Changes

Listing Page Classifier (view source)

Revision as of 20:52, 12 May 2019

578 bytes added , 20:52, 12 May 2019

====Data Preprocessing====

* This '''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the <code>The File to Rule Them All</code> csv file and then feed them into the Site Map Generator to ~~assign~~ retrieve their corresponding internal urls*This process assigns corresponding cohort ~~indicators~~ indicator to each ~~internal~~ url ~~generated~~ , which is separated from the url by ~~the Site Map Tool~~tab (see example below) http://fledge.co/blog/ 0 http://fledge.co/fledglings/ 1 http://fledge.co/2019/visiting-malawi/ 0 http://fledge.co/about/details/ 0 http://fledge. co/about/ 0 *Results are automatically split into two text files: train.txt and test.txt.

Python file saved in

E:\projects\listing page identifier\generate_dataset.py

'''''Generate and Separate Image Data: ''''' feed train.txt and text.txt into Screenshot Tool to get our image data * Images are ~~also~~ split into two folders: train and test** Images are also separated into ~~different~~ corresponding sub folders: cohort and not_cohortwithin the folder train and the folder test

NancyYu

227

edits

Changes

Listing Page Classifier (view source)

Revision as of 20:52, 12 May 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools