====Data Preprocessing====
* This '''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the <code>The File to Rule Them All</code> csv file and then feed them into the Site Map Generator to assign retrieve their corresponding internal urls*This process assigns corresponding cohort indicators indicator to each internal url generated , which is separated from the url by the Site Map Tooltab (see example below) http://fledge.co/blog/ 0 http://fledge.co/fledglings/ 1 http://fledge.co/2019/visiting-malawi/ 0 http://fledge.co/about/details/ 0 http://fledge. co/about/ 0 *Results are automatically split into two text files: train.txt and test.txt.
Python file saved in
E:\projects\listing page identifier\generate_dataset.py
'''''Generate and Separate Image Data: ''''' feed train.txt and text.txt into Screenshot Tool to get our image data * Images are also split into two folders: train and test** Images are also separated into different corresponding sub folders: cohort and not_cohortwithin the folder train and the folder test