====Data Preprocessing====
'''''Retrieving All Internal Links: ''''' this generate_dataset tool reads all homepage urls in the <code>The File to Rule Them All</code> csv file and then feed them into the Site Map Generator to retrieve their corresponding internal urls
*This process assigns corresponding cohort indicator to each url, which is separated from the url by tab (see example below)
http://fledge.co/blog/ 0
http://fledge.co/fledglings/ 1