====Data Preprocessing====
'''''Retrieving All Internal Links: ''''' this <code>generate_dataset tool .py</code> reads all homepage urls in the file <code>The File to Rule Them All.csv</code> and then feed them into the Site Map Generator to retrieve their corresponding internal urls
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)
http://fledge.co/blog/ 0