This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model. The classifier itself takes:
AFeatures: The number of times each word in words.txt occurs in a webpage. This is calculated by web_demo_features.py in the same directory. It also takes the number of occurrences of years from 1900-2099, and month words group grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of <strong> html tags.
BTraining classifications: A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.