==Project==
This is a tensorflow ML project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. The classifier itself takes:
<strong>Features:</strong> The frequencies of each word from words.txt in the webpage. This is calculated by web_demo_features.py in the same directory. It also takes the frequencies of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body. These features can be extended by adding words to words.txt or regexes to PATTERNS in web_demo_features.py.