This is a ML project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. Currently about 80% accuracy, though this would be vastly improved with more training data. The classifier currently really overfits the training data. The classifier itself takes:
<strong>Input features:</strong> This is calculated by web_demo_features.py in the same directory and output to a tsv file. It takes: the frequencies of each word from words.txt, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body. These features can be extended by adding words to words.txt or regexes to the PATTERNS variable in web_demo_features.py. There is also unused code for generating monogram/bigram tfidf frequencies, this might improve the classifier if we had more data. Currently it does not.
<strong>Training data:</strong> A set of webpages hand-classified as to whether they contain a list of cohort companies. The classification is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom. The HTML pages themselves are stored in DemoDayHTMLFull.
<strong>Usage:</strong>
* Steps to add training data to the model: Put all of the html files to be used in DemoDayHTMLFull. Put corresponding entries into demo_day_cohort_lists.xlsx (only the columns "URL" and "Cohort" are necessary, but they must be in alphabetical order. data_reader.py will throw error otherwise), then export it to classification.txt. Convert this to utf-8 (textpad can do this, just save as -> encoding:utf-8). Then run : python3 web_demo_features.py #to generate the features matrix, hand_training_features.txt. Then, run python3 demo_day_classifier_randforest.py #to generate the model, classifier.pkl.
* Steps to run the model on google results: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run this command: