==Project==
This is a ML project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. Currently about 80% accuracy, though this would be vastly improved with more training data. The classifier itself takes:
<strong>Features:</strong> The frequencies of each word from words.txt in the webpage. This is calculated by web_demo_features.py in the same directory. It also takes the frequencies of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body. These features can be extended by adding words to words.txt or regexes to PATTERNS in web_demo_features.py.
<strong>Project location:</strong>
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\
<strong>Training data:</strong>
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx
<strong>Usage:</strong>
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Put corresponding entries into demo_day_cohort_lists.xlsx, then export it to classification.txt. Convert this to utf-8. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.
* Steps to run the model on google results: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run this command:
It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt. The HTML pages are then moved into CrawledHTMlPages/demoday/ or CrawledHtmlPages/non_demoday based on their prediction.
==Possible further stepsFiles and Directories==* CrawledHTMLPages** Contains the results from crawl_and_classify.py, stored in positive and negative folders based on how the html files are classified.* DemoDayHTMLFull** Contains the training data for the classifier. demo_day_cohort_lists.xlsx is the classification (converted to classification.txt before use), and the html files are used for generating the features matrix.* demo_day_classifier_randforest.py** The classifier itself. A pkl'ed version of the classifier should be saved in classifier.pkl.* web_demo_features.py** Generates the features matrix from a directory of html files to be used in the classifier.* words.txt** The words for the features. The frequency of each word is used as a feature (maybe change this to tfidf?)* data_reader.py** Helper functions to read in the data for the classifier.* crawl_and_classify.py** Googles a bunch of results for a given query and list of accelerators and their years, and then classifies the html pages into CrawledHTMLPages. ===Other scripts===* feature_diff.py** Generates a little image to show how the number of features differs between demoday and non-demoday pages.* delete_duplicate_classified.py** Looks through CrawledHTMLPages/positive and CrawledHTMLPages/negative and deletes all the duplicate files.* classify_all_accelerator.py** Taking the TSV files MasterAcceleratorList.tsv and SplitAcceleratorList.tsv, it googles and classifies all accelerators from MasterAcceleratorList that are not already in SplitAcceleratorList.
Changed from Bag-Of-Words model to a more powerful neural network, perhaps an RNN.
Handle PDF files using PDF to text converter:==Possible further steps==
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_RipperChanged from Bag-Of-Words model to a more powerful neural network, perhaps an RNN. This would likely need even more data, though.
==Notes on Categorization==
*
[[Demo Day Page Parser]]