Difference between revisions of "Demo Day Page Google Classifier"

From edegan.com
Jump to navigation Jump to search
 
(17 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{McNair Projects
+
{{Project
 +
|Has project output=Tool
 +
|Has sponsor=McNair Center
 
|Has title=Demo Day Page Google Classifier
 
|Has title=Demo Day Page Google Classifier
 
|Has owner=Kyran Adams,
 
|Has owner=Kyran Adams,
 
|Has start date=2/5/2018
 
|Has start date=2/5/2018
 
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow
 
|Has keywords=Accelerator, Demo Day, Google Result, Word2vec, Tensorflow
|Has project status=Active
+
|Has project status=Subsume
 
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser
 
|Is dependent on=Accelerator Seed List (Data), Demo Day Page Parser
 
}}
 
}}
 
 
==Project==
 
==Project==
  
This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model. The classifier itself takes:
+
This is a ML project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. Currently about 80% accuracy, though this would be vastly improved with more training data. The classifier currently really overfits the training data. The classifier itself takes:
  
Features: The number of times each word in words.txt occurs in a webpage. This is calculated by web_demo_features.py in the same directory. It also takes the number of occurrences of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of <strong> html tags.
+
<strong>Input features:</strong> This is calculated by web_demo_features.py in the same directory and output to a tsv file. It takes: the frequencies of each word from words.txt, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body. These features can be extended by adding words to words.txt or regexes to the PATTERNS variable in web_demo_features.py. There is also unused code for generating monogram/bigram tfidf frequencies, this might improve the classifier if we had more data.
  
Training classifications: A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.
+
<strong>Training data:</strong> A set of webpages hand-classified as to whether they contain a list of cohort companies. The classification is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom. The HTML pages themselves are stored in DemoDayHTMLFull.
  
 
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.
 
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.
  
Project location:
+
<strong>Project location:</strong>
 
   E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\
 
   E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\
  
 +
<strong>Usage:</strong>
  
Training data:
+
* Steps to add training data to the model: Put all of the html files to be used in DemoDayHTMLFull. Put corresponding entries into demo_day_cohort_lists.xlsx (only the columns "URL" and "Cohort" are necessary, but they must be in alphabetical order. data_reader.py will throw error otherwise), then export it to classification.txt. Convert this to utf-8 (textpad can do this, just save as -> encoding:utf-8). Then run:
  E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx
+
  python3 web_demo_features.py #to generate the features matrix, hand_training_features.txt
 
+
  python3 demo_day_classifier_randforest.py #to generate the model, classifier.pkl.  
Usage:
 
 
 
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl. Make sure that in demo_day_classifier_randforest.py, USE_CROSS_VALIDATION is set to False in order to generate the model.
 
 
 
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt.
 
  
==Possibly useful programs==
+
* Steps to run the model on google results: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run this command:
 +
  python3 crawl_and_classify.py
 +
It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt. The HTML pages are then moved into CrawledHTMlPages/positive/ or CrawledHtmlPages/negative/ based on their prediction. If you want to run the classifier on html files already downloaded, the function classify_dir in crawl_and_classify.py will do this.
  
Google bindings for python
+
==Files and Directories==
 +
* CrawledHTMLPages
 +
** Contains the classified html file results from crawl_and_classify.py, stored in positive and negative folders based on how they are classified.
 +
* DemoDayHTMLFull
 +
** Contains the training data for the classifier. demo_day_cohort_lists.xlsx is the classification (same as classification.txt, which is actually used by the program, but the excel file has hyperlinks), and the html files are used for generating the features matrix.
 +
* demo_day_classifier_randforest.py
 +
** The classifier itself. A pickled version of the classifier should be saved in classifier.pkl.
 +
* web_demo_features.py
 +
** Generates the features matrix from a directory of html files to be used in the classifier. See input features.
 +
* words.txt
 +
** The words for the features. The frequency of each word is used as a feature (maybe change this to tfidf?)
 +
* data_reader.py
 +
** Helper functions to read in the data for the classifier.
 +
* crawl_and_classify.py
 +
** Googles a bunch of results for a given query and list of accelerators and their years, and then classifies the html pages into CrawledHTMLPages.
  
  E:\McNair\Projects\Accelerators\Spring 2017\Google_SiteSearch
+
===Other scripts===
 +
* feature_diff.py
 +
** Generates a little image to show how the number of features differs between demoday and non-demoday pages.
 +
* delete_duplicate_classified.py
 +
** Looks through CrawledHTMLPages/positive and CrawledHTMLPages/negative and deletes all the duplicate files. Run this after the crawler runs, because there are lots of duplicates from google results.
 +
* classify_all_accelerator.py
 +
** Taking the TSV files MasterAcceleratorList.tsv and SplitAcceleratorList.tsv, it googles and classifies all accelerators from MasterAcceleratorList that are not already in SplitAcceleratorList. These tsv files were exported from the Master Variable List on google sheets.
 +
* google.py/google_crawl.py
 +
** Functions for googling stuff
  
PDF to text converter
 
  
  E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper
+
==Possible further steps==
  
HTML to text converted
+
Change from Bag-Of-Words model to a more powerful neural network, perhaps an RNN, or use full tfidf monogram/bigram frequencies. This would need even more data, though. The best way to collect more data would probably be to automate/make easier the process of data collection, and just have a few people collect a few thousand points of data, or use mechanical turk. This would likely improve accuracy a lot, and allow for more sophisticated classification methods.
  
  E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data
 
  
 
[[Demo Day Page Parser]]
 
[[Demo Day Page Parser]]
Line 51: Line 70:
 
*https://machinelearnings.co/tensorflow-text-classification-615198df9231
 
*https://machinelearnings.co/tensorflow-text-classification-615198df9231
 
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
 
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
*https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
+
*http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Latest revision as of 12:47, 21 September 2020


Project
Demo Day Page Google Classifier
Project logo 02.png
Project Information
Has title Demo Day Page Google Classifier
Has owner Kyran Adams
Has start date 2/5/2018
Has deadline date
Has keywords Accelerator, Demo Day, Google Result, Word2vec, Tensorflow
Has project status Subsume
Is dependent on Accelerator Seed List (Data), Demo Day Page Parser
Dependent(s): U.S. Seed Accelerators
Subsumed by: Accelerator Demo Day
Has sponsor McNair Center
Has project output Tool
Copyright © 2019 edegan.com. All Rights Reserved.

Project

This is a ML project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. Currently about 80% accuracy, though this would be vastly improved with more training data. The classifier currently really overfits the training data. The classifier itself takes:

Input features: This is calculated by web_demo_features.py in the same directory and output to a tsv file. It takes: the frequencies of each word from words.txt, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body. These features can be extended by adding words to words.txt or regexes to the PATTERNS variable in web_demo_features.py. There is also unused code for generating monogram/bigram tfidf frequencies, this might improve the classifier if we had more data.

Training data: A set of webpages hand-classified as to whether they contain a list of cohort companies. The classification is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom. The HTML pages themselves are stored in DemoDayHTMLFull.

A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.

Project location:

 E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\

Usage:

  • Steps to add training data to the model: Put all of the html files to be used in DemoDayHTMLFull. Put corresponding entries into demo_day_cohort_lists.xlsx (only the columns "URL" and "Cohort" are necessary, but they must be in alphabetical order. data_reader.py will throw error otherwise), then export it to classification.txt. Convert this to utf-8 (textpad can do this, just save as -> encoding:utf-8). Then run:
 python3 web_demo_features.py  #to generate the features matrix, hand_training_features.txt
 python3 demo_day_classifier_randforest.py #to generate the model, classifier.pkl. 
  • Steps to run the model on google results: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run this command:
 python3 crawl_and_classify.py

It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt. The HTML pages are then moved into CrawledHTMlPages/positive/ or CrawledHtmlPages/negative/ based on their prediction. If you want to run the classifier on html files already downloaded, the function classify_dir in crawl_and_classify.py will do this.

Files and Directories

  • CrawledHTMLPages
    • Contains the classified html file results from crawl_and_classify.py, stored in positive and negative folders based on how they are classified.
  • DemoDayHTMLFull
    • Contains the training data for the classifier. demo_day_cohort_lists.xlsx is the classification (same as classification.txt, which is actually used by the program, but the excel file has hyperlinks), and the html files are used for generating the features matrix.
  • demo_day_classifier_randforest.py
    • The classifier itself. A pickled version of the classifier should be saved in classifier.pkl.
  • web_demo_features.py
    • Generates the features matrix from a directory of html files to be used in the classifier. See input features.
  • words.txt
    • The words for the features. The frequency of each word is used as a feature (maybe change this to tfidf?)
  • data_reader.py
    • Helper functions to read in the data for the classifier.
  • crawl_and_classify.py
    • Googles a bunch of results for a given query and list of accelerators and their years, and then classifies the html pages into CrawledHTMLPages.

Other scripts

  • feature_diff.py
    • Generates a little image to show how the number of features differs between demoday and non-demoday pages.
  • delete_duplicate_classified.py
    • Looks through CrawledHTMLPages/positive and CrawledHTMLPages/negative and deletes all the duplicate files. Run this after the crawler runs, because there are lots of duplicates from google results.
  • classify_all_accelerator.py
    • Taking the TSV files MasterAcceleratorList.tsv and SplitAcceleratorList.tsv, it googles and classifies all accelerators from MasterAcceleratorList that are not already in SplitAcceleratorList. These tsv files were exported from the Master Variable List on google sheets.
  • google.py/google_crawl.py
    • Functions for googling stuff


Possible further steps

Change from Bag-Of-Words model to a more powerful neural network, perhaps an RNN, or use full tfidf monogram/bigram frequencies. This would need even more data, though. The best way to collect more data would probably be to automate/make easier the process of data collection, and just have a few people collect a few thousand points of data, or use mechanical turk. This would likely improve accuracy a lot, and allow for more sophisticated classification methods.


Demo Day Page Parser

Resources