Difference between revisions of "Demo Day Page Google Classifier"

From edegan.com
Jump to navigation Jump to search
Line 12: Line 12:
 
This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. The classifier itself takes:
 
This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. The classifier itself takes:
  
<strong>Features:</strong> The number of times each word in words.txt occurs in the titles or headers of a webpage. This is calculated by web_demo_features.py in the same directory. It also takes the number of occurrences of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body.
+
<strong>Features:</strong> The frequencies of each word from words.txt in the webpage. This is calculated by web_demo_features.py in the same directory. It also takes the frequencies of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body. These features can be extended by adding words to words.txt or regexes to PATTERNS in web_demo_features.py.
 
 
A frequency matrix of up to 3000 of the most frequent words in the body is also generated and stored in auto_training_features.txt.
 
  
 
<strong>Training classifications:</strong> A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.
 
<strong>Training classifications:</strong> A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.
Line 22: Line 20:
 
<strong>Project location:</strong>
 
<strong>Project location:</strong>
 
   E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\
 
   E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\
 
  
 
<strong>Training data:</strong>
 
<strong>Training data:</strong>
Line 29: Line 26:
 
<strong>Usage:</strong>
 
<strong>Usage:</strong>
  
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl. Make sure that in demo_day_classifier_randforest.py, USE_CROSS_VALIDATION is set to False in order to generate the model.
+
* Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.  
  
* Steps to run: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run crawl_and_classify using python3. It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt.
+
* Steps to run the model on google results: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run this command:
 +
  python3 crawl_and_classify.py
 +
It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt. The HTML pages are then moved into CrawledHTMlPages/demoday/ or CrawledHtmlPages/non_demoday based on their prediction.
  
==Possibly useful programs==
+
==Possible further steps==
  
Google bindings for python
+
Changed from Bag-Of-Words model to a more powerful neural network, perhaps an RNN.
  
  E:\McNair\Projects\Accelerators\Spring 2017\Google_SiteSearch
+
Handle PDF files using PDF to text converter:
 
 
PDF to text converter
 
  
 
   E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper
 
   E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper
 
HTML to text converted
 
 
  E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data
 
  
 
[[Demo Day Page Parser]]
 
[[Demo Day Page Parser]]
Line 53: Line 46:
 
*https://machinelearnings.co/tensorflow-text-classification-615198df9231
 
*https://machinelearnings.co/tensorflow-text-classification-615198df9231
 
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
 
*http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
*https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
 

Revision as of 00:32, 5 May 2018


McNair Project
Demo Day Page Google Classifier
Project logo 02.png
Project Information
Project Title Demo Day Page Google Classifier
Owner Kyran Adams
Start Date 2/5/2018
Deadline
Keywords Accelerator, Demo Day, Google Result, Word2vec, Tensorflow
Primary Billing
Notes
Has project status Active
Is dependent on Accelerator Seed List (Data), Demo Day Page Parser
Copyright © 2016 edegan.com. All Rights Reserved.


Project

This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model and a bag of words approach. The classifier itself takes:

Features: The frequencies of each word from words.txt in the webpage. This is calculated by web_demo_features.py in the same directory. It also takes the frequencies of years from 1900-2099, month words grouped in seasons, and phrases of the form "# startups". It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. It also takes the number of "strong" html tags in the body. These features can be extended by adding words to words.txt or regexes to PATTERNS in web_demo_features.py.

Training classifications: A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.

A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.

Project location:

 E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\

Training data:

 E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx

Usage:

  • Steps to train the model: Put all of the html files to be used in DemoDayHTMLFull. Then run web_demo_features.py to generate the features matrix, training_features.txt. Then, run demo_day_classifier_randforest.py to generate the model, classifier.pkl.
  • Steps to run the model on google results: In the file crawl_and_classify.py, set the variables to whatever is wanted. Then, run this command:
 python3 crawl_and_classify.py

It will download all of the html files into the directory CrawledHTMLPages, and then it will generate a matrix of features, CrawledHTMLPages\features.txt. It will then run the trained model saved in classifier.pkl to predict whether these pages are demo day pages, and then it will save the results to CrawledHTMLPages\predicted.txt. The HTML pages are then moved into CrawledHTMlPages/demoday/ or CrawledHtmlPages/non_demoday based on their prediction.

Possible further steps

Changed from Bag-Of-Words model to a more powerful neural network, perhaps an RNN.

Handle PDF files using PDF to text converter:

 E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper

Demo Day Page Parser

Resources