Difference between revisions of "Demo Day Page Google Classifier"
Line 12: | Line 12: | ||
This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model. The classifier itself takes: | This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model. The classifier itself takes: | ||
− | A: The number of times each word in words.txt occurs in a webpage. This is calculated by web_demo_features.py in the same directory. | + | A: The number of times each word in words.txt occurs in a webpage. This is calculated by web_demo_features.py in the same directory. It also takes the number of occurrences of years from 1900-2099, and month words group in seasons. It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images. |
B: A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom. | B: A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom. |
Revision as of 16:08, 5 April 2018
Demo Day Page Google Classifier | |
---|---|
Project Information | |
Project Title | Demo Day Page Google Classifier |
Owner | Kyran Adams |
Start Date | 2/5/2018 |
Deadline | |
Keywords | Accelerator, Demo Day, Google Result, Word2vec, Tensorflow |
Primary Billing | |
Notes | |
Has project status | Active |
Is dependent on | Accelerator Seed List (Data), Demo Day Page Parser |
Copyright © 2016 edegan.com. All Rights Reserved. |
Project
This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model. The classifier itself takes:
A: The number of times each word in words.txt occurs in a webpage. This is calculated by web_demo_features.py in the same directory. It also takes the number of occurrences of years from 1900-2099, and month words group in seasons. It also takes the number of simple links (links in the form www.abc.com or www.abc.org) and the number of those that are attached to images.
B: A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.
Project location:
E:\McNair\Projects\Accelerators\Spring 2018\google_classifier\
Training data:
E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx
Possibly useful programs
Google bindings for python
E:\McNair\Projects\Accelerators\Spring 2017\Google_SiteSearch
PDF to text converter
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper
HTML to text converted
E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data
Resources
- https://www.tensorflow.org/tutorials/word2vec
- https://machinelearnings.co/tensorflow-text-classification-615198df9231
- http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
- https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw