Difference between revisions of "Demo Day Page Google Classifier"

From edegan.com
Jump to navigation Jump to search
Line 10: Line 10:
 
==Project==
 
==Project==
  
This is a tensorflow project that classifies webpages as either a demo day page or not, currently using logistic regression. The classifier itself should take the output of Peter's DemoDayHits.py program and output whether the page is a demo day page. It is trained on a file outputted by DemoDayHits.py and a hand-classified set of google results, some of which are demo day pages.
+
This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model. The classifier itself takes:
  
It may later take other inputs, such as the text of the page itself.
+
A: The number of times each word in words.txt occurs in a webpage. This is calculated by web_demo_features.py in the same directory.
 +
 
 +
B: A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.
  
 
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.
 
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.
  
The random forest implementation doesn't work on windows, so it is located in the Z drive to be run from the linux box.
+
Project location:
 
 
Located:
 
 
   E:\McNair\Projects\Accelerators\Spring 2018\google_classifier\
 
   E:\McNair\Projects\Accelerators\Spring 2018\google_classifier\
  Z:\demoday
 
  
  
 
Training data:
 
Training data:
   E:\McNair\Projects\Accelerators\Fall 2017\Demo Day URLs.xlsx
+
   E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx
  
 
==Possibly useful programs==
 
==Possibly useful programs==

Revision as of 15:47, 2 April 2018


McNair Project
Demo Day Page Google Classifier
Project logo 02.png
Project Information
Project Title Demo Day Page Google Classifier
Owner Kyran Adams
Start Date 2/5/2018
Deadline
Keywords Accelerator, Demo Day, Google Result, Word2vec, Tensorflow
Primary Billing
Notes
Has project status Active
Is dependent on Accelerator Seed List (Data), Demo Day Page Parser
Copyright © 2016 edegan.com. All Rights Reserved.


Project

This is a tensorflow project that classifies webpages as a demo day page containing a list of cohort companies, currently using scikit learn's random forest model. The classifier itself takes:

A: The number of times each word in words.txt occurs in a webpage. This is calculated by web_demo_features.py in the same directory.

B: A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as, and changing the encoding at the bottom.

A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.

Project location:

 E:\McNair\Projects\Accelerators\Spring 2018\google_classifier\


Training data:

 E:\McNair\Projects\Accelerators\Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx

Possibly useful programs

Google bindings for python

 E:\McNair\Projects\Accelerators\Spring 2017\Google_SiteSearch

PDF to text converter

 E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data\Utilities\PDF_Ripper

HTML to text converted

 E:\McNair\Projects\Accelerators\Fall 2017\Code+Final_Data

Demo Day Page Parser

Resources