==Project==
This is a tensorflow project that classifies webpages as either a demo day page or notcontaining a list of cohort companies, currently using logistic regressionscikit learn's random forest model. The classifier itself should take the output of Peter's DemoDayHits.py program and output whether the page is a demo day page. It is trained on a file outputted by DemoDayHits.py and a hand-classified set of google results, some of which are demo day pages.takes:
It may later take other inputsA: The number of times each word in words.txt occurs in a webpage. This is calculated by web_demo_features.py in the same directory. B: A set of webpages hand-classified as to whether they contain a list of cohort companies. This is stored in classification.txt, such which is a tsv equivalent of Demo Day URLS.xlsx. Keep in mind that this txt file must be utf-8 encoded. In textpad, one can convert a file to utf-8 by pressing save-as , and changing the text of encoding at the page itselfbottom.
A demo day page is an advertisement page for a "demo day," which is a day that cohorts graduating from accelerators can pitch their ideas to investors. These demo days give us a good idea of when these cohorts graduated from their accelerator.
The random forest implementation doesn't work on windows, so it is located in the Z drive to be run from the linux box. LocatedProject location:
E:\McNair\Projects\Accelerators\Spring 2018\google_classifier\
Z:\demoday
Training data:
E:\McNair\Projects\Accelerators\Fall 2017Spring 2018\demo_day_classifier\DemoDayHTMLFull\Demo Day URLs.xlsx
==Possibly useful programs==