Difference between revisions of "Demo Day Page Parser"

Project
Demo Day Page Parser
Project Information
Has title	Demo Day Page Parser
Has owner	Peter Jalbert
Has start date
Has deadline date
Has project status	Subsume
Dependent(s):	Demo Day Page Google Classifier, U.S. Seed Accelerators
Subsumed by:	Accelerator Demo Day
Has sponsor	McNair Center
Has project output	Tool
	Copyright © 2019 edegan.com. All Rights Reserved.

Latest revision as of 13:47, 21 September 2020

Project Specs

The goal of this project is to leverage data mining with Selenium and Machine Learning to get good candidate web pages for Demo Days for accelerators. Relevant information on the project can be found on the Accelerator Data page.

Code Location

The code directory for this project can be found:

E:\McNair\Software\Accelerators

The Selenium-based crawler can be found in the file below. This script runs a google search on accelerator names and keywords, and saves the urls and html pages for future use:

DemoDayCrawler.py

A script to rip from HTML to TXT can be found below. This script reads HTML files from the DemoDayHTML directory, and writes them to the DemoDayTxt directory:

htmlToText.py

A script to match Keywords (Accelerator and Cohort names) against the resulting text pages can be found in KeyTerms.py. The script takes the Keywords located in CohortAndAcceleratorsFullList.txt, and the text files in DemoDayTxt, and creates a file with the number of matches of each keyword against each text file.

The script can be found:

KeyTerms.py

The Keyword matches text file can be found:

DemoDayTxt\KeyTermFile\KeyTerms.txt

A script to determine the text files of webpages that have at least one hit of these key words can be found:

DemoDayHits.py

Downloading HTML Files with Selenium

The code for utilizing Selenium to download HTML files can be found in the DemoDayCrawler.py file.

The initial observation set over the data scraped 100 links for each of 20 sample accelerators from the list of overall accelerators. These sample pages were turned to text, and scored to remove web pages with no mention of relevant accelerators or companies.

Once the process was tweaked in response to the initial sample testing, the process ran again over all accelerators. The test determined that we needed take no more than 10 links for each accelerator, and that 'Demo Day' was a suitable search term.

COMPLETE FILES

These files hold data for all the accelerators: not just the test set.

The full list of accelerators:

ListOfAccs.txt

The full list of search terms to match with the text versions of news articles:

CohortAndAcceleratorsFullList.txt

A list of accelerators, queries, and urls:

demoday_crawl_full.txt

A directory with HTML files for all accelerator demo day results:

DemoDayHTMLFull

A directory with TXT files for all accelerator demo day results:

DemoDayTxtFull

A file with the name of the results that passed keyword matching:

DemoDayHitsFull.txt

A file with an analysis of the most frequent matched words in each text file:

topWordsFull.txt

Faulty Results

The first pass through the data revealed articles that had thousands of hits for keyword matches. This seemed highly suspicious, so we dug in deeper to investigate the cause of this issue.

The following script in the same directory analyzes the keyword matches to determine the words with the highest number of hits.

DemoDayAnalysis.py

After investigation, it was found that many company names were taken after common english words. Here are some of the companies causing issues along with their associated accelerator:

the, L-Spark

Matter, This., website

Fledge, HERE, website

StartupBootCamp, We...

LightBank Start, Zero

Entrepreneurs Roundtable Accelerator, SELECT

Y Combinator, Her

Y Combinator, Final

AngelCube, class

Matter, common

L-Spark, Company

Techstars, Hot

Rather than removing these companies from the list of search terms, we opted to not include as search terms any words that were considered among the top 10000 most common English words. For reference, we used the top 10000 most common English words according to a Google research study. The github documentation of the study can be found here.

The file containing the 10000 most common English words can be found:

E:\McNair\Software\Accelerators\10000_common_words.txt

The results seemed much more plausible after removing these words. Some company words still appeared many times, but in the correct context.

@@ Line 1: / Line 1: @@
-{{McNair Projects
+{{Project
+|Has project output=Tool
+|Has sponsor=McNair Center
 |Has title=Demo Day Page Parser
 |Has owner=Peter Jalbert,
-|Has project status=Active
+|Has project status=Subsume
 }}
 ==Project Specs==
 The goal of this project is to leverage data mining with Selenium and Machine Learning to get good candidate web pages for Demo Days for accelerators. Relevant information on the project can be found on the [http://mcnair.bakerinstitute.org/wiki/Accelerator_Data Accelerator Data] page.
@@ Line 44: / Line 45: @@
   ListOfAccs.txt
-The full list of potential keywords (used for throwing out irrelevant results):
+The full list of search terms to match with the text versions of news articles:
-  Keywords.txt
+  CohortAndAcceleratorsFullList.txt
 A list of accelerators, queries, and urls:
@@ Line 58: / Line 59: @@
 A file with the name of the results that passed keyword matching:
   DemoDayHitsFull.txt
+A file with an analysis of the most frequent matched words in each text file:
+ topWordsFull.txt
 ==Faulty Results==
@@ Line 69: / Line 73: @@
 the, L-Spark
-Matter, This.
+Matter, This., [https://matter.vc/portfolio/this/ website]
-Fledge, HERE
+Fledge, HERE, [http://fledge.co/fledgling/here/ website]
 StartupBootCamp, We...
@@ Line 89: / Line 93: @@
 L-Spark, Company
-After removing these companies from consideration as keywords,
+Techstars, Hot
+Rather than removing these companies from the list of search terms, we opted to not include as search terms any words that were considered among the top 10000 most common English words. For reference, we used the top 10000 most common English words according to a Google research study. The github documentation of the study can be found [https://github.com/first20hours/google-10000-english here].
+The file containing the 10000 most common English words can be found:
+ E:\McNair\Software\Accelerators\10000_common_words.txt
+The results seemed much more plausible after removing these words. Some company words still appeared many times, but in the correct context.

Difference between revisions of "Demo Day Page Parser"

Latest revision as of 13:47, 21 September 2020

Contents

Project Specs

Code Location

Downloading HTML Files with Selenium

Faulty Results

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools