Demo Day Page Parser

From edegan.com
Revision as of 16:23, 28 November 2017 by Peterjalbert (talk | contribs)
Jump to navigation Jump to search


McNair Project
Demo Day Page Parser
Project logo 02.png
Project Information
Project Title Demo Day Page Parser
Owner Peter Jalbert
Start Date
Deadline
Primary Billing
Notes
Has project status Active
Copyright © 2016 edegan.com. All Rights Reserved.


Project Specs

The goal of this project is to leverage data mining with Selenium and Machine Learning to get good candidate web pages for Demo Days for accelerators. Relevant information on the project can be found on the Accelerator Data page.

Code Location

The code directory for this project can be found:

E:\McNair\Software\Accelerators

The Selenium-based crawler can be found in the file below. This script runs a google search on accelerator names and keywords, and saves the urls and html pages for future use:

DemoDayCrawler.py


A script to rip from HTML to TXT can be found below. This script reads HTML files from the DemoDayHTML directory, and writes them to the DemoDayTxt directory:

htmlToText.py

A script to match Keywords (Accelerator and Cohort names) against the resulting text pages can be found in KeyTerms.py. The script takes the Keywords located in CohortAndAcceleratorsFullList.txt, and the text files in DemoDayTxt, and creates a file with the number of matches of each keyword against each text file.

The script can be found:

KeyTerms.py

The Keyword matches text file can be found:

DemoDayTxt\KeyTermFile\KeyTerms.txt

A script to determine the text files of webpages that have at least one hit of these key words can be found:

DemoDayHits.py

Downloading HTML Files with Selenium

The code for utilizing Selenium to download HTML files can be found in the DemoDayCrawler.py file.

The initial observation set over the data scraped 100 links for each of 20 sample accelerators from the list of overall accelerators. These sample pages were turned to text, and scored to remove web pages with no mention of relevant accelerators or companies.

Once the process was tweaked in response to the initial sample testing, the process ran again over all accelerators. The test determined that we needed take no more than 10 links for each accelerator, and that 'Demo Day' was a suitable search term.

COMPLETE FILES

==

These files hold data for all the accelerators: not just the test set.

The full list of accelerators:

ListOfAccs.txt

The full list of potential keywords (used for throwing out irrelevant results):

Keywords.txt

A list of accelerators, queries, and urls:

demoday_crawl_full.txt

A directory with HTML files for all accelerator demo day results:

DemoDayHTMLFull

A directory with TXT files for all accelerator demo day results:

DemoDayTxtFull