Difference between revisions of "Accelerator Demo Day"

From edegan.com
Jump to navigation Jump to search
Line 59: Line 59:
  
 
Since MTurk makes it hard for us to display the downloaded HTML, it is much faster to just copy the url into the question box rather than trying to display the downloaded HTML.
 
Since MTurk makes it hard for us to display the downloaded HTML, it is much faster to just copy the url into the question box rather than trying to display the downloaded HTML.
 +
 +
The advantage to this is that some websites, such as techcrunch.com behaves abnormally when downloaded as HTML so opening these kinds of websites in the browser would actually be more beneficial because the UI would not be messed up. Moreover, if certain websites has paywall or pop-up ads, the user can also click out of it. Since most of the times, paywall or pop-ups are scripts within HTMLs, the classifier can't rule them out because the body of the HTMLs may still contain useful information we are looking for. Major paywalls or websites that required log-ins such as linkedin have been black-listed in the crawler. More detail in the crawler section below.
 +
 
However. there is a disadvantage to this: websites are ever changing, so there is a possibility that in the future, the URL may not be usable, or has changed to something else; on the other hand, downloaded HTMLs remain the same because it does not require any internet connection to render and thus, the content is static.
 
However. there is a disadvantage to this: websites are ever changing, so there is a possibility that in the future, the URL may not be usable, or has changed to something else; on the other hand, downloaded HTMLs remain the same because it does not require any internet connection to render and thus, the content is static.
  

Revision as of 11:27, 25 July 2018


McNair Project
Accelerator Demo Day
Project logo 02.png
Project Information
Project Title Accelerator Demo Day
Owner Minh Le
Start Date 06/18/2018
Deadline
Primary Billing
Notes
Has project status Active
Subsumes: Demo Day Page Parser, Demo Day Page Google Classifier
Copyright © 2016 edegan.com. All Rights Reserved.


Project Introduction

This project that utilizes Selenium and Machine Learning to get good candidate web pages and classify web pages as a demo day page containing a list of cohort companies, ultimately to gather good candidates to push to Mechanical Turk. The code is written using Python 3 using Selenium and Tensorflow (Keras)

Project Goal

The goal of this project is to find good "Demo Day" candidate web pages and to submit these pages to Amazon Mechanical Turk for data collecting. A good candidate is defined as a page containing a list of cohort companies associated with an accelerator. Through observation, good candidates usually containing time and location information about the demo day as well and thus is sufficient to be pushed to MTurk to collect data.

Code Location

The source code and relevant files for the project can be found here:

E:\McNair\Projects\Accelerator Demo Day\

The current working model using RF is in:

E:\McNair\Projects\Accelerator Demo Day\Test Run

The RNN model is in:

E:\McNair\Projects\Accelerator Demo Day\Experiment

The RNN is still under much development. Modifying anything in this folder is not recommended

All the other folders are used for experimenting purposes, please don't touch them.

General User Guide: How to Use this Project (Random Forest model)

First, change your directory to the working folder:

cd E:\McNair\Projects\Accelerator Demo Day\Test Run

Then you need to specify the list of accelerators you want to crawl by modifying the following file:

ListOfAccsToCrawl.txt

The first line must remain fixed as "Accelerator". Then the next several rows are the Accelerators name. The name needs not to be case sensitive, but it is preferable that the case remains sensitive if possible.

All necessary preparations are now complete. Now onto running the code!

Running the project is as simple as executing the code in the correct order. The files are named in the format "STEPX_name", where as X is the order of execution. To be more specific, run the following 4 commands:

# Crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt
python3 STEP1_crawl.py
# Preprocess data using a bag of word approach: each page is characterized by the frequencies of chosen keywords. Chosen keywords are stored in words.txt. This script reates a file called feature_matrix.txt
python3 STEP2_preprocessing_feature_matrix_generator.py
# Train the RF model
python3 STEP3_train_rf.py
# Run the model to predict on the HTML of the crawled HTMLs.
python3 STEP4_classify_rf.py

The result is stored in CrawledHTMLFull folder and is classified in two folder: positive and negative. The positive folder contains HTMLs that the classifier thought of as "good candidate." The negative contains the opposite. There is also a txt file called prediction.txt that lists everything. feature.txt is an irrelevant file for the general user, please ignore it. Its sole purpose is for analyzing and debugging.

NEVER touch the TrainingHTML folder, datareader.py or the classifier.txt. These are used internally to train data.

Amazon Mechanical Turk

There's a file in the folder

CrawledHTMLFull

called

FinalResultWithURL

that was manually created by combining the file

crawled_demoday_page_list.txt

in the mother folder and the file

predicted.txt

This file combined the predictions to the actual url of the websites.

Since MTurk makes it hard for us to display the downloaded HTML, it is much faster to just copy the url into the question box rather than trying to display the downloaded HTML.

The advantage to this is that some websites, such as techcrunch.com behaves abnormally when downloaded as HTML so opening these kinds of websites in the browser would actually be more beneficial because the UI would not be messed up. Moreover, if certain websites has paywall or pop-up ads, the user can also click out of it. Since most of the times, paywall or pop-ups are scripts within HTMLs, the classifier can't rule them out because the body of the HTMLs may still contain useful information we are looking for. Major paywalls or websites that required log-ins such as linkedin have been black-listed in the crawler. More detail in the crawler section below.

However. there is a disadvantage to this: websites are ever changing, so there is a possibility that in the future, the URL may not be usable, or has changed to something else; on the other hand, downloaded HTMLs remain the same because it does not require any internet connection to render and thus, the content is static.

To create the MTurk for this project, follow this tutorial in Mechanical Turk (Tool). For testing and development purpose, use https://requestersandbox.mturk.com/

Test account: email: mcboatfaceboaty670@gmail.com password: sameastheoneforemail2018

For this project, all the fields that was asked of the user is:

Connor, add the criteria here

Layout:

Connor, add the screenshot here

Advance User Guide: An in-depth look into the project and the various settings

The Crawler Functionality

The crawler functionality is stored in the file:

STEP1_crawl.py

The crawler was optimized for improved speed, improved performance and improved filtration while remain functional over the large set of data.

BUG REPORT by Maxine Tao (FIXED): update the crawler with this line of code:

search_results = driver.find_elements_by_xpath("//div[@class='g']/div/div/div/h3/a") + driver.find_elements_by_xpath("//div[@class='g']/div/div/h3/a")

Because apparently for some reason it stopped grabbing the first web page (I think because google may have modified how their website looks.

The Classifier

Input (Features)

The input (features) right now is the frequency of X_NUMBER of words appearing in each documents. The word choice is hand selected. This is the naive bag-of-word approach.

Idea: Create a matrix with the first col being the file BiBTex, and the following columns are the words, and the value at (file, word) is the frequency of that word in the file. Then, split the matrix into an array of row vectors, and each vector is then feed into the RNN)

This seems to not give really high accuracy with our LSTM RNN, so I will consider a word2vec approach

Development Notes

Right now I am working on two different classifier: Kyran's old Random Forest model - optimizing it by tweaking parameters and different combination of features - and my RNN text classifier.

The RF model has a ~92% accuracy on the training data and ~70% accuracy on the test data.

The RNN currently has a ~50% accuracy on both train and est data, which is rather concerning.

Test : train ration is 1:3 (25/75)

Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.

Reading resources

http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf