Changes

Jump to navigation Jump to search
1,072 bytes added ,  15:17, 24 July 2018
no edit summary
Both model is currently using the Bag-of-word approach to preprocess data, but I will try to use Yang's code in the industry classifier to preprocess using word2vec. I'm not familiar with this approach, but I will try to learn this.
==General User Guide: How to Use this Project (Random Forest model)== First, change your directory to the working folder: cd E:\McNair\Projects\Accelerator Demo Day\Test RunThen you need to specify the list of accelerators you want to crawl by modifying the following file: ListOfAccsToCrawl.txtThe first line must remain fixed as "Accelerator". Then the next several rows are the Accelerators name. The name needs not to be case sensitive, but it is preferable that the case remains sensitive if possible. All necessary preparations are now complete. Now onto running the code! 
Running the project is as simple as executing the code in the correct order. The files are named in the format "STEPX_name", where as X is the order of execution. To be more specific, run the following 4 commands:
''# Crawl Google to get the data for the demo day pages for the accelerator stored in ListOfAccsToCrawl.txt''
''# Run the model to predict on the HTML of the crawled HTMLs.''
python3 STEP4_classify_rf.py
 
The result is stored in CrawledHTMLFull folder and is classified in two folder: positive and negative. The positive folder contains HTMLs that the classifier thought of as "good candidate." The negative contains the opposite. There is also a txt file called prediction.txt that lists everything. feature.txt is an irrelevant file for the general user, please ignore it. Its sole purpose is for analyzing and debugging.
 
NEVER touch the TrainingHTML folder, datareader.py or the classifier.txt. These are used internally to train data.
==The Crawler Functionality==
197

edits

Navigation menu