Listing Page Classifier Progress

Summary

This page records the progress on the Listing Page Classifier Project

Progress Log

3/28/2019

Assigned Tasks:

Build a site map generator: output every internal links of input websites
Build a generator that captures screenshot of individual web pages
Build a CNN classifier using Python and TensorFlow

Suggested Approaches:

beautifulsoup Python package. Articles for future reference:

https://www.portent.com/blog/random/python-sitemap-crawler-1.htm
http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html

selenium Python package

work on site map first, wrote the web scrape script

4/1/2019

Site map:

Some href may not include home_page url : e.g. /careers
Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow

4/2/2019

Site map:

Solved the second bullet point from yesterday
Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )

4/3/2019

Site map:

Find similar work done for mcnair project
Clean up my own code + figure out the depth constraint

4/4/2019

Site map (BFS approach is DONE):

Test run couple sites to see if there are edge cases that I missed
Implement the BFS code: try to output the result in a txt file
Will work on DFS approach next week

4/8/2019

Site map:

Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)
Suggestion: may be able to improve the performance by using queue

4/9/2019

Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow

4/10/2019

Finished DFS method
Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)
Test run several websites

4/11/2019

Screenshot tool:

Selenium package reference of using selenium package to generate full page screenshot

http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html

Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week
- Downloaded the Chromedriver for Win32

4/15/2019

Screenshot tool:

Implement the screenshot tool
- can capture the full screen
- avoids scroll bar
will work on generating png file name automatically tomorrow

4/16/2019

Documentation on wiki

4/17/2019

Documentation on wiki
Implemented the screenshot tool:
- read input from text file
- auto-name png file

(still need to test run the code)

4/18/2019

test run screenshot tool
- can’t take full screenshot of some websites
- WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)
test run site map
- BFS takes much more time than DFS when depth is big (will look into this later)

4/22/2019

Trying to figure out why full screenshot not work for some websites:
- e.g. https://bunkerlabs.org/
- get the scroll height before running headless browsers (Nope, doesn’t work)
- try out a different package ‘splinter’

https://splinter.readthedocs.io/en/latest/screenshot.html

4/23/2019

Implement new screenshot tool (splinter package):
- Reading all text files from one directory, and take screenshot of each url from individual text files in that directory
- Filename modification (e.g. test7z_0i96__.png, autogenerates file name)
- Documentation on wiki

4/24/2019

Documentation on wiki
went back to the time complexity issue with BFS and DFS
- DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)
- need to look into the problem with the DFS tomorrow

4/25/2019

Site map:

the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.
Implement the BFS algorithm: trying out deque etc. to see if it runs faster

4/29/2019

Image processing work assigned
Documentation on wiki

4/30/19

Image Processing:

Research on 3 packages for setting up CNN
- Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries
  - Scikit: good for small dataset, easy to use. Does not support GPU computation
  - Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.
  - TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve
Initiate the idea of data preprocessing: create proper input dataset for the CNN model

5/2/2019

Work on data preprocessing

5/6/2019

Keep working on data preprocessing
Generate screenshot

5/7/2019

some issues occurred during screenshot generating (Will work on this more tomorrow)
try to set up CNN model
- https://www.datacamp.com/community/tutorials/cnn-tensorflow-python

5/8/2019

fix the screenshot tool by switching to Firefox
Data preprocessing

5/12/2019

Finish image data preprocessing

5/13/2019

Set up initial CNN model using Keras
- issue: Keras freezes on last batch of first epoch, make sure the following:

steps_per_epoch = number of train samples//batch_size
validation_steps = number of validation samples//batch_size

5/14/2019

Implement the CNN model
Work on some changes in the data preprocessing part (image data)
- place class label in image filename

5/15/2019

Correct some out-of-date data in The File to Rule Them ALL.csv, new file saved as The File to Rule Them ALL_NEW.csv
implement generate_dataset.py and sitmap tool
- regenerate dataset using updated data and tool

5/16/2019

implementation on CNN
Some problems to consider:
- some websites have more than 1 cohort page: a list of cohorts for each year
- class label is highly imbalanced:

https://towardsdatascience.com/deep-learning-unbalanced-training-data-solve-it-like-this-6c528e9efea6

5/17/2019

have to go back with the old plan of separating image data :(
documentation on wiki
test run on the GPU server

Listing Page Classifier Progress

Summary

Progress Log

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools