Difference between revisions of "Listing Page Classifier Progress"

Latest revision as of 18:37, 17 May 2019

Summary

This page records the progress on the Listing Page Classifier Project

Progress Log

3/28/2019

Assigned Tasks:

Build a site map generator: output every internal links of input websites
Build a generator that captures screenshot of individual web pages
Build a CNN classifier using Python and TensorFlow

Suggested Approaches:

beautifulsoup Python package. Articles for future reference:

https://www.portent.com/blog/random/python-sitemap-crawler-1.htm
http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html

selenium Python package

work on site map first, wrote the web scrape script

4/1/2019

Site map:

Some href may not include home_page url : e.g. /careers
Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow

4/2/2019

Site map:

Solved the second bullet point from yesterday
Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )

4/3/2019

Site map:

Find similar work done for mcnair project
Clean up my own code + figure out the depth constraint

4/4/2019

Site map (BFS approach is DONE):

Test run couple sites to see if there are edge cases that I missed
Implement the BFS code: try to output the result in a txt file
Will work on DFS approach next week

4/8/2019

Site map:

Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)
Suggestion: may be able to improve the performance by using queue

4/9/2019

Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow

4/10/2019

Finished DFS method
Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)
Test run several websites

4/11/2019

Screenshot tool:

Selenium package reference of using selenium package to generate full page screenshot

http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html

Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week
- Downloaded the Chromedriver for Win32

4/15/2019

Screenshot tool:

Implement the screenshot tool
- can capture the full screen
- avoids scroll bar
will work on generating png file name automatically tomorrow

4/16/2019

Documentation on wiki

4/17/2019

Documentation on wiki
Implemented the screenshot tool:
- read input from text file
- auto-name png file

(still need to test run the code)

4/18/2019

test run screenshot tool
- can’t take full screenshot of some websites
- WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)
test run site map
- BFS takes much more time than DFS when depth is big (will look into this later)

4/22/2019

Trying to figure out why full screenshot not work for some websites:
- e.g. https://bunkerlabs.org/
- get the scroll height before running headless browsers (Nope, doesn’t work)
- try out a different package ‘splinter’

https://splinter.readthedocs.io/en/latest/screenshot.html

4/23/2019

Implement new screenshot tool (splinter package):
- Reading all text files from one directory, and take screenshot of each url from individual text files in that directory
- Filename modification (e.g. test7z_0i96__.png, autogenerates file name)
- Documentation on wiki

4/24/2019

Documentation on wiki
went back to the time complexity issue with BFS and DFS
- DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)
- need to look into the problem with the DFS tomorrow

4/25/2019

Site map:

the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.
Implement the BFS algorithm: trying out deque etc. to see if it runs faster

4/29/2019

Image processing work assigned
Documentation on wiki

4/30/19

Image Processing:

Research on 3 packages for setting up CNN
- Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries
  - Scikit: good for small dataset, easy to use. Does not support GPU computation
  - Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.
  - TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve
Initiate the idea of data preprocessing: create proper input dataset for the CNN model

5/2/2019

Work on data preprocessing

5/6/2019

Keep working on data preprocessing
Generate screenshot

5/7/2019

some issues occurred during screenshot generating (Will work on this more tomorrow)
try to set up CNN model
- https://www.datacamp.com/community/tutorials/cnn-tensorflow-python

5/8/2019

fix the screenshot tool by switching to Firefox
Data preprocessing

5/12/2019

Finish image data preprocessing

5/13/2019

Set up initial CNN model using Keras
- issue: Keras freezes on last batch of first epoch, make sure the following:

steps_per_epoch = number of train samples//batch_size
validation_steps = number of validation samples//batch_size

5/14/2019

Implement the CNN model
Work on some changes in the data preprocessing part (image data)
- place class label in image filename

5/15/2019

Correct some out-of-date data in The File to Rule Them ALL.csv, new file saved as The File to Rule Them ALL_NEW.csv
implement generate_dataset.py and sitmap tool
- regenerate dataset using updated data and tool

5/16/2019

implementation on CNN
Some problems to consider:
- some websites have more than 1 cohort page: a list of cohorts for each year
- class label is highly imbalanced:

https://towardsdatascience.com/deep-learning-unbalanced-training-data-solve-it-like-this-6c528e9efea6

5/17/2019

have to go back with the old plan of separating image data :(
documentation on wiki
test run on the GPU server

@@ Line 1: / Line 1: @@
+==Summary==
 This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]
+==Progress Log==
 '''3/28/2019'''
@@ Line 9: / Line 11: @@
 Suggested Approaches:
-#beautifulsoup Python package
+*beautifulsoup Python package. Articles for future reference:
-https://www.portent.com/blog/random/python-sitemap-crawler-1.htm
+ https://www.portent.com/blog/random/python-sitemap-crawler-1.htm
+ http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html
+*selenium Python package
-http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html
+work on site map first, wrote the web scrape script
-#	selenium Python package
 '''4/1/2019'''
 Site map:
-*Some internal links may not include home_page url : e.g. /careers
+*Some href may not include home_page url : e.g. /careers
 *Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow
@@ Line 33: / Line 35: @@
 *Find similar work done for mcnair project
 *Clean up my own code + figure out the depth constraint
+'''4/4/2019'''
+Site map (BFS approach is DONE):
+*Test run couple sites to see if there are edge cases that I missed
+*Implement the BFS code: try to output the result in a txt file
+*Will work on DFS approach next week
+'''4/8/2019'''
+Site map:
+*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)
+*Suggestion: may be able to improve the performance by using queue
+'''4/9/2019'''
+*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow
+'''4/10/2019'''
+*Finished DFS method
+*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)
+*Test run several websites
+'''4/11/2019
+Screenshot tool:
+*Selenium package reference of using selenium package to generate full page screenshot
+ http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html
+*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week
+**Downloaded the Chromedriver for Win32
+'''4/15/2019'''
+Screenshot tool:
+*Implement the screenshot tool
+**can capture the full screen
+**avoids scroll bar
+*will work on generating png file name automatically tomorrow
+'''4/16/2019'''
+*Documentation on wiki
+'''4/17/2019'''
+*Documentation on wiki
+*Implemented the screenshot tool:
+**read input from text file
+**auto-name png file
+(still need to test run the code)
+'''4/18/2019'''
+*test run screenshot tool
+**can’t take full screenshot of some websites
+**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)
+*test run site map
+**BFS takes much more time than DFS when depth is big (will look into this later)
+'''4/22/2019'''
+*Trying to figure out why full screenshot not work for some websites:
+**e.g. https://bunkerlabs.org/
+**get the scroll height before running headless browsers (Nope, doesn’t work)
+**try out a different package ‘splinter’
+ https://splinter.readthedocs.io/en/latest/screenshot.html
+'''4/23/2019'''
+*Implement new screenshot tool (splinter package):
+**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory
+**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)
+**Documentation on wiki
+'''4/24/2019'''
+*Documentation on wiki
+*went back to the time complexity issue with BFS and DFS
+**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)
+**need to look into the problem with the DFS tomorrow
+'''4/25/2019'''
+Site map:
+*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.
+*Implement the BFS algorithm: trying out deque etc. to see if it runs faster
+'''4/29/2019'''
+*Image processing work assigned
+*Documentation on wiki
+'''4/30/19'''
+Image Processing:
+*Research on 3 packages for setting up CNN
+**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries
+***Scikit: good for small dataset, easy to use. Does not support GPU computation
+***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.
+***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve
+*Initiate the idea of data preprocessing: create proper input dataset for the CNN model
+'''5/2/2019'''
+*Work on data preprocessing
+'''5/6/2019'''
+*Keep working on data preprocessing
+*Generate screenshot
+'''5/7/2019'''
+*some issues occurred during screenshot generating (Will work on this more tomorrow)
+*try to set up CNN model
+**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python
+'''5/8/2019'''
+*fix the screenshot tool by switching to Firefox
+*Data preprocessing
+'''5/12/2019'''
+*Finish image data preprocessing
+'''5/13/2019'''
+*Set up initial CNN model using Keras
+**issue: Keras freezes on last batch of first epoch, make sure the following:
+ steps_per_epoch = number of train samples//batch_size
+ validation_steps = number of validation samples//batch_size
+'''5/14/2019'''
+*Implement the CNN model
+*Work on some changes in the data preprocessing part (image data)
+**place class label in image filename
+'''5/15/2019'''
+*Correct some out-of-date data in <code>The File to Rule Them ALL.csv</code>, new file saved as <code>The File to Rule Them ALL_NEW.csv</code>
+*implement generate_dataset.py and sitmap tool
+**regenerate dataset using updated data and tool
+'''5/16/2019'''
+*implementation on CNN
+*Some problems to consider:
+**some websites have more than 1 cohort page: a list of cohorts for each year
+**class label is highly imbalanced:
+ https://towardsdatascience.com/deep-learning-unbalanced-training-data-solve-it-like-this-6c528e9efea6
+'''5/17/2019'''
+*have to go back with the old plan of separating image data :(
+*documentation on wiki
+*test run on the GPU server

Difference between revisions of "Listing Page Classifier Progress"

Latest revision as of 18:37, 17 May 2019

Summary

Progress Log

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools