Difference between revisions of "Listing Page Classifier Progress"

From edegan.com
Jump to navigation Jump to search
 
(39 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
==Summary==
 
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]
 
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]]
  
 +
==Progress Log==
 
'''3/28/2019'''
 
'''3/28/2019'''
  
Line 9: Line 11:
  
 
Suggested Approaches:
 
Suggested Approaches:
#beautifulsoup Python package
+
*beautifulsoup Python package. Articles for future reference:
https://www.portent.com/blog/random/python-sitemap-crawler-1.htm
+
https://www.portent.com/blog/random/python-sitemap-crawler-1.htm
 +
http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html
 +
*selenium Python package
  
http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html
+
work on site map first, wrote the web scrape script
 
 
# selenium Python package
 
  
 
'''4/1/2019'''
 
'''4/1/2019'''
  
 
Site map:
 
Site map:
*Some internal links may not include home_page url : e.g. /careers
+
*Some href may not include home_page url : e.g. /careers
 
*Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow
 
*Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow
  
Line 36: Line 38:
 
'''4/4/2019'''
 
'''4/4/2019'''
  
Site map (DONE):
+
Site map (BFS approach is DONE):
 
*Test run couple sites to see if there are edge cases that I missed
 
*Test run couple sites to see if there are edge cases that I missed
*Implement the code: try to output the result in a txt file
+
*Implement the BFS code: try to output the result in a txt file
*Will work on screenshot generator next week
+
*Will work on DFS approach next week
 +
 
 +
'''4/8/2019'''
 +
 
 +
Site map:
 +
*Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)
 +
*Suggestion: may be able to improve the performance by using queue
 +
 
 +
'''4/9/2019'''
 +
*Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow
 +
 
 +
'''4/10/2019'''
 +
*Finished DFS method
 +
*Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)
 +
*Test run several websites
 +
 
 +
'''4/11/2019
 +
 
 +
Screenshot tool:
 +
*Selenium package reference of using selenium package to generate full page screenshot
 +
http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html
 +
*Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week
 +
**Downloaded the Chromedriver for Win32
 +
 
 +
'''4/15/2019'''
 +
 
 +
Screenshot tool:
 +
*Implement the screenshot tool
 +
**can capture the full screen
 +
**avoids scroll bar
 +
*will work on generating png file name automatically tomorrow
 +
 
 +
'''4/16/2019'''
 +
*Documentation on wiki
 +
 
 +
'''4/17/2019'''
 +
*Documentation on wiki
 +
*Implemented the screenshot tool:
 +
**read input from text file
 +
**auto-name png file
 +
(still need to test run the code)
 +
 
 +
'''4/18/2019'''
 +
*test run screenshot tool
 +
**can’t take full screenshot of some websites
 +
**WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)
 +
*test run site map
 +
**BFS takes much more time than DFS when depth is big (will look into this later)
 +
 
 +
'''4/22/2019'''
 +
*Trying to figure out why full screenshot not work for some websites:
 +
**e.g. https://bunkerlabs.org/
 +
**get the scroll height before running headless browsers (Nope, doesn’t work)
 +
**try out a different package ‘splinter’
 +
https://splinter.readthedocs.io/en/latest/screenshot.html
 +
 
 +
 
 +
'''4/23/2019'''
 +
*Implement new screenshot tool (splinter package):
 +
**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory
 +
**Filename modification (e.g. test7z_0i96__.png, autogenerates file name)
 +
**Documentation on wiki
 +
 
 +
'''4/24/2019'''
 +
*Documentation on wiki
 +
*went back to the time complexity issue with BFS and DFS
 +
**DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)
 +
**need to look into the problem with the DFS tomorrow
 +
 
 +
'''4/25/2019'''
 +
 
 +
Site map:
 +
*the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.
 +
*Implement the BFS algorithm: trying out deque etc. to see if it runs faster
 +
 
 +
 
 +
'''4/29/2019'''
 +
*Image processing work assigned
 +
*Documentation on wiki
 +
 
 +
 
 +
'''4/30/19'''
 +
 
 +
Image Processing:
 +
*Research on 3 packages for setting up CNN
 +
**Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries
 +
***Scikit: good for small dataset, easy to use. Does not support GPU computation
 +
***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.
 +
***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve
 +
*Initiate the idea of data preprocessing: create proper input dataset for the CNN model
 +
 
 +
'''5/2/2019'''
 +
*Work on data preprocessing
 +
 
 +
'''5/6/2019'''
 +
*Keep working on data preprocessing
 +
*Generate screenshot
 +
 
 +
'''5/7/2019'''
 +
*some issues occurred during screenshot generating (Will work on this more tomorrow)
 +
*try to set up CNN model
 +
**https://www.datacamp.com/community/tutorials/cnn-tensorflow-python
 +
 
 +
'''5/8/2019'''
 +
*fix the screenshot tool by switching to Firefox
 +
*Data preprocessing
 +
 
 +
'''5/12/2019'''
 +
*Finish image data preprocessing
 +
 
 +
'''5/13/2019'''
 +
*Set up initial CNN model using Keras
 +
**issue: Keras freezes on last batch of first epoch, make sure the following:
 +
steps_per_epoch = number of train samples//batch_size
 +
validation_steps = number of validation samples//batch_size
 +
 
 +
'''5/14/2019'''
 +
*Implement the CNN model
 +
*Work on some changes in the data preprocessing part (image data)
 +
**place class label in image filename
 +
 
 +
'''5/15/2019'''
 +
*Correct some out-of-date data in <code>The File to Rule Them ALL.csv</code>, new file saved as <code>The File to Rule Them ALL_NEW.csv</code>
 +
*implement generate_dataset.py and sitmap tool
 +
**regenerate dataset using updated data and tool
 +
 
 +
'''5/16/2019'''
 +
*implementation on CNN
 +
*Some problems to consider:
 +
**some websites have more than 1 cohort page: a list of cohorts for each year
 +
**class label is highly imbalanced:
 +
https://towardsdatascience.com/deep-learning-unbalanced-training-data-solve-it-like-this-6c528e9efea6
 +
 
 +
 
 +
'''5/17/2019'''
 +
*have to go back with the old plan of separating image data :(
 +
*documentation on wiki
 +
*test run on the GPU server

Latest revision as of 18:37, 17 May 2019

Summary

This page records the progress on the Listing Page Classifier Project

Progress Log

3/28/2019

Assigned Tasks:

  • Build a site map generator: output every internal links of input websites
  • Build a generator that captures screenshot of individual web pages
  • Build a CNN classifier using Python and TensorFlow

Suggested Approaches:

  • beautifulsoup Python package. Articles for future reference:
https://www.portent.com/blog/random/python-sitemap-crawler-1.htm
http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html
  • selenium Python package

work on site map first, wrote the web scrape script

4/1/2019

Site map:

  • Some href may not include home_page url : e.g. /careers
  • Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow

4/2/2019

Site map:

  • Solved the second bullet point from yesterday
  • Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )

4/3/2019

Site map:

  • Find similar work done for mcnair project
  • Clean up my own code + figure out the depth constraint

4/4/2019

Site map (BFS approach is DONE):

  • Test run couple sites to see if there are edge cases that I missed
  • Implement the BFS code: try to output the result in a txt file
  • Will work on DFS approach next week

4/8/2019

Site map:

  • Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)
  • Suggestion: may be able to improve the performance by using queue

4/9/2019

  • Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow

4/10/2019

  • Finished DFS method
  • Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)
  • Test run several websites

4/11/2019

Screenshot tool:

  • Selenium package reference of using selenium package to generate full page screenshot
http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html
  • Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week
    • Downloaded the Chromedriver for Win32

4/15/2019

Screenshot tool:

  • Implement the screenshot tool
    • can capture the full screen
    • avoids scroll bar
  • will work on generating png file name automatically tomorrow

4/16/2019

  • Documentation on wiki

4/17/2019

  • Documentation on wiki
  • Implemented the screenshot tool:
    • read input from text file
    • auto-name png file

(still need to test run the code)

4/18/2019

  • test run screenshot tool
    • can’t take full screenshot of some websites
    • WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)
  • test run site map
    • BFS takes much more time than DFS when depth is big (will look into this later)

4/22/2019

  • Trying to figure out why full screenshot not work for some websites:
    • e.g. https://bunkerlabs.org/
    • get the scroll height before running headless browsers (Nope, doesn’t work)
    • try out a different package ‘splinter’
https://splinter.readthedocs.io/en/latest/screenshot.html


4/23/2019

  • Implement new screenshot tool (splinter package):
    • Reading all text files from one directory, and take screenshot of each url from individual text files in that directory
    • Filename modification (e.g. test7z_0i96__.png, autogenerates file name)
    • Documentation on wiki

4/24/2019

  • Documentation on wiki
  • went back to the time complexity issue with BFS and DFS
    • DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)
    • need to look into the problem with the DFS tomorrow

4/25/2019

Site map:

  • the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.
  • Implement the BFS algorithm: trying out deque etc. to see if it runs faster


4/29/2019

  • Image processing work assigned
  • Documentation on wiki


4/30/19

Image Processing:

  • Research on 3 packages for setting up CNN
    • Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries
      • Scikit: good for small dataset, easy to use. Does not support GPU computation
      • Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.
      • TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve
  • Initiate the idea of data preprocessing: create proper input dataset for the CNN model

5/2/2019

  • Work on data preprocessing

5/6/2019

  • Keep working on data preprocessing
  • Generate screenshot

5/7/2019

5/8/2019

  • fix the screenshot tool by switching to Firefox
  • Data preprocessing

5/12/2019

  • Finish image data preprocessing

5/13/2019

  • Set up initial CNN model using Keras
    • issue: Keras freezes on last batch of first epoch, make sure the following:
steps_per_epoch = number of train samples//batch_size
validation_steps = number of validation samples//batch_size

5/14/2019

  • Implement the CNN model
  • Work on some changes in the data preprocessing part (image data)
    • place class label in image filename

5/15/2019

  • Correct some out-of-date data in The File to Rule Them ALL.csv, new file saved as The File to Rule Them ALL_NEW.csv
  • implement generate_dataset.py and sitmap tool
    • regenerate dataset using updated data and tool

5/16/2019

  • implementation on CNN
  • Some problems to consider:
    • some websites have more than 1 cohort page: a list of cohorts for each year
    • class label is highly imbalanced:
https://towardsdatascience.com/deep-learning-unbalanced-training-data-solve-it-like-this-6c528e9efea6


5/17/2019

  • have to go back with the old plan of separating image data :(
  • documentation on wiki
  • test run on the GPU server