Difference between revisions of "Listing Page Classifier Progress"

From edegan.com
Jump to navigation Jump to search
Line 96: Line 96:
 
**trying different package ‘splinter’
 
**trying different package ‘splinter’
 
  https://splinter.readthedocs.io/en/latest/screenshot.html
 
  https://splinter.readthedocs.io/en/latest/screenshot.html
 +
 +
 +
'''4/23/2019'''
 +
*Implement new screenshot tool (splinter package):
 +
**Reading all text files from one directory, and take screenshot of each url from individual text files in that directory
 +
**Filename modification (e.g. test7.pngz_0i96__ has strange tailing)

Revision as of 13:53, 23 April 2019

Summary

This page records the progress on the Listing Page Classifier Project

Progress Log

3/28/2019

Assigned Tasks:

  • Build a site map generator: output every internal links of input websites
  • Build a generator that captures screenshot of individual web pages
  • Build a CNN classifier using Python and TensorFlow

Suggested Approaches:

  • beautifulsoup Python package. Articles for future reference:
https://www.portent.com/blog/random/python-sitemap-crawler-1.htm
http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html
  • selenium Python package

work on site map first, wrote the web scrape script

4/1/2019

Site map:

  • Some href may not include home_page url : e.g. /careers
  • Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow

4/2/2019

Site map:

  • Solved the second bullet point from yesterday
  • Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )

4/3/2019

Site map:

  • Find similar work done for mcnair project
  • Clean up my own code + figure out the depth constraint

4/4/2019

Site map (BFS approach is DONE):

  • Test run couple sites to see if there are edge cases that I missed
  • Implement the BFS code: try to output the result in a txt file
  • Will work on DFS approach next week

4/8/2019

Site map:

  • Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)
  • Suggestion: may be able to improve the performance by using queue

4/9/2019

  • Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow

4/10/2019

  • Finished DFS method
  • Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)
  • Test run several websites

4/11/2019

Screenshot tool:

  • Selenium package reference of using selenium package to generate full page screenshot
http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html
  • Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week
    • Downloaded the Chromedriver for Win32

4/15/2019

Screenshot tool:

  • Implement the screenshot tool
    • can capture the full screen
    • avoids scroll bar
  • will work on generating png file name automatically tomorrow

4/16/2019

  • Documentation on wiki

4/17/2019

  • Documentation on wiki
  • Implemented the screenshot tool:
    • read input from text file
    • auto-name png file

(still need to test run the code)

4/18/2019

  • test run screenshot tool
    • can’t take full screenshot of some websites
    • WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)
  • test run site map
    • BFS takes much more time than DFS when depth is big(trying to fix this later)

4/22/2019

  • Trying to figure out why full screenshot not work for some websites:
    • e.g. https://bunkerlabs.org/
    • get the scroll height before running headless browsers (Nope, doesn’t work)
    • trying different package ‘splinter’
https://splinter.readthedocs.io/en/latest/screenshot.html


4/23/2019

  • Implement new screenshot tool (splinter package):
    • Reading all text files from one directory, and take screenshot of each url from individual text files in that directory
    • Filename modification (e.g. test7.pngz_0i96__ has strange tailing)