Listing Page Classifier Progress
Jump to navigation
Jump to search
Summary
This page records the progress on the Listing Page Classifier Project
Progress Log
3/28/2019
Assigned Tasks:
- Build a site map generator: output every internal links of input websites
- Build a generator that captures screenshot of individual web pages
- Build a CNN classifier using Python and TensorFlow
Suggested Approaches:
- beautifulsoup Python package. Articles for future reference:
https://www.portent.com/blog/random/python-sitemap-crawler-1.htm http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html
- selenium Python package
work on site map first, wrote the web scrape script
4/1/2019
Site map:
- Some href may not include home_page url : e.g. /careers
- Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow
4/2/2019
Site map:
- Solved the second bullet point from yesterday
- Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )
4/3/2019
Site map:
- Find similar work done for mcnair project
- Clean up my own code + figure out the depth constraint
4/4/2019
Site map (BFS approach is DONE):
- Test run couple sites to see if there are edge cases that I missed
- Implement the BFS code: try to output the result in a txt file
- Will work on DFS approach next week
4/8/2019
Site map:
- Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)
- Suggestion: may be able to improve the performance by using queue
4/9/2019
- Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow
4/10/2019
- Finished DFS method
- Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)
- Test run several websites
4/11/2019
Screenshot tool:
- Selenium package reference of using selenium package to generate full page screenshot
http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html
- Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week
- Downloaded the Chromedriver for Win32
4/15/2019
Screenshot tool:
- Implement the screenshot tool
- can capture the full screen
- avoids scroll bar
- will work on generating png file name automatically tomorrow
4/16/2019
- Documentation on wiki
4/17/2019
- Documentation on wiki
- Implemented the screenshot tool:
- read input from text file
- auto-name png file
(still need to test run the code)
4/18/2019
- test run screenshot tool
- can’t take full screenshot of some websites
- WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)
- test run site map
- BFS takes much more time than DFS when depth is big(trying to fix this later)
4/22/2019
- Trying to figure out why full screenshot not work for some websites:
- e.g. https://bunkerlabs.org/
- get the scroll height before running headless browsers (Nope, doesn’t work)
- trying different package ‘splinter’
https://splinter.readthedocs.io/en/latest/screenshot.html