Difference between revisions of "Listing Page Classifier Progress"

Revision as of 17:07, 17 April 2019

Summary

This page records the progress on the Listing Page Classifier Project

Progress Log

3/28/2019

Assigned Tasks:

Build a site map generator: output every internal links of input websites
Build a generator that captures screenshot of individual web pages
Build a CNN classifier using Python and TensorFlow

Suggested Approaches:

beautifulsoup Python package. Articles for future reference:

https://www.portent.com/blog/random/python-sitemap-crawler-1.htm
http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html

selenium Python package

work on site map first, wrote the web scrape script

4/1/2019

Site map:

Some href may not include home_page url : e.g. /careers
Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow

4/2/2019

Site map:

Solved the second bullet point from yesterday
Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )

4/3/2019

Site map:

Find similar work done for mcnair project
Clean up my own code + figure out the depth constraint

4/4/2019

Site map (BFS approach is DONE):

Test run couple sites to see if there are edge cases that I missed
Implement the BFS code: try to output the result in a txt file
Will work on DFS approach next week

4/8/2019

Site map:

Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)
Suggestion: may be able to improve the performance by using queue

4/9/2019

Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow

4/10/2019

Finished DFS method
Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)
Test run several websites

4/11/2019

Screenshot tool:

Selenium package reference of using selenium package to generate full page screenshot

http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html

Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week
- Downloaded the Chromedriver for Win32

4/15/2019

Screenshot tool:

Implement the screenshot tool
- can capture the full screen
- avoids scroll bar
will work on generating png file name automatically tomorrow

4/16/2019

Documentation on wiki

4/17/2019

Documentation on wiki
Implemented the screenshot tool:
- read input from text file
- auto-name png file

(still need to test run the code)

@@ Line 75: / Line 75: @@
 '''4/16/2019'''
 *Documentation on wiki
+'''4/17/2019'''
+*Documentation on wiki
+*Implemented the screenshot tool:
+**read input from text file
+**auto-name png file
+(still need to test run the code)

Difference between revisions of "Listing Page Classifier Progress"

Revision as of 17:07, 17 April 2019

Summary

Progress Log

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools