Difference between revisions of "Listing Page Classifier Progress"
Jump to navigation
Jump to search
Line 9: | Line 9: | ||
Suggested Approaches: | Suggested Approaches: | ||
− | + | *beautifulsoup Python package. Articles for future reference: | |
https://www.portent.com/blog/random/python-sitemap-crawler-1.htm | https://www.portent.com/blog/random/python-sitemap-crawler-1.htm | ||
− | |||
http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html | http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html | ||
− | + | *selenium Python package | |
− | |||
work on site map first: | work on site map first: |
Revision as of 15:03, 8 April 2019
This page records the progress on the Listing Page Classifier Project
3/28/2019
Assigned Tasks:
- Build a site map generator: output every internal links of input websites
- Build a generator that captures screenshot of individual web pages
- Build a CNN classifier using Python and TensorFlow
Suggested Approaches:
- beautifulsoup Python package. Articles for future reference:
https://www.portent.com/blog/random/python-sitemap-crawler-1.htm http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html
- selenium Python package
work on site map first:
- Python script to scrape url link from a webpage (saved as urlcrawler.py)
4/1/2019
Site map:
- Some href may not include home_page url : e.g. /careers
- Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow
4/2/2019
Site map:
- Solved the second bullet point from yesterday
- Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )
4/3/2019
Site map:
- Find similar work done for mcnair project
- Clean up my own code + figure out the depth constraint
4/4/2019
Site map (BFS approach is DONE):
- Test run couple sites to see if there are edge cases that I missed
- Implement the BFS code: try to output the result in a txt file
- Will work on DFS approach next week
4/8/2019 Site map:
- Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)
- Suggestion: may be able to improve the performance by using queue