Difference between revisions of "Listing Page Classifier Progress"
Jump to navigation
Jump to search
(Created page with "This page records the progress on the Listing Page Classifier Project '''3/28/2019''' Assigned Tasks: *Build a site map generator: output every i...") |
|||
(41 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | ==Summary== | ||
This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]] | This page records the progress on the [[Listing Page Classifier|Listing Page Classifier Project]] | ||
+ | ==Progress Log== | ||
'''3/28/2019''' | '''3/28/2019''' | ||
Line 9: | Line 11: | ||
Suggested Approaches: | Suggested Approaches: | ||
− | + | *beautifulsoup Python package. Articles for future reference: | |
− | https://www.portent.com/blog/random/python-sitemap-crawler-1.htm | + | https://www.portent.com/blog/random/python-sitemap-crawler-1.htm |
+ | http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html | ||
+ | *selenium Python package | ||
− | + | work on site map first, wrote the web scrape script | |
− | |||
− | |||
'''4/1/2019''' | '''4/1/2019''' | ||
Site map: | Site map: | ||
− | *Some | + | *Some href may not include home_page url : e.g. /careers |
*Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow | *Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow | ||
Line 33: | Line 35: | ||
*Find similar work done for mcnair project | *Find similar work done for mcnair project | ||
*Clean up my own code + figure out the depth constraint | *Clean up my own code + figure out the depth constraint | ||
+ | |||
+ | '''4/4/2019''' | ||
+ | |||
+ | Site map (BFS approach is DONE): | ||
+ | *Test run couple sites to see if there are edge cases that I missed | ||
+ | *Implement the BFS code: try to output the result in a txt file | ||
+ | *Will work on DFS approach next week | ||
+ | |||
+ | '''4/8/2019''' | ||
+ | |||
+ | Site map: | ||
+ | *Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow) | ||
+ | *Suggestion: may be able to improve the performance by using queue | ||
+ | |||
+ | '''4/9/2019''' | ||
+ | *Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow | ||
+ | |||
+ | '''4/10/2019''' | ||
+ | *Finished DFS method | ||
+ | *Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n) | ||
+ | *Test run several websites | ||
+ | |||
+ | '''4/11/2019 | ||
+ | |||
+ | Screenshot tool: | ||
+ | *Selenium package reference of using selenium package to generate full page screenshot | ||
+ | http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html | ||
+ | *Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week | ||
+ | **Downloaded the Chromedriver for Win32 | ||
+ | |||
+ | '''4/15/2019''' | ||
+ | |||
+ | Screenshot tool: | ||
+ | *Implement the screenshot tool | ||
+ | **can capture the full screen | ||
+ | **avoids scroll bar | ||
+ | *will work on generating png file name automatically tomorrow | ||
+ | |||
+ | '''4/16/2019''' | ||
+ | *Documentation on wiki | ||
+ | |||
+ | '''4/17/2019''' | ||
+ | *Documentation on wiki | ||
+ | *Implemented the screenshot tool: | ||
+ | **read input from text file | ||
+ | **auto-name png file | ||
+ | (still need to test run the code) | ||
+ | |||
+ | '''4/18/2019''' | ||
+ | *test run screenshot tool | ||
+ | **can’t take full screenshot of some websites | ||
+ | **WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time) | ||
+ | *test run site map | ||
+ | **BFS takes much more time than DFS when depth is big (will look into this later) | ||
+ | |||
+ | '''4/22/2019''' | ||
+ | *Trying to figure out why full screenshot not work for some websites: | ||
+ | **e.g. https://bunkerlabs.org/ | ||
+ | **get the scroll height before running headless browsers (Nope, doesn’t work) | ||
+ | **try out a different package ‘splinter’ | ||
+ | https://splinter.readthedocs.io/en/latest/screenshot.html | ||
+ | |||
+ | |||
+ | '''4/23/2019''' | ||
+ | *Implement new screenshot tool (splinter package): | ||
+ | **Reading all text files from one directory, and take screenshot of each url from individual text files in that directory | ||
+ | **Filename modification (e.g. test7z_0i96__.png, autogenerates file name) | ||
+ | **Documentation on wiki | ||
+ | |||
+ | '''4/24/2019''' | ||
+ | *Documentation on wiki | ||
+ | *went back to the time complexity issue with BFS and DFS | ||
+ | **DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster) | ||
+ | **need to look into the problem with the DFS tomorrow | ||
+ | |||
+ | '''4/25/2019''' | ||
+ | |||
+ | Site map: | ||
+ | *the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine. | ||
+ | *Implement the BFS algorithm: trying out deque etc. to see if it runs faster | ||
+ | |||
+ | |||
+ | '''4/29/2019''' | ||
+ | *Image processing work assigned | ||
+ | *Documentation on wiki | ||
+ | |||
+ | |||
+ | '''4/30/19''' | ||
+ | |||
+ | Image Processing: | ||
+ | *Research on 3 packages for setting up CNN | ||
+ | **Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries | ||
+ | ***Scikit: good for small dataset, easy to use. Does not support GPU computation | ||
+ | ***Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration. | ||
+ | ***TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve | ||
+ | *Initiate the idea of data preprocessing: create proper input dataset for the CNN model | ||
+ | |||
+ | '''5/2/2019''' | ||
+ | *Work on data preprocessing | ||
+ | |||
+ | '''5/6/2019''' | ||
+ | *Keep working on data preprocessing | ||
+ | *Generate screenshot | ||
+ | |||
+ | '''5/7/2019''' | ||
+ | *some issues occurred during screenshot generating (Will work on this more tomorrow) | ||
+ | *try to set up CNN model | ||
+ | **https://www.datacamp.com/community/tutorials/cnn-tensorflow-python | ||
+ | |||
+ | '''5/8/2019''' | ||
+ | *fix the screenshot tool by switching to Firefox | ||
+ | *Data preprocessing | ||
+ | |||
+ | '''5/12/2019''' | ||
+ | *Finish image data preprocessing | ||
+ | |||
+ | '''5/13/2019''' | ||
+ | *Set up initial CNN model using Keras | ||
+ | **issue: Keras freezes on last batch of first epoch, make sure the following: | ||
+ | steps_per_epoch = number of train samples//batch_size | ||
+ | validation_steps = number of validation samples//batch_size | ||
+ | |||
+ | '''5/14/2019''' | ||
+ | *Implement the CNN model | ||
+ | *Work on some changes in the data preprocessing part (image data) | ||
+ | **place class label in image filename | ||
+ | |||
+ | '''5/15/2019''' | ||
+ | *Correct some out-of-date data in <code>The File to Rule Them ALL.csv</code>, new file saved as <code>The File to Rule Them ALL_NEW.csv</code> | ||
+ | *implement generate_dataset.py and sitmap tool | ||
+ | **regenerate dataset using updated data and tool | ||
+ | |||
+ | '''5/16/2019''' | ||
+ | *implementation on CNN | ||
+ | *Some problems to consider: | ||
+ | **some websites have more than 1 cohort page: a list of cohorts for each year | ||
+ | **class label is highly imbalanced: | ||
+ | https://towardsdatascience.com/deep-learning-unbalanced-training-data-solve-it-like-this-6c528e9efea6 | ||
+ | |||
+ | |||
+ | '''5/17/2019''' | ||
+ | *have to go back with the old plan of separating image data :( | ||
+ | *documentation on wiki | ||
+ | *test run on the GPU server |
Latest revision as of 18:37, 17 May 2019
Summary
This page records the progress on the Listing Page Classifier Project
Progress Log
3/28/2019
Assigned Tasks:
- Build a site map generator: output every internal links of input websites
- Build a generator that captures screenshot of individual web pages
- Build a CNN classifier using Python and TensorFlow
Suggested Approaches:
- beautifulsoup Python package. Articles for future reference:
https://www.portent.com/blog/random/python-sitemap-crawler-1.htm http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html
- selenium Python package
work on site map first, wrote the web scrape script
4/1/2019
Site map:
- Some href may not include home_page url : e.g. /careers
- Updated urlcrawler.py (having issues with identifying internal links does not start with "/") <- will work on this part tomorrow
4/2/2019
Site map:
- Solved the second bullet point from yesterday
- Recursion to get internal links from a page causing HTTPerror on some websites (should set up a depth constraint- WILL WORK ON THIS TOMORROW )
4/3/2019
Site map:
- Find similar work done for mcnair project
- Clean up my own code + figure out the depth constraint
4/4/2019
Site map (BFS approach is DONE):
- Test run couple sites to see if there are edge cases that I missed
- Implement the BFS code: try to output the result in a txt file
- Will work on DFS approach next week
4/8/2019
Site map:
- Work on DFS approach (stuck on the anchor tag, will work on this part tomorrow)
- Suggestion: may be able to improve the performance by using queue
4/9/2019
- Something went wrong with my DFS algorithm (keep outputting None as result), will continue working on this tomorrow
4/10/2019
- Finished DFS method
- Compare two methods: DFS is 2 - 6 seconds faster, theoretically, both methods should take O(n)
- Test run several websites
4/11/2019
Screenshot tool:
- Selenium package reference of using selenium package to generate full page screenshot
http://seleniumpythonqa.blogspot.com/2015/08/generate-full-page-screenshot-in-chrome.html
- Set up the script that can capture the partial screen shot of a website, will work on how to get a full screen shot next week
- Downloaded the Chromedriver for Win32
4/15/2019
Screenshot tool:
- Implement the screenshot tool
- can capture the full screen
- avoids scroll bar
- will work on generating png file name automatically tomorrow
4/16/2019
- Documentation on wiki
4/17/2019
- Documentation on wiki
- Implemented the screenshot tool:
- read input from text file
- auto-name png file
(still need to test run the code)
4/18/2019
- test run screenshot tool
- can’t take full screenshot of some websites
- WebDriverException: invalid session id occurred during the iteration (solved it by not closing the driver each time)
- test run site map
- BFS takes much more time than DFS when depth is big (will look into this later)
4/22/2019
- Trying to figure out why full screenshot not work for some websites:
- e.g. https://bunkerlabs.org/
- get the scroll height before running headless browsers (Nope, doesn’t work)
- try out a different package ‘splinter’
https://splinter.readthedocs.io/en/latest/screenshot.html
4/23/2019
- Implement new screenshot tool (splinter package):
- Reading all text files from one directory, and take screenshot of each url from individual text files in that directory
- Filename modification (e.g. test7z_0i96__.png, autogenerates file name)
- Documentation on wiki
4/24/2019
- Documentation on wiki
- went back to the time complexity issue with BFS and DFS
- DFS algorithm has flaws!! (it does not visit all nodes, this is why DFS is much faster)
- need to look into the problem with the DFS tomorrow
4/25/2019
Site map:
- the recursive DFS will not work in this type of problem, and if we rewrite it in an iterative way, it will be similar to the BFS approach. So, I decided to only keep the BFS since the BFS is working just fine.
- Implement the BFS algorithm: trying out deque etc. to see if it runs faster
4/29/2019
- Image processing work assigned
- Documentation on wiki
4/30/19
Image Processing:
- Research on 3 packages for setting up CNN
- Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries
- Scikit: good for small dataset, easy to use. Does not support GPU computation
- Pytorch: Coding is easy, so it has a flatter learning curve, Supports dynamic graphs so you can adjust on-the-go, Supports GPU acceleration.
- TensorFlow: Flexibility, Contains several ready-to-use ML models and ready-to-run application packages, Scalability with hardware and software, Large online community, Supports only NVIDIA GPUs, A slightly steep learning curve
- Comparison between the 3 packages: https://kite.com/blog/python/python-machine-learning-libraries
- Initiate the idea of data preprocessing: create proper input dataset for the CNN model
5/2/2019
- Work on data preprocessing
5/6/2019
- Keep working on data preprocessing
- Generate screenshot
5/7/2019
- some issues occurred during screenshot generating (Will work on this more tomorrow)
- try to set up CNN model
5/8/2019
- fix the screenshot tool by switching to Firefox
- Data preprocessing
5/12/2019
- Finish image data preprocessing
5/13/2019
- Set up initial CNN model using Keras
- issue: Keras freezes on last batch of first epoch, make sure the following:
steps_per_epoch = number of train samples//batch_size validation_steps = number of validation samples//batch_size
5/14/2019
- Implement the CNN model
- Work on some changes in the data preprocessing part (image data)
- place class label in image filename
5/15/2019
- Correct some out-of-date data in
The File to Rule Them ALL.csv
, new file saved asThe File to Rule Them ALL_NEW.csv
- implement generate_dataset.py and sitmap tool
- regenerate dataset using updated data and tool
5/16/2019
- implementation on CNN
- Some problems to consider:
- some websites have more than 1 cohort page: a list of cohorts for each year
- class label is highly imbalanced:
https://towardsdatascience.com/deep-learning-unbalanced-training-data-solve-it-like-this-6c528e9efea6
5/17/2019
- have to go back with the old plan of separating image data :(
- documentation on wiki
- test run on the GPU server