Listing Page Classifier
Jump to navigation
Jump to search
Listing Page Classifier | |
---|---|
Project Information | |
Has title | Listing Page Classifier |
Has owner | Nancy Yu |
Has start date | |
Has deadline date | |
Has project status | Active |
Copyright © 2019 edegan.com. All Rights Reserved. |
Text Processing
There are two possible classification methods for the processing the text of target HTML pages. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.)
Main Tasks
- Build a site map generator: output every internal links of input websites
- Build a generator that captures screenshot of individual web pages
- Build a CNN classifier using Python and TensorFlow
Approaches (IN PROGRESS)
- URL Crawler
E:\projects\listing page identifier\urlcrawler.py