Difference between revisions of "Listing Page Classifier"
Jump to navigation
Jump to search
Line 4: | Line 4: | ||
|Has project status=Active | |Has project status=Active | ||
}} | }} | ||
+ | |||
+ | == Text Processing== | ||
+ | |||
+ | There are two possible classification methods for the processing the text of target HTML pages. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.) | ||
== Main Tasks == | == Main Tasks == |
Revision as of 13:50, 30 March 2019
Listing Page Classifier | |
---|---|
Project Information | |
Has title | Listing Page Classifier |
Has owner | Nancy Yu |
Has start date | |
Has deadline date | |
Has project status | Active |
Copyright © 2019 edegan.com. All Rights Reserved. |
Text Processing
There are two possible classification methods for the processing the text of target HTML pages. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.)
Main Tasks
- Build a site map generator: output every internal links of input websites
- Build a generator that captures screenshot of individual web pages
- Build a CNN classifier using Python and TensorFlow
Approaches (IN PROGRESS)
- URL Crawler
E:\projects\listing page identifier\urlcrawler.py