Difference between revisions of "Listing Page Classifier"

Project
Listing Page Classifier
Project Information
Has title	Listing Page Classifier
Has owner	Nancy Yu
Has start date
Has deadline date
Has project status	Active
	Copyright © 2019 edegan.com. All Rights Reserved.

Revision as of 14:50, 30 March 2019

Text Processing

There are two possible classification methods for the processing the text of target HTML pages. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.)

Main Tasks

Build a site map generator: output every internal links of input websites
Build a generator that captures screenshot of individual web pages
Build a CNN classifier using Python and TensorFlow

Approaches (IN PROGRESS)

URL Crawler

E:\projects\listing page identifier\urlcrawler.py

@@ Line 4: / Line 4: @@
 |Has project status=Active
 }}
+== Text Processing==
+There are two possible classification methods for the processing the text of target HTML pages. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.)
 == Main Tasks ==

Difference between revisions of "Listing Page Classifier"

Revision as of 14:50, 30 March 2019

Text Processing

Main Tasks

Approaches (IN PROGRESS)

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools