Changes

Listing Page Classifier (view source)

Revision as of 13:51, 30 March 2019

179 bytes removed , 13:51, 30 March 2019

no edit summary

|Has project status=Active

}}

~~== Text Processing==~~

There are two possible classification methods for the processing the text of target HTML pages. The first is a "Bag of Words" approach, which uses Term Frequency – Inverse Document Frequency to do basic natural language processing and select words or phrases which have discriminant capabilities. The second is a Word2Vec approach which uses shallow 2 layer neural networks to reduce descriptions to a vector with high discriminant potential. (See "Memo for Evan" in E:\mcnair\Projects\Incubators for further detail.)

== Main Tasks ==

# URL Crawler

E:\projects\listing page identifier\urlcrawler.py

=== Image Processing ===

This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.

Ed

Bureaucrats, Interface administrators, Administrators (Semantic MediaWiki), Administrators

7,649

edits

Changes

Listing Page Classifier (view source)

Revision as of 13:51, 30 March 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools