LP Extractor Protocol

Project
LP Extractor Protocol
Project Information
Has title	LP Extractor Protocol
Has start date
Has deadline date
Has project status	Active
Subsumed by:	Listing Page Extractor
	Copyright © 2019 edegan.com. All Rights Reserved.

Overview of Possible Methods

According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is textual processing, analyzing the text of the HTML page either through a Word2Vec or “Bag of Words” approach. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).

HTML Tree Structure Analysis

Supervised Learning Approach

LP Extractor Protocol

Overview of Possible Methods

HTML Tree Structure Analysis

Supervised Learning Approach

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools