Difference between revisions of "LP Extractor Protocol"
Jump to navigation
Jump to search
LasyaRajan (talk | contribs) |
LasyaRajan (talk | contribs) |
||
Line 6: | Line 6: | ||
==Overview of Possible Methods== | ==Overview of Possible Methods== | ||
+ | |||
+ | According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is textual processing, analyzing the text of the HTML page either through a Word2Vec or “Bag of Words” approach. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL). | ||
=== HTML Tree Structure Analysis === | === HTML Tree Structure Analysis === | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
==== Supervised Learning Approach ==== | ==== Supervised Learning Approach ==== |
Revision as of 15:52, 21 March 2019
LP Extractor Protocol | |
---|---|
Project Information | |
Has title | LP Extractor Protocol |
Has start date | |
Has deadline date | |
Has project status | Active |
Subsumed by: | Listing Page Extractor |
Copyright © 2019 edegan.com. All Rights Reserved. |
Overview of Possible Methods
According to “Project Goal V2,” (E:\mcnair\Projects\Incubators) there are three proposed methods to organize and extract useful information from an HTML web page. The first method is textual processing, analyzing the text of the HTML page either through a Word2Vec or “Bag of Words” approach. The second method is to use image based pattern recognition, likely through an off-the-shelf model that can extrapolate key HTML elements from web page screenshots. The third, and most novel method is to structurally analyze the HTML tree structure, and express that simplified HTML structure in a Domain Specific Language (DSL).