Difference between revisions of "Listing Page Plugin Spec"
(A spec on feasibility of scripting data extraction) |
|||
(10 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
{{Project | {{Project | ||
− | |Has title= | + | |Has project output=Tool |
+ | |Has sponsor=Kauffman Incubator Project | ||
+ | |Has title=Listing Page Plugin Spec | ||
|Has owner=Rex Bone, | |Has owner=Rex Bone, | ||
|Has project status=Active | |Has project status=Active | ||
}} | }} | ||
− | '' | + | |
+ | ==Plugin Overview== | ||
+ | Faced with the problem of no standardization across incubator and accelerator websites, there is a design feasibility question concerning automating the extraction of information. A browser plugin with user guidance could serve as a fundamental first step towards total mechanization of the process. See [[LP_Extractor_Protocol]] for a comprehensive introduction to potential methods. | ||
+ | |||
+ | The focus of this design is to create a tool which allows for the quick identification of HTML markings on a webpage and subsequent reduction to a DSL for useful data extraction. Multiple options will be considered, including allowing the user to visually 'draw' a grid, either via dragging or marking vertices, and mouse-over. Attention will be given to potentially viable technical resources as well as usability. | ||
+ | |||
+ | Current List of sites to examine: | ||
+ | E:\projects\accelerators\The File to Rule Them All.xlsx | ||
+ | |||
+ | (E:\projects\Kauffman Incubator Project\02 Identify the client listing page\Listing Page Classifier) | ||
+ | |||
+ | |||
+ | '''Sample Webpage:''' | ||
+ | [[File:Kobeexampless.PNG|thumb|center|upright=3|Image taken from 500kobe.com]] | ||
+ | |||
+ | ==Technical Specifications== | ||
+ | |||
+ | ===HTML Layout Variations=== | ||
+ | HTML tree structure differs by site and web developer preference. A look at examples of accelerator websites reveals the following methods of organizing company data: | ||
+ | [[File:Divclassexample.PNG|thumb|Each "views-row" tag represents a starting extraction point]] | ||
+ | # div tag, class parameter | ||
+ | In certain cases, the existing style guide may be utilized for ease of extraction. | ||
+ | |||
+ | If the user highlights one section, the label can be extracted and then used to locate the remaining start-ups. | ||
+ | |||
+ | |||
+ | ===Browser Choice=== | ||
+ | *Firefox | ||
+ | *Chrome | ||
+ | *Internet Explorer | ||
+ | |||
+ | ===Programming Language & Frameworks=== | ||
+ | *Python | ||
+ | *Node.js | ||
+ | |||
+ | ===User Input Styles=== | ||
+ | * Drag + Drop | ||
+ | * Marking Vertices | ||
+ | * Mouse-Over | ||
+ | |||
+ | ===Current Problems=== | ||
+ | * "Infinite Scroll" webpages: Potentially impossible to account for incubator websites which display company lists in an infinite scroll style. Would require multiple instances of user input. |
Latest revision as of 12:47, 21 September 2020
Listing Page Plugin Spec | |
---|---|
Project Information | |
Has title | Listing Page Plugin Spec |
Has owner | Rex Bone |
Has start date | |
Has deadline date | |
Has project status | Active |
Has sponsor | Kauffman Incubator Project |
Has project output | Tool |
Copyright © 2019 edegan.com. All Rights Reserved. |
Contents
Plugin Overview
Faced with the problem of no standardization across incubator and accelerator websites, there is a design feasibility question concerning automating the extraction of information. A browser plugin with user guidance could serve as a fundamental first step towards total mechanization of the process. See LP_Extractor_Protocol for a comprehensive introduction to potential methods.
The focus of this design is to create a tool which allows for the quick identification of HTML markings on a webpage and subsequent reduction to a DSL for useful data extraction. Multiple options will be considered, including allowing the user to visually 'draw' a grid, either via dragging or marking vertices, and mouse-over. Attention will be given to potentially viable technical resources as well as usability.
Current List of sites to examine: E:\projects\accelerators\The File to Rule Them All.xlsx
(E:\projects\Kauffman Incubator Project\02 Identify the client listing page\Listing Page Classifier)
Sample Webpage:
Technical Specifications
HTML Layout Variations
HTML tree structure differs by site and web developer preference. A look at examples of accelerator websites reveals the following methods of organizing company data:
- div tag, class parameter
In certain cases, the existing style guide may be utilized for ease of extraction.
If the user highlights one section, the label can be extracted and then used to locate the remaining start-ups.
Browser Choice
- Firefox
- Chrome
- Internet Explorer
Programming Language & Frameworks
- Python
- Node.js
User Input Styles
- Drag + Drop
- Marking Vertices
- Mouse-Over
Current Problems
- "Infinite Scroll" webpages: Potentially impossible to account for incubator websites which display company lists in an infinite scroll style. Would require multiple instances of user input.