Difference between revisions of "Listing Page Classifier"
Line 25: | Line 25: | ||
====Site Map Generator==== | ====Site Map Generator==== | ||
Performing 2 following algorithms to find all internal links of a web page with 2 given user inputs: homepage url and depth | Performing 2 following algorithms to find all internal links of a web page with 2 given user inputs: homepage url and depth | ||
+ | |||
+ | [[File:WebPageTree.png|900px]] | ||
*BFS approach | *BFS approach | ||
E:\projects\listing page identifier\Internal_Link\Internal_url_BFS.py | E:\projects\listing page identifier\Internal_Link\Internal_url_BFS.py | ||
*DFS approach | *DFS approach | ||
E:\projects\listing page identifier\Internal_Link\Internal_url_DFS.py | E:\projects\listing page identifier\Internal_Link\Internal_url_DFS.py | ||
+ | |||
====Web Page Screenshot Tool (IN PROGRESS)==== | ====Web Page Screenshot Tool (IN PROGRESS)==== | ||
E:\projects\listing page identifier\screen_shot\screen_shot_tool.py | E:\projects\listing page identifier\screen_shot\screen_shot_tool.py |
Revision as of 12:12, 16 April 2019
Listing Page Classifier | |
---|---|
Project Information | |
Has title | Listing Page Classifier |
Has owner | Nancy Yu |
Has start date | |
Has deadline date | |
Has project status | Active |
Copyright © 2019 edegan.com. All Rights Reserved. |
Contents
Summary
The objective of this project is to determine which web page on an incubator's website contains the client company listing.
The project will ultimately use data (incubator names and URLs) identified using the Ecosystem Organization Classifier (perhaps in conjunction with an additional website finder tool, if the Incubator Seed Data source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the U.S. Seed Accelerators project.
We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.
Current Work
Main Tasks
- Build a site map generator: output every internal link of input websites
- Build a tool that captures a screenshot of individual web pages
- Build a CNN classifier using Python and TensorFlow
Approaches (IN PROGRESS)
Progress Log (updated on 4/15/2019)
Site Map Generator
Performing 2 following algorithms to find all internal links of a web page with 2 given user inputs: homepage url and depth
- BFS approach
E:\projects\listing page identifier\Internal_Link\Internal_url_BFS.py
- DFS approach
E:\projects\listing page identifier\Internal_Link\Internal_url_DFS.py
Web Page Screenshot Tool (IN PROGRESS)
E:\projects\listing page identifier\screen_shot\screen_shot_tool.py
Image Processing
This method would likely rely on a convolutional neural network (CNN) to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.