Difference between revisions of "Listing Page Classifier"

Project
Listing Page Classifier
Project Information
Has title	Listing Page Classifier
Has owner	Nancy Yu
Has start date
Has deadline date
Has project status	Active
	Copyright © 2019 edegan.com. All Rights Reserved.

Revision as of 15:02, 2 May 2019

Summary

The objective of this project is to determine which web page on an incubator's website contains the client company listing.

The project will ultimately use data (incubator names and URLs) identified using the Ecosystem Organization Classifier (perhaps in conjunction with an additional website finder tool, if the Incubator Seed Data source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the U.S. Seed Accelerators project.

We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.

Current Work

Progress Log (updated on 5/1/2019)

Main Tasks

Build a site map generator: output every internal link of a website
Build a tool that captures screenshots of individual web pages
Build a CNN classifier

Site Map Generator

URL Extraction from HTML

The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag <a>, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).

<a href="/wiki/Listing_Page_Classifier_Progress" title="Listing Page Classifier Progress"> Progress Log (updated on 4/15/2019)</a>

Issues may occur:

The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com
Some may not exclude the domain name and we should take consideration of both cases when extracting the url

Note: the beautifulsoup package is used for pulling data out of HTML

Distinguish Internal Links

If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link
If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)

<a href = https://www.facebook.com/...></a>

Algorithm on Collecting Internal Links

Site Map Tree

Intuitions:

We treat each internal page as a tree node
Each node can have multiple linked children or none
Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4
Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth

Note: the recommended maximum depth input is 2. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.

Breadth-First Search (BFS) approach:

We examine all pages(nodes) at the same depth before going down to the next depth.

Python file saved in

E:\projects\listing page identifier\Internal_Link\Internal_url_BFS.py

Web Page Screenshot Tool

This tool reads all text files (which contain internal links of individual companies extracted from the above site map generator) from a directory, and outputs a full screenshot (.png) of each url from those text files (see sample output on the right).

Sample Output

Browser Automation Tool

The initial idea was to use the selenium package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the splinter package is chosen as the final browser automation tool to assist our screenshot tool

Used Browser

The picked browser for taking screenshot is Chrome. A chromedriver was downloaded for setting up the browser during browser automation.

Python file saved in

E:\projects\listing page identifier\screen_shot\screen_shot_tool.py

Image Processing

This method would likely rely on a convolutional neural network (CNN) to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.

Set Up

Possible packages for building CNN: TensorFlow, PyTorch, scikit
Current dataset: The File to Rule Them All, contains information of 160 accelerators (homepage url, found cohort url etc.)
- We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm
- 90 out of 145(around 75%) of the data will be used to train our model, the rest (31 accelerators, around 25%) will be used as the test data
The type of inputs for CNN model:

Picture of the web page (Image data that is generated from the above screenshot tool)
Cohort indicator (Categorical data: 1 - it is a cohort page, 0 - not a cohort page)

Note: The cohort indicator implies that our dataset is a labeled dataset, this may become helpful when choosing packages for building the CNN model

Data Preprocessing (IN PROGRESS)

This part aims to create an automation process for combining results generated from the Site Map Tool and the Screenshot Tool with cohort indicators. The generated dataset from this process will be fed into our CNN model.

Python file saved in

E:\projects\listing page identifier\generate_dataset.py

Difference between revisions of "Listing Page Classifier"

Revision as of 15:02, 2 May 2019

Contents

Summary

Current Work

Main Tasks

Site Map Generator

URL Extraction from HTML

Distinguish Internal Links

Algorithm on Collecting Internal Links

Web Page Screenshot Tool

Browser Automation Tool

Used Browser

Image Processing

Set Up

Data Preprocessing (IN PROGRESS)

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools

@@ Line 91: / Line 91: @@
 ====Data Preprocessing (IN PROGRESS)====
 This part aims to create an automation process for combining results generated from the Site Map Tool and the Screenshot Tool with cohort indicators. The generated dataset from this process will be fed into our CNN model.
+Python file saved in
+ E:\projects\listing page identifier\generate_dataset.py