Listing Page Classifier

From edegan.com
Jump to navigation Jump to search


Project
Listing Page Classifier
Project logo 02.png
Project Information
Has title Listing Page Classifier
Has owner Nancy Yu
Has start date
Has deadline date
Has project status Active
Copyright © 2019 edegan.com. All Rights Reserved.


Summary

The objective of this project is to determine which web page on an incubator's website contains the client company listing.

The project will ultimately use data (incubator names and URLs) identified using the Ecosystem Organization Classifier (perhaps in conjunction with an additional website finder tool, if the Incubator Seed Data source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the U.S. Seed Accelerators project.

We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.

Current Work

Progress Log (updated on 4/22/2019)

Main Tasks

  1. Build a site map generator: output every internal link of a website
  2. Build a tool that captures screenshots of individual web pages
  3. Build a CNN classifier using Python and TensorFlow

Site Map Generator

URL Extraction from HTML

The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag <a>, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url that we look for (see example below).

<a href="/wiki/Listing_Page_Classifier_Progress" title="Listing Page Classifier Progress"> Progress Log (updated on 4/15/2019)</a>

Issues may occur:

  • The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com
  • Some may not exclude the domain name and we should take consideration of both cases when extracting the url

Note: the beautifulsoup package is used for pulling data out of HTML

Distinguish Internal Links

  • If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link
  • If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)
<a href = https://www.facebook.com/...></a>

Algorithm on Collecting Internal Links

 
Site Map Tree

Intuition:

  • We treat each internal page as a tree node
  • Each node can have multiple linked children or none
  • Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4
  • Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth

Breadth-First Search (BFS) approach:

We examine all pages(nodes) at the same depth before going down to the next depth.

Python file saved in

E:\projects\listing page identifier\Internal_Link\Internal_url_BFS.py

Depth-First Search (DFS) approach:

We visit a page(node) "A" and then all A's children on the current path will be visited before we visit A's neighbor node "B".

For example, assuming the furthest depth a user wants to dig in is 2, we will start with our homepage and then examine its first child node "page 1", then visiting page 1's children until we meet the maximum depth. Then we move onto homepage's second child "page 2" and visit page 2's children until we reach the maximum depth. Next we visit homepage's next child and so on.

Python file saved in

E:\projects\listing page identifier\Internal_Link\Internal_url_DFS.py

Web Page Screenshot Tool (IN PROGRESS)

This tool will take 2 user input: the url and the output file(.png)'s name. It will output a png file that has the full screen shot of a web page (see output file example on the right)

 
Sample Output

Python file saved in

E:\projects\listing page identifier\screen_shot\screen_shot_tool.py

Image Processing

This method would likely rely on a convolutional neural network (CNN) to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.