Difference between revisions of "Listing Page Classifier"

From edegan.com
Jump to navigation Jump to search
 
(104 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
{{Project
 
{{Project
 +
|Has project output=Tool
 +
|Has sponsor=Kauffman Incubator Project
 
|Has title=Listing Page Classifier
 
|Has title=Listing Page Classifier
 
|Has owner=Nancy Yu,
 
|Has owner=Nancy Yu,
Line 14: Line 16:
  
 
==Current Work==
 
==Current Work==
 +
[[Listing Page Classifier Progress|Progress Log (updated on 5/17/2019)]]
  
 
===Main Tasks===
 
===Main Tasks===
  
# Build a site map generator: output every internal link of input websites
+
# Build a site map generator: output every internal link of a website
# Build a tool that captures a screenshot of individual web pages
+
# Build a tool that captures screenshots of individual web pages
# Build a CNN classifier using Python and TensorFlow
+
# Build a CNN classifier
  
===Approaches (IN PROGRESS)===
+
===Site Map Generator===
[[Listing Page Classifier Progress|Progress Log (updated on 4/15/2019)]]
 
====Site Map Generator====
 
  
 +
====URL Extraction from HTML====
  
[[File:WebPageTree.png|900px]]
+
The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag <a>, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).
  
Intuition:
+
<code><a href="/wiki/Listing_Page_Classifier_Progress" title="Listing Page Classifier Progress"> Progress Log (updated on 4/15/2019)</a></code>
*We treat each internal page as a tree node. Each node can have multiple children.
 
*Taking the above picture as an example, the homepage is the first tree node that we will be given as an input to our function, and it has 4 children: page 1, page 2, page 3, and page 4
 
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 given user inputs: homepage url and depth
 
  
'''Breadth-First Search(BFS)approach''':  
+
Issues may occur:
 +
* The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com
 +
* Some may not exclude the domain name and we should take consideration of both cases when extracting the url
  
we examine all pages(nodes) at the same depth before going down to the next depth.
+
'''Note:''' the [https://www.crummy.com/software/BeautifulSoup/bs4/doc/ beautifulsoup] package is used for pulling data out of HTML
  
  E:\projects\listing page identifier\Internal_Link\Internal_url_BFS.py
+
====Distinguish Internal Links====
'''Depth-First Search (DFS) approach''':  
+
* If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link
 +
* If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)
 +
   
 +
<code><a href = https://www.facebook.com/...></a></code>
 +
 
 +
====Algorithm on Collecting Internal Links====
 +
 
 +
[[File:WebPageTree.png|500px|thumb|center|Site Map Tree]]
  
we visit a page(node)"A" and then all its children on the current path will be visited before we visit A's neighbor node "B".
+
'''Intuitions:'''
 +
*We treat each internal page as a tree node
 +
*Each node can have multiple linked children or none
 +
*Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4
 +
*Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth
  
For example, assuming the furthest depth a user wants to dig in is 2, we will start with our homepage and then examine its first children "page 1", then visiting page 1's children until we meet the maximum depth. Then we move onto homepage's second children "page 2" and visit page 2's children until we reach the maximum depth. Next we visit homepage next children page 3 and so on.
+
'''Note:''' the '''recommended maximum depth''' input is '''2'''. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.
+
 
  E:\projects\listing page identifier\Internal_Link\Internal_url_DFS.py
+
'''''Breadth-First Search (BFS) approach''''':
 +
 
 +
We examine all pages(nodes) at the same depth before going down to the next depth.
 +
 
 +
 
 +
Python file saved in
 +
 
 +
  E:\projects\listing page identifier\Internal_url_BFS.py
 +
 
 +
===Web Page Screenshot Tool===
 +
This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.
 +
[[File:screenshotEx.png|200x400px|thumb|right|Sample Output]]
 +
 
 +
====Browser Automation Tool====
 +
The initial idea was to use the [https://www.seleniumhq.org/ selenium] package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the [https://splinter.readthedocs.io/en/latest/why.html splinter] package is chosen as the final browser automation tool to assist our screenshot tool
 +
 
 +
====Used Browser====
 +
The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.
 +
 
 +
'''Note:''' initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.
  
====Web Page Screenshot Tool (IN PROGRESS)====
+
Python file saved in
This tool will take 2 user input: the url and the output file(.png)'s name. It will output a png file that can take the full screen shot of a web page (see example output file on the left)
+
  E:\projects\listing page identifier\screen_shot_tool.py
  E:\projects\listing page identifier\screen_shot\screen_shot_tool.py
 
  
 
===Image Processing===
 
===Image Processing===
 +
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.
 +
====Set Up====
 +
*Possible Python packages for building CNN: TensorFlow, PyTorch, scikit
 +
*Current dataset: <code>The File to Rule Them All</code>, contains information of 160 accelerators (homepage url, found cohort url etc.)
 +
** We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm
 +
** After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data
 +
*The type of inputs for training CNN model:
 +
#Image: picture of the web page (generated by the Screenshot Tool)
 +
#Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)
 +
 +
====Data Preprocessing====
 +
'''''Retrieving All Internal Links: ''''' this <code>generate_dataset.py</code> reads all homepage urls in the file <code>The File to Rule Them All.csv</code> and then feed them into the Site Map Generator to retrieve their corresponding internal urls
 +
*This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)
 +
http://fledge.co/blog/ 0
 +
http://fledge.co/fledglings/ 1
 +
http://fledge.co/2019/visiting-malawi/ 0
 +
http://fledge.co/about/details/ 0
 +
http://fledge.co/about/ 0
 +
 +
*Results are automatically split into two text files: <code>train.txt</code> and <code>test.txt</code>.
  
This method would likely rely on a [https://en.wikipedia.org/wiki/Convolutional_neural_network convolutional neural network (CNN)] to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.
+
Python file saved in
 +
E:\projects\listing page identifier\generate_dataset.py
 +
 
 +
'''''Generate and Label Image Data: ''''' feed paths/directories of <code>train.txt</code> and <code>text.txt</code> into Screenshot Tool to get our image data
 +
*Results are split into two folders: train and test
 +
** Also separated into sub-folders: cohort and not_cohort[[File:autoName.png|250px]]
 +
** Make sure to create train and test folders (in the '''same directory''' as <code>train.txt</code> and <code>text.txt</code>), and their sub-folders cohort and not_cohort '''BEFORE''' running the Screenshot Tool
 +
 
 +
====CNN Model====
 +
Python file saved in
 +
E:\projects\listing page identifier\cnn.py
 +
 
 +
'''''NOTE: '''''[https://keras.io/ Keras]  package (with TensorFlow backend) is used for setting up the model
 +
 
 +
'''Current condition/issue''' of the model:
 +
* loss: 0.9109, accuracy: 0.9428
 +
* The model runs with no problem, however, it does not make classification. All predictions on the test set are the same
 +
 
 +
Some '''factors/problems''' to consider for '''future implementation''' on the model:
 +
* Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class
 +
**may cause our model favoring the larger class, then the accuracy metric is not reliable
 +
**several suggestions to fix this: A) under-sampling the larger class B)over-sampling the smaller class
 +
* Convert image data into same format: [https://www.oreilly.com/library/view/linux-multimedia-hacks/0596100760/ch01s04.html Make image thumbnail]
 +
**we can modify image target size in our CNN, but we don't know if Keras library crop or re-scale image with given target size
 +
*I chose to group images into cohort folder or not_cohort folder to let our CNN model detect the class label of an image. There are certainly other ways to detect class label and one may want to modify the Screenshot Tool and <code>cnn.py</code> to assist with other approaches
 +
 
 +
 
 +
Useful rescource:
 +
*Image generator in Keras: https://keras.io/preprocessing/image/
 +
*Keras tutorial for builindg a CNN:
 +
https://adventuresinmachinelearning.com/keras-tutorial-cnn-11-lines/
 +
https://towardsdatascience.com/building-a-convolutional-neural-network-cnn-in-keras-329fbbadc5f5
 +
https://becominghuman.ai/building-an-image-classifier-using-deep-learning-in-python-totally-from-a-beginners-perspective-be8dbaf22dd8
 +
 
 +
===Workflow===
 +
This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.
 +
 
 +
# Feed raw data (as for now, our raw data is the <code>The File to Rule Them All.csv</code>) into <code> generate_dataset.py</code> to get text files (<code>train.txt</code> and<code>text.txt</code>) that contain a list of all internal urls with their corresponding indicator (class label)
 +
# Create 2 folders: train and test, located in the same directory as <code>train.txt</code> and <code>text.txt</code>, also create 2 sub-folders: cohort and not_cohort within these 2 folders
 +
# Feed the directory/path of <code>train.txt</code> and <code>text.txt</code> into <code>screen_shot_tool.py</code>. This process will automatically group images into their corresponding folders that we just created in step 2

Latest revision as of 12:47, 21 September 2020


Project
Listing Page Classifier
Project logo 02.png
Project Information
Has title Listing Page Classifier
Has owner Nancy Yu
Has start date
Has deadline date
Has project status Active
Has sponsor Kauffman Incubator Project
Has project output Tool
Copyright © 2019 edegan.com. All Rights Reserved.


Summary

The objective of this project is to determine which web page on an incubator's website contains the client company listing.

The project will ultimately use data (incubator names and URLs) identified using the Ecosystem Organization Classifier (perhaps in conjunction with an additional website finder tool, if the Incubator Seed Data source does not contain URLs). Initially, however, we are using accelerator websites taken from the master file from the U.S. Seed Accelerators project.

We are building three tools: a site map generator, a web page screenshot tool, and an image classifier. Then, given an incubator URL, we will find and generate (standardized size) screenshots of every web page on the website, code which page is the client listing page, and use the images and the coding to train our classifier. We currently plan to build the classifier using a convolutional neural network (CNN), as these are particularly effective at image classification.

Current Work

Progress Log (updated on 5/17/2019)

Main Tasks

  1. Build a site map generator: output every internal link of a website
  2. Build a tool that captures screenshots of individual web pages
  3. Build a CNN classifier

Site Map Generator

URL Extraction from HTML

The goal here is to identify url links from the HTML code of a website. We can solve this by finding the place holder, which is the anchor tag <a>, for a hyperlink. Within the anchor tag, we may locate the href attribute that contains the url (see example below).

<a href="/wiki/Listing_Page_Classifier_Progress" title="Listing Page Classifier Progress"> Progress Log (updated on 4/15/2019)</a>

Issues may occur:

  • The href may not give us the full url, like above example it excludes the domain name: http://www.edegan.com
  • Some may not exclude the domain name and we should take consideration of both cases when extracting the url

Note: the beautifulsoup package is used for pulling data out of HTML

Distinguish Internal Links

  • If the href is not presented in a full url format (referring to the example above), then it is for sure an internal link
  • If the href is in a full url format, but it does not contain the domain name, then it is an external link (see example below, assuming the domain name is not facebook.com)

<a href = https://www.facebook.com/...></a>

Algorithm on Collecting Internal Links

Site Map Tree

Intuitions:

  • We treat each internal page as a tree node
  • Each node can have multiple linked children or none
  • Taking the above picture as an example, the homepage is the first tree node (at depth = 0) that we will be given as an input to our function, and it has 4 children (at depth = 1): page 1, page 2, page 3, and page 4
  • Given the above idea, we have built 2 following algorithms to find all internal links of a web page with 2 user inputs: homepage url and depth

Note: the recommended maximum depth input is 2. Since our primary goal is to capture the screenshot of the portfolio page (client listing page) and this page often appears at the first depth, if not, second depth will be enough to achieve the goal, no need to dive deeper than the second depth.

Breadth-First Search (BFS) approach:

We examine all pages(nodes) at the same depth before going down to the next depth.


Python file saved in

E:\projects\listing page identifier\Internal_url_BFS.py

Web Page Screenshot Tool

This tool reads two text files: test.txt and train.txt, and outputs a full screenshot (see sample output on the right) of each url in these 2 text files.

Sample Output

Browser Automation Tool

The initial idea was to use the selenium package to set up a browser window that fits the web page size, then capture the whole window to get a full screenshot of the page. After several test runs on different websites, this method worked great for most web pages but with some exceptions. Therefore, the splinter package is chosen as the final browser automation tool to assist our screenshot tool

Used Browser

The picked browser for taking screenshot is Firefox. A geckodriver v0.24.0 was downloaded for setting up the browser during browser automation.

Note: initial plan was to use Chrome, but encountered some issues with switching different versions(v73 to v74) of chromedriver during the browser automation.

Python file saved in

E:\projects\listing page identifier\screen_shot_tool.py

Image Processing

This method would likely rely on a convolutional neural network (CNN) to classify HTML elements present in web page screenshots. Implementation could be achieved by combining the VGG16 model or ResNet architecture with batch normalization to increase accuracy in this context.

Set Up

  • Possible Python packages for building CNN: TensorFlow, PyTorch, scikit
  • Current dataset: The File to Rule Them All, contains information of 160 accelerators (homepage url, found cohort url etc.)
    • We will use the data of 121 accelerators, which have cohort urls found, for training and testing our CNN algorithm
    • After applying the above Site Map Generator to those 121 accelerators, we will use 75% of the result data to train our model. The rest, 25% will be used as the test data
  • The type of inputs for training CNN model:
  1. Image: picture of the web page (generated by the Screenshot Tool)
  2. Class Label: Cohort indicator ( 1 - it is a cohort page, 0 - not a cohort page)

Data Preprocessing

Retrieving All Internal Links: this generate_dataset.py reads all homepage urls in the file The File to Rule Them All.csv and then feed them into the Site Map Generator to retrieve their corresponding internal urls

  • This process assigns corresponding cohort indicator to each url, which is separated by tab (see example below)
http://fledge.co/blog/	0
http://fledge.co/fledglings/	1
http://fledge.co/2019/visiting-malawi/	0
http://fledge.co/about/details/	0
http://fledge.co/about/	0 
  • Results are automatically split into two text files: train.txt and test.txt.

Python file saved in

E:\projects\listing page identifier\generate_dataset.py

Generate and Label Image Data: feed paths/directories of train.txt and text.txt into Screenshot Tool to get our image data

  • Results are split into two folders: train and test
    • Also separated into sub-folders: cohort and not_cohortAutoName.png
    • Make sure to create train and test folders (in the same directory as train.txt and text.txt), and their sub-folders cohort and not_cohort BEFORE running the Screenshot Tool

CNN Model

Python file saved in

E:\projects\listing page identifier\cnn.py

NOTE: Keras package (with TensorFlow backend) is used for setting up the model

Current condition/issue of the model:

  • loss: 0.9109, accuracy: 0.9428
  • The model runs with no problem, however, it does not make classification. All predictions on the test set are the same

Some factors/problems to consider for future implementation on the model:

  • Class label is highly imbalanced: o (not cohort) is way more than 1 (cohort) class
    • may cause our model favoring the larger class, then the accuracy metric is not reliable
    • several suggestions to fix this: A) under-sampling the larger class B)over-sampling the smaller class
  • Convert image data into same format: Make image thumbnail
    • we can modify image target size in our CNN, but we don't know if Keras library crop or re-scale image with given target size
  • I chose to group images into cohort folder or not_cohort folder to let our CNN model detect the class label of an image. There are certainly other ways to detect class label and one may want to modify the Screenshot Tool and cnn.py to assist with other approaches


Useful rescource:

https://adventuresinmachinelearning.com/keras-tutorial-cnn-11-lines/
https://towardsdatascience.com/building-a-convolutional-neural-network-cnn-in-keras-329fbbadc5f5
https://becominghuman.ai/building-an-image-classifier-using-deep-learning-in-python-totally-from-a-beginners-perspective-be8dbaf22dd8

Workflow

This section summarizes a general process of utilizing above tools to get appropriate input for our CNN model, also serves as a guidance for anyone who wants to implement upon those tools.

  1. Feed raw data (as for now, our raw data is the The File to Rule Them All.csv) into generate_dataset.py to get text files (train.txt andtext.txt) that contain a list of all internal urls with their corresponding indicator (class label)
  2. Create 2 folders: train and test, located in the same directory as train.txt and text.txt, also create 2 sub-folders: cohort and not_cohort within these 2 folders
  3. Feed the directory/path of train.txt and text.txt into screen_shot_tool.py. This process will automatically group images into their corresponding folders that we just created in step 2