Difference between revisions of "Google Crawler"

McNair Project
Google Crawler
Project Information
Project Title	Google Crawler
Owner	Anne Freeman
Start Date
Deadline
Primary Billing
Notes
Has project status	Active
	Copyright © 2016 edegan.com. All Rights Reserved.

Revision as of 15:23, 9 April 2019

Background

We wanted to create a google web crawler that could collect data from web searches specific to individual cities. The searches could be in the format of "incubator" + "city, state". It was modeled off of previous researcher's web crawler which collected information on accelerators. We could not simply modify their web crawler as it used an outdated python module.

The output from this crawler could be used in several ways:

The URLs determined to be incubator websites can be input for the Listing Page Classifier that takes an incubator website URL and identifies which page contains the client company listing.
The title text can be analyzed using n-grams to look for keywords in order to classify the URL as an incubator. This strategy is discussed in Geocoding Inventor Locations (Tool).
Key elements of a page's HTML can be feed into an adapted version of the Demo Day Page Google Classifier to identify demo day webpages that contain a list of cohort companies.
The page can be passed over to Amazon's Mechanical Turk to outsource the task of classifying pages as being incubators.

Implementation

The crawler opens the text file containing a list of locations in the format "city, state", with each entry separated by a newline. It appends the google search query domain "https://www.google.com/search?q=" to the front of the key term "incubator" and appropriately attaches the city and state name, using google escape characters for commas and spaces. Then, using beautifulsoup, the script opens each of the generated urls and parses the resulting page to collect the titles and urls of the results. The titles and urls are stored in a csv file in the following format

first row: city, state
second row: titles of results
third row: urls of results
fourth row: blank

This pattern repeats for each city, state query.

Relevant files, including python script, text files and csv files are located in

E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\GoogleCrawler

@@ Line 7: / Line 7: @@
 ==Background==
 We wanted to create a google web crawler that could collect data from web searches specific to individual cities. The searches could be in the format of "incubator" + "city, state". It was modeled off of previous researcher's web crawler which collected information on accelerators. We could not simply modify their web crawler as it used an outdated python module.
+The output from this crawler could be used in several ways:
+# The URLs determined to be incubator websites can be input for the [[Listing Page Classifier]] that takes an incubator website URL and identifies which page contains the client company listing.
+# The title text can be analyzed using n-grams to look for keywords in order to classify the URL as an incubator. This strategy is discussed in [[Geocoding Inventor Locations (Tool)]].
+# Key elements of a page's HTML can be feed into an adapted version of the [[Demo Day Page Google Classifier]] to identify demo day webpages that contain a list of cohort companies.
+# The page can be passed over to Amazon's [https://www.mturk.com/ Mechanical Turk] to outsource the task of classifying pages as being incubators.
 ==Implementation==

Difference between revisions of "Google Crawler"

Revision as of 15:23, 9 April 2019

Background

Implementation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools