Difference between revisions of "Google Crawler"

From edegan.com
Jump to navigation Jump to search
 
(9 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{McNair Projects
+
{{Project
 +
|Has project output=Tool
 +
|Has sponsor=McNair Center
 
|Has title=Google Crawler
 
|Has title=Google Crawler
 
|Has owner=Anne Freeman,
 
|Has owner=Anne Freeman,
Line 6: Line 8:
 
}}
 
}}
 
==Background==
 
==Background==
We wanted to create a google web crawler that could collect data from web searches specific to individual cities. The searches could be in the format of "incubator" + "city, state". It was modeled off of previous researcher's web crawler which collected information on accelerators. We could not simply modify their web crawler as it used an outdated python module.  
+
We wanted to create a web crawler that could collect data from google searches specific to individual cities. The searches could be in the format of "incubator" + "city, state". It was modeled off of previous researcher's web crawler which collected information on accelerators. We could not simply modify their web crawler as it used an outdated python module.  
  
==Implementation==
+
The output from this crawler could be used in several ways:
The crawler opens the text file containing a list of locations in the format "city, state", with each entry separated by a newline. It appends the google search query domain "https://www.google.com/search?q=" to the front of the key term "incubator" and appropriately attaches the city and state name, using google escape characters for commas and spaces. Then, using beautifulsoup, the script opens each of the generated urls and parses the resulting page to collect the titles and urls of the results. 
+
# The URLs determined to be incubator websites can be input for the [[Listing Page Classifier]] that takes an incubator website URL and identifies which page contains the client company listing.
The titles and urls are stored in a csv file in the following format
+
# The title text can be analyzed using n-grams to look for keywords in order to classify the URL as an incubator. This strategy is discussed in [[Geocoding Inventor Locations (Tool)]].
* first row: city, state
+
# Key elements of a page's HTML can be feed into an adapted version of the [[Demo Day Page Google Classifier]] to identify demo day webpages that contain a list of cohort companies.
* second row: titles of results
+
# The page can be passed over to Amazon's [https://www.mturk.com/ Mechanical Turk] to outsource the task of classifying pages as being incubators.
* third row: urls of results
 
* fourth row: blank
 
This pattern repeats for each city, state query.
 
  
Relevant files, including python script, text files and csv files are located in
+
==Selenium Implementation==
  E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\GoogleCrawler
+
The selenium implementation of the crawler requires a downloaded chrome driver. The crawler opens the text file containing a list of locations in the format "city, state" with each entry separated by a newline. It appends the google search query domain "https://www.google.com/search?q=" to the front of the key term "incubator" and appropriately attaches the city and state name, using google escape characters for commas and spaces. Then the crawler uses the chromedriver browser to access the url and parse the results for each location. It's default is to parse 10 pages of results, meaning that approximately 100 lines of data are collected for each location.
 +
 
 +
Relevant files, including python script, text files are located in
 +
  E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\SeleniumScraper
 +
 
 +
==Beautiful Soup Implementation==
 +
When we created the web crawler, our first implementation used beautiful soup to directly "request" the url. The crawler took the same input file (city, state on each line, separated by newlines) and formatted queries in the same manner. Then, using beautifulsoup, the script opens each of the generated urls and parses the resulting page to collect the titles and urls of the results. The data collected is stored in a tab separated text file with each row containing city, state, title of result, url
 +
 
 +
Relevant files, including python script, text files are located in
 +
  E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\GoogleCrawler
 +
 
 +
This crawler was frequently blocked, as directly performed queries to google and parsed the results with beautiful soup. Additionally, this implementation would only collect eight results for each location. To prevent the crawler from being blocked and collect more results, we decided to switch and use selenium.
 +
 
 +
== Things to note/What needs work ==
 +
The scraper coded using beautifulSoup does not work, it is frequently blocked by google. The scraper coded using Selenium pushes in the URL to google rather than typing in the search term and hitting enter. The Selenium script also does not collect results from multiple pages, I believe it collects results only from the first page at the moment.
 +
 
 +
== How to Run ==
 +
The scripts incubator_scrape_data.py, and incubator_selenium_scrape.py were coded on a Mac in a virtualenv using python 3.6.5 
 +
The following packages were loaded into the environment for the Selenium Script:
 +
* numpy          1.16.2
 +
* pandas          0.24.2
 +
* pip            19.1.1
 +
* python-dateutil 2.8.0 
 +
* pytz            2019.1
 +
* selenium        3.141.0
 +
* setuptools      41.0.0
 +
* six            1.12.0
 +
* urllib3        1.24.1
 +
* wheel          0.33.1
 +
 
 +
==Five Cities==
 +
 
 +
We retrieved the first 10 pages of results for each city in our 'five' cities. These included:
 +
*Washington, DC and surrounds:
 +
**Arlington VA
 +
**Alexandria VA
 +
**Crystal City VA
 +
**Fairfax VA
 +
**Washington DC
 +
**Springfield MD
 +
**Bethesda MD
 +
**Gaithersburg MD
 +
**Rockville MD
 +
**Frederick MD
 +
*Burlington VT
 +
*Boulder, CO and select other CO cities:
 +
**Boulder CO
 +
**Colorado Springs CO
 +
**Fort Collins CO
 +
*The Twin Cities and adjacent city:
 +
**St. Paul MN
 +
**Minneapolis MN
 +
**Bloomington MN
 +
*Austin TX

Latest revision as of 12:47, 21 September 2020


Project
Google Crawler
Project logo 02.png
Project Information
Has title Google Crawler
Has owner Anne Freeman
Has start date
Has deadline date
Has project status Active
Dependent(s): Incubator Seed Data
Has sponsor McNair Center
Has project output Tool
Copyright © 2019 edegan.com. All Rights Reserved.

Background

We wanted to create a web crawler that could collect data from google searches specific to individual cities. The searches could be in the format of "incubator" + "city, state". It was modeled off of previous researcher's web crawler which collected information on accelerators. We could not simply modify their web crawler as it used an outdated python module.

The output from this crawler could be used in several ways:

  1. The URLs determined to be incubator websites can be input for the Listing Page Classifier that takes an incubator website URL and identifies which page contains the client company listing.
  2. The title text can be analyzed using n-grams to look for keywords in order to classify the URL as an incubator. This strategy is discussed in Geocoding Inventor Locations (Tool).
  3. Key elements of a page's HTML can be feed into an adapted version of the Demo Day Page Google Classifier to identify demo day webpages that contain a list of cohort companies.
  4. The page can be passed over to Amazon's Mechanical Turk to outsource the task of classifying pages as being incubators.

Selenium Implementation

The selenium implementation of the crawler requires a downloaded chrome driver. The crawler opens the text file containing a list of locations in the format "city, state" with each entry separated by a newline. It appends the google search query domain "https://www.google.com/search?q=" to the front of the key term "incubator" and appropriately attaches the city and state name, using google escape characters for commas and spaces. Then the crawler uses the chromedriver browser to access the url and parse the results for each location. It's default is to parse 10 pages of results, meaning that approximately 100 lines of data are collected for each location.

Relevant files, including python script, text files are located in

 E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\SeleniumScraper

Beautiful Soup Implementation

When we created the web crawler, our first implementation used beautiful soup to directly "request" the url. The crawler took the same input file (city, state on each line, separated by newlines) and formatted queries in the same manner. Then, using beautifulsoup, the script opens each of the generated urls and parses the resulting page to collect the titles and urls of the results. The data collected is stored in a tab separated text file with each row containing city, state, title of result, url

Relevant files, including python script, text files are located in

E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\GoogleCrawler  

This crawler was frequently blocked, as directly performed queries to google and parsed the results with beautiful soup. Additionally, this implementation would only collect eight results for each location. To prevent the crawler from being blocked and collect more results, we decided to switch and use selenium.

Things to note/What needs work

The scraper coded using beautifulSoup does not work, it is frequently blocked by google. The scraper coded using Selenium pushes in the URL to google rather than typing in the search term and hitting enter. The Selenium script also does not collect results from multiple pages, I believe it collects results only from the first page at the moment.

How to Run

The scripts incubator_scrape_data.py, and incubator_selenium_scrape.py were coded on a Mac in a virtualenv using python 3.6.5 The following packages were loaded into the environment for the Selenium Script:

  • numpy 1.16.2
  • pandas 0.24.2
  • pip 19.1.1
  • python-dateutil 2.8.0
  • pytz 2019.1
  • selenium 3.141.0
  • setuptools 41.0.0
  • six 1.12.0
  • urllib3 1.24.1
  • wheel 0.33.1

Five Cities

We retrieved the first 10 pages of results for each city in our 'five' cities. These included:

  • Washington, DC and surrounds:
    • Arlington VA
    • Alexandria VA
    • Crystal City VA
    • Fairfax VA
    • Washington DC
    • Springfield MD
    • Bethesda MD
    • Gaithersburg MD
    • Rockville MD
    • Frederick MD
  • Burlington VT
  • Boulder, CO and select other CO cities:
    • Boulder CO
    • Colorado Springs CO
    • Fort Collins CO
  • The Twin Cities and adjacent city:
    • St. Paul MN
    • Minneapolis MN
    • Bloomington MN
  • Austin TX