Difference between revisions of "Google Crawler"
AnneFreeman (talk | contribs) |
AnneFreeman (talk | contribs) |
||
Line 5: | Line 5: | ||
|Depends upon it=Ecosystem Organization Classifier, Incubator Seed Data | |Depends upon it=Ecosystem Organization Classifier, Incubator Seed Data | ||
}} | }} | ||
+ | ==Background== | ||
+ | We wanted to create a google web crawler that could collect data from web searches specific to individual cities. The searches could be in the format of "incubator" + "city, state". It was modeled off of previous researcher's web crawler which collected information on accelerators. We could not simply modify their web crawler as it used an outdated python module. | ||
− | + | ==Implementation== | |
− | |||
The crawler opens the text file containing a list of locations in the format "city, state", with each entry separated by a newline. It appends the google search query domain "https://www.google.com/search?q=" to the front of the key term "incubator" and appropriately attaches the city and state name, using google escape characters for commas and spaces. Then, using beautifulsoup, the script opens each of the generated urls and parses the resulting page to collect the titles and urls of the results. | The crawler opens the text file containing a list of locations in the format "city, state", with each entry separated by a newline. It appends the google search query domain "https://www.google.com/search?q=" to the front of the key term "incubator" and appropriately attaches the city and state name, using google escape characters for commas and spaces. Then, using beautifulsoup, the script opens each of the generated urls and parses the resulting page to collect the titles and urls of the results. | ||
The titles and urls are stored in a csv file in the following format | The titles and urls are stored in a csv file in the following format |
Revision as of 14:20, 8 April 2019
Google Crawler | |
---|---|
Project Information | |
Project Title | Google Crawler |
Owner | Anne Freeman |
Start Date | |
Deadline | |
Primary Billing | |
Notes | |
Has project status | Active |
Copyright © 2016 edegan.com. All Rights Reserved. |
Background
We wanted to create a google web crawler that could collect data from web searches specific to individual cities. The searches could be in the format of "incubator" + "city, state". It was modeled off of previous researcher's web crawler which collected information on accelerators. We could not simply modify their web crawler as it used an outdated python module.
Implementation
The crawler opens the text file containing a list of locations in the format "city, state", with each entry separated by a newline. It appends the google search query domain "https://www.google.com/search?q=" to the front of the key term "incubator" and appropriately attaches the city and state name, using google escape characters for commas and spaces. Then, using beautifulsoup, the script opens each of the generated urls and parses the resulting page to collect the titles and urls of the results. The titles and urls are stored in a csv file in the following format
- first row: city, state
- second row: titles of results
- third row: urls of results
- fourth row: blank
This pattern repeats for each city, state query.
Relevant files, including python script, text files and csv files are located in
E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\GoogleCrawler