Difference between revisions of "INBIA"

Project
INBIA
Project Information
Has title	INBIA
Has owner	Anne Freeman
Has start date
Has deadline date
Has project status	Active
Dependent(s):	Incubator Seed Data
Has sponsor	McNair Center
Has project output	Data, Tool
	Copyright © 2019 edegan.com. All Rights Reserved.

Latest revision as of 12:44, 21 September 2020

Initial Review of INBIA

The International Business Innovation Association (INBIA) has a directory that contains information for 415 incubators within the United States. It provides reliable links to a secondary page within the INBIA domain. This page contains information including the incubator's name, address, a link to the home page of their website, and information for key contacts. The secondary pages have the same HTML structure and are reliable in the data they contain, making INBIA an ideal candidate for web crawling methods to collect data from the internal pages.

See Wiki Page Table for more details on source evaluations.

Retrieve URLS from INBIA Directory

We retrieved the INBIA data as follows:

Go to http://exchange.inbia.org/network/findacompany/ and search US
Change to 100 results per page
Save HTML page of 0-100
Choose next page, Save HTML page of 100-200
Sort Z-A
Save HTML page 418-318
Choose next page, Save HTML page of 318-218
Note that we are missing some that start with L and M
Search US L, Choose page with L as first letter, Save HTML of L
Search US M, Choose page with M as first letter, Save HTML of M

Then process each of those html files with regular expressions in textpad

Search .*biobubblekey Replace #
Search ^[^#].*\n Replace NOTHING
Search .*href=\" Replace NOTHING
Search <\/a> Replace NOTHING
Search \"> Replace \t

Then combine files, throw out duplicates, move columns, sort. This results in a file without headers where the lines are like:

1863 Ventures/Project 500	/?c=companyprofile&UserKey=4794e0a6-3f61-4357-a1cb-513baf00957e	
4th Sector Innovations	/?c=companyprofile&UserKey=cc47b04e-1c2a-4019-88b3-05d1163a0d6a	
712 Innovations	/?c=companyprofile&UserKey=531ad600-e11a-4c74-9f37-bace816b9325	
AccelerateHER	/?c=companyprofile&UserKey=3c05d1c1-91b5-48ae-8ec3-c77765b10c2b	
ACTION Innovation Network	/?c=companyprofile&UserKey=5ac08dd0-364d-47b2-8de0-a7536a3b4802

We can now build a crawler to call http://exchange.inbia.org/network/findacompany/ with then the URL extension (either encoded or with & replaced with just &), for example: http://exchange.inbia.org/network/findacompany/?c=companyprofile&UserKey=da2dbe35-9afa-4141-9b31-4e2cfd46a5aa Gets the company page for Cambridge Innovation Center.

We can then rip out the contact information, including URL, and the people, using either beautiful soup or regular expressions.

Retrieve Data from URLs Generated

We wrote a web crawler that

reads in the csv file containing the URLs to scrape into a pandas dataframe
changes the urls by replacing ?c=companyprofile& with companyprofile? and appending the domain http://exchange.inbia.org/network/findacompany to each url
opens each url and extracts information using element tree parser
collects information from each url and stores it in a txt file

The crawler generates a tab separated text file called INBIA_data.txt containing [company_name, street_address, city, state, zipcode, country, website] and is populated by information from the 415 entries from the database.

The txt file and the python script (inbia_scrape.py) are located in

E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\INBIA

How to Run

The following script inbia_scrape.py was coded in a virtualenv on a Mac, using Python 3.6.5 The following packages where loaded in that virtualenv

beautifulsoup4 4.7.1
certifi 2019.3.9
chardet 3.0.4
idna 2.8
numpy 1.16.2
pandas 0.24.2
pip 19.1.1
python-dateutil 2.8.0
pytz 2018.9
requests 2.21.0
setuptools 40.8.0
six 1.12.0
soupsieve 1.9
urllib3 1.24.1
wheel 0.33.1

@@ Line 1: / Line 1: @@
-{{McNair Projects
+{{Project
+|Has project output=Data,Tool
+|Has sponsor=McNair Center
 |Has title=INBIA
 |Has owner=Anne Freeman,
 |Has project status=Active
 |Depends upon it=Incubator Seed Data
-|Does subsume=Incubator Seed Data, Ecosystem Organization Classifier,
 }}
-The [https://inbia.org/ International Business Innovation Association (INBIA)] has a [http://exchange.inbia.org/network/findacompany directory] containing information on 415 incubators in the United States.
 ==Initial Review of INBIA==
-The INBIA directory contains information for 415 incubators within the United States. It provides reliable links to a secondary page within the INBIA domain. This page contains information including the incubator's name, address, a link to the home page of their website, and information for key contacts. The secondary pages have the same HTML structure and are reliable in the data they contain, making INBIA an ideal candidate for web crawling methods to collect data from the internal pages.
+The [https://inbia.org/ International Business Innovation Association (INBIA)] has a [http://exchange.inbia.org/network/findacompany directory] that contains information for 415 incubators within the United States. It provides reliable links to a secondary page within the INBIA domain. This page contains information including the incubator's name, address, a link to the home page of their website, and information for key contacts. The secondary pages have the same HTML structure and are reliable in the data they contain, making INBIA an ideal candidate for web crawling methods to collect data from the internal pages.
 See [http://www.edegan.com/wiki/Incubator_Seed_Data#Evaluation_of_Sources_from_Specific_Google_Searches Wiki Page Table] for more details on source evaluations.
@@ Line 47: / Line 47: @@
-==Retrieve Data from URLS Generated==
+==Retrieve Data from URLs Generated==
+We wrote a web crawler that
+# reads in the csv file containing the URLs to scrape into a pandas dataframe
+# changes the urls by replacing ''?c=companyprofile&amp;'' with ''companyprofile?'' and appending the domain http://exchange.inbia.org/network/findacompany to each url
+# opens each url and extracts information using element tree parser
+# collects information from each url  and stores it in a txt file
+The crawler generates a tab separated text file called INBIA_data.txt containing [company_name, street_address, city, state, zipcode, country, website] and is populated by information from the 415 entries from the database.
+The txt file and the python script (inbia_scrape.py) are located in
+ E:\projects\Kauffman Incubator Project\01 Classify entrepreneurship ecosystem organizations\INBIA
+== How to Run ==
+The following script inbia_scrape.py was coded in a virtualenv on a Mac, using Python 3.6.5
+The following packages where loaded in that virtualenv
+* beautifulsoup4  4.7.1
+* certifi         2019.3.9
+* chardet         3.0.4
+* idna            2.8
+* numpy           1.16.2
+* pandas          0.24.2
+* pip             19.1.1
+* python-dateutil 2.8.0
+* pytz            2018.9
+* requests        2.21.0
+* setuptools      40.8.0
+* six             1.12.0
+* soupsieve       1.9
+* urllib3         1.24.1
+* wheel           0.33.1

Difference between revisions of "INBIA"

Latest revision as of 12:44, 21 September 2020

Contents

Initial Review of INBIA

Retrieve URLS from INBIA Directory

Retrieve Data from URLs Generated

How to Run

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools