Difference between revisions of "INBIA"

From edegan.com
Jump to navigation Jump to search
Line 7: Line 7:
 
}}
 
}}
 
The [https://inbia.org/ International Business Innovation Association (INBIA)] has a [http://exchange.inbia.org/network/findacompany directory] containing information on 415 incubators in the United States.
 
The [https://inbia.org/ International Business Innovation Association (INBIA)] has a [http://exchange.inbia.org/network/findacompany directory] containing information on 415 incubators in the United States.
 +
 +
 +
===INBIA===
 +
 +
We retrieved the INBIA data as follows:
 +
#Go to http://exchange.inbia.org/network/findacompany/ and search US
 +
#Change to 100 results per page
 +
#Save HTML page of 0-100
 +
#Choose next page, Save HTML page of 100-200
 +
#Sort Z-A
 +
#Save HTML page 418-318
 +
#Choose next page, Save HTML page of 318-218
 +
#Note that we are missing some that start with L and M
 +
#Search US L, Choose page with L as first letter, Save HTML of L
 +
#Search US M, Choose page with M as first letter, Save HTML of M
 +
 +
Then process each of those html files with regular expressions in textpad
 +
*Search .*biobubblekey    Replace #
 +
*Search ^[^#].*\n    Replace NOTHING
 +
*Search .*href=\"    Replace NOTHING
 +
*Search <\/a>    Replace NOTHING
 +
*Search \">    Replace \t
 +
 +
Then combine files, throw out duplicates, move columns, sort. This results in a file without headers where the lines are like:
 +
1863 Ventures/Project 500 /?c=companyprofile&amp;UserKey=4794e0a6-3f61-4357-a1cb-513baf00957e
 +
4th Sector Innovations /?c=companyprofile&amp;UserKey=cc47b04e-1c2a-4019-88b3-05d1163a0d6a
 +
712 Innovations /?c=companyprofile&amp;UserKey=531ad600-e11a-4c74-9f37-bace816b9325
 +
AccelerateHER /?c=companyprofile&amp;UserKey=3c05d1c1-91b5-48ae-8ec3-c77765b10c2b
 +
ACTION Innovation Network /?c=companyprofile&amp;UserKey=5ac08dd0-364d-47b2-8de0-a7536a3b4802
 +
 +
We can now build a crawler to call http://exchange.inbia.org/network/findacompany/ with then the URL extension (either encoded or with <nowiki>&amp;</nowiki> replaced with just &), for example: http://exchange.inbia.org/network/findacompany/?c=companyprofile&UserKey=da2dbe35-9afa-4141-9b31-4e2cfd46a5aa Gets the company page for Cambridge Innovation Center.
 +
 +
We can then rip out the contact information, including URL, and the people, using either beautiful soup or regular expressions.

Revision as of 10:31, 3 April 2019


McNair Project
INBIA
Project logo 02.png
Project Information
Project Title INBIA
Owner Anne Freeman
Start Date
Deadline
Primary Billing
Notes
Has project status Active
Subsumes: Incubator Seed Data, Ecosystem Organization Classifier
Copyright © 2016 edegan.com. All Rights Reserved.


The International Business Innovation Association (INBIA) has a directory containing information on 415 incubators in the United States.


INBIA

We retrieved the INBIA data as follows:

  1. Go to http://exchange.inbia.org/network/findacompany/ and search US
  2. Change to 100 results per page
  3. Save HTML page of 0-100
  4. Choose next page, Save HTML page of 100-200
  5. Sort Z-A
  6. Save HTML page 418-318
  7. Choose next page, Save HTML page of 318-218
  8. Note that we are missing some that start with L and M
  9. Search US L, Choose page with L as first letter, Save HTML of L
  10. Search US M, Choose page with M as first letter, Save HTML of M

Then process each of those html files with regular expressions in textpad

  • Search .*biobubblekey Replace #
  • Search ^[^#].*\n Replace NOTHING
  • Search .*href=\" Replace NOTHING
  • Search <\/a> Replace NOTHING
  • Search \"> Replace \t

Then combine files, throw out duplicates, move columns, sort. This results in a file without headers where the lines are like:

1863 Ventures/Project 500	/?c=companyprofile&UserKey=4794e0a6-3f61-4357-a1cb-513baf00957e	
4th Sector Innovations	/?c=companyprofile&UserKey=cc47b04e-1c2a-4019-88b3-05d1163a0d6a	
712 Innovations	/?c=companyprofile&UserKey=531ad600-e11a-4c74-9f37-bace816b9325	
AccelerateHER	/?c=companyprofile&UserKey=3c05d1c1-91b5-48ae-8ec3-c77765b10c2b	
ACTION Innovation Network	/?c=companyprofile&UserKey=5ac08dd0-364d-47b2-8de0-a7536a3b4802	

We can now build a crawler to call http://exchange.inbia.org/network/findacompany/ with then the URL extension (either encoded or with & replaced with just &), for example: http://exchange.inbia.org/network/findacompany/?c=companyprofile&UserKey=da2dbe35-9afa-4141-9b31-4e2cfd46a5aa Gets the company page for Cambridge Innovation Center.

We can then rip out the contact information, including URL, and the people, using either beautiful soup or regular expressions.