Difference between revisions of "Crunchbase Data"

From edegan.com
Jump to navigation Jump to search
 
(32 intermediate revisions by 5 users not shown)
Line 1: Line 1:
{{McNair Projects
+
{{Project
 +
|Has project output=Data
 +
|Has sponsor=McNair Center
 
|Has title=Crunchbase Data
 
|Has title=Crunchbase Data
|Has owner=Adrian Smart,
+
|Has owner=Adrian Smart, Grace Tan, Maxine Tao, Connor Rothschild,
 
|Has start date=June 2017
 
|Has start date=June 2017
 
|Has keywords=Data, Tool, Crunchbase, VC, Angel
 
|Has keywords=Data, Tool, Crunchbase, VC, Angel
 
|Has project status=Complete
 
|Has project status=Complete
 +
|Is dependent on=Crunchbase Accelerator Equity, Crunchbase Accelerator Founders,
 
}}
 
}}
==Files and dbases==
+
==New Work==
 +
 
 +
File Location:
 +
E://McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx
 +
 
 +
This file is an update of Crunchbase data to correspond with our updated list of accelerators. That updated list of 166 accelerators can be found at:
 +
E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx
 +
 
 +
We used SQL to match a huge list of companies downloaded from Crunchbase to our list of accelerators.
 +
 
 +
===Downloading Data===
 +
The download script is written in perl. It downloads from version 3.1 of the Crunchbase API. It is called downloadScriptv3.1 and located in  E:\McNair\Software\Database Scripts\Crunchbase2. You can execute it by typing "perl downloadScript.pl" in terminal.
 +
Use this key:
 +
  662e263576fe3e4ea5991edfbcfb9883
 +
 
 +
A full list of tables in the database called crunchbase2:
 +
organizations
 +
organization_descriptions
 +
people
 +
people_descriptions
 +
degrees
 +
funding_rounds
 +
investments
 +
investment_partners
 +
investors
 +
funds
 +
acquisitions
 +
ipos
 +
events
 +
event_appearances
 +
jobs
 +
category_groups
 +
org_parents
 +
 
 +
===Commands===
 +
 
 +
Note that these should be done in command line not in the database
 +
 
 +
To open a couple lines of a file:
 +
  head organizations-clean.txt -n 3
 +
 
 +
To replace "" with nothing in all of a file:
 +
  cat organizations.csv | sed s/\"\"//g > organizations-clean.txt
 +
 
 +
 
 +
===Files===
 +
 
 +
E:\McNair\Software\DatabaseScripts\Crunchbase2
 +
 
 +
Database is called crunchbase2.
 +
 
 +
In making a table of accelerators and their UUIDs, some match multiple times if you match only on accelerator name. The file(Accelerator Multiple Matches) of those that match multiple times is in:
 +
  E:\McNair\Projects\Accelerators\Summer2018
 +
 
 +
If you match on accelerator name and/or accelerator URL, the results can be found in the same location above. It is called 'Accelerators Matched by Name and Homepage URL'. The first sheet includes the results by matching accelerators to crunchbase on organization names AND urls. The second sheet includes the results by matching accelerators to crunchbase data on organization names OR urls. This second sheet also includes results that are manually added in.
 +
For a list of accelerators, their matches from the Crunchbase data, and their UUIDs, find the 'Accelerators and UUIDs' file also in the same location as above. This file contains trailing whitespace that has been removed and saved in the file 'Accelerators and UUIDs - clean.txt'.
 +
 +
 
 +
For the analysis script that obtains the results described above, see 'Analysis.sql' in:
 +
  E:\McNair\Software\Database Scripts\Crunchbase2
 +
 
 +
===Collecting Company Information===
 +
Code to build tables is here:
 +
E:\McNair\Software\Database Scripts\Crunchbase2\CompanyMatchScript
 +
 
 +
Text file versions of the tables are located in the Z drive. The database is called:
 +
crunchbase2
 +
 
 +
Crunchbase may have blanks and random quotation marks as entries. I had to clean this on textpad.
 +
 
 +
<br>All crunchbase companies and their UUIDs are here:
 +
E:\McNair\Projects\Accelerators\Summer 2018\CB Company UUIDs
 +
 
 +
A list of UUIDs for the companies we are interested in can be found in the same place as above. It is called 'Our Company UUID matches'. SQL code to get matches is in the CompanyMatchScript mentioned above. The perl matches were done using Matcher.pl. The single match sheet for SQl contains companies that do not have multiple entries in crunchbase. This is also the case for the Perl single match sheet, and it has also been filtered to remove those that were flagged as multiple matches by Matcher.pl
 +
 
 +
A list for companies, UUIDs, and all other informations is also in the Summer 2018 folder. It is called 'FINAL LIST'. THIS SHEET HAS DUPLICATES THAT ARE IN THE SHEET 'DUPLICATE COMPANIES'.
 +
 
 +
<br> The definite, final list is:
 +
E:\McNair\Projects\Accelerators\Summer 2018\The File to Rule Them All.xlsx
 +
 
 +
==Old Notes from previous work==
 +
 
 +
===Files and dbases===
  
 
The dbase is:
 
The dbase is:
Line 15: Line 100:
 
  Z:\Crunchbase\CrunchbaseData
 
  Z:\Crunchbase\CrunchbaseData
  
==To do==
+
===To do===
 
#Download the data
 
#Download the data
 
#Extract top 5 lines
 
#Extract top 5 lines
Line 31: Line 116:
 
Where, what, etc.
 
Where, what, etc.
  
==Importing the data==
+
===Importing the data===
  
 
To import the data make sure that all 22 crunchbase csv files are on the db server in /bulk/crunchbase/crunchbaseData. Also make sure that the load_crunchbase.sql script is in this directory. Run psql crunchbasebulk to start the db from this directory. Run the command "\i load_crunchbase.sql" to run the script. This will load the contents of the 22 csv files into the db. Check that the number of lines copied into each table matches the actual lines in the csv file. The line numbers have been included in the comments of the load_crunchbase.sql script. See issues section for unexpected results.
 
To import the data make sure that all 22 crunchbase csv files are on the db server in /bulk/crunchbase/crunchbaseData. Also make sure that the load_crunchbase.sql script is in this directory. Run psql crunchbasebulk to start the db from this directory. Run the command "\i load_crunchbase.sql" to run the script. This will load the contents of the 22 csv files into the db. Check that the number of lines copied into each table matches the actual lines in the csv file. The line numbers have been included in the comments of the load_crunchbase.sql script. See issues section for unexpected results.
Line 51: Line 136:
 
Same thing for people - people end point then permalink
 
Same thing for people - people end point then permalink
  
==Accelerator Founders Data==
+
===Accelerator Founders Data===
  
 
The Crunchbase API can be used to readily access Founders Data. The API is used in the web format: https://api.crunchbase.com/v/3/organizations/company_name/?user_key=662e263576fe3e4ea5991edfbcfb9883, to get a JSON object with alot of data.
 
The Crunchbase API can be used to readily access Founders Data. The API is used in the web format: https://api.crunchbase.com/v/3/organizations/company_name/?user_key=662e263576fe3e4ea5991edfbcfb9883, to get a JSON object with alot of data.
Line 72: Line 157:
 
The next step is to match each founder name with a linkedin profile. These profiles can be accessed using our LinkedIn Crawler to gather more information about each founder. The results of matching the founders with their linkedin profiles can be found at:
 
The next step is to match each founder name with a linkedin profile. These profiles can be accessed using our LinkedIn Crawler to gather more information about each founder. The results of matching the founders with their linkedin profiles can be found at:
 
  E:\McNair\Projects\Accelerators\founders_linkedin.txt
 
  E:\McNair\Projects\Accelerators\founders_linkedin.txt
 +
 +
===Funded by Accelerators===
 +
Ben Baldazo : I used this database and coded in "Z:\bulk\crunchbase\AccFunding.psql" to create a table connecting accelerators (just names) from [[Accelerator Seed List (Data)]] to companies that include the acclerator as an investor. Table was joined using uuids and contains names, domains, some dates (founding), and funding amounts & each column has a suffix that labels what the data regards to (a.k.a. what table it came from)
 +
*More notation and the coding is available in "Z:\bulk\crunchbase\AccFunding.psql"

Latest revision as of 12:41, 21 September 2020


Project
Crunchbase Data
Project logo 02.png
Project Information
Has title Crunchbase Data
Has owner Adrian Smart, Grace Tan, Maxine Tao, Connor Rothschild
Has start date June 2017
Has deadline date
Has keywords Data, Tool, Crunchbase, VC, Angel
Has project status Complete
Is dependent on Crunchbase Accelerator Equity, Crunchbase Accelerator Founders
Dependent(s): U.S. Seed Accelerators
Has sponsor McNair Center
Has project output Data
Copyright © 2019 edegan.com. All Rights Reserved.

New Work

File Location:

E://McNair/Projects/Accelerators/Summer 2018/Accelerators and UUIDs.xlsx

This file is an update of Crunchbase data to correspond with our updated list of accelerators. That updated list of 166 accelerators can be found at:

E://McNair/Projects/Accelerators/Summer 2018/Connor Accelerator Work/Accelerator Master Variable List - Revised by Ed V2.xlsx

We used SQL to match a huge list of companies downloaded from Crunchbase to our list of accelerators.

Downloading Data

The download script is written in perl. It downloads from version 3.1 of the Crunchbase API. It is called downloadScriptv3.1 and located in E:\McNair\Software\Database Scripts\Crunchbase2. You can execute it by typing "perl downloadScript.pl" in terminal. Use this key:

 662e263576fe3e4ea5991edfbcfb9883

A full list of tables in the database called crunchbase2:

organizations
organization_descriptions
people
people_descriptions
degrees
funding_rounds
investments
investment_partners
investors
funds
acquisitions
ipos
events
event_appearances
jobs
category_groups
org_parents

Commands

Note that these should be done in command line not in the database

To open a couple lines of a file:

 head organizations-clean.txt -n 3

To replace "" with nothing in all of a file:

 cat organizations.csv | sed s/\"\"//g > organizations-clean.txt


Files

E:\McNair\Software\DatabaseScripts\Crunchbase2

Database is called crunchbase2.

In making a table of accelerators and their UUIDs, some match multiple times if you match only on accelerator name. The file(Accelerator Multiple Matches) of those that match multiple times is in:

  E:\McNair\Projects\Accelerators\Summer2018

If you match on accelerator name and/or accelerator URL, the results can be found in the same location above. It is called 'Accelerators Matched by Name and Homepage URL'. The first sheet includes the results by matching accelerators to crunchbase on organization names AND urls. The second sheet includes the results by matching accelerators to crunchbase data on organization names OR urls. This second sheet also includes results that are manually added in. For a list of accelerators, their matches from the Crunchbase data, and their UUIDs, find the 'Accelerators and UUIDs' file also in the same location as above. This file contains trailing whitespace that has been removed and saved in the file 'Accelerators and UUIDs - clean.txt'.


For the analysis script that obtains the results described above, see 'Analysis.sql' in:

 E:\McNair\Software\Database Scripts\Crunchbase2

Collecting Company Information

Code to build tables is here:

E:\McNair\Software\Database Scripts\Crunchbase2\CompanyMatchScript

Text file versions of the tables are located in the Z drive. The database is called:

crunchbase2

Crunchbase may have blanks and random quotation marks as entries. I had to clean this on textpad.


All crunchbase companies and their UUIDs are here:

E:\McNair\Projects\Accelerators\Summer 2018\CB Company UUIDs

A list of UUIDs for the companies we are interested in can be found in the same place as above. It is called 'Our Company UUID matches'. SQL code to get matches is in the CompanyMatchScript mentioned above. The perl matches were done using Matcher.pl. The single match sheet for SQl contains companies that do not have multiple entries in crunchbase. This is also the case for the Perl single match sheet, and it has also been filtered to remove those that were flagged as multiple matches by Matcher.pl

A list for companies, UUIDs, and all other informations is also in the Summer 2018 folder. It is called 'FINAL LIST'. THIS SHEET HAS DUPLICATES THAT ARE IN THE SHEET 'DUPLICATE COMPANIES'.


The definite, final list is:

E:\McNair\Projects\Accelerators\Summer 2018\The File to Rule Them All.xlsx

Old Notes from previous work

Files and dbases

The dbase is:

crunchbasebulk

The files are in:

E:\McNair\Software\Database Scripts\Crunchbase
Z:\Crunchbase\CrunchbaseData

To do

  1. Download the data
  2. Extract top 5 lines
  3. Build table specs
  4. Import the data

Issue importing events.csv. Line count does not match import count. Also typos in data for start_time and end_time which created import errors with time datatype. Switched to varchar. Issue importing organizations.csv. Line count is 515496 but on import got 515491. Last 4 lines in file are chinese characters which might affect import.

Downloading the data

Our user key is: 662e263576fe3e4ea5991edfbcfb9883

  1. The download script is written in perl. It is called downloadScript and located in E:\McNair\Software\Database Scripts\Crunchbase. You can execute it by typing "perl downloadScript.pl" in terminal.


Where, what, etc.

Importing the data

To import the data make sure that all 22 crunchbase csv files are on the db server in /bulk/crunchbase/crunchbaseData. Also make sure that the load_crunchbase.sql script is in this directory. Run psql crunchbasebulk to start the db from this directory. Run the command "\i load_crunchbase.sql" to run the script. This will load the contents of the 22 csv files into the db. Check that the number of lines copied into each table matches the actual lines in the csv file. The line numbers have been included in the comments of the load_crunchbase.sql script. See issues section for unexpected results.

Type \dt to get a list of the tables in the crunchbasebulk db. Type \q to quit db. When the db is not running you can type 'wc -l acquisitions.csv' to get a line count of this file.


News is a relationship query parameter

https://api.crunchbase.com/v/3/organizations?user_key=662e263576fe3e4ea5991edfbcfb9883&uuid=1e4f199c-363b-451b-a164-f94571075ee5

permalink is airbnb in this example:

https://api.crunchbase.com/v/3/organizations/airbnb/news?user_key=662e263576fe3e4ea5991edfbcfb9883

Same thing for people - people end point then permalink

Accelerator Founders Data

The Crunchbase API can be used to readily access Founders Data. The API is used in the web format: https://api.crunchbase.com/v/3/organizations/company_name/?user_key=662e263576fe3e4ea5991edfbcfb9883, to get a JSON object with alot of data.

One such field in the JSON object is Founders, which is followed by profiles of the founders for any given company. Not all the accelerators have a crunchbase page, but it is a good start.

The script for querying the API can be found at:

E:\McNair\Projects\Accelerators\crunchbase_founders.py

I queried the API for all the accelerators we have listed in the following file:

E:\McNair\Projects\Accelerators\Fall 2017\ListofAccs.txt

I retrieved a list of founders for the accelerators that returned results from Crunchbase. Out of the 269 accelerators we have on record, 136 of them turned up results for founders, resulting in 312 founders. The list of founders and their respective company can be found at:

E:\McNair\Projects\Accelerators\founder_names.txt

A table has also been added to the crunchbasebulk database called founders. The table contains 3 columns: Company, first_name, last_name. Another table, called founders_linkedin, contains 4 columns: Company, first_name, last_name, and linkedin_url.

The file has been exported to the McNair DB Server.

The next step is to match each founder name with a linkedin profile. These profiles can be accessed using our LinkedIn Crawler to gather more information about each founder. The results of matching the founders with their linkedin profiles can be found at:

E:\McNair\Projects\Accelerators\founders_linkedin.txt

Funded by Accelerators

Ben Baldazo : I used this database and coded in "Z:\bulk\crunchbase\AccFunding.psql" to create a table connecting accelerators (just names) from Accelerator Seed List (Data) to companies that include the acclerator as an investor. Table was joined using uuids and contains names, domains, some dates (founding), and funding amounts & each column has a suffix that labels what the data regards to (a.k.a. what table it came from)

  • More notation and the coding is available in "Z:\bulk\crunchbase\AccFunding.psql"