Difference between revisions of "Crunchbase Database"

From edegan.com
Jump to navigation Jump to search
 
(12 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
{{Project
 
{{Project
 +
|Has project output=Data,Tool,How-to
 +
|Has sponsor=Kauffman Incubator Project
 
|Has title=Crunchbase Database
 
|Has title=Crunchbase Database
 
|Has owner=Hiep Nguyen
 
|Has owner=Hiep Nguyen
Line 57: Line 59:
 
  data\people_descriptions.csv
 
  data\people_descriptions.csv
  
The sql script get_data.sql from last year is copied to the current Crunchbase3 directory. However, two databases are very different now and adjustments are necessary. To keep track of the data type from each csv file used to copy to sql tables, a file get_type.py is included in E:\projects\crunchbase3. This python script will print the first 5 rows of every data frame in the current directory.
+
To keep track of the data type from each csv file used to copy to the SQL database, a file get_type.py is included in E:\projects\crunchbase3. This python script will print the first 5 rows of every data frame in the current directory.
  
 
All the crunchbase3 data from drive E are now also in drive Z:/crunchbase3
 
All the crunchbase3 data from drive E are now also in drive Z:/crunchbase3
  
A version of crunchbase3 database is live on the postgresql in Z:/crunchbase3. However, a few csv files have not been copied to the SQL database because of data type errors, which is a small problem but Hiep will need to spend some time to fix that. Hiep will work on it next week (March 28th).
 
  
Right now, a modification of load_crunchbase.sql is in both Z:/crunchbase3 and E:/projects/crunchbase3. Changes in dataset, datatype, and data columns have been made a lot compared to the previous version. The columns that are not yet added to the postgresql db are noted inside two lines of ################'s in the sql script. Since the data has changed a lot compared to last year, using \i load_crunchbase.sql was not very useful, and one may need to copy one table at a time by pasting the sql script into the terminal.
+
Since the data will be changing a lot compared to previous years, using \i load_crunchbase.sql might not very useful, and one may need to copy one table at a time by pasting the sql script into the terminal.
  
Files that have not yet been copied to the postgresql server are
+
All the dataset (17 of them) from the API have been copied to the PostgreSQL server in drive Z under /bulk/crunchbase3. To make date-time format in postgres work properly, all the empty string with quotes ("") in CSV files have been replaced by NULL with the command line
   degrees.csv
+
   sed 's/""//g' file.csv >file_clean.csv
  events.csv
+
The script that I used to do that is in the file clean_data.sh in E:/projects/crunchbase3. A shorter script to do that for all the files in the directory is possible but might not be necessary and not all files require such edit.
  funding_round.csv
+
 
  funds.csv
+
==Working with the database==
  investors.csv
+
 
  ipos.csv
+
All the scripts in load_crunchbase.sql have been updated. It now works perfectly with the current data (as of 03/29/2019) crawled from crunchbaseAPI and includes the correct number of rows copied from the csv files at the end of each \COPY command.
  jobs.csv
+
 
  organizations.csv
+
To see and use the data in the postgres server:
  people.csv
+
 
 +
1) Connect to reseacher@199.188.177.215. A password is required ( ask Prof Egan for details)
 +
 
 +
2) Go to /bulk/crunchbase3
 +
  cd /bulk/crunchbase3
 +
3) Connect to the database
 +
  psql crunchbase3
 +
  \dt
 +
4) Perform regular SQL queries
 +
 
 +
==Incubators in Crunchbase==
 +
 
 +
\COPY (SELECT uuid, company_name, short_description FROM Organizations WHERE country_code='USA' AND short_description LIKE '%incubat%') TO
 +
'CrunchbaseShortOrgDescsUSAIncubat.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
 +
--466
 +
 +
\COPY (SELECT A.uuid, A.company_name, B.description FROM Organizations AS A JOIN organization_descriptions AS B on A.uuid=B.uuid WHERE
 +
country_code='USA' AND description LIKE '%incubat%') TO 'CrunchbaseLongOrgDescsUSAIncubat.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
 +
--933
 +
 
 +
The two queries above were run against the Crunchbase database (see [[Ecosystem Organization Classifier]]), then their results were manually reviewed in two xlsx files (CrunchbaseLongOrgDescsUSAIncubat_IncubatorScore and CrunchbaseShortOrgDescsUSAIncubat_IncubatorScore), stored in E:\projects\crunchbase3
 +
 
 +
These files were then combined into IncubatorsFromCrunchbase.xlsx providing they scored 1 in the Long file or were marked keep and did not score 0 (social impact or virtual) in the Short file. The file has 564 (not necessarily unique) records and the following columns:
 +
uuid company_name description Score Notes Source
 +
 
 +
RetrievingIncubators.sql was then modified to load this data, locate distinct UUIDs and output Organizational records. The resulting file is CrunchbaseIncubators.txt (456 unique records, all USA), which has the following fields:
 +
company_name uuid address city state_code region status domain category_list short_description

Latest revision as of 12:46, 21 September 2020


Project
Crunchbase Database
Project logo 02.png
Project Information
Has title Crunchbase Database
Has owner Hiep Nguyen
Has start date 2019/03/13
Has deadline date 2019/03/22
Has project status Active
Dependent(s): Ecosystem Organization Classifier, Incubator Seed Data
Has sponsor Kauffman Incubator Project
Has project output Data, Tool, How-to
Copyright © 2019 edegan.com. All Rights Reserved.


Files and Dbase

Files are in:

  • E:\projects\crunchbase3
  • Z:\crunchbase3

Dbase is crunchbase3

The old project page is Crunchbase Data. File locations listed as Z:/bulk/ should now be Z:/bulk/mcnair/. For example there is an old loadscript in /bulk/mcnair/crunchbase/crunchbaseData/load_crunchbase.sql


Crunchbase Pro

https://www.crunchbase.com/login

Login details:

  • mcnair@rice.edu getpasswordfromed

Getting and cleaning data

The url to make API calls is https://api.crunchbase.com/v3.1/csv_export/csv_export.tar.gz?user_key=[API KEY GOES HERE]

API key (premium) is located at E:\projects\crunchbase3

The command line (bash script) to get the data and extract the data (1.9gb) is at E:\projects\crunchbase3\get_data.sh

Alternatively, we can download and extract directly using windows command prompt by typing the following commands

curl -O https://api.crunchbase.com/v3.1/csv_export/csv_export.tar.gz?user_key=[API key goes here] \
      
tar -xvf csv_export.tar.gz_user_key=[API key goes here].

Current csv files from crunchbase data

data\acquisitions.csv
data\category_groups.csv
data\degrees.csv
data\events.csv
data\event_appearances.csv
data\funding_rounds.csv
data\funds.csv
data\investments.csv
data\investment_partners.csv
data\investors.csv
data\ipos.csv
data\jobs.csv
data\organizations.csv
data\organization_descriptions.csv
data\org_parents.csv
data\people.csv
data\people_descriptions.csv

To keep track of the data type from each csv file used to copy to the SQL database, a file get_type.py is included in E:\projects\crunchbase3. This python script will print the first 5 rows of every data frame in the current directory.

All the crunchbase3 data from drive E are now also in drive Z:/crunchbase3


Since the data will be changing a lot compared to previous years, using \i load_crunchbase.sql might not very useful, and one may need to copy one table at a time by pasting the sql script into the terminal.

All the dataset (17 of them) from the API have been copied to the PostgreSQL server in drive Z under /bulk/crunchbase3. To make date-time format in postgres work properly, all the empty string with quotes ("") in CSV files have been replaced by NULL with the command line

 sed 's/""//g' file.csv >file_clean.csv

The script that I used to do that is in the file clean_data.sh in E:/projects/crunchbase3. A shorter script to do that for all the files in the directory is possible but might not be necessary and not all files require such edit.

Working with the database

All the scripts in load_crunchbase.sql have been updated. It now works perfectly with the current data (as of 03/29/2019) crawled from crunchbaseAPI and includes the correct number of rows copied from the csv files at the end of each \COPY command.

To see and use the data in the postgres server:

1) Connect to reseacher@199.188.177.215. A password is required ( ask Prof Egan for details)

2) Go to /bulk/crunchbase3

  cd /bulk/crunchbase3

3) Connect to the database

  psql crunchbase3
  \dt

4) Perform regular SQL queries

Incubators in Crunchbase

\COPY (SELECT uuid, company_name, short_description FROM Organizations WHERE country_code='USA' AND short_description LIKE '%incubat%') TO 
'CrunchbaseShortOrgDescsUSAIncubat.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--466

\COPY (SELECT A.uuid, A.company_name, B.description FROM Organizations AS A JOIN organization_descriptions AS B on A.uuid=B.uuid WHERE 
country_code='USA' AND description LIKE '%incubat%') TO 'CrunchbaseLongOrgDescsUSAIncubat.txt' WITH DELIMITER AS E'\t' HEADER NULL AS  CSV
--933

The two queries above were run against the Crunchbase database (see Ecosystem Organization Classifier), then their results were manually reviewed in two xlsx files (CrunchbaseLongOrgDescsUSAIncubat_IncubatorScore and CrunchbaseShortOrgDescsUSAIncubat_IncubatorScore), stored in E:\projects\crunchbase3

These files were then combined into IncubatorsFromCrunchbase.xlsx providing they scored 1 in the Long file or were marked keep and did not score 0 (social impact or virtual) in the Short file. The file has 564 (not necessarily unique) records and the following columns:

uuid	company_name	description	Score	Notes	Source

RetrievingIncubators.sql was then modified to load this data, locate distinct UUIDs and output Organizational records. The resulting file is CrunchbaseIncubators.txt (456 unique records, all USA), which has the following fields:

company_name	uuid	address	city	state_code	region	status	domain	category_list	short_description