Crunchbase Database
Crunchbase Database | |
---|---|
Project Information | |
Has title | Crunchbase Database |
Has owner | Hiep Nguyen |
Has start date | 2019/03/13 |
Has deadline date | 2019/03/22 |
Has project status | Active |
Dependent(s): | Ecosystem Organization Classifier, Incubator Seed Data |
Copyright © 2019 edegan.com. All Rights Reserved. |
Files and Dbase
Files are in:
- E:\projects\crunchbase3
- Z:\crunchbase3
Dbase is crunchbase3
The old project page is Crunchbase Data. File locations listed as Z:/bulk/ should now be Z:/bulk/mcnair/. For example there is an old loadscript in /bulk/mcnair/crunchbase/crunchbaseData/load_crunchbase.sql
Crunchbase Pro
https://www.crunchbase.com/login
Login details:
- mcnair@rice.edu getpasswordfromed
Getting and cleaning data
The url to make API calls is https://api.crunchbase.com/v3.1/csv_export/csv_export.tar.gz?user_key=[API KEY GOES HERE]
API key (premium) is located at E:\projects\crunchbase3
The command line (bash script) to get the data and extract the data (1.9gb) is at E:\projects\crunchbase3\get_data.sh
Alternatively, we can download and extract directly using windows command prompt by typing the following commands
curl -O https://api.crunchbase.com/v3.1/csv_export/csv_export.tar.gz?user_key=[API key goes here] \ tar -xvf csv_export.tar.gz_user_key=[API key goes here].
Current csv files from crunchbase data
data\acquisitions.csv data\category_groups.csv data\degrees.csv data\events.csv data\event_appearances.csv data\funding_rounds.csv data\funds.csv data\investments.csv data\investment_partners.csv data\investors.csv data\ipos.csv data\jobs.csv data\organizations.csv data\organization_descriptions.csv data\org_parents.csv data\people.csv data\people_descriptions.csv
The sql script get_data.sql from last year is copied to the current Crunchbase3 directory. However, two databases are very different now and adjustments are necessary. To keep track of the data type from each csv file used to copy to sql tables, a file get_type.py is included in E:\projects\crunchbase3. This python script will print the first 5 rows of every data frame in the current directory.
All the crunchbase3 data from drive E are now also in drive Z:/crunchbase3
A version of crunchbase3 database is live on the postgresql in Z:/crunchbase3. However, a few csv files have not been copied to the SQL database because of data type errors, which is a small problem but Hiep will need to spend some time to fix that. Hiep will work on it next week (March 28th).
Right now, a modification of load_crunchbase.sql is in both Z:/crunchbase3 and E:/projects/crunchbase3. Changes in dataset, datatype, and data columns have been made a lot compared to the previous version. The columns that are not yet added to the postgresql db are noted inside two lines of ################'s in the sql script. Since the data has changed a lot compared to last year, using \i load_crunchbase.sql was not very useful, and one may need to copy one table at a time by pasting the sql script into the terminal.
Files that have not yet been copied to the postgresql server are
degrees.csv events.csv funding_round.csv funds.csv investors.csv ipos.csv jobs.csv organizations.csv people.csv
03/28/2019 UPDATE
All the dataset (17 of them) from the API has been copied to the PostgreSQL server in drive Z under /bulk/crunchbase3. To make date-time format in postgres works properly, all the empty string with quotes ("") in CSV files have been replaced by NULL with the command line
sed 's/""//g' file.csv >file_clean.csv
The script that I used to do that is in the file clean_data.sh in E:/projects/crunchbase3. A shorter script to do that for all the files in the directory is possible but might not be necessary and not all files require such edit.
All the scripts in load_crunchbase.sql have been updated. It now works perfectly with the current data crawled from crunchbaseAPI and includes the correct number of rows copied from the csv files at the end of each \COPY command. I have also double-checked each table by comparing the postgres version of the data and the pandas version of the data.
To see and use the data in the postgres sever:
1) Connect to reseacher@199.188.177.215. A password is required ( ask Prof Egan for details)
2) Go to /bulk/crunchbase3
cd /bulk/crunchbase3
3) Connect to the database
psql crunchbase3 \dt
4) Perform regular SQL queries