Difference between revisions of "Merging Existing Data with Crunchbase"
Maxine.tao (talk | contribs) |
|||
(13 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | {{McNair | + | {{Project |
+ | |Has project output=Data | ||
+ | |Has sponsor=McNair Center | ||
|Has title=Merging Existing Data with Crunchbase | |Has title=Merging Existing Data with Crunchbase | ||
|Has owner=Connor Rothschild | |Has owner=Connor Rothschild | ||
Line 146: | Line 148: | ||
Doing training data - 2,600 pages and are a little bit more than 1/2 way (~1500-1600). | Doing training data - 2,600 pages and are a little bit more than 1/2 way (~1500-1600). | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |
Latest revision as of 12:41, 21 September 2020
Merging Existing Data with Crunchbase | |
---|---|
Project Information | |
Has title | Merging Existing Data with Crunchbase |
Has owner | Connor Rothschild |
Has start date | |
Has deadline date | |
Has project status | |
Has sponsor | McNair Center |
Has project output | Data |
Copyright © 2019 edegan.com. All Rights Reserved. |
This page details the process of merging existing data with data pulled from Crunchbase.
Contents
Project Location
For the merge detailed in this page, our data was from:
/bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx
and Crunchbase info can be found in:
/bulk/McNair/Projects/Accelerators/Summer 2018/Cohort Companies with Crunchbase Info.xlsx
The data from Crunchbase, organized into tables, is in a script found at:
/bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql
Code for building tables (from Maxine Tao):
E:\McNair\Software\Database Scripts\Crunchbase2\CompanyMatchScript.sql
And the database is
crunchbase2
Process
Step One: Creating UUID Matches
We began by making sure our company names were unique; creating a 1-1-1-1 relationship (only one instance of a company name in our data, and in Crunchbase data). We did so using the Matcher. We matched our sheet against itself, and Crunchbase info (pulled from organizations table detailed below) against itself, to remove duplicates and only leave unique values. More here: http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data#Collecting_Company_Information
Upon Ed's instruction, we then looked at companies in Crunchbase which had more than one UUID associated with the company name. Of the 670,000 companies in Crunchbase, only 15,000 had duplicate UUIDs. From this list of 15,000, we used recursive filtering to determine if any companies could be properly matched to the company in our data by looking at additional variables (such as company location).
Upon refining our list based on recursive filtering, we found 40 companies which match our data, and added UUIDs appropriately.
Step Two: Pulling Data
The necessary tables for this pull can be found at:
/bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql
We pulled the relevant data from Crunchbase based on unique UUID matches. In the crunchbase2 database, we used the table organizations.
We pull based on the unique UUIDs found by Maxine, which can be found in the file:
/bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx, in column W.
The table we get most Crunchbase data from looks like this:
DROP TABLE organizations; CREATE TABLE organizations ( company_name varchar(100), role varchar(255), permalink varchar(255), domain varchar(5000), homepage_url varchar(5000), country_code varchar(10), state_code varchar(2), region varchar(50), city varchar(100), address text, status varchar(50), short_description text, category_list text, category_group_list text, funding_rounds integer, funding_total_usd money, founded_on date, --yyyy-mm-dd last_funding_on date, --yyyy-mm-dd closed_on date, --yyyy-mm-dd employee_count varchar(255), email varchar(255), phone text, facebook_url varchar(5000), linkedin_url varchar(5000), cb_url varchar(5000), logo_url varchar(5000), twitter_url varchar(5000), alias varchar(10000), uuid varchar(255), created_at date, --yyyy-mm-dd-hh-mm-s.s updated_at date, --yyyy-mm-dd-hh-mm-s.s primary_role varchar(255), type varchar(255) );
From this list, we care about the following:
- company_name,
- domain,
- country_code,
- state_code,
- city,
- address,
- status,
- short_description,
- category_list,
- category_group_list,
- founded_on,
- employee_count,
- linkedin_url,
- uuid -- our primary key
We also want to get more information on organization descriptions. To do so, we can pull description from the table organization_descriptions, matching based on UUID.
We also, for the purposes of industry classification, want to pull category_name from the table category_groups, matching based on UUID.
Finally, it may be worthwhile to pull variables such as name, description, and started_on from the events table, in the hopes of finding Cohort years, or potentially demo days. This can also be matched based on UUID.
Given the aforementioned information, we now have much data that can be used to populate empty cells in our existing data, as well as to create new columns.
Step Three: Merging
Of the data we've pulled from Crunchbase, we're interested in merging four columns with our existing data:
- domain (to be merged with the empty cells of courl)
- city, state_code, and country_code (some combination of this is to be merged with the empty cells of colocation)
- status (to be merged with the empty cells of costatus)
- short_description and description from the table organization_descriptions (some combination to be merged with empty cells of codescription)
Note: we may also be able to merge some combination of category_list, category_group_list, and (from category_groups table) category_name, to merge with cosector in our data, and use it for Maxine Tao's industry classifier.
The other columns can be added to the end of our sheet as supplemental data.
SQL Scripts, Files, and Databases
The contents of E:\McNair\Projects\Accelerators\Summer 2018\For Ed Merge July 17.xlsx where copied into CohortCosWcbuuid.txt (in the Accelerators folder, as well as Z:/crunchbase2, Z:/../vcdb2).
The script AddCBData.sql loads this data into crunchbase2. It then outputs the relevant crunchbase data into CBCohortData.txt
The script LoadAcceleratorDataV2.sql (see around line 305) loads both CohortCosWcbuuid.txt and CBCohortData.txt into the database vcdb2. It then produces a CohortCoExtended table, which is output to a file.
Note that CohortCoExtended.txt includes a variable GotVC, which takes the value 1 if the cohort company got VC and zero otherwise:
gotvc | count -------+------- 0 | 11465 1 | 1504 (2 rows)
We now need to determine which cohort companies we have timing information for and which we don't - and use demo days to get the info we are missing!
Getting Timing info for Companies Who Got VC
Line 136 of
E:\McNair\Software\Database Scripts\Crunchbase2\CompanyMatchScript.sql
contains the code to find the companies which recieved VC but did not have timing info. There are 809 such companies. This table was exported into needtiminginfo.txt.
A list of distinct accelerators that we need timing data for was also created, which was given to Minh Le. There's 75 accelerators that need their timing doing.
Doing training data - 2,600 pages and are a little bit more than 1/2 way (~1500-1600).