Difference between revisions of "Merging Existing Data with Crunchbase"

From edegan.com
Jump to navigation Jump to search
Line 27: Line 27:
 
  /bulk/McNair/Software/Database Scripts/Crunchbase2/'''LoadTables.sql'''
 
  /bulk/McNair/Software/Database Scripts/Crunchbase2/'''LoadTables.sql'''
  
We then pulled the relevant data from Crunchbase based on unique UUID matches. In the crunchbase2 database, we used the table ''organizations''.
+
We pulled the relevant data from Crunchbase based on unique UUID matches. In the crunchbase2 database, we used the table ''organizations''.
The table looks like this:
 
  
DROP TABLE organizations;
+
We pull based on the unique UUIDs found by Maxine, which can be found in the file:
CREATE TABLE organizations (
+
/bulk/McNair/Projects/Accelerators/Summer 2018/'''The File to Rule Them All.xlsx''', in column W.
 +
 
 +
The table we get most Crunchbase data from looks like this:
 +
 
 +
DROP TABLE organizations;
 +
CREATE TABLE organizations (
 
   company_name varchar(100),
 
   company_name varchar(100),
 
   role varchar(255),
 
   role varchar(255),
Line 81: Line 85:
 
*employee_count,
 
*employee_count,
 
*linkedin_url,
 
*linkedin_url,
*uuid'''
+
*uuid -- our primary key'''
  
 
We also want to get more information on organization descriptions. To do so, we can pull ''description'' from the table ''organization_descriptions'', matching based on UUID.
 
We also want to get more information on organization descriptions. To do so, we can pull ''description'' from the table ''organization_descriptions'', matching based on UUID.

Revision as of 12:07, 17 July 2018

This page details the process of merging existing data with data pulled from Crunchbase.

Project Location

For the merge detailed in this page, our data was from:

/bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx

and Crunchbase info can be found in:

/bulk/McNair/Projects/Accelerators/Summer 2018/Cohort Companies with Crunchbase Info.xlsx

The data from Crunchbase, organized into tables, is in a script found at:

/bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql

Process

Step One: Creating UUID Matches

We began my making sure our company names were unique; creating a 1-1-1-1 relationship (only one instance of a company name in our data, and in Crunchbase data). We did so using the Matcher. We matched our sheet against itself, and Crunchbase info against itself, to remove duplicates and only leave unique values.

Upon Ed's instruction, we then looked at companies in Crunchbase which had more than one UUID associated with the company name. Of the 670,000 companies in Crunchbase, only 15,000 had duplicate UUIDs. From this list of 15,000, we used recursive filtering to determine if any companies could be properly matched to the company in our data by looking at additional variables (such as company location).

Upon refining our list based on recursive filtering, we found __ companies which match our data, and added UUIDs appropriately.

Step Two: Pulling Data

The necessary tables for this pull can be found at:

/bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql

We pulled the relevant data from Crunchbase based on unique UUID matches. In the crunchbase2 database, we used the table organizations.

We pull based on the unique UUIDs found by Maxine, which can be found in the file:

/bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx, in column W.

The table we get most Crunchbase data from looks like this:

DROP TABLE organizations;
CREATE TABLE organizations (
 company_name varchar(100),
 role varchar(255),
 permalink varchar(255),
 domain varchar(5000),
 homepage_url varchar(5000),
 country_code varchar(10),
 state_code varchar(2),
 region varchar(50),
 city varchar(100),
 address text,
 status varchar(50),
 short_description text,
 category_list text,
 category_group_list  text,
 funding_rounds integer,
 funding_total_usd money,
 founded_on date, --yyyy-mm-dd
 last_funding_on date, --yyyy-mm-dd
 closed_on date, --yyyy-mm-dd
 employee_count varchar(255),
 email varchar(255),
 phone text,
 facebook_url varchar(5000),
 linkedin_url varchar(5000),
 cb_url varchar(5000),
 logo_url varchar(5000),
 twitter_url varchar(5000),
 alias varchar(10000),
 uuid varchar(255),
 created_at date, --yyyy-mm-dd-hh-mm-s.s
 updated_at date, --yyyy-mm-dd-hh-mm-s.s
 primary_role varchar(255),
 type varchar(255)
);

From this list, we care about the following:

  • company_name,
  • domain,
  • country_code,
  • state_code,
  • city,
  • address,
  • status,
  • short_description,
  • category_list,
  • category_group_list,
  • founded_on,
  • employee_count,
  • linkedin_url,
  • uuid -- our primary key

We also want to get more information on organization descriptions. To do so, we can pull description from the table organization_descriptions, matching based on UUID.

We also, for the purposes of industry classification, want to pull category_name from the table category_groups, matching based on UUID.

Finally, it may be worthwhile to pull variables such as name, description, and started_on from the events table, in the hopes of finding Cohort years, or potentially demo days. This can also be matched based on UUID.

Given the aforementioned information, we now have much data that can be used to populate empty cells in our existing data, as well as to create new columns.

Step Three: Merging

Of the data we've pulled from Crunchbase, we're interested in merging four columns with our existing data:

  • domain (to be merged with the empty cells of courl)
  • city, state_code, and country_code (some combination of this is to be merged with the empty cells of colocation)
  • status (to be merged with the empty cells of costatus)
  • short_description and description from the table organization_descriptions (some combination to be merged with empty cells of codescription)

Note: we may also be able to merge some combination of category_list, category_group_list, and (from category_groups table) category_name, to merge with cosector in our data, and use it for Maxine Tao's industry classifier.