Difference between revisions of "Merging Existing Data with Crunchbase"
Maxine.tao (talk | contribs) |
Maxine.tao (talk | contribs) |
||
Line 155: | Line 155: | ||
− | + | ====Results==== | |
+ | I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'. | ||
+ | |||
+ | ====Testing==== | ||
In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'): | In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'): | ||
Line 167: | Line 170: | ||
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above. | It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above. | ||
− | + | ====Actual Run Info==== | |
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'. | The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'. |
Revision as of 13:24, 27 July 2018
Merging Existing Data with Crunchbase | |
---|---|
Project Information | |
Project Title | Merging Existing Data with Crunchbase |
Owner | Connor Rothschild |
Start Date | |
Deadline | |
Primary Billing | |
Notes | |
Has project status | |
Copyright © 2016 edegan.com. All Rights Reserved. |
This page details the process of merging existing data with data pulled from Crunchbase.
Contents
Project Location
For the merge detailed in this page, our data was from:
/bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx
and Crunchbase info can be found in:
/bulk/McNair/Projects/Accelerators/Summer 2018/Cohort Companies with Crunchbase Info.xlsx
The data from Crunchbase, organized into tables, is in a script found at:
/bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql
Code for building tables (from Maxine Tao):
E:\McNair\Software\Database Scripts\Crunchbase2\CompanyMatchScript.sql
And the database is
crunchbase2
Process
Step One: Creating UUID Matches
We began by making sure our company names were unique; creating a 1-1-1-1 relationship (only one instance of a company name in our data, and in Crunchbase data). We did so using the Matcher. We matched our sheet against itself, and Crunchbase info (pulled from organizations table detailed below) against itself, to remove duplicates and only leave unique values. More here: http://mcnair.bakerinstitute.org/wiki/Crunchbase_Data#Collecting_Company_Information
Upon Ed's instruction, we then looked at companies in Crunchbase which had more than one UUID associated with the company name. Of the 670,000 companies in Crunchbase, only 15,000 had duplicate UUIDs. From this list of 15,000, we used recursive filtering to determine if any companies could be properly matched to the company in our data by looking at additional variables (such as company location).
Upon refining our list based on recursive filtering, we found 40 companies which match our data, and added UUIDs appropriately.
Step Two: Pulling Data
The necessary tables for this pull can be found at:
/bulk/McNair/Software/Database Scripts/Crunchbase2/LoadTables.sql
We pulled the relevant data from Crunchbase based on unique UUID matches. In the crunchbase2 database, we used the table organizations.
We pull based on the unique UUIDs found by Maxine, which can be found in the file:
/bulk/McNair/Projects/Accelerators/Summer 2018/The File to Rule Them All.xlsx, in column W.
The table we get most Crunchbase data from looks like this:
DROP TABLE organizations; CREATE TABLE organizations ( company_name varchar(100), role varchar(255), permalink varchar(255), domain varchar(5000), homepage_url varchar(5000), country_code varchar(10), state_code varchar(2), region varchar(50), city varchar(100), address text, status varchar(50), short_description text, category_list text, category_group_list text, funding_rounds integer, funding_total_usd money, founded_on date, --yyyy-mm-dd last_funding_on date, --yyyy-mm-dd closed_on date, --yyyy-mm-dd employee_count varchar(255), email varchar(255), phone text, facebook_url varchar(5000), linkedin_url varchar(5000), cb_url varchar(5000), logo_url varchar(5000), twitter_url varchar(5000), alias varchar(10000), uuid varchar(255), created_at date, --yyyy-mm-dd-hh-mm-s.s updated_at date, --yyyy-mm-dd-hh-mm-s.s primary_role varchar(255), type varchar(255) );
From this list, we care about the following:
- company_name,
- domain,
- country_code,
- state_code,
- city,
- address,
- status,
- short_description,
- category_list,
- category_group_list,
- founded_on,
- employee_count,
- linkedin_url,
- uuid -- our primary key
We also want to get more information on organization descriptions. To do so, we can pull description from the table organization_descriptions, matching based on UUID.
We also, for the purposes of industry classification, want to pull category_name from the table category_groups, matching based on UUID.
Finally, it may be worthwhile to pull variables such as name, description, and started_on from the events table, in the hopes of finding Cohort years, or potentially demo days. This can also be matched based on UUID.
Given the aforementioned information, we now have much data that can be used to populate empty cells in our existing data, as well as to create new columns.
Step Three: Merging
Of the data we've pulled from Crunchbase, we're interested in merging four columns with our existing data:
- domain (to be merged with the empty cells of courl)
- city, state_code, and country_code (some combination of this is to be merged with the empty cells of colocation)
- status (to be merged with the empty cells of costatus)
- short_description and description from the table organization_descriptions (some combination to be merged with empty cells of codescription)
Note: we may also be able to merge some combination of category_list, category_group_list, and (from category_groups table) category_name, to merge with cosector in our data, and use it for Maxine Tao's industry classifier.
The other columns can be added to the end of our sheet as supplemental data.
SQL Scripts, Files, and Databases
The contents of E:\McNair\Projects\Accelerators\Summer 2018\For Ed Merge July 17.xlsx where copied into CohortCosWcbuuid.txt (in the Accelerators folder, as well as Z:/crunchbase2, Z:/../vcdb2).
The script AddCBData.sql loads this data into crunchbase2. It then outputs the relevant crunchbase data into CBCohortData.txt
The script LoadAcceleratorDataV2.sql (see around line 305) loads both CohortCosWcbuuid.txt and CBCohortData.txt into the database vcdb2. It then produces a CohortCoExtended table, which is output to a file.
Note that CohortCoExtended.txt includes a variable GotVC, which takes the value 1 if the cohort company got VC and zero otherwise:
gotvc | count -------+------- 0 | 11465 1 | 1504 (2 rows)
We now need to determine which cohort companies we have timing information for and which we don't - and use demo days to get the info we are missing!
Getting Timing info for Companies Who Got VC
Line 136 of
E:\McNair\Software\Database Scripts\Crunchbase2\CompanyMatchScript.sql
contains the code to find the companies which recieved VC but did not have timing info. There are 809 such companies. This table was exported into needtiminginfo.txt.
A list of distinct accelerators that we need timing data for was also created, which was given to Minh Le. There's 75 accelerators that need their timing doing.
Doing training data - 2,600 pages and are a little bit more than 1/2 way (~1500-1600).
Finding Company URLs
Excel master datasets are in:
E:\McNair\Projects\Accelerators\Summer 2018
Code and files specific to this URL finder are in:
E:\McNair\Projects\Accelerators\Summer 2018\url finder
Results
I used STEP1_crawl.py and STEP2_findcorrecturl.py to add approximately 1000 more URLs into 'The File to Rule Them All.xlsx'.
Testing
In this file (sheet: 'Most Recent Merged Data' note that this is just a copy of 'Cohorts Final' in 'The File to Rule Them All'):
E:\McNair\Projects\Accelerators\Summer 2018\Merged W Crunchbase Data as of July 17.xlx
We filter for companies (~4000) that did not receive VC, are not in crunchbase, and do not have URLs. Using a Google crawler(STEP1_crawl.py) and URL matching script(STEP2_findcorrecturl.py), we will try to find as many URLs as possible.
To test, I ran about 40 companies from "smallcompanylist.txt", using only the company name as a search term and taking the first 4 valid results (see don't collect list in code). The google crawler and URL matcher was able to correctly identify around 20 URLs. It also misidentifies some URLs that look really similar to the company name, but it is accurate for the most part if the name is not too generic. I then tried to run the 20 unfound company names through the crawler again, but this time I used company name + startup as the search term. This did not identify any more correct URLs.
It seems reasonable to assume that if the company URL cannot be found within the first 4 valid search results, then that company probably does not have URL at all. This is the case for many of the unfound 20 URLs from my test run above.
Actual Run Info
The companies we needed to find URLs for are in a file called 'ACTUALNEEDEDCOMPANIES.txt'.
The first four results for every company, as found by STEP1_crawl.py, are in 'ACTUAL_crawled_company_urls.txt'.
The results after the matching done by STEP2_findcorrecturl.py, are in 'ACTUAL_finalurls.txt'.
Note that in the end, I decided to only take URLs that were given a match score of greater than 0.9.