Venture Capital (Data)
Venture Capital (Data) | |
---|---|
Project Information | |
Project Title | Venture Capital (Data) |
Owner | Adrian Smart, Jake Silberman |
Start Date | |
Deadline | |
Keywords | Data |
Primary Billing | |
Notes | |
Has project status | Active |
Copyright © 2016 edegan.com. All Rights Reserved. |
Contents
Project Objective
To create a master venture capital data set on which all other projects requiring venture capital data will be built.
Specifically, we will be building data to support the following:
- Lists and counts by year as well as state, MSA, and city, zip code, etc.
- Identify seed, early and later stage deals (first investment), dollars invested and rounds
- Identify transactional VC dollars invested and rounds
- Acquisitions and IPOs of VC backed firms
- Identify when VC backed firms are "alive"
- Geocode all portfolio companies
- Geocode VC headquarters and branch offices
- Compute distances between investors and their portfolio companies
- Identify lead investors for portfolio companies
- Identify top 100 VCs
Retrieving data
Old notes are Old Venture Capital Data Work
Retrieving US VC Data From SDC requires SDC platinum, perl scripts and SQL scripts, as well as some manual processing.
Scripts and other info
SDC and Perl Scripts are in:
E:\McNair\Projects\VC Database
This includes:
- NormalizeFixedWidth.pl
- RoundOnOneLine.pl - Needed for round on one line
SQL Scripts and finished data are in:
Z:\VentureCapitalData\SDCVCData
Notes:
- Portfolio companies attributes to be extracted from roundlevel information
- Round-on-one-line processed using RoundOnOneLine.pl and NormalizeFixedWidth.pl
- Firms includes branch office so attributes must be extracted
- Portfolio company descriptions - just the portco name, state, date of first inv, and the long description - has to be custom processed.
The next steps are detailed in VC Database Rebuild.
Additional datafiles (in E:\McNair\Projects\VC Database):
- GeocodedVCData.txt 43,724 records, tab-delimited with companynames but with "none" for some geocoords.
Processing the base tables
The SQL scripts need to do the following:
- Extract company infomation from roundbase then process it
- Add variable for growth (1), transactional (2), or non-VC (3) (see below)
- Create clean 5-digit ZIP field
- Extract firm information from firmbase, then:
- Add indicator variable for US
- Create clean 5-digit ZIP field
- Build out BranchOffice (BO) table
- Build lookups for:
- State Code
- raisestatus
SQL file to process data is:
Z:/VentureCapitalData/SDCVC/ProcessData.sql
Lookup Tables
Stage of Investment
We are generating 6 binary indicators as follows:
Growth VC (4) -------------- Seed (1) Early Stage (2) Later Stage (3)
Tranactional VC (5) -------------- Acq. for Expansion Acquisition Bridge Loan Expansion Pending Acq Recap or Turnaround
Exclude (non-VC) (6) -------------- LBO MBO Open Market Purchase PIPE Secondary Buyout VC Partnership Other
Building companybase and round
Both are built from roundbase which contains portfolio companies and round details. The scripts are below.
CREATE TABLE companybase AS SELECT DISTINCT coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode, statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip FROM roundbase ORDER BY coname;
CREATE TABLE round AS SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors FROM roundbase ORDER BY coname;
Building stageflags
Stageflags are built from the stage3 column of round. Using the tables above, I made 6 flags for seed, early, later, growth, transaction and exclude. The script is below. Verify that no row entries are missing flags in excel by copying the round table to a txt file. The last 3 columns should always sum to 1 for every row entry. There should be no zero sums and no sums greater than 1.
CREATE TABLE stageflags AS SELECT coname, stage3, CASE WHEN stage3 = 'Seed' THEN 1 ELSE 0 END AS seedflag, CASE WHEN stage3 = 'Early Stage' THEN 1 ELSE 0 END AS earlyflag, CASE WHEN stage3 = 'Later Stage' THEN 1 ELSE 0 END AS laterflag, CASE WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1 ELSE 0 END AS growthflag, CASE WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1 ELSE 0 END AS transactionflag, CASE WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1 ELSE 0 END AS excludeflag FROM round;
Sample output
The psql file was copied to an output text file using:
\COPY stageflags TO 'flag.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV;
A sum column was added in excel which is the sum of the last 3 columns. Notice how all entries are 1. This is good.
Checking companybase table
The following psql code creates a new table called companybase1 that checks for Undisclosed companies and nonUS companies in the companybase table and also inserts a primary key column which is a concatenation of coname, statecode and datefirstinv.
CREATE TABLE companybase1 AS SELECT *, concat(coname::text, statecode::text, datefirstinv::text) AS pkey, CASE WHEN nationcode = 'US' THEN 1::int ELSE 0::int END AS alwaysusflag, CASE WHEN coname = 'Undisclosed Company' THEN 1::int ELSE 0::int END AS undisclosedflag FROM companybase; \COPY companybase1 TO 'companybase1.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV;
Checking if pkey is valid
The following psql code creates a new table called companybasepkeyvalid that counts the number of times the pkey created in companybase1 occurs in companybase1. In this case there are 3 pkeys that are not valid for companies: Undisclosed Company, New York Digital Health LLC and Undisclosed Company.
CREATE TABLE companybasepkeyvalid AS SELECT coname, pkey, COUNT(pkey) FROM companybase1 GROUP BY coname, pkey; --44771 \COPY companybasepkeyvalid TO 'companybasepkeyvalid.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV;
The Undisclosed Companies will be updated manually using the description to extract the company name. To remove the duplicate records of the company New York Digital Health LLC from the roundbase table run the following command:
DELETE FROM roundbase WHERE url = 'www.digitalhealthaccelerator.c'; --1
After the Undisclosed company names are updated, you will need to recreate the companybase and roundbase tables. Copy all the undisclosed companies to a text file and pull out the company names from the description.
DROP TABLE undisclosedcompanies CREATE TABLE undisclosedcompanies AS SELECT coname, description, url FROM roundbase WHERE coname = 'Undisclosed Company'; --37 \COPY undisclosedcompanies TO 'undisclosedcompanies.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV;
After manually pulling out the company names from the description from all 37 entries, I created the following commands to update the original roundbase table below. After the updates there are still 18 Undisclosed companies. Recheck to make sure there are no primary key issues again.
UPDATE roundbase SET coname = 'Standard Bariatrics' WHERE coname = 'Undisclosed Company' AND description = 'Standard Bariatrics is taking bariatric procedure and creating instruments'; --1 UPDATE roundbase SET coname = 'Jersey Watch' WHERE coname = 'Undisclosed Company' AND description = 'Jersey Watch provides organizers to professional.'; --1 UPDATE roundbase SET coname = 'CareSkore' WHERE coname = 'Undisclosed Company' AND description = 'CareSkore provides an evidence-based predictive systems.'; --1 UPDATE roundbase SET coname = 'Navistone' WHERE coname = 'Undisclosed Company' AND description = 'Navistone provides digital marketers a way to target the right web browsers'; --2 UPDATE roundbase SET coname = 'PolySync' WHERE coname = 'Undisclosed Company' AND description = 'Offers PolySync platform for driverless vehicle applications.'; --2 UPDATE roundbase SET coname = 'Platform9' WHERE coname = 'Undisclosed Company' AND description = 'Platform9 Systems Inc is a United States-based company.'; --2 UPDATE roundbase SET coname = 'Salamanca' WHERE coname = 'Undisclosed Company' AND description = 'Salamanca provides products and services.'; --1 UPDATE roundbase SET coname = 'Sourcegraph Inc.' WHERE coname = 'Undisclosed Company' AND description = 'Sourcegraph Inc. provides a code search engine for open source.'; --1 UPDATE roundbase SET coname = 'Sona' WHERE coname = 'Undisclosed Company' AND description = 'Sona provides products and services.'; --1 UPDATE roundbase SET coname = 'AlphaSource' WHERE coname = 'Undisclosed Company' AND description = 'AlphaSource in provides healthcare technology solutions.'; --2 UPDATE roundbase SET coname = 'Reflecktive' WHERE coname = 'Undisclosed Company' AND description = 'Reflecktive provides products and services.'; --1 UPDATE roundbase SET coname = 'SuperH' WHERE coname = 'Undisclosed Company' AND description = 'SuperH provides an SaaS platform for open source data'; --1 UPDATE roundbase SET coname = 'Islands Media Inc' WHERE coname = 'Undisclosed Company' AND description = 'Islands Media Inc is an IT company.'; --1 UPDATE roundbase SET coname = 'Envoy Commerce Corp.' WHERE coname = 'Undisclosed Company' AND description = 'Envoy Commerce Corp. provides online grocery delivery service.'; --1 UPDATE roundbase SET coname = 'Quilt' WHERE coname = 'Undisclosed Company' AND description = 'Quilt is a consumer-centric insurance company focused on renters insurances'; --1
Removing duplicates from ipos table
If you create the following two tables you'll notice that there are duplicate entries in the ipos table.
DROP TABLE ipospkey; CREATE TABLE ipospkey AS SELECT *, concat(issuer::text, issuedate::text, statecode::text, principalamt::text) AS pkey FROM ipos; --10440
DROP TABLE iposcount; CREATE TABLE iposcount AS SELECT issuer, issuedate, statecode, COUNT(pkey) FROM ipospkey GROUP BY issuer, issuedate, statecode; --9491
To clean these up I built the following tables and selected the max principalamt values if there were duplicates. There is probably a simpler way to do this but I'll leave that for the next person to discover.
DROP TABLE iposcounthigh; CREATE TABLE iposcounthigh AS SELECT * FROM iposcount WHERE count > 1;
DROP TABLE duplicateipodata; CREATE TABLE duplicateipodata AS SELECT ipos.* FROM ipos INNER JOIN iposcounthigh ON (ipos.issuer = iposcounthigh.issuer AND ipos.issuedate = iposcounthigh.issuedate AND ipos.statecode = iposcounthigh.statecode) ORDER BY principalamt DESC;
DROP TABLE ipoduplicatemin; CREATE TABLE ipoduplicatemin AS SELECT issuedate, issuer, statecode, MIN(principalamt) AS principalamt FROM duplicateipodata GROUP BY issuedate, issuer, statecode; --939 \COPY ipoduplicatemin TO 'ipoduplicatemin.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV
DROP TABLE ipoduplicatekeys; CREATE TABLE ipoduplicatekeys AS SELECT *, concat(issuer::text, issuedate::text, statecode::text, principalamt::text) AS pkey FROM ipoduplicatemin;
DROP TABLE iponoduplicates; CREATE TABLE iponoduplicates AS SELECT ipospkey.* FROM ipospkey WHERE NOT EXISTS(SELECT * FROM ipoduplicatekeys WHERE (ipospkey.pkey = ipoduplicatekeys.pkey)); --9490
To verify that there are no more duplicates create another table.
DROP TABLE iposcountdupes1; CREATE TABLE iposcountdupes1 AS SELECT issuer, issuedate, statecode, COUNT(pkey) FROM iponoduplicates GROUP BY issuer, issuedate, statecode; \COPY iposcountdupes1 TO 'iposcountdupes1.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV
You will notice that there are still 9 duplicate records that slipped thru somehow. Remove these manually.
DELETE FROM iponoduplicates WHERE pkey = 'PacTel Corp1993-12-02CA$299.00'; --1 DELETE FROM iponoduplicates WHERE pkey = 'Templeton Dragon Fund Inc1994-09-21FL$169.50'; --1 DELETE FROM iponoduplicates WHERE pkey = 'Sterling Commerce1996-03-08TX$240.00' AND proceedsamt = '288'; --1 DELETE FROM iponoduplicates WHERE pkey = 'Sothebys Holdings Inc1988-05-13NY$27.00'; --1 DELETE FROM iponoduplicates WHERE pkey = 'TD Waterhouse Group Inc1999-06-23NY$655.20'; --1 DELETE FROM iponoduplicates WHERE pkey = 'Berlitz International Inc1989-12-13NJ$34.65'; --1 DELETE FROM iponoduplicates WHERE pkey = 'Spain Fund Inc1988-06-21NY$39.00'; --1 DELETE FROM iponoduplicates WHERE pkey = 'Ultramar Corp1992-06-26CT$85.39'; --1 DELETE FROM iponoduplicates WHERE pkey = 'Goldman Sachs Group Inc1999-05-03NY$424.00'; --1
iponoduplicates is now the master ipo table that contains unique keys which are concatenated from issuer, issuedate and prinicpalamt. It should contain 9481 lines.
Cleaning mas table
There were duplicate records in this file because the original text file from SDC contained addresses that were on two rows for some companies. Therefore the easiest way to get rid of these records is to remove lines in textpad and reimport into the mas table.
Remove lines in textpad using regex: ^#.*\n and replacing with nothing.
After reimporting there will still be duplicate rows. Slap an id number on them and take the min as shown below.
CREATE TABLE mas1 AS SELECT *, concat(targetname::text, targetstate::text, announceddate::text) AS pkey FROM mas; ALTER TABLE mas1 ADD COLUMN id SERIAL PRIMARY KEY; --114890
DROP TABLE masminid; CREATE TABLE masminid AS SELECT targetname, targetstate, announceddate, MIN(id) FROM mas1 GROUP BY targetname, targetstate, announceddate; --114825
DROP TABLE masnodupes; CREATE TABLE masnodupes AS SELECT mas1.* FROM mas1 JOIN masminid ON mas1.id = masminid.min; \COPY masnodupes TO 'masnodupes.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV
masnodupes should have no duplicates based on targetname, targetstate, announceddate. After running it thru the matcher you will get about 800 matching errors. We decided to exclude any rows that had equivalent targetname, targetstate where the announced dates fell within the same week. In this case we took the minimum date value and excluded the other row. Below are the queries to make those tables.
DROP TABLE masdistinctkeys; CREATE TABLE masdistinctkeys AS SELECT DISTINCT targetname, targetstate, announceddate FROM masnodupes; --114825 \COPY masdistinctkeys TO 'masdistinctkeys.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV
DROP TABLE masdistinctkeysmindates; CREATE TABLE masdistinctkeysmindates AS SELECT targetname, targetstate, MIN(announceddate) AS announceddate FROM masdistinctkeys GROUP BY targetname, targetstate; --113236
DROP TABLE masdistinctkeysdatewindow; CREATE TABLE masdistinctkeysdatewindow AS SELECT masdistinctkeys.*, masdistinctkeysmindates.announceddate as minanndate, CASE WHEN masdistinctkeys.announceddate - INTERVAL '7 day' > masdistinctkeysmindates.announceddate THEN 1::int ELSE 0::int END AS dateflag FROM masdistinctkeys JOIN masdistinctkeysmindates ON (masdistinctkeys.targetname = masdistinctkeysmindates.targetname AND masdistinctkeys.targetstate = masdistinctkeysmindates.targetstate); --114825 \COPY masdistinctkeysdatewindow TO 'masdistinctkeysdatewindow.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV
CREATE TABLE masdistinctkeys1 AS SELECT targetname, targetstate, announceddate FROM masdistinctkeysdatewindow WHERE dateflag = 0; --113267 \COPY masdistinctkeys1 TO 'masdistinctkeys1.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV