Venture Capital (Data)
Venture Capital (Data) | |
---|---|
Project Information | |
Project Title | Venture Capital (Data) |
Owner | Adrian Smart, Jake Silberman |
Start Date | |
Deadline | |
Keywords | Data |
Primary Billing | |
Notes | |
Has project status | Active |
Copyright © 2016 edegan.com. All Rights Reserved. |
Project Objective
To create a master venture capital data set on which all other projects requiring venture capital data will be built.
Specifically, we will be building data to support the following:
- Lists and counts by year as well as state, MSA, and city, zip code, etc.
- Identify seed, early and later stage deals (first investment), dollars invested and rounds
- Identify transactional VC dollars invested and rounds
- Acquisitions and IPOs of VC backed firms
- Identify when VC backed firms are "alive"
- Geocode all portfolio companies
- Geocode VC headquarters and branch offices
- Compute distances between investors and their portfolio companies
- Identify lead investors for portfolio companies
- Identify top 100 VCs
Retrieving data
Old notes are Old Venture Capital Data Work
Retrieving US VC Data From SDC requires SDC platinum, perl scripts and SQL scripts, as well as some manual processing.
Scripts and other info
SDC and Perl Scripts are in:
E:\McNair\Projects\VC Database
This includes:
- NormalizeFixedWidth.pl
- RoundOnOneLine.pl - Needed for round on one line
SQL Scripts and finished data are in:
Z:\VentureCapitalData\SDCVCData
Notes:
- Portfolio companies attributes to be extracted from roundlevel information
- Round-on-one-line processed using RoundOnOneLine.pl and NormalizeFixedWidth.pl
- Firms includes branch office so attributes must be extracted
- Portfolio company descriptions - just the portco name, state, date of first inv, and the long description - has to be custom processed.
The next steps are detailed in VC Database Rebuild.
Additional datafiles (in E:\McNair\Projects\VC Database):
- GeocodedVCData.txt 43,724 records, tab-delimited with companynames but with "none" for some geocoords.
Processing the base tables
The SQL scripts need to do the following:
- Extract company infomation from roundbase then process it
- Add variable for growth (1), transactional (2), or non-VC (3) (see below)
- Create clean 5-digit ZIP field
- Extract firm information from firmbase, then:
- Add indicator variable for US
- Create clean 5-digit ZIP field
- Build out BranchOffice (BO) table
- Build lookups for:
- State Code
- raisestatus
Lookup Tables
Stage of Investment
We are generating 6 binary indicators as follows:
Growth VC (4) -------------- Seed (1) Early Stage (2) Later Stage (3)
Tranactional VC (5) -------------- Acq. for Expansion Acquisition Bridge Loan Expansion Pending Acq Recap or Turnaround
Exclude (non-VC) (6) -------------- LBO MBO Open Market Purchase PIPE Secondary Buyout VC Partnership Other
Building companybase and round
Both are built from roundbase which contains portfolio companies and round details. The scripts are below.
CREATE TABLE companybase AS SELECT DISTINCT coname,updateddate,foundingdate,datelastinv,datefirstinv,investedk,city,description,msa,msacode,nationcode, statecode,addr1,addr2,indclass,indsubgroup3,indminor,url,zip FROM roundbase ORDER BY coname;
CREATE TABLE round AS SELECT DISTINCT coname,statecode,datefirstinv,rounddate,stage1,stage3,rndamtdisck,rndamtestk,roundnum,numinvestors FROM roundbase ORDER BY coname;
Building stageflags
Stageflags are built from the stage3 column of round. Using the tables above, I made 6 flags for seed, early, later, growth, transaction and exclude. The script is below. Verify that no row entries are missing flags in excel by copying the round table to a txt file. The last 3 columns should always sum to 1 for every row entry. There should be no zero sums and no sums greater than 1.
CREATE TABLE stageflags AS SELECT coname, stage3, CASE WHEN stage3 = 'Seed' THEN 1 ELSE 0 END AS seedflag, CASE WHEN stage3 = 'Early Stage' THEN 1 ELSE 0 END AS earlyflag, CASE WHEN stage3 = 'Later Stage' THEN 1 ELSE 0 END AS laterflag, CASE WHEN stage3 = 'Seed' OR stage3 = 'Later Stage' OR stage3 = 'Early Stage' THEN 1 ELSE 0 END AS growthflag, CASE WHEN stage3 = 'Acq. for Expansion' OR stage3 = 'Acquisition' OR stage3 = 'Bridge Loan' OR stage3 = 'Expansion' OR stage3 = 'Pending Acq' OR stage3 = 'Recap or Turnaround' OR stage3 = 'Mezzanine' THEN 1 ELSE 0 END AS transactionflag, CASE WHEN stage3 = 'LBO' OR stage3 = 'MBO' OR stage3 = 'Open Market Purchase' OR stage3 = 'PIPE' OR stage3 = 'Secondary Buyout' OR stage3 = 'Other' OR stage3 = 'VC Partnership' OR stage3 = 'Secondary Purchase' THEN 1 ELSE 0 END AS excludeflag FROM round;
Sample output
The psql file was copied to an output text file using:
\COPY stageflags TO 'flag.txt' WITH DELIMITER AS E'\t' HEADER NULL AS CSV;
A sum column was added in excel which is the sum of the last 3 columns. Notice how all entries are 1. This is good.