Venture Capital (Data)
Venture Capital (Data) | |
---|---|
Project Information | |
Project Title | Venture Capital (Data) |
Owner | Adrian Smart, Jake Silberman |
Start Date | |
Deadline | |
Keywords | Data |
Primary Billing | |
Notes | |
Has project status | Active |
Copyright © 2016 edegan.com. All Rights Reserved. |
Project Objective
To create a master venture capital data set on which all other projects requiring venture capital data will be built.
Specifically, we will be building data to support the following:
- Lists and counts by year as well as state, MSA, and city, zip code, etc.
- Identify seed, early and later stage deals (first investment), dollars invested and rounds
- Identify transactional VC dollars invested and rounds
- Acquisitions and IPOs of VC backed firms
- Identify when VC backed firms are "alive"
- Geocode all portfolio companies
- Geocode VC headquarters and branch offices
- Compute distances between investors and their portfolio companies
- Identify lead investors for portfolio companies
- Identify top 100 VCs
Retrieving data
Old notes are Old Venture Capital Data Work
All pulls and processing scripts are in:
E:\McNair\Projects\VC Database
Each of the following has a .rpt, .ssh, and .txt file, and has the following constraints in its search:
- Venture related deals
- Round date in 1/1/1980 to 06/15/2017
The portfolio company (inc. round) based pulls also have:
- Portfolio company nation = United States
The datasets retrieved from SDC platinum (on June 15-16th 2017) are:
- Portfolio companies (USVC1980-present.ssh) - attributes to be extracted from roundlevel information
- Portfolio company descriptions - just the portco name and the long description. Custom processed.
- Round-on-one-line (USVCRound1980-present) - processed using RoundOnOneLine.pl
- Funds (USVCFund1980-present)
- Firms (VCFirms1980-present) - includes branch office so attributes must be extracted
- IPOs
- M&A
All files were processed with NormalizeFixedWidth.pl (after footers were removed) unless otherwise indicated. Some files required some minor post-processing to load into PostgreSQL. Issues included:
- Firm level data didn't normalize correctly - had to adjust headers
- Stray quotation in address line
- Area code had a 1- in it
- Some line counts were off by one or two
- "Firm Capital under Mgmt" column header for VCFirms has a {0mil} which screws up the normalizer. Delete this part of the column title prior to running normalizer.
Additional datafiles (in E:\McNair\Projects\VC Database):
- GeocodedVCData.txt 43,724 records, tab-delimited with companynames but with "none" for some geocoords.
Loading the data into SQL
The SQL script and load data are in:
Z:\VentureCapitalData\SDCVCData
The load script is:
LoadData.sql
Processing the base tables
The SQL scripts need to do the following:
- Extract company infomation from roundbase then process it
- Add variable for growth (1), transactional (2), or non-VC (3) (see below)
- Create clean 5-digit ZIP field
- Extract firm information from firmbase, then:
- Add indicator variable for US
- Create clean 5-digit ZIP field
- Build out BranchOffice (BO) table
- Build lookups for:
- State Code
- raisestatus
Lookup Tables
Stage of Investment
Growth VC (1) -------------- Seed Early Stage Later Stage
Tranactional VC (2) -------------- Acq. for Expansion Acquisition Bridge Loan Expansion Pending Acq Recap or Turnaround
Exclude (non-VC) (3) -------------- LBO MBO Open Market Purchase PIPE Secondary Buyout