Difference between revisions of "Venture Capital (Data)"
Line 9: | Line 9: | ||
}} | }} | ||
− | [[Old Venture Capital Data Work]] | + | ==Project Objective== |
+ | To create a master venture capital data set on which all other projects requiring venture capital data will be built. | ||
+ | |||
+ | Specifically, we will be building data to support the following: | ||
+ | *Lists and counts by year as well as state, MSA, and city, zip code, etc. | ||
+ | *Identify seed, early and later stage deals (first investment), dollars invested and rounds | ||
+ | *Identify transactional VC dollars invested and rounds | ||
+ | *Acquisitions and IPOs of VC backed firms | ||
+ | *Identify when VC backed firms are "alive" | ||
+ | *Geocode all portfolio companies | ||
+ | *Geocode VC headquarters and branch offices | ||
+ | *Compute distances between investors and their portfolio companies | ||
+ | *Identify lead investors for portfolio companies | ||
+ | *Identify top 100 VCs | ||
+ | |||
+ | ==Retrieving data== | ||
+ | |||
+ | Old notes are [[Old Venture Capital Data Work]] | ||
+ | |||
+ | All pulls and processing scripts are in: | ||
+ | E:\McNair\Projects\VC Database | ||
+ | |||
+ | Each of the following has a .rpt, .ssh, and .txt file, and has the following constraints in its search: | ||
+ | *Venture related deals | ||
+ | *Round date in 1/1/1980 to 06/15/2017 | ||
+ | |||
+ | The portfolio company (inc. round) based pulls also have: | ||
+ | *Portfolio company nation = United States | ||
+ | |||
+ | The datasets retrieved from SDC platinum (on June 13-15th 2017) are: | ||
+ | *Portfolio companies (USVC1980-present.ssh) - attributes to be extracted from roundlevel information | ||
+ | *Portfolio company descriptions - just the portco name and the long description. Custom processed. | ||
+ | *Round-on-one-line (USVCRound1980-present) - processed using RoundOnOneLine.pl | ||
+ | *Funds (USVCFund1980-present) | ||
+ | *Firms (VCFirms1980-present-pull2) - includes branch office so attributes must be extracted | ||
+ | *IPOs | ||
+ | *M&A | ||
+ | |||
+ | All files were processed with NormalizeFixedWidth.pl (after footers were removed) unless otherwise indicated. Some files required some minor post-processing to load into PostgreSQL. Issues included: | ||
+ | *Firm level data didn't normalize correctly - had to adjust headers | ||
+ | *Stray quotation in address line | ||
+ | *Area code had a 1- in it | ||
+ | *Some line counts were off by one or two | ||
+ | |||
+ | Additional datafiles (in E:\McNair\Projects\VC Database): | ||
+ | *GeocodedVCData.txt 43,724 records, tab-delimited with companynames but with "none" for some geocoords. | ||
+ | |||
+ | ==Loading the data into SQL== | ||
+ | |||
+ | The SQL script and load data are in: | ||
+ | Z:\VentureCapitalData\SDCVCData | ||
+ | |||
+ | The load script is: | ||
+ | LoadData.sql | ||
+ | |||
+ | ==Processing the base tables== | ||
+ | |||
+ | The SQL scripts need to do the following: | ||
+ | *Extract company infomation from roundbase then process it | ||
+ | **Add variable for growth (1), transactional (2), or non-VC (3) (see below) | ||
+ | **Create clean 5-digit ZIP field | ||
+ | *Extract firm information from firmbase, then: | ||
+ | **Add indicator variable for US | ||
+ | **Create clean 5-digit ZIP field | ||
+ | *Build out BranchOffice (BO) table | ||
+ | *Build lookups for: | ||
+ | **State Code | ||
+ | **raisestatus | ||
+ | |||
+ | ==Lookup Tables== | ||
+ | |||
+ | ===Stage of Investment=== | ||
+ | |||
+ | Growth VC (1) | ||
+ | -------------- | ||
+ | Seed | ||
+ | Early Stage | ||
+ | Later Stage | ||
+ | |||
+ | Tranactional VC (2) | ||
+ | -------------- | ||
+ | Acq. for Expansion | ||
+ | Acquisition | ||
+ | Bridge Loan | ||
+ | Expansion | ||
+ | Pending Acq | ||
+ | Recap or Turnaround | ||
+ | |||
+ | Exclude (non-VC) (3) | ||
+ | -------------- | ||
+ | LBO | ||
+ | MBO | ||
+ | Open Market Purchase | ||
+ | PIPE | ||
+ | Secondary Buyout |
Revision as of 12:46, 16 June 2017
Venture Capital (Data) | |
---|---|
Project Information | |
Project Title | Venture Capital (Data) |
Owner | Adrian Smart, Jake Silberman |
Start Date | |
Deadline | |
Keywords | Data |
Primary Billing | |
Notes | |
Has project status | Active |
Copyright © 2016 edegan.com. All Rights Reserved. |
Contents
Project Objective
To create a master venture capital data set on which all other projects requiring venture capital data will be built.
Specifically, we will be building data to support the following:
- Lists and counts by year as well as state, MSA, and city, zip code, etc.
- Identify seed, early and later stage deals (first investment), dollars invested and rounds
- Identify transactional VC dollars invested and rounds
- Acquisitions and IPOs of VC backed firms
- Identify when VC backed firms are "alive"
- Geocode all portfolio companies
- Geocode VC headquarters and branch offices
- Compute distances between investors and their portfolio companies
- Identify lead investors for portfolio companies
- Identify top 100 VCs
Retrieving data
Old notes are Old Venture Capital Data Work
All pulls and processing scripts are in:
E:\McNair\Projects\VC Database
Each of the following has a .rpt, .ssh, and .txt file, and has the following constraints in its search:
- Venture related deals
- Round date in 1/1/1980 to 06/15/2017
The portfolio company (inc. round) based pulls also have:
- Portfolio company nation = United States
The datasets retrieved from SDC platinum (on June 13-15th 2017) are:
- Portfolio companies (USVC1980-present.ssh) - attributes to be extracted from roundlevel information
- Portfolio company descriptions - just the portco name and the long description. Custom processed.
- Round-on-one-line (USVCRound1980-present) - processed using RoundOnOneLine.pl
- Funds (USVCFund1980-present)
- Firms (VCFirms1980-present-pull2) - includes branch office so attributes must be extracted
- IPOs
- M&A
All files were processed with NormalizeFixedWidth.pl (after footers were removed) unless otherwise indicated. Some files required some minor post-processing to load into PostgreSQL. Issues included:
- Firm level data didn't normalize correctly - had to adjust headers
- Stray quotation in address line
- Area code had a 1- in it
- Some line counts were off by one or two
Additional datafiles (in E:\McNair\Projects\VC Database):
- GeocodedVCData.txt 43,724 records, tab-delimited with companynames but with "none" for some geocoords.
Loading the data into SQL
The SQL script and load data are in:
Z:\VentureCapitalData\SDCVCData
The load script is:
LoadData.sql
Processing the base tables
The SQL scripts need to do the following:
- Extract company infomation from roundbase then process it
- Add variable for growth (1), transactional (2), or non-VC (3) (see below)
- Create clean 5-digit ZIP field
- Extract firm information from firmbase, then:
- Add indicator variable for US
- Create clean 5-digit ZIP field
- Build out BranchOffice (BO) table
- Build lookups for:
- State Code
- raisestatus
Lookup Tables
Stage of Investment
Growth VC (1) -------------- Seed Early Stage Later Stage
Tranactional VC (2) -------------- Acq. for Expansion Acquisition Bridge Loan Expansion Pending Acq Recap or Turnaround
Exclude (non-VC) (3) -------------- LBO MBO Open Market Purchase PIPE Secondary Buyout