Difference between revisions of "VCDB20"
(Created page with "{{Project |Has project output=Data,Tool,How-to |Has title=VCDB20Q3 |Has owner=Ed Egan }} <onlyinclude>The VCDB20Q3 project documents a build of my VCDB -- '''V'''enture ''...") |
|||
(32 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
{{Project | {{Project | ||
|Has project output=Data,Tool,How-to | |Has project output=Data,Tool,How-to | ||
− | |Has title= | + | |Has title=VCDB20 |
|Has owner=Ed Egan | |Has owner=Ed Egan | ||
}} | }} | ||
− | <onlyinclude>The [[ | + | <onlyinclude>The [[VCDB20]] project documents a build of my VCDB -- '''V'''enture '''C'''apital '''D'''ata'''B'''ase -- covering until the end of 2020. Each VCDB includes investments, funds, startups, executives, exits, locations, and more, derived from data from [[VentureXpert]]. This project updates [[vcdb4]], which covered (almost) to the of Q3 2019, and replaces [[VCDB20H1]] and [[VCDB20Q3]], which were partial builds. See also: [[SDC Normalizer]].</onlyinclude> |
+ | |||
+ | ==Data design== | ||
+ | |||
+ | I followed the same data design as in [[VCDB20H1]]. Essentially the specification pulls everything, even things that aren't needed like incomplete M&As or withdrawn IPOs, or funds or investments that aren't venture capital, all the way to the present (the pull was done on 2020-01-04), and then place restrictions on the data later. Crucially, the pulls no longer use the venture-related flag, so the data contains private equity and other deals, as well as secondaries and purchases. Note that the M&As are pulled separately for public and private acquirers, and in chunks by year to keep the request sizes manageable. | ||
+ | |||
+ | ==Processing Steps== | ||
+ | |||
+ | ===Source Files=== | ||
+ | |||
+ | Copy over and update the source files: | ||
+ | Name | ||
+ | ---- | ||
+ | NormalizeFixedWidth.pl | ||
+ | RoundOnOneLine.pl | ||
+ | USFirmBranchOffices1980.rpt | ||
+ | USFirmBranchOffices1980.ssh | ||
+ | USFirms1980.rpt | ||
+ | USFirms1980.ssh | ||
+ | USFund1980.rpt | ||
+ | USFund1980.ssh | ||
+ | USFundExecs1980.rpt | ||
+ | USFundExecs1980.ssh | ||
+ | USIPO1980.rpt | ||
+ | USIPO1980.ssh | ||
+ | USMAPrivate.rpt | ||
+ | USMAPrivate00-10.ssh | ||
+ | USMAPrivate10-.ssh | ||
+ | USMAPrivate80-85.ssh | ||
+ | USMAPrivate85-00.ssh | ||
+ | USMAPublic.rpt | ||
+ | USMAPublic00-.ssh | ||
+ | USMAPublic80-00.ssh | ||
+ | USPortCo1980.rpt | ||
+ | USPortCo1980.ssh | ||
+ | USPortCoExecs1980.rpt | ||
+ | USPortCoExecs1980.ssh | ||
+ | USPortCoLongDesc1980.rpt | ||
+ | USPortCoLongDesc1980.ssh | ||
+ | USRound1980.rpt | ||
+ | USRound1980.ssh | ||
+ | USRoundOnOneLine1980.rpt | ||
+ | USRoundOnOneLine1980.ssh | ||
+ | |||
+ | '''Update the paths and dates in the ssh files''' then run them (see [[SDC Platinum]]). | ||
+ | |||
+ | ===Database import=== | ||
+ | |||
+ | Run the [[SDC Normalizer]] on each of the files. For most of them, that's straightforward. You can safely ignore the Access Violation error messages that occur at the end of some pulls. However, the following require attention: | ||
+ | *Fix the header row in USFirms1980.txt before normalizing (the Capital Under Management column name is too long) | ||
+ | *Remove double quotes from USFund1980-normal.txt, USFundExecs1980-normal.txt, USPortCo1980-normal.txt | ||
+ | *The private and public M&A file sets have to be (separately) combined after they've been normalized. Then replace \tnp\t and \tnm\t with \t\t. | ||
+ | *For RoundOnOneLine, remove the footer, run NormalizeFixedWidth.pl first then RoundOnOneLine.pl, and then fix the header. | ||
+ | *The PortCo Long Description needs to be pre-processed from the command line and then post-processed in excel (see [[VCDB20H1]] and [[Vcdb4#Long_Description]]). | ||
+ | |||
+ | Create the dbase as a researcher: | ||
+ | createdb vcdb20 | ||
+ | |||
+ | Move the files to //mother/bulk/vcdb20 and run the load script: | ||
+ | E:\projects\vcdb20\Load.sql | ||
+ | |||
+ | ===Create the keys=== | ||
+ | |||
+ | Standardize company names using Hall normalization without a fuzzy algorithm (see [[The Matcher (Tool)]]) and matching them to themselves for PortCos, M&As, IPOs. '''It is crucial that the self-matches are made using mode=2, or you won't select a stdname and will generate duplicate entries (from the stdname permutations.''' | ||
+ | perl .\Matcher.pl -mode=2 -file1="DistinctConame.txt" -file2="DistinctConame.txt" | ||
+ | |||
+ | The keys for each table as follows: | ||
+ | *PortCo: Coname, statecode, datefirstinv | ||
+ | *M&A (private targets): targetname, statecode, announceddate | ||
+ | *IPOs: issuer, statecode, issuedate | ||
+ | *Fund: Fundname | ||
+ | *Firm: Firmname | ||
+ | |||
+ | Note that when normalizing Fundexecs use fundname, fundyear as the foreign keys. Generally, do not copy down keys unless they are foreign and blank in the source file. | ||
+ | |||
+ | Also, clean up the round table and create the SEL flags, as well as the PortCoSEL table. And make sure that the end of period date (plus 1) is updated in PortCoAliveDead. | ||
+ | |||
+ | ===Add the geocoding=== | ||
+ | |||
+ | Load in the old geocoding and create data files for new Google Maps API runs (see [[Geocode.py]] for directions). Note that I separate out Growth and non-growth PortCo addresses so that I can get the growth ones first for the PortCos, and that there are also firm and firm branch office addresses to process (US only). The limit for free is 2,500 calls/day but the cost per call is pretty low. | ||
+ | |||
+ | Note that PortCoGeoID and other tables are built in the BuildBaseTables.sql script. | ||
+ | |||
+ | Change to Load script: | ||
+ | *Retrieve and load ZCTA Gazetteer from the [https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.html U.S. Census] | ||
+ | *A small but meaningful proportion of US venture firms have zip codes but not addresses. I created the table firmbogeoplus to add longitude and latitudes for these firms in Load.sql (right after firmbogeo). | ||
+ | |||
+ | A small number of US PortCos have zipcodes but not places (which are used in the Rankings and elsewhere). To address this, I loaded the 2010 ZCTA to place lookup in Load.sql (passing the result through TigerGeog) to get the placename. This is now incorporated into PortCoGeoid in BuildBaseTables.sql. | ||
+ | |||
+ | 2010 Census data (2020 isn't available but the mappings are fairly static): | ||
+ | *https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.2010.html | ||
+ | *https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_zcta_place_rel_10.pdf | ||
+ | *https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_place_rel_10.txt | ||
+ | |||
+ | I also load up the 2010 ANSI codes for places (i.e., statefp, statecode, placename, placefp, etc.): | ||
+ | * Source: https://www.census.gov/library/reference/code-lists/ansi.html | ||
+ | * Data: national_places.txt -> national_places_processed.txt (regexes) | ||
+ | * Pop the last word off the placename! | ||
+ | |||
+ | Note that the postprocessing of these two tables is done at the end of Load.sql. The final table is '''zctaplaceinfo''' which takes zcta, statecode and provides back placename and geoid (of the place). | ||
+ | |||
+ | ===Load the Additions=== | ||
+ | |||
+ | These include: | ||
+ | *CPI (see the spreadsheet CPIEstimate.xlsx, source data from https://data.bls.gov/pdq/SurveyOutputServlet) | ||
+ | *Population, statepop2017, and ACS. See [[American Community Survey (ACS) Data]]. | ||
+ | *tigerplaces, which builds tigergeog, and placedisplay. See [[Jeemin Sim (Work Log)]], [[Urban Start-up Agglomeration and Venture Capital Investment]] and [[Tiger Geocoder]] | ||
+ | *Industry, Firm type, Firm stage, Title, statecode, and other lookup tables. | ||
+ | *PortCoSBIR and PortCoPatent -- These are now out of date and don't appear to have build notes. | ||
+ | |||
+ | ===Join in the exit data=== | ||
+ | |||
+ | Bring in and clean up the IPO and MA (private target) data, match it to the PortCos, and reconcile the conflicts. This creates two core tables, one key table, and two PortCo results tables: | ||
+ | *IpoCleanNoDups | ||
+ | *MACleanNoDups | ||
+ | *ExitKeys (IPO and MA only, not PortCo) | ||
+ | *PortCoExit | ||
+ | *PortCoAliveDead | ||
+ | |||
+ | '''PortCoAliveDead''' follows the academic convention of marking a startup as alive if it has begun receiving investment and hasn't exited and if its last investment occurred less than five years ago. | ||
+ | |||
+ | ==Base Tables== | ||
+ | |||
+ | The base tables are built using BuildBaseTables.sql. These provide common foundations for: | ||
+ | *PortCo Geography | ||
+ | *Other PortCo tables, including industry, id, cpi adjustments, geoids. | ||
+ | **Exit, Alive/dead and patents & SBIR/STTR grant (note that these need updating) information is also included. | ||
+ | *Firm variables (note that the full build is only done for firms that make growth investments). | ||
+ | *Round Line Joiner (RLJoiner) tables. | ||
+ | *PortCo People, including gender, dr., titles, serials. | ||
+ | *Fund People: gender and dr. | ||
+ | *The Master tables: | ||
+ | **PortCoMaster | ||
+ | **PortCoPeopleMaster | ||
+ | **FirmGrowthMaster (the Firm master table, for growth investments) | ||
+ | **RLMaster | ||
+ | |||
+ | Finally, this code also builds: MatchMostNumerous and MatchHighestRandom, which are used in [[Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists]] |
Latest revision as of 15:39, 31 May 2024
VCDB20 | |
---|---|
Project Information | |
Has title | VCDB20 |
Has owner | Ed Egan |
Has start date | |
Has deadline date | |
Has project status | |
Has project output | Data, Tool, How-to |
Copyright © 2019 edegan.com. All Rights Reserved. |
The VCDB20 project documents a build of my VCDB -- Venture Capital DataBase -- covering until the end of 2020. Each VCDB includes investments, funds, startups, executives, exits, locations, and more, derived from data from VentureXpert. This project updates vcdb4, which covered (almost) to the of Q3 2019, and replaces VCDB20H1 and VCDB20Q3, which were partial builds. See also: SDC Normalizer.
Contents
Data design
I followed the same data design as in VCDB20H1. Essentially the specification pulls everything, even things that aren't needed like incomplete M&As or withdrawn IPOs, or funds or investments that aren't venture capital, all the way to the present (the pull was done on 2020-01-04), and then place restrictions on the data later. Crucially, the pulls no longer use the venture-related flag, so the data contains private equity and other deals, as well as secondaries and purchases. Note that the M&As are pulled separately for public and private acquirers, and in chunks by year to keep the request sizes manageable.
Processing Steps
Source Files
Copy over and update the source files:
Name ---- NormalizeFixedWidth.pl RoundOnOneLine.pl USFirmBranchOffices1980.rpt USFirmBranchOffices1980.ssh USFirms1980.rpt USFirms1980.ssh USFund1980.rpt USFund1980.ssh USFundExecs1980.rpt USFundExecs1980.ssh USIPO1980.rpt USIPO1980.ssh USMAPrivate.rpt USMAPrivate00-10.ssh USMAPrivate10-.ssh USMAPrivate80-85.ssh USMAPrivate85-00.ssh USMAPublic.rpt USMAPublic00-.ssh USMAPublic80-00.ssh USPortCo1980.rpt USPortCo1980.ssh USPortCoExecs1980.rpt USPortCoExecs1980.ssh USPortCoLongDesc1980.rpt USPortCoLongDesc1980.ssh USRound1980.rpt USRound1980.ssh USRoundOnOneLine1980.rpt USRoundOnOneLine1980.ssh
Update the paths and dates in the ssh files then run them (see SDC Platinum).
Database import
Run the SDC Normalizer on each of the files. For most of them, that's straightforward. You can safely ignore the Access Violation error messages that occur at the end of some pulls. However, the following require attention:
- Fix the header row in USFirms1980.txt before normalizing (the Capital Under Management column name is too long)
- Remove double quotes from USFund1980-normal.txt, USFundExecs1980-normal.txt, USPortCo1980-normal.txt
- The private and public M&A file sets have to be (separately) combined after they've been normalized. Then replace \tnp\t and \tnm\t with \t\t.
- For RoundOnOneLine, remove the footer, run NormalizeFixedWidth.pl first then RoundOnOneLine.pl, and then fix the header.
- The PortCo Long Description needs to be pre-processed from the command line and then post-processed in excel (see VCDB20H1 and Vcdb4#Long_Description).
Create the dbase as a researcher:
createdb vcdb20
Move the files to //mother/bulk/vcdb20 and run the load script:
E:\projects\vcdb20\Load.sql
Create the keys
Standardize company names using Hall normalization without a fuzzy algorithm (see The Matcher (Tool)) and matching them to themselves for PortCos, M&As, IPOs. It is crucial that the self-matches are made using mode=2, or you won't select a stdname and will generate duplicate entries (from the stdname permutations.
perl .\Matcher.pl -mode=2 -file1="DistinctConame.txt" -file2="DistinctConame.txt"
The keys for each table as follows:
- PortCo: Coname, statecode, datefirstinv
- M&A (private targets): targetname, statecode, announceddate
- IPOs: issuer, statecode, issuedate
- Fund: Fundname
- Firm: Firmname
Note that when normalizing Fundexecs use fundname, fundyear as the foreign keys. Generally, do not copy down keys unless they are foreign and blank in the source file.
Also, clean up the round table and create the SEL flags, as well as the PortCoSEL table. And make sure that the end of period date (plus 1) is updated in PortCoAliveDead.
Add the geocoding
Load in the old geocoding and create data files for new Google Maps API runs (see Geocode.py for directions). Note that I separate out Growth and non-growth PortCo addresses so that I can get the growth ones first for the PortCos, and that there are also firm and firm branch office addresses to process (US only). The limit for free is 2,500 calls/day but the cost per call is pretty low.
Note that PortCoGeoID and other tables are built in the BuildBaseTables.sql script.
Change to Load script:
- Retrieve and load ZCTA Gazetteer from the U.S. Census
- A small but meaningful proportion of US venture firms have zip codes but not addresses. I created the table firmbogeoplus to add longitude and latitudes for these firms in Load.sql (right after firmbogeo).
A small number of US PortCos have zipcodes but not places (which are used in the Rankings and elsewhere). To address this, I loaded the 2010 ZCTA to place lookup in Load.sql (passing the result through TigerGeog) to get the placename. This is now incorporated into PortCoGeoid in BuildBaseTables.sql.
2010 Census data (2020 isn't available but the mappings are fairly static):
- https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.2010.html
- https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_zcta_place_rel_10.pdf
- https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_place_rel_10.txt
I also load up the 2010 ANSI codes for places (i.e., statefp, statecode, placename, placefp, etc.):
- Source: https://www.census.gov/library/reference/code-lists/ansi.html
- Data: national_places.txt -> national_places_processed.txt (regexes)
- Pop the last word off the placename!
Note that the postprocessing of these two tables is done at the end of Load.sql. The final table is zctaplaceinfo which takes zcta, statecode and provides back placename and geoid (of the place).
Load the Additions
These include:
- CPI (see the spreadsheet CPIEstimate.xlsx, source data from https://data.bls.gov/pdq/SurveyOutputServlet)
- Population, statepop2017, and ACS. See American Community Survey (ACS) Data.
- tigerplaces, which builds tigergeog, and placedisplay. See Jeemin Sim (Work Log), Urban Start-up Agglomeration and Venture Capital Investment and Tiger Geocoder
- Industry, Firm type, Firm stage, Title, statecode, and other lookup tables.
- PortCoSBIR and PortCoPatent -- These are now out of date and don't appear to have build notes.
Join in the exit data
Bring in and clean up the IPO and MA (private target) data, match it to the PortCos, and reconcile the conflicts. This creates two core tables, one key table, and two PortCo results tables:
- IpoCleanNoDups
- MACleanNoDups
- ExitKeys (IPO and MA only, not PortCo)
- PortCoExit
- PortCoAliveDead
PortCoAliveDead follows the academic convention of marking a startup as alive if it has begun receiving investment and hasn't exited and if its last investment occurred less than five years ago.
Base Tables
The base tables are built using BuildBaseTables.sql. These provide common foundations for:
- PortCo Geography
- Other PortCo tables, including industry, id, cpi adjustments, geoids.
- Exit, Alive/dead and patents & SBIR/STTR grant (note that these need updating) information is also included.
- Firm variables (note that the full build is only done for firms that make growth investments).
- Round Line Joiner (RLJoiner) tables.
- PortCo People, including gender, dr., titles, serials.
- Fund People: gender and dr.
- The Master tables:
- PortCoMaster
- PortCoPeopleMaster
- FirmGrowthMaster (the Firm master table, for growth investments)
- RLMaster
Finally, this code also builds: MatchMostNumerous and MatchHighestRandom, which are used in Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists