Difference between revisions of "VCDB24"
(12 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | [[VCDB24]] is the 2024 and final iteration of my [[VentureXpert]] based '''V'''enture '''C'''apital '''D'''ata'''B'''ase. Thomson-Reuters discontinued access to VentureXpert through [[SDC Platinum]] on December 31st, 2023 (see also: [[SDC Normalizer]]). This iteration contains data up until then. Each VCDB includes investments, funds, startups, executives, exits, locations, and more. The previous build was [[VCDB23]], but the best previous instructions are from [[VCDB20]]. | + | [[VCDB24]] is the 2024 and final iteration of my [[VentureXpert]] based '''V'''enture '''C'''apital '''D'''ata'''B'''ase. Thomson-Reuters discontinued access to VentureXpert through [[SDC Platinum]] on December 31st, 2023 (see also: [[SDC Normalizer]]). This iteration contains data up until then. Each VCDB includes investments, funds, startups, executives, exits, locations, and more. The previous build was [[VCDB23]], but the best previous instructions are from [[VCDB20]] or the [[McNair Center]] build, which was called [[VentureXpert Data]]. |
== Processing Steps == | == Processing Steps == | ||
− | # Copy over the rpt, ssh, and pl files | + | Get the source data: |
− | ## | + | # Copy over the rpt, ssh, and pl files to E:\projects\vcdb24\SDC, and bulk edit the ssh files. |
− | # Run the ssh files against SDC Platinum | + | ## Make final date 12/31/2023 and change vcdb23 to vcdb24 |
+ | # Run the ssh files against SDC Platinum one last time on 31 December 2023. | ||
# Run the [[SDC Normalizer]] script (one of the pl files) on each output | # Run the [[SDC Normalizer]] script (one of the pl files) on each output | ||
## Fix the header row in USFirms1980.txt before normalizing (the Capital Under Management column name is too long) | ## Fix the header row in USFirms1980.txt before normalizing (the Capital Under Management column name is too long) | ||
Line 11: | Line 12: | ||
## The private and public M&A file sets have to be separately combined into 2 files after they've been normalized. Then replace \tnp\t and \tnm\t with \t\t in each. | ## The private and public M&A file sets have to be separately combined into 2 files after they've been normalized. Then replace \tnp\t and \tnm\t with \t\t in each. | ||
## For RoundOnOneLine, remove the footer, run NormalizeFixedWidth.pl first, then RoundOnOneLine.pl, and then fix the header. | ## For RoundOnOneLine, remove the footer, run NormalizeFixedWidth.pl first, then RoundOnOneLine.pl, and then fix the header. | ||
− | ## PortCoLongDescription must be pre-processed from the command line and then post-processed in excel (see VCDB20H1 and Vcdb4#Long_Description). | + | ## PortCoLongDescription must be pre-processed from the command line and then post-processed in excel (see below as well as [[VCDB20H1]] and [[Vcdb4#Long_Description]]). |
− | # Create a new database on mother (createdb | + | |
− | # Copy over and edit Load.sql. Run it section-by-section. | + | Create the postgres database: |
+ | # Create a new database on mother (createdb vcdb24) and set up a directory for the input files: bulk\vcdb24 | ||
+ | # Copy over (to sql folder) and edit Load.sql. Run it section-by-section. | ||
+ | |||
+ | ===PortCoLongDescription=== | ||
+ | |||
+ | Process the Long Description data as follows: | ||
+ | #Remove the header and footer, and then save as Process.txt using UNIX line endings and UTF-8 encoding. | ||
+ | #Run the first section (producing Out5.txt) of the regex process below | ||
+ | #Import into Excel to make tab-delimited | ||
+ | #Remove double quotes " from just the description field | ||
+ | #Put in a new header | ||
+ | #Save as In5.txt with UNIX/UTF-8 | ||
+ | #Run the last regex. It deals with the spaces in the description and the cases when there is no description. | ||
+ | #Try importing USVCPortCoLongDesc1980Cleaned.txt. It should be fine. | ||
+ | |||
+ | cat Process.txt | perl -pe 's/^([^ ])/###\1/g' > Out1.txt | ||
+ | cat Out1.txt | perl -pe 's/\s{65,}/ /g' > Out2.txt | ||
+ | cat Out2.txt | perl -pe 's/\n//g' > Out3.txt | ||
+ | cat Out3.txt | perl -pe 's/###/\n/g' > Out4.txt | ||
+ | cat Out4.txt | perl -pe 's/(\d{4} $/\1\t/g' > Out5.txt | ||
+ | ... | ||
+ | cat In5.txt | perl -pe 's/(\d{4})\t$/\1###/g' > Out6.txt | ||
+ | cat Out6.txt | perl -pe 's/\s{2,}/ /g' > Out7.txt | ||
+ | cat Out7.txt | perl -pe 's/###/\t/g' > USPortCoLongDesc1980Cleaned.txt | ||
+ | |||
+ | ===Geocoding=== | ||
+ | |||
+ | Part of Load.sql requires we update the Geocoding - adding new long and lat for PortCos and firm offices that we haven't seen before. | ||
+ | |||
+ | The last time this was run was vcdb20. Accordingly: | ||
+ | * In vcdb20, export the portcogeo, firmgeo, and bogeo tables | ||
+ | * Import them as portcogeo_vcdb20, firmgeo_vcdb20, and bogeo_vcdb20 | ||
+ | * Build portcogrowthneedsgeo, firmneedsgeo, firmboneedsgeo files for geocoding | ||
+ | * Log into [https://console.cloud.google.com/ Google Console] and set up an API key. Note that: | ||
+ | ** Up to $200/month should be free | ||
+ | ** $5.00 USD per 1000 lookups. | ||
+ | ** 3,000 QPM max | ||
+ | * In E:/tools/Geocode run the script(s): Geocode.py for portcos and GeocodeOneKey.py for everything else. | ||
+ | ** Strip the header line out of the input file(s) | ||
+ | ** python Geocode.py portcogrowthneedsgeo-NoHeader.txt | ||
+ | * Get the latest Gazetteer file(s): https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.html | ||
+ | * Check the coverage of portcogeo and create firmbogeoplus | ||
+ | |||
+ | ===Exits=== | ||
+ | |||
+ | Another part of Load.sql does the matching to IPOs and MAs and their precedence. Note that: | ||
+ | * Issuer and target names are matched against themselves using the [[Matcher.pl]] script in mode 2 | ||
+ | * PortCo stdnames are matched to issuerstd and targetstd (separately) in mode 0 | ||
+ | * The state and date matching requirements are in the load.sql. | ||
+ | * There seems to have been duplicate issue records in the IPO data for vcdb2020 (and perhaps earlier). Some of the duplicates are often identical, except that the date is a day apart. | ||
+ | * The IPO records also contain listings on junior and foreign exchanges, as well as some OTC - I left these in and flagged them. | ||
+ | |||
+ | ===Industry=== | ||
+ | |||
+ | The industry coding is in IndustryCodes.txt. Note that: | ||
+ | * The code map had to be updated. The Excel file is in projects/vcdb24 but I didn't have the original counts to hand. | ||
+ | * The codes are not unique (576 out of 585 are unique) at the industry subgroup 3 level. | ||
+ | * The codes (code, code20, code100) should be joined using indclass, indminorgroup, indsubgroup | ||
+ | * The codes are 1,2,4 but not 3dg hierarchical. | ||
+ | * 1dg codes are IT, LS, and Other. | ||
+ | * Note that code is the full 4dg industry identifier, where as code20 and code100 are name-based aggregates with at least 20 or 100 observations in them. | ||
+ | {| class="wikitable" | ||
+ | |- | ||
+ | ! 2dg Code | ||
+ | ! Minorcode | ||
+ | ! No. of PortCos | ||
+ | |- | ||
+ | | 11 | ||
+ | | Communications and Media | ||
+ | | 3930 | ||
+ | |- | ||
+ | | 12 | ||
+ | | Computer Hardware | ||
+ | | 3058 | ||
+ | |- | ||
+ | | 13 | ||
+ | | Computer Software and Services | ||
+ | | 21157 | ||
+ | |- | ||
+ | | 14 | ||
+ | | Internet Specific | ||
+ | | 14440 | ||
+ | |- | ||
+ | | 15 | ||
+ | | Semiconductors/Other Elect. | ||
+ | | 3145 | ||
+ | |- | ||
+ | | 21 | ||
+ | | Biotechnology | ||
+ | | 4251 | ||
+ | |- | ||
+ | | 22 | ||
+ | | Medical/Health | ||
+ | | 7138 | ||
+ | |- | ||
+ | | 31 | ||
+ | | Consumer Related | ||
+ | | 7459 | ||
+ | |- | ||
+ | | 32 | ||
+ | | Industrial/Energy | ||
+ | | 7028 | ||
+ | |- | ||
+ | | 33 | ||
+ | | Other Products | ||
+ | | 14246 | ||
+ | |} | ||
+ | |||
+ | I also tried some keyword industry coding from both short and long descriptions. The source code is at the top of BuildBaseTables.sql. The results are in sheets in the IndustryCodes.xlsx file. | ||
+ | |||
+ | ===BuildBaseTables.sql=== | ||
+ | |||
+ | Build the PortCoGrowthGeoId table that codes the city-state to a geoid. | ||
+ | |||
+ | {| class="wikitable" | ||
+ | |- style="vertical-align:bottom;" | ||
+ | ! origin | ||
+ | ! count | ||
+ | ! Method | ||
+ | |- | ||
+ | | style="vertical-align:bottom;" | 1 | ||
+ | | style="vertical-align:bottom;" | 45,111 | ||
+ | | style="color:#808080;" | Address is geocoded and in tiger place | ||
+ | |- | ||
+ | | style="vertical-align:middle; color:#808080;" | 2 | ||
+ | | style="vertical-align:bottom;" | 270 | ||
+ | | style="color:#808080;" | city, statecode matches to only 1 geoid, so use it | ||
+ | |- | ||
+ | | style="vertical-align:middle; color:#808080;" | 3 | ||
+ | | style="vertical-align:bottom;" | 1,374 | ||
+ | | style="color:#808080;" | city, statecode matches to multiple geoids, use the most popular | ||
+ | |- | ||
+ | | style="vertical-align:middle; color:#808080;" | 4 | ||
+ | | style="vertical-align:bottom;" | 964 | ||
+ | | style="color:#808080;" | 1:1 straight city<->place and statecode match with tiger | ||
+ | |- | ||
+ | | style="vertical-align:middle; color:#808080;" | 5 | ||
+ | | style="vertical-align:bottom;" | 509 | ||
+ | | style="color:#808080;" | Use zctaplaceinfo to lookup the best place choice for the zipcode | ||
+ | |- style="vertical-align:bottom;" | ||
+ | | style="vertical-align:middle; color:#808080;" | 6 | ||
+ | | 636 | ||
+ | | style="color:#808080;" | Unable to code | ||
+ | |- | ||
+ | | style="vertical-align:middle; color:#808080;" | 9 | ||
+ | | style="vertical-align:bottom;" | 24 | ||
+ | | style="color:#808080;" | Custom coded | ||
+ | |} | ||
+ | |||
+ | ===StartupCities=== | ||
+ | |||
+ | The original Startup Cities code is in E:\projects\BayesianStartupCities\V1\startupcities.sql. The new version is in e:\projects\BayesianStartupCities\StartupCitiesV2.sql. |
Latest revision as of 14:37, 13 June 2024
VCDB24 is the 2024 and final iteration of my VentureXpert based Venture Capital DataBase. Thomson-Reuters discontinued access to VentureXpert through SDC Platinum on December 31st, 2023 (see also: SDC Normalizer). This iteration contains data up until then. Each VCDB includes investments, funds, startups, executives, exits, locations, and more. The previous build was VCDB23, but the best previous instructions are from VCDB20 or the McNair Center build, which was called VentureXpert Data.
Contents
Processing Steps
Get the source data:
- Copy over the rpt, ssh, and pl files to E:\projects\vcdb24\SDC, and bulk edit the ssh files.
- Make final date 12/31/2023 and change vcdb23 to vcdb24
- Run the ssh files against SDC Platinum one last time on 31 December 2023.
- Run the SDC Normalizer script (one of the pl files) on each output
- Fix the header row in USFirms1980.txt before normalizing (the Capital Under Management column name is too long)
- Remove double quotes from USFund1980-normal.txt, USFundExecs1980-normal.txt, USPortCo1980-normal.txt, USFirmBranchOffices1980.txt
- The private and public M&A file sets have to be separately combined into 2 files after they've been normalized. Then replace \tnp\t and \tnm\t with \t\t in each.
- For RoundOnOneLine, remove the footer, run NormalizeFixedWidth.pl first, then RoundOnOneLine.pl, and then fix the header.
- PortCoLongDescription must be pre-processed from the command line and then post-processed in excel (see below as well as VCDB20H1 and Vcdb4#Long_Description).
Create the postgres database:
- Create a new database on mother (createdb vcdb24) and set up a directory for the input files: bulk\vcdb24
- Copy over (to sql folder) and edit Load.sql. Run it section-by-section.
PortCoLongDescription
Process the Long Description data as follows:
- Remove the header and footer, and then save as Process.txt using UNIX line endings and UTF-8 encoding.
- Run the first section (producing Out5.txt) of the regex process below
- Import into Excel to make tab-delimited
- Remove double quotes " from just the description field
- Put in a new header
- Save as In5.txt with UNIX/UTF-8
- Run the last regex. It deals with the spaces in the description and the cases when there is no description.
- Try importing USVCPortCoLongDesc1980Cleaned.txt. It should be fine.
cat Process.txt | perl -pe 's/^([^ ])/###\1/g' > Out1.txt cat Out1.txt | perl -pe 's/\s{65,}/ /g' > Out2.txt cat Out2.txt | perl -pe 's/\n//g' > Out3.txt cat Out3.txt | perl -pe 's/###/\n/g' > Out4.txt cat Out4.txt | perl -pe 's/(\d{4} $/\1\t/g' > Out5.txt ... cat In5.txt | perl -pe 's/(\d{4})\t$/\1###/g' > Out6.txt cat Out6.txt | perl -pe 's/\s{2,}/ /g' > Out7.txt cat Out7.txt | perl -pe 's/###/\t/g' > USPortCoLongDesc1980Cleaned.txt
Geocoding
Part of Load.sql requires we update the Geocoding - adding new long and lat for PortCos and firm offices that we haven't seen before.
The last time this was run was vcdb20. Accordingly:
- In vcdb20, export the portcogeo, firmgeo, and bogeo tables
- Import them as portcogeo_vcdb20, firmgeo_vcdb20, and bogeo_vcdb20
- Build portcogrowthneedsgeo, firmneedsgeo, firmboneedsgeo files for geocoding
- Log into Google Console and set up an API key. Note that:
- Up to $200/month should be free
- $5.00 USD per 1000 lookups.
- 3,000 QPM max
- In E:/tools/Geocode run the script(s): Geocode.py for portcos and GeocodeOneKey.py for everything else.
- Strip the header line out of the input file(s)
- python Geocode.py portcogrowthneedsgeo-NoHeader.txt
- Get the latest Gazetteer file(s): https://www.census.gov/geographies/reference-files/time-series/geo/gazetteer-files.html
- Check the coverage of portcogeo and create firmbogeoplus
Exits
Another part of Load.sql does the matching to IPOs and MAs and their precedence. Note that:
- Issuer and target names are matched against themselves using the Matcher.pl script in mode 2
- PortCo stdnames are matched to issuerstd and targetstd (separately) in mode 0
- The state and date matching requirements are in the load.sql.
- There seems to have been duplicate issue records in the IPO data for vcdb2020 (and perhaps earlier). Some of the duplicates are often identical, except that the date is a day apart.
- The IPO records also contain listings on junior and foreign exchanges, as well as some OTC - I left these in and flagged them.
Industry
The industry coding is in IndustryCodes.txt. Note that:
- The code map had to be updated. The Excel file is in projects/vcdb24 but I didn't have the original counts to hand.
- The codes are not unique (576 out of 585 are unique) at the industry subgroup 3 level.
- The codes (code, code20, code100) should be joined using indclass, indminorgroup, indsubgroup
- The codes are 1,2,4 but not 3dg hierarchical.
- 1dg codes are IT, LS, and Other.
- Note that code is the full 4dg industry identifier, where as code20 and code100 are name-based aggregates with at least 20 or 100 observations in them.
2dg Code | Minorcode | No. of PortCos |
---|---|---|
11 | Communications and Media | 3930 |
12 | Computer Hardware | 3058 |
13 | Computer Software and Services | 21157 |
14 | Internet Specific | 14440 |
15 | Semiconductors/Other Elect. | 3145 |
21 | Biotechnology | 4251 |
22 | Medical/Health | 7138 |
31 | Consumer Related | 7459 |
32 | Industrial/Energy | 7028 |
33 | Other Products | 14246 |
I also tried some keyword industry coding from both short and long descriptions. The source code is at the top of BuildBaseTables.sql. The results are in sheets in the IndustryCodes.xlsx file.
BuildBaseTables.sql
Build the PortCoGrowthGeoId table that codes the city-state to a geoid.
origin | count | Method |
---|---|---|
1 | 45,111 | Address is geocoded and in tiger place |
2 | 270 | city, statecode matches to only 1 geoid, so use it |
3 | 1,374 | city, statecode matches to multiple geoids, use the most popular |
4 | 964 | 1:1 straight city<->place and statecode match with tiger |
5 | 509 | Use zctaplaceinfo to lookup the best place choice for the zipcode |
6 | 636 | Unable to code |
9 | 24 | Custom coded |
StartupCities
The original Startup Cities code is in E:\projects\BayesianStartupCities\V1\startupcities.sql. The new version is in e:\projects\BayesianStartupCities\StartupCitiesV2.sql.