Changes

Jump to navigation Jump to search
no edit summary
[[VCDB24]] is the 2024 and final iteration of my [[VentureXpert]] based venture capital database'''V'''enture '''C'''apital '''D'''ata'''B'''ase. Thomson-Reuters discontinued access to VentureXpert through [[SDC Platinum]] on December 31st, 2023(see also: [[SDC Normalizer]]). This iteration contains data up until then. Each VCDB includes investments, funds, startups, executives, exits, locations, and more. The previous build was [[VCDB23]], but the best previous instructions are from [[VCDB20]].
== Processing Steps ==
Get the source data:# Copy over the rpt, ssh, and pl files, and bulk edit the ssh files, now in to E:\projects\vcdb24\SDC, and bulk edit the ssh files. ## Change Make final date 12/31/2020 (2023 and one 07/20/2020) change vcdb23 to 12/31/2022 and vcdb20 to vcdb23vcdb24# Run the ssh files against SDC Platinum. Note that SDC Platinum's service will be withdrawn one last time on 31 December 2023.
# Run the [[SDC Normalizer]] script (one of the pl files) on each output
## Fix the header row in USFirms1980.txt before normalizing (the Capital Under Management column name is too long)
## The private and public M&A file sets have to be separately combined into 2 files after they've been normalized. Then replace \tnp\t and \tnm\t with \t\t in each.
## For RoundOnOneLine, remove the footer, run NormalizeFixedWidth.pl first, then RoundOnOneLine.pl, and then fix the header.
## PortCoLongDescription must be pre-processed from the command line and then post-processed in excel (see below as well as [[VCDB20H1 ]] and [[Vcdb4#Long_Description]]). However, I didn't load it for this run. Create the postgres database:# Create a new database on mother (createdb vcdb23vcdb24) and setup set up a directory for the input files: E: bulk\projects\vcdb23vcdb24# Copy over (to sql folder) and edit Load.sql. Run it section-by-section. ===PortCoLongDescription=== Process the Long Description data as follows:#Remove the header and footer, and then save as Process.txt using UNIX line endings and UTF-8 encoding.#Run the first section (producing Out5.txt) of the regex process below#Import into Excel to make tab-delimited#Remove double quotes " from just the description field #Put in a new header#Save as In5.txt with UNIX/UTF-8#Run the last regex. It deals with the spaces in the description and the cases when there is no description.#Try importing USVCPortCoLongDesc1980Cleaned.txt. It should be fine.  cat Process.txt | perl -pe 's/^([^ ])/###\1/g' > Out1.txt cat Out1.txt | perl -pe 's/\s{65,}/ /g' > Out2.txt cat Out2.txt | perl -pe 's/\n//g' > Out3.txt cat Out3.txt | perl -pe 's/###/\n/g' > Out4.txt cat Out4.txt | perl -pe 's/(\d{4} $/\1\t/g' > Out5.txt ... cat In5.txt | perl -pe 's/(\d{4})\t$/\1###/g' > Out6.txt cat Out6.txt | perl -pe 's/\s{2,}/ /g' > Out7.txt cat Out7.txt | perl -pe 's/###/\t/g' > USPortCoLongDesc1980Cleaned.txt

Navigation menu