Difference between revisions of "VCDB20"

From edegan.com
Jump to navigation Jump to search
Line 5: Line 5:
 
}}
 
}}
 
<onlyinclude>The [[VCDB20Q3]] project documents a build of my VCDB -- '''V'''enture '''C'''apital '''D'''ata'''B'''ase -- covering until the end of 2020 Q3. Each VCDB includes investments, funds, startups, executives, exits, locations, and more, derived from data from [[VentureXpert]]. This project updates [[vcdb4]], which covered (almost) to the of Q3 2019, and replaces [[VCDB20H1]], which was a partial build. See also: [[SDC Normalizer]].</onlyinclude>
 
<onlyinclude>The [[VCDB20Q3]] project documents a build of my VCDB -- '''V'''enture '''C'''apital '''D'''ata'''B'''ase -- covering until the end of 2020 Q3. Each VCDB includes investments, funds, startups, executives, exits, locations, and more, derived from data from [[VentureXpert]]. This project updates [[vcdb4]], which covered (almost) to the of Q3 2019, and replaces [[VCDB20H1]], which was a partial build. See also: [[SDC Normalizer]].</onlyinclude>
 +
 +
==Data design==
 +
 +
I followed the same data design as in [[VCDB20H1]]. Essentially the specification pulls everything, even things that aren't needed like incomplete M&As or withdrawn IPOs, or funds or investments that aren't venture capital, all the way to the present (the pull was done on 11/17/2020), and then place restrictions on the data later. Crucially, the pulls no longer use the venture-related flag, so the data contains private equity and other deals, and the data does contain secondaries and purchases. Note that the M&As are pulled separately for public and private acquirers, and in chunks by year to keep the request sizes manageable.
  
 
==Processing Steps==
 
==Processing Steps==
Line 44: Line 48:
 
  USRoundOnOneLine1980.ssh
 
  USRoundOnOneLine1980.ssh
  
Update the paths and dates in the ssh files then run them.
+
Update the paths and dates in the ssh files then run them (see [[SDC Platinum]]).
 +
 
 +
===Database import===
 +
 
 +
Run the [[SDC Normalizer]] on each of the files. For most of them, that's straightforward. You can safely ignore the Access Violation error messages that occur at the end of some pulls. However, the following require attention:
 +
*Fix the header row in USFirms1980.txt before normalizing (the Capital Under Management column name is too long)
 +
*Remove double quotes from USFund1980-normal.txt, USFundExecs1980-normal.txt, USPortCo1980-normal.txt
 +
*The private and public M&A files have to be combined after they've been normalized. Then replace \tnp\t and \tnm\t with \t\t.
 +
*For RoundOnOneLine, remove the footer, run NormalizeFixedWidth.pl first then RoundOnOneLine.pl, and then fix the header.
 +
*The PortCo Long Description needs to be pre-processed from the command line (see [[VCDB20H1]] and [[Vcdb4#Long_Description]])
 +
 
 +
Create the dbase as a researcher:
 +
createdb vcdb20q3
 +
 
 +
Move the files to //mother/bulk/vcdb20q3 and run the load script:
 +
E:\projects\vcdb20q3\Load.sql

Revision as of 19:11, 17 November 2020


Project
VCDB20
Project logo 02.png
Project Information
Has title VCDB20Q3
Has owner Ed Egan
Has start date
Has deadline date
Has project status
Has project output Data, Tool, How-to
Copyright © 2019 edegan.com. All Rights Reserved.

The VCDB20Q3 project documents a build of my VCDB -- Venture Capital DataBase -- covering until the end of 2020 Q3. Each VCDB includes investments, funds, startups, executives, exits, locations, and more, derived from data from VentureXpert. This project updates vcdb4, which covered (almost) to the of Q3 2019, and replaces VCDB20H1, which was a partial build. See also: SDC Normalizer.

Data design

I followed the same data design as in VCDB20H1. Essentially the specification pulls everything, even things that aren't needed like incomplete M&As or withdrawn IPOs, or funds or investments that aren't venture capital, all the way to the present (the pull was done on 11/17/2020), and then place restrictions on the data later. Crucially, the pulls no longer use the venture-related flag, so the data contains private equity and other deals, and the data does contain secondaries and purchases. Note that the M&As are pulled separately for public and private acquirers, and in chunks by year to keep the request sizes manageable.

Processing Steps

Source Files

Copy over and update the source files:

Name
----
NormalizeFixedWidth.pl
RoundOnOneLine.pl
USFirmBranchOffices1980.rpt
USFirmBranchOffices1980.ssh
USFirms1980.rpt
USFirms1980.ssh
USFund1980.rpt
USFund1980.ssh
USFundExecs1980.rpt
USFundExecs1980.ssh
USIPO1980.rpt
USIPO1980.ssh
USMAPrivate.rpt
USMAPrivate00-10.ssh
USMAPrivate10-.ssh
USMAPrivate80-85.ssh
USMAPrivate85-00.ssh
USMAPublic.rpt
USMAPublic00-.ssh
USMAPublic80-00.ssh
USPortCo1980.rpt
USPortCo1980.ssh
USPortCoExecs1980.rpt
USPortCoExecs1980.ssh
USPortCoLongDesc1980.rpt
USPortCoLongDesc1980.ssh
USRound1980.rpt
USRound1980.ssh
USRoundOnOneLine1980.rpt
USRoundOnOneLine1980.ssh

Update the paths and dates in the ssh files then run them (see SDC Platinum).

Database import

Run the SDC Normalizer on each of the files. For most of them, that's straightforward. You can safely ignore the Access Violation error messages that occur at the end of some pulls. However, the following require attention:

  • Fix the header row in USFirms1980.txt before normalizing (the Capital Under Management column name is too long)
  • Remove double quotes from USFund1980-normal.txt, USFundExecs1980-normal.txt, USPortCo1980-normal.txt
  • The private and public M&A files have to be combined after they've been normalized. Then replace \tnp\t and \tnm\t with \t\t.
  • For RoundOnOneLine, remove the footer, run NormalizeFixedWidth.pl first then RoundOnOneLine.pl, and then fix the header.
  • The PortCo Long Description needs to be pre-processed from the command line (see VCDB20H1 and Vcdb4#Long_Description)

Create the dbase as a researcher:

createdb vcdb20q3

Move the files to //mother/bulk/vcdb20q3 and run the load script:

E:\projects\vcdb20q3\Load.sql