Vcdb4

From edegan.com
Revision as of 18:35, 10 October 2019 by Ed (talk | contribs)
Jump to navigation Jump to search

Source Files

Files are in:

E:\projects\vcdb4

The old files from VentureXpert Database are in the subfolder Student Work, and their latest work is in Updated.

We need a set of pulls (according to E:\projects\vcdb3\OriginalSQL\LoadingScriptsV1.sql), which are documented below, as well as some lookup tables (CPI may need updating) and some joined tables (which would have to be updated separately) in MatchingEntrepsV3.sql:

  • PortCoSBIR: PortCoSBIR.txt
  • PortCoPatent: PortCoPatent.txt

And to update RevisedDBaseCode.sql, we'll need to:

  • Join in the Crunchbase (which needs updating)
  • Update the Geocoordinates

Note that this data could support new or updated versions of:

and others.

The build should be done as quickly but cleanly as possible, as it is needed right away but also will likely need to be updated in January of 2020 to reflect 2019's year end.

SDC Platinum Requests

Everything was updated to 09/22/2019 as the final date. Some files were renamed for clarity. Each result is a triplet of .ssh, .rpt, and .txt files. The following scripts, reports and their outputs are in E:\projects\vcdb4\SDC:

  • USVCRound1980
  • USVCPortCo1980
  • USVCRoundOnOneLine1980
  • USVCFund1980
  • USVCFirms1980
  • USPortCoLongDesc1980
  • USVCFirmBranchOffices1980
  • USIPO1980
  • USVCPortCoExecs1980
  • USVCFundExecs1980
  • USMAPrivate100pc1985
  • USMAPrivate100pc2013

The two USMAPrivate100pc queries are different. The first pulls just date announced, date effective, target name, target state and tv. The second adds basic acquirer information from 2013 forward (to allow retroactive revision by Thomson for 5+ years) and can be combined with MAUSTargetComp100pc1985-July2018.txt (after adjusting the spacing) to make USMAPrivate100pc2013Full. For some reason, the query always fails with an out of memory message when trying to pull the whole thing.

USSDCRound1980 was updated to remove fields that should have been in USVCPortCos1980 only. When normalizing be sure to only copy down key fields. USMAPrivate100pc1985 was updated to reflect the MAs load in LoadingScriptsV1. There wasn't a good original. We are using 1985 forward as there are data issues that prevent download/extraction for the 1980-1984 data. Year completed was added as a check variable but might have been the source of issues and so was removed. Date Effective can be used instead. And USIPOComp1980 was updated to allow all exchanges (not just NNA). I couldn't require completion in the search, so that will have to be done in the dbase. USVCFund1980 was updated because some variables -- those concerned with the fund's name and fund address -- had changed name. USTRoundOnOneLine1980 was fixed so that it is just the key (coname,statecode,datefirst) and the round info field, so that it works with the RoundOnOneLine.pl script. Finally, note that USPortCoLongDesc1980 needs processing separately (see below).

Long Description

The instructions on Retrieving_US_VC_Data_From_SDC#Scripts_and_other_info were modified as follows:

  1. Remove the header and footer, and then save as Process.txt using UNIX line endings and UTF-8 encoding.
  2. Run the Regex process (note that I modified it slightly)
  3. Manual Clean
  4. Remove double quotes " from just the description field
  5. Put in a new header with a very long description column
  6. Run the normalizer
  7. Remove duplicate spaces from the description column by pushing the data through excel and running the last regex (save as In5.txt with UNIX/UTF-8)
  8. Remove quote marks from Out5.txt, resave and then put back into excel to create USVCPortCoLongDesc1980Cleaned.txt
cat Process.txt | perl -pe 's/^([^ ])/###\1/g' > Out1.txt
cat Out1.txt | perl -pe 's/\s{65,}/ /g' > Out2.txt
cat Out2.txt | perl -pe 's/\n//g' > Out3.txt
cat Out3.txt | perl -pe 's/###/\n/g' > Out4.txt
...
cat In5.txt | perl -pe 's/\s{2,}/ /g' > Out5.txt

Round On One Line

The process is run USVCRoundOnOneLine1980.ssh with USVCRoundOnOneLine1980.rpt to generate USVCRoundOnOneLine1980.txt, then remove the footer and:

perl Normalizer.pl -file="USVCRoundOnOneLine1980-NoFoot.txt"
 copy down the key (0,1,2)
perl RoundOnOneLine.pl -file="USVCRoundOnOneLine1980-NoFoot-normal.txt"
 then put the header back in!

Everything else

Just run the Normalizer. Only copy down key fields -- never copy down anything else as it will introduce data errors. The primary and foreign key fields, which may still need cleaning in SQL to be valid, are as follows (they are marked with * if they should be copied down):

  • USVCRound1980 -- coname*, statecode*, datefirst*, rounddate
  • USVCPortCo1980 -- coname, statecode, datefirst
  • USVCRoundOnOneLine1980 -- Coname*, statecode*, datefirst*, rounddate, fundname
  • USVCFund1980 -- fundname, firmname
  • USVCFirms1980 -- firmname
  • USPortCoLongDesc1980 -- coname*, statecode*, datefirst*
  • USVCFirmBranchOffices1980 --firmname
  • USIPO1980 -- issuer
  • USVCPortCoExecs1980 -- coname*, statecode*, datefirst*
  • USVCFundExecs1980 -- fundname*, and maybe fundyear*
  • USMAPrivate100pc2013 -- dateeffective, targetname, acquirorname

Not that USMAPrivate100pc2013 and USIPO1980 have some non-numerics in their value fields, and we are generally going to have take care of some type issues.

Loading the data

First, create a dbase:

createdb vcdb4

Then load the data, running Load.sql

The following were nuances of this process:

  • Replace all double quotes with nothing in PortCo, Fund, and FundExecs
  • Renormalize firm by fixing the header -- the Capital Under Mgmt field was too close to the next field -- and remove two bad area codes (lines 931 and 11298).
  • Be careful not to replace single quotes in company names as it will destroy the keys (or do it everywhere!).

Note that our roundbase doesn't currently have (non-key) company-level fields, unlike before. But this shouldn't matter, as we were going to throw them out anyway when we build Round.

The next phase is to standardize the names and match the portcos to the IPOs and MAs. Then clean up all of the core tables. That takes us to the end of the old LoadingScriptV1.sql. We then need to do the equivalent of updating GeoLoad.sql. This time, we are going to use the old data, as before, but try updating any portco or firm that doesn't have at least 4dp (11.1m resolution) or perhaps 5dp (1.11m resolution) accuracy [1]. We have to fix the PortCo keys first, but we can only run 2,500 queries through Google each day for free per API key, so this should be a batch job.

Enable PostGIS Extensions

Before we can load up geometries and other PostGIS types, and do PostGIS operations, we need to:

create extension postgis;

Updating GeoCoding

The vcdb3 portcogeo table was saved as vcdb3portcogeo.txt and uploaded to vcdb4. Then a new table, companyneedsgeo was created with valid addresses that could be geolocated. There were less than 200 repeat addresses (see examples below) out of 5519 to be geocoded.

SELECT address, count(*) FROM companyneedsgeo GROUP BY address HAVING COUNT(*) >1;
/*
 address	count
501 Massachusetts Avenue, Cambridge, MA, 02139	4
479 Jessie Street, San Francisco, CA, 94103	4
745 Atlantic Avenue, Boston, MA, 02111	4
80 State Street, Albany, NY, 12207	4
953 Indiana Street, San Francisco, CA, 94107	5
79 Madison Avenue, New York, NY, 10016	5
400 Technology Square, Tenth Floor, Cambridge, MA, 02139	6
440 North Wolfe Road, Sunnyvale, CA, 94085	7
 */

Then run Geocode.py on the output file (broken into 2500 address queries).

python3 Geocode.py companyneedsgeo1-2499.txt

Matching

Generally everything should be matched to itself first. Matching should be done using mode=2:

perl .\Matcher.pl -file1="DistinctConame.txt" -file2="DistinctConame.txt" -mode=2
perl .\Matcher.pl -mode=2 -file1="DistinctTargetName.txt" -file2="DistinctTargetName.txt" 
perl .\Matcher.pl -mode=2 -file1="IPODistinctIssuer.txt" -file2="IPODistinctIssuer.txt"

Then clean up the PortCo table by removing any duplicate firms with different datefirstinv records. Following this construct portcokeysstd (to be used to restrict keys through out) and companybasecore (49136).

Match between and review MAs and IPOs:

 perl .\Matcher.pl -mode=2 -file1="PortCoMatchInput.txt" -file2="MAMatchInput.txt"

The M&A review does the following (10406):

  • Check if Hall (not Multi), datefirstinv<announceddate, statecode=statecode. Take when all three. (8064)
  • Throw out when statecode != statecode OR when announcedate < datefirstinv. (1879)
  • Of the remaining (463), take the min date announced.

Follow a similar procedure for the IPOs. There are 3134 IPOs, of which 2551 are all good (3 out of 3), 495 have bad dates or states, and 87 are left. For these 87, almost all are pairs (presumably the ones that aren't had a bad date) where there are two issue dates and we should take the minimum issue date.

The result will be (with portcokey unique and valid):

  • MACleanNoDups (8344) - PortCoKey, conamestd, targetnamestd, MA Info...
  • IPOCleanNoDups (2600) - PortCoKey, conamestd, issuernamestd, IPO Info...

Exits

Next we need to determine precedence if a PortCo has both an IPO and an M&A record. There are 173 such records. This is done in the query ExitKeys. PortCoExit (49136) is then built. This version fixes an error from before, where announceddate and not effectivedate was being used as exitdate for M&As. Note that exitvaluem is left as null for non-exits. PortCoAliveDead still uses a 5 year interval, but the date truncation date is updated to 9/22/2019. exitvaluem is then updated to 0 for those that are deemed dead without exit and left as null for non-exits. Exitdate and year are left as is, as new fields (deaddate, deadyear) are created appropriately.

Fund and Branch

We have to key off of fundname for the fund table. Fundbase has 31931 on load. Throwing out undisclosed funds leaves 31540 with 31479 distinct names (requiring disclosed firmnames too reduces this to 31404). The solution is to keep the duplicate with the largest numportcos, discarding Net Ventures, Cambridge Innovations - Unspecified Fund', and Spring Capital, as these are the only three with a tie and both are of no consequence (numportco=1).

Likewise, we have to key off of firmname for all firm tables. It isn't that we don't have other identifying information in firmbase (all records have founding date) but we only have firmname as a foreign key in the fund table and branch office table.

Additions

Population

For now:

  • The population table (see Population.sql)
  • The statepop2017 table (from States2017PopulationEstimate.txt, with DC (2018), GU (2017), and PR (2017) added.

We might want to join in the American Community Survey (ACS) Data too.

Tiger Places

We will use the old Tiger Places, but it is worth exploring how to rebuild this from scratch. First, visit https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html. Then download each state's place file. Load them and merge them. See PostGIS_Installation#Bulk_Download_TIGER_Shapefiles for more information. It is also worth reading the Tiger Geocoder and Jeemin Sim (Work Log).

CPI

We loaded the old CPI file, updating it with real numbers for 2018 and an estimate for 2019 based on extrapolating the last four months from the first eight, and then computing an annual average. The inflator is normalized to 2017 dollars. Data was taken from: https://www.bls.gov/cpi/tables/supplemental-files/historical-cpi-u-201908.pdf

Other things to do

We also might need to add:

  • Accelerator data? (See CohortPortCoMatch.sql for doing the portco matching) -- this might be in its own dbase
  • GSP (see Adding GDP.sql)
  • Industry, Firm type, Firm stage and other lookups (see MatchingEntreps.sql)
  • Titlelookup (see MatchingEntreps.sql)
  • PortCoSBIR and PortCoPatents (see MatchingEntreps.sql)

Building Datasets

We need to collect code from:

  • Ranking.sql
  • MatchingEntreps.sql, which builds:
    • Women
    • Some of matching VC-Entreps? (much might now be redundant)
  • RevisedDBaseCode.sql, which builds:
    • Matching VC-Entreps
    • Some base tables and the super tables: PortCoSuper, FirmSuper, etc.
  • Agglomeration.sql (the latest version is in E:\projects\agglomeration), which builds:
    • Agglomeration!
    • Note: Needs updating to reflect latest decisions

BuildBaseTables

To this end, the base tables, which are used by multiple different projects, are built in BuildBaseTables.sql

Build notes:

  • Common code from MatchingEntrepsV3.sql was moved to BuildBaseTables.sql
  • Almost all of the women's code now appears to be in the new base
  • The super tables still need building
  • The code for the Matching VC-Entreps project is now almost entirely in RevisedDBaseCode.sql

Agglomeration

This build supports the Urban Start-up Agglomeration and Venture Capital Investment paper. It uses the sql script: AgglomerationVcdb4.sql, which is in E:\projects\agglomeration. The notes are here: Urban_Start-up_Agglomeration_and_Venture_Capital_Investment#vcdb4

Ranking

A new Ranking.sql file was created in E:\projects\vcdb4 and run.

There are imperfect matches between cities in VentureXpert and places in Tiger. In the past, this hasn't mattered as we'd considered such a small sub set of cities (and had gotten lucky on coverage). Geocoding covers 97.2607% (47.760 out of 49,136) of companybasecore, 48,815 (99.35%) of which have a non-null city. This drops to 96.42% of firms with datefirstinv >= 2010. Using just geocoded would therefore cost us a small number of firms, but these may not be randomly distributed. What we need, then is a complete lookup table...