Difference between revisions of "Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists"
Line 5: | Line 5: | ||
|Has paper status=In development | |Has paper status=In development | ||
}} | }} | ||
+ | |||
+ | ==Notes== | ||
+ | |||
+ | This section provides notes on the analysis in Selected_prelim.do. I ran the code and did a basic exploration of variables. These are the results: | ||
+ | * I couldn't make the sign on distance change. That's a real finding! I tried CA/MA fixed effects and interactions, Silicon Valley/Boston Cambridge fixed effects and interactions, using quadratic effects, various transformations... nothing undid the results. Also, log distance is the best variable. | ||
+ | * matchinstagebroad outperforms matchinstagenarrow in some contexts, we should use it. | ||
+ | * I tried logging all the vars, even t_pccitydollarsrankm1, to no avail. | ||
+ | * pcexitvaluem is already set to 0 when exit=0. It is missing for undisclosed value acquisitions. It's a great RHS variable: Log it and use it! | ||
+ | * I used mktid fixed effects and the results look pretty good without the interactions... but less good with them for the probit. There was a slight improvement for the reg, but only 2 vars are sig either way. I think maybe year ind is fine. | ||
+ | * Doing value conditional on exit gave just 1 *** and 1 *. It's the exit that drives the sig in exit value. | ||
+ | |||
+ | So, all in all, I say you're good to go with your 'best spec' if you want to keep the interactions: | ||
+ | probit pcexit l_matchbodist matchinstagebroad c.l_firmageatdeal##c.l_pcexpceopres c.l_matchprevportcos##c.t_pccitydollarsrankm1 i.year i.ind if realmatch==1, cluster(mktid) | ||
+ | |||
+ | However, '''the interactions are never significant'''. If you drop the weakest of the interactions (the one involving l_matchprevportcos*t_pccitydollarsrankm1), you're good to go for every variable except the remaining interaction, which is (almost) borderline (it's significant if your drop the industry fixed effects): | ||
+ | probit pcexit l_matchbodist matchinstagebroad c.l_firmageatdeal##c.l_pcexpceopres matchprevportcos t_pccitydollarsrankm1 i.year i.ind if realmatch==1, cluster(mktid) | ||
+ | |||
+ | Without interactions, the variables are all good: | ||
+ | probit pcexit l_matchbodist matchinstagebroad l_firmageatdeal l_pcexpceopres matchprevportcos t_pccitydollarsrankm1 i.mktid if realmatch==1, cluster(mktid) | ||
+ | |||
+ | A word on the interpretation of variables for the write-up: | ||
+ | * l_firmageatdeal is our VC wisdom variable | ||
+ | * matchprevportcos is our VC size variable | ||
+ | * l_pcexpceopres is our PC team quality variable | ||
+ | * t_pccitydollarsrankm1 is our PC environment quality variable | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
==Dataset Rebuild== | ==Dataset Rebuild== |
Revision as of 17:19, 22 August 2021
Academic Paper | |
---|---|
Title | Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists |
Author | Ed Egan, Jeremy Fox, David Hsu, Chenyu Yang |
RAs | Meghana Gaur, James Chen, Kyran Adams, Marcos Ki Hyung Lee, Wei Wu |
Status | In development |
© edegan.com, 2016 |
Contents
Notes
This section provides notes on the analysis in Selected_prelim.do. I ran the code and did a basic exploration of variables. These are the results:
- I couldn't make the sign on distance change. That's a real finding! I tried CA/MA fixed effects and interactions, Silicon Valley/Boston Cambridge fixed effects and interactions, using quadratic effects, various transformations... nothing undid the results. Also, log distance is the best variable.
- matchinstagebroad outperforms matchinstagenarrow in some contexts, we should use it.
- I tried logging all the vars, even t_pccitydollarsrankm1, to no avail.
- pcexitvaluem is already set to 0 when exit=0. It is missing for undisclosed value acquisitions. It's a great RHS variable: Log it and use it!
- I used mktid fixed effects and the results look pretty good without the interactions... but less good with them for the probit. There was a slight improvement for the reg, but only 2 vars are sig either way. I think maybe year ind is fine.
- Doing value conditional on exit gave just 1 *** and 1 *. It's the exit that drives the sig in exit value.
So, all in all, I say you're good to go with your 'best spec' if you want to keep the interactions:
probit pcexit l_matchbodist matchinstagebroad c.l_firmageatdeal##c.l_pcexpceopres c.l_matchprevportcos##c.t_pccitydollarsrankm1 i.year i.ind if realmatch==1, cluster(mktid)
However, the interactions are never significant. If you drop the weakest of the interactions (the one involving l_matchprevportcos*t_pccitydollarsrankm1), you're good to go for every variable except the remaining interaction, which is (almost) borderline (it's significant if your drop the industry fixed effects):
probit pcexit l_matchbodist matchinstagebroad c.l_firmageatdeal##c.l_pcexpceopres matchprevportcos t_pccitydollarsrankm1 i.year i.ind if realmatch==1, cluster(mktid)
Without interactions, the variables are all good:
probit pcexit l_matchbodist matchinstagebroad l_firmageatdeal l_pcexpceopres matchprevportcos t_pccitydollarsrankm1 i.mktid if realmatch==1, cluster(mktid)
A word on the interpretation of variables for the write-up:
- l_firmageatdeal is our VC wisdom variable
- matchprevportcos is our VC size variable
- l_pcexpceopres is our PC team quality variable
- t_pccitydollarsrankm1 is our PC environment quality variable
Dataset Rebuild
This project uses VCDB20.
In E:\projects\vcdb20\
- Load.sql
- BuildBaseTables.sql
- Ranking.sql
Specific to this project:
- BuildDataset.sql
V2
New dataset
The new file is: MasterCode20YearV2-1.txt. It's in the dropbox!
V2-1 Changes
There were some issues with bodistkm being rather extreme (i.e., ~6000-8000km):
- A single portco, CodeHS had an incorrectly geocoded addresss. Despite being listed as San Francisco, CA, addr2 was "Babraham Research Campus", which Google Maps was incorrectly associating with the village of Babraham in Cambridge, UK. The address for this firm has been manually fixed (the correct address is 1328 Mission St, San Francisco, CA 94103) and its correct geocoding has been pushed through the tables (portcogeo, portcopoints, portcomaster, portcosuper) to the final dataset. CodeHS received its first round in 2012 and may achieve an exit but probably .
- The remaining extreme values were caused by portcos or firms being located outside of the continental U.S. (i.e., in HI, AK, or PR). When such a firm was paired with a mainland firm, it would have an extreme distance. The dataset is now restricted to the continental U.S. The largest distances are now ME-CA pairs or WA-FL pairs, as expected.
Summary of request
Objectives:
- Generate matchinstagenarrow and matchinstagebroad
- Check coverage of Xunjie's variables
- Generate some alternative/additional variables:
Xunjie's list of variables for estimation:
- matchhqdist -- matchbodist is preferred. was 152/500k nulls, should be done now.
- matchinstagenarrow -- Now in the dataset with improved logic. Probably use matchinstagebroad instead. No nulls.
- firmfirstinvyear -- firmageatdeal is preferred. No Nulls.
- matchprevindu20 -- no nulls.
- matchprevportcos -- no nulls.
- pcnumperson -- this is a conceptually and operationally terrible variable! See below.
- pccitydollarsrankm1 -- had lots of missing! This should be resolved now.
- pcexp -- similar issues to pcnumperson. See below.
Xunjie's Restrictions:
- I only keep data between 2002 and 2016.
- And I only keep the matching markets if the number of real matches is more than or equal to 5.
- This gives me 922 matching markets.
"The major issue is that there are many missing values (about 10%) for the last three variables (and possibly “matchinstagenarrow"). We need to drop the entire matching market if there is one missing value. The number of matching markets without missing values is 260. So less than 1/3 matching markets survive."
Dropping the entire market is surely way too extreme. We should just drop the offending portco and only drop the market if the number of real matches drops below our threshold (e.g., 5). I've included some new market stats to give analytics: mktdealcount, mktnumreal, mktnumsyn, mktnumfirms, mktvalid.
Review of Changes
In BuildBaseTables.sql:
- Fixed PortCoGeoid to use zipcodes too to determine placename
- Created separate geoid lookup table for place, statecode: PlaceStatecodeGeoid
- Pushed changes through PortCoMaster
- Added new vars to PortCoPeople and pushed them through.
Added restrictions to MatchMostNumerous (34604) by creating a temp view RLMasterRestricted where:
- Firm and portco nation code='US'
- Firm and portco statecode code!='UN'
- Code is not null
- placename is not null
- hqdistkm is not null
In Ranking.sql
- Re-run with updated portcogeoid!
- Pick up Geoid from new PlaceStatecodeGeoid table (BuildBastTables.sql)
In BuildDataset.sql.
- SynthKeys_Code20: Pushed changes through. 488065
- ComboKeys_Code20: 522669 = 488065 + 34604
- PortCoSuper (not restricted) - uses lastest version of PortCoMaster. Placename from PortCoGeoid, which uses zipcodes too. DealSuper (restricted to MatchKeys!), FirmSuper (now US only, but not restricted)
- Combodist, ComboIndu, and ComboMeasures tables (all based on ComboKeys_Code20) much as before but with new base sets
- ComboStats_Code20 added to provide market info.
- MasterCode20Year. Rerun with new feeders. New variables added.
pccitydollarsrankm1
There are a number of possible explanations for why this variable had lots of missing.
There do seem to be missing placenames. 4855/69882 PortCoSuper records don't join to PlaceYearRanking on placename and state (ignoring year) and 4,561 of these have valid zips. However, only 263 had growth VC and just 82 has non-null positive invested amounts, so this isn't the issue.
Ultimatately, I rebuild the underlying tables (portgeoid, etc.) and created a new lookup table (PlaceStatecodeGeoid), and then reran the rankings making sure to keep the "no activity" places for each year (tied for last place). The ranking variables should be fixed now.
pcnumperson / pcexp
pcnumperson suffers from a number of endogeneity issues, including:
- Thomson adds information each time the firm recieves more investment, so pcnumperson is correlated with the number of rounds, amount invested, prob of exit, etc.
- Higher quality firms/portcos are more likely to report the people in a portco.
- numperson includes a broad range of titles, roughly VP-level and above with some extras, and more organized portcos may report deeper into their ranks.
- It's possible that some non-exec board members are included
pcexp suffers from all of the above issues and more. pcexp has the following lineage:
- portcosuper.pcexp <- prevs.PortCoPeopleMaster <- sum(prev).PortCoPeopleAgg <- COALESCE(sum(hasperson),0).PortCoPeopleSerial <-hasperson.PortCoPeopleKey <- hasperson.PortCoPeopleFull <- hasperson.PortCoPeopleBase
- PortCoPeopleBase: CASE WHEN prefix IS NULL AND firstname IS NULL AND lastname IS NULL THEN 0::int ELSE 1::int END AS hasperson,
- So hasperson is a record indicator, and the pcexp sums over both people and thier positions. Accordingly:
- A portco with 1 person who has held two previous positions has pcexp=2
- A portco with 2 people who have each held one previous position has pcexp=2
- Non-exec board members (lawyers, investors, etc.) may have worked with lots of previous firms and be inflating this count!
I rebuilt these variables so that they have better coverage where possible. I also set doctors, serials, serialceopreses, serialfounders, prevs, prevceopreses, prevfounders to zero when missing in PortCoSuper (I left them as null in PortCoPeopleMaster).
We shouldn't use numperson at all. It's just horrible. Instead we should try one of the following:
- serialceopreses
- serialfounders
- serials
- doctors (maybe for something different)
- prevceopreses
- prevfounders
But I expect that you we have problems with variation.
match in stage
New logic makes 'matchinstagepref' (both narrow and broad) true when stagepref is null, the deal is sead/early and the firm is 'Ecosystem','SBIC','Angel','Gov', and otherwise when stagepref is null.
CASE WHEN (dealseed >= 1 AND (firmstageprefno=1 OR firmstageprefno=3 OR firmstageprefno=5)) THEN 1::int WHEN (dealearly >= 1 AND (firmstageprefno=2 OR firmstageprefno=3 OR firmstageprefno=5)) THEN 1::int WHEN (deallater >= 1 AND (firmstageprefno=4 OR firmstageprefno=5)) THEN 1::int WHEN firmstageprefno IS NULL AND firmcat IN ('Ecosystem','SBIC','Angel','Gov') AND (dealseed >= 1 OR dealearly >= 1) THEN 1::int WHEN firmstageprefno IS NULL AND firmcat IN ('Ecosystem','SBIC','Angel','Gov') AND NOT (dealseed >= 1 OR dealearly >= 1) THEN 0::int WHEN firmstageprefno IS NULL THEN 1::int ELSE 0::int END AS matchinstagenarrow, CASE WHEN (deallater >= 1 AND (firmstageprefno=1 OR firmstageprefno=2 OR firmstageprefno=3 OR firmstageprefno=5)) THEN 1::int WHEN (dealearly >= 1 AND (firmstageprefno=1 OR firmstageprefno=2 OR firmstageprefno=3 OR firmstageprefno=5)) THEN 1::int WHEN (deallater >= 1 AND (firmstageprefno=4 OR firmstageprefno=5)) THEN 1::int WHEN firmstageprefno IS NULL AND firmcat IN ('Ecosystem','SBIC','Angel','Gov') AND (dealseed >= 1 OR dealearly >= 1) THEN 1::int WHEN firmstageprefno IS NULL AND firmcat IN ('Ecosystem','SBIC','Angel','Gov') AND NOT (dealseed >= 1 OR dealearly >= 1) THEN 0::int WHEN firmstageprefno IS NULL THEN 1::int ELSE 0::int END AS matchinstagebroad,
Other
I added together the patents and SBIR grants pre and during VC to create the following variables (each has variation issues, but maybe try in order):
- pchaspatentsvc (1/0 indicator for portco has patents)
- pcpatentsvc (number of patents)
- pcsbircountvc (number of SBIR grants)
- pcsbiramountvc (value of SBIR grants)
Changes to date
Code is in E:\projects\unobservedcomplementarities\BuildDataset.sql
Changes:
- Changed MatchHighestRandom to MatchMaster. It is MatchMostNumerous (i.e., pick the firm with max(numportcos) for each portco from RLMaster) with a random tie break. It contains a lot of variables pertaining to the portco, firm, round, and match!
- MatchKeys is coname, statecode, datefirstinv, firmname, as well as minroundin, year, code, code20, code100. It replaces RealMatchesCode.
- Replaced SynRealSetc20 with SynthKeys_Code20.
- Replaced AllRealMatchKeysC20Code with ComboKeys_Code20, also renamed realmatch variable to isreal.
- Note that ComboKeys_Code20 (591,299 with 554,561 synths) is much smaller AllRealMatchKeysC20Code (1,631,896 with 1,599,427 synths), which allowed (almost) any other firm from RLJoinerFF that had done a deal in that code20-year. ComboKeys_Code20 only allows another real match from the same code20-year.
- Rebuilt Super tables (PortCoSuper, DealSuper, FirmSuper). Note that FirmSuper is now restricted to US firms only (matches were already US-US because MatchMostNumerous was constrained to US (and state !='UN') firms and portcos only. However, it is not clear that this was true in the past.).
- Rebuilt the AllMatchc20 tables. New names are Combo..._Code20.
- Crucial difference: Only investments within ComboKeys_Code20 are included in the history counts. Before anything in RLJoinerFF or even RoundlineBase were included, which created the impression of overcounts.
- The vast majority of missing distances were caused by missing firm addresses for just 391 VCs (some of these were state UN, and later removed). However, we have zip codes for almost all of them. I built firmbogeoplus (see VCDB20) to add in zcta centroids from the U.S. Census gazetteer where available. However, 47 zips weren't in the ZCTA lookup, and they account for 113 firms that participated in 3,997 pairs (real and synth). So I also ran GeocodeOneKey.py with a (no header) Zip\tZip input (MissingFirmZips.txt), manually added three records (02801,85292,91399) and loaded up the result as zipgeoaddon (from MissingFirmZips-Geocoded.txt) in Load.sql.
- Also required that matches have a code(/20/100)
- Changed the Experience table to count experience in previous years rather than previous rounddates, as then experience doesn't change within a market except through the match.
- Changed the join to the ranking. New build uses PlaceYearRanking, and joins on placename, state, year (placename is available in PortCoSuper).
- There was an error with Centerville Ohio appearing in the ranking twice each year from 1998 to 2001.
SELECT hasgeo, COUNT(*) FROM ComboDist_Code20 WHERE bodist IS NULL GROUP BY hasgeo; /* firm 52 none 4 pc 10444 */
Note: The following are available in firmbasecore but unused in firmvars:
SELECT invstatus, rolepref, geogpref, indpref, stagepref FROM firmbasecore; SELECT invstatus, count(*) FROM firmbasecore group by invstatus ORDER BY invstatus; Actively seeking new investments 10856 Inactive / Unknown 1625 Making few if any new investments 68 Reducing investment activity 48 1257 SELECT rolepref, count(*) FROM firmbasecore group by rolepref ORDER BY rolepref; BO 965 DO 340 EI 757 OH 10 OT 131 11651 SELECT COUNT(*) FROM ( SELECT geogpref, count(*) FROM firmbasecore group by geogpref ORDER BY geogpref) AS T; --181 --Includes, states, nations, continents, etc. SELECT COUNT(*) FROM ( SELECT indpref, count(*) FROM firmbasecore group by indpref ORDER BY indpref) AS T; --112 --COULD MAYBE BE CODED IN TERMS OF CODE20, etc.
Review of old dataset
Anticipating a dataset rebuild, the old production dataset was MasterRealC20YearFullPlus.txt, which was produced using:
E:\projects\vcdb3\MatchingEntreps2VCs\MatchingVCEntrepRevisions.sql
This seems to have an error in the construction of matchinstagebroad l129.
MasterRealC20YearFull requires:
- AllRealMatchKeysC20Code <-RealMatchescode, AllRealMatchKeysC20; AllRealMatchKeysC20 <-SynRealSetc20,RealMatchescode; SynRealSetc20 <-RealMatchescode; RealMatchesCode <-MatchHighestRandom, PortCoIindustry (RevisedDbaseCode.sql)
- PortCoSuper <- likely new PortCoMaster
- DealSuper: Would need to be rebuilt
- FirmSuper <- likely new FirmVars and FirmGrowthPerf
- AllMatchc20Dist <-AllMatchc20DistBase (RevisedDbaseCode.sql); AllMatchc20DistBase <- AllMatchKeysC20, portcogeo, firmbogeo; AllMatchKeysC20 <-RealMatchesCode, SynSetc20.
- AllMatchc20InduHistWZero <- AllMatchKeysC20Code, AllMatchc20InduHist ... RoundLineMasterBase (RevisedDbaseCode.sql)
- AllMatchc20InduTotal <- AllMatchKeysC20Code, RoundLineMasterBase (RevisedDbaseCode.sql)
- rankingfull <- likely new placeyearranking
Code for matchinstagenarrow
CASE WHEN (dealseed >= 1 AND (firmstageprefno=1 OR firmstageprefno=3 OR firmstageprefno=5)) THEN 1::int WHEN (dealearly >= 1 AND (firmstageprefno=2 OR firmstageprefno=3 OR firmstageprefno=5)) THEN 1::int WHEN (deallater >= 1 AND (firmstageprefno=4 OR firmstageprefno=5)) THEN 1::int ELSE 0::int END AS matchinstagenarrow,
Xunjie's Notes
Xunjie Zheng wrote "Notes on Estimation of Venture Model", which is in the project folder.
Note on that doc:
- Startup-Specific Variables:
- VE adds executives round after round, so firms with more rounds will have more executives. We should only use Founders and whether or not they are serial from the executive measures.
- We should be using patents here?
- This appears ok in section 4 but not in section 1
- It somewhat repeats in section 5.1
- PC Total Investor suffers from similar issues
- The dataset might not be informational...
- Interaction terms related to VC Exp could have over 70% of zeros
- Same city indicator? Also, log distance. Transformation is awkward.
- We may want to redefine PC Rank so that the higher its value the higher the rank of the city
Potential news vars:
- FracOfDealsInSector
- Patent Measure? Imprecisely estimated...
Data:
- 748 -- real matches
- Cap on market size?
- 5-10 on each side
- market size cutoff
- firmcat - check it!
- Market size below 20-30. Over 100 markets. Should be fine.
- Single node - 80 Cores
Discussion Points
Discuss normalization:
Distance is standardized to have zero mean and unit variance with respect to the whole dataset. PC Rank is normalized to be between 0 and 1 with respect to the whole dataset. Other variables are normalized to be between 0 and 1 with respect to each matching market. See Table 1 for summary statistics for realized matches before data transformation. See Table 2 for summary statistics for realized matches after data transformation.
Discuss:
In general, the average market complementarity in Table 8, the average observed market production in Table 9, and the average observed market complementarity in Table 10 should be the highest for the original market, then after removing the worst VC, then after removing the mediocre VC, and then after removing the best VC. If the latter three counterfactual values are similar, it is due to the small magnitude of the coefficient compared to other coefficients in the model. The counterfactual values in Models 3 and 5 are higher than the original one in Table 8, it is because there is a dominant negative coefficient in the model. Also, the large standard deviation of unobserved complementarity makes unobserved complementarity and so is the whole complementarity of some matches to be large and negative. After one firm is removed, the number of total matches in the optimal assignment decreases by 1 so that the match with large and negative unobserved complementarity is dropped. However, since the observed production or observed complementarity of the same match may not be negative, such case only appears in Table 8.
To be checked:
- VC Exp is “matchprevindu20” in the codebook, which is defined as “Matched VCs prior experience investing in companies (i.e., count of companies) in the same pccode20 as the PortCo”. There might be a mistake in constructing this variable. The VC Exp for counterfactual matches are greater than or equal to the value for realized matches.
Tables in the paper
Tables: 1 & 2 Summary stats (reg, norm)
- Distance normalization is strange. Use logs?
- PC RANK should be non-linear?
- PC investor can be standardized - first round (of interest), or av. per round?
3 Structural Estimation Results
- Model 6 has two unexpected signs
- rhos are unobserved complementarity correlations and sigma is its std. dev.
- Why not the main effects? Interpretation of interactions is problematic.
4 Change in production with respect to one standard deviation of the (interaction) variable
5 Change in production with respect to one standard deviation of one variable in the interaction term while setting the other variable at 0.5
6 Number of zeros in matches (all vs. realized). Seems to indicate data issues???
7 Correlation between (two) firm-specific variables and (three) portco-specific variables in realized matches
8 Counterfactual average market complementarity, 9 Counterfactual average observed market production, 10 Counterfactual average observed market complementarity
- VC Tenure is probably the wrong measure... Presumably, funds are alive because they are in the market... Also, inverted U. Stage drift. Etc.
- Investment?
- Currently removing by tenure and experience (in thirds)
- Should do: GSVC, CVC, non-market VC?
11 - 14 - IPO and M&A and both among unmatched startups
- Exitvalue. M&A suffers from issues. Disclosed value M&A would be better?
- Worse VC IPO/M&A likely a better but maybe non-temporal as implemented.
15 Geog dist.
- Why 748 matches?
16-19 Posterior exit rates.
- VC tenure looks problematic again. Also M&A issues with firesales.
Tables from Section 5 - Results with New Variables:
- Check stage fit
- Check co-investors is in the deal.
Other Notes on Xunjie's Work
The estimation strategy is a two-step simulated method of moments. The solver is "adaptive_de_rand_1 bin_radius_limited" in the Julia package "BlackBoxOptim.jl", which executes Differential Evolution algorithm.
Julia code:
- https://github.com/robertfeldt/BlackBoxOptim.jl (support parallel CPU)
Previous Work
All work is in:
E:\projects\unobservedcomplementarities
Dataset was buit from vcdb3 (see below). Latest data files are:
- MasterRealC20YearFullPlus.txt
- MasterRealC20YearFullPlus - DataDictionary.txt
Subdirectories:
- Marcos' old work is in .\marcos
- Chenyu's code and datasets are in .\matlab
- .\linearmodel is the current STATA work
Chenyu's box (available until Oct 31st 2019) is here: https://rochester.app.box.com/s/nvtqgpmyygjykes3lcx9s53sfzmu8c27
It contains:
- working folder
- sample batch files
Both folders were cloned to E:\projects\unobservedcomplementarities\Chenyusbox on 17th July 2019.
Notes:
- data_import3.m uses MasterRealC20YearFullPlus.txt, which is the latest dataset
Linear Model
The objective is to add city ranking and serials, possibly as well as no. coinvestors, and VC experience x no. coinvestors, to a linear model. The data for the linear model should include real and synthetic matches. However, to make it comparable to Chenyu's data, we need to exclude some markets.
Marcos' used Z:\VentureXpertDB\vcdb3\MasterRealOneSynth.txt as a base dataset, which contained only a single synth. However, in lpm_full.do, he loads MasterWithSynthcode20.txt instead. Note that some of Marcos' do files were not in the dropbox but were in E:\mcnair\Projects\MatchingEntrepsToVC\Stata\DoFiles
Data notes:
- Exit, exitvalue and related measures are going to be right censored
In data_import2.m, Chenyu has the following restrictions that I clone in STATA (counts are mine):
- He starts with MasterC20YearFull.txt, rather than MasterRealC20YearFullPlus.txt (which suggests he isn't using the latest data)
- Mkts are pccode20,dealminroundyear
- Removes unmatched VCs and startups (shouldn't be in latest dataset?)
- Requires that matched VCs have synthetics with all startups in the market (should be redundant now?)
- Requires there to tbe >=5 and <=10 real matches in a market
- This reduces the number of obs from 445,710 to just 59,205 (13.3%)
- Requires the year to be >=1990 and <=2001
- 142,738 out of 445,710 (32%), or 18,055 out of 59,205
- Removes duplicates (should be redundant with revised data?)
- Removes markets with marketid NaN (not clear why this happens)
- In master_dyad.m, Chenyu has year bounds of 2002 and 2016. This upper bound likely has right censoring on exits.
The STATA do file is in:
E:\projects\unobservedcomplementarities\linearmodel
Rebuilding Marcos
Marcos starts with a dataset of reals with a single synthetic, and then constructs a dataset of reals with all synthetics (in the same year and code20).
Table 1 gives some LPMs using two sets of variables with and without VC-yearmet fixed effects. These are replicated in the new do file. In order to get something close to Marcos's reported numbers, I create a one-to-one variable so that each real match has only a single synthetic match. This gives about 60k observations as compared with Marcos's 64K (and as opposed to 445k for the full sample). The coefficients are very close to those in Table 1. There are some caveats, however. Marcos is using:
- Amounts in billions (as am I) without taking logs (of 1+x)
- Firmid x year (which he refers to as VC x yearmet) fixed effects, as opposed to year (i.e., dealminroundyear) x pccode20 fixed effects, which correctly define a market
- No restrictions on timing
Table 5 gives some LPMs before and after a Lasso. The hqdist variable was first transformed so that hqdist = hqdist/1000. Note that the matchhqdist variable is bimodal. Matchbodist is also bimodal but not as strongly. The second spike in the distribution is just over 4000km, which is the arc distance from San Francisco to Boston (4335km [1])
Again the data is just a single synthetic for each real. In this analysis, Marcos also clusters the standard errors at the year level, but does not use any fixed effects.
The labels in the pdf are somewhat misleading. The margin command reports only the underlying covariates not the interactions (unless you specifically generate the variables). An analysis of just the underlying variables without the interactions would have produced markedly different margins! The margins in table 6 column 1 of the pdf are coming from the following:
PDF -> source -------------------------------------- hdqist -> c.hqdist##c.hqdist sumprevsameindu20 -> c.sumprevsameindu20##c.sumprevsameindu20 serials -> c.serials##c.numprevportco numprevportcos -> c.patentsprevc##c.numprevportco firmtenure -> c.serials##c.firmtenure patentsprevc -> c.patentsprevc##c.firmtenure
Note that STATA uses ## to report both main effects for each variable as well as an interaction, so c.hqdist##c.hqdist reports both hqdist and hqdist^2, while c.serials##c.numprevportco reports serials, numprevportco, and serials*numprevportco. Variables are omitted when duplicated as in c.serials##c.numprevportco and c.patentsprevc##c.numprevportco, which both report numprevportco.
We don't get the same lasso results as Markus:
Variable MarcosLasso NewLasso ----------------------------- hdqist yes yes sumprevsameindu20 yes yes serials yes no numprevportcos yes no firmtenure yes yes patentsprevc no no
But Marcos's spec isn't very grounded. He clusters standard errors at the year level but uses no fixed effects. We want to know what goes on inside markets, implying market-level fixed effects. He believes that "Since non-match specific variables are not used in the structural model, we have to interact VC or Startup specific variables." I'm not sure that this is correct. He goes on to say that "Therefore, the main specification is one which every match-specific variable has a quadratic interaction, and startup and VC variables are interacted with each other. Also, we exclude industry code from the model because it is a discrete variable, and we transform VC founding year to VC tenure, which subtracts the former with year of match."
Industry certainly won't matter with market fixed effects. Marcos also used numprevportco as if it was purely a VC variable, rather than being closer to a match specific variable.
I tried Marcos's approach using all of the possible variables (old and new) but always and only using firmtenurel as a VC interaction variable (as firmportcosl is used to pick the real from the list of potential reals, and as firmapportione~ml is correlated with firmportcosl). I will also only use pccityoverallr~1l as the PortCo interaction variable, as that's the only PortCo variable that survives to significance.
The result was:
. margins, dydx(*) post Average marginal effects Number of obs = 381,882 Model VCE : Robust Expression : Linear prediction, predict() dy/dx w.r.t. : pccityoverallrankm1l firmtenurel firmportcosl matchprevindu20l matchbodistl matchinstagenarrow matchcity matchstate -------------------------------------------------------------------------------------- | Delta-method | dy/dx Std. Err. t P>|t| [95% Conf. Interval] ---------------------+---------------------------------------------------------------- pccityoverallrankm1l | .0055706 .0002035 27.38 0.000 .0051718 .0059694 firmtenurel | .0059353 .0005165 11.49 0.000 .0049229 .0069477 firmportcosl | .0052155 .0004399 11.86 0.000 .0043532 .0060777 matchprevindu20l | -.0536725 .0007413 -72.41 0.000 -.0551254 -.0522196 matchbodistl | -.0106516 .0003413 -31.21 0.000 -.0113205 -.0099826 matchinstagenarrow | .0057086 .0007494 7.62 0.000 .0042398 .0071774 matchcity | .0684326 .0041129 16.64 0.000 .0603715 .0764937 matchstate | .0436431 .0015343 28.45 0.000 .040636 .0466503 --------------------------------------------------------------------------------------
Finally, collapse the dataset by summing realmatch and produce a histogram and some analysis.
Notes from Conference Call
On July 5th at 10am, the co-authors had a conference call. These are the notes from that call:
Ed described the data. Chenyu said that he is running the estimation as two periods: one before and one after dot com crash. He is estimating:
- VCExperience (# prior deals in pccode20) x ranking ($amount or overall, lagged 1yr or 5yr)
- No. of Co-investors
- Distributional params of unobservables
The estimates are quite stable. But when he includes VC Experience x co-investors, estimates come much larger for unobserved hetero. It is possible that this is because of the optimization process. It uses a GA and the objective function is non-smooth because it relies upon Method-of-Moments (and hence the GA might not be finding the global max/min). However, it could also be a with the moments. This is the current sticking point. Possible solutions are:
- Replace the variable with total number of previous startups by VC?
- Scale Co-investors?
Ed is going to reproduce table 5 with new regression evidence for city ranking and serials. He's also going to check what is in dropbox and clear it up.
To Discuss:
- Patents - to many zeros, Serials - not specific to a market
Counterfactual
- No resorting
- Take away different categories
- Mkt vs. Non-mkt
Compute:
- Currently running it on a 120 core cluster. Each job takes 12 cores - 20 seconds for each evaluation, 6 free params.
- Needs Low RAM (3GB per core).
- Rochester - 120hr max war time
VCDB3 Rebuild
The dataset was rebuilt using vcdb3 -- See VentureXpert Data and then fixed by Ed using RevisedDbaseCode.sql in E:\projects\vcdb3 and /bulk/vcdb3.
After that, Ed did the following:
- Rebuild the code so that only matched VCs are used as synthetics
- Add year and industry as variables
- Add variables:
- City rankings over time, lagged by 1 year and by 5 years
- Age
- Various serial measures
- Rebuild data dictionary!
- Definition of industry codes again
Specifically:
E:\projects\vcdb3\MatchingEntrepsV3.sql
Fix PortCoMatchMaster and copeopleaggsimple to make and pass:
E.serialceopres, E.serialfounder, E.ceopres, E.singularceopres, E.founders, E.hasfounders, E.prevs, E.prevceopres, E.prevfounders
Also fix doctors!
E:\projects\vcdb3\RevisedDBaseCode.sql
Fix PortCoSuper and add:
C.serialceopres AS pcserialceopres, C.serialfounder AS pcserialfounder, C.ceopres AS pcceopres, C.singularceopres AS pcsingularceopres, C.founders AS pcfounders, C.hasfounders AS pchasfounders, C.prevs AS pcprevs, C.prevceopres AS pcprevceopres, C.prevfounders AS pcprevfounders, C.serialceopres*C.singularceopres AS pcserialceopressingular, C.serialfounder*C.hasfounders AS pcserialfounderhas, C.prevceopres*C.singularceopres AS pcprevceopressingular, C.prevfounders*C.hasfounders AS pcprevfoundershas
Fix E:\projects\vcdb3\OriginalSQL\Ranking.sql (Note originally fixed in E:\projects\vcdb3\Ranking.sql). Specifically, add the rankingfull queries: city, state, year, dollarsrank, dealsrank, aliverank, overallrank
Finally, in:
E:\projects\vcdb3\MatchingEntreps2VCs
Make:
- MatchingVCEntrepRevisions.sql
- MasterRealC20YearFullPlus.txt
- MasterRealC20YearFullPlus - DataDictionary.txt
Variables for inclusion
New potential variables still being considered for inclusion:
- VC historic CAPUM?
- Industry-Year Measures? Needs input from Chenyu. Likely not useful.
Currently Chenyu is using:
- Distance
- Sub-sector specific expertise of VC - could broaden definition
- Currently: most small 10-15 matches using pccode20
- Might end up with more large markets
- Startup specific experience
- patent counts - mostly 0s: 95%.
This will be revised once he has the new data set.
Chenyu is now going to do:
- Monte Carlo with data from empirical distro
- Actual estimation - doesn't take long
- Reduced form estimation: VC investment and outcomes? Logit? outcomes (exit measures). Real match explatory variable, match characteristics, controlling
- Target: May
Running Chenyu's code on HPCC
Two Wharton ugrads: Stacey and Kenneth (no account yet) are going to try running Chenyu's code on the HPCC. Chenyu is going to put everything into Box and invite us all to it.
Reference Papers
Jeremy's paper with David Hsu and Chenyu Yang is here: Unobserved Heterogeneity in Matching Games with an Application to Venture Capital.
Abstract: Agents in two-sided matching games vary in characteristics that are unobservable in typical data on matching markets. We investigate the identification of the distribution of unobserved characteristics using data on who matches with whom. In full generality, we consider many-to-many matching and matching with trades. The distribution of match-specific unobservables cannot be fully recovered without information on unmatched agents, but the distribution of a combination of unobservables, which we call unobserved complementarities, can be identified. Using data on unmatched agents restores identification. We estimate the contribution of observables and unobservable complementarities to match production in venture capital investments in biotechnology and medical firms.
Fox Hsu Yang (2015) - Unobserverd Heterogeneity in Matching Games with an Application to Venture Capital provides some notes.
MATLAB Estimator
MATLAB Code
Abhijit Brahme (Work Log) contains his notes on working with the MATLAB code. There is a seperate page here: Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists Matlab Code.
Data specification
The data spec sent to Jeremy is in:
Z:\Projects\MatchingAcceleratorsToVCs
Data foundations
The database is vcdb2
The foundational tables were built using:
Z:\VentureCapitalData\SDCVCData\vcdb2\ProcessData2.sql
The documentation, which is a little messy, is on VC Database Rebuild
Our SQL script, which builds on top of the above database (still in vcdb2) is in:
E:\McNair\Projects\MatchingEntrepsToVC\DataWork
Dataset build
Decisions
Decisions we need to make:
- Will we need synthetic matches? If so what we do we do for outcomes? Can still do dyadic and left/right pair variables.
- Granularity of industry: To start let's use minor industry group (see below). We use a much finer grained industry definition and aggregate back up to balance out the counts somewhat later.
- Matching to a fund or a firm: For now, we will work with funds, though deals are sometimes transferred across funds within a firm (i.e. from Kliener fund IV to Kliener fund V), this is probably comparatively rare (check!).
- Dealing with the right censorship problem: We can likely address this with indicator variables to condition on, but may want to restrict estimation to dyads that don't have this issue. For now we will take portfolio companies that received their last investment before 2007, to allow funds a full 10 years to clear their portfolios.
- Inadequate coverage in early years: VentureXpert's coverage is notably inferior prior to 1982. We should start with portco that received their first investment in 1985 and forward.
- Determination of lead VC - see below
- How to collapse VC rounds (date, amount, etc.): We will use only seed, early, later stage investment and insist on the presence of seed/early for inclusion. We can then have date first, investment duration (to date last), total investment.
Objective dataset description
Unit of observation - a startup-fund match.
Constraints:
- PortCo name disclosed
- PortCo date of first investment >= 1/1/1985
- PortCo date of last investment <= 2007 to allow 10 yrs for the funds
- PortCo received at least one round of Seed or Early stage investment
- Matched VC is not undisclosed
Variables:
Startup:
- PortCo ID
- PortCo Name
- Longitude, latitude,
- State of inc., industry, year of founding, year of first investment, year of last investment
- SEL $invested, SEL num rounds, transactional VC indicator and $inv, investment duration SEL (yrs)
- Exit indicator, exit value, exit type indicator
- alive2016 indicator, last round pre-2012 indicator
- total MOOMI (Money Out Over Money In)
Fund:
- fund ID
- fund name
- Number of funds investing (SEL)
- As averages (?) and for lead:
- Fund ipo count, Fund M&A count, Fund investment count(calc at end), fund ipo rate, fund M&A rate, fund exit count, fund exit rate, fund ipo $, fund M&A $, fund exit $, fund fraction of MOOMI.
- Total invested by lead, number of rounds participation by lead, stage of participation of lead, location of lead, last investment pre-2012 indicator, lead fund type indicator (corp, priv, gov, etc.), lead fund size, lead fund vintage year.
Dyadic variables:
- Distance between lead and portco,
- industry preference match between lead and portco
- maybe stage-match (doesn't make a lot of sense when collapsing rounds) between lead and port co.
Identifying lead VCs
Possible methods:
- Best performing participant (on exit count/value or fractional MOOMI) with tie-breaker
- Closest participant (using great circle distance)
- Most frequent participant with tie-breaker
- Participant with greatest investment with tie-breaker
- Participant in earliest round that stayed in for longest with tie-breaker
Minor Industry
Across all time and without regard to SEL vs. transaction, here's the minor industry list and counts:
indminorgroup | count -------------------------------+------- Industrial/Energy | 2871 Internet Specific | 8794 Biotechnology | 2592 Semiconductors/Other Elect. | 2402 Other Products | 4891 Computer Hardware | 2061 Computer Software and Services | 10550 Communications and Media | 3271 Medical/Health | 4373 Consumer Related | 3161
Literature from David
Literature to "validate" our sample. I think you probably know the papers I reference below (let me know if you need any of them-some for which I am coauthor you can get from my website).
- VCs are more likely to match with geographically proximate startups (Lerner on corporate governance, Sorenson on geography)
- Startups prefer to match with VCs with domain experience within their startup sector (Morten Sorensen), possibly also prefer to match by stage of VC specialization relative to their own stage of development (not sure which paper if any documents that)
- Startup patents signal VCs (Hsu/Ziedonis in SMJ)
- VCs prefer serial founders, or at least may interact differently with founders based on their prior founding experience (Hsu 2007 in Research Policy)
- If we have access to more individual data: VCs prefer to invest in founders with similar demographic characteristics relative to their own characteristics (Gompers et al within the past few years in JFE, Bengtsson and Hsu in JBV within the last few years).
SBIR and Patent Data
SBIR Data taken from McNair\Projects\SBIR\Data\Aggregate SBIR\SBIR.txt. -Note! This file needed to be opened in excel to be readable, and took a very long time to open due to its large size. SBIR firm names converted to a pivot table to eliminate exact repeat entries, and then exported to a txt file, NSBIR. NSBIR then matched using The Matcher in mode 2 with the following code:
"-file1="NSBIR.txt" -file2="NSBIR.txt" -mode=2"
Output then placed in:
McNair\Projects\MatchingEntrepsToVC\Matching\Output
The original pre-matched, cleaned NSBIR.txt file is moved to:
McNair\Projects\MatchingEntrepsToVC\Matching\Input.
There is a sql file to extract VC portcos (SEL backed only), with key info from vcdb2, and distinct assignee names from allpatentsprocessed here:
E:\McNair\Projects\MatchingEntrepsToVC\Matching
There are three input files:
- distinctNSBIR.txt - made by pivot tabling SBIR.txt from the SBIR aggregation project
- distinctassignees.txt - extracted as distinct from allpatentsprocessed
- vcbackedselcokeys.txt - extracted with key info from vcdb2. It needs pivot tabling to get unique names.
These .txt files were made distinct, and then matched against themselves for normalization. The normalized files still need to be matched against each other. They are located in:
McNair\Projects\MatchingEntrepsToVC\Matching\Normalized
These normalized files were then matched against each other. Approximately 12,000 matches. they are located in:
McNair\Projects\MatchingEntrepsToVC\Matching\Normalized & Matched