Difference between revisions of "VC Startup Matching Stata Work"

From edegan.com
Jump to navigation Jump to search
Line 26: Line 26:
 
  E:\McNair\Projects\MatchingEntrepsToVC\Stata
 
  E:\McNair\Projects\MatchingEntrepsToVC\Stata
  
contains all the necessary files to run the analysis. All the raw datasets are in the directory itself, while Do-Files, log-files and output like tex files are in the respective folders.
+
contains all the necessary files to run the analysis. All the raw datasets are in the directory itself, while Do-Files, log-files and raw output like Stata-to-tex tables are in their respective folders. Written reports in .tex are in the Tex folder.
  
 
Regarding Do-Files organization, the first file to be opened has to be 'master.do'. In it, I wrote the necessary globals to make referencing directories easier, while also pointing out any necessary extra packages. In the future, when the analysis is more robust and clear, the general instructions of what each do-file does will be also written in the master do-file.
 
Regarding Do-Files organization, the first file to be opened has to be 'master.do'. In it, I wrote the necessary globals to make referencing directories easier, while also pointing out any necessary extra packages. In the future, when the analysis is more robust and clear, the general instructions of what each do-file does will be also written in the master do-file.
Line 32: Line 32:
 
For now, every do-file is more or less self-descriptive and self-contained.
 
For now, every do-file is more or less self-descriptive and self-contained.
  
==Summary==
+
==Preliminary Analysis==
 +
 
 +
A written report with detailed description of results can be found at
 +
 
 +
E:\McNair\Projects\MatchingEntrepsToVC\Stata\Tex
 +
 
 +
 
 +
===Initial Look at Dataset===
 +
 
 +
Before attempting to do any statistical analysis, I performed an initial look at the raw dataset to spot possible problems.
 +
 
 +
There was a mistake in the synthetic VC's count of startups from the same sector and the current match, ie, variables 'synsumprevsamesector', 'synsumprevsameindu', 'synsumprevsameindu20', 'synsumprevsameindu10', as their values contained lots of -1 and 0s. To correct it, I changed the SQL code.
 +
 
 +
More specifically, when creating table 'FirmnameInduBlowout', when doing the JOIN, the weak inequality was changed to strict inequality. Then, when creating the next table, 'FirmnameRoundInduHist', I removed the subtraction. The same was done to the corresponding synthetic tables.
 +
 
 +
===Summary Statistics===
 +
 
 +
Summary statistics were produced using the 'summarystats.do' do-file.
 +
 
 +
===Linear Probability Model===
 +
 
 +
A linear probability model was suggested by Jeremy Fox, where Y=1 when the match is real, and Y=0 when the match is synthetic, and independent variables are characteristics from the VCs.
 +
 
 +
To perform this regression, it is necessary to build a new dataset. This is done on 'lpmsynthetic.do'.
 +
 
 +
At first look, this looks like a simple case of using the -reshape- function in Stata, since the original dataset is on a 'Wide' format, ie, the synthetic VC and its characteristics for each observation (startup) are variables (columns) itself, and we want to make them into observations (rows), with a dummy indicating when it is a real or synthetic match. However, the -reshape- command does not work with string variable names.
 +
 
 +
Therefore the do-file performs a manual reshape. After sending the results to Jeremy Fox, he felt that the results were not as expected and suggested some corrections.
 +
 
 +
===Regressions===
 +
 
 +
We want to know if VCs are more likely to match with geographically close startups, if patents are good signals for VCs, if VCs prefer
 +
serial founders and startups with similar demographic characteristics. We also want to know if startups prefer to match with VCs that have previous experience on startups of the same sector and VCs that prefer to invest in startups at their stage.
 +
 
 +
Since we don't have 'out-of-match' VCs and startups, I decided to do two different types of regressions.
 +
 
 +
I regress VCs all-time characteristics on their matched startups characteristics of interest, like distance, patents before match, demographic, etc. I am basically trying to see correlations. If 'good' VCs tend to match with very close startups, that had many patents before match, etc, then we can say there is some evidence of positive assortative matching.
 +
 
 +
On the other hand, if 'good' startups matched with VCs that were within their scope of investment, that had a history of investing in
 +
similar sectors, then these characteriscs are important for the startups.
 +
 
 +
Every regression has sector and VC founding year fixed effects.
 +
 
 +
Also, for all count variables, I've log-transformed it (adding 1 before to account for zeros) as suggested by Ed Egan. For the distance variable, I've also log-transformed it. Continuous variables are not log-transformed because most of them contains zeros, and adding 1 doesn't seem to make much sense.

Revision as of 12:48, 11 July 2018


McNair Project
VC Startup Matching Stata Work
Project logo 02.png
Project Information
Project Title VC Startup Matching Stata Work
Owner Marcos Ki Hyung Lee
Start Date 06/2018
Deadline
Keywords VC, Stata, Matching, Startup
Primary Billing
Notes
Has project status Active
Is dependent on Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists
Copyright © 2016 edegan.com. All Rights Reserved.


Exploratory files and dictionaries, as well as Stata Do-Files and Logs, are located in:

E:\McNair\Projects\MatchingEntrepsToVC\Stata

Synopsis

The VC Startup Matching Stata Work Project is support work for the Estimating Unobserved Complementarities between Entrepreneurs and Venture Capitalists academic paper.

Estimate reduced form model and summary statistics to 'validate' the dataset for future structural estimation, following the literature evidence pointed out by David Hsu in the academic paper page.

To do so, I use the Startup-VC match dataset which contains variables regarding the startup, the VC and the match itself.

Stata Do-Files Guide

The directory

E:\McNair\Projects\MatchingEntrepsToVC\Stata

contains all the necessary files to run the analysis. All the raw datasets are in the directory itself, while Do-Files, log-files and raw output like Stata-to-tex tables are in their respective folders. Written reports in .tex are in the Tex folder.

Regarding Do-Files organization, the first file to be opened has to be 'master.do'. In it, I wrote the necessary globals to make referencing directories easier, while also pointing out any necessary extra packages. In the future, when the analysis is more robust and clear, the general instructions of what each do-file does will be also written in the master do-file.

For now, every do-file is more or less self-descriptive and self-contained.

Preliminary Analysis

A written report with detailed description of results can be found at

E:\McNair\Projects\MatchingEntrepsToVC\Stata\Tex


Initial Look at Dataset

Before attempting to do any statistical analysis, I performed an initial look at the raw dataset to spot possible problems.

There was a mistake in the synthetic VC's count of startups from the same sector and the current match, ie, variables 'synsumprevsamesector', 'synsumprevsameindu', 'synsumprevsameindu20', 'synsumprevsameindu10', as their values contained lots of -1 and 0s. To correct it, I changed the SQL code.

More specifically, when creating table 'FirmnameInduBlowout', when doing the JOIN, the weak inequality was changed to strict inequality. Then, when creating the next table, 'FirmnameRoundInduHist', I removed the subtraction. The same was done to the corresponding synthetic tables.

Summary Statistics

Summary statistics were produced using the 'summarystats.do' do-file.

Linear Probability Model

A linear probability model was suggested by Jeremy Fox, where Y=1 when the match is real, and Y=0 when the match is synthetic, and independent variables are characteristics from the VCs.

To perform this regression, it is necessary to build a new dataset. This is done on 'lpmsynthetic.do'.

At first look, this looks like a simple case of using the -reshape- function in Stata, since the original dataset is on a 'Wide' format, ie, the synthetic VC and its characteristics for each observation (startup) are variables (columns) itself, and we want to make them into observations (rows), with a dummy indicating when it is a real or synthetic match. However, the -reshape- command does not work with string variable names.

Therefore the do-file performs a manual reshape. After sending the results to Jeremy Fox, he felt that the results were not as expected and suggested some corrections.

Regressions

We want to know if VCs are more likely to match with geographically close startups, if patents are good signals for VCs, if VCs prefer serial founders and startups with similar demographic characteristics. We also want to know if startups prefer to match with VCs that have previous experience on startups of the same sector and VCs that prefer to invest in startups at their stage.

Since we don't have 'out-of-match' VCs and startups, I decided to do two different types of regressions.

I regress VCs all-time characteristics on their matched startups characteristics of interest, like distance, patents before match, demographic, etc. I am basically trying to see correlations. If 'good' VCs tend to match with very close startups, that had many patents before match, etc, then we can say there is some evidence of positive assortative matching.

On the other hand, if 'good' startups matched with VCs that were within their scope of investment, that had a history of investing in similar sectors, then these characteriscs are important for the startups.

Every regression has sector and VC founding year fixed effects.

Also, for all count variables, I've log-transformed it (adding 1 before to account for zeros) as suggested by Ed Egan. For the distance variable, I've also log-transformed it. Continuous variables are not log-transformed because most of them contains zeros, and adding 1 doesn't seem to make much sense.