Marcos Ki Hyung Lee (Work Log)

Revision as of 15:03, 5 July 2018 by Ed (talk | contribs)
Jump to navigation Jump to search

Summer 2018

Notes from Ed

Please build and link to a project page for the STATA analysis!

Also, if/when you make changes to a sql file, please:

  1. Run them through or make it clear that you haven't with comments
  2. Let me know by posting it on a project page and linking to in your work log.

Otherwise, we are both going to be making conflicting changes to the same files.

By Date


Rework data analysis following suggestions from Egan and Fox.


Worked with regressions and made log files with summary statistics and outputs.

Had Skype meetings with Ed Egan and Jeremy Fox.


Picked relevant variables and started thinking of some regression specifications.


Studied the SQL code that creates the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector (and also the non synthetic ones).

Found some problems with the nonsynthetic ones, with some double counting when VCS met with more than portco on the same day AND the portcos had the same industry code. The code subtracts 1 from the sum of dummies that indicate same industry code, but in these instances, they should be subctrating more.

Also, the synthetic counterparts have weird values. The historical ones (previous from meeting the portco) are mostly 0 or -1, while all-time have lots of missings. Initially I thought it was an error from the code, but after thinking about this, I think it is a feature of the randomization. To correct the negative numbers, I think we should not subtract 1 from the sum of dummies. We did that to account for the repeated portcos that showed up in the blowout table, but now these repetitions don't happen, since we are joining a table with synthetic matches with real matches.


Plans for today: try to fix dataset.

Found more errors in matched dataset. Synthetic firms variables seem to be wrong, as there are negative numbers and lots of missings.

Also, variables form matched firms like number of people, doctors, etc, and city name, are missing seemingly at random.

Egan walked me through the SQL code that generates de matched dataset. We made a more precise count of coinvestors. Before, we were double counting funds. Now, if a PortCo had only one VC fund investment, numcoinvestor == 0.

Looking into the syntethic variables problem, the main problem is on the variables synsumprevsameindu100 synsumprevsameindu20 synsumprevsameindu synsumprevsamesector synnumprevportcos syntotsameindu100 syntotsameindu20 syntotsameindu syntotsamesector.

They basically count the number of PortCos VCs invested that were in the same industry code as them, before meeting the current matched POrtCo (synsum*) and over all time (syntot*). So they are integers and tot >= sum. However, for the synthetic firm ones, they are mostly -1 on the sum ones, and missing on tot ones.

Looking at the code that generates these synthetic, there seems to be a problem when joining and subtracting one to the sum of dummies where A.code100 = B.code100 for example. Can't figure out how to correct it yet.


Plans for today: get a full understanding of dataset and variables, start making some summary statistics.

Inspected matched dataset and found inconsistencies on invesmentment amounts of VCs in PortCos. Talked to Egan about this, we will check it out carefully on the source SQL code tomorrow.

Made summary statistics of firm variables. There does not seem to be inconsistencies on that.

2018-06-20: Created folder at "E:\McNair\Projects\MatchingEntrepsToVC\Stata\", imported files into Stata, and made master dofile.