Hubs
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.
This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.
Primary Data Set
The Hubs data set, from SDC Platinum, is currently in the process of being constructed.
The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015. Data has been accumulated at the portfolio company, fund, and round level. It will be analyzed at the MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA
The data set has now been uploaded to the database server, named Hubs.
There are 4 tables:
- Rounds: Rounddate, coname, state, roundno, stage1, etc.
- CombinedRounds: Coname, rounddate, discamount, fundname
- Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
- Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address
Used variables:
Companies: Coname, MSACode, Industry, state MSALookupTable: MSACode, MSASuper IndustryLookupTable: IndustryMajor, InduCode -> CompanyInfo: Coname, MSASuper, InduCode, state (complete)
Funds: fundname, msacode, state MSALookupTable: MSACode, MSASuper -> FundInfo: fundname, msacode, state (complete)
Rounds: coname, rounddate, stagecode, roundno CombinedRounds: coname, rounddate, discamount, fundname -> RoundInfoSuper: coname, rounddate, nofunds, discamount -> RoundInfo: Coname, roundyear, fundname, estamount (complete)
Then take:
RoundInfo: Coname, roundyear, fundname, estamount CompanyInfo: Coname, MSASuper, InduCode, state FundInfo: fundname, msacode, state -> SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount -> MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear ...
Notes on Creation of Primary Data Set
Raw tables
- companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
- funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
- rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
- combined rounds (company name, round date, disclosed amount, investor)
- msalist (changes MSAs to CMSAs— combined MSAs)
- industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)
Process
- cleaned tables to eliminate duplications, undisclosed variables
- changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
- matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
- matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
- join by round and company conames
- bridge years (1990-2016), stage, and cmsa
- populate data with count of companies (Deal flow) and estimated amount ($)
- data set in 181 hubs folder under summarycmsa.txt (38394)
Glossary of Tables'
cleanco — used to remove duplicates from companies cleanedcompanies — clean set of companies with no duplicates cmsas— list of all CMSAs in final data set (for merging) cmsastats- statistics not including empty years (pre-merge) cmsastats2 - statistics separated by year-MSA cmsastats3— statistics separated by year-MSA-stage cmsayears— empty merged table between year and cmsa cmsayearstage — empty merged table between cmsa/years and stage combinedrounds— raw sdc data for combined rounds combinedroundswamt— used to join rounds and combined rounds for roundinfo2 companies- raw SDC company data companyinfo — cleaned companies joined with state and CMSA information companyinfo2— companyinfo1 with original industry categories companyinfo3— companyinfo2 with updated industry categories and codes companyinfo4-- clean version of companyinfo3 companyround- combined company information with round information companyround2- combined company information with round information, cleaned up from companyround2 companyround3- combined company information with round information, cleaned up from companyround3 fundinfo— funds joined with CMSA info fundinfo2 - clean version of fundinfo1 fundinfoclean - used in process to clean fundinfo2 fundinfoclean2- used in process to clean fundinfo2 fundinfocleanfinal- used in process to clean fundinfo2 fundinfocleannodups- final clean set of fundinfo funds - raw SDC fund data industry — new industry codes (4)— used for all future data sets industrylist— lookup table for new industry codes (went from 6 to 4) joined1- used for matching process joined2- used for matching process matchfund2- used for matching process matchfunds- used for matching process matchroundfund - used for matching process matchroundfund2- used for matching process msalist — lookup table for MSA to CMSA (used for all future data sets) roundfund— not used— joined round to fund; drop/ignore roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate roundinfo2— roundinfo1 including name of investors/funds roundinfo3— clean version of roundinfo2 roundinfoclean — final clean version of roundinfo3 (final roundinfo table) rounds — raw SDC round data stages — table for merging stage-year-CMSA superinfo — ignore/drop temp - used for matching process years — table for merging stage-year-CMSA
Hub Candidates Data Set
The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.
Characteristics/Variables
- Year Founded
- Square footage
- LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
- Activeness on Twitter (binomial)
- Member Directory available online (binomial)
- Number of conference rooms
- Price ($/month) for Flex desk
- Offers Reserved desk (binomial)
- Offers office space for rent (binomial)
- Offers community membership-- not for coworking but for community events, etc. (binomial)
- Number of events offered per month (estimate)
- Offers code academy
- Mission Statement/Vision (for qualitative or key-word analysis)
These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.
As of March 10th 2016, the list contains 125 Hub candidates.
Supplementary Data Sets
Patent data: to be pulled from USPTO or SDC Platinum.
- unable to find on the internet, must be pulled from the larger dataset
Number of STEM Graduate Students (NSF) and University R&D Spending (NSF): Grad Students found for the year 2015, no data going back historically; R&D found for the past 10 years
- categorized university by MSA, can be used for all university-based projects
Per Capita Income and Employment Data (US Census Bureau): complete for most recent census, unable to find data going back historically
Firm Births (BDS): data set found for 1990 to present, currently being cleaned up for use
Resources
- Yael Hochberg and Fehder (2015), located in dropbox
- Use this paper as a guideline on how to conduct the analysis
- US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
- USPTO tility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
- MSA level trends: http://www.metrotrends.org/data.cf
To Do
We need to find and clean up data sets at the MSA level
- Patent data (USPTO)
- Number of STEM Graduate Students (NSF)
- in progress
- University R&D Spending (NSF)
- Per Capita Income (US Census)
- complete (Employment and Income_MSA.xls)
- Employment (US Census)
- complete (Employment and Income_MSA.xls)
- Firm births (BDS)
- SELECT MSAs!!! Possible method: choosing CMSAs with Distinct companies funded
- >100 = 38
- >75 = 45
- >50 = 52
- >25 = 80
- Total 238
- greater than 100 will give us 52 CMSAs to work with
Data Cleaning
Cleaning tasks:
- Remove PortCos named Undisclosed, etc.
- Remove Funds named Unknown, etc.
- Basic Data cleaning:
- Enormous outliers on funds invested
- Check dates
Lookup tables:
- SuperMSAs
- Industry
- Stages
The Target Dataset
We will need to process the following variables:
- SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
Example dataset:
MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors .... ---------------------------------------------------------------------------------------------------------------------------- 1234 2001 1000000 20000000 30000000 4 7 7
Note that the unit of observation is MSA-Year.
Variables to be computed at the MSA level:
- HubActive (binary)
- NoHubsActive (Count)
- HubSqFt
- Other Hub Vars (build list!!!)
- SeedVCInv (Seed/Start-up)
- EarlyVCInv (Early Stage)
- LaterStageVC (Later)
- OtherStageVC (Buyout/Acq, Other)
- NoDeals (done by local VCs?)
- NoDealsNear
- NoDealsFar
- NoPortCosFunded
- FundsInv (in an MSA)
- FundsInvFromNear (within MSA?)
- FundsInvFromFar (outside MSA?)
- DistinctInvestors (?)
- DistinctInvestorsNear (within MSA?)
- DistinctInvestorsFar (outside MSA?)
- PatentCount
- NoSTEMGrads
- FirmBirths (BDS data)
- UniRandDSpend
- PerCapitaIncome
- Employment
We need to:
- Check funds invested means dollars invested
- Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?
There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).
Final Primary Data Set
Table name: finaldataset
- cmsa
- year
- total amount invested (totalamountinv)
- amount invested from local funds (nearamountinv)
- amount invested from funds outside CMSA (faramountinv)
- amount invested in early stage companies (earlyinv)
- amount invested in later stage companies (laterinv)
- amount invested in seed or startup stage companies (startupseedinv)
- amount invested in Acquisition/Buy-outs/Other stage companies (otherstageinv)
- distinct funds that are investing in that CMSA-year (investingfund)
- distinct funds from that CMSA that invested in that CMSA-year (investingfundnear)
- distinct funds from outside that CMSA that invested in that CMSA-year (investingfundfar)
- number of deals (deals)
- number of deals inside a CMSA (near deals)
- number of deals from outside a CMSA
- some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA
- deals with earlystage companies (earlystagedeals)
- deals with later stage companies (laterstagedeals)
- deals with startup/seed companies (startupseeddeals)
- deals with companies in other stages (otehrstagedeals)
- number of portfolio companies to receive their first investment in that year (newportcosfunded)