Difference between revisions of "Hubs"

McNair Project
Hubs
Project Information
Project Title	Hubs
Owner	Hira Farooqi
Start Date
Deadline
Keywords	Data
Primary Billing
Notes
Has project status	Active
	Copyright © 2016 edegan.com. All Rights Reserved.

Revision as of 14:26, 26 June 2017

The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.

This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.

The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015. Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA.

The data set has now been uploaded to the database server, named Hubs. There are 4 tables:

Rounds: Rounddate, coname, state, roundno, stage1, etc.
CombinedRounds: Coname, rounddate, discamount, fundname
Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address

Used variables:

Companies: Coname, MSACode, Industry, state
MSALookupTable: MSACode, MSASuper
IndustryLookupTable: IndustryMajor, InduCode
-> 
CompanyInfo: Coname, MSASuper, InduCode, state (complete)

Funds: fundname, msacode, state
MSALookupTable: MSACode, MSASuper 
-> 
FundInfo: fundname, msacode, state (complete)

Rounds: coname, rounddate, stagecode, roundno
CombinedRounds: coname, rounddate, discamount, fundname
->
RoundInfoSuper: coname, rounddate, nofunds, discamount   
->
RoundInfo: Coname, roundyear, fundname, estamount (complete)

Then take:

RoundInfo: Coname, roundyear, fundname, estamount
CompanyInfo: Coname, MSASuper, InduCode, state
FundInfo: fundname, msacode, state
->
SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount
->
MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear
...

Notes on Creation of Primary Data Set

Raw tables

companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
combined rounds (company name, round date, disclosed amount, investor)
msalist (changes MSAs to CMSAs— combined MSAs)
industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)

Process

cleaned tables to eliminate duplications, undisclosed variables
changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
join by round and company conames
bridge years (1990-2016), stage, and cmsa
populate data with count of companies (Deal flow) and estimated amount ($)
- data set in 181 hubs folder under summarycmsa.txt (38394)

Key decisions:

Threw out undisclosed co through-out as no address
Count is done by joining round and company
Anything fund related must be disclosed fund
Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only

Glossary of Tables

cleanco — used to remove duplicates from companies
cleanedcompanies — clean set of companies with no duplicates
cmsafunds- 
cmsas— list of all CMSAs in final data set (for merging) 
cmsastats- statistics not including empty years (pre-merge) 
cmsastats2 - statistics separated by year-MSA
cmsastats3— statistics separated by year-MSA-stage
cmsastats4
cmsayears— empty merged table between year and cmsa
cmsayearstage — empty merged table between cmsa/years and stage
combinedrounds— raw sdc data for combined rounds
combinedroundswamt— used to join rounds and combined rounds for roundinfo2
companies- raw SDC company data
companyinfo — cleaned companies joined with state and CMSA information
companyinfo2— companyinfo1 with original industry categories
companyinfo3— companyinfo2 with updated industry categories and codes
companyinfo4-- clean version of companyinfo3
companyround- combined company information with round information
companyround2- combined company information with round information, cleaned up from companyround2
companyround3- combined company information with round information, cleaned up from companyround3
finaldataset- final statistics by CMSA-year, see section Final Primary Data Set for more information
fundinfo— funds joined with CMSA info
fundinfo2 - clean version of fundinfo1
fundinfoclean - used in process to clean fundinfo2
fundinfoclean2- used in process to clean fundinfo2
fundinfocleanfinal- used in process to clean fundinfo2
fundinfocleannodups- final clean set of fundinfo
funds - raw SDC fund data
Houston - analysis for Houston ecosystem team
Houston2- analysis for Houston ecosystem team
houston3- analysis for Houston ecosystem team
industry — new industry codes (4)— used for all future data sets
industrylist— lookup table for new industry codes (went from 6 to 4) 
joined1- used for matching process
joined2- used for matching process
matchfund2- used for matching process
matchfunds- used for matching process
matchroundfund - used for matching process
matchroundfund2- used for matching process
msalist — lookup table for MSA to CMSA (used for all future data sets) 
nearfar1-- beginning set before adding nearfar/stage variables 
nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset
roundfund— not used— joined round to fund; drop/ignore
roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
roundinfo2— roundinfo1 including name of investors/funds
roundinfo3— clean version of roundinfo2
roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
rounds — raw SDC round data
stages — table for merging stage-year-CMSA
superinfo — ignore/drop
temp - used for matching process
years — table for merging stage-year-CMSA

Hub Candidates Data Set

The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.

Characteristics/Variables

Year Founded
Square footage
LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
Activeness on Twitter (binomial)
Member Directory available online (binomial)
Number of conference rooms
Price ($/month) for Flex desk
Offers Reserved desk (binomial)
Offers office space for rent (binomial)
Offers community membership-- not for coworking but for community events, etc. (binomial)
Number of events offered per month (estimate)
Offers code academy
Mission Statement/Vision (for qualitative or key-word analysis)

These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.

As of March 10th 2016, the list contains 125 Hub candidates.

Where to find: The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue

Supplementary Data Sets

Patent data: to be pulled from USPTO or SDC Platinum.

Number of STEM Graduate Students (NSF) and University R&D Spending (NSF):

University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)
R&D spending found at the university level for 2014 ("Stem Grad Students.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")
- not uploaded to server or matched yet to CMSA code, because of this discrepancy.
- "Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects

Per Capita Income and Employment Data (US Census Bureau):

"Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)
"Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)
- not uploaded to server or matched yet to CMSA code

Firm Births (BDS)

in server 181, under table name "BDS"
includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa
includes code for CMSA but is not aggregated by CMSA
- i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)

Resources

Yael Hochberg and Fehder (2015), located in dropbox
- Use this paper as a guideline on how to conduct the analysis
US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
MSA level trends: http://www.metrotrends.org/data.cf

The Target Dataset

We will need to process the following variables:

SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
CSV mapping msas to cmsas is in the folder (and a table in the dbase)

Example dataset:

MSA      Year       SeedVCInv      SeedEarlyVCInv      LaterVCInv     NoDeals   FundsInvested   DistinctInvestors   ....
----------------------------------------------------------------------------------------------------------------------------
1234     2001       1000000        20000000            30000000       4          7              7

Note that the unit of observation is MSA-Year.

Variables to be computed at the MSA level:

HubActive (binary)
NoHubsActive (Count)
HubSqFt
Other Hub Vars (build list!!!)
SeedVCInv (Seed/Start-up)
EarlyVCInv (Early Stage)
LaterStageVC (Later)
OtherStageVC (Buyout/Acq, Other)
NoDeals (done by local VCs?)
- NoDealsNear
- NoDealsFar
NoPortCosFunded
FundsInv (in an MSA)
- FundsInvFromNear (within MSA?)
- FundsInvFromFar (outside MSA?)
DistinctInvestors (?)
- DistinctInvestorsNear (within MSA?)
- DistinctInvestorsFar (outside MSA?)
PatentCount
NoSTEMGrads
FirmBirths (BDS data)
UniRandDSpend
PerCapitaIncome
Employment

We need to:

Check funds invested means dollars invested
Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?

There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).

Final Primary Data Set

Deal is a round syndicate (near/far deal is one investor that is near/far).

Table name: finaldataset

cmsa
year
totalamountinv--total amount invested 
nearamountinv--amount invested from local funds
faramountinv-- amount invested from funds outside CMSA 
earlyinv--amount invested in early stage companies 
laterinv--amount invested in later stage companies 
startupseedinv--amount invested in seed or startup stage companies 
otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies 
investingfund--distinct funds that are investing in that CMSA-year 
investingfundnear--distinct funds from that CMSA that invested in that CMSA-year 
investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year 
deals--number of deals 
neardeals--number of deals inside a CMSA 
fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA
earlystagedeals--deals with earlystage companies
laterstagedeals--deals with later stage companies 
startupseeddeals--deals with startup/seed companies 
otherstagedeals--deals with companies in other stages 
newportcosfunded--number of portfolio companies to receive their first investment in that year

Data by zip code

Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)

https://www2.census.gov/programs-surveys/popest/datasets/

Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)

https://www.irs.gov/uac/about-irs

DCI index, to assess the economic well-being of communities

http://eig.org/dci/interactive-maps/u-s-zip-codes

R&D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)
Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).

Data by MSA

COMPUSTAT Data

Data is in:

E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is cities

SQL script is: COMPUSTAT.sql

The source file is RandDExpenditures.txt. It contains:

Date from 1980-2017 (July). All COMPUSTAT.
427799 records
Fields include:
- R&D Expenditure
- Address (inc. city, zip, state)

Output file is COMPUSTATSummary.txt. It contains:

Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
1979-2016
4440 cities

NSF Data

Data is in:

E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is cities SQL script is: nsf_2017.sql The source file is nsf2017.txt, copied from table titled "nsf" in the biotech db. It contains:

Award ID
Award Institution
Award Effective date
Institution city
Award Value

From 1900 - 2017

Output file is nsfSummary.txt. It contains:

Variables: City, year, nogrants, valuegrant,
1900-2017
3854 cities

NIH Data

Data is in:

Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is cities SQL script is: nih2017.sql The source files are:

nih_1986_2001.csv
nih_2002_2012.txt
nih_2013_2015

located in E:\McNair\Projects\Federal Grant Data\NIH

Date from 1986-2015

@@ Line 297: / Line 297: @@
 *1979-2016
 *4440 cities
+==NSF Data==
+Data is in:
+ E:\McNair\Projects\Hubs\Summer 2017
+ Z:\Hubs\2017
+Database is '''cities'''
+SQL script is: nsf_2017.sql
+The source file is nsf2017.txt, copied from table titled "nsf" in the biotech db.
+It contains:
+*Award ID
+*Award Institution
+*Award Effective date
+*Institution city
+*Award Value
+From 1900 - 2017
+Output file is nsfSummary.txt. It contains:
+*Variables: City, year, nogrants, valuegrant,
+*1900-2017
+*3854 cities
 ==NIH Data==

Difference between revisions of "Hubs"

Revision as of 14:26, 26 June 2017

Contents

Primary Data Set

Hub Candidates Data Set

Supplementary Data Sets

Resources

The Target Dataset

Final Primary Data Set

Data by zip code

Data by MSA

COMPUSTAT Data

NSF Data

NIH Data

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools