Difference between revisions of "Hubs"

From edegan.com
Jump to navigation Jump to search
imported>Rachel
 
(142 intermediate revisions by 8 users not shown)
Line 1: Line 1:
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.
+
{{Project
 +
|Has project output=Data
 +
|Has sponsor=McNair Center
 +
|Has title=Hubs
 +
|Has owner=Hira Farooqi,
 +
|Has keywords=Data
 +
|Has project status=Active
 +
|Does subsume=Hubs Analysis 2017,
 +
}}
  
This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.  
+
'''Important Notice: The last update to the hubs data was done manually by Ed and is in E:\projects\MeasuringHGHTEcosystems\HubsData-RevisedSimplified.xlsx'''
  
===Primary Data Set===
 
The Hubs data set, from SDC Platinum, is currently in the process of being constructed.
 
  
The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
+
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. It focuses on cities in the United States as the primary unit of analysis.
Data has been accumulated at the portfolio company, fund, and round level. It will be analyzed at the MSA level. We will be looking at in terms of number of companies funded in  number of funds active, and flow of investment in a given MSA
 
  
 +
This page contains information about data used for this research project, including data sources, location of data on RDP and details on data processing.
  
The data set has now been uploaded to the database server, named Hubs.
 
There are 4 tables:
 
*Rounds: Rounddate, coname, state, roundno, stage1, etc.
 
*CombinedRounds: Coname, rounddate, discamount, fundname
 
*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
 
*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address
 
  
Used variables:
 
  
Companies: Coname, MSACode, Industry, state
+
Information on initial data work done prior to Summer 2017 can be found at [[Hubs Summer 2016]].
MSALookupTable: MSACode, MSASuper
 
IndustryLookupTable: IndustryMajor, InduCode
 
->
 
CompanyInfo: Coname, MSASuper, InduCode, state (complete)
 
  
Funds: fundname, msacode, state
+
'''Note on joining:''' The city-state-year ID from VC data is used as the master ID for joining datasets. Each table (e.g. income, nih, nsf, sbir, compustat) is first joined with the VC data on city-state-year ID and then the resulting tables are all joined together in the final table.
MSALookupTable: MSACode, MSASuper
 
->
 
FundInfo: fundname, msacode, state (complete)
 
  
Rounds: coname, rounddate, stagecode, roundno
 
CombinedRounds: coname, rounddate, discamount, fundname
 
->
 
RoundInfoSuper: coname, rounddate, '''nofunds''', discamount 
 
->
 
RoundInfo: Coname, roundyear, fundname, estamount (complete)
 
  
Then take:
+
===Data by zip code===
RoundInfo: Coname, roundyear, fundname, estamount
+
*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)
CompanyInfo: Coname, MSASuper, InduCode, state
+
https://www2.census.gov/programs-surveys/popest/datasets/
FundInfo: fundname, msacode, state
+
*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)
->
+
https://www.irs.gov/uac/about-irs
SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount
+
*DCI index, to assess the economic well-being of communities
->
+
http://eig.org/dci/interactive-maps/u-s-zip-codes
MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear
+
*R&D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)
...
+
*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).
  
'''Notes on Creation of Primary Data Set'''
+
== Data by MSA ==
  
Raw tables
+
We have principle cities of MSAs from the census:
* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
+
https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html
* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
 
* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
 
* combined rounds (company name, round date, disclosed amount, investor)
 
* msalist (changes MSAs to CMSAs— combined MSAs)
 
*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)
 
  
Process
+
We might be able to go City -> FIPS place code -> MSA?
* cleaned tables to eliminate duplications, undisclosed variables
 
* changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
 
* matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
 
*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
 
*join by round and company conames
 
*bridge years (1990-2016), stage, and cmsa
 
* populate data with count of companies (Deal flow) and estimated amount ($)
 
** data set in 181 hubs folder under summarycmsa.txt (38394)
 
  
'''Glossary of Tables''''
+
Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html
cleanco — used to remove duplicates from companies
 
cleanedcompanies — clean set of companies with no duplicates
 
cmsas— list of all CMSAs in final data set (for merging)
 
cmsastats- statistics not including empty years (pre-merge)
 
cmsastats2 - statistics separated by year-MSA
 
cmsastats3— statistics separated by year-MSA-stage
 
cmsayears— empty merged table between year and cmsa
 
cmsayearstage — empty merged table between cmsa/years and stage
 
combinedrounds— raw sdc data for combined rounds
 
combinedroundswamt— used to join rounds and combined rounds for roundinfo2
 
companies- raw SDC company data
 
companyinfo — cleaned companies joined with state and CMSA information
 
companyinfo2— companyinfo1 with original industry categories
 
companyinfo3— companyinfo2 with updated industry categories and codes
 
companyinfo4
 
companyround
 
companyround2
 
companyround3
 
fundinfo— funds joined with CMSA info
 
fundinfo2 - clean version of fundinfo1
 
fundinfoclean - used in process to clean fundinfo2
 
fundinfoclean2- used in process to clean fundinfo2
 
fundinfocleanfinal- used in process to clean fundinfo2
 
fundinfocleannodups- final clean set of fundinfo
 
funds - raw SDC fund data
 
industry — new industry codes (4)— used for all future data sets
 
industrylist— lookup table for new industry codes (went from 6 to 4)
 
joined1
 
joined2
 
matchfund2
 
matchfunds
 
matchroundfund
 
matchroundfund2
 
msalist — lookup table for MSA to CMSA (used for all future data sets)
 
roundfund— not used— joined round to fund; drop/ignore
 
roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
 
roundinfo2— roundinfo1 including name of investors/funds
 
roundinfo3— clean version of roundinfo2
 
roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
 
rounds — raw SDC round data
 
stages — table for merging stage-year-CMSA
 
superinfo — ignore/drop
 
temp
 
years — table for merging stage-year-CMSA
 
  
===Hub Candidates Data Set===
+
The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html
 +
However, there is only CBSA!
  
The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.
+
This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf
 +
We can maybe track city to principal city to MSA
  
Characteristics/Variables
+
==COMPUSTAT Data==
*Year Founded
+
The data set includes information on publicly traded firms in the US. It was obtained from the Wharton Research Data Services (https://wrds-web.wharton.upenn.edu/wrds/index.cfm?).
*Square footage
 
*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
 
*Activeness on Twitter (binomial)
 
*Member Directory available online (binomial)
 
*Number of conference rooms
 
*Price ($/month) for Flex desk
 
*Offers Reserved desk (binomial)
 
*Offers office space for rent (binomial)
 
*Offers community membership-- not for coworking but for community events, etc. (binomial)
 
*Number of events offered per month (estimate)
 
*Offers code academy
 
*Mission Statement/Vision (for qualitative or key-word analysis)  
 
  
These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.
 
  
As of March 10th 2016, the list contains 125 Hub candidates.
+
Raw Data is in:
 +
E:\McNair\Projects\Hubs\Summer 2017
 +
Z:\Hubs\2017
  
===Supplementary Data Sets===
+
The source file is RandDExpenditures.txt. It contains:
'''Patent data''': to be pulled from USPTO or SDC Platinum.
+
*Date from 1980-2017 (July).
*unable to find on the internet, must be pulled from the larger dataset
+
*427799 records
 +
*Fields include:
 +
**R&D Expenditure
 +
**Address (inc. city, zip, state)
 +
**Revenue of firms
 +
 +
Database is '''cities'''
  
'''Number of STEM Graduate Students''' (NSF) and '''University R&D Spending''' (NSF): Grad Students found for the year 2015, no data going back historically; R&D found for the past 10 years
+
SQL script is: COMPUSTAT.sql
  
*categorized university by MSA, can be used for all university-based projects
+
Output file is COMPUSTATSummary.txt. It contains:
 +
*Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
 +
*1979-2016
 +
*4440 cities
  
'''Per Capita Income''' and '''Employment Data''' (US Census Bureau): complete for most recent census, unable to find data going back historically
+
It is located in
 +
Z:\Hubs\2017\Output_Files
  
'''Firm Births''' (BDS): data set found for 1990 to present, currently being cleaned up for use
+
==NSF Data==
 +
Data is in:
 +
E:\McNair\Projects\Hubs\Summer 2017
 +
Z:\Hubs\2017
 +
 +
Database is '''cities'''
  
===Resources===
+
SQL script is: nsf_2017.sql
* Yael Hochberg and Fehder (2015), located in dropbox
 
** Use this paper as a guideline on how to conduct the analysis
 
*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
 
*USPTO tility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
 
*MSA level trends: http://www.metrotrends.org/data.cf
 
  
===To Do===
+
The source files are: nsf2017.txt, copied from table '''nsf''', and nsf_institution copied from table '''nsf_grants_institution''' from the biotech db.
We need to find and clean up data sets at the MSA level
 
  
*Patent data (USPTO)
+
They contain:
*Number of STEM Graduate Students (NSF)
+
*Award ID
**in progress
+
*Award Institution
*University R&D Spending (NSF)
+
*Award Effective date
*Per Capita Income (US Census)
+
*Institution city
**complete (Employment and Income_MSA.xls)
+
*Award Value
*Employment (US Census)
+
*Organization state code
**complete (Employment and Income_MSA.xls)
+
From 1900 - 2017
*Firm births (BDS)
 
*SELECT MSAs!!! Possible method: choosing CMSAs with Distinct companies funded
 
** >100 = 38
 
** >75 = 45
 
** >50 = 52
 
** >25 = 80
 
** Total 238
 
**greater than 100 will give us 52 CMSAs to work with
 
  
===Data Cleaning===
+
Output file is nsfSummary.txt. It contains:
 +
*Variables: City, State code year, nsf_nogrants, nsf_valuegrant
 +
*1900-2017
  
Cleaning tasks:
+
===Joined NSF table===
*Remove PortCos named Undisclosed, etc.
+
The joined nsf table with the VC table is found in db '''cities'''. The table is named '''merged_nsf'''.
*Remove Funds named Unknown, etc.
+
All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0.
*Basic Data cleaning:
+
The sql script is in
**Enormous outliers on funds invested
+
Z:\HUbs\2017\sql scripts
**Check dates
 
  
Lookup tables:
+
==NIH Data==
*SuperMSAs
+
Data is in:  
*Industry
+
Z:\Hubs
*Stages
+
E:\McNair\Projects\Hubs\Summer 2017
  
 +
Database is '''cities'''
 +
SQL script is: nih2017.sql
 +
The source files are:
 +
*nih_1986_2001.csv
 +
*nih_2002_2012.txt
 +
*nih_2013_2015
 +
located in E:\McNair\Projects\Federal Grant Data\NIH
  
===The Target Dataset===
 
  
We will need to process the following variables:
 
*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
 
  
 +
The script that cleans NIH data and generates the summary table is titled '''nihSummary'''. It is located here:
  
Example dataset:
+
Z:\Hubs\2017\sql scripts
MSA      Year      SeedVCInv      SeedEarlyVCInv      LaterVCInv    NoDeals  FundsInvested  DistinctInvestors  ....
 
----------------------------------------------------------------------------------------------------------------------------
 
1234    2001      1000000        20000000            30000000      4          7              7
 
  
 +
This table includes
 +
*year
 +
*city
 +
*state
 +
*country
 +
*nogrants (number of grants)
 +
*valuegrant
 +
*city_state
  
Note that the unit of observation is MSA-Year.
+
*Date from 1986-2015
  
Variables to be computed at the MSA level:
+
===Joined NIH table===
*HubActive (binary)
+
The joined NIH table with the VC table is found in db '''cities'''. The table is named '''merged_nih'''.
*NoHubsActive (Count)
+
All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0.
*HubSqFt
+
The sql script is in
*Other Hub Vars (build list!!!)
+
Z:\HUbs\2017\sql scripts
*'''SeedVCInv''' (Seed/Start-up)
 
*'''EarlyVCInv''' (Early Stage)
 
*LaterStageVC (Later)
 
*OtherStageVC (Buyout/Acq, Other)
 
*'''NoDeals''' (done by local VCs?)
 
**NoDealsNear
 
**NoDealsFar
 
*NoPortCosFunded
 
*FundsInv (in an MSA)
 
**FundsInvFromNear (within MSA?)
 
**FundsInvFromFar (outside MSA?)
 
*DistinctInvestors
 
**DistinctInvestorsNear (within MSA?)
 
**DistinctInvestorsFar (outside MSA?)
 
*PatentCount
 
*NoSTEMGrads
 
*FirmBirths (BDS data)
 
*UniRandDSpend
 
*PerCapitaIncome
 
*Employment
 
  
We need to:
+
==Clinical Trials Data==
*Check funds invested means dollars invested
+
Data is in:  
*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?
+
Z:\Hubs
 +
E:\McNair\Projects\Hubs\Summer 2017
  
 +
Database is '''cities'''
 +
SQL script is: ctrials.sql
 +
The source file is:
  
There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).
+
*medclinical.txt
 +
 
 +
located in Z:\Hubs\2017
 +
 
 +
*Date from 1999-2017
 +
 
 +
===Joined clinical trials table===
 +
 
 +
The file which contains the number of trials in each city and year is located in:
 +
Z:\Hubs\2017
 +
 +
The file is in:
 +
Z:\Hubs\2017\clean data
 +
The name of the file is:
 +
  ctrialsSummary.txt
 +
 
 +
It contains:
 +
*city
 +
*year
 +
*city_state_year
 +
*noctrials - number of trials
 +
 
 +
The ctrials is joined with VC table.
 +
The joined SQL script is: '''new_ctrials.sql''' and it is located in
 +
Z:\Hubs\2017\sql scripts
 +
 
 +
The name of the joined table is '''new_merged_ctrials'''.
 +
 
 +
It contains:
 +
*city
 +
*state
 +
*city_state_id
 +
*city_state_year
 +
*year
 +
*noctrials
 +
*seedamtm
 +
*earlyamtm
 +
*lateramtm
 +
*selamtm
 +
*numseeds
 +
*numearly
 +
*numlater
 +
*numsel
 +
 
 +
All the values of noctrials with missing values for years 1999-2017 are set equal to 0.
 +
 
 +
==Population Data==
 +
Data is in:
 +
Z:\Hubs
 +
E:\McNair\Projects\Hubs\Summer 2017
 +
 
 +
Database is '''cities'''
 +
 
 +
SQL script is: '''population.sql'''
 +
The source files are:
 +
*pop2000_2009.xlsx
 +
*pop2010_2016.xlsx
 +
 
 +
They contain:
 +
*State
 +
*City name
 +
*Year
 +
*Population Estimates
 +
 
 +
Date from 2000-2016
 +
 
 +
===Joined population table===
 +
 
 +
Data is in:
 +
Z:\Hubs\2017\clean data
 +
The file names are
 +
1_population.txt - contains data on population estimates from 2000-2009
 +
2_population.txt - contains data on population estimates from 2010-2016
 +
 
 +
 
 +
Database is '''cities'''
 +
SQL script is: '''new_population.sql''', located in
 +
Z:\Hubs\2017\sql scripts
 +
 
 +
The population table is joined on VC table. The table is called '''new_merged_population'''.
 +
 
 +
They contain:
 +
*City
 +
*State
 +
*city_state_id to uniquely identify each city
 +
*city_state_year to uniquely identify each city in each year
 +
*Population estimates
 +
*Year
 +
*Code from the state code and Fips code
 +
*State full name
 +
 
 +
==Income Data==
 +
 
 +
Raw data was obtained from Census data, American Communities Survey.
 +
 
 +
Raw Data is in:
 +
E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip
 +
 
 +
 
 +
Date from 2005-2015
 +
 
 +
The master list with MSAs and principal cities is titled '''list2.xls'''. It is located at:
 +
Z:\Hubs\2017
 +
 
 +
This master list includes:
 +
*MSA code
 +
*MSA name
 +
*Principal City
 +
*State
 +
*Place code (city code)
 +
*State Code
 +
 
 +
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So '''list''' was edited to put New York with NY.
 +
 
 +
 
 +
Cleaned Income data files are in
 +
Z:\Hubs\2017\merging_on_ID
 +
 
 +
They contain:
 +
*MSA code
 +
*MSA
 +
*Year
 +
*Total Household Income
 +
 
 +
The MSA-City-State look up file is titled '''msa_city_state_wcode.txt'''. It is located in
 +
Z:\Hubs\2017\merging_on_ID
 +
 
 +
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled '''income.sql'''. It is located here:
 +
Z:\Hubs\2017\sql scripts
 +
 
 +
 +
The final income table is in db '''cities''' titled '''merged_income'''.
 +
 
 +
It includes:
 +
*MSA
 +
*City
 +
*State
 +
*Year
 +
*Total Household Income
 +
 
 +
The table includes 8780 observations
 +
 
 +
===Joined income table===
 +
 
 +
Data is in:
 +
Z:\Hubs\clean data
 +
The file names are:
 +
INC_05.txt - INC_15.txt
 +
 +
 
 +
Database is '''cities'''
 +
SQL script is: merged_income.sql
 +
 
 +
 
 +
They contain:
 +
*City
 +
*State
 +
*city_state_id to uniquely identify each city
 +
*Income
 +
*Year
 +
*Code from the state code and Fips code
 +
 
 +
==Employment Data==
 +
 
 +
Data on employment was obtained from American Communities Survey, US Census Bureau.
 +
 
 +
Raw Data is in:
 +
E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA
 +
Cleaned files are in
 +
Z:\Hubs\2017\clean data 
 +
 +
They contain:
 +
*MSA code
 +
*MSA
 +
*Year
 +
*Employment rate of individuals 16 years or older
 +
*Unemployment rate of individuals 16 years or older
 +
 
 +
Date from 2005-2015
 +
 
 +
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled '''Employment.sql'''.
 +
The file is located in:
 +
Z:\Hubs\2017
 +
 
 +
The final table is in db '''cities''' titled '''merged_employment'''.
 +
 
 +
It includes:
 +
*MSA
 +
*City
 +
*Year
 +
*Employment rate
 +
*Unemployment rate
 +
 
 +
===Joined employment table===
 +
 
 +
Data is in:
 +
Z:\Hubs\clean data
 +
 +
The file names are:
 +
EMP_05.txt - EMP_15.txt
 +
 
 +
Database is '''cities'''
 +
SQL script is: '''new_employment.sql''' and it is located in
 +
Z:\Hubs\2017\sql scripts
 +
 
 +
The final table which is joined on VC is in db cities titled '''new_merged_employment'''.
 +
 
 +
They contain:
 +
*City
 +
*State
 +
*Code from the state code and Fips code
 +
*city_state_id to uniquely identify each city
 +
*city_state_year to uniquely identify each city in each year
 +
*Employment rates of individuals of 16 years or older
 +
*Unemployment rates of individuals of 16 years or older
 +
*Year
 +
 
 +
==Schooling Data==
 +
 
 +
Data on schooling was obtained from American Communities Survey, US Census Bureau.
 +
 
 +
Raw Data is in:
 +
E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA
 +
Cleaned files are in
 +
Z:\Hubs\2017\clean data
 +
 +
They contain:
 +
*MSA code
 +
*MSA
 +
*Year
 +
*Total number of population 3 years and over enrolled in school
 +
*Percent of population 3 years and over enrolled in public school
 +
*Percent of population 3 years and over enrolled in private school
 +
 
 +
Date from 2005-2015
 +
 
 +
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled '''schooling.sql'''.
 +
The file is located in:
 +
Z:\Hubs\2017
 +
 
 +
The final table is in db '''cities''' titled '''merged_schooling'''.
 +
 
 +
It includes:
 +
*MSA
 +
*City
 +
*Year
 +
*Total
 +
*Percent_public_schooling
 +
*Percent_private_schooling
 +
 
 +
===Joined schooling table===
 +
 
 +
Data is in:
 +
Z:\Hubs\clean data
 +
The file names are:
 +
SCH_05.txt - SCH_15.txt
 +
 +
 
 +
Database is '''cities'''
 +
SQL script which joins this table with VC table is: '''new_merged_schooling.sql'''
 +
The final table is in db '''cities''' titled '''new_merged_schooling'''.
 +
 
 +
It contains:
 +
*City
 +
*State
 +
*city_state_id to uniquely identify each city
 +
*city_state_year to uniquely identify each city in each year
 +
*Total number of school enrollment
 +
*Percentage enrolled in public schools
 +
*Percentage enrolled in private schools
 +
*Year
 +
*Code from the state code and Fips code
 +
 
 +
==VC Data==
 +
 
 +
 
 +
 
 +
Raw Data is in:
 +
  Z:\VentureCapitalData\SDCVCData\vcdb2
 +
  The file name is roundleveloutput2.txt
 +
 
 +
It contains:
 +
*city
 +
*state
 +
*year
 +
*seedamtm - seed, amount in millions
 +
*earlyamtm - early, amount in millions
 +
*lateramtm - late, amount in millions
 +
*selamtm - seed early late, amount in millions
 +
*numseeds - number of seeds
 +
*numearly
 +
*numlater
 +
*numsel
 +
*numdeals
 +
*numalive
 +
 
 +
 
 +
Date from 1948-2017
 +
 
 +
 
 +
The table is in db '''cities''' titled '''new_vc'''.
 +
 
 +
It includes:
 +
*city
 +
*state
 +
*city_state_id
 +
*city_state_year
 +
*seedamtm
 +
*earlyamtm
 +
*lateramtm
 +
*selamtm
 +
*numseeds
 +
*numearly
 +
*numlater
 +
*numsel
 +
*numdeals
 +
*numalive
 +
*year
 +
 
 +
==Final Joined Data set ==
 +
 
 +
The final data set is in file '''final.txt''' and is located here:
 +
Z:\Hubs\2017
 +
 
 +
It includes:
 +
*city
 +
*state
 +
*city_state_year - (ID that data is merged on)
 +
*year
 +
*seedamtm - Seed Amount
 +
*earlyamtm - Early Investment Amount
 +
*lateramtm - Late Investment Amount
 +
*selamtm - Seed early or late amount
 +
*numseeds - Number of seed investments
 +
*numearly - Number of early investments
 +
*numlater - Number of late investments
 +
*numsel
 +
*numdeals - Number of deals (first contracts)
 +
*numalive - Number of start ups alive
 +
*income - Income per capita in each city-year
 +
*sbir_nogrants - Number of SBIR grants
 +
*sbir_valuegrant - Value of SBIR grants
 +
*emp - Employment stats of each city-year
 +
*unemp - Rate of unemployment
 +
*popestimate - Population estimate of each city-year
 +
*private - Enrollment in private schools
 +
*public - Enrollment in public schools
 +
*total -
 +
*numfirms - Number of publicly traded firms
 +
*randd - R&D expenditure of publicly traded firms
 +
*revenue - Revenue of PTF
 +
*totalassets
 +
*nsf_nogrants - Number of NSF grants
 +
*valuegrant - Value of NSF grants
 +
*nih_nogrants - Number of NIH grants
 +
*nih_valuegrant - Value of NIH grants
 +
*noctrials - NUmber of clinical trials
 +
 
 +
== Defining Hubs ==
 +
'''Summer 2016''' - Last year a master list of 125 "potential" hubs was used. A scorecard was developed which filtered these 125 candidate hubs to determine which of these should be included in the study sample. This method resulted in a sample size of ~ 30. The master list and the final hubs list is titled '''Hubs Data v2_'16'''. It is located here:
 +
Z:\Hubs\2017\hubs_data
 +
 
 +
'''Summer 2017''' - In order to obtain a more statistically significant sample of hubs, we developed 5 criteria which produce a more relaxed definition of hubs than last year. These include
 +
 
 +
*Availability of co-working space
 +
*Coding classes or tech events
 +
*Some focus on the tech sector (this is important as our dependent variable is VC funding)
 +
*Presence of an accelerator
 +
*Availability of mentorship for members.
 +
 
 +
We will review the 125 candidate hubs and select those which satisfy a subset or all of these characteristics.
 +
 
 +
 
 +
 
 +
[[category:Internal]]

Latest revision as of 12:41, 21 September 2020


Project
Hubs
Project logo 02.png
Project Information
Has title Hubs
Has owner Hira Farooqi
Has start date
Has deadline date
Has keywords Data
Has project status Active
Does subsume Hubs Analysis 2017
Has sponsor McNair Center
Has project output Data
Copyright © 2019 edegan.com. All Rights Reserved.


Important Notice: The last update to the hubs data was done manually by Ed and is in E:\projects\MeasuringHGHTEcosystems\HubsData-RevisedSimplified.xlsx


The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area. It focuses on cities in the United States as the primary unit of analysis.

This page contains information about data used for this research project, including data sources, location of data on RDP and details on data processing.


Information on initial data work done prior to Summer 2017 can be found at Hubs Summer 2016.

Note on joining: The city-state-year ID from VC data is used as the master ID for joining datasets. Each table (e.g. income, nih, nsf, sbir, compustat) is first joined with the VC data on city-state-year ID and then the resulting tables are all joined together in the final table.


Data by zip code

  • Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)

https://www2.census.gov/programs-surveys/popest/datasets/

  • Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)

https://www.irs.gov/uac/about-irs

  • DCI index, to assess the economic well-being of communities

http://eig.org/dci/interactive-maps/u-s-zip-codes

Data by MSA

We have principle cities of MSAs from the census: https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html

We might be able to go City -> FIPS place code -> MSA?

Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html

The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html However, there is only CBSA!

This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf We can maybe track city to principal city to MSA

COMPUSTAT Data

The data set includes information on publicly traded firms in the US. It was obtained from the Wharton Research Data Services (https://wrds-web.wharton.upenn.edu/wrds/index.cfm?).


Raw Data is in:

E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

The source file is RandDExpenditures.txt. It contains:

  • Date from 1980-2017 (July).
  • 427799 records
  • Fields include:
    • R&D Expenditure
    • Address (inc. city, zip, state)
    • Revenue of firms

Database is cities

SQL script is: COMPUSTAT.sql

Output file is COMPUSTATSummary.txt. It contains:

  • Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
  • 1979-2016
  • 4440 cities

It is located in

Z:\Hubs\2017\Output_Files

NSF Data

Data is in:

E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is cities

SQL script is: nsf_2017.sql

The source files are: nsf2017.txt, copied from table nsf, and nsf_institution copied from table nsf_grants_institution from the biotech db.

They contain:

  • Award ID
  • Award Institution
  • Award Effective date
  • Institution city
  • Award Value
  • Organization state code

From 1900 - 2017

Output file is nsfSummary.txt. It contains:

  • Variables: City, State code year, nsf_nogrants, nsf_valuegrant
  • 1900-2017

Joined NSF table

The joined nsf table with the VC table is found in db cities. The table is named merged_nsf. All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0. The sql script is in

Z:\HUbs\2017\sql scripts

NIH Data

Data is in:

Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is cities SQL script is: nih2017.sql The source files are:

  • nih_1986_2001.csv
  • nih_2002_2012.txt
  • nih_2013_2015

located in E:\McNair\Projects\Federal Grant Data\NIH


The script that cleans NIH data and generates the summary table is titled nihSummary. It is located here:

Z:\Hubs\2017\sql scripts

This table includes

  • year
  • city
  • state
  • country
  • nogrants (number of grants)
  • valuegrant
  • city_state
  • Date from 1986-2015

Joined NIH table

The joined NIH table with the VC table is found in db cities. The table is named merged_nih. All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0. The sql script is in

Z:\HUbs\2017\sql scripts

Clinical Trials Data

Data is in:

Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is cities SQL script is: ctrials.sql The source file is:

  • medclinical.txt

located in Z:\Hubs\2017

  • Date from 1999-2017

Joined clinical trials table

The file which contains the number of trials in each city and year is located in:

Z:\Hubs\2017

The file is in:

Z:\Hubs\2017\clean data

The name of the file is:

 ctrialsSummary.txt

It contains:

  • city
  • year
  • city_state_year
  • noctrials - number of trials

The ctrials is joined with VC table. The joined SQL script is: new_ctrials.sql and it is located in

Z:\Hubs\2017\sql scripts

The name of the joined table is new_merged_ctrials.

It contains:

  • city
  • state
  • city_state_id
  • city_state_year
  • year
  • noctrials
  • seedamtm
  • earlyamtm
  • lateramtm
  • selamtm
  • numseeds
  • numearly
  • numlater
  • numsel

All the values of noctrials with missing values for years 1999-2017 are set equal to 0.

Population Data

Data is in:

Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is cities

SQL script is: population.sql The source files are:

  • pop2000_2009.xlsx
  • pop2010_2016.xlsx

They contain:

  • State
  • City name
  • Year
  • Population Estimates

Date from 2000-2016

Joined population table

Data is in:

Z:\Hubs\2017\clean data

The file names are

1_population.txt - contains data on population estimates from 2000-2009
2_population.txt - contains data on population estimates from 2010-2016


Database is cities SQL script is: new_population.sql, located in

Z:\Hubs\2017\sql scripts

The population table is joined on VC table. The table is called new_merged_population.

They contain:

  • City
  • State
  • city_state_id to uniquely identify each city
  • city_state_year to uniquely identify each city in each year
  • Population estimates
  • Year
  • Code from the state code and Fips code
  • State full name

Income Data

Raw data was obtained from Census data, American Communities Survey.

Raw Data is in:

E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip 


Date from 2005-2015

The master list with MSAs and principal cities is titled list2.xls. It is located at:

Z:\Hubs\2017

This master list includes:

  • MSA code
  • MSA name
  • Principal City
  • State
  • Place code (city code)
  • State Code

This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So list was edited to put New York with NY.


Cleaned Income data files are in

Z:\Hubs\2017\merging_on_ID 

They contain:

  • MSA code
  • MSA
  • Year
  • Total Household Income

The MSA-City-State look up file is titled msa_city_state_wcode.txt. It is located in

Z:\Hubs\2017\merging_on_ID 

The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled income.sql. It is located here:

Z:\Hubs\2017\sql scripts
  

The final income table is in db cities titled merged_income.

It includes:

  • MSA
  • City
  • State
  • Year
  • Total Household Income

The table includes 8780 observations

Joined income table

Data is in:

Z:\Hubs\clean data

The file names are:

INC_05.txt - INC_15.txt

Database is cities SQL script is: merged_income.sql


They contain:

  • City
  • State
  • city_state_id to uniquely identify each city
  • Income
  • Year
  • Code from the state code and Fips code

Employment Data

Data on employment was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:

E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA

Cleaned files are in

Z:\Hubs\2017\clean data  

They contain:

  • MSA code
  • MSA
  • Year
  • Employment rate of individuals 16 years or older
  • Unemployment rate of individuals 16 years or older

Date from 2005-2015

The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled Employment.sql. The file is located in:

Z:\Hubs\2017

The final table is in db cities titled merged_employment.

It includes:

  • MSA
  • City
  • Year
  • Employment rate
  • Unemployment rate

Joined employment table

Data is in:

Z:\Hubs\clean data

The file names are:

EMP_05.txt - EMP_15.txt 

Database is cities SQL script is: new_employment.sql and it is located in Z:\Hubs\2017\sql scripts

The final table which is joined on VC is in db cities titled new_merged_employment.

They contain:

  • City
  • State
  • Code from the state code and Fips code
  • city_state_id to uniquely identify each city
  • city_state_year to uniquely identify each city in each year
  • Employment rates of individuals of 16 years or older
  • Unemployment rates of individuals of 16 years or older
  • Year

Schooling Data

Data on schooling was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:

E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA

Cleaned files are in

Z:\Hubs\2017\clean data

They contain:

  • MSA code
  • MSA
  • Year
  • Total number of population 3 years and over enrolled in school
  • Percent of population 3 years and over enrolled in public school
  • Percent of population 3 years and over enrolled in private school

Date from 2005-2015

The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled schooling.sql. The file is located in:

Z:\Hubs\2017

The final table is in db cities titled merged_schooling.

It includes:

  • MSA
  • City
  • Year
  • Total
  • Percent_public_schooling
  • Percent_private_schooling

Joined schooling table

Data is in:

Z:\Hubs\clean data

The file names are:

SCH_05.txt - SCH_15.txt

Database is cities SQL script which joins this table with VC table is: new_merged_schooling.sql The final table is in db cities titled new_merged_schooling.

It contains:

  • City
  • State
  • city_state_id to uniquely identify each city
  • city_state_year to uniquely identify each city in each year
  • Total number of school enrollment
  • Percentage enrolled in public schools
  • Percentage enrolled in private schools
  • Year
  • Code from the state code and Fips code

VC Data

Raw Data is in:

 Z:\VentureCapitalData\SDCVCData\vcdb2
 The file name is roundleveloutput2.txt

It contains:

  • city
  • state
  • year
  • seedamtm - seed, amount in millions
  • earlyamtm - early, amount in millions
  • lateramtm - late, amount in millions
  • selamtm - seed early late, amount in millions
  • numseeds - number of seeds
  • numearly
  • numlater
  • numsel
  • numdeals
  • numalive


Date from 1948-2017


The table is in db cities titled new_vc.

It includes:

  • city
  • state
  • city_state_id
  • city_state_year
  • seedamtm
  • earlyamtm
  • lateramtm
  • selamtm
  • numseeds
  • numearly
  • numlater
  • numsel
  • numdeals
  • numalive
  • year

Final Joined Data set

The final data set is in file final.txt and is located here:

Z:\Hubs\2017

It includes:

  • city
  • state
  • city_state_year - (ID that data is merged on)
  • year
  • seedamtm - Seed Amount
  • earlyamtm - Early Investment Amount
  • lateramtm - Late Investment Amount
  • selamtm - Seed early or late amount
  • numseeds - Number of seed investments
  • numearly - Number of early investments
  • numlater - Number of late investments
  • numsel
  • numdeals - Number of deals (first contracts)
  • numalive - Number of start ups alive
  • income - Income per capita in each city-year
  • sbir_nogrants - Number of SBIR grants
  • sbir_valuegrant - Value of SBIR grants
  • emp - Employment stats of each city-year
  • unemp - Rate of unemployment
  • popestimate - Population estimate of each city-year
  • private - Enrollment in private schools
  • public - Enrollment in public schools
  • total -
  • numfirms - Number of publicly traded firms
  • randd - R&D expenditure of publicly traded firms
  • revenue - Revenue of PTF
  • totalassets
  • nsf_nogrants - Number of NSF grants
  • valuegrant - Value of NSF grants
  • nih_nogrants - Number of NIH grants
  • nih_valuegrant - Value of NIH grants
  • noctrials - NUmber of clinical trials

Defining Hubs

Summer 2016 - Last year a master list of 125 "potential" hubs was used. A scorecard was developed which filtered these 125 candidate hubs to determine which of these should be included in the study sample. This method resulted in a sample size of ~ 30. The master list and the final hubs list is titled Hubs Data v2_'16. It is located here:

Z:\Hubs\2017\hubs_data

Summer 2017 - In order to obtain a more statistically significant sample of hubs, we developed 5 criteria which produce a more relaxed definition of hubs than last year. These include

  • Availability of co-working space
  • Coding classes or tech events
  • Some focus on the tech sector (this is important as our dependent variable is VC funding)
  • Presence of an accelerator
  • Availability of mentorship for members.

We will review the 125 candidate hubs and select those which satisfy a subset or all of these characteristics.