Difference between revisions of "Hubs"

McNair Project
Hubs
Project Information
Project Title	Hubs
Owner	Hira Farooqi
Start Date
Deadline
Keywords	Data
Primary Billing
Notes
Has project status	Active
	Copyright © 2016 edegan.com. All Rights Reserved.

Revision as of 13:11, 13 July 2017

The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.

This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.

Details of initial data work done prior to Summer 2017 can be found at Hubs Summer 2016

1 COMPUSTAT Data
2 NSF Data
- 2.1 Joined NSF table
3 NIH Data
- 3.1 Joined NIH table
4 Clinical Trials Data
- 4.1 Joined clinical trials table
5 Population Data
- 5.1 Joined population table
6 Income Data
- 6.1 Joined income table
7 Employment Data
- 7.1 Joined employment table
8 Schooling Data
- 8.1 Joined schooling table
9 VC Data

COMPUSTAT Data

Data is in:

E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is cities

SQL script is: COMPUSTAT.sql

The source file is RandDExpenditures.txt. It contains:

Date from 1980-2017 (July). All COMPUSTAT.
427799 records
Fields include:
- R&D Expenditure
- Address (inc. city, zip, state)

Output file is COMPUSTATSummary.txt. It contains:

Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
1979-2016
4440 cities

NSF Data

Data is in:

E:\McNair\Projects\Hubs\Summer 2017
Z:\Hubs\2017

Database is cities

SQL script is: nsf_2017.sql

The source files are: nsf2017.txt, copied from table nsf, and nsf_institution copied from table nsf_grants_institution from the biotech db.

They contain:

Award ID
Award Institution
Award Effective date
Institution city
Award Value
Organization state code

From 1900 - 2017

Output file is nsfSummary.txt. It contains:

Variables: City, State code year, nsf_nogrants, nsf_valuegrant
1900-2017

Joined NSF table

The joined nsf table with the VC table is found in db cities. The table is named merged_nsf. All the values of nogrants and valuegrant with missing values for years 1990-2017 are set equal to 0. The sql script is in

Z:\HUbs\2017\sql scripts

NIH Data

Data is in:

Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is cities SQL script is: nih2017.sql The source files are:

nih_1986_2001.csv
nih_2002_2012.txt
nih_2013_2015

located in E:\McNair\Projects\Federal Grant Data\NIH

The script that cleans NIH data and generates the summary table is titled nihSummary. It is located here:

E:\McNair\Projects\Hubs\Summer 2017

This table includes

year
city
state
country
nogrants (number of grants)
valuegrant
city_state (the city-state ID that we'll merge on)

Date from 1986-2015

Joined NIH table

The joined NIH table with the VC table is found in db cities. The table is named merged_nih. All the values of nih_valuegrant and nih_nogrants with missing values for years 1986-2015 are set equal to 0. The sql script is in

Z:\HUbs\2017\sql scripts

Clinical Trials Data

Data is in:

Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is cities SQL script is: ctrials.sql The source file is:

medclinical.txt

located in Z:\Hubs\2017

Date from 1999-2017

Joined clinical trials table

The file which contains the number of trials in each city and year is located in:

Z:\Hubs\2017

The file is in:

Z:\Hubs\2017\clean data

The name of the file is:

 ctrialsSummary.txt

It contains:

city
year
city_state_year
noctrials - number of trials

The ctrials is joined with VC table. The joined SQL script is: new_ctrials.sql and it is located in

Z:\Hubs\2017\sql scripts

The name of the joined table is new_merged_ctrials.

It contains:

city
state
city_state_id
city_state_year
year
noctrials
seedamtm
earlyamtm
lateramtm
selamtm
numseeds
numearly
numlater
numsel

All the values of noctrials with missing values for years 1999-2017 are set equal to 0.

Population Data

Data is in:

Z:\Hubs
E:\McNair\Projects\Hubs\Summer 2017

Database is cities

SQL script is: population.sql The source files are:

pop2000_2009.xlsx
pop2010_2016.xlsx

They contain:

State
City name
Year
Population Estimates

Date from 2000-2016

Joined population table

Data is in:

Z:\Hubs\2017\clean data

The file names are

1_population.txt - contains data on population estimates from 2000-2009
2_population.txt - contains data on population estimates from 2010-2016

Database is cities SQL script is: new_population.sql, located in

Z:\Hubs\2017\sql scripts

The population table is joined on VC table. The table is called new_merged_population.

They contain:

City
State
city_state_id to uniquely identify each city
city_state_year to uniquely identify each city in each year
Population estimates
Year
Code from the state code and Fips code
State full name

Income Data

Raw data was obtained from Census data, American Communities Survey.

Raw Data is in:

E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip

Date from 2005-2015

The master list with MSAs and principal cities is titled list2.xls. It is located at:

Z:\Hubs\2017

This master list includes:

MSA code
MSA name
Principal City
State
Place code (city code)
State Code

This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So list was edited to put New York with NY.

Cleaned Income data files are in

Z:\Hubs\2017\merging_on_ID

They contain:

MSA code
MSA
Year
Total Household Income

The MSA-City-State look up file is titled msa_city_state_wcode.txt. It is located in

Z:\Hubs\2017\merging_on_ID

The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled income.sql. It is located here:

Z:\Hubs\2017\sql scripts

The final income table is in db cities titled merged_income.

It includes:

MSA
City
State
Year
Total Household Income

The table includes 8780 observations

Joined income table

Data is in:

Z:\Hubs\clean data

The file names are:

INC_05.txt - INC_15.txt

Database is cities SQL script is: merged_income.sql

They contain:

City
State
city_state_id to uniquely identify each city
Income
Year
Code from the state code and Fips code

Employment Data

Data on employment was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:

E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA

Cleaned files are in

Z:\Hubs\2017\clean data

They contain:

MSA code
MSA
Year
Employment rate of individuals 16 years or older
Unemployment rate of individuals 16 years or older

Date from 2005-2015

The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled Employment.sql. The file is located in:

Z:\Hubs\2017

The final table is in db cities titled merged_employment.

It includes:

MSA
City
Year
Employment rate
Unemployment rate

Joined employment table

Data is in:

Z:\Hubs\clean data

The file names are:

EMP_05.txt - EMP_15.txt

Database is cities SQL script is: new_employment.sql and it is located in Z:\Hubs\2017\sql scripts

The final table which is joined on VC is in db cities titled new_merged_employment.

They contain:

City
State
Code from the state code and Fips code
city_state_id to uniquely identify each city
city_state_year to uniquely identify each city in each year
Employment rates of individuals of 16 years or older
Unemployment rates of individuals of 16 years or older
Year

Schooling Data

Data on schooling was obtained from American Communities Survey, US Census Bureau.

Raw Data is in:

E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA

Cleaned files are in

Z:\Hubs\2017\clean data

They contain:

MSA code
MSA
Year
Total number of population 3 years and over enrolled in school
Percent of population 3 years and over enrolled in public school
Percent of population 3 years and over enrolled in private school

Date from 2005-2015

The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled schooling.sql. The file is located in:

Z:\Hubs\2017

The final table is in db cities titled merged_schooling.

It includes:

MSA
City
Year
Total
Percent_public_schooling
Percent_private_schooling

Joined schooling table

Data is in:

Z:\Hubs\clean data

The file names are:

SCH_05.txt - SCH_15.txt

Database is cities SQL script which joins this table with VC table is: new_merged_schooling.sql The final table is in db cities titled new_merged_schooling.

It contains:

City
State
city_state_id to uniquely identify each city
city_state_year to uniquely identify each city in each year
Total number of school enrollment
Percentage enrolled in public schools
Percentage enrolled in private schools
Year
Code from the state code and Fips code

VC Data

Raw Data is in:

 Z:\VentureCapitalData\SDCVCData
 The file name is roundcitystateyear.txt

It contains:

city
state
year
seedamtm - seed, amount in millions
earlyamtm - early, amount in millions
lateramtm - late, amount in millions
selamtm - seed early late, amount in millions
numseeds - number of seeds
numearly
numlater
numsel

Date from 1953-2017

The table is in db cities titled vc.

It includes:

city
state
city_state_id
city_state_year
seedamtm
earlyamtm
lateramtm
selamtm
numseeds
numearly
numlater
numsel
year

Difference between revisions of "Hubs"

Revision as of 13:11, 13 July 2017

Contents

COMPUSTAT Data

NSF Data

Joined NSF table

NIH Data

Joined NIH table

Clinical Trials Data

Joined clinical trials table

Population Data

Joined population table

Income Data

Joined income table

Employment Data

Joined employment table

Schooling Data

Joined schooling table

VC Data

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools

@@ Line 9: / Line 9: @@
 This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.
-===Primary Data Set===
-The Hubs data set, from SDC Platinum, has been constructed in the server:
- Data files are in 128.42.44.181/bulk/Hubs
- All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs
- psql Hubs
+Details of initial data work done prior to Summer 2017 can be found at [[Hubs Summer 2016]]
-The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
-Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in  number of funds active, and flow of investment in a given MSA.
-The data set has now been uploaded to the database server, named Hubs.
-There are 4 tables:
-*Rounds: Rounddate, coname, state, roundno, stage1, etc.
-*CombinedRounds: Coname, rounddate, discamount, fundname
-*Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
-*Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address
-Used variables:
- Companies: Coname, MSACode, Industry, state
- MSALookupTable: MSACode, MSASuper
- IndustryLookupTable: IndustryMajor, InduCode
- ->
- CompanyInfo: Coname, MSASuper, InduCode, state (complete)
- Funds: fundname, msacode, state
- MSALookupTable: MSACode, MSASuper
- ->
- FundInfo: fundname, msacode, state (complete)
- Rounds: coname, rounddate, stagecode, roundno
- CombinedRounds: coname, rounddate, discamount, fundname
- ->
- RoundInfoSuper: coname, rounddate, '''nofunds''', discamount
- ->
- RoundInfo: Coname, roundyear, fundname, estamount (complete)
-Then take:
- RoundInfo: Coname, roundyear, fundname, estamount
- CompanyInfo: Coname, MSASuper, InduCode, state
- FundInfo: fundname, msacode, state
- ->
- SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount
- ->
- MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear
- ...
-'''Notes on Creation of Primary Data Set'''
-Raw tables
-* companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
-* funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
-* rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
-* combined rounds (company name, round date, disclosed amount, investor)
-* msalist (changes MSAs to CMSAs— combined MSAs)
-*industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)
-Process
-*cleaned tables to eliminate duplications, undisclosed variables
-*changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
-*matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
-*matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
-*join by round and company conames
-*bridge years (1990-2016), stage, and cmsa
-* populate data with count of companies (Deal flow) and estimated amount ($)
-** data set in 181 hubs folder under summarycmsa.txt (38394)
-Key decisions:
-*Threw out undisclosed co through-out as no address
-*Count is done by joining round and company
-*Anything fund related must be disclosed fund
-*Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only
-'''Glossary of Tables'''
- cleanco — used to remove duplicates from companies
- cleanedcompanies — clean set of companies with no duplicates
- cmsafunds-
- cmsas— list of all CMSAs in final data set (for merging)
- cmsastats- statistics not including empty years (pre-merge)
- cmsastats2 - statistics separated by year-MSA
- cmsastats3— statistics separated by year-MSA-stage
- cmsastats4
- cmsayears— empty merged table between year and cmsa
- cmsayearstage — empty merged table between cmsa/years and stage
- combinedrounds— raw sdc data for combined rounds
- combinedroundswamt— used to join rounds and combined rounds for roundinfo2
- companies- raw SDC company data
- companyinfo — cleaned companies joined with state and CMSA information
- companyinfo2— companyinfo1 with original industry categories
- companyinfo3— companyinfo2 with updated industry categories and codes
- companyinfo4-- clean version of companyinfo3
- companyround- combined company information with round information
- companyround2- combined company information with round information, cleaned up from companyround2
- companyround3- combined company information with round information, cleaned up from companyround3
- '''finaldataset'''- final statistics by CMSA-year, see section Final Primary Data Set for more information
- fundinfo— funds joined with CMSA info
- fundinfo2 - clean version of fundinfo1
- fundinfoclean - used in process to clean fundinfo2
- fundinfoclean2- used in process to clean fundinfo2
- fundinfocleanfinal- used in process to clean fundinfo2
- fundinfocleannodups- final clean set of fundinfo
- funds - raw SDC fund data
- Houston - analysis for Houston ecosystem team
- Houston2- analysis for Houston ecosystem team
- houston3- analysis for Houston ecosystem team
- industry — new industry codes (4)— used for all future data sets
- industrylist— lookup table for new industry codes (went from 6 to 4)
- joined1- used for matching process
- joined2- used for matching process
- matchfund2- used for matching process
- matchfunds- used for matching process
- matchroundfund - used for matching process
- matchroundfund2- used for matching process
- msalist — lookup table for MSA to CMSA (used for all future data sets)
- nearfar1-- beginning set before adding nearfar/stage variables
- nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset
- roundfund— not used— joined round to fund; drop/ignore
- roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate
- roundinfo2— roundinfo1 including name of investors/funds
- roundinfo3— clean version of roundinfo2
- roundinfoclean — final clean version of roundinfo3 (final roundinfo table)
- rounds — raw SDC round data
- stages — table for merging stage-year-CMSA
- superinfo — ignore/drop
- temp - used for matching process
- years — table for merging stage-year-CMSA
-===Hub Candidates Data Set===
-The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.
-Characteristics/Variables
-*Year Founded
-*Square footage
-*LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
-*Activeness on Twitter (binomial)
-*Member Directory available online (binomial)
-*Number of conference rooms
-*Price ($/month) for Flex desk
-*Offers Reserved desk (binomial)
-*Offers office space for rent (binomial)
-*Offers community membership-- not for coworking but for community events, etc. (binomial)
-*Number of events offered per month (estimate)
-*Offers code academy
-*Mission Statement/Vision (for qualitative or key-word analysis)
-These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.
-As of March 10th 2016, the list contains 125 Hub candidates.
-'''Where to find''': The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue
-===Supplementary Data Sets===
-'''Patent data''': to be pulled from USPTO or SDC Platinum.
-'''Number of STEM Graduate Students''' (NSF) and '''University R&D Spending''' (NSF):
-*University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)
-*R&D spending found at the university level for 2014 ("Stem Grad Students.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")
-** not uploaded to server or matched yet to CMSA code, because of this discrepancy.
-**"Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects
-'''Per Capita Income''' and '''Employment Data''' (US Census Bureau):
-*"Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)
-*"Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)
-**not uploaded to server or matched yet to CMSA code
-'''Firm Births''' (BDS)
-*in server 181, under table name "BDS"
-*includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa
-*includes code for CMSA but is not aggregated by CMSA
-** i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)
-===Resources===
-* Yael Hochberg and Fehder (2015), located in dropbox
-** Use this paper as a guideline on how to conduct the analysis
-*US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
-*USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
-*MSA level trends: http://www.metrotrends.org/data.cf
-===The Target Dataset===
-We will need to process the following variables:
-*SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
-*CSV mapping msas to cmsas is in the folder (and a table in the dbase)
-Example dataset:
- MSA      Year       SeedVCInv      SeedEarlyVCInv      LaterVCInv     NoDeals   FundsInvested   DistinctInvestors   ....
- ----------------------------------------------------------------------------------------------------------------------------
-     2001       1000000        20000000            30000000       4          7              7
-Note that the unit of observation is MSA-Year.
-Variables to be computed at the MSA level:
-*HubActive (binary)
-*NoHubsActive (Count)
-*HubSqFt
-*Other Hub Vars (build list!!!)
-*'''SeedVCInv'''  (Seed/Start-up)
-*'''EarlyVCInv''' (Early Stage)
-*'''LaterStageVC''' (Later)
-*'''OtherStageVC''' (Buyout/Acq, Other)
-*'''NoDeals''' (done by local VCs?)
-**'''NoDealsNear'''
-**'''NoDealsFar'''
-*NoPortCosFunded
-*'''FundsInv''' (in an MSA)
-**'''FundsInvFromNear''' (within MSA?)
-**'''FundsInvFromFar''' (outside MSA?)
-*DistinctInvestors (?)
-**DistinctInvestorsNear (within MSA?)
-**DistinctInvestorsFar (outside MSA?)
-*PatentCount
-*NoSTEMGrads
-*FirmBirths (BDS data)
-*UniRandDSpend
-*PerCapitaIncome
-*Employment
-We need to:
-*Check funds invested means dollars invested
-*Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?
-There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).
-===Final Primary Data Set===
-*Deal is a round syndicate (near/far deal is one investor that is near/far).
-Table name: finaldataset
- cmsa
- year
- totalamountinv--total amount invested
- nearamountinv--amount invested from local funds
- faramountinv-- amount invested from funds outside CMSA
- earlyinv--amount invested in early stage companies
- laterinv--amount invested in later stage companies
- startupseedinv--amount invested in seed or startup stage companies
- otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies
- investingfund--distinct funds that are investing in that CMSA-year
- investingfundnear--distinct funds from that CMSA that invested in that CMSA-year
- investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year
- deals--number of deals
- neardeals--number of deals inside a CMSA
- fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA
- earlystagedeals--deals with earlystage companies
- laterstagedeals--deals with later stage companies
- startupseeddeals--deals with startup/seed companies
- otherstagedeals--deals with companies in other stages
- newportcosfunded--number of portfolio companies to receive their first investment in that year
-===Data by zip code===
-*Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)
-https://www2.census.gov/programs-surveys/popest/datasets/
-*Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)
-https://www.irs.gov/uac/about-irs
-*DCI index, to assess the economic well-being of communities
-http://eig.org/dci/interactive-maps/u-s-zip-codes
-*R&D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)
-*Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).
-== Data by MSA ==
-We have principle cities of MSAs from the census:
-https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html
-We might be able to go City -> FIPS place code -> MSA?
-Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html
-The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html
-However, there is only CBSA!
-This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf
-We can maybe track city to principal city to MSA
 ==COMPUSTAT Data==
@@ Line 700: / Line 424: @@
 *numsel
 *year
-check pg [[new hubs]]