Difference between revisions of "Hubs"
Line 444: | Line 444: | ||
SQL script is: '''new_population.sql''' | SQL script is: '''new_population.sql''' | ||
− | The population table is joined on vc_city_state_year. The table is called '''new_merged_population'''. | + | The population table is joined on VC table called 'vc_city_state_year'. The table is called '''new_merged_population'''. |
They contain: | They contain: |
Revision as of 14:36, 12 July 2017
Hubs | |
---|---|
Project Information | |
Project Title | Hubs |
Owner | Hira Farooqi |
Start Date | |
Deadline | |
Keywords | Data |
Primary Billing | |
Notes | |
Has project status | Active |
Copyright © 2016 edegan.com. All Rights Reserved. |
The Hubs Research Project is a full-length academic paper analyzing the effectiveness of "hubs", a component of the entrepreneurship ecosystem, in the advancement and growth of entrepreneurial success in a metropolitan area.
This research will primarily focused on large and mid-sized Metropolitan Statistical Areas (MSAs), as that is where the greater majority of Venture Capital funding is located.
Contents
- 1 Primary Data Set
- 2 Hub Candidates Data Set
- 3 Supplementary Data Sets
- 4 Resources
- 5 The Target Dataset
- 6 Final Primary Data Set
- 7 Data by zip code
- 8 Data by MSA
- 9 COMPUSTAT Data
- 10 NSF Data
- 11 NIH Data
- 12 Clinical Trials Data
- 13 Population Data
- 14 Income Data
- 15 Employment Data
- 16 Schooling Data
- 17 VC Data
Primary Data Set
The Hubs data set, from SDC Platinum, has been constructed in the server:
Data files are in 128.42.44.181/bulk/Hubs All files are in 128.42.44.182/bulk/Projects/Ecosystem/Hubs psql Hubs
The data set includes all United States Venture Capital transactions (moneytree) from the twenty-five year period of 1990 through 2015.
Data has been aggregated at the portfolio company, fund, and round level. It will be analyzed at the combined MSA level. We will be looking at in terms of number of companies funded in number of funds active, and flow of investment in a given MSA.
The data set has now been uploaded to the database server, named Hubs.
There are 4 tables:
- Rounds: Rounddate, coname, state, roundno, stage1, etc.
- CombinedRounds: Coname, rounddate, discamount, fundname
- Companies: LastInv, FirstInv, coname, MSA, MSACode, Address, state, datefounded, totalknownfunding, industry(major)
- Funds: fundname, closingdate, lastinv, firstinv, msa, msacode, avinv, nocoinv, totalknowninv, address
Used variables:
Companies: Coname, MSACode, Industry, state MSALookupTable: MSACode, MSASuper IndustryLookupTable: IndustryMajor, InduCode -> CompanyInfo: Coname, MSASuper, InduCode, state (complete)
Funds: fundname, msacode, state MSALookupTable: MSACode, MSASuper -> FundInfo: fundname, msacode, state (complete)
Rounds: coname, rounddate, stagecode, roundno CombinedRounds: coname, rounddate, discamount, fundname -> RoundInfoSuper: coname, rounddate, nofunds, discamount -> RoundInfo: Coname, roundyear, fundname, estamount (complete)
Then take:
RoundInfo: Coname, roundyear, fundname, estamount CompanyInfo: Coname, MSASuper, InduCode, state FundInfo: fundname, msacode, state -> SuperRoundInfo: Coname, CoMSASuper, CoInduCode, CoState, FundName, FundMSASuper, FundState, RoundYear, RoundEstAmount -> MSAPortCos: Count(CoName) As NoPortCosFunded, CoMSASuper, RoundYear ...
Notes on Creation of Primary Data Set
Raw tables
- companies (last investment, first investment, company name, MSA, MSA code, address, state, date founded, known funding, industry)
- funds (fund closing date, last investment, first investment, fund name, address, MSA, MSA code, Average investment, number companies invested (NoCos), known investment)
- rounds (round date, company name, state, round number, stage 1, stage 2, stage 3)
- combined rounds (company name, round date, disclosed amount, investor)
- msalist (changes MSAs to CMSAs— combined MSAs)
- industry list (changes 6 industry categories to 4— ICT, Life Sciences, Semiconductors, Other)
Process
- cleaned tables to eliminate duplications, undisclosed variables
- changed all original characters to include CMSA and Industry Codes (companyinfo3, fundinfocleanfinal, roundinfoclean)
- matched funds to avoid any issues with names (i.e. Fund ABC L.P./Fund ABC LP/Fund ABC)
- matched roundinfoclean investors to fundinfocleanfinal investors (roundinfo.txt >> cleanfundfinal.txt)
- join by round and company conames
- bridge years (1990-2016), stage, and cmsa
- populate data with count of companies (Deal flow) and estimated amount ($)
- data set in 181 hubs folder under summarycmsa.txt (38394)
Key decisions:
- Threw out undisclosed co through-out as no address
- Count is done by joining round and company
- Anything fund related must be disclosed fund
- Near and far, and total invested, and fund counts, etc., are all done using disclosed funds that match only
Glossary of Tables
cleanco — used to remove duplicates from companies cleanedcompanies — clean set of companies with no duplicates cmsafunds- cmsas— list of all CMSAs in final data set (for merging) cmsastats- statistics not including empty years (pre-merge) cmsastats2 - statistics separated by year-MSA cmsastats3— statistics separated by year-MSA-stage cmsastats4 cmsayears— empty merged table between year and cmsa cmsayearstage — empty merged table between cmsa/years and stage combinedrounds— raw sdc data for combined rounds combinedroundswamt— used to join rounds and combined rounds for roundinfo2 companies- raw SDC company data companyinfo — cleaned companies joined with state and CMSA information companyinfo2— companyinfo1 with original industry categories companyinfo3— companyinfo2 with updated industry categories and codes companyinfo4-- clean version of companyinfo3 companyround- combined company information with round information companyround2- combined company information with round information, cleaned up from companyround2 companyround3- combined company information with round information, cleaned up from companyround3 finaldataset- final statistics by CMSA-year, see section Final Primary Data Set for more information fundinfo— funds joined with CMSA info fundinfo2 - clean version of fundinfo1 fundinfoclean - used in process to clean fundinfo2 fundinfoclean2- used in process to clean fundinfo2 fundinfocleanfinal- used in process to clean fundinfo2 fundinfocleannodups- final clean set of fundinfo funds - raw SDC fund data Houston - analysis for Houston ecosystem team Houston2- analysis for Houston ecosystem team houston3- analysis for Houston ecosystem team industry — new industry codes (4)— used for all future data sets industrylist— lookup table for new industry codes (went from 6 to 4) joined1- used for matching process joined2- used for matching process matchfund2- used for matching process matchfunds- used for matching process matchroundfund - used for matching process matchroundfund2- used for matching process msalist — lookup table for MSA to CMSA (used for all future data sets) nearfar1-- beginning set before adding nearfar/stage variables nearfar2 -- added binomial variables for near/far and for each of the stages, used to build final dataset roundfund— not used— joined round to fund; drop/ignore roundinfo— round info cleaned up to include number of investors in a syndicate and estimated investment per member of syndicate roundinfo2— roundinfo1 including name of investors/funds roundinfo3— clean version of roundinfo2 roundinfoclean — final clean version of roundinfo3 (final roundinfo table) rounds — raw SDC round data stages — table for merging stage-year-CMSA superinfo — ignore/drop temp - used for matching process years — table for merging stage-year-CMSA
Hub Candidates Data Set
The Hubs candidate data set is a list of potential hubs found in MSAs throughout the country. Researchers are currently pulling qualitative and quantitative information from the candidate's websites, in an attempt to categorize what can be identified as a hub. This is a difficult data set to pull, as there is little to no quantitative information available for this category of institution, and is dependent on accessibility of information to the public on the internet.
Characteristics/Variables
- Year Founded
- Square footage
- LinkedIN self-identifiers (what the organization classifies itself on its LinkedIN profile)
- Activeness on Twitter (binomial)
- Member Directory available online (binomial)
- Number of conference rooms
- Price ($/month) for Flex desk
- Offers Reserved desk (binomial)
- Offers office space for rent (binomial)
- Offers community membership-- not for coworking but for community events, etc. (binomial)
- Number of events offered per month (estimate)
- Offers code academy
- Mission Statement/Vision (for qualitative or key-word analysis)
These characteristics/variables will be used to determine whether a candidate is or is not likely to be a Hub.
As of March 10th 2016, the list contains 125 Hub candidates.
Where to find: The Hubs data set can be found in the Ecosystem>>Hubs>>dataset folder. It is not currently in the database due to a UTF8 issue
Supplementary Data Sets
Patent data: to be pulled from USPTO or SDC Platinum.
Number of STEM Graduate Students (NSF) and University R&D Spending (NSF):
- University R&D Data found under file "NSF DATA_2004 to 2011.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets)
- R&D spending found at the university level for 2014 ("Stem Grad Students.xlsx) or at state level ("Science and Engineering Grad Students by State and Year 2000-2011.csv")
- not uploaded to server or matched yet to CMSA code, because of this discrepancy.
- "Stem Grad Students.xlsx" contains categorized university by MSA, can be used for all university-based projects
Per Capita Income and Employment Data (US Census Bureau):
- "Per Capita Personal Income by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>Datasets>>Data from Yael)
- "Wages and Salaries by MSA 2000-2012.xlsx" in datasets folder (Ecosystem>>Hubs>>datasets>>Data from Yael)
- not uploaded to server or matched yet to CMSA code
Firm Births (BDS)
- in server 181, under table name "BDS"
- includes birth, death, net(birth-death) and rate(death rate) for years 1990-2013 for every msa
- includes code for CMSA but is not aggregated by CMSA
- i.e. BDS statistics are still separate for all the smaller MSAs in New York's CMSA (code=1)
Resources
- Yael Hochberg and Fehder (2015), located in dropbox
- Use this paper as a guideline on how to conduct the analysis
- US Census Bureau data on employment by MSA: http://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_14_5YR_B23027&prodType=table
- USPTO utility patents by MSA: http://www.uspto.gov/web/offices/ac/ido/oeip/taf/cls_cbsa/allcbsa_gd.htm
- MSA level trends: http://www.metrotrends.org/data.cf
The Target Dataset
We will need to process the following variables:
- SuperMSA - combine SanFran and SanJose, New York and Newark?, NC Research triangle, others?
- CSV mapping msas to cmsas is in the folder (and a table in the dbase)
Example dataset:
MSA Year SeedVCInv SeedEarlyVCInv LaterVCInv NoDeals FundsInvested DistinctInvestors .... ---------------------------------------------------------------------------------------------------------------------------- 1234 2001 1000000 20000000 30000000 4 7 7
Note that the unit of observation is MSA-Year.
Variables to be computed at the MSA level:
- HubActive (binary)
- NoHubsActive (Count)
- HubSqFt
- Other Hub Vars (build list!!!)
- SeedVCInv (Seed/Start-up)
- EarlyVCInv (Early Stage)
- LaterStageVC (Later)
- OtherStageVC (Buyout/Acq, Other)
- NoDeals (done by local VCs?)
- NoDealsNear
- NoDealsFar
- NoPortCosFunded
- FundsInv (in an MSA)
- FundsInvFromNear (within MSA?)
- FundsInvFromFar (outside MSA?)
- DistinctInvestors (?)
- DistinctInvestorsNear (within MSA?)
- DistinctInvestorsFar (outside MSA?)
- PatentCount
- NoSTEMGrads
- FirmBirths (BDS data)
- UniRandDSpend
- PerCapitaIncome
- Employment
We need to:
- Check funds invested means dollars invested
- Categorize near and far! Is it within MSA vs. not, within adjacent MSAs, etc.?
There may be a second dataset that has Hub-Industry-Year (where industry is semiconductor/non-semiconductor?).
Final Primary Data Set
- Deal is a round syndicate (near/far deal is one investor that is near/far).
Table name: finaldataset
cmsa year totalamountinv--total amount invested nearamountinv--amount invested from local funds faramountinv-- amount invested from funds outside CMSA earlyinv--amount invested in early stage companies laterinv--amount invested in later stage companies startupseedinv--amount invested in seed or startup stage companies otherstageinv--amount invested in Acquisition/Buy-outs/Other stage companies investingfund--distinct funds that are investing in that CMSA-year investingfundnear--distinct funds from that CMSA that invested in that CMSA-year investingfundfar--distinct funds from outside that CMSA that invested in that CMSA-year deals--number of deals neardeals--number of deals inside a CMSA fardeals--number of deals from outside a CMSA --some of these deals might count in both categories, because of syndicate members being both inside and outside the CMSA earlystagedeals--deals with earlystage companies laterstagedeals--deals with later stage companies startupseeddeals--deals with startup/seed companies otherstagedeals--deals with companies in other stages newportcosfunded--number of portfolio companies to receive their first investment in that year
Data by zip code
- Population data, 2000-2016 - US Census Bureau (E:\McNair\Hubs\summer 2017)
https://www2.census.gov/programs-surveys/popest/datasets/
- Income data, 1998-2014 - The Internal Revenue Service (E:\McNair\Hubs\summer 2017)
https://www.irs.gov/uac/about-irs
- DCI index, to assess the economic well-being of communities
http://eig.org/dci/interactive-maps/u-s-zip-codes
- R&D Expenses, 1980-2016 - Wharton Research Data Services (E:\McNair\Hubs\summer 2017)
- Zipcode look-up table obtained from https://www.unitedstateszipcodes.org/zip-code-database/. It's available in (E:\McNair\Hubs\summer 2017).
Data by MSA
We have principle cities of MSAs from the census: https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html
We might be able to go City -> FIPS place code -> MSA?
Cities and their FIPS codes (which don't perfectly correspond) are available from https://www.census.gov/geo/reference/codes/place.html
The Census claims to provide city to MSA here: https://www.census.gov/geo/maps-data/data/ua_rel_download.html However, there is only CBSA!
This might do it: https://www2.census.gov/geo/pdfs/maps-data/data/rel/explanation_ua_cbsa_rel_10.pdf We can maybe track city to principal city to MSA
COMPUSTAT Data
Data is in:
E:\McNair\Projects\Hubs\Summer 2017 Z:\Hubs\2017
Database is cities
SQL script is: COMPUSTAT.sql
The source file is RandDExpenditures.txt. It contains:
- Date from 1980-2017 (July). All COMPUSTAT.
- 427799 records
- Fields include:
- R&D Expenditure
- Address (inc. city, zip, state)
Output file is COMPUSTATSummary.txt. It contains:
- Variables: City, year, No.public firms, sum R&D, sum Sales, sum total assets
- 1979-2016
- 4440 cities
NSF Data
Data is in:
E:\McNair\Projects\Hubs\Summer 2017 Z:\Hubs\2017
Database is cities
SQL script is: nsf_2017.sql
The source files are: nsf2017.txt, copied from table nsf, and nsf_institution copied from table nsf_grants_institution from the biotech db.
They contain:
- Award ID
- Award Institution
- Award Effective date
- Institution city
- Award Value
- Organization state code
From 1900 - 2017
Output file is nsfSummary.txt. It contains:
- Variables: City, State code year, nsf_nogrants, nsf_valuegrant
- 1900-2017
NIH Data
Data is in:
Z:\Hubs E:\McNair\Projects\Hubs\Summer 2017
Database is cities SQL script is: nih2017.sql The source files are:
- nih_1986_2001.csv
- nih_2002_2012.txt
- nih_2013_2015
located in E:\McNair\Projects\Federal Grant Data\NIH
The script that cleans NIH data and generates the summary table is titled nihSummary. It is located here:
E:\McNair\Projects\Hubs\Summer 2017
This table includes
- year
- city
- state
- country
- nogrants (number of grants)
- valuegrant
- city_state (the city-state ID that we'll merge on)
- Date from 1986-2015
Clinical Trials Data
Data is in:
Z:\Hubs E:\McNair\Projects\Hubs\Summer 2017
Database is cities SQL script is: ctrials.sql The source file is:
- medclinical.txt
located in Z:\Hubs\2017
- Date from 1999-2017
Joined clinical trials data
The file which contains the number of trials in each city and year is located in:
Z:\Hubs\2017
The name of the file is
ctrialsSummary.txt
It contains:
- city
- year
- city_state_year
- noctrials - number of trials
The ctrials is joined with vc_city_state_year. The joined SQL script is: new_ctrials.sql
It contains:
- city
- state
- city_state_id
- city_state_year
- year
- noctrials
- seedamtm
- earlyamtm
- lateramtm
- selamtm
- numseeds
- numearly
- numlater
- numsel
Population Data
Data is in:
Z:\Hubs E:\McNair\Projects\Hubs\Summer 2017
Database is cities
SQL script is: population.sql The source files are:
- pop2000_2009.xlsx
- pop2010_2016.xlsx
They contain:
- State
- City name
- Year
- Population Estimates
Date from 2000-2016
Joined population data
Data is in:
Z:\Hubs\clean data
The file names are
1_population.txt - contains data on population estimates from 2000-2009 2_population.txt - contains data on population estimates from 2010-2016
Database is cities
SQL script is: new_population.sql
The population table is joined on VC table called 'vc_city_state_year'. The table is called new_merged_population.
They contain:
- City
- State
- city_state_id to uniquely identify each city
- city_state_year to uniquely identify each city in each year
- Population estimates
- Year
- Code from the state code and Fips code
- State full name
Income Data
Raw data was obtained from Census data, American Communities Survey.
Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\MSA Income_raw.zip
Date from 2005-2015
The master list with MSAs and principal cities is titled list2.xls. It is located at:
Z:\Hubs\2017
This master list includes:
- MSA code
- MSA name
- Principal City
- State
- Place code (city code)
- State Code
This master list was edited to associate each principal city with a unique state. E.g. if New York is the principal city located in New York-New Jersey MSA, it was associated with state NY-NJ. So list was edited to put New York with NY.
Cleaned Income data files are in
Z:\Hubs\2017\merging_on_ID
They contain:
- MSA code
- MSA
- Year
- Total Household Income
The MSA-City-State look up file is titled msa_city_state_wcode.txt. It is located in
Z:\Hubs\2017\merging_on_ID
The SQL file that merges income data from ACS (by MSA - Year) with the MSA-City file is titled income.sql. It is located here:
Z:\Hubs\2017
The final income table is in db cities titled merged_income.
It includes:
- MSA
- City
- State
- Year
- Total Household Income
The table includes 8780 observations
Joined income data
Data is in:
Z:\Hubs\clean data
The file names are:
INC_05.txt - INC_15.txt
Database is cities SQL script is: merged_income.sql
They contain:
- City
- State
- city_state_id to uniquely identify each city
- Income
- Year
- Code from the state code and Fips code
Employment Data
Data on employment was obtained from American Communities Survey, US Census Bureau.
Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\Employment Data by MSA
Cleaned files are in
Z:\Hubs\2017\merging_on_ID
They contain:
- MSA code
- MSA
- Year
- Employment rate of individuals 16 years or older
- Unemployment rate of individuals 16 years or older
Date from 2005-2015
The SQL file that merges employment data from ACS (by MSA - Year) with the MSA-City file is titled Employment.sql. The file is located in:
Z:\Hubs\2017
The final table is in db cities titled merged_employment.
It includes:
- MSA
- City
- Year
- Employment rate
- Unemployment rate
Joined employment data
Data is in:
Z:\Hubs\clean data
The file names are:
EMP_05.txt - EMP_15.txt
Database is cities SQL script is: new_employment.sql The final table is in db cities titled new_merged_employment.
They contain:
- City
- State
- Code from the state code and Fips code
- city_state_id to uniquely identify each city
- city_state_year to uniquely identify each city in each year
- Employment rates of individuals of 16 years or older
- Unemployment rates of individuals of 16 years or older
- Year
Schooling Data
Data on schooling was obtained from American Communities Survey, US Census Bureau.
Raw Data is in:
E:\McNair\Projects\Hubs\Summer 2017\School Enrollment Data by MSA
Cleaned files are in
Z:\Hubs\2017\merging_on_ID
They contain:
- MSA code
- MSA
- Year
- Total number of population 3 years and over enrolled in school
- Percent of population 3 years and over enrolled in public school
- Percent of population 3 years and over enrolled in private school
Date from 2005-2015
The SQL file that merges schooling data from ACS (by MSA - Year) with the MSA-City file is titled schooling.sql. The file is located in:
Z:\Hubs\2017
The final table is in db cities titled merged_schooling.
It includes:
- MSA
- City
- Year
- Total
- Percent_public_schooling
- Percent_private_schooling
Joined schooling data
Data is in:
Z:\Hubs\clean data
The file names are:
SCH_05.txt - SCH_15.txt
Database is cities SQL script is: new_merged_schooling.sql The final table is in db cities titled new_merged_schooling.
They contain:
- City
- State
- city_state_id to uniquely identify each city
- city_state_year to uniquely identify each city in each year
- Total number of school enrollment
- Percentage enrolled in public schools
- Percentage enrolled in private schools
- Year
- Code from the state code and Fips code
VC Data
Raw Data is in:
Z:\VentureCapitalData\SDCVCData The file name is roundcitystateyear.txt
It contains:
- city
- state
- year
- seedamtm - seed, amount in millions
- earlyamtm - early, amount in millions
- lateramtm - late, amount in millions
- selamtm - seed early late, amount in millions
- numseeds - number of seeds
- numearly
- numlater
- numsel
Date from 1953-2017
The SQL file that merges VC data with the MSA-City file is titled vc.sql. The file is located in:
Z:\Hubs\2017
The final table is in db cities titled vc_city_state_year.
It includes:
- city
- state
- city_state_id
- city_state_year
- seedamtm
- earlyamtm
- lateramtm
- selamtm
- numseeds
- numearly
- numlater
- numsel
- year