Difference between revisions of "Delineating Spatial Agglomerations"
Line 8: | Line 8: | ||
Shapefiles from the 2020 U.S. Census TIGER/Line data series provide the boundaries and names of the MSAs, and a python script (Geocode.py) in conjunction with a Google Maps API, provides longitudes and latitudes for startups. We restrict the accuracy of Google’s results to four decimal places, which is approximately 10m of precision. | Shapefiles from the 2020 U.S. Census TIGER/Line data series provide the boundaries and names of the MSAs, and a python script (Geocode.py) in conjunction with a Google Maps API, provides longitudes and latitudes for startups. We restrict the accuracy of Google’s results to four decimal places, which is approximately 10m of precision. | ||
− | [[File:AgglomerationDataSourcesAndSinks_v2.png|right|thumb| | + | [[File:AgglomerationDataSourcesAndSinks_v2.png|right|thumb|512px|Data Sources and Sinks]] All of our data assembly, and much of our data processing and analysis, is done in a PostgreSQL PostGIS database. |
However, we rely on python scripts to retrieve addresses from Google Maps, as well as compute the Hierarchical Cluster Analysis (HCA) itself, and estimate a cubic to determine the HCA-regression method agglomeration count for an MSA. We also use two Stata scripts: one to compute the HCA-regressions, and another to estimate the paper's summary statistics and regression specifications. Finally, we use QGIS to construct the map images based on queries to our database. These images use a Google Maps base layer. | However, we rely on python scripts to retrieve addresses from Google Maps, as well as compute the Hierarchical Cluster Analysis (HCA) itself, and estimate a cubic to determine the HCA-regression method agglomeration count for an MSA. We also use two Stata scripts: one to compute the HCA-regressions, and another to estimate the paper's summary statistics and regression specifications. Finally, we use QGIS to construct the map images based on queries to our database. These images use a Google Maps base layer. | ||
Line 14: | Line 14: | ||
== Data Processing Steps == | == Data Processing Steps == | ||
− | [[File:AgglomerationProcess_v2.png| | + | [[File:AgglomerationProcess_v2.png|center|thumb|1092px|Data Processing Steps]] |
+ | |||
+ | The script [[:File:Agglomeration_CBSA.sql.pdf|Agglomeration_CBSA.sql]] provides the processing steps within the PostgreSQL database. We first load the startup data, add in the longitudes and latitudes, and combine them with the CBSA boundaries. Startups in our data our keyed by a triple (coname, statecode, datefirstinv) as two different companies can have the same names in different states, or within the same state at two different times. | ||
A python script, HCA.py, consumes data on each startup and its location for each MSA-year. It performs the HCA and returns a file with layer and cluster numbers for each startup and MSA-year. | A python script, HCA.py, consumes data on each startup and its location for each MSA-year. It performs the HCA and returns a file with layer and cluster numbers for each startup and MSA-year. |
Revision as of 17:08, 20 July 2022
This page provides code and data set development material for:
Egan, Edward J. and James A. Brander (2022), "New Method for Identifying and Delineating Spatial Agglomerations with Application to Clusters of Venture-Backed Startups.", Journal of Economic Geography, Manuscript: JOEG-2020-449.R2, forthcoming.
Contents
Overview
The dataset construction begins with startup data from Thomson Reuters’ VentureXpert. This data is retrieved using SDC platinum and comprises information on startup investment amounts and dates, stage descriptions, industries, and addresses. This data is combined with data on mergers and acquisitions from the Securities Data Commission M&A and Global New Issues databases, also available through SDC Platinum, to determine startup exit events.
Shapefiles from the 2020 U.S. Census TIGER/Line data series provide the boundaries and names of the MSAs, and a python script (Geocode.py) in conjunction with a Google Maps API, provides longitudes and latitudes for startups. We restrict the accuracy of Google’s results to four decimal places, which is approximately 10m of precision.
All of our data assembly, and much of our data processing and analysis, is done in a PostgreSQL PostGIS database.
However, we rely on python scripts to retrieve addresses from Google Maps, as well as compute the Hierarchical Cluster Analysis (HCA) itself, and estimate a cubic to determine the HCA-regression method agglomeration count for an MSA. We also use two Stata scripts: one to compute the HCA-regressions, and another to estimate the paper's summary statistics and regression specifications. Finally, we use QGIS to construct the map images based on queries to our database. These images use a Google Maps base layer.
Data Processing Steps
The script Agglomeration_CBSA.sql provides the processing steps within the PostgreSQL database. We first load the startup data, add in the longitudes and latitudes, and combine them with the CBSA boundaries. Startups in our data our keyed by a triple (coname, statecode, datefirstinv) as two different companies can have the same names in different states, or within the same state at two different times.
A python script, HCA.py, consumes data on each startup and its location for each MSA-year. It performs the HCA and returns a file with layer and cluster numbers for each startup and MSA-year.
The HCA.py script uses several functions from another python module, schedule.py, which encodes agglomeration schedules produced by the sklearn.cluster.AgglomerativeClustering package. The standard encoding records the agglomeration schedule as complete paths, indicating which clusters are merged together at each step. The layer-cluster encoding provided in schedule.py instead efficiently records the agglomeration schedule as a series of layers. It also relies on only a single read of the source data, so it is fast.
The code snippets provided in “hierarchy.py” modify the standard library provided in the scipy.cluster package of the same name. This code allows users to pre-calculate distances between locations (latitude-longitude pairs) using highly-accurate PostGIS spatial functions in PostgreSQL. Furthermore, the code caches the results so, provided the distances fit into (high-speed) memory, it also allows users to increase the maximum feasible scale by around an order of magnitude. The code in “hierarchy.py” contains two snippets. The first snippet should be inserted at line 188 in the standard library. Then line 732 of the standard library should be commented out (i.e., #y = distance.pdist(y, metric)), and the second snippet should be inserted at line 734.
A full copy of the amended hierarchy.py is available from https://www.edegan.com/wiki/Delineating_Spatial_Agglomerations.
The results of the HCA.py script are loaded back to the database, which produces a dataset for analysis in Stata. The script AgglomerationMaxR2.do loads this dataset and performs the HCA-Regressions. The results are passed to a python script, Cubic.py, which selects the appropriate number of agglomerations for each MSA. The results from both AgglomerationMaxR2.do and Cubic.py are then loaded back into the database, which produces a final dataset and set of tables providing data for the maps. The analysis on the final dataset uses the Stata script AgglomerationAnalysis.do and the maps are made using custom queries in QGIS.
Code
Agglomeration_CBSA.sql
File:AgglomerationAnalysis.do
Agglomeration_CBSA.sql
Cubic.py
Geocode.py
HCA.py
Hierarchy.py
Hierarchy-InsertSnippets.py
Schedule.py