Shapefiles from the 2020 U.S. Census TIGER/Line data series provide the boundaries and names of the MSAs, and a python script (Geocode.py) in conjunction with a Google Maps API, provides longitudes and latitudes for startups. We restrict the accuracy of Google’s results to four decimal places, which is approximately 10m of precision.
[[File:AgglomerationDataSourcesAndSinks_v2.png|right|thumb|320px512px|Data Sources and Sinks]] All of our data assembly, and much of our data processing and analysis, is done in a PostgreSQL PostGIS database.
However, we rely on python scripts to retrieve addresses from Google Maps, as well as compute the Hierarchical Cluster Analysis (HCA) itself, and estimate a cubic to determine the HCA-regression method agglomeration count for an MSA. We also use two Stata scripts: one to compute the HCA-regressions, and another to estimate the paper's summary statistics and regression specifications. Finally, we use QGIS to construct the map images based on queries to our database. These images use a Google Maps base layer.
== Data Processing Steps ==
[[File:AgglomerationProcess_v2.png|rightcenter|thumb|320px1092px|Data Processing Steps]] The script [[:File:Agglomeration_CBSA.sql.pdf|Agglomeration_CBSA.sql]] provides the processing steps within the PostgreSQL database. We first load the startup data, add in the longitudes and latitudes, and combine them with the CBSA boundaries. Startups in our data our keyed by a triple (coname, statecode, datefirstinv) as two different companies can have the same names in different states, or within the same state at two different times.
A python script, HCA.py, consumes data on each startup and its location for each MSA-year. It performs the HCA and returns a file with layer and cluster numbers for each startup and MSA-year.