Changes

Jump to navigation Jump to search
no edit summary
This paper is published as:
 
[[Delineating Spatial Agglomerations|Egan, Edward J. and James A. Brander (2022), "New Method for Identifying and Delineating Spatial Agglomerations with Application to Clusters of Venture-Backed Startups.", Journal of Economic Geography, Manuscript: JOEG-2020-449.R2, forthcoming.]]
 
{{AcademicPaper
|Has title=Urban Start-up Agglomeration and Venture Capital Investment
|Has author=Ed Egan,Jim Brander|Has RAs=Peter Jalbert, Jake Silberman, Christy Warden, Jeemin Sim,|Has paper status=In developmentPublished
}}
=New WorkSubmissionA revised version of the paper, now co-authored with [[Jim Brander]] and based on the version 3 rebuild, was submitted to the Journal of Economic Geography. This is solely a methods paper, and is titled: '''A New Method for Identifying and Delineating Spatial Agglomerations with Application to Clusters of Venture-Backed Startups'''. The policy application would need to be written up as a separate paper. ==Acceptance== On July 5th 2022, the paper was accepted to the Journal of Economic Geography: * Manuscript ID JOEG-2020-449.R2* Title: A New Method for Identifying and Delineating Spatial Agglomerations with Application to Clusters of Venture-Backed Startups* Author(s): Edward J. Egan and James A. Brander.* Editor: Bill Kerr, HBS: wkerr@hbs.edu* Abstract: This paper advances a new approach using hierarchical cluster analysis (HCA) for identifying and delineating spatial agglomerations and applies it to venture-backed startups. HCA identifies nested clusters at varying aggregation levels. We describe two methods for selecting a particular aggregation level and the associated agglomerations. The “elbow method” relies entirely on geographic information. Our preferred method, the “regression method”, also uses venture capital investment data and identifies finer agglomerations, often the size of a small neighborhood. We use heat maps to illustrate how agglomerations evolve and indicate how our methods can aid in evaluating agglomeration support policies.* Permanent link for code/data: https://www.edegan.com/wiki/Delineating_Spatial_Agglomerations The paper is now in production. I will build a wiki page called [[Delineating_Spatial_Agglomerations]] that structures the documentation of the build process and shares code and some data or artifacts. Currently, that page redirects here. == R&R == Files:*Pdf: [[File:Egan Brander (2020) - A New Method for Identifying and Delineating Spatial Agglomerations (Submitted to JEG).pdf]]* In E:\projects\agglomeration** Last document was Agglomeration Dec 15.docx** Build is Version 3-6-2-2. ** SQL file is: AgglomerationVcdb4.sql After some inquiries, we heard from Bill Kerr, the associate editor, that the paper had new reviews on Aug 11th. On Aug 23rd, we recieved an email titled "Journal of Economic Geography - Decision on Manuscript ID JOEG-2020-449" giving us an R&R. Overall, the R&R is very positive. Bill's comments:* Referees aligned on central issue of Census places* Too short: Wants application and suggests ("not contractual requirements"):** Diversity within and between in terms of types of VC investment (e.g., Biotech vs. ICT in Waltham)** Patent citiations made between VC backed firms Reviewer 1's comments (excluding minor things):* Explain projection (should have said it was WGS1984)* Starting units: Suggests MSA level. Suppose cities that are close... can we find cases?* Identify clusters that have grown over time* Maybe try a cluster-level analysis* Is ruling out the first second-difference too limiting? Can a city be a cluster? (Vegas, baby?, Or starting from CMSA, probably yes in some sense.)* Discuss cluster boundaries (they aren't hard and fast: "think of these clusters as the kernels or seeds of VC-backedstartup hotspots") Reviewer 2's comments (excluding minor things):* Starting Units. Suggests MSA. * Explain R2 method better. He didn't say try cluster-level but that might be helpful to him too.* Change language (back) to microgeographies! (or startup neighborhoods). * Tighter connection to lit. He gives papers to start.* Discuss overlap of clusters (a la patent clustering). Check findings in Kerr and Kominers!!!* Discuss counterfactuals/cause-and-effect/application etc. Show/discuss that we didn't just find office parks. <pdf>File:JOEG1RndReviews.pdf</pdf>  ===Notes for further improvement=== We might want to add some things in/back in. These include technical notes:*To do the HCA we used the AgglomerativeClustering method from the sklearn.cluster library (version 0.20.1) in python 3.7.1, with Ward linkage and connectivity set to none. This method is documented here: https://scikit-learn.org/stable/modules/clustering.html. I checked some of the early results against an implementation of Ward's method using the agnes function, available through the cluster package, in R. https://www.rdocumentation.org/packages/cluster/versions/2.1.0/topics/agnes*The data was assembled and processed in a Postgresql (version 10) database using PostGIS (version 2.4). We used World Geodetic System revision 84, known as WGS1984 (see https://en.wikipedia.org/wiki/World_Geodetic_System), as a coordinate system with an ellipsoidal earth, to calculate distances and areas (see https://postgis.net/docs/manual-2.4/using_postgis_dbmanagement.html). Shapefiles for Census Places were retrieved from the U.S. Census TIGER (Topologically Integrated Geographic Encoding and Referencing) database (see https://www.census.gov/programs-surveys/geography.html).*The statistical analysis was done in STATA/MP version 15.*All maps were made using QGIS v3.8.3. The base map is from Google Maps. City areas are highlighted using U.S. Census TIGER/Line Shapefiles.  The methodology has other applications:*Food deserts - one could study the agglomerations of restaurants and other food providers in urban environments.*Airports, cement factories, banana plantations, police/fire stations, hospitals/drug stores, etc.*We could think about commercial applications. Perhaps locating plants/facilities that are/aren't in clusters with a view to buying or selling them? =SSRN version of the paper (uses v2 build)= There are two 'final' papers based on the version 2 build. The one with Houston narrative as the motivation is available from SSRN: https://papers.ssrn.com/abstract=3537162 The Management Science submission version has a more conventional front end and is as follows: <pdf>File:AgglomerationV8-Reduced.pdf</pdf> =Version 3 Rebuild= ===Another round of refinements=== #The elbow method has issues in its current form, so we are going to try using the elbow in the curvature (degree of concavity) instead. #We might also try using elasticities...#Rerun the distance calculations -- avghulldisthm and avgdisthm are only computed for layers that we select with some method (like max r2). However, this table hadn't been updated for the elbow method, perhaps as well as some other methods, so some distances would have been missing (and replaced with zeros in the STATA script).#Create and run the new max R2 layer. In this variant, we'll use "the first layer a cluster number is reached as the representative layer for that cluster number"  I built two new curvature based elbow methods and so variables: curvaturelayer and curvaturelayerrestricted. They use the method described below and are identical except that curvaturelayerrestricted can't select layer 2 (both can't select the first and last layers as they use central second differences). For the example cities we have:{| class="wikitable" style="vertical-align:bottom;"|-! place! statecode! year! numstartups! elbowlayer! finallayer! curvaturelayer|-| Menlo Park| CA| 2,006| 68| 4| 51| 4|-| San Diego| CA| 2,006| 220| 3| 184| 181|-| Campbell| CA| 2,006| 38| 3| 26| 8|-| Charlotte| NC| 2,006| 30| 3| 30| 28|-| Waltham| MA| 2,006| 106| 3| 58| 55|} For these city-years, the curvaturelayer is the same as the curvaturelayerrestricted. As you can see, it is all over the place! I really don't think we can say that this method 'works' for any real value of 'works'. There's a sheet (Curvature Raw Data Examples) in ResultsV3-6.xlsx, and there's graphs for the selected cities on sheet "Elbow Curvature Selected Cities". ====New MaxR2 Layer==== I noticed a copy and paste error in the do file and I re-ran the existing max R2 method too, just to be sure. My process for the new method uses the code for the old chosenhullflayer variable. Key variables are:*firstlayer the layer at which numclusters first achieves that value*regfirst an indicator to select the right set of layers to run the max r2 estimation on*chosenhullflayer - the variable that records the layer number selected using firstlayer and the max r2 method*besthullflayer - the equivalent to besthulllayer but with the first layers instead of the lowest-highest ones*targetnumclustersf, besthullflayerisadded, maxr2flayerflag, etc.*'''regmaxr2f''' and '''regbestf''' - these are the dataset constraints to use. Everything is pushed through the database and back to generate them. The results for our sample cities are as follows:{| class="wikitable" style="vertical-align:bottom;"|-! place! statecode! year! finallayer! chosenhulllayer! style="font-weight:bold;" | chosenhullflayer! elbowlayer|-| Campbell| CA| 2,006| 26| 15| style="font-weight:bold;" | 3| style="font-weight:bold;" | 3|-| Charlotte| NC| 2,006| 30| 14| style="font-weight:bold;" | 3| style="font-weight:bold;" | 3|-| Menlo Park| CA| 2,006| 51| 33| style="font-weight:bold;" | 21| style="font-weight:bold;" | 4|-| San Diego| CA| 2,006| 184| 141| style="font-weight:bold;" | 12| style="font-weight:bold;" | 3|-| Waltham| MA| 2,006| 58| 31| style="font-weight:bold;" | 3| style="font-weight:bold;" | 3|} I build the max R2 graphs in the sheet '''New MaxR2''' in ResultsV3-6.xlsx ====Jim's notes on the curvature==== Suppose we have a function f. Then what I have been calling the curvature is -f’’/f’. If f is a utility function this is the coefficient of absolute risk aversion and it has quite often been called curvature in that context. However, in differential geometry curvature is described differently, although it is quite similar. Mas-Collel and others have suggested calling -f’’/f’ the “degree of concavity” instead. I came across this definition on the internet: :“The degree of concavity is measured by the proportionate rate of decrease of the slope, that is, the rate at which the slope decreases divided by the slope itself.” The general rationale for using this measure is that it is invariant to scale, whereas the straight second derivative, f’’, is not invariant. The same applies to the second difference of course." So our measure is the second difference divided by the first difference. However, it is not clear whether we should divide by the initial first difference or the second first difference or the average. I initially assumed that we should use the initial first difference. I now think that is wrong as it can produce anomalies. I think we should use the second (or “current”) first difference as the base.  Here is some data I sent before: {| class="wikitable" style="text-align:right;"|- style="background-color:#FFF; color:#222;"| Layer| SSR| D1| D2| Concavity| Concavity|- style="text-align:left; background-color:#FFF; color:#222;"| style="text-align:right;" | 1| style="text-align:right;" | 0| style="vertical-align:bottom;" | | style="vertical-align:bottom;" | | style="vertical-align:bottom;" | | style="vertical-align:bottom;" | |- style="background-color:#FFF; color:#222;"| 2| 40| 40| -5| 0.13| 0.14|- style="background-color:#FFF; color:#222;"| 3| 75| 35| style="background-color:#FF0;" | -20| 0.57| 1.33|- style="background-color:#FFF; color:#222;"| 4| 90| 15| -12| style="background-color:#FF0;" | 0.80| style="background-color:#FF0;" | 4|- style="background-color:#FFF; color:#222;"| 5| 93| 3| -1| 0.33| 0.5|- style="background-color:#FFF; color:#222;"| 6| 95| 2| -1| 0.50| 1|-| style="background-color:#FFF; color:#222;" | 7| style="background-color:#FFF; color:#222;" | 96| style="background-color:#FFF; color:#222;" | 1| style="background-color:#FFF; color:#222;" | -1| style="background-color:#FFF; color:#222;" | 1.00| style="text-align:left;" | |} The column at the far right uses the second first difference as the base, which I now think is correct. The column second from the right uses the first first difference at the base. Just to be clear, for layer 2 the first difference is 40 – 0 = 40. For layer 3 the first difference is 75 – 40 = 35. Therefore, for layer 2, the second difference is 35 – 40 = -5. I think this is what you would call the “middle second difference”. It tells how sharply the slope falls after the current layer, which is what we want. To correct for scaling, we need to divide by a first difference. In the first concavity column, for layer 2 I use 5/40 = 0.125. For the last column for layer 2 I use 5/35 = 0.143. Both approaches have a local max at layer 4, which is what we want. However, the second column from the right has a global max at the last layer, which is certainly not what we want. But is can happen at the end where the increments are very small. So it seems pretty clear that we want to use the second first difference at the base. More precisely, to get the concavity for layer 3 we want to divide the middle second difference by the forward first difference. (It would probably also be okay to use the middle second difference divided by the middle first difference, but I have not checked that out). =====Formalizing Jim's Notes===== Jim calculates the following (examples using layer 2):*The '''first-order backward difference''' in column '''D1''': <math>f(x)-f(x-1) = 40-0=40</math>*The '''second-order central difference''' in column '''D2''': <math>f(x+1)-2f(x)+f(x-1) = 75-2x40+0 = -5</math>*'''Concavity''' (in col5) as -D2_l/D1_l, or -1 times the backward first over the central second: <math> --5/40 = 0.125 \approx 0.13</math>*'''Concavity''' (in col6) as -D2_l/D1_{l+1}, or -1 times the central first over the central second: <math> --5/35 = 0.43 \approx 0.14</math> The concavity measure in col6 is therefore the -1 times central first difference divided by the central second difference, but the central first isn't computable for a step of 1 (and gives a weird answer anyway, as it straddles the observation in question). The central second difference isn't defined for either the first or last layer, and the backward first difference isn't defined for the first layer. It seems likely that we don't want the last layer and might get it because D1 is small and drives the ratio.  We could instead use the forward first difference - this isn't available for the last observation (for which we can't compute a second central anyway) but is available for the first observation - and increment the answer, much as Jim proposes decrementing it when using the backward layer. But seeing as we can't use the first observation we've gained nothing anyway! So we'll do Jim's method verbatim, and declare the result null if it comes out as either the first or last layer. ====Curvature==== {{Colored box|title=Specification|content=For layer <math>l</math>, I compute the curvature as -1 times the backward first difference in the variance explained ratio from layer <math>l+1</math> divided by the central second difference in the variance explained ratio from <math>l</math>. The first and last layers are forbidden results.}} The curvature results seem somewhat better than the elbow results but are still far from ideal. Here are some things I look for and/or don't like in a layer selection method:*Interior solutions are good, collapsing to the bounds, especially the lower bound is bad*Stable interior solutions are better - when the results approximate a quadratic so that margins generally decrease and then increase around a maximum, the interior results are stable and that's very desirable*Consistent solutions are good within cities - it's nice when adjacent years in the same city have more or less the same layer selected*Consistent solutions across cities are also good - When the method picks roughly similar layer indicies (i.e., % unclustered) across cities, particularly conceptually similar cities, that's a plus*From other analysis, I know that the equilibrium of agglomeration forces occurs when agglomerations have fairly small average hull sizes, perhaps on the order of 10hm2. ===Version 3.5 build notes=== In the process of building version 3.5, I noticed a discrepancy between tothulldensity and avghulldensity. This turned out to be correct. Both are measured in startups/hm2. Tothulldensity of the sum of the number of startups in hulls divided by the total hull area, whereas the avghulldensity is the average of the hull densities (computed as the number of startups in the hull divided by the hull area).  The revised script and dataset is v3-5. ResultsV3-5.xlsx has all of the old redundant results removed and has new sheets for Descriptives (copied over with renamed column names from Descriptives.xlx, which is generated by the .do file), as well as for the new scatterplot. Its Bar and Whisker is also stripped down to the bare essentials.  ===Heuristic Layer=== [[File:AgglomerationInflectionScatterPlotAllDataCircles.png|500px|right]] I had previously calculated the heuristic layer by calculating the mean fracinhull (i.e., % of startups in economic clusters) for each percentage of the layer index (i.e., for 101 observations) and then fitting a cubic to it. I did this because excel can't handle fitting a cubic to the full data (i.e., all 148,556 city-year-layers). However, it is incorrect because of orthogonality issues in calculating mean square distances (I'm also unsure that the mean would be the best measure of central tendency). So I redid the plot using all the data, and calculated the cubic in STATA instead. See: '''inflection.do''' and '''inflection.log'''. The old result is in [https://www.edegan.com/wiki/Urban_Start-up_Agglomeration_and_Venture_Capital_Investment#Fixing_an_issue Fixing an issue] below, and is x≈0.483879. The corrected result is x≈0.487717 (note that R2 has dropped to 92.43%): :2.737323 x^3 - 4.005114 x^2 + 0.3262405 x + 0.9575088≈0.481497 at x≈0.487717 [https://www.wolframalpha.com/input/?i=inflection+points+2.737323+x3+-+4.005114x2++%2B+0.3262405x+%2B+0.9575088] I also calculated an '''inflectionlayer''' (as opposed to the heurflhlayer, where flh stands for fraction of locations in hulls, described above). This inflectionlayer is '''the first time''' that the second central difference in the '''share of startups in economic clusters''' switches sign. It is only possible to calculate this when there are at least 4 data points, as the central difference requires data from layer-1, layer and layer+1, and we need two central differences. The variable is included in dataset (and so do files, etc.) version 3-4 forwards. However, the inflectionlayer is really meaningless. The sign of the second central switches back and forward due to integer effects and I can't find a straight forward algorithm to pick the "correct" candidate from the set of results. Picking the '''first one''', which I currently pick, is completely arbitrary. There are a bunch of examples of the curves and the issue(s) in Results3-4.xlsx sheet 'Inflection'. I expect that if I put a bunch of time into this I could come up with some change thresholds to rule candidate answers in or out, but even then this isn't a good method.  Ultimately, the individual city-year inflection curves (i.e., across layers within a city-year) are just way too noisy. A variant of this noise problem is what makes the elbow method so problematic, but the noise is even worse with the inflection method. Using the heuristic result above (i.e., the one using all city-years) solves this noise problem by aggregating city-years together. One complaint made about the heuristic results is that it is near the middle (i.e., it's 48.7717%, which happens to be near 50%). Although the nature of any HCA on geographic coords implies that the result is unlikely to the close to the bounds (0 or 100%) and more likely to be near the middle (50%), it could be in an entirely different place. '''This result (i.e., the heuristic layer at 48.7717%) characterizes the agglomeration of venture-backed startup firms'''. You'd get a very different number if you studied gas stations, supermarkets, airports, or banana plantations! ====Comparing the Heuristic and R2 Layers==== {{Colored box|title=The Case for the Heuristic Method|content=The heuristic method (i.e., using the inflection in the plot from the population of city-year-layers) finds pretty much the same layer as the R2 method with almost no work, and it can be used in a within-city analysis without having to hold hull count constant.}}  . tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regmaxr2==1, stats(p50 > mean sd N min max p10 p90) columns(statistics) variable | p50 mean sd N min max p10 p90 -------------+-------------------------------------------------------------------------------- nohull | 2 3.531407 7.07922 2977 1 68 1 6 tothullcount | 8 17.4565 35.65118 2977 3 380 3 30 tothullarea | 14.76523 448.029 2063.824 2977 .0049029 34780.04 .5275311 732.4005 tothullden~y | .7640136 11.32988 63.62256 2977 .0002282 1425.338 .0115537 16.15439 growthinv18 | 33.53101 142.5 561.6696 2977 0 22282.6 1.53118 309.0208 numdeals | 3 6.71347 17.06682 2977 0 275 0 15 numstartups | 16 41.28955 89.98027 2977 6 1317 7 90 ---------------------------------------------------------------------------------------------- . tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regheur1==1, stats(p50 > mean sd N min max p10 p90) columns(statistics) variable | p50 mean sd N min max p10 p90 -------------+-------------------------------------------------------------------------------- nohull | 2 4.279958 8.433203 3797 0 119 1 9 tothullcount | 8 20.08954 42.99372 3797 0 673 3 43 tothullarea | 11.32983 49.42803 158.7375 3797 0 2569.169 1.660208 93.94627 tothullden~y | .946713 3.48483 10.93185 3797 0 212.8198 .06182 7.601018 growthinv18 | 31.8453 133.0608 508.1196 3797 0 22282.6 1.235763 292.4397 numdeals | 2 6.629181 16.46614 3797 0 275 0 15 numstartups | 15 38.74743 83.6814 3797 6 1317 7 83 ---------------------------------------------------------------------------------------------- Analyzing layers: Method Avg. Layer Index Std. Dev Layer Index Max R2 0.392473192 0.2380288695 Heuristic 0.43423652 0.0495630531  '''The Max R2 and Heuristic layers are identical in 12.6% of cases!''' Some of these cases are found in city-years with a large number of layers, for instance, there are 90 city-years that have more than 20 startups and identical heuristic and max r2 layers. The table below shows city-years with more than 50 startups and identical heuristic and max R2 layers: {| class="wikitable" |- style="font-weight:bold;"! place! statecode! year! numstartups! chosenhulllayer! heurflhlayer|-| San Francisco| CA| 2,009| 503| 175| 175|-| Los Angeles| CA| 2,012| 213| 93| 93|-| Redwood City| CA| 2,012| 151| 49| 49|-| Redwood City| CA| 2,013| 151| 49| 49|-| Seattle| WA| 2,000| 113| 48| 48|-| Houston| TX| 2,007| 92| 40| 40|-| Waltham| MA| 2,012| 73| 24| 24|-| Pittsburgh| PA| 2,008| 70| 25| 25|-| Bellevue| WA| 2,001| 64| 25| 25|-| Bellevue| WA| 2,003| 61| 23| 23|-| Pleasanton| CA| 2,004| 54| 20| 20|-| Menlo Park| CA| 2,004| 52| 22| 22|-| Durham| NC| 2,009| 50| 22| 22|} In fact, 84% of city-years (which have both heuristic and max R2 layers) have heuristic and max R2 layers that are separated by less than or equal to 5 layers, and 59% have them separated by less than or equal to 2 layers! '''More than a third (36.3%) of city-years have their heuristic and max R2 layers separated by less than or equal to 1 layer.''' ===Another list of items=== Jim asked for the following (in order of delivery schedule, not importance):#A dataset and STATA do file and to implement table 5, complete with an exploration of which regressors to include#An implementation of the 'real elbow method', then integration with (1).#A (set of) comparison(s) between the max R2 method and the elbow methods#A new heatmap or two, based on a different location. All done... see the sections below. ====Heatmaps==== I built '''unbuffered heatmaps using maximum R2 layers from 1995 to 2018''' for a set of "interesting" cities. I often built the same city at multiple scales. Only the zoomed-in maps are in the gallery below. I can now quite quickly build more cities if needed. It is worth noting the following:*Because we are using unbuffered hulls, heatmap components are angular and non-diffuse.*Agglomerations are smaller in cities with higher startup counts but are small everywhere. *Agglomerations don't come close to overlapping city boundaries. Agglomerations within Palo Alto don't overflow into Mountain View and it isn't meant meaningful to talk about Boston-Cambridge agglomerations, except as a broad set. An agglomeration is typically a few square blocks (we knew this from the mean and median hull sizes). *Some famous policy interventions appear to have no effect. There is no agglomeration, let alone a concentration of them, in Boston's North End, where hundreds of millions were plowed into a TIF (and MassChallenge). <gallery widths=300 heights=300>File:Bellevue125000MaxR2UnbufferedHeatmap.png| Bellevue, WA, 1:125kFile:PaloAlto50000MaxR2UnbufferedHeatmap.png| Palo Alto, CA, 1:50kFile:Boulder50000MaxR2UnbufferedHeatmap.png| Boulder, CO, 1:50kFile:Waltham65000MaxR2UnbufferedHeatmap.png| Waltham, MA, 1:65kFile:Boston50000MaxR2UnbufferedHeatmap.png| Boston, MA, 1:50k</gallery> I also built three buffered heatmaps of Boston as a proof of concept. I used either the average distance between the points on the edge of the hull and the centroid, or half of it, as a buffering distance. I also varied the intensity of the shading (down to 10% per layer from 20% in the 1:70000 image). Boston should have 17 agglomerations according to the maximum R2 method, so the half distance buffer might be best for picking them out. <gallery widths=300 heights=300>File:Boston70000MaxR2buffered1xHeatmap.png| Boston, MA, 1:70k, 1x buffer, 10% opacityFile:Boston50000MaxR2bufferedHalfxHeatmap.png| Boston, MA, 1:50k, 0.5x buffer, 20% opacityFile:Boston50000MaxR2buffered1xHeatmap.png| Boston, MA, 1:50k, 1x buffer, 20% opacity</gallery> ====Comparing the Methods==== Summaries of the meta-data on geometries created by each lens is probably the best method of comparison. These are in the do file: . //Compare how their lenses look: . tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regmaxr2==1, stats(p50 > mean sd N min max p10 p90) columns(statistics) variable | p50 mean sd N min max p10 p90 -------------+-------------------------------------------------------------------------------- nohull | 2 3.531407 7.07922 2977 1 68 1 6 tothullcount | 8 17.4565 35.65118 2977 3 380 3 30 tothullarea | 14.76523 448.029 2063.824 2977 .0049029 34780.04 .5275311 732.4005 tothullden~y | .7640136 11.32988 63.62256 2977 .0002282 1425.338 .0115537 16.15439 growthinv18 | 33.53101 142.5 561.6696 2977 0 22282.6 1.53118 309.0208 numdeals | 3 6.71347 17.06682 2977 0 275 0 15 numstartups | 16 41.28955 89.98027 2977 6 1317 7 90 ---------------------------------------------------------------------------------------------- . tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regheur1==1, stats(p50 > mean sd N min max p10 p90) columns(statistics) variable | p50 mean sd N min max p10 p90 -------------+-------------------------------------------------------------------------------- nohull | 2 4.305768 8.498714 3797 0 119 1 9 tothullcount | 8 20.27153 43.51017 3797 0 675 3 43 tothullarea | 11.43336 49.81455 159.0983 3797 0 2569.169 1.661926 94.14735 tothullden~y | .9422739 3.452804 10.89704 3797 0 212.8198 .06182 7.47113 growthinv18 | 31.8453 133.0608 508.1196 3797 0 22282.6 1.235763 292.4397 numdeals | 2 6.629181 16.46614 3797 0 275 0 15 numstartups | 15 38.74743 83.6814 3797 6 1317 7 83 ---------------------------------------------------------------------------------------------- . tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regelbow==1, stats(p50 > mean sd N min max p10 p90) columns(statistics) variable | p50 mean sd N min max p10 p90 -------------+-------------------------------------------------------------------------------- nohull | 2 3.152638 3.973168 3374 0 48 0 7 tothullcount | 12 37.32721 87.6828 3374 0 1303 0 91 tothullarea | 55.78572 898.919 3849.938 3374 0 74067.25 0 1589.324 tothullden~y | .169715 31.79024 1726.042 3374 0 100257.3 0 1.841935 growthinv18 | 36.56511 146.5069 537.3532 3374 0 22282.6 1.816288 326.6357 numdeals | 3 7.232662 17.32508 3374 0 275 0 16 numstartups | 17 42.33225 88.08184 3374 6 1317 7 98 ---------------------------------------------------------------------------------------------- Another way to compare the methods is to look at the layers they select. This is visible in a box plot as well as summary statistics. The following is from ResultsV3.xlsx '''Separate Samples (All Available City-Years)'''{| class="wikitable" style="vertical-align:bottom;"|- style="font-weight:bold;"! Method! N! Avg Layer Index! Std. Dev Layer Index|-| Maximum R2| 3080| 39%| 18%|-| Startups in Clusters Inflection| 6,743| 35%| 16%|-| Variance Explained Elbow| 4,799| 43%| 30%|} '''Using Common City-Years'''{| class="wikitable" style="vertical-align:bottom;"|- style="font-weight:bold;"! Method! N! Avg Layer Index! Std. Dev Layer Index! L Index < Peak! L Index < Max R2! X < Max R2|-| Maximum R2| 2662| 40%| 24%| 0| 2662| |-| Startups in Clusters Inflection| 2662| 44%| 5%| 1102| 167| 6%|-| Variance Explained Elbow| 2662| 31%| 22%| 53| 297| 11%|} [[File:BoxPlot.PNG|frame|600px]] Finally, we can look at a city where different methods select different layers and look at those layers. Here's Cincinnati, Ohio, in 2018: <gallery widths=300 heights=300>File:Cincinnati2018_Level16of25(MaximumR2).png| Layer:16/25 MaxR2File:Cincinnati2018_Level12of25(FractionInHullsInflection).png| Layer:12/25 InflectionFile:Cincinnati2018_Level8of25(VarianceExplainedElbow).png| Layer:8/25 Elbow</gallery> ====Implementing the '''Real Elbow Method'''==== I calculated the between and within-cluster variances, as described below, using the Euclidean distance by using the ST_Distance function on PostGIS geographies (i.e., accounting for an ellipsoid earth using reference system WGS1984).  The output of the python HCL clustering script has around 40m observations (place-statecode, year, layer, cluster, startup), and some of the intermediate tables took several minutes to build. As the process should be O(n), this process could accommodate input data that is perhaps 100x bigger, assuming a patient researcher, which would imply source data perhaps 10x bigger. Note that the hardware/software that we are running this on is pretty close to the (current) frontier. =====Fixing an issue===== [[File:HeuristicLayerSelectionGraphv3-1.PNG|400px|right]]The within-cluster variance (and so F-stat and variance explained) revealed an issue with the data that had to be fixed: The Python HCA script forces the decomposition of multitons into singletons at the end of its run! We want to stop the HCA when we have every location in a separate point, rather than artificially forcing startups with the same location into separate points. This issue likely directly affects the heuristic method(s) that rely on layer indices and indirectly (by changing observation counts) affects the maximum r2 layer choice. I pushed through the change and reran everything. It is build '''version 3.1''', and includes a new .do file, new .txt data files, and a new .log file.  The new elbow layer is: 2.5795 x^3 - 3.7445 x^2 + 0.1989 x + 0.9808≈0.492554 at x≈0.483879 [https://www.wolframalpha.com/input/?i=inflection+points+2.5795x3+-+3.7445x2++%2B+0.1989x+%2B+0.9808].  {{Colored box|title=NOTICE|content= The results in the section below are outdated! The updated results are similar but not the same. Do not rely on the results on the wiki page. '''Always check the log file in the dropbox (or in E:\projects\agglomeration) for the latest results!'''}} =====Trying to find the elbow===== The objective is to apply the [https://en.wikipedia.org/wiki/Elbow_method_(clustering) Elbow Method], which involves finding the [https://en.wikipedia.org/wiki/Knee_of_a_curve Knee of the curve] of either the F-statistic or variance explained. I used distances calculated by ST_Distance and calculated the '''variance explained''' using the following equations: :<math>SS_{exp}=\sum_{i=1}^{K} n_i(\bar{Y}_{i\cdot} - \bar{Y})^2</math>:<math>SS_{unexp}=\sum_{i=1}^{K}\sum_{j=1}^{n_{i}} \left( Y_{ij}-\bar{Y}_{i\cdot} \right)^2</math>:<math>R^2 = \frac{SS_{exp}}{SS_{exp}+SS_{unexp}}</math> I then calculated forward differences, and added one to the answer, as using central differences left truncates the data. (An inspection of the data revealed that it is vastly more likely that the 'correct' answer is found at the left end of the data than the right. Also central first difference bridge the observation, which can lead to misidentification of monotonicity.) Specifically, I used: :<math> f'(x) = f(x + 1) - f(x) </math>:<math> f''(x) = f(x+2) - 2 f(x+1) + f(x)</math> See https://en.wikipedia.org/wiki/Finite_difference I required that a city-year had more than two layers, as it takes at least 3 layers to form an elbow. I then used <math>f'(x)</math> to determine the layer index from which the variance explained was monotonic (i.e., there was no change in sign in <math>f'(x)</math> in higher layer indices). This wasn't an issue when using the population variance explained. In an earlier version, when we used the sample variance explained, we had some non-monotonic sections of the curve resulting from integer division (<math>\frac{k-1}{n-k}</math>). I used <math>f''(x)</math> to find the layer index <math>i</math> at which <math>varexp_i = min(varexp)</math> (for elbowlayer) or for which <math>varexp_i = max(varexp)</math> (for elbowmaxlayer), for some city-year. I then marked <math>i+1</math> as the elbow (or elbowmax) layer for that city-year, as we are using forward differences, not central differences. Note that the biggest change in slope could be found using max(abs(f''(x))) but this is essentially always min(f''(x)), i.e., the elbow layer, as the change in slopes are mostly negative. However, the changes in slopes do often go positive, and the elbowmax layer captures the biggest positive change in slope. '''I created a new build (version 3.3) of the dataset, do file and log file, which includes the population variance explained elbow method, as well as the elbowmax method. It's in the dropbox.'''.  Note that the lens found by the population elbow method is slightly bigger than the lenses found using sample elbow method from before, but the lens found using the elbowmax method is about the same size as the sample elbow method, if not slightly smaller. I'm not sure about the justification of the elboxmax method though. ====Fixing the layer index==== I checked the implementation of the % layer index again, and fixed a mistake in it. In Heurcalc, I define fracunclusted as: CASE WHEN finallayer >1 THEN ((layer::real-1)/(finallayer::real-1)) ELSE NULL END AS fracunclustered, I then previously select the layer below or equal to % layer index 0.374833 in Heurflh: CASE WHEN floor((finallayer-1)*0.374833)=0 THEN 1::int ELSE floor((finallayer-1)*0.374833) END AS heurflhlayer This was incorrect. With some rearrangement, we can see:<math>index \le 0.374833 \implies \frac{layer-1}{final-1} \le 0.374833 \implies layer \le 0.374833(final-1) + 1 </math> I therefore fixed the query to: floor(((finallayer-1)*0.374833)+1) AS heurflhlayer I've updated the results in the New Do File section, and the two methods are now much closer to giving the same lens (see the descriptives!). ====New Do File==== There's a new do file, dataset and log file in the dropbox. The do file is reorganized and condensed into a single file. In order to select layers for specific purposes, like regressions, the do file uses flags (defined on lines 165-190). Do not try to run regressions using xt commands even with these flags, as the underlying layers used will be incorrect. Instead, put in the fixed effects yourself (using i.var) and put them at the front of the regression. STATA decides which variables to omit before doing the linear algebra, so you'll be able to get non-zero coefficients on omitted variables if you put them ahead of the fixed effects that would wipe them out. The actual regressions I chose are on lines 243-250. There have the log of next period growth VC per startup as the dependent variable. Each spec uses year and city fixed effects, and clusters the standard errors at the city level. However, only the second spec in each sequence uses explicit scale controls.  '''Here's extracts of the results for Max R2:''' reg growthinv18lfperstartup i.year i.placeid nohull tothullcountl tothullareal tothulldensityl avghulldisthml if regmaxr2==1, cluster(placeid) nohull | 0 (omitted) tothullcountl | -.0731116 .0170361 -4.29 0.000 -.1067484 -.0394748 tothullareal | .0166992 .0032354 5.16 0.000 .0103111 .0230873 tothulldensityl | -.0026603 .0034181 -0.78 0.438 -.0094092 .0040885 avghulldisthml | .0025771 .0060088 0.43 0.669 -.0092869 .0144411 reg growthinv18lfperstartup i.year i.placeid nohull tothullcountl tothullareal tothulldensityl avghulldisthml growthinv18l numstartupsl numdealsl if regmaxr2==1, cluster(placeid) nohull | .0045458 .0007755 5.86 0.000 .0030165 .0060752 tothullcountl | .0042399 .0087493 0.48 0.629 -.0130145 .0214943 tothullareal | .0045121 .0042544 1.06 0.290 -.0038778 .0129021 tothulldensityl | .0163778 .0048519 3.38 0.001 .0068095 .0259462 avghulldisthml | -.0024717 .0026637 -0.93 0.355 -.0077247 .0027813 growthinv18l | .0075835 .0025118 3.02 0.003 .0026301 .012537 numstartupsl | -.1492758 .016447 -9.08 0.000 -.1817106 -.1168411 numdealsl | .0121621 .005014 2.43 0.016 .002274 .0220501 '''Here's extracts of the results for the 1st Heuristic (Percentage of startups in clusters):''' reg growthinv18lfperstartup i.year i.placeid nohull tothullcountl tothullareal tothulldensityl avghulldisthml if regheur1==1, cluster(placeid) nohull | .0029323 .0007413 3.96 0.000 .0014704 .0043941 tothullcountl | -.0526231 .0113923 -4.62 0.000 -.0750896 -.0301567 tothullareal | .0155028 .0041201 3.76 0.000 .0073776 .023628 tothulldensityl | .0057162 .0040038 1.43 0.155 -.0021797 .0136121 avghulldisthml | -.0022889 .0029852 -0.77 0.444 -.0081759 .003598 reg growthinv18lfperstartup i.year i.placeid nohull tothullcountl tothullareal tothulldensityl avghulldisthml growthinv18l numstartupsl numdealsl if regheur1==1, cluster(placeid) nohull | .0047632 .0007554 6.31 0.000 .0032736 .0062528 tothullcountl | .0008837 .012177 0.07 0.942 -.0231303 .0248977 tothullareal | .0011863 .0042324 0.28 0.780 -.0071603 .0095329 tothulldensityl | .0067998 .0043843 1.55 0.123 -.0018464 .015446 avghulldisthml | -.0063252 .0028367 -2.23 0.027 -.0119194 -.0007311 growthinv18l | .0076041 .0024995 3.04 0.003 .0026749 .0125333 numstartupsl | -.1450849 .0158493 -9.15 0.000 -.176341 -.1138289 numdealsl | .0124877 .0050118 2.49 0.014 .002604 .0223714 Lines 252 to 255 of the do file also compare the lense given by the max r2 method and the 1st heuristic method. The results are below. Note that the second spec shows the 1st heuristic for all city-years and the third spec shows it for the city-years that have a max R2 layer, to make a fairer comparison.  . tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if '''regmaxr2==1''', stats(p50 mean sd N min max p10 p90) columns(statistics) variable | p50 mean sd N min max p10 p90 -------------+-------------------------------------------------------------------------------- nohull | 2 3.862758 7.279538 2951 1 67 1 7 tothullcount | 8 19.18841 36.62401 2951 3 369 3 37 tothullarea | 15.2049 431.5689 2067.431 2951 4.04e-06 34780.04 .7658609 606.8908 tothullden~y | .8115587 258.0355 13679.93 2951 .0002282 743141.7 .0123656 12.06713 growthinv18 | 33.16967 141.5326 563.4799 2951 0 22282.6 1.412436 298.6927 numdeals | 3 6.630973 17.17611 2951 0 275 0 14 numstartups | 16 41.2528 90.67289 2951 6 1317 7 87 ---------------------------------------------------------------------------------------------- . tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if '''regheur1==1''', stats(p50 mean sd N min max p10 p90) columns(statistics) variable | p50 mean sd N min max p10 p90 -------------+-------------------------------------------------------------------------------- nohull | 2 4.352647 8.107042 3797 0 108 1 9 tothullcount | 9 20.60706 40.26376 3797 0 566 4 42 tothullarea | 18.88252 110.757 348.9474 3797 0 6079.538 1.523386 236.2767 tothullden~y | .5860988 6.013232 43.74094 3797 0 1429.742 .0277004 7.329074 growthinv18 | 31.8453 133.0608 508.1196 3797 0 22282.6 1.235763 292.4397 numdeals | 2 6.629181 16.46614 3797 0 275 0 15 numstartups | 15 38.74743 83.6814 3797 6 1317 7 83 ---------------------------------------------------------------------------------------------- . tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if '''regheur1==1 & compok==1''' , stats(p50 mean sd N min max p10 p90) columns(statistics) variable | p50 mean sd N min max p10 p90 -------------+-------------------------------------------------------------------------------- nohull | 2 4.581159 8.548611 2951 0 108 1 9 tothullcount | 9 21.68519 42.78684 2951 0 566 4 42 tothullarea | 19.43593 114.4633 360.3934 2951 0 6079.538 1.86059 252.267 tothullden~y | .5797117 6.835335 49.19501 2951 0 1429.742 .0314084 7.848711 growthinv18 | 33.16967 141.5326 563.4799 2951 0 22282.6 1.412436 298.6927 numdeals | 3 6.630973 17.17611 2951 0 275 0 14 numstartups | 16 41.2528 90.67289 2951 6 1317 7 87 ---------------------------------------------------------------------------------------------- ===A list of items=== Items:#Rerun the code with just the 8 geometry variables, and three factors, in the maximum R2 selection#Fix the implementation of %unclustered (now called %complete)#Compare the changed definitions and try some test regressions#Write a paragraph justifying the 'heuristic' method.#Write a couple of sentences about Guzman and Stern#Explore if/how we could implement the variance-based elbow method I also wanted to fix confusion between CSAs (Combined Statistical Areas)[https://en.wikipedia.org/wiki/Combined_statistical_area] and CMSAs (Consolidated Metropolitan Statistical Areas)[https://www2.census.gov/geo/pdfs/reference/GARM/Ch13GARM.pdf]. CMSA redirects to CSA on Wikipedia. However, it is actually not clear if these are the same things. OMB is the originator of both terms[https://www.census.gov/programs-surveys/metro-micro/about/masrp.html]. ====New implementations==== We're making two small but important changes to the implementations of:#The maximum R2 method (now just 8 metadata vars)#%Complete - now i-1/I-1 =====Max R2===== I ran the code and pushed the results through the database. The newly selected layers are generally the same or one lower than the previously selected layers. I expect that this will be just fine. =====Fraction Complete===== Fraction complete isn't a variable in MasterLayersv3-0. It is used on the fly in queries to assess the elbow layer and then generated anew in AnalysisVcdb4.do (though it isn't really used there): gen fracunclustered=layer/finallayer I fixed the following, and reran the analysis using fracunclustered = (layer-1)/(finallayer-1):*elbowcalc->heurcalc*elbowdata->heurdata*Elbowflh-heurflh [[File:HeuristicLayerSelectionGraphv2.PNG|right|500px]] Plotting the data from heurdata.txt in excel gives: 2.5944 x^3 - 2.9174 x^2 - 0.6407 x + 1.0304Wolfram alpha[https://www.wolframalpha.com/input/?i=inflection+points+2.5944x3+-+2.9174x2++-+0.6407x+%2B+1.0304] says the inflection point is now: x≈0.374833, y≈0.516982 Note that this is now the fraction complete is relative to finallayer-1. For reference, the old value was 0.429388 relative to finallayer. ====Implementing The Elbow Method==== This section explores whether we could implement the '''actual''' elbow method (see https://en.wikipedia.org/wiki/Elbow_method_(clustering) ). The answer is that we might be able to, at least for some sub-sample of our data, but that it likely doesn't give us what we want. =====Background===== The elbow method plots the number of clusters (on x) against the percentage of variance explained (on y) and finds the elbow. The elbow is the point at which the "diminishing returns [in variance explained] are no longer worth the additional cost [of adding another cluster]'. For the variance explained there are two main options:  #Variance explained = between-group variance / total variance#Variance explained = between-group variance / within-group variance (Note that this is the ANOVA F-statistic). Using the [https://en.wikipedia.org/wiki/Law_of_total_variance Law of total variance], total variance = between-group variance + within-group variance. '''From Wikipedia:''' :[https://en.wikipedia.org/wiki/F-test The F-test] in one-way analysis of variance is used to assess whether the expected values of a quantitative variable within several pre-defined groups differ from each other. The "explained variance", or "between-group variability" is :<math>\sum_{i=1}^{K} n_i(\bar{Y}_{i\cdot} - \bar{Y})^2/(K-1)</math> where <math>\bar{Y}_{i\cdot}</math> denotes the sample mean in the ''i''-th group, <math>n_i</math> is the number of observations in the ''i''-th group,<math>\bar{Y}</math> denotes the overall mean of the data, and <math>K</math> denotes the number of groups. The "unexplained variance", or "within-group variability" is :<math>\sum_{i=1}^{K}\sum_{j=1}^{n_{i}} \left( Y_{ij}-\bar{Y}_{i\cdot} \right)^2/(N-K),</math> where <math>Y_{ij}</math> is the ''j''<sup>th</sup> observation in the ''i''<sup>th</sup> out of <math>K</math> groups and <math>N</math> is the overall sample size. This ''F''-statistic follows the ''F''-distribution with degrees of freedom <math>d_1=K-1</math> and <math>d_2=N-K</math> under the null hypothesis. The statistic will be large if the between-group variability is large relative to the within-group variability, which is unlikely to happen if the population means of the groups all have the same value. '''Also from Wikipedia:''' [https://en.wikipedia.org/wiki/Variance Variance] is the expectation of the squared deviation of a random variable from its mean. :<math> \operatorname{Var}(X) = \frac{1}{n} \sum_{i=1}^n (x_i - \mu)^2 = \left( \frac{1}{n} \sum_{i=1}^n x_i^2 \right) - \mu^2, </math> where <math>\mu</math> is the average value. That is, :<math>\mu = \frac{1}{n}\sum_{i=1}^n x_i</math> =====Practical Consequences===== It is possible that any calculation of variance using the full sample of our data (layers x city-years) is computationally infeasible. It seems particularly unlikely that we are going to manage between-group variance. I had problems in the past calculating mean distances between all centroids for just hulls, let alone for all geometries! We could, however, do this for some meaningful sub-population. There is also the question as to whether this approach is sensible in our context. In its native form, we'd be selecting the number of statistical clusters. We could readily use it to select the number of hulls (economic clusters) instead. But, in either case, we'd have to be within-city for this to make sense.  We could normalize the number of clusters, dividing it by the maximum, to deal with the 'cities are different' problem. That is, we could put %unclustered (later called %complete) on the x-axis and %variance explained on the y-axis and fit a curve to a plot of city-year-layers. We could then pick a %unclustered value and apply it across cities. The difference between this and the 'heuristic method' is that we'd be choosing based on diminishing marginal returns in variance explained as opposed to in percentage locations in hulls. '''Addendum:''' #We could do the elbow method on a per city-year basis. The number of statistical clusters is equal to the number of layers, so we'd be indexing over layers, and selecting a layer, for each city-year. It might be worth trying this for some city-year, say Tulsa, 2003. The code would be reusable for a bigger sample. Estimate: 3hrs.#I've rechecked the code and I now think it is computationally feasible. What I was trying to do before was find the average distance between every set of coordinates, which is an order more complex than what we need to do to calculate even within-group variance (between-group variance is simpler). Think O(n) rather than O(n^2), and we have around ~20million statistical clusters spread over ~200k layers. Estimate, given (1) above: 2hrs. =====The "Right" Implementation===== I could implement the elbow method using scalar distances. Specifically. I could measure <math>\bar{Y}_{i\cdot} - \bar{Y}</math> as the distance between a cluster's centroid and the overall centroid, and measure <math>Y_{ij}-\bar{Y}_{i\cdot}</math> as the distance between a cluster's constituent location and its centroid. There's a sense in which this approach is "the right thing to do", and the distance measurements are pretty straight-forward in PostGIS (and would account for an elipsoid earth).  However, in actuality, we have vectors of locations <math>(a,b)</math>, and not scalar distances as fundamental inputs. This changes the math[https://online.stat.psu.edu/stat505/lesson/8/8.2], as well as the ultimate test statistic [https://online.stat.psu.edu/stat505/lesson/8/8.3] that we might use. Specifically, <math>Y_{ij}</math> is a vector: <math>\mathbf{Y_{ij}} = \begin{pmatrix} Y_{ija} \\ Y_{ijb}\end{pmatrix}</math>, as are the sample mean <math>\mathbf{Y_{i\cdot}} = \begin{pmatrix} \bar{Y}_{i\cdot a}\\ \bar{Y}_{i\cdot b} \end{pmatrix}</math> and the grand mean <math>\mathbf{Y} = \begin{pmatrix} \bar{Y}_{a}\\ \bar{Y}_{b} \end{pmatrix}</math>. So the between-variance, which is called the '''H'''ypothesis sum of squares and cross products, is a 2x2 matrix and has K-1 degrees of freedom: :<math>H=\sum_{i=1}^{K} n_i(\mathbf{\bar{Y}_{i\cdot}} - \mathbf{\bar{Y}})(\mathbf{\bar{Y}_{i\cdot}} - \mathbf{\bar{Y}})'</math> And the within-variacnce, called the '''E'''rror sum of squares and cross products, is a 2x2 matrix and has N-K degrees of freedom: :<math>E=\sum_{i=1}^{K}\sum_{j=1}^{n_{i}} ( \mathbf{Y_{ij}}-\mathbf{\bar{Y}_{i\cdot}} )( \mathbf{Y_{ij}}-\mathbf{\bar{Y}_{i\cdot}} )'</math> The '''T''' sum of squares and cross products is <math>\mathbf{T=H+E}</math>. This is all potentially problematic because relational databases don't naturally do linear algebra very well. And we likely don't want to do this in some other software for two reasons. First, we'd have to move the data out and back into the database, which is costly. And, second the other software isn't likely to handle an elipsoid earth and the coordinates are not actually on a flat plane. (Though, assuming a flat earth might be reasonable as most cities are sufficiently small that curvature won't materially affect results, no matter how conceptually offensive it is.)  We could calculate the elements of '''H''' and '''E''' separately in PostGIS, taking advantage of ST_Distance. That wouldn't be too bad as the matrices are only 2x2s. However, it's worth asking whether that would be "right". Let's look at how '''H''' captures information about 'distance' (ignoring flat earth issues). Denote <math>\mathbf{\bar{Y}_{i\cdot}} - \mathbf{\bar{Y}} = \begin{pmatrix} \Delta A \\ \Delta B\end{pmatrix}</math>. Then: :<math>\mathbf{H}=\sum_{i=1}^{K} n_i \begin{pmatrix} \Delta A^2 & \Delta A \Delta B \\ \Delta A \Delta B & \Delta B^2\end{pmatrix} </math> So, the determinant and trace of '''H''' are: :<math>\det(\mathbf{H}) = \Delta A^2 \Delta B^2 - (\Delta A \Delta B)^2 \quad \mbox{and} \quad Tr(\mathbf{H}) = \Delta A^2 + \Delta B^2</math> Thus the trace is the square of the distance between points (for H, between a point within a cluster and the cluster's centroid). The trace of a product is the product of the traces only under specific circumstances and not in general, though we likely meet those circumstances (no zero elements). =====Standard Test Statistics===== We'd like to use some measure of variance explained, but variance is now a matrix. The standard test statistics for MANOVA, which are the closest equivalents to the ANOVA F-statistic from earlier, are: Wilk's Lambda: <math>\Lambda^* = \frac{\det \mathbf{E}}{\det(\mathbf{H}+\mathbf{E})}</math> Hotelling-Lawley Trace: <math>T = Tr(HE^{-1})</math> Pillai Trace: <math>V = Tr(H(H+E)^{-1})</math> So... the two trace statistics are very close to what we would get if we used scalar distances and used either scalar definition of variance explained. The main difference is the lack of correction for degrees of freedom. =====An Opportunity?===== We can't find a decent, let alone seminal, reference for using the elbow method to select the number of clusters. Our problem, which uses geographic coordinates, is also a special case anyway. So, we could implement a method using scalar distance and put a description of it, and its relationship to other measures, in the appendix. It might be a good value-added for the paper. One final thought: We could weight the distance between locations and the mean(s) by the fraction of startups in the location. ====The Heuristic Method Justification==== An attempt at a paragraph justifying the 'heuristic' method::Our heuristic method provides an objective technique for picking a city-year layer, which identifies and maps its clusters. It uses two measures. First, we are interested in clusters rather than lines or points, so we measure the percentage of locations in clusters. Second, we want to view each city-year through the same lens. As layer indices are not comparably across city-years, we use the HCA's 'fraction complete' to measure a layer's lens.:As an HCA progresses towards completion, it takes locations out of clusters at an increasing and then decreasing rate. Accordingly, a city-year plot, with the fraction complete on the x-axis and the percentage of locations in clusters on the y-axis, gives an S-curve. The inflection point of this curve marks a conceptual transition between refining clusters and dismantling them.  '''Note:''' For each layer of the HCA from i=1 to i=I, we define the HCA's fraction complete as (i-1)/(I-1). The HCA's fraction complete is then zero for the first layer when all locations are in a single hull, and one for the last layer when it has decomposed every cluster into separate locations. From version 8::My ‘elbow method’ fits a cubic function to the relationship between the percentage unclustered and the percentage of locations that are in hulls and determines the inflection point. The inflection point finds the layer beyond which further unclustering moves locations out of hulls at a decreasing rate.:As a rough guide, divisive clustering of geographic data of things like startups in cities moves through three stages. In the first stage, the algorithm identifies outliers as points, lines, or small-population hulls. Highest-first layers with low hull counts occur in this stage. In the second stage, the algorithm breaks apart core areas until it achieves the maximum number of hulls. Then, in the third stage, these hulls are refined, providing a progressively tighter lens on the core groupings, and dismantled, until all that remains are points. The elbow method identifies layers in this third stage at the tipping point between refining hulls and dismantling them. ====Guzman and Stern===='''The objective:''':"I think less is more. We just want to try to immunize ourselves against a referee who thinks we might be unaware of that data. We need a sentence saying it exists and another sentence saying why we don’t use it. And citing one source is probably enough." '''A hard-and-fast two-sentence version, with a one-sentence pillow:''':Guzman and Stern (2015) use business registration to find non-venture-backed startups. We do not use this data as it does not have a performance measure to demonstrate the representative layer's selection. However, understanding the relationships between clusters of different types of startups is an exciting topic for future research. '''A paragraph that covers the reviewer's comments too:''':We use data on venture capital-backed startups as their venture capital investment provides a direct performance measure to identify their clusters. However, other entrepreneurial firms may have clusters that influence clusters of venture-backed firms. In particular, pre-venture capital startups, firms that have experienced an acquisition or an initial public offering, and high-growth startups that do not raise venture capital (see Guzman and Stern, 2015, and related papers) may all affect the agglomeration of venture-backed startups. We leave it to future research to examine the relationships between clusters of different types of startups. '''A two-sentence version:''':We use data on venture capital-backed startups as their venture capital investment provides a direct performance measure to identify their clusters. However, we expect that other entrepreneurial firms, such as those identified from business registration data in Guzman and Stern (2015), may have a second-order effect on the clustering of venture-backed firms. '''We should cite their Science paper:''':Guzman, Jorge and Stern, Scott (2015), "Where is Silicon Valley?", Science, Vol. 347, No.6222, pp. 606-609, American Association for the Advancement of Science Bibtex: @article {Guzman606, author = {Guzman, Jorge and Stern, Scott}, title = {Where is Silicon Valley?}, volume = {347}, number = {6222}, pages = {606--609}, year = {2015}, doi = {10.1126/science.aaa0201}, publisher = {American Association for the Advancement of Science}, issn = {0036-8075}, URL = {https://science.sciencemag.org/content/347/6222/606}, eprint = {https://science.sciencemag.org/content/347/6222/606.full.pdf}, journal = {Science} } =====Previous attempts to include Guzman and Stern===== '''From version 8:'''Finally, Guzman and Stern (2015a,b) and related papers use business registration data to suggest that non-venture high-growth, high-technology startups outnumber their venturebacked counterparts by around three or four to one. These papers map and analyze nonventure startup activity within states, cities, zip codes, and for an individual street. A natural extension to this paper would consider agglomeration economies of non-venture startups, as well as their effects on the agglomeration of venture-backed firms. '''From version 4.1:'''Catalini et al. (2019) suggest that there are high-growth, high-technology startups that are not supported by venture capital. These firms may have been turned down by venture capitalists, may never approach venture capitalists, or may not have had their venture investment recorded at the time the authors conducted their study. Regardless, according to Catalini et al. (2019), non-venture HGHT startups outnumber venture-backed firms by around three or four to one. One obvious question is then whether they enhance or detract from agglomeration economies among venture-backed startups.  A natural extension of this paper would consider the impact of these firms. Catalini et al. (2019) derives its data from the processing of business registration data, which is explored for Massachusetts in Guzman & Stern (2015a), California in Guzman & Stern (2015b), firms leaving Delaware in Guzman (2017), and 15 U.S. states in Guzman & Stern (2016). This data provides full addresses, and so readily lends itself to Geographic Information System mapping, as well as a hierarchical cluster analysis decomposition to create microgeographies. =====Comments from the Associate Editor and two reviewers at Management Science===== '''Associate Editor:'''Multiple referees recommend supplementing your analyses with additional datasets. An expensive but informative path would be the National Establishment Time Series, which has street addresses for every business that reports data to Dun & Bradstreet; this would solve your longitudinal address problem. Alternatively, at no cost you can take advantage of Guzman & Stern’s Startup Cartography project data, which provides counts of the high-potential startups you are probably trying to capture with VX (and also see high-potential startups that don’t raise VC). '''Reviewer 2:'''First, the analysis only considers startup ventures which have already received venture funding—an important component of the ecosystem, for sure, but not necessarily the most important for agglomeration purposes. This results in an incomplete and possibly distorted picture of what agglomeration actually means. if a district has 50 startups that have not yet received VC funding and one that has, would that qualify as agglomeration? And what about companies that have been acquired or have gone public? At a minimum, a more accurate count of active ventures would be required to draw any legitimate conclusions about the effects of agglomeration, along the lines of what Andrews, Fazio, Guzman, Liu, & Stern (2019) have done with the Startup Cartography Project, especially if it is true that “non-venture high-growth, high-technology startups outnumber their venture-backed counterparts by around three or four to one” (p. 5). '''Reviewer 4:'''The authors note the challenges of using VC database addresses. This needs to be more closely considered given the persistence/agglomeration application currently contemplated. If ala Guzman’s work, you have a venture being ported to Redwood City for a big financing round and you also assign all prior rounds to Redwood City, you create quite a biased sample to extra growth around winner places. ===Back and Forth=== Big things:*Q: What is this technique for? A: Endogenous selection!*Q: Why cities? A: They are just foundational geographies. We might be able to do this starting with the entire US for a year.*Big problems with last version:**Ed's choice of language, especially using GIS terms.**Lack of focus. The paper does too many things!*Get Jim set up on the infrastructure: Wiki account, RDP account, etc. Other notes:*Ed to fix the % unclustered definition.*Send Jim Wald method justification Proposed title:#A new method for estimating the pattern and efficiency of agglomeration with an application to venture-backed startups.#A new method for identifying and mapping spatial agglomeration with an application to venture-backed startups Note: We had discussions on mapping vs. delineating, startups vs. ecosystem, and other tradeoffs.  4 or 5 value-added points by Ed:*Transform the problem of how to find and delineate agglomerations into the problem of selecting the right lens with which to view agglomerations*Endogenously determine the location and boundaries of agglomerations**Examine agglomeration at any scale, including microgeographic scales (i.e., 10s or 100s of meters). This is important as we think that some sectors, like high-growth high-tech startups, have microgeographic agglomerations.**Produce maps of agglomerations so we can examine their characteristics (socio-economics, demographics, physical, etc.) and facilities (e.g., parks, coffee shops, etc.) of agglomerations without including irrelevant areas. The results have policy implications.*Allow estimation of continuous changes in spatial characteristics on agglomeration economies.**Determine the equilibrium, where costs (from increased competition over scarce resources) and benefits (from reduced transportation costs) are in balance, and understand its policy implications. **Conduct policy simulations by demarking some area as an innovation district and exploring the effects of relocating startups into it. *In the context of venture-backed startups, agglomeration economies are as powerful as scale effects but easier and cheaper to achieve with policy.  The revision will be built around a public-policy hook, which pervades these points. ===Questions and Answers=== *What's the method: Ward's method in reverse. Always add another cluster!*How do we select the best layer for our purpose: **Our purpose is to identify/delineate agglomeratons within a CDP**We use lowest-highest layers, which correspond to unique hull counts with 'stable' hulls, in a regression at the CDP level, with a performance measure (next year's venture investment) as the dependent variable. ===Suggestions and Questions=== Terminology - Match to audience but accurate. *Propose: Hulls -> Clusters. Issues: mixes statistical def, also what about boundary? Alternative: Area.*Census place -> cities. *Lowest-highest -> representative*Understand regression to select hull count*Heat map! References for things we used: To do the HCA we used the AgglomerativeClustering method from the sklearn.cluster library (version 0.20.1) in python 3.7.1, with Ward linkage and connectivity set to none. This method is documented here: https://scikit-learn.org/stable/modules/clustering.html. I checked some of the early results against an implementation of Ward's method using the agnes function, available through the cluster package, in R. https://www.rdocumentation.org/packages/cluster/versions/2.1.0/topics/agnes The data was assembled and processed in a Postgresql (version 10) database using PostGIS (version 2.4). We used World Geodetic System revision 84, known as WGS1984 (see https://en.wikipedia.org/wiki/World_Geodetic_System), as a coordinate system with an ellipsoidal earth, to calculate distances and areas (see https://postgis.net/docs/manual-2.4/using_postgis_dbmanagement.html). Shapefiles for Census Places were retrieved from the U.S. Census TIGER (Topologically Integrated Geographic Encoding and Referencing) database (see https://www.census.gov/programs-surveys/geography.html). The statistical analysis was done in STATA/MP version 15. ===New Target Journals=== There is a section of Wikipedia page on [https://en.wikipedia.org/wiki/Economic_geography#New_economic_geography "New Economic Geography"] that is worth reading, though it seems very out of context... Impact measured using [https://en.wikipedia.org/wiki/H-index H-index] Journals (Ed's prefs in bold):*'''Journal of Economic Geography''' -- H-Index: 96[https://www.scimagojr.com/journalsearch.php?q=22750&tip=sid]; Times: 2.8 months for 1st round and 5.2 months completion based on 2 reviews[https://scirev.org/journal/economic-geography/]; Homepage: https://academic.oup.com/joeg; Other info: impact factor 3.289 (but dropping), length <= 8000 ex. refs (100 word abstract).*Journal of Regional Science -- H-Index 74[https://www.scimagojr.com/journalsearch.php?q=13933&tip=sid&clean=0]; Times: No data[https://scirev.org/search/?query=Regional+Science&kind=journal&linkto=journal]; Homepage: https://onlinelibrary.wiley.com/journal/14679787; Other: 250 abstract, 2016 impact factor of 1.743*'''Journal of Urban Economics''' -- H-Index: 102[https://www.scimagojr.com/journalsearch.php?q=24406&tip=sid&clean=0]; Times: No data[https://scirev.org/journal/journal-of-urban-economics/]; Homepage: https://www.journals.elsevier.com/journal-of-urban-economics; Other: $100 sub fee, anecdotal good turnaround time: https://www.econjobrumors.com/topic/journal-of-urban-economics-turnaround-time*Regional Science and Urban Economics -- H-Index: 73[https://www.scimagojr.com/journalsearch.php?q=15139&tip=sid&clean=0]; Times: 2.8 months to 1st round and 3.3 months to completion based on 1 review[https://scirev.org/journal/regional-science-and-urban-economics/]; Homepage: https://www.journals.elsevier.com/regional-science-and-urban-economics; Other: $100 sub fee*'''Regional Studies''' -- H-Index: 111[https://www.scimagojr.com/journalsearch.php?q=15143&tip=sid&clean=0]; Times: 1.9 months to 1st round and 6.4 months to completion based on 1 review[https://scirev.org/journal/regional-studies/]; Homepage: https://www.tandfonline.com/toc/cres20/current; <=8000 words inc. abstract, abstract of 100 words. No sub fee.*Regional Science Policy and Practice -- H-Index: 6[https://www.scimagojr.com/journalsearch.php?q=21100435556&tip=sid&clean=0]; Times: No record; Homepage: https://rsaiconnect.onlinelibrary.wiley.com/journal/17577802 Also, we might want to pull a Scott Stern and try:*Science -- H-Index: 1124 [https://www.scimagojr.com/journalsearch.php?q=23571&tip=sid&clean=0]; Times: 2.0 months to 1st round and 5.3 months to completion based on 54 reviews[https://scirev.org/journal/science/]; Homepage: https://science.sciencemag.org/; Other: up to ~4500 words. Can have additional supplementary materials. ===Changes to methodology=== We changed the PCA to use just 8 variables and we then took just 3 components: pca nosinglemulti nopair nohull totsinglemulticount totpaircount tothullcount tothullarea totpairlength Principal components/correlation Number of obs = 216,242 Number of comp. = 8 Trace = 8 Rotation: (unrotated = principal) Rho = 1.0000  -------------------------------------------------------------------------- Component | Eigenvalue Difference Proportion Cumulative -------------+------------------------------------------------------------ Comp1 | 3.22169 1.09811 0.4027 0.4027 Comp2 | 2.12359 1.11652 0.2654 0.6682 Comp3 | 1.00707 .147009 0.1259 0.7940 Comp4 | .860061 .199002 0.1075 0.9016 Comp5 | .661058 .550229 0.0826 0.9842 Comp6 | .110829 .101626 0.0139 0.9980 Comp7 | .00920377 .00270352 0.0012 0.9992 Comp8 | .00650025 . 0.0008 1.0000 --------------------------------------------------------------------------  Principal components (eigenvectors)   ------------------------------------------------------------------------------------------------------------ Variable | Comp1 Comp2 Comp3 Comp4 Comp5 Comp6 Comp7 Comp8 | Unexplained -------------+--------------------------------------------------------------------------------+------------- nosinglemu~i | 0.1763 0.6213 0.1362 -0.2153 0.1628 -0.0441 0.4473 0.5380 | 0 nopair | 0.4968 0.0588 -0.1088 0.4258 -0.1916 0.0992 0.5652 -0.4366 | 0 nohull | 0.4613 -0.2494 0.1004 -0.3290 -0.1480 -0.7638 -0.0352 -0.0087 | 0 totsinglem~t | 0.1948 0.6196 0.1305 -0.1813 0.1416 0.0103 -0.4529 -0.5504 | 0 totpaircount | 0.4932 0.0711 -0.0971 0.4262 -0.2260 0.1408 -0.5224 0.4656 | 0 tothullcount | 0.3642 -0.2913 0.2530 -0.5627 -0.1425 0.6171 0.0259 -0.0001 | 0 tothullarea | -0.0543 -0.0905 0.9248 0.3532 0.0797 -0.0494 0.0035 -0.0016 | 0 totpairlen~h | 0.3041 -0.2575 -0.1175 0.0902 0.9039 0.0391 -0.0224 0.0119 | 0 ------------------------------------------------------------------------------------------------------------ From the eigenvectors, we can see that (give or take):*Component 1 is driven by the '''number of lines and hulls''', as well as the '''number of startups in lines and hulls'''*Component 2 is driven by the '''number of points''' and the '''number of startups in points'''*Component 3 is driven by the '''total hull area''' =Version 2 Build= ==Target Journal== High-end options might include:*Science: See https://science.sciencemag.org/content/347/6222/606*Management Science: There's a special issue which might be a fit [https://pubsonline.informs.org/doi/10.1287/mnsc.2018.3120], but otherwise the E&I track is probably the obvious choice [https://pubsonline.informs.org/page/mnsc/editorial-statement]. Editors are Ashish, Toby and Sridhar Tayur: https://pubsonline.informs.org/page/mnsc/editorial-board*JPE: https://www.journals.elsevier.com/journal-of-public-economics*Journal of Economic Geography: https://academic.oup.com/joeg ===Submission=== The paper was submitted to Management Science under Toby Stuart on the 26th March 2020. Toby immediately requested a change to the framing and the paper was resubmitted on April 6th. It was then sent out for review. I think I drew Josh Gans as an A.E. The paper was rejected on June 25th with the following note from Toby: <q>I have received four reviews and an Associate Editor report on your paper. I am very sorry to inform you that three of the four reviewers and the AE recommend against proceeding with the manuscript. All readers find this to be a really thought-provoking piece of work and I'd say there is a consensus that it falls into the "valiant try" category. it presents a novel idea and it got everyone thinking, but in the end, it is also is too much of a kluge and too difficult to follow to proceed with a revision, given the reviewers' doubts that the project is one round of revision away from a successful outcome.</q>
==Files==
New files are in E:\projects\agglomeration
*agglomeration.sql -- new main sql file uses for vcdb3*AgglomerationVcdb4.sql -- even newer main sql file for vcdb4
Contents of the Old Code subdirectory:
See also:
*[[Restoring vcdb3]]
*[[vcdb4]]
 
==vcdb4==
 
The [[Vcdb4]] rebuild is done with the script AgglomerationVcdb4.sql in E:\projects\agglomeration
 
It needs:
*portcomaster -- built in new BuildBaseTables
*tiger2 -- built in AgglomerationVcdb4.sql from TigerModified, which combines 7 encapsulated places with their encapsulators (see section below).
*portcogeo -- built in load tables and modified there to create lat4 and long4, which have 4dp (11.1m) of precision, which wikipedia lists as a land parcel and roughly corresponds to an office front.
*roundsummary ->placeyearvc -- Roundsummary doesn't exist anymore. We'll build placeyearvc from a different base
*rankingfull -- Requires deciding how to deal with places (see below).
*Cpi - done in Load.sql
 
When doing the synthetics for Houston, we need a central location and hull size. The 2007 hull centroid was at (long,lat=-95.47029505572422,29.745008286648368). The WeWork in The Galleria on Westheimer is at (-95.4765668, 29.7256031). We'll use a scaling hull size of 1 hectare per startup. For a 25 hectare district, we would do sqrt(25)/2 as the plus-minus in hm, where 1 hm is 0.0000898 degrees (at the equator for latitude). Or more broadly, the degree +/- for an x hectare district is:
+/- = ((sqrt(x)/2)*0.0000898)
 
Other notes:
*This build also attempts to incorporate [[American Community Survey (ACS) Data]].
*The level code was taken out of the main SQL file and moved to ExcessCode.sql.
*The elbow calculation was done using '''elbowdata''', not '''elbowdatarestricted'''. The workings are in E:\projects\agglomeration\Elbow.xlsx. Plot fracunclustered (X) against avgfracinhull (Y), and fit a cubic to give: 2.8485 x^3 - 3.3892 x^2 - 0.43^x + 1.0373≈-0.0553263 at x≈0.425704 [https://www.wolframalpha.com/input/?i=inflection+points+2.8485x%5E3+-+3.3892x%5E2+-+0.43%5Ex+%2B+1.0373]
*The Chosen Hull Layer can't be done until after the STATA analysis has been run, so for the first pass these tables are created empty. The code will have to be re-run from line 1200 onwards to add this in, recomputer the layer distances, and update the master dataset later.
*The TIF data processing was taken out of the main file. It was moved to E:\projects\agglomeration\TIF\TIF.sql.
*The old Houston code was largely redundant after the rebuild, but it was taken out to ExcessCode.sql, as was the diagnostic code that followed it.
 
Still to do:
*ACS variables - Just needs a join!
*Run and add in chosen layers!
 
==Analysis and Exploration==
 
Choose LHS variable:
*Growth VC: Growthinv17lf -- Rather than seed, less shocky. Forward one period. In 2017 or 2018 dollars to make it real. Logged to normalize somewhat and give a 'change' interpretation
*Growth in growth VC: gg17pcl -- Definitely logged (it's wildly non-normal without and somewhat normal with), even though this gives an change in growth rate interpretation. Could winsorize but logging pretty much deals with outliers anyway. Forward less present over present, so still forward looking.
*We could also conceivably use the change in rank: rankchg (=f.overallrank-overallrank) or rankup (=overallrank-f.overallrank). It would be much 'smoother' than other measures. It's also pretty normal (if a little sharply peaked at 0). However, it would be relative performance measure!
 
Choose dataset:
*Restrict on numlayers. Those with <3 layers can't form hulls. 3 layers is therefore a minimum for a hull based analysis. 6 layers allows 2 hulls, in theory anyway, and doesn't. 6,746 place-years have 2 or more layers (max is 1317), 4,969 have 6 or more layers.
*Restrict on time period. Data is from 1980 to present, through note that we have our synthetic Houstons which have years from 1 to 65 to indicate how many locations were replaced. We might want to restrict to 1986 to present to get good coverage. Or from 1995 to present to do the 'modern era'. Note that 2019 is a half year, so we should probably through it out.
*We are already restricting to the 198 places that had greater than 10 active at some point in their history...
 
Choose scale regressors:
*growthinv17l numdealsl numstartupsl
 
Then the analysis will be loosely as follows:
*Non-layer dependent descriptives
*Explain layers using Burlington, VT?
*Explain outliers using Las Vegas, NV?
*Highest1Hull
*Show Boston? at the elbow for 2018, 2017, 2016
*Elbow
*Layer Dependent descriptives I
*Levels (picking layer by average hull size)
*Maximum R2
*Layer Dependent descriptives II
*Group Means
*Policy Simulation
 
It would be good to include some MSA level or broader ecosystem level variables. These could include:
*Scale of MSA?
*Boston-Cambridge, Research Triangle, Silicon Valley, North East Corridor, LA, Bay Area, Seattle Area, and other notable indicators.
*Some dbase calculated measures:
**Number of adjacent places in placetigerarea (or similar)? Create a 10km buffer and look for intersections?
**Chain places together and count them?
 
The Between Estimator just isn't appropriate. We do want fixed effects (we'll test to see if random effects are appropriate - they aren't). See E:\projects\agglomeration\microeconometrics-using-stata.pdf p254 to p262, as well as [https://www.statalist.org/forums/forum/general-stata-discussion/general/395948-in-which-cases-should-one-prefer-xtreg-be-over-xtreg-re this www.statalist.org thread].
 
Also the maximum r2 approach picked a number of hulls, which should have been constant (it is now - there were errors before) within a place, making it redundant. Note that STATA will run the variable if you include the fixed effect second but not first - try:
reg something else i.control
reg something i.control else
 
This problem doesn't occur when using xtreg, which also appears to correctly cluster the standard errors.
 
===Highest1hull===
 
The 'highest1hull' (or 2 or 3) is a specification that identifies the highest level that has 1 hull the first time (i.e., starting from level 1, go until the level has 2 hulls and back it up one, or if it doesn't have 2 hulls, just find the highest layer with 1 hull.
 
As a consequence we can't use nohull, as it is always 1. It also turns out that tothullcount and tothulldensity don't matter, likely because we already include numstartupsl (removing it makes tothullcount significant). avgdisthm, however, does work.
 
===Levels===
 
We tried the following 'levels' (See Levels.xlsx):
{| class="wikitable"
! Name
! Tgt. Avg. Hull Size
! Act. Avg. Hull Size
! Act. Avg. Hull Count
! Coef.
! Sig
! Act. Density
|-
| HCLLayerLevel1
| 2
| 1.390169
| 3.31986
| 0.031994
|
| 0.418743
|-
| HCLLayerLevel2
| 5
| 3.828996
| 4.213987
| -0.00471
|
| 0.90864
|-
| HCLLayerLevel3
| 10
| 7.921853
| 4.949919
| -0.04308
|
| 1.600401
|-
| HCLLayerLevel4
| 15
| 12.38832
| 5.483262
| -0.00495
|
| 2.259297
|-
| HCLLayerLevel5
| 20
| 16.82407
| 5.879045
| 0.057642
|
| 2.861701
|-
| HCLLayerLevel6
| 25
| 21.07783
| 6.218925
| 0.13614
| *
| 3.389304
|-
| HCLLayerLevel7
| 30
| 25.40073
| 6.512421
| 0.174672
| **
| 3.900351
|-
| HCLLayerLevel8
| 35
| 29.48548
| 6.788268
| 0.155578
| **
| 4.343594
|-
| HCLLayerLevel9
| 40
| 33.6133
| 7.02369
| 0.158089
| *
| 4.785704
|-
| HCLLayerLevel10
| 50
| 41.93593
| 7.452783
| 0.163425
| *
| 5.626882
|-
| HCLLayerLevel11
| 35000
| 5080.886
| 37.30932
| 0.3621
|
| 136.1828
|}
 
From this, the optimum density -- the point at which the change in total density gives no further growth effect is probably between 2 and 3 startups per hectare. We will split the difference and use 2.5 startups per hectare as the optimum for the our innovation district simulations. Given than 0.001 decimal degrees is 111.3199m, there are 0.000898 degrees/hm [http://wiki.gis.com/wiki/index.php/Decimal_degrees]. We'll use square boxes, so our corners will be:
coord +/- ( ( (sqrt(x)/2)*0.000898) / 2.5)
 
Note that we had one too many decimal places in the earlier version by accident, which is why everything looked really weird!
 
===Elbow===
 
The elbow uses the point of inflection between the fracunclustered and fracinhull in the base table '''hcllayerwfinal''' (i.e. across all place, statecode, year, layer). This inflection point is at 0.425704 fracunclustered -- which is layer/finallayer.
2.8485 x^3 - 3.3892 x^2 - 0.43^x + 1.0373≈-0.0553263 at x≈0.425704 (y=0.459799)
 
Within cities, and with year fixed effects, the only things that matter are the number of singleton/multitons and the number of hulls. The more singletons/multitons, the greater the growth; and the more hulls, the lower the growth. This result goes away if we control for the full set of geographic characteristics. Without city fixed effects, only avgdisthm is significant (and negative).
 
===Maximum R2===
 
The Maximum R-squared analysis relies on finding layers by hull count. Specifically, for a certain hull count, say 2, we find the lowest-highest layer occurrence. That is we first find the last layer (i.e., the highest layer) that had a hull count of 2, then if that occurrence is a part of a sequence, we find the first time that in the sequence.
 
Possible examples include Portland, OR (in say 2018), or Burlington, VT in 2015
SELECT * FROM HullsBase WHERE place='Burlington' AND statecode='VT' AND year=2015;
place statecode year layer numclusters
Burlington VT 2015 1 1
Burlington VT 2015 2 1
Burlington VT 2015 3 1
Burlington VT 2015 4 2
Burlington VT 2015 5 2
Burlington VT 2015 6 1
 
For Burlington, the lowest-highest layers are 4 for hull count 2 and 6 for hull count 1. Note that there is presumably a layer 7 with hull count 0.
 
We want to regress, within each city, our measures on performance, and then select the lowest-highest layer -- in effect the number of hulls -- which maximizes the R-squared of the regression. We are going to use 1995 to 2018 inclusive, which is 24 years. We want to include scale variables (i.e. 3 vars: growthinv17l numdealsl numstartupsl), and the (almost full -- we can't do avgdisthm) set of explanatory vars (8 vars nosinglemulti nopair nohull totsinglemulticount totpaircount tothullcount tothullarea totpairlength), which would total 11 vars plus an intercept. To reduce this, we can do a PCA first.
 
The alternative method is to create residuals from the outcome variable (growth vc) after taking out the variation explained by the scale effect. We can then use these residuals as the outcome variable in the city-by-city regressions (again, perhaps with a PCA). The advantage to this method is that we will pick layers based on their R2 of just the agglomeration effect, rather than agglomeration and scale together.
 
reg growthinv17lf growthinv17l numdealsl numstartupsl i.year i.placeid if lowesthighestflag==1 & year>=1995 & year <.
predict growthinv17lfres, residuals
 
Using the first method, the PCA yields 3 or maybe 4 components. We'll use 4.
 
Then we need to create a new dataset with placeid year as the panel, so that we can run between regressions. One issue was that some observations have a missing avgdisthm even when layer=chosenhulllayer. This was because in some cases there was only one geometry. In a similar vein, avghulldisthm is the distance between hull geometries, not within hull geometries, so is missing when there are zero or one hulls. In both cases, we flag the instances (using variables onegeom, onehull, zerohull) and then replace missings with zero (as this is conceptually correct).
 
===Instrumental Variable(s)===
 
See https://en.wikipedia.org/wiki/Instrumental_variables_estimation for background.
 
We want a variable that affects the amount of venture capital only through its effect on agglomeration. More specifically, in the case of the highest1hull, we want a variable that affects the next periods VC only through its effect on the hull size (given that there is always one and only one hull). Moreover, we will be using city-level fixed effects, so a variable that doesn't change over time -- like the size of the city -- will have no variation to drive an effect on hull size. There are two immediate possibilities, using a size measure involving the TIFs or using the number of locations within TIFs. The points measure has two opposing effects in it: it grows as more startups move into TIFs and it grows as TIFs take up more area. The size measure(s) could be the area of TIfs intersecting last periods hull and the area not intersecting it.
 
hull area in t = f( hull area in t-1 intersecting TIF areas in t, TIF areas in t not intersecting hull area in t-1).
 
The two measures will have temporal variation as TIF areas vary over time and as hull areas vary over time...
 
===Instrumenting highest1hull===
 
The instrument on highest1hull needs to work in itself, but we also need a consistent estimate of an effect for the population and for the 13 (or 10, etc.) cities that we can instrument before we start.
 
It looks like including year fixed effects in the 'standard' highest1hull spec is just too much with the reduced sample size. We can use a boom indicators though:
reg growthinv17lf growthinv17l numdealsl numstartupsl avghullarea boom i.placeid if reg1==1 & year95==1, robust
reg growthinv17lf growthinv17l numdealsl numstartupsl avghullarea boom i.placeid if reg1==1 & year95==1 & tifs==1, robust
reg avghullarea tifintareahm boom i.placeid if reg1==1 & year95==1 & tifs==1, robust
tab placeid if tifs==1, gen(tifcity)
ivreg2 growthinv17lf growthinv17l numdealsl numstartupsl boom tifcity* (avghullarea=avghullarea_L) if reg1==1 & year95==1 & tifs==1, robust endog(avghullarea)
 
Useful documentation:
*https://economics.mit.edu/files/18 -- Nice section on understanding coefficients, LATE, etc.
*https://www.nuffield.ox.ac.uk/media/3154/stata-intro-part-iii.pdf -- Using ivreg2 and doing it manually with predict
*http://www.repec.org/bocode/i/ivreg2.html - man page for ivreg2
*http://fmwww.bc.edu/EC-C/F2012/228/EC228.f2012.nn15.pdf -- Useful material on additional tests (endog, etc.)
*https://www.stata.com/statalist/archive/2011-04/msg00877.html -- For the stata tests
*https://journals.sagepub.com/doi/pdf/10.1177/1536867X0800700402 -- Actual documentation!
 
And later additions:
*http://www.columbia.edu/~ajc2241/Instrumental%20Variables%20Estimation%20in%20Political%20Science.pdf
*Interpreting Instrumental Variables Estimates of the Returns to Schooling: http://www.nber.org/~kling/kling_iv.pdf
*http://cameron.econ.ucdavis.edu/e240a/ch04iv.pdf
*Just How Sensitive are Instrumental Variable Estimates? https://web.stanford.edu/~preiss/iv_sensitive.pdf
*LATE: http://econometricsense.blogspot.com/2017/07/instrumental-variables-and-late.html
*Explanations of IVs: http://econometricsense.blogspot.com/2015/11/instrumental-explanations-of.html
 
Results are as follows:
*In the first stage, the instrument is highly (positively statistically significant).
*Without the instrument my estimate is small, negative and sig at 5%.
*With the instrument, the estimate is small, positive and insig.
*Underidentification test is significant, H0 is that model IS underidentified, so this is rejected [https://journals.sagepub.com/doi/pdf/10.1177/1536867X0800700402 p486]
*Weak ID test has a Cragg-Donald Wald F statistic of 58.713, which is well above the critical bounds
*Overidentification test is 0 (null is identified), and it says "equation exactly identified"
*Endogeneity test is 3.375 with Chi-sq(1) P-val = 0.0662. Null is that it is exogeneous, which is rejected.
 
 
. ivreg2 growthinv17lf growthinv17l numdealsl numstartupsl boom tifcity* (avghullarea = tifintareahm) if reg1==1 & year95==
> 1 & tifs==1, robust endog(avghullarea)
Warning - collinearities detected
Vars dropped: tifcity12
IV (2SLS) estimation
--------------------
Estimates efficient for homoskedasticity only
Statistics robust to heteroskedasticity
Number of obs = 179
F( 16, 162) = 14.82
Prob > F = 0.0000
Total (centered) SS = 412.5880393 Centered R2 = 0.5846
Total (uncentered) SS = 3128.446788 Uncentered R2 = 0.9452
Residual SS = 171.3895217 Root MSE = .9785
------------------------------------------------------------------------------
| Robust
growthin~7lf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
avghullarea | 7.13e-06 8.44e-06 0.84 0.399 -9.42e-06 .0000237
growthinv17l | .2036359 .1261974 1.61 0.107 -.0437064 .4509783
numdealsl | -.0775574 .1797158 -0.43 0.666 -.4297939 .2746791
numstartupsl | .8311169 .2834661 2.93 0.003 .2755335 1.3867
boom | 1.063169 .2712561 3.92 0.000 .531517 1.594821
tifcity1 | .3257855 .401005 0.81 0.417 -.4601698 1.111741
tifcity2 | .055352 .6979564 0.08 0.937 -1.312617 1.423321
tifcity3 | -.6024319 .7152221 -0.84 0.400 -2.004241 .7993776
tifcity4 | .3234907 .4568871 0.71 0.479 -.5719915 1.218973
tifcity5 | -.2383231 .4824386 -0.49 0.621 -1.183885 .7072392
tifcity6 | -.2941022 .398834 -0.74 0.461 -1.075802 .4875981
tifcity7 | -.8108876 1.113175 -0.73 0.466 -2.992671 1.370896
tifcity8 | .2262667 .5042398 0.45 0.654 -.7620251 1.214558
tifcity9 | -.0979371 .4359115 -0.22 0.822 -.9523079 .7564337
tifcity10 | -1.042832 .6154755 -1.69 0.090 -2.249142 .1634775
tifcity11 | -.3423386 .4180955 -0.82 0.413 -1.161791 .4771135
tifcity12 | 0 (omitted)
_cons | -.0235353 .9222309 -0.03 0.980 -1.831075 1.784004
------------------------------------------------------------------------------
Underidentification test (Kleibergen-Paap rk LM statistic): 17.585
Chi-sq(1) P-val = 0.0000
------------------------------------------------------------------------------
Weak identification test (Cragg-Donald Wald F statistic): 58.713
(Kleibergen-Paap rk Wald F statistic): 15.579
Stock-Yogo weak ID test critical values: 10% maximal IV size 16.38
15% maximal IV size 8.96
20% maximal IV size 6.66
25% maximal IV size 5.53
Source: Stock-Yogo (2005). Reproduced by permission.
NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors.
------------------------------------------------------------------------------
Hansen J statistic (overidentification test of all instruments): 0.000
(equation exactly identified)
-endog- option:
Endogeneity test of endogenous regressors: 3.375
Chi-sq(1) P-val = 0.0662
Regressors tested: avghullarea
------------------------------------------------------------------------------
Instrumented: avghullarea
Included instruments: growthinv17l numdealsl numstartupsl boom tifcity1
tifcity2 tifcity3 tifcity4 tifcity5 tifcity6 tifcity7
tifcity8 tifcity9 tifcity10 tifcity11
Excluded instruments: tifintareahm
Dropped collinear: tifcity12
------------------------------------------------------------------------------
Note that using lagged avghullarea as an instrument seems to work... Results are as follows:
*With the instrument, the estimate is close to the original, still negative and sig at the 5% level.
*Underidentification test is significant, H0 is that model IS underidentified, so this is rejected
*Weak ID test has a Cragg-Donald Wald F statistic of 679.195, which is massively above the critical bounds
*Overidentification test is 0 (null is identified), and it says "equation exactly identified"
*Endogeneity test is 0.037 with Chi-sq(1) P-val = 0.8465. Null is that it is exogeneous, which we can't reject.
 
Overall, though, it looks like the TIF instrument is dead. Which is a HUGE shame.
 
==Results==
 
It looks like a previous version of this results section wasn't saved somehow...
 
The crucial information is the magnitude of effect of coefficients:
*Log-level: (e^B -1)*100, gives pc change in Y from 1 unit change in X [http://www.cazaar.com/ta/econ113/interpreting-beta]
**Except with indicator variables, when the positive change is (e^B -1)*100 but the negative change is (e^(-B) -1)*100 [https://davegiles.blogspot.com/2011/03/dummies-for-dummies.html].
*Log-log: ((1.01^B)-1)*100 [https://www.cscu.cornell.edu/news/statnews/stnews83.pdf]
 
 
Note that some guides appear to contain errors [https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqhow-do-i-interpret-a-regression-model-when-some-variables-are-log-transformed/]!
 
We are working off of the besthulllayer regression coefficient between:
. xtreg growthinv17lf numdealsl numstartupsl nohull frachull tothullareal avgdisthml onegeom i.year if reg6==1, be
Between regression (regression on group means) Number of obs = 3,635
Group variable: placeid Number of groups = 198
R-sq: Obs per group:
within = 0.0000 min = 2
between = 0.9010 avg = 18.4
overall = 0.0189 max = 23
F(29,168) = 52.73
sd(u_i + avg(e_i.))= .4628413 Prob > F = 0.0000
------------------------------------------------------------------------------
growthin~7lf | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
numdealsl | -.7000722 .2198263 -3.18 0.002 -1.13405 -.2660944
numstartupsl | 2.125103 .2104279 10.10 0.000 1.709679 2.540527
nohull | -.0154989 .0094233 -1.64 0.102 -.0341023 .0031045
frachull | .2890439 .2398689 1.21 0.230 -.1845017 .7625894
tothullareal | -.0726595 .0423528 -1.72 0.088 -.1562717 .0109526
avgdisthml | -.2554753 .0868682 -2.94 0.004 -.4269693 -.0839814
onegeom | -1.217626 .4010889 -3.04 0.003 -2.009449 -.4258019
|
year |
...
2017 | 11.83234 3.830136 3.09 0.002 4.270947 19.39374
|
_cons | -3.687907 3.026614 -1.22 0.225 -9.663005 2.28719
------------------------------------------------------------------------------
 
{| class="wikitable"
!
! 1 unit change gives pc change in Y
! 1 pc change gives pc change in Y
|-
| numdealsl
|
| -0.69417
|-
| numstartupsl
|
| 2.137063
|-
| nohull
| -1.537941017
|
|-
| frachull
| 33.51503403
|
|-
| tothullareal
|
| -0.07227
|-
| avgdisthml
|
| -0.25388
|-
| onegeom
| -70.40681241
|
|}
==CoLevelForCircles==
We went with the first option. The base table for this approach is '''hcllayer'''. The variables are '''highest1hulllayer''', highest2hulllayer, and highest3hulllayer in the '''highesthulllayer''' table.
It is worth noting that the highest1hulllayer occurs on average at around 2135.49% unclustered (with std dev. of 2048.20%), but this falls to just 12.8% when we exclude years where the highest1hulllayer is also the last and hence only layer. These percentages go down alightly slightly for highest2hulllayer and highest3hulllayer because cities that have 2 or 3 (or more) hulls have larger ecosystems and so more layers.
===Elbow on fraction of locations in hulls===
We could ignore, flag or discard these cites. A visual inspection suggests that Culver City, Torrence, El Segundo, Jersey City, and probably Richardson, Newark, and maybe Cary don't have any issues. Santa Monica, Santa Clara, Emeryville, Farmer's Branch and Addison do look like they have issues, but with the exception of Farmer's Branch and Addison, these are big cites and with lots of locations, so the issue should be washed out by removing outliers or otherwise appropriately choosing the clustering layer.
 
After reflection, we decided to deal merge the following places (listing geoids):
Santa Monica 0670000 -> LA 0644000
Santa Clara 0669084 -> San Jose 0668000
Emeryville 0622594 -> Oakland 0653000
Farmer's Branch 4825452 -> Dallas 4819000
Addison 4801240 -> Dallas 4819000
Newark 0650916 -> Freemont 0626000
Morrisville 3744520 -> Cary 3710740
====Intersecting All Encompassing Hulls====
==TIF data==
See:*the [[TIF Project]]*[[Jeemin Sim (Work Log)]] Install ogr2ogr page for details on mother: apt install gdal-bin ===Chicago=== Starting the process with Chicago. Do the following:*Go to: https://TIF data.cityofchicago.org/Community-Economic-Development/Boundaries-Tax-Increment-Financing-Districts/fz5x-7zak*Save the KML as E:\projects\agglomeration\TIF\Chicago.KML*Load the KML into the dbase using ogr2ogr [https://gdal.org/programs/ogr2ogr.html], note the the nln option creates a new layer (table)*Note that you have to hit refresh on the dbase (or at least the table list) in DataGrip to get the new table to show up*Also, you'll need to fix the date types in post-processing. Note that null dates are going to show up as year 0, which doesn't actually exist and is incorrect (try SELECT to_date('0', 'YYYY')) and again with -1. See https://github.com/sequelize/sequelize/issues/3762 and blame the pope.  ogr2ogr -f PostgreSQL PG:"dbname=vcdb3" Chicago.kml -nln tifchicago The table tifchicago corresponds to the XML below, except that it has some extra fields. The index field seems to be autogenerated, and tessellate, extrude and visibility are always -1,0, and -1 respectively. All other fields are blank, including description, timestamp, begin, and end (note section that you can query end with "end" in postgres): SELECT ogc_fid, name, description, timestamp, begin, "end", altitudemode, tessellate, extrude, visibility, draworder, icon FROM tifchicago; The KML is an XML file with meta data (including an approval date, an expiration date, and a name, as well as some other values, some of which could be derived from the geometry) and then a set of points that describe the outer ring of a polygon: <Placemark> <styleUrl>#defaultStyle</styleUrl> <name>Northwest Industrial Corridor</name> <ExtendedData> <Data name="approval_d"><value>12/2/1998</value></Data> <Data name="comm_area"><value>19,20,23,25,26,27</value></Data> <Data name="expiration"><value>12/2/2021</value></Data> <Data name="ind"><value>Industrial</value></Data> <Data name="name_trim"><value>Northwest Industrial Corridor</value></Data> <Data name="objectid"><value>0</value></Data> <Data name="objectid_1"><value>107</value></Data> <Data name="ref"><value>T- 64</value></Data> <Data name="repealed_d"><value></value></Data> <Data name="sbif"><value>Y</value></Data> <Data name="shape_area"><value>51402231.0322</value></Data> <Data name="shape_leng"><value>80417.4828932</value></Data> <Data name="show"><value>1</value></Data> <Data name="type"><value>Existing</value></Data> <Data name="use"><value>Industrial</value></Data> <Data name="wards"><value>27,28,30,31,37</value></Data> </ExtendedData> <MultiGeometry> <Polygon> <outerBoundaryIs> <LinearRing> <coordinates> -87.74541914577178,41.92534327389125 -87.74541599224814,41.92516215074396 </coordinates> </LinearRing> </outerBoundaryIs> </Polygon> </MultiGeometry> </Placemark> The ind variable might be useful if they are sufficiently systematic across jurisdictions. '''use''' was a messy version of '''ind'''. The '''sbif''' apprears to be a Small Business Improvement Fund grant specific to Chicago [https://somercor.com/sbif/] SELECT sbif, count(*) FROM TifChicagoClean GROUP BY sbif; sbif count Y 92 N 45  SELECT DISTINCT ind FROM TifChicagoClean; ind Mixed-Use Residential Mixed-use Industrial Commercial Transportation ===One by One=== Chicago -- credit Jonathan Levy, created Jul 17, 2013, updated Mar 19, 2019 [https://data.cityofchicago.org/Community-Economic-Development/Boundaries-Tax-Increment-Financing-Districts/fz5x-7zak], has 138 TIFs with dates, approved from 1989-09-13 to 2018-10-31. Expiration ranges from 2018-07-13 to 2052-12-31.  ogr2ogr -f PostgreSQL PG:"dbname=vcdb3" Chicago.kml -nln tifchicago Columbus, OH. Last updated 8/8/2017. Downloaded full dataset [http://data-columbus.opendata.arcgis.com/datasets/tax-increment-financing-areas?geometry=-112.588%2C-29.382%2C112.588%2C29.382], which should contain 69 records as a KML. URL in each record is out of date but just points to https://www.columbus.gov/development/economic-development/. Records '''have PASSAGE dates''' but no end dates. They also have DTE_STATUS but it isn't clear that originally on this really means anything (something to do with filling form DTE24?).   ogr2ogr -f PostgreSQL PG:"dbname=vcdb3" Columbus.kml -nln tifcolumbus Washington, DC. Last updated 6/25/2019. Downloaded full dataset [https://opendata.dc.gov/datasets/tax-increment-financing-tif-areas], which should contain 17 records. There are '''no dates''' and a quick search didn't find any. There's useful information here: https://cfo.dc.gov/sites/default/files/dc/sites/ocfo/publication/attachments/tx_tif_chapter_2020j.pdfIt should also be noted that firms located in the DC Ballpark TIF area are specifically excluded from being Qualified High Technology Companies[https://otr.cfo.dc.gov/sites/default/files/dc/sites/otr/publication/attachments/48605_FR-399_22416.pdf]. Checked the old version, and it didn't have dates either... so the labelling on [[TIF Project]] appears incorrect. Call: 202 727 1140 -- Aisa Reed, senior data analyst. 202 727 4364 Nate Cruz, deputy director of Office of Economic Development Finance 202 478 9152 Nancy Fox, Senior Policy Analyst, Office of Economic Development FinanceNote that we might be able to infer them, one by one, from financial statements or by looking at their creating acts: https://cfo.dc.gov/node/230912 ogr2ogr -f PostgreSQL PG:"dbname=vcdb3" WashingtonDC.kml -nln tifwashingtondc Some date data taken from: https://cfo.dc.gov/sites/default/files/dc/sites/ocfo/publication/attachments/FY%202018%20DC%20CAFR_Full%20Report.pdfOthers can be taken directly from the legislation either http://lims.dccouncil.us or https://code.dccouncil.us/dc/council/code/sections (better through Google). In 2 cases, I also took data from a news article [https://ggwash.org/view/12907/what-can-dc-learn-from-its-successful-subsidies]. See also, DC's TIF legislation: https://code.dccouncil.us/dc/council/code/titles/2/chapters/12/subchapters/IX/ Houston - no longer exists... But we have the shape files from the previous round of work. http://data.houstontx.gov/dataset/city-tax-increment-reinvestment-zones-tirz now leads to a dead link for the zip, but it does say that the file page was created June 30, 2017, 2:27 PM (UTC-04:00) and last updated July 20, 2017, 2:22 PM (UTC-04:00). Again, though, '''no dates'''! Can go one by one through the creation docs found at https://www.houstontx.gov/ecodev/tirz_info.html This site lists 27 TIRZ, whereas moved there are only 23 in the old data. Harris County says that it participates in just 5 Houston TIRZs [https://hcoed.harriscountytx.gov/tirz.aspx]. In z:/vcdb3/Houston: shp2pgsql -I "TIRZ.shp" tifhouston | psql -U researcher -W -d vcdb3 drop table tifhouston; In z:/vcdb3/HarrisCounty shp2pgsql -I "TIRZ.shp" tifharriscounty | psql -U researcher -W -d vcdb3 drop table tifharriscounty; See https://www.bostongis.com/pgsql2shp_shp2pgsql_quickguide.bqg for a cheatsheet for shp2pgsql. Found City of Houston KML data here: https://cohgis-mycity.opendata.arcgis.com/datasets/coh-tirz It doesn't have dates but does have 27 rows and was last updated 7/10/2019. ogr2ogr -f PostgreSQL PG:"dbname=vcdb3" Houston.kml -nln tifhouston One TIF had a null name, which was changed to UNKNOWN. By process of elimination, this was Village Enclaves (Terminated). Found the dates in a review: https://www.houstontx.gov/finance/COHTIRZPerformanceAudit-FINAL(INCLUDINGAPPENDICES-NOMAPS).pdfThis was processed and extracted to E:\projects\agglomeration\TIF\HoustonDatesAndData.txt. Note that the total increment and project costs were $830,822k and $1,732,558k, respectively. Also, the Texas Comptroller's office has a nice chapter on TIFs: https://www.cdfa.net/cdfa/cdfaweb.nsf/ordredirect.html?open&id=tifintexas.html Dallas - downloaded 'latest' data [http://www.dallascad.org/GISDataProducts.aspx]. But no meta data is available. Shockingly, the data looks good and up to date. In contains legalstart and legalend fields for all 214 rows! In z:/vcdb3/Dallas shp2pgsql -I "LocalTaxDistrict.shp" tifdallas | psql -U researcher -W -d vcdb3 Vermont -- 12 records, last updated two years ago (4/7/2017), http://geodata.vermont.gov/datasets/cd53b08d40fc4e7dbc2757c36d038af5_7/data no dates available but one state contact (Megan Sullivan Executive Director, Vermont Economic Progress Council (802) 798-2221 Megan.Sullivan@vermont.gov) and an individual contact for each one. Dates are readily available from: https://accd.vermont.gov/sites/accdnew/files/documents/DED/VEPC/Tiff/TIFDISTRICTSstatus.pdf Other, more detailed info, is also available [https://accd.vermont.gov/community-development/funding-incentives/tif]. Note that Colchester was dissolved, having never become active and Milton has finished its debt service. TIFs get 10 yrs from approval to incur debt and then 20 yrs from debt to retain increment, which ends when debt service ends.  ogr2ogr -f PostgreSQL PG:"dbname=vcdb3" Vermont.kml -nln tifvermont Dublin, OH -- 46 records, last updated 5/8/2018, seems up to date. Has dates! ogr2ogr -f PostgreSQL PG:"dbname=vcdb3" Dublin.kml -nln tifdublin Atlanta, GA -- Seems like 6+4 projects [https://www.atlantaga.gov/government/departments/city-planning/office-of-zoning-development/plans-and-studies/tax-allocation-district-tad]. Found here: https://dcp-coaplangis.opendata.arcgis.com/datasets/tax-allocation-district All 10, last updated 8/29/2018. Has creation and end dates.   ogr2ogr -f PostgreSQL PG:"dbname=vcdb3" Atlanta.kml -nln tifatlanta AlleghenyCounty (Pittsburg), PA -- Available from https://data.wprdc.org/dataset/allegheny-county-tif-boundaries but dates are "None" through out. Looks up to date (says August 20, 2019, 5:00 PM), and has 34 records. Same data available from http://openac-alcogis.opendata.arcgis.com/datasets/allegheny-county-tif-boundaries and https://catalog.data.gov/dataset/allegheny-county-tif-boundaries-a1715  ogr2ogr -f PostgreSQL PG:"dbname=vcdb3" AlleghenyCounty.kml -nln tifalleghenycounty A report online [https://www.alleghenycontroller.com/admin/attachments/7880501fe636c1fe90fa572eb72b54292019MarTue13550615536085066512bd43d9caa6e02c990b0a82652dca36830bb3ce75baae6d5313ca822b2dc2FinalEconomicDevelopmentTaxSubsidiesReport.pdf], saved to the TIF folder, has end dates for 28 of 34 TIFs, which was used to infer start dates (T-20 from ). Note that the following TIF boundaries do not have dates: CENTER TRIANGLE - LAZARUS CENTER TRIANGLE - MELLON CLIENT SERVICE CENTER MCCLAREN ROAD TIF NORTH SHORE PENN LIBERTY PLAZA SCHENLEY CENTER SEWICKLEY SOUTH SIDE WORKS WASHINGTON PARK Also the following TIFs do not have boundaries: Hazelwood‐Almono Summerset at Frick Park Robinson Mall Peripheral is lumped in with Robinson Mall in the boundary data. ===Looking for add ons=== State-by-State-TIF-Spreadsheeet.xls is in the TIF folder. It was found on https://www.bettertogetherstl.com/economic-development-source-materials?rq=TIF and details the state level inception of TIF programs. St. Louis, has shape files: https://www.stlouis-mo.gov/data/tax-increment-financing-districts.cfm, also available as GEOJSON, but no date info. File name includes TIF_DISTRICTS_20190830, which implies it is up-to-date. St. Loius County has a KML file [https://worldmap.harvard.edu/data/geonode:slcounty_tif_aqt] with a PASSED field. It may or may not coincide... and might not be up to date.  There are also recent KMLs without dates here: https://hub.arcgis.com/datasets/8918f94ffeca407fa48db2d66bc74c52/dataAnd an official map, without data available seperately here: https://gisportal.stlouis-mo.gov/portal/apps/webappviewer/index.html?id=5cf974631f65487f868c5bfcfaa3e9f9 This says to contact Dale Ruthsatz (314) 657-3732 for help. Might also be able to get dates out of https://dor.mo.gov/business/tif/. Found the data on the dates here: https://www.stlouis-mo.gov/government/departments/sldc/documents/upload/Summary-of-TIF-Redevelopment-Projects-Updated-12-31-18.pdf  ogr2ogr -f PostgreSQL PG:"dbname=vcdb3" StLouisCounty.kml -nln tifstlouiscounty--There are 46 records and we'd have to work to pull the passed field out of the descriptionDROP TABLE tifstlouiscounty;  shp2pgsql -I "TIF_DISTRICTS_20190830.shp" tifstlouis | psql -U researcher -W -d vcdb3--118 records These were put together with the dates using some regexes: ^(.*) Retired.*$ ^(.*) Active.*$ ^(.*) Terminated.*$ ^(.*) Never Approved.*$ ^(.*) On Hold.*$ .* (\d\d?\-\w\w\w-\d\d).*
==TIF Analysis==
=Old Work Using Circles=
 
See: [[Enclosing Circle Algorithm]]
==Very Old Summary==
*Geocode.py
Entrepreneurship, small businesses and economic growth in cities:
*https://academic.oup.com/joeg/article/17/2/311/2930603
==Unbiased measure==
 
The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.
 
For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf
 
==GIS Resources==
 
*https://www.census.gov/geo/maps-data/data/tiger-line.html
*https://www.census.gov/geo/maps-data/data/tiger.html
*http://postgis.net/features/
*https://en.wikipedia.org/wiki/GIS_file_formats
 
===Useful functions for spatial joins===
 
'''sum(expression)''': aggregate to return a sum for a set of records
'''count(expression)''': aggregate to return the size of a set of records
'''ST_Area(geometry)''' returns the area of the polygons
'''ST_AsText(geometry)''' returns WKT text
'''ST_Buffer(geometry, distance)''': For geometry: Returns a geometry that represents all points whose distance from this Geometry is less than or equal to distance. Calculations are in the Spatial Reference System of this Geometry. For geography: Uses a planar transform wrapper.
'''ST_Contains(geometry A, geometry B)''' returns the true if geometry A contains geometry B
'''ST_Distance(geometry A, geometry B)''' returns the minimum distance between geometry A and geometry B
'''ST_DWithin(geometry A, geometry B, radius)''' returns the true if geometry A is radius distance or less from geometry B
'''ST_GeomFromText(text)''' returns geometry
'''ST_Intersection(geometry A, geometry B)''': Returns a geometry that represents the shared portion of geomA and geomB. The geography implementation does a transform to geometry to do the intersection and then transform back to WGS84
'''ST_Intersects(geometry A, geometry B)''' returns the true if geometry A intersects geometry B
'''ST_Length(linestring)''' returns the length of the linestring
'''ST_Touches(geometry A, geometry B)''' returns the true if the boundary of geometry A touches geometry B
'''ST_Within(geometry A, geometry B)''' returns the true if geometry A is within geometry B
geometry_a '''&&''' geometry_b: Returns TRUE if A’s bounding box overlaps B’s.
geometry_a '''=''' geometry_b: Returns TRUE if A’s bounding box is the same as B’s.
'''ST_SetSRID(geometry, srid)''': Sets the SRID on a geometry to a particular integer value.
'''ST_SRID(geometry)''': Returns the spatial reference identifier for the ST_Geometry as defined in spatial_ref_sys table.
'''ST_Transform(geometry, srid)''': Returns a new geometry with its coordinates transformed to the SRID referenced by the integer parameter.
'''ST_Union()''': Returns a geometry that represents the point set union of the Geometries.
'''substring(string [from int] [for int])''': PostgreSQL string function to extract substring matching SQL regular expression.
'''ST_Relate(geometry A, geometry B)''': Returns a text string representing the DE9IM relationship between the geometries.
'''ST_GeoHash(geometry A)''': Returns a text string representing the GeoHash of the bounds of the object.
 
===Native functions for geography===
 
'''ST_AsText(geography)''' returns text
'''ST_GeographyFromText(text)''' returns geography
'''ST_AsBinary(geography)''' returns bytea
'''ST_GeogFromWKB(bytea)''' returns geography
'''ST_AsSVG(geography)''' returns text
'''ST_AsGML(geography)''' returns text
'''ST_AsKML(geography)''' returns text
'''ST_AsGeoJson(geography)''' returns text
'''ST_Distance(geography, geography)''' returns double
'''ST_DWithin(geography, geography, float8)''' returns boolean
'''ST_Area(geography)''' returns double
'''ST_Length(geography)''' returns double
'''ST_Covers(geography, geography)''' returns boolean
'''ST_CoveredBy(geography, geography)''' returns boolean
'''ST_Intersects(geography, geography)''' returns boolean
'''ST_Buffer(geography, float8)''' returns geography [1]
'''ST_Intersection(geography, geography)''' returns geography [1]
 
===Functions for Linear Referencing===
'''ST_LineInterpolatePoint(geometry A, double measure)''': Returns a point interpolated along a line.
'''ST_LineLocatePoint(geometry A, geometry B)''': Returns a float between 0 and 1 representing the location of the closest point on LineString to the given Point.
'''ST_Line_Substring(geometry A, double from, double to)''': Return a linestring being a substring of the input one starting and ending at the given fractions of total 2d length.
'''ST_Locate_Along_Measure(geometry A, double measure)''': Return a derived geometry collection value with elements that match the specified measure.
'''ST_Locate_Between_Measures(geometry A, double from, double to)''': Return a derived geometry collection value with elements that match the specified range of measures inclusively.
'''ST_AddMeasure(geometry A, double from, double to)''': Return a derived geometry with measure elements linearly interpolated between the start and end points. If the geometry has no measure dimension, one is added.
 
===3-D Functions===
'''ST_3DClosestPoint''' — Returns the 3-dimensional point on g1 that is closest to g2. This is the first point of the 3D shortest line.
'''ST_3DDistance''' — For geometry type Returns the 3-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units.
'''ST_3DDWithin''' — For 3d (z) geometry type Returns true if two geometries 3d distance is within number of units.
'''ST_3DDFullyWithin''' — Returns true if all of the 3D geometries are within the specified distance of one another.
'''ST_3DIntersects''' — Returns TRUE if the Geometries “spatially intersect” in 3d - only for points and linestrings
'''ST_3DLongestLine''' — Returns the 3-dimensional longest line between two geometries
'''ST_3DMaxDistance''' — For geometry type Returns the 3-dimensional cartesian maximum distance (based on spatial ref) between two geometries in projected units.
'''ST_3DShortestLine''' — Returns the 3-dimensional shortest line between two geometries
===Relevant PostgreSQL Commands=== '''\dt *.*''' Show all tables '''\q''' Exit table ===Specifities/ Outliers to consider===
New York (decompose)
Princeton area (keep Princeton unique)
Silicon Valley (all distinct)
===To make a circle=Unbiased measure==
SELECT ST_Buffer(''[desired point]''The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, ''[desired radius]''which is not affected by the number of the startups, 'quad_segs=8') given the size and their distribution. FROM ''[desired table]''quad_segs=8 indicates circle
[[File: CirclePostGIS.png]] For more precision the unbiased calculation of a measure in circlea different context see: SELECT ST_Transform(geometry( ST_Buffer(geography( ST_Transform( ''[desired point]'', 4326 )), ''[desired radius]')), 900913) FROM ''[desired table]''4326 and 900913 represent particular precision. ===Decimal Degrees=== We are working with longitude and latitude in decimal degrees. See httpshttp://enusers.wikipedianber.org/wiki~edegan/w/images/d/d0/Decimal_degrees When converting radius to km, multiply by 111.3199. For area, multiple by Hall_(111.31992005)^2=12,392.12013601_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf
==Census Data==

Navigation menu