Difference between revisions of "Urban Start-up Agglomeration and Venture Capital Investment"
Line 112: | Line 112: | ||
*'''Concavity''' (in col6) as -D2_l/D1_{l+1}, or -1 times the central first over the central second: <math> --5/35 = 0.43 \approx 0.14</math> | *'''Concavity''' (in col6) as -D2_l/D1_{l+1}, or -1 times the central first over the central second: <math> --5/35 = 0.43 \approx 0.14</math> | ||
− | The concavity measure in col6 is therefore the -1 times central first difference divided by the central second difference, but the central first isn't computable for a step of 1 (and gives a weird answer anyway, as it straddles the observation in question). The central second difference isn't defined for either the first or last layer, and the backward first difference isn't defined for the first layer. It seems likely that we don't want the last layer and might get it because D1 is small and drives the ratio. | + | The concavity measure in col6 is therefore the -1 times central first difference divided by the central second difference, but the central first isn't computable for a step of 1 (and gives a weird answer anyway, as it straddles the observation in question). The central second difference isn't defined for either the first or last layer, and the backward first difference isn't defined for the first layer. It seems likely that we don't want the last layer and might get it because D1 is small and drives the ratio. |
+ | |||
+ | We could instead use the forward first difference - this isn't available for the last observation (for which we can't compute a second central anyway) but is available for the first observation - and increment the answer, much as Jim proposes decrementing it when using the backward layer. But seeing as we can't use the first observation we've gained nothing anyway! So we'll do Jim's method verbatim. | ||
+ | |||
+ | {{Colored box|title=Definition|content=For layer <math>l</math>, I will compute the concavity as -1 times the backward first difference from layer l+1 divided by the central second difference from l.} | ||
===Version 3.5 build notes=== | ===Version 3.5 build notes=== |
Revision as of 19:28, 1 December 2020
Academic Paper | |
---|---|
Title | Urban Start-up Agglomeration and Venture Capital Investment |
Author | Ed Egan, Jim Brander |
RAs | Peter Jalbert, Jake Silberman, Christy Warden, Jeemin Sim |
Status | Working paper |
© edegan.com, 2016 |
Contents
- 1 Working Paper
- 2 New Work
- 3 Old Work Using Circles
Working Paper
The last version of the paper, with the Houston narrative as the motivation, is available from SSRN: https://papers.ssrn.com/abstract=3537162
The Management Science submission version has a more conventional front end and is as follows:
A new version, written by Jim, is in the works!
New Work
Another round of refinements
- The elbow method is pretty questionable in its current form, so we are going to try using the elbow in the curvature (degree of concavity) instead.
Jim's notes on the curvature
Suppose we have a function f. Then what I have been calling the curvature is -f’’/f’. If f is a utility function this is the coefficient of absolute risk aversion and it has quite often been called curvature in that context. However, in differential geometry curvature is described differently, although it is quite similar. Mas-Collel and others have suggested calling -f’’/f’ the “degree of concavity” instead. I came across this definition on the internet:
- “The degree of concavity is measured by the proportionate rate of decrease of the slope, that is, the rate at which the slope decreases divided by the slope itself.”
The general rationale for using this measure is that it is invariant to scale, whereas the straight second derivative, f’’, is not invariant. The same applies to the second difference of course."
So our measure is the second difference divided by the first difference. However, it is not clear whether we should divide by the initial first difference or the second first difference or the average. I initially assumed that we should use the initial first difference. I now think that is wrong as it can produce anomalies. I think we should use the second (or “current”) first difference as the base.
Here is some data I sent before:
Layer | SSR | D1 | D2 | Concavity | Concavity |
1 | 0 | ||||
2 | 40 | 40 | -5 | 0.13 | 0.14 |
3 | 75 | 35 | -20 | 0.57 | 1.33 |
4 | 90 | 15 | -12 | 0.80 | 4 |
5 | 93 | 3 | -1 | 0.33 | 0.5 |
6 | 95 | 2 | -1 | 0.50 | 1 |
7 | 96 | 1 | -1 | 1.00 |
The column at the far right uses the second first difference as the base, which I now think is correct. The column second from the right uses the first first difference at the base.
Just to be clear, for layer 2 the first difference is 40 – 0 = 40. For layer 3 the first difference is 75 – 40 = 35. Therefore, for layer 2, the second difference is 35 – 40 = -5. I think this is what you would call the “middle second difference”. It tells how sharply the slope falls after the current layer, which is what we want.
To correct for scaling, we need to divide by a first difference. In the first concavity column, for layer 2 I use 5/40 = 0.125. For the last column for layer 2 I use 5/35 = 0.143.
Both approaches have a local max at layer 4, which is what we want. However, the second column from the right has a global max at the last layer, which is certainly not what we want. But is can happen at the end where the increments are very small.
So it seems pretty clear that we want to use the second first difference at the base. More precisely, to get the concavity for layer 3 we want to divide the middle second difference by the forward first difference. (It would probably also be okay to use the middle second difference divided by the middle first difference, but I have not checked that out).
Formalizing Jim's Notes
Jim calculates the following (examples using layer 2):
- The first-order backward difference in column D1: [math]f(x)-f(x-1) = 40-0=40[/math]
- The second-order central difference in column D2: [math]f(x+1)-2f(x)+f(x-1) = 75-2x40+0 = -5[/math]
- Concavity (in col5) as -D2_l/D1_l, or -1 times the backward first over the central second: [math] --5/40 = 0.125 \approx 0.13[/math]
- Concavity (in col6) as -D2_l/D1_{l+1}, or -1 times the central first over the central second: [math] --5/35 = 0.43 \approx 0.14[/math]
The concavity measure in col6 is therefore the -1 times central first difference divided by the central second difference, but the central first isn't computable for a step of 1 (and gives a weird answer anyway, as it straddles the observation in question). The central second difference isn't defined for either the first or last layer, and the backward first difference isn't defined for the first layer. It seems likely that we don't want the last layer and might get it because D1 is small and drives the ratio.
We could instead use the forward first difference - this isn't available for the last observation (for which we can't compute a second central anyway) but is available for the first observation - and increment the answer, much as Jim proposes decrementing it when using the backward layer. But seeing as we can't use the first observation we've gained nothing anyway! So we'll do Jim's method verbatim.
For layer [math]l[/math], I will compute the concavity as -1 times the backward first difference from layer l+1 divided by the central second difference from l.}
Version 3.5 build notes
In the process of building version 3.5, I noticed a discrepancy between tothulldensity and avghulldensity. This turned out to be correct. Both are measured in startups/hm2. Tothulldensity of the sum of the number of startups in hulls divided by the total hull area, whereas the avghulldensity is the average of the hull densities (computed as the number of startups in the hull divided by the hull area).
The revised script and dataset is v3-5. ResultsV3-5.xlsx has all of the old redundant results removed and has new sheets for Descriptives (copied over with renamed column names from Descriptives.xlx, which is generated by the .do file), as well as for the new scatterplot. Its Bar and Whisker is also stripped down to the bare essentials.
Heuristic Layer
I had previously calculated the heuristic layer by calculating the mean fracinhull (i.e., % of startups in economic clusters) for each percentage of the layer index (i.e., for 101 observations) and then fitting a cubic to it. I did this because excel can't handle fitting a cubic to the full data (i.e., all 148,556 city-year-layers). However, it is incorrect because of orthogonality issues in calculating mean square distances (I'm also unsure that the mean would be the best measure of central tendency). So I redid the plot using all the data, and calculated the cubic in STATA instead. See: inflection.do and inflection.log.The old result is in Fixing an issue below, and is x≈0.483879. The corrected result is x≈0.487717 (note that R2 has dropped to 92.43%):
- 2.737323 x^3 - 4.005114 x^2 + 0.3262405 x + 0.9575088≈0.481497 at x≈0.487717 [1]
I also calculated an inflectionlayer (as opposed to the heurflhlayer, where flh stands for fraction of locations in hulls, described above). This inflectionlayer is the first time that the second central difference in the share of startups in economic clusters switches sign. It is only possible to calculate this when there are at least 4 data points, as the central difference requires data from layer-1, layer and layer+1, and we need two central differences. The variable is included in dataset (and so do files, etc.) version 3-4 forwards.
However, the inflectionlayer is really meaningless. The sign of the second central switches back and forward due to integer effects and I can't find a straight forward algorithm to pick the "correct" candidate from the set of results. Picking the first one, which I currently pick, is completely arbitrary. There are a bunch of examples of the curves and the issue(s) in Results3-4.xlsx sheet 'Inflection'. I expect that if I put a bunch of time into this I could come up with some change thresholds to rule candidate answers in or out, but even then this isn't a good method.
Ultimately, the individual city-year inflection curves (i.e., across layers within a city-year) are just way too noisy. A variant of this noise problem is what makes the elbow method so problematic, but the noise is even worse with the inflection method. Using the heuristic result above (i.e., the one using all city-years) solves this noise problem by aggregating city-years together.
One complaint made about the heuristic results is that it is near the middle (i.e., it's 48.7717%, which happens to be near 50%). Although the nature of any HCA on geographic coords implies that the result is unlikely to the close to the bounds (0 or 100%) and more likely to be near the middle (50%), it could be in an entirely different place. This result (i.e., the heuristic layer at 48.7717%) characterizes the agglomeration of venture-backed startup firms. You'd get a very different number if you studied gas stations, supermarkets, airports, or banana plantations!
Comparing the Heuristic and R2 Layers
. tabstat nohull tothullcount tothullarea tothulldensity growthinv18 numdeals numstartups if regmaxr2==1, stats(p50 > mean sd N min max p10 p90) columns(statistics)variable
[2], point of inflection and local maximum. A quartic had an R2 of 90% at around x=0.44 (6.408 x^4 - 15.176 x^3 + 12.592 x^2 - 4.3046 x + 0.517≈0.00825284 at x≈0.440275). I tried a quintic and it had inflection points are x=0.33, 0.55, and 0.82, as well as local maxima at 0.39 and 0.90. Visually there seems to be something going on in the 20% to 40% uncovered range too, perhaps a bifurcation of results, which might be due to rounding issues.
Reasonable Exclusions
We started by including all U.S. cities that received at least $10m of growth venture capital in a year between 1980 and 2017 (inclusive). This gave us a list of 200 cities. However, we still have a lot of city-years with low number of startups.
What is a reasonable number of startups to analyze agglomeration? Three locations (which is at least three startups) is the bare minimum required for one hull without excluding outliers. And we only made images for places with 4 or more startups. A visual inspection suggests that while there is greater (relative) dispersion when counts are low, it isn't hugely problematic. It is also worth noting that excluding 4 or less would get rid of Farmer's Branch, Fort Lauderdale, and Tempe (and Bloomington, MN) in 2017, and 6 or less in 2017 would eliminate Cary and Addison, all of which are slightly problematic. Burlington, VT has 7 years in the data with more than 6 startups, and one with 6.
But everywhere (i.e., all 200 places) have 10 or more layers at some point in time. And everywhere has at least 6 years with 6 or more observations. Detroit has just 7 obs that meet this criteria, half the number of Germantown, MD and a third of Greenwood Village, CO.! Requiring a year to have six observations would reduce us to 4916 observations from 6702 (i.e., down to 73% of the data). Requiring 9 would reduce the data down to 3889 obs (58%), and we'd lose more observations as places wouldn't have enough to form a time-series. The answer then appears to be to limit to observations with 6 or more layers. We'll code the number of layers, and the max and min number of layers for a place, into the data.
Maximum R-squared
Using a maximum R-squared approach to find the 'best layer' for a city is inherently problematic. A city might have 5 layers in 1980 and 80 layers in 2017, and so using layer 40, say, irrespective of year is somewhat meaningless. There are several alternative that make more sense. One is to use the fraction unclustered, much like with the elbow approach. The other is to find the layer with a certain hull count (or as close to it as possible). Hulls might tend to be somewhat stable over time, so three hulls in Portland in 2017 will be centered in more or less the same place as three hulls in Portland in 2003. This turns out to be somewhat true, as seen in the image on the right, which uses the last time (highest level) that there are three hulls, or two for 1998 and 1993 (one of which is out of frame). One issue with this approach is that the highest level with a certain hull count is that hulls almost always contain just three points.
An obvious alternative approach is to use the first time (lowest level) that there are three hulls. There is a big difference in the layer numbers for this. See the queries below. Essentially, the HCA algorithm often takes an original hull apart and then takes the resulting hulls apart, giving a quadratic for hull count again layer. But, as is apparent in the image, this leads to big areas that would only be good for overlap analysis and not for identifying individual clusters.
--Lowest,highest, and lowest-highest, and first-after-peak level where there are three hulls SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='2015' ORDER BY year,layer; --3, 41, 37, 33 (peak at 22,23,24,25) SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='2010' ORDER BY year,layer; --3, 24, 24, 24 (peak at 18) SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='2005' ORDER BY year,layer; --7, 23, 21, 21(peak at 18) SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='2000' ORDER BY year,layer; --3, 17, 17 , 17 (peak at 12,13,14) SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='1995' ORDER BY year,layer; --1(1), 8(1), 8(1), 8(1) (no peak-just flat but still take the highest layer with the nearest lower number of hulls) SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='1990' ORDER BY year,layer; --2(2), 5(2), 5(2), 5(2) (peak at 2,3,4,5 so 5 is first 'after' peak? but still take the highest layer with the nearest lower number of hulls) SELECT * FROM DisplayHullsFull WHERE place='Portland' and statecode='OR' AND year='1985' ORDER BY year,layer; --1(1), 1(1), 1(1), 1(1) (peak at 1 as that's the only layer with a hull but still take the highest layer with the nearest lower number of hulls)
Two further options are to find the lowest-highest and the first-after-peak. These often coincide. The lowest-highest finds the highest layer with x hulls and then works back down the layers taking the lowest in the continuous chain of x hulled layers. The first-after-peak finds the first layer with x hulls in or after the layers where there is the peak number of hulls. This last approach is a little problematic because sometimes there isn't a peak - its just flat - and it is inconceivable that there could be two or more peaks.
Computing the Lowest-Highest, running a PCA (see below) and using three factors in a regression to store R2, N, and adjusted R2, gives the results in the R2 On Hulls sheet of Images Review.xlsx. The first 20 placeIDs were checked (i.e., 10% of the sample). In most cases (16/20), the maximum R2 and R2-Adjusted coincided when considering only cases when N>=10. Note that R2=1 and R2 adjusted is missing for all cases where N <=4, and there there is very high volatility in both measures for N<10.
There were some draws out of scope (i.e., with N=9), and we should take the lowest(?) layer in the event of a draw (they will have the same N with prob ~1). There are some cases to look at:
- 10 - Austin (3885 layers): Came up 3 when it could have been anything up to 23 (with N>=10). Looking at a map, this might be ok. Austin's has some fairly homogenous big clusters in 2017.
- 9 - Atlanta (1414 layers) came up 8/8(N>=10) on R2 and 1 on adjusted R2, which was the biggest spread. All other discrepancies were of a single layer. It is hard to see either answer in the 2017 image, which looks like it has 2 clusters, or maybe 3 or 4. The r2 does jump at level 8 (to 0.50 from 0.28 in the layer before, with the previous highest being 0.39 at level 1), but level 8 has N=13 and is the second highest R2adj at 0.327 as compared with 0.332 in level 1.
- 13 - Bellevue (1181 layers) had R2 and R2adj in agreement at level 2, but there were 7 levels with N>=10 and 13 levels over all. 2 looks ok on the map in 2017 (3 or 4 would likely be better). I expect that this is a case of a place that had dramatic growth...
- 6 - Alpharetta (580 layers) comes it at 1 for both R2 and R2adj even with N<10 but has 6 layers, 4 of which have N>=10. It looks like a pretty homogeneous place in Georgia.
We also need to check some of the really big places:
- San Fran (160 - 12,946) is 52 on both (N=12, >=10). Picking any N>=10 up to N=15 will always find the largest hull count. At N=16 it finds the smallest large hull count (41 to 47 are all N=16).
- Boston (22 - 3,506) is 17 with N=10 and 11 (N=17) with N>10 on both measures.
- New York (122 - 11,466) is 27 with both measures with N>10 (N=18). It is 67 and then 66 at N=10 on both! N=12's 65 hulls is in fourth place.
- Palo Alto (134 - 3,492) is 1 with both measures. Even allowing N<10 there isn't an issue until N=5.
The analysis was repeated but using only 1995 to 2017 inclusive (i.e., the modern era). The results were much more stable for N>=10. 20 out of 20 maximum R2 hull counts were also maximum adjusted R2 counts. N>=10 also seemed much less contentious, though N=11 (just under 50%) would have given the same result and N>=12 would have change only one result. Austin now maximized at 11 hulls - which is pretty much in the middle of its set. Boston was the same as before: 17 hulls with N=10 and 11 hulls with N>=11. Given the shape using N=11 or 12 might be preferred. New York now follows a similar pattern to Boston, and is 67 hulls with N>=10 and 27 hulls with N>=11. Again, the higher N result seems preferred. Palo Alto is now 18 hulls (out of 33 with 19 N>=10), which is a shame but hey. SF maxes out at 52 hulls, with N=12.
So, aside from the Palo Alto result, this is clearly a greatly preferred spec. I think N>=12 (just over half of the 23 years) is fair as a cut off. Also Portland, OR, maximizes at 4 hulls on both measures with N=15...
The only thing that we should change is the R2 estimation regression. Up until this point, we've been using:
pca nosinglemulti nopair nohull totsinglemulticount totpaircount tothullcount totpairlength tothullarea predict pc1 pc2 pc3, score quietly capture reg growthinv17lf pc1 pc2 pc3 if placeid==`placeid' & numclusters==`clusters' & lowesthighestflag==1 & year>=1995 & year <.
Issues and Solution
There are two issues. Why are we using a PCA? Just to get the number of regressors down? The dimensionality isn't that high. And more importantly, one or more PCA components may be picking up a scale effect. We don't want to use the scale regressors in R2 estimation, because they might drive the R2.
So the solution is to first regress to estimate the scale effect and then create residuals:
reg growthinv17lf growthinv17l numdealsl numstartupsl i.year i.placeid if lowesthighestflag==1 & year>=1995 & year <. predict growthinv17lfres, residuals
We then can't use nohull in the regress so our variable list is as follows:
quietly capture reg growthinv17lfres nosinglemulti nopair totsinglemulticount totpaircount tothullcount totpairlength tothullarea if placeid==`placeid' & numclusters==`clusters' & lowesthighestflag==1 & year>=1995 & year <.
Again using a cut off of 12, there's some slight divergence between the R2 and adjusted R2 maximization points, but not much (2 out of 20). The results looks pretty much like before for the first 20 with some minor differences. 12 out of 20 places have the same answer. The 7 of the 8 remaining are different by just a hull or two. Only Austin is different, with now 21 hulls rather than 11. The R2 adjusted for Austin maximizes at 7.
Checking the other cities leads to the following observations:
- Portland is still maximized at 4 hulls
- Boston maximizes at 16 hulls on R2 and 5 on R2 adjusted. Recall that it was at 11 with N>=11.
- New York now maximizes at just 17 hulls, which is a massive drop. But it does look like a clean interior solution.
- Palo Alto is down at 12 hulls.
- SF is at 21 hulls, way down from its old value in the 50s.
- The large set results look more interior and stable than before... the cutoff of 12 looks reasonable too.
Revisiting Portland
Portland doesn't have 4 clusters for any year before 2000, or for 2007 and 2009. For 5 year multiples the layers are as follows:
2000 15 2005 19 2010 19 2015 33
The resulting map has much more adjacency than overlap. Measuring the nearest hull edge and center distance for each hull in a year to each hull in the next year and averaging would compute two measures of hull persistence. The overlap area from year to year, either in total or as a fraction of the second year's (or smaller years) total area, would provide another measure of persistence.
What do we want to know?
So now we have 200 (ish) cities with their optimally selected hulls (we chose the best hull count that is constant from 1995 to 2017 using the lowest-highest occurrence of that count). And now we'd like to know:
- Whether having fewer hulls is associated with growth, controlling for size -- it is: nohulll -.168335***
- Whether having a greater hull density is associated with growth, controlling for size -- it is: tothulldensityl .0730263***
- Whether having a higher fraction of locations inside hulls is associated with growth, controlling for size: -- it isn't: frachull -.1345406*
- Whether having hulls closer together is associated with growth, controlling for size. We should put these layer in the list to build avghulldisthm and avgdisthm (see line 1335).
Houston, TX
We also want to know about Houston, TX.
tab place placeid if placeid >50 & placeid <100 //Houston is 83 tab chosenhullspcar2inc if placeid==83 //10
SELECT year, layer FROM MasterLayers WHERE place='Houston' AND statecode='TX' AND layer=lowesthighestlayer AND numclusters=10 ORDER BY year, layer; 1990 20 2000 45 2005 50 2010 30 print myDict["Houston_TX"]; [-95.836717, -95.014608, 29.515925, 30.155908]
Useful place data:
- The Sears Building is at 29.8055662,-95.7145812 [3]
- The four corners of the innovation corridor are [4]:
- 100 Hogan St Houston, TX 77009 29.774004, -95.367127
- Second Ward Houston, TX 29.759421, -95.346044
- Orange Lot Houston, TX 77054 29.682241,
- Buffalo Speedway Houston, TX 77025 29.695301, -95.426753
The policy evaluation methodology:
- Find the optimal number of hulls. Choose the corresponding layer for 2017, or the closest layer to it. Call this the Houston2017 layer.
- Run a regression across cities to estimate the effect of the number of hulls, total hull area, avg hull distance, fraction in hulls, and perhaps other measures, as well as the scale measures, to generate coefficients.
- Plug in Houston2017's values.
- Create an artificial Houston2017X layer that moves 1/x (x=4, for example) of all of Houston's startups into a single 25hmsq innovation district. Evaluate it!
- Create an artificial Houston2017Y layer that moves 1/x of all of Houston's startups into the proposed innovation corridor. Evaluate it!
- Also calculate the expected values for a city with Houston's characteristics (?) and plug those in.
Working off this regression:
xtreg growthinv17lf nohull nohullsq frachull frachullsq tothullarea tothullareasq avghulldisthm avghulldisthmsq, be --------------------------------------------------------------------------------- growthinv17lf | Coef. Std. Err. t P>|t| [95% Conf. Interval] ----------------+---------------------------------------------------------------- nohull | .3839447 .0563233 6.82 0.000 .2727974 .495092 nohullsq | -.0098748 .0024791 -3.98 0.000 -.014767 -.0049826 frachull | -1.695035 2.044876 -0.83 0.408 -5.730354 2.340284 frachullsq | -.6243502 1.534811 -0.41 0.685 -3.653117 2.404416 tothullarea | -.0007168 .0003133 -2.29 0.023 -.001335 -.0000985 tothullareasq | 1.44e-07 5.52e-08 2.60 0.010 3.48e-08 2.53e-07 avghulldisthm | .0157895 .0068875 2.29 0.023 .0021978 .0293813 avghulldisthmsq | -.0001506 .0000556 -2.71 0.007 -.0002602 -.0000409 _cons | 3.629909 .5882279 6.17 0.000 2.469112 4.790707 ---------------------------------------------------------------------------------
In 2017, Houston had the following characteristics:
year layer growthinv17 growthinv17l nohull nohullsq frachull frachullsq tothullarea tothullareasq avghulldisthm avghulldisthmsq 2017 20 76.533 4.350703652 7 49 0.595744681 0.354911725 268.2674944 71967.44853 100.6191636 10124.21608 1 0.3839447 -0.0098748 -1.695035 -0.6243502 -0.0007168 1.44E-07 0.0157895 -0.0001506 3.629909 2.6876129 -0.4838652 -1.009808085 -0.221589206 -0.19229414 0.010363313 1.588726284 -1.524706942 3.629909 4.484347923 88.61914544
We are going to do the following:
- Calculate a base effect using the real data, much as above.
- Define some target areas - the innovation corridor and a new innovation district. The innovation districts location will have to be picked based on the data and its size determined by the optimal hull area (or using the 25hmsq suggested by the economists).
- Take the 2017 Houston data (providing we don't need f.growthinv or can come up with it seperately), reallocate a randomly selected 25% of startups to a random location in the target area and recompute the hulls and layers using the HCA script.
- Compute the effect on growth and do a back-of-the-envelope calculation for whether it is worth TIF financing.
Note that 0.001 decimal degrees is 111m or 1.11hm [5], so 2.5 hm (for a total of 5 per side, and so 25hmsq) is 0.002252 decimal degrees and 1.617hm (giving a hull area of 10hmsq, which at 0.95 startups per hectare -- essentially 1 per square block, accommodates 25% of Houston's 47 active startups in 2017 in 10.46025 hectares). Using centroid analysis on Houston's 2017 hulls, the optimal location for a 3.2 block by 3.2 block innovation district is just above the Galleria area in Uptown [6]. There is a business park directly adjacent to these coordinates which ia about the right size for the innovation district and currently includes the offices of Schlumberger, Alert Logic, Hire Priority, and others.
Supposing that all of Houston's 43 active startups were relocated, it doesn't much matter where you put them. One questions is whether the 4 sq mile innovation corridor would then be an improvement over the status quo, and how much worse it would be than a district that of the implied optimum density? Such a district, using 0.95 hectares per location, would have an implied hull area of around 45 hectares, or a 0.003011 decimal degrees deviation in four directions from a point (to give 4 corners of a square).
Group Means Regression
Once we have found optimum hull specifications within a city, they will not vary, or will vary very little, over time. We therefore want to use a between panel regression, also called a group means regression. See the following:
- The spec on p34 of http://people.stern.nyu.edu/wgreene/Lugano2013/Greene-Chapter-11.pdf
- The explanation at the top of https://stats.stackexchange.com/questions/83462/the-between-estimator-in-panel-data
- https://www.stata.com/support/faqs/statistics/between-estimator/
- The definition in https://www.stata.com/manuals13/xtxtreg.pdf
Image Analysis
Building Images
Use B&W:
- 50% grey at 75% transparent for city outline
- 50% grey at 50% transparent for hull
- Black + for singleton (size 10pt), 50% transparent when in pair or hull
- Black * for multiton (size 10pt), 50% transparent when in pair or hull
- Black line for pair
Informs colors:
- Orange: R240 G118 B34
- Blue: R31 G61 B124
- Green: R129 G190 B65
Town: Blue, 75% tranparent
Working with ArcPy
First version saved as E:\projects\agglomeration\Test.mxd
If the basemaps aren't available, connect to ERSI online using the icon in the system tray[7]
Basic set up is:
- displayhulls layer=1 grey50%,trans50%,noborder
- displaymultitons asterisk4,black,size10,trans50%
- displaysingletons cross1,black,size10,trans50%
- placetigerarea grey50%,trans75%,grey80%border
- Reference
- Basemap - World Light Grey Canvas
Change dataframe map to GCS 1984 and display to decimal degrees Saved as: FullHullReview2017.mxd
Open python window then:
#Load the map and create the dataframe mxd = arcpy.mapping.MapDocument(r"E:\projects\agglomeration\FullHullReview2017Colored.mxd") df = arcpy.mapping.ListDataFrames(mxd)[0] df.credits="(c) Ed Egan, 2019" #Now do the image generation! myDict = {} #myDict["Burlington_VT"] = [-73.27691399999999,-73.176383,44.445927999999995,44.539927] #... See the entries in E:\projects\agglomeration\arcpydict.txt for location in myDict: newExtent = df.extent newExtent.XMin = myDict[location][0] newExtent.XMax = myDict[location][1] newExtent.YMin = myDict[location][2] newExtent.YMax = myDict[location][3] df.extent = newExtent df.panToExtent(df.extent) filename="E:\\projects\\agglomeration\\Images\\"+location +".png" arcpy.mapping.ExportToPNG(mxd, filename, resolution=144) #arcpy.RefreshActiveView() #mxd.save()
If you run into issues, it's useful to test things step by step:
#Test some exports, note that if the geocords are in a different system from the extent parameters, you'll be exporting blank images! arcpy.mapping.ExportToPNG(mxd, r"E:\projects\agglomeration\Images\BurlingtonAuto1.png", resolution=144) print df.extent #-73.5977988105381 44.1185146554607 -72.8022011894619 44.680787974202 NaN NaN NaN NaN #Test with Burlington, VT newExtent = df.extent newExtent.XMin =-73.219086 newExtent.XMax =-73.19356 newExtent.YMin = 44.460346 newExtent.YMax = 44.48325 df.extent = newExtent arcpy.mapping.ExportToPNG(mxd, r"E:\projects\agglomeration\Images\BurlingtonAuto2.png", resolution=144) df.panToExtent(df.extent) arcpy.mapping.ExportToPNG(mxd, r"E:\projects\agglomeration\Images\BurlingtonAuto3.png", resolution=144) #Test with Buffalo, NY (while looking at Burlington, VT) newExtent = df.extent newExtent.XMin =-78.95687699999999 newExtent.XMax =-78.795157 newExtent.YMin =42.826023 newExtent.YMax =42.966454999999996 df.extent = newExtent arcpy.mapping.ExportToPNG(mxd, r"E:\projects\agglomeration\Images\BuffaloAuto1.png", resolution=144) df.panToExtent(df.extent) arcpy.mapping.ExportToPNG(mxd, r"E:\projects\agglomeration\Images\BuffaloAuto2.png", resolution=144)
Remove basemap credits[8]:
- Click on World map layer
- Insert->Dynamic Text->Service Level Credits
- Set the symbol color to no color
Help pages:
- ArcPy Functions: https://desktop.arcgis.com/en/arcmap/10.3/analyze/arcpy-functions/alphabetical-list-of-arcpy-functions.htm
- Export to PNG: http://desktop.arcgis.com/en/arcmap/10.3/analyze/arcpy-mapping/exporttopng.htm
- Data Frames: http://desktop.arcgis.com/en/arcmap/10.3/analyze/arcpy-mapping/dataframe-class.htm
- Map Documents: http://desktop.arcgis.com/en/arcmap/10.3/analyze/arcpy-mapping/mapdocument-class.htm
- Text Elements: http://desktop.arcgis.com/en/arcmap/10.3/analyze/arcpy-mapping/textelement-class.htm
- Data access module: http://desktop.arcgis.com/en/arcmap/10.3/analyze/arcpy-data-access/what-is-the-data-access-module-.htm
Analyzing the results
The following issues became apparent (Counts out of 191 cities with 4 or more locations in 2017 and greater than $10m inv in a year over all time):
- Encapsulation - A small number of place boundaries are fully encapsulated inside of other geoplaces. We need to determine when this happens. The initial list includes Addison, Culver City, Santa Monica (might be extreme adjacency), and others. We need a query to work this out.
- Concavity (6 marked) - Some place boundaries are fairly extremely concave (for instance, Fort Lauderdale, FL, Birmingham, AL, Boulder, CO). This in itself isn't too much of an (addressable) issue. However, a small number of places have concavity and adjacency issues, which together lead to hull overlaps. This is ameriorated by removing outliers, but we should check them (e.g., Cary, NC, Morrisville, NC, the city next to Newark, CA, Roswell, GA)
- Adjacency (23 marked) - The entire of the valley has an adjacency issue (these weren't marked), as do a fairly large number of other cities. See Newport Beach, CA and others. Lexington, MA provides a nice example of containment despite adjacency. As does Cambridge, MA with the right outliers removed.
- Outliers (52 marked) - perhaps as many as 1 in 5 cities had one or two obvious outliers on a visual inspection.
Critical checks:
- Addison, TX: encapsulation
- Culver City, CA: encapsulation
- Oklahoma City, OK: scale issue (one outlier in State House?)
- Portland, ME: scale issue. Though Portland's place boundary contains an island and some sea area, making it very wonky, this isn't an issue.
- San Juan Capistrano, CA: Just 2 locations (1 singleton and 1 multiton) and no hull. Note that we might want to omit this place.
We might also want to check Twin Cities. Here's the results:
place statecode Issue Reason Champaign IL No Urbana isn't in the data Phoenix AZ No Mesa isn't in the data San Francisco CA No Twinned with Oakland! Oakland CA No Twinned with SF! Stamford CT Yes Norwalk Norwalk CT Yes Stamford New Haven CT No Bridgeport isn't in the data Tampa FL No St. Petersburg Portland ME No South Portland isn't in the data Minneapolis MN Yes St. Paul Bloomington MN No Normal isn't in the data Durham NC Yes? Raleigh Raleigh NC Yes? Durham Portland OR No Vancouver isn't in the data Bethlehem PA No Allentown isn't in the data Dallas TX Yes? Fort Worth Fort Worth TX Yes? Dallas Seattle WA No Tacoma isn't in the data
A visual inspection suggests that Stamford and Norwalk might be better combined but don't really matter. Minneapolis and St. Paul are pretty separate and really separate after removing outliers. Rarleigh and Durham are completely separate (Cary is more of an issue), as are Dallas and Fort Worth and SF and Oakland.
Encapsulation
The data suggests that there are 12 places that encapsulated by 7 other places:
SELECT A.place, A.statecode, B.place AS ContainedPlace, B.statecode AS ContainedStatecode FROM placetigerarea AS A JOIN placetigerarea AS B ON st_contains(ST_ConvexHull(A.placegeog::geometry),ST_ConvexHull(B.placegeog::geometry)) WHERE NOT (A.place=B.place AND A.statecode=B.statecode); --12
place statecode containedplace containedstatecode Los Angeles CA Culver City CA Los Angeles CA Torrance CA Los Angeles CA El Segundo CA Los Angeles CA Santa Monica CA San Jose CA Santa Clara CA Fremont CA Newark CA Oakland CA Emeryville CA Cary NC Morrisville NC New York NY Jersey City NJ Dallas TX Richardson TX Dallas TX Addison TX Dallas TX Farmers Branch TX
We could ignore, flag or discard these cites. A visual inspection suggests that Culver City, Torrence, El Segundo, Jersey City, and probably Richardson, Newark, and maybe Cary don't have any issues. Santa Monica, Santa Clara, Emeryville, Farmer's Branch and Addison do look like they have issues, but with the exception of Farmer's Branch and Addison, these are big cites and with lots of locations, so the issue should be washed out by removing outliers or otherwise appropriately choosing the clustering layer.
After reflection, we decided to deal merge the following places (listing geoids):
Santa Monica 0670000 -> LA 0644000 Santa Clara 0669084 -> San Jose 0668000 Emeryville 0622594 -> Oakland 0653000 Farmer's Branch 4825452 -> Dallas 4819000 Addison 4801240 -> Dallas 4819000 Newark 0650916 -> Freemont 0626000 Morrisville 3744520 -> Cary 3710740
Intersecting All Encompassing Hulls
52 places have all encompasing hulls intersect in our data (i.e., there are 26 intersections). This includes some of the places that suffer from encapsulation (especially Santa Monica, Santa Clara, Emeryville, Farmer's Branch and Addison). So beyond encapsulated places, there are an additional 20 intersections. These are:
place statecode intersectedplace intersectedstatecode Alpharetta GA Roswell GA Bellevue WA Redmond WA Boston MA Cambridge MA Boston MA Somerville MA Campbell CA San Jose CA Centennial CO Greenwood Village CO Cupertino CA San Jose CA Fremont CA Newark CA Greenwood Village CO Centennial CO Irvine CA Newport Beach CA Los Altos CA Mountain View CA Menlo Park CA Redwood City CA Milpitas CA San Jose CA Mountain View CA Palo Alto CA Mountain View CA Sunnyvale CA Newton MA Wellesley MA Phoenix AZ Tempe AZ Redwood City CA San Carlos CA San Jose CA Sunnyvale CA Santa Clara CA Sunnyvale CA
At a glance, most of these appear big or very big startup ecosystems. Accordingly, any process that deals with outliers (etc.) should address this issue.
First Estimation(s)
Note that this subsection is now very out of date!
At this stage we have MasterLevels.txt and MasterLayers.txt as datafiles. MasterLevels.txt contains only layers corresponding to levels 0 through 12 and also has noothergeoms and avgdisthm as variables.
The questions we need to answer are: 1) Is there an agglomeration effect? 2) Which level or layer best describes a city (perhaps for a year, or perhaps over its life)?
We can just pick a level (say 25 hectares) and run a within-city regression:
. xtreg growthinv17l_f growthinv17l nosingletonl totmultitoncountl totpaircountl tothullcountl avgpairlengthl avghulldensi > tyl avgdisthml i.year if level==6, fe cluster(placelevelid) Fixed-effects (within) regression Number of obs = 5,027 Group variable: placelevelid Number of groups = 198 R-sq: Obs per group: within = 0.4097 min = 3 between = 0.8310 avg = 25.4 overall = 0.5974 max = 37 F(44,197) = 78.20 corr(u_i, Xb) = 0.4087 Prob > F = 0.0000 (Std. Err. adjusted for 198 clusters in placelevelid) ----------------------------------------------------------------------------------- | Robust growthinv17l_f | Coef. Std. Err. t P>|t| [95% Conf. Interval] ------------------+---------------------------------------------------------------- growthinv17l | .1388644 .0176074 7.89 0.000 .1041412 .1735877 nosingletonl | .1447935 .0402488 3.60 0.000 .0654197 .2241673 totmultitoncountl | .0909545 .0481349 1.89 0.060 -.0039714 .1858803 totpaircountl | .1724367 .0383185 4.50 0.000 .0968695 .2480039 tothullcountl | .7120504 .0467915 15.22 0.000 .6197739 .8043269 avgpairlengthl | -.0219417 .023633 -0.93 0.354 -.0685478 .0246645 avghulldensityl | .049566 .0202756 2.44 0.015 .0095808 .0895511 avgdisthml | .0933327 .076309 1.22 0.223 -.0571546 .2438201
Or:
. xtreg growthinv17l_f growthinv17l numstartups numstartupssq nosinglemulti nosinglemultisq nohull nohullsq nopair nopairs > q i.year if level==6, fe cluster(placelevelid) Fixed-effects (within) regression Number of obs = 5,773 Group variable: placelevelid Number of groups = 200 R-sq: Obs per group: within = 0.4017 min = 4 between = 0.8425 avg = 28.9 overall = 0.5708 max = 37 F(45,199) = 72.39 corr(u_i, Xb) = 0.4222 Prob > F = 0.0000 (Std. Err. adjusted for 200 clusters in placelevelid) --------------------------------------------------------------------------------- | Robust growthinv17l_f | Coef. Std. Err. t P>|t| [95% Conf. Interval] ----------------+---------------------------------------------------------------- growthinv17l | .220333 .018699 11.78 0.000 .1834595 .2572066 numstartups | .0062875 .0022944 2.74 0.007 .001763 .0108119 numstartupssq | -8.04e-07 1.14e-06 -0.70 0.483 -3.06e-06 1.45e-06 nosinglemulti | .0648575 .0168134 3.86 0.000 .0317023 .0980127 nosinglemultisq | -.0021614 .0006336 -3.41 0.001 -.0034108 -.000912 nohull | .1747691 .0255105 6.85 0.000 .1244636 .2250747 nohullsq | -.0057148 .0012164 -4.70 0.000 -.0081136 -.003316 nopair | .0896908 .0248207 3.61 0.000 .0407455 .1386361 nopairsq | -.0097196 .0024153 -4.02 0.000 -.0144825 -.0049567
Note that the following don't work, either alone or with other variables (including numstartups and numstartupsq), probably because they are third-order effects:
. xtreg growthinv17l_f growthinv17l avghulldensity avghulldensitysq avgpairlength avgpairlengthsq avgdisthm avgdisthmsq i. > year if level==6, fe cluster(placelevelid) Fixed-effects (within) regression Number of obs = 5,027 Group variable: placelevelid Number of groups = 198 R-sq: Obs per group: within = 0.3579 min = 3 between = 0.5926 avg = 25.4 overall = 0.3753 max = 37 F(43,197) = 2152.49 corr(u_i, Xb) = 0.2529 Prob > F = 0.0000 (Std. Err. adjusted for 198 clusters in placelevelid) ---------------------------------------------------------------------------------- | Robust growthinv17l_f | Coef. Std. Err. t P>|t| [95% Conf. Interval] -----------------+---------------------------------------------------------------- growthinv17l | .2668427 .0208574 12.79 0.000 .2257101 .3079752 avghulldensity | .0008076 .0003875 2.08 0.038 .0000433 .0015718 avghulldensitysq | -1.14e-07 8.80e-08 -1.29 0.197 -2.87e-07 5.97e-08 avgpairlength | -.0018724 .0036128 -0.52 0.605 -.0089972 .0052524 avgpairlengthsq | -.0000296 .0000596 -0.50 0.620 -.0001471 .0000879 avgdisthm | .001429 .0035371 0.40 0.687 -.0055465 .0084045 avgdisthmsq | -.000012 .0000157 -0.76 0.447 -.0000429 .000019
We can also do it with fractions and their squares (omit fracsinglemulti). However at level 6 (25 hectare), pairs seems more important than hulls:
. xtreg growthinv17l_f growthinv17l numstartups numstartupssq fracpair fracpairsq frachull frachullsq i.year if level==6, > fe cluster(placelevelid) Fixed-effects (within) regression Number of obs = 5,773 Group variable: placelevelid Number of groups = 200 R-sq: Obs per group: within = 0.3919 min = 4 between = 0.8456 avg = 28.9 overall = 0.5274 max = 37 F(43,199) = 62.34 corr(u_i, Xb) = 0.4268 Prob > F = 0.0000 (Std. Err. adjusted for 200 clusters in placelevelid) ------------------------------------------------------------------------------- | Robust growthinv17~f | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- growthinv17l | .2481436 .0181447 13.68 0.000 .2123631 .283924 numstartups | .0100673 .0019792 5.09 0.000 .0061644 .0139702 numstartupssq | -6.92e-06 2.03e-06 -3.41 0.001 -.0000109 -2.92e-06 fracpair | .7540177 .3709212 2.03 0.043 .0225772 1.485458 fracpairsq | -1.936 .7030942 -2.75 0.006 -3.322472 -.5495289 frachull | .1969853 .562807 0.35 0.727 -.9128457 1.306816 frachullsq | -.1491389 .3878513 -0.38 0.701 -.9139649 .615687
Whereas across all levels:
. xtreg growthinv17l_f growthinv17l numstartups numstartupssq fracpair fracpairsq frachull frachullsq i.year, fe cluster(p > lacelevelid) Fixed-effects (within) regression Number of obs = 76,623 Group variable: placelevelid Number of groups = 2,600 R-sq: Obs per group: within = 0.3956 min = 4 between = 0.8330 avg = 29.5 overall = 0.5279 max = 37 F(43,2599) = 827.33 corr(u_i, Xb) = 0.4143 Prob > F = 0.0000 (Std. Err. adjusted for 2,600 clusters in placelevelid) ------------------------------------------------------------------------------- | Robust growthinv17~f | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- growthinv17l | .2522524 .0049725 50.73 0.000 .2425019 .2620028 numstartups | .0100677 .0005323 18.91 0.000 .0090239 .0111114 numstartupssq | -6.95e-06 5.51e-07 -12.62 0.000 -8.03e-06 -5.87e-06 fracpair | .4152028 .0956859 4.34 0.000 .2275745 .6028311 fracpairsq | -.9753654 .1271631 -7.67 0.000 -1.224717 -.7260141 frachull | -.8606939 .1231519 -6.99 0.000 -1.10218 -.6192081 frachullsq | .495785 .0976557 5.08 0.000 .3042942 .6872758
This is probably because of the variation in hulls vs pairs at level 6, which has lots of cities with nothing in pairs and everything in hulls. We might want to 'control' for cityarea by restricting our within city analysis to large enough cities. A 25 hectare target area might be too encapsulating -- more than 10% of observations are 100% in hulls:
. su frachull if level==6, det frachull ------------------------------------------------------------- Percentiles Smallest 1% .2162162 .1153846 5% .3333333 .1153846 10% .4285714 .1428571 Obs 6,032 25% .6 .1428571 Sum of Wgt. 6,032 50% .8 Mean .7539126 Largest Std. Dev. .2209328 75% .9666666 1 90% 1 1 Variance .0488113 95% 1 1 Skewness -.6390206 99% 1 1 Kurtosis 2.400947
I tried using R2 to select levels, but only the second spec had an interior solution (at level 3):
forvalues i=1/12 { quietly capture xtreg growthinv17l_f growthinv17l nosinglemulti nosinglemultisq nohull nohullsq nopair nopairsq avghulldensity avghulldensitysq avgpairlength avgpairlengthsq avgdisthm avgdisthmsq i.year if level==`i', fe cluster(placelevelid) display "Reg: 1 level: " `i' " r2-within: " `e(r2_w)' } forvalues i=1/12 { quietly capture xtreg growthinv17l_f growthinv17l nosinglemulti nosinglemultisq nohull nohullsq nopair nopairsq i.year if level==`i', fe cluster(placelevelid) display "Reg: 2 level: " `i' " r2-within: " `e(r2_w)' } forvalues i=1/12 { quietly capture xtreg growthinv17l_f growthinv17l numstartups numstartupssq fracsinglemulti fracsinglemultisq fracpair fracpairsq frachull frachullsq i.year if level==`i', fe cluster(placelevelid) display "Reg: 3 level: " `i' " r2-within: " `e(r2_w)' }
Reg: 2 level: 1 r2-within: .39552569 Reg: 2 level: 2 r2-within: .3998779 Reg: 2 level: 3 r2-within: .40348691 Reg: 2 level: 4 r2-within: .40130097 Reg: 2 level: 5 r2-within: .39931203 Reg: 2 level: 6 r2-within: .39707046 Reg: 2 level: 7 r2-within: .39366909 Reg: 2 level: 8 r2-within: .38957831 Reg: 2 level: 9 r2-within: .38398108 Reg: 2 level: 10 r2-within: .37662604 Reg: 2 level: 11 r2-within: .36999057 Reg: 2 level: 12 r2-within: .38393843
However, doing this by exgroup (0 t0 3), gives the same result - level 3 - for each exgroup.
Alternative approaches are to use AIC/BIC, or maybe entropy. For the same set of variables, in the same model, AIC/BIC are minimized when R2 is maximized, they are only useful when choosing the combination of the variables/estimation and the level. And it seems we can only do entropy one variable at a time:
entropyetc nohull if level==1
TIF data
See the TIF Project page for details on the TIF data. The section that was originally on this page was moved there.
TIF Analysis
Using Burlington, VT at the elbow, we plotted the hulls and TIFs. This is in WorkingMapV2.mxd. Some potential measures include:
- Overlapping Hull and TIF area
- Fraction of Hull area covered by TIFs
- Non-overlapping Hull and TIF area
- Adjacent Hull and TIF area (and non-adjacent) -- expand out hull area to allow for new inclusion without affecting density
- Count or Fraction of locations in TIF areas
The data needs to be reprocessed to be in the format:
place statecode year tifname geog
Or at least start-year and end-year...
Plotting Chicago (-87.7, 41.9) in 2017 using chosenlayer (layer=92) reveals some insights. There aren't any TIFs in the city core, but there are lots of startups. And conversely likewise, there are lots of TIFs covering suburbs that have few if any startups. There are, however, some areas of overlap. And some possible pattern of startups appearing exactly where TIFs aren't -- In a few cases (one notable), the startups are essentially surrounded.
And Houston in 2017, using layer=20 (as no chosen exists and 20 has max nohull at 7), it seems that there is little overlap between hulls and TIRZ but considerable overlap between Rice's "Innovation Corridor". The Sears Building, at (29.7255505,-95.3887225) [9] and Houston Exponential's building at (29.7521572,-95.3778919) [10] are both in the Midtown TIF, albeit at opposite ends! There are other TIFs impacted by Rice's proposal too. The East Downtown, Market Square, Montrose, and at least three others all intersect the "planned" innovation corridor.
Note that The Sears Building Area wasn't in the Midtown TIF when they did the last bond issue in 2015 (see map on A-1 [11]). They raised $13.5m in 2015 to pay off existing debt, and then raised $39.31m in 2017 [12] to conduct the Plan, pay off debt, etc. The area was in the map in the 2017 issue. The area around the Sears building "contains virtually no taxable property and therefore will produce no significant Captured Appraised Value".
Density Maps
It's useful to lay each year's chosen hulls on top of each other over time (from say 1995 to present, or using the layer with the max number of hulls if the year never achieves the chosen hull count). However, to do this we should expand the hulls, because all hulls have points at their corners and most hulls have points only at their corners. I propose using ST_Expand to increase X and Y distances separately. The method would be to take ((Centroid's X-ST_XMin)+(ST_Xmax-Centroid's X)/2) as the X expansion distance and likewise for Y.
Correction - we definately don't want ST_Expand as it creates a bounding box. We want ST_Buffer, but how big a buffer? Half of the maximum width is easy.
ST_Length(ST_LongestLine( (SELECT geom FROM mylayer WHERE gid=1), (SELECT geom FROM mylayer WHERE gid=1))
Also, transparency doesn't stack within a layer... See https://gis.stackexchange.com/questions/91537/how-to-vary-the-transparency-of-symbols-within-a-single-layer-in-arcmap Rather than half of the maximum width, we could use the average distance from a corner to another corner (i.e., pretty much between locations) divided by two. Use ST_DumpPoints(geometry geom) to recover the corners of the convex hulls. We should also put some other places on the map (long,lat):
- Mercury Fund [13] at (-95.4306675,29.7331018)
- The Galleria [14] at (-95.4627705,29.7411474).
- The center of Downtown [15] at (-95.3755815,29.7575263)
- Rice University [16] at (-95.4040199,29.7173941)
- The Energy Corridor [17] at (-95.7131468,29.8698068)
- Houston Community College - Spring Branch Campus [18] at (-95.5631846,29.7841191)
- Westchase Neighborhood [19] at (-95.5832518,29.72609)
- Houston Community College Alief Hayes Campus [20] at (-95.5770254,29.7336065)
Visually, it is easy to layer the years, using opacity to build up an effect over time. In the data, it is more difficult. Each year could have thickness one and then we could count the number using ST_intersects while creating the new hulls using ST_Intersection (returns null when no interection). If there are more than one intersections with the highest intersects count, then we could take the largest one as the ultimate one. The centroid of the ultimate intersection would be the heart of a city's startup scene.
I also added the roads from the Tiger Line Shapefile for Harris County[21]:
shp2pgsql -c -D -s 4269 -W "latin1" -I tl_2013_48201_roads.shp tlharris | psql -U researcher -d vcdb3
As well as the US national file for the coastlines:
shp2pgsql -c -D -s 4269 -W "latin1" -I tl_2017_us_coastline.shp tlcoastline | psql -U researcher -d vcdb3
Unfortunately, this doesn't show lakes... You can get all lines from https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2019&layergroup=All+Lines
But you have to do them county by county.
shp2pgsql -c -D -s 4269 -W "latin1" -I tl_2019_50007_edges.shp tlchittenden | psql -U researcher -d vcdb3
And there's so many features...
One problem with this method is that there are partial intersections. We could use ST_Difference to return the part that doesn't intersect... a bigger problem is that we are restricted to pairs of geometries. Using a cross product we could test all rows against all other rows. But then we'd need to aggregate the intersections... One method is to use a recursive CTE [22]. ST_Union is truly an aggregate function but not what we want in this context.
Another thing that might be an issue is that when hulls are expanded, they may intersect within a year too. Counting across year and within year intersections the same would simplify this, but it might be important to track them separately?
Other
See also:
Old Work Using Circles
Very Old Summary
Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.
Description
Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:
The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument. The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.
Data
Making the circle input data
Ed's additional datawork is in
Z:\VentureCapitalData\SDCVCData\vcdb2\ProcessingCoLevelSimple.sql
The key table for circle processing is CoLevelBlowout, which is restricted (to include cities with greater than 10 active at some point in the data) to make CoLevelForCircles.
We need to:
- Winsorize CoLevelBlowout
- Compute the circles!
- Make the Bay Area (over time) data
- Plot the Bay Area data (with colors per Bay Area city) for 1985 to present
- Combine the plots to make an animated gif
To winsorize the data we need the formula for Great Circle Distance. The radius of the earth is 6,378km (half of diameter: 12,756 km). So:
GCD = acos( sin(lat1) x sin(lat2) + cos(lat1) x cos(lat2) x cos(long1-long2) ) x r
Main Sources
The primary sources of data for this project are:
- SDC VentureXpert - from VC Database Rebuild, the key table is
- GIS City Data
- Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in Hubs.
VC data
Data on the number of new vc backed firms in each city and year is in:
Z:\Hubs\2017\clean data The name of the file is firm_nr.txt.
Database is cities SQL script is: nr_firms.sql
Raw data is in:
Z:\VentureCapitalData\SDCVCData\vcdb2 The file is colevelsimple.txt
In order to see if there are outliers, I get the average coordinates for all cities and find the differences of the firm's coordinates from the city coordinate. The script for the average city coordinates is in
Z:\Hubs\2017\sql scripts and the file name is newcolevel.sql.
The differences are taken in excel. The file containing the differences is in
Z:\Hubs\2017 and the file name is new_colevel.txt.
- Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data The name of the file is circles.txt. (It contains only 106 observations)
Database is cities SQL script is: circles.sql
The script for joining the two tables on the VC table is in:
Z:\Hubs\2017\sql scripts The name of the file is new_firm_nr_circles.sql
- We use the cities with greater than 10 active VC backed firms. Data on the cities and number of active firms is in:
E:\McNair\Projects\Hubs\Summer 2017 The file is CitiesWithGT10Active.txt
The script for joining the final data with this file is located in
Z:\Hubs\2017\sql scripts The file name is final_joined_kerda.sql.
The final data is in
Z:\Hubs\2017\clean data The file name is new_final_kerda.txt.
Accelerator data
Accelerators data is in
Z:\Hubs\2017\clean data The file name is accelerators.txt The table is accelerators
The joined accelerators data with the VC table is in joined_accelerators table. The script is in
Z:\Hubs\2017\sql scripts The file name is join_accelerators.sql
The do file is in
Z:\Hubs\2017\kerda The name is agglomeartion_kerda.do
It includes the graphs, tables and the preliminary FE regressions with VC funding amount and growth rate. It also predicts the hazard rates, matches on the hazard rate in order to create synthetic control and treatment groups. What is left to do is to add 2 lagged and 3 forward observations for the cities which do have a match. Remove the overlapping observations for the years that get a treatment but which at the same time serve as a control.
See also
Also:
- Enclosing Circle Algorithm
- Normalizer
- Geocode.py
Entrepreneurship, small businesses and economic growth in cities:
Specifities/ Outliers to consider
New York (decompose) Princeton area (keep Princeton unique) Reston, Virginia (keep) San Diego (include La Jolla) Silicon Valley (all distinct)
Unbiased measure
The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.
For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf
Census Data
Population
The Census Gazetteer files for 2010, 2000 and 1990 can give use population by census place. See https://www.census.gov/geo/maps-data/data/gazetteer.html
The places file contains data for all incorporated places and census designated places (CDPs) in the 50 states, the District of Columbia and Puerto Rico as of the January 1, 2010. The file is tab-delimited text, one line per record. Some records contain special characters. Download the National Places Gazetteer Files (1.2MB) Download the State-Based Places Gazetteer Files: Column Label Description Column 1 USPS United States Postal Service State Abbreviation Column 2 GEOID Geographic Identifier - fully concatenated geographic code (State FIPS and Place FIPS) Column 3 ANSICODE American National Standards Insititute code Column 4 NAME Name Column 5 LSAD Legal/Statistical area descriptor. Column 6 FUNCSTAT Functional status of entity. Column 7 POP10 2010 Census population count. Column 8 HU10 2010 Census housing unit count. Column 9 ALAND Land Area (square meters) - Created for statistical purposes only. Column 10 AWATER Water Area (square meters) - Created for statistical purposes only. Column 11 ALAND_SQMI Land Area (square miles) - Created for statistical purposes only. Column 12 AWATER_SQMI Water Area (square miles) - Created for statistical purposes only. Column 13 INTPTLAT Latitude (decimal degrees) First character is blank or "-" denoting North or South latitude respectively Column 14 INTPTLONG Longitude (decimal degrees) First character is blank or "-" denoting East or West longitude respectively.
Relationships
See https://www.census.gov/geo/maps-data/data/relationship.html
These text files describe geographic relationships. There are two types of relationship files; those that show the relationship between the same type of geography over time (comparability) and those that show the relationship between two types of geography for the same time period.
ACS (American Community Survey) Data
Steps to download:
1) Go to https://factfinder.census.gov/faces/nav/jsf/pages/download_center.xhtml 2) Select 'I know the dataset or table(s) that I want to download.' 3) Press Next 4) For 'Select a program:' choose 'American Community Survey' 5) For 'Select a dataset and click Add to Your Selections:' choose '<YEAR OF INTEREST> ACS 1-year estimates' 6) Press 'Add To Your Selections' 7) Press Next 8) For 'Select a geographic type:' choose 'Place - 160' 9) For Select a state: Don't choose a state, as we wish to download all. 10) For 'Select one or more geographic areas...' choose 'All Places within United States and Puerto Rico' 11) Press Next
Other
Counts of firms by NAICS code at the county level may be useful: https://www2.census.gov/geo/pdfs/education/cbp12gdbs.pdf
Tax Increment Finance Zones
- State by state enabling statues: https://www.cdfa.net/cdfa/tifmap.nsf/index.html