I calculated the between and within-cluster variances, as described below, using the Euclidean distance by using the ST_Distance function on PostGIS geographies (i.e., accounting for an ellipsoid earth using reference system WGS1984).
The output of the python HCL clustering script has around 40m observations (place-statecode, year, layer, cluster, startup), and some of the intermediate tables took several minutes to build. As the process should be O(n), this process could accommodate input data that is perhaps 100x to 1000x bigger, assuming a patient researcher, which would imply source data perhaps 10x bigger. That would put an upper-bound at around 40b observations, as Note that the hardware/software that we are running this on is pretty close to the (current) frontier.
=====Fixing an issue=====
I required that a city-year had more than two layers, as it takes at least 3 layers to form an elbow. I then used <math>f'(x)</math> to determine the layer index from which the variance explained was monotonic (i.e., there was no change in sign in <math>f'(x)</math> in higher layer indices), and used <math>f''(x)</math> to find the layer index <math>i</math> at which <math>varexp_i = min(varexp)</math> for some city-year. I then marked <math>i+1</math> as the elbow layer for that city-year, as we are using forward differences, not central differences.
'''I created a new build (version 3.2) of the dataset, do file and log file, which includes the variance explained elbow method'''.
Note that the lens found by this elbow method is only slightly bigger than the lenses found using the other heuristic method and the maximum R2 method (and those two lenses are near identical!). It's easy to look at the differences in the medians or means (etc.) and see differences, but it's important to remember just how big those differences could be!
====Fixing the layer index====