The objective is to apply the [https://en.wikipedia.org/wiki/Elbow_method_(clustering) Elbow Method], which involves finding the [https://en.wikipedia.org/wiki/Knee_of_a_curve Knee of the curve] of either the F-statistic or variance explained.
I used distances calculated by ST_Distance and calculated the '''variance explained''' using the equation below. The between-group variance is undefined for the first layer, as it has following equations: :<math>kSS_{exp}=\sum_{i=1}^{K} n_i(\bar{Y}_{i\cdot} - \bar{Y})^2</math> and :<math>SS_{unexp}=\sum_{i=1}^{K}\sum_{j=1}^{n_{i}} \left( Y_{ij}-\bar{Y}_{i\cdot} = \bar{Y}right)^2</math> (i.e., a its single all-encompassing hull so its centroid the overall mean) and its variance is then :<math>n_i(0)R^2/(0)= \frac{SS_{exp}}{SS_{exp}+>SS_{unexp}}</math>.
I then calculated forward differences, and added one to the answer, as using central differences left truncates the data. (An inspection of the data revealed that it is vastly more likely that the 'correct' answer is found at the left end of the data than the right. Also central first difference bridge the observation, which can lead to misidentification of monotonicity.) Specifically, I used:
:<math> f''(x) = f(x+2) - 2 f(x+1) + f(x)</math>
I required that a city-year had more than two layers, as it takes at least 3 layers to form an elbow. I then used <math>f'(x)</math> to determine the layer index from which the variance explained was monotonic (i.e., there was no change in sign in <math>f'(x)</math> in higher layer indices). This wasn't an issue when using the population variance explained. In an earlier version, and when we used <math>f''(x)</math> to find the layer index <math>i</math> at which <math>varexp_i = minsample variance explained, we had some non-monotonic sections of the curve resulting from integer division (varexp)</math> for some city\frac{k-year. I then marked <math>i+1}{n-k}</math> as the elbow layer for that city-year, as we are using forward differences, not central differences).
I used <math>f'''(x)</math> to find the layer index <math>i</math> at which <math>varexp_i = min(varexp)</math> (for elbowlayer) or for which <math>varexp_i = max(varexp)</math> (for elbowmaxlayer), for some city-year. I created a new build then marked <math>i+1</math> as the elbow (version 3or elbowmax) layer for that city-year, as we are using forward differences, not central differences.2Note that the biggest change in slope could be found using max(abs(f''(x))) but this is essentially always min(f''(x)) of the dataset, do file and log filei.e., which includes the variance explained elbow methodlayer, as the change in slopes are mostly negative. It's However, the changes in slopes do often go positive, and the elbowmax layer captures the dropbox.'''biggest positive change in slope.
'''I created a new build (version 3.3) of the dataset, do file and log file, which includes the population variance explained elbow method, as well as the elbowmax method. It's in the dropbox.'''. Note that the lens found by this the population elbow method is only slightly bigger than the lenses found using sample elbow method from before, but the lens found using the other heuristic elbowmax method and is about the same size as the maximum R2 sample elbow method (and those two lenses are near identical!), if not slightly smaller. ItI's easy to look at m not sure about the differences in justification of the medians or means (etcelboxmax method though.) and see differences, but it's important to remember just how big those differences could be!
====Fixing the layer index====