Difference between revisions of "Urban Start-up Agglomeration and Venture Capital Investment"
Line 244: | Line 244: | ||
99% 1 1 Kurtosis 2.400947 | 99% 1 1 Kurtosis 2.400947 | ||
+ | I tried using R2 to select levels, but only the second spec had an interior solution (at level 3): | ||
+ | forvalues i=1/12 { | ||
+ | quietly capture xtreg growthinv17l_f growthinv17l nosinglemulti nosinglemultisq nohull nohullsq nopair nopairsq avghulldensity avghulldensitysq avgpairlength avgpairlengthsq avgdisthm avgdisthmsq i.year if level==`i', fe cluster(placelevelid) | ||
+ | display "Reg: 1 level: " `i' " r2-within: " `e(r2_w)' | ||
+ | } | ||
+ | forvalues i=1/12 { | ||
+ | quietly capture xtreg growthinv17l_f growthinv17l nosinglemulti nosinglemultisq nohull nohullsq nopair nopairsq i.year if level==`i', fe cluster(placelevelid) | ||
+ | display "Reg: 2 level: " `i' " r2-within: " `e(r2_w)' | ||
+ | } | ||
+ | forvalues i=1/12 { | ||
+ | quietly capture xtreg growthinv17l_f growthinv17l numstartups numstartupssq fracsinglemulti fracsinglemultisq fracpair fracpairsq frachull frachullsq i.year if level==`i', fe cluster(placelevelid) | ||
+ | display "Reg: 3 level: " `i' " r2-within: " `e(r2_w)' | ||
+ | } | ||
+ | |||
+ | Reg: 2 level: 1 r2-within: .39552569 | ||
+ | Reg: 2 level: 2 r2-within: .3998779 | ||
+ | Reg: 2 level: 3 r2-within: .40348691 | ||
+ | Reg: 2 level: 4 r2-within: .40130097 | ||
+ | Reg: 2 level: 5 r2-within: .39931203 | ||
+ | Reg: 2 level: 6 r2-within: .39707046 | ||
+ | Reg: 2 level: 7 r2-within: .39366909 | ||
+ | Reg: 2 level: 8 r2-within: .38957831 | ||
+ | Reg: 2 level: 9 r2-within: .38398108 | ||
+ | Reg: 2 level: 10 r2-within: .37662604 | ||
+ | Reg: 2 level: 11 r2-within: .36999057 | ||
+ | Reg: 2 level: 12 r2-within: .38393843 | ||
+ | |||
+ | However, doing this by exgroup (0 t0 3), gives the same result - level 3 - for each exgroup. | ||
+ | |||
+ | Alternative approaches are to use AIC/BIC, or maybe entropy. For the same set of variables, in the same model, AIC/BIC are minimized when R2 is maximized, they are only useful when choosing the combination of the variables/estimation and the level. And it seems we can only do entropy one variable at a time: | ||
+ | entropyetc nohull if level==1 | ||
===Other=== | ===Other=== |
Revision as of 19:33, 12 August 2019
Academic Paper | |
---|---|
Title | Urban Start-up Agglomeration and Venture Capital Investment |
Author | Ed Egan |
RAs | Peter Jalbert, Jake Silberman, Christy Warden, Jeemin Sim |
Status | In development |
© edegan.com, 2016 |
Contents
New Work
Files
New files are in E:\projects\agglomeration
- agglomeration.sql -- new main sql file uses vcdb3
Contents of the Old Code subdirectory:
- Agglomeration.SQL -- This was the replacement code that was run when vcdb3 was made. We had to work through it to update the counts. Came from E:\projects\vcdb3\OriginalSQL\Agglomeration.SQL
- Agglomeration.sql -- Old version that makes colevel data for python script. Came from E:\mcnair\Projects\Agglomeration
- Analysis.sql -- Old version builds from HCL forward. Came from E:\mcnair\Projects\Agglomeration
See also:
CoLevelForCircles
Note: Make sure that the geocoding has been fixed first! See Restoring_vcdb3#Fix_the_Geocoding.
The build is as follows:
- portcomaster 48001 (where hadgrowth=1 34761) and portcogeo (47715 where latitude and longitude NOT NULL 47715) -> colevelsimple 33869
- colevelsimple 33869-> CoPoints (adds point geoms)
- tigerplaces 29322 (has place geoms) -> tiger2 (adds areas)
- tiger2,CoPoints ST_Intersects -> colevelbase 29442
- colevelbase,years -> colevelblowout 219060, CoLevelSummary 35344, PlacesWithGT10Active 200
- colevelblowout,PlacesWithGT10Active -> CoLevelForCircles 171170
HCA
Take CoLevelForCircles.txt and feed it into the HCA script in
E:\projects\hca\main.py
See also:
Take results.tsv and load it as the HCL table.
Layers and levels
See Agglomeration.sql for the following build:
- hcl loaded from results.tsv 29751998
- Determine if singleset, multiset or hullset (fully partitions data)
- Make hclsingletons, hclmultitons and hclhulls (which contains both hulls and lines)
- UNION them together to create hclmain (14875999), which contains geographies
- Make hcllayer 163887: Aggregate to the layer level calculating nosingleton, nomultiton, etc., as well as tothullarea etc.
Note that everything uses Geographies (except to find centroids), and pair lengths and hull areas are scaled so that they are in hm and hectares (hm2) [1], as this is close to being a city block length and square block [2] at least in Houston, TX (note that a block in Chicago is 2 Houston blocks and there is no standard block definition).
Then load up leveldefinitions.txt. Note that we are using more levels than before, with finer grained levels at the bottom end:
Level Label Target 1 100msq 0.01 2 1blfront 0.1 3 1blsq 1 4 5blsq 5 5 10blsq 10 6 25blsq 25 7 50sqbl 50 8 1kmsq 100 9 5kmsq 400 10 10kmsq 1000 11 20min rule 35000
Then build the nohulls as level 0 and the allhull as level 12. Note that we are going to have to throw out panels or observations with too few points per city-year later, as these can have singletons, multitons, or pairs as their allhulls. This can be done with nohull !=0. Also nohulls can be built using hcllayer then it will contain pairs, or from colevelforcircles. I opted for the later, so that it only contains singletons and multitons.
Finally in this part, build hcllevels and hcllayerwzero. For hcllevels we are going to compute mean distances between clusters. It is computational infeasible to do this for all layers. And then for all layers (inc zero) we are going to run our selection regression.
For the next steps on the data see Jeemin Sim (Work Log). This includes details of how to load the TIF data.
First Estimation(s)
At this stage we have MasterLevels.txt and MasterLayers.txt as datafiles. MasterLevels.txt contains only layers corresponding to levels 0 through 12 and also has noothergeoms and avgdisthm as variables.
The questions we need to answer are: 1) Is there an agglomeration effect? 2) Which level or layer best describes a city (perhaps for a year, or perhaps over its life)?
We can just pick a level (say 25 hectares) and run a within-city regression:
. xtreg growthinv17l_f growthinv17l nosingletonl totmultitoncountl totpaircountl tothullcountl avgpairlengthl avghulldensi > tyl avgdisthml i.year if level==6, fe cluster(placelevelid) Fixed-effects (within) regression Number of obs = 5,027 Group variable: placelevelid Number of groups = 198 R-sq: Obs per group: within = 0.4097 min = 3 between = 0.8310 avg = 25.4 overall = 0.5974 max = 37 F(44,197) = 78.20 corr(u_i, Xb) = 0.4087 Prob > F = 0.0000 (Std. Err. adjusted for 198 clusters in placelevelid) ----------------------------------------------------------------------------------- | Robust growthinv17l_f | Coef. Std. Err. t P>|t| [95% Conf. Interval] ------------------+---------------------------------------------------------------- growthinv17l | .1388644 .0176074 7.89 0.000 .1041412 .1735877 nosingletonl | .1447935 .0402488 3.60 0.000 .0654197 .2241673 totmultitoncountl | .0909545 .0481349 1.89 0.060 -.0039714 .1858803 totpaircountl | .1724367 .0383185 4.50 0.000 .0968695 .2480039 tothullcountl | .7120504 .0467915 15.22 0.000 .6197739 .8043269 avgpairlengthl | -.0219417 .023633 -0.93 0.354 -.0685478 .0246645 avghulldensityl | .049566 .0202756 2.44 0.015 .0095808 .0895511 avgdisthml | .0933327 .076309 1.22 0.223 -.0571546 .2438201
Or:
. xtreg growthinv17l_f growthinv17l numstartups numstartupssq nosinglemulti nosinglemultisq nohull nohullsq nopair nopairs > q i.year if level==6, fe cluster(placelevelid) Fixed-effects (within) regression Number of obs = 5,773 Group variable: placelevelid Number of groups = 200 R-sq: Obs per group: within = 0.4017 min = 4 between = 0.8425 avg = 28.9 overall = 0.5708 max = 37 F(45,199) = 72.39 corr(u_i, Xb) = 0.4222 Prob > F = 0.0000 (Std. Err. adjusted for 200 clusters in placelevelid) --------------------------------------------------------------------------------- | Robust growthinv17l_f | Coef. Std. Err. t P>|t| [95% Conf. Interval] ----------------+---------------------------------------------------------------- growthinv17l | .220333 .018699 11.78 0.000 .1834595 .2572066 numstartups | .0062875 .0022944 2.74 0.007 .001763 .0108119 numstartupssq | -8.04e-07 1.14e-06 -0.70 0.483 -3.06e-06 1.45e-06 nosinglemulti | .0648575 .0168134 3.86 0.000 .0317023 .0980127 nosinglemultisq | -.0021614 .0006336 -3.41 0.001 -.0034108 -.000912 nohull | .1747691 .0255105 6.85 0.000 .1244636 .2250747 nohullsq | -.0057148 .0012164 -4.70 0.000 -.0081136 -.003316 nopair | .0896908 .0248207 3.61 0.000 .0407455 .1386361 nopairsq | -.0097196 .0024153 -4.02 0.000 -.0144825 -.0049567
Note that the following don't work, either alone or with other variables (including numstartups and numstartupsq), probably because they are third-order effects:
. xtreg growthinv17l_f growthinv17l avghulldensity avghulldensitysq avgpairlength avgpairlengthsq avgdisthm avgdisthmsq i. > year if level==6, fe cluster(placelevelid) Fixed-effects (within) regression Number of obs = 5,027 Group variable: placelevelid Number of groups = 198 R-sq: Obs per group: within = 0.3579 min = 3 between = 0.5926 avg = 25.4 overall = 0.3753 max = 37 F(43,197) = 2152.49 corr(u_i, Xb) = 0.2529 Prob > F = 0.0000 (Std. Err. adjusted for 198 clusters in placelevelid) ---------------------------------------------------------------------------------- | Robust growthinv17l_f | Coef. Std. Err. t P>|t| [95% Conf. Interval] -----------------+---------------------------------------------------------------- growthinv17l | .2668427 .0208574 12.79 0.000 .2257101 .3079752 avghulldensity | .0008076 .0003875 2.08 0.038 .0000433 .0015718 avghulldensitysq | -1.14e-07 8.80e-08 -1.29 0.197 -2.87e-07 5.97e-08 avgpairlength | -.0018724 .0036128 -0.52 0.605 -.0089972 .0052524 avgpairlengthsq | -.0000296 .0000596 -0.50 0.620 -.0001471 .0000879 avgdisthm | .001429 .0035371 0.40 0.687 -.0055465 .0084045 avgdisthmsq | -.000012 .0000157 -0.76 0.447 -.0000429 .000019
We can also do it with fractions and their squares (omit fracsinglemulti). However at level 6 (25 hectare), pairs seems more important than hulls:
. xtreg growthinv17l_f growthinv17l numstartups numstartupssq fracpair fracpairsq frachull frachullsq i.year if level==6, > fe cluster(placelevelid) Fixed-effects (within) regression Number of obs = 5,773 Group variable: placelevelid Number of groups = 200 R-sq: Obs per group: within = 0.3919 min = 4 between = 0.8456 avg = 28.9 overall = 0.5274 max = 37 F(43,199) = 62.34 corr(u_i, Xb) = 0.4268 Prob > F = 0.0000 (Std. Err. adjusted for 200 clusters in placelevelid) ------------------------------------------------------------------------------- | Robust growthinv17~f | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- growthinv17l | .2481436 .0181447 13.68 0.000 .2123631 .283924 numstartups | .0100673 .0019792 5.09 0.000 .0061644 .0139702 numstartupssq | -6.92e-06 2.03e-06 -3.41 0.001 -.0000109 -2.92e-06 fracpair | .7540177 .3709212 2.03 0.043 .0225772 1.485458 fracpairsq | -1.936 .7030942 -2.75 0.006 -3.322472 -.5495289 frachull | .1969853 .562807 0.35 0.727 -.9128457 1.306816 frachullsq | -.1491389 .3878513 -0.38 0.701 -.9139649 .615687
Whereas across all levels:
. xtreg growthinv17l_f growthinv17l numstartups numstartupssq fracpair fracpairsq frachull frachullsq i.year, fe cluster(p > lacelevelid) Fixed-effects (within) regression Number of obs = 76,623 Group variable: placelevelid Number of groups = 2,600 R-sq: Obs per group: within = 0.3956 min = 4 between = 0.8330 avg = 29.5 overall = 0.5279 max = 37 F(43,2599) = 827.33 corr(u_i, Xb) = 0.4143 Prob > F = 0.0000 (Std. Err. adjusted for 2,600 clusters in placelevelid) ------------------------------------------------------------------------------- | Robust growthinv17~f | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- growthinv17l | .2522524 .0049725 50.73 0.000 .2425019 .2620028 numstartups | .0100677 .0005323 18.91 0.000 .0090239 .0111114 numstartupssq | -6.95e-06 5.51e-07 -12.62 0.000 -8.03e-06 -5.87e-06 fracpair | .4152028 .0956859 4.34 0.000 .2275745 .6028311 fracpairsq | -.9753654 .1271631 -7.67 0.000 -1.224717 -.7260141 frachull | -.8606939 .1231519 -6.99 0.000 -1.10218 -.6192081 frachullsq | .495785 .0976557 5.08 0.000 .3042942 .6872758
This is probably because of the variation in hulls vs pairs at level 6, which has lots of cities with nothing in pairs and everything in hulls. We might want to 'control' for cityarea by restricting our within city analysis to large enough cities. A 25 hectare target area might be too encapsulating -- more than 10% of observations are 100% in hulls:
. su frachull if level==6, det frachull ------------------------------------------------------------- Percentiles Smallest 1% .2162162 .1153846 5% .3333333 .1153846 10% .4285714 .1428571 Obs 6,032 25% .6 .1428571 Sum of Wgt. 6,032 50% .8 Mean .7539126 Largest Std. Dev. .2209328 75% .9666666 1 90% 1 1 Variance .0488113 95% 1 1 Skewness -.6390206 99% 1 1 Kurtosis 2.400947
I tried using R2 to select levels, but only the second spec had an interior solution (at level 3):
forvalues i=1/12 { quietly capture xtreg growthinv17l_f growthinv17l nosinglemulti nosinglemultisq nohull nohullsq nopair nopairsq avghulldensity avghulldensitysq avgpairlength avgpairlengthsq avgdisthm avgdisthmsq i.year if level==`i', fe cluster(placelevelid) display "Reg: 1 level: " `i' " r2-within: " `e(r2_w)' } forvalues i=1/12 { quietly capture xtreg growthinv17l_f growthinv17l nosinglemulti nosinglemultisq nohull nohullsq nopair nopairsq i.year if level==`i', fe cluster(placelevelid) display "Reg: 2 level: " `i' " r2-within: " `e(r2_w)' } forvalues i=1/12 { quietly capture xtreg growthinv17l_f growthinv17l numstartups numstartupssq fracsinglemulti fracsinglemultisq fracpair fracpairsq frachull frachullsq i.year if level==`i', fe cluster(placelevelid) display "Reg: 3 level: " `i' " r2-within: " `e(r2_w)' }
Reg: 2 level: 1 r2-within: .39552569 Reg: 2 level: 2 r2-within: .3998779 Reg: 2 level: 3 r2-within: .40348691 Reg: 2 level: 4 r2-within: .40130097 Reg: 2 level: 5 r2-within: .39931203 Reg: 2 level: 6 r2-within: .39707046 Reg: 2 level: 7 r2-within: .39366909 Reg: 2 level: 8 r2-within: .38957831 Reg: 2 level: 9 r2-within: .38398108 Reg: 2 level: 10 r2-within: .37662604 Reg: 2 level: 11 r2-within: .36999057 Reg: 2 level: 12 r2-within: .38393843
However, doing this by exgroup (0 t0 3), gives the same result - level 3 - for each exgroup.
Alternative approaches are to use AIC/BIC, or maybe entropy. For the same set of variables, in the same model, AIC/BIC are minimized when R2 is maximized, they are only useful when choosing the combination of the variables/estimation and the level. And it seems we can only do entropy one variable at a time:
entropyetc nohull if level==1
Other
See also:
Old Work Using Circles
Very Old Summary
Agglomeration is generally thought to be one of the most important determinants of growth for urban entrepreneurship ecosystems. However, there is essentially no empirical evidence to support this. This paper takes advantage of geocoding and introduces a novel measure of agglomeration. This measure is the smallest circle area that covers all startup offices, subject to having at least N startups in each circle. Using GIS data on cities, this paper controls for the density and socio-demographics of an area to identify the effect of just agglomeration.
Description
Clusters of economic activity plays a significant role in the firms performance and growth. An important driver of growth is the knowledge spillover between firms. This includes among others the facilitation of information flow and ideas between firms which could be a milestone especially in the growth of startup firms or small businesses. This project focuses on the effects of agglomeration on the performance and growth of startup firms. It introduces a novel measure of agglomeration which can be used to empirically test the effects of clustering. This measure the is smallest total circle area that covers all of the startups in the sample such that there are at least n firms in each circle. The projects is based on the creation of an algorithm which gives an unbiased measure to be used in the empirical analysis. The regression we are interested in takes the following form:
The dependent variable is a measure of growth of the firms. This measure could be investment forwarded one period or growth in investment. The control variables include the number of the startups firms, m, the agglomeration measure, A and a vector of other control variables affecting the growth of firms at time t. Because of the endogeneity in the circle area or the measure of agglomeration, A, there is a need for an instrumental variable to get consistent estimates of the effects we are interested in. The proposed instrument is the presence of a river, or road in between the points representing geographical locations of the venture capital backed up firms. The instrument affects agglomeration without having a direct impact on the growth. This makes it good candidate for a valid instrument. The next tasks are determining the additional control variables to include in the regression, years to include in the analysis and methods of finding an unbiased measure of agglomeration.
Data
Making the circle input data
Ed's additional datawork is in
Z:\VentureCapitalData\SDCVCData\vcdb2\ProcessingCoLevelSimple.sql
The key table for circle processing is CoLevelBlowout, which is restricted (to include cities with greater than 10 active at some point in the data) to make CoLevelForCircles.
We need to:
- Winsorize CoLevelBlowout
- Compute the circles!
- Make the Bay Area (over time) data
- Plot the Bay Area data (with colors per Bay Area city) for 1985 to present
- Combine the plots to make an animated gif
To winsorize the data we need the formula for Great Circle Distance. The radius of the earth is 6,378km (half of diameter: 12,756 km). So:
GCD = acos( sin(lat1) x sin(lat2) + cos(lat1) x cos(lat2) x cos(long1-long2) ) x r
Main Sources
The primary sources of data for this project are:
- SDC VentureXpert - from VC Database Rebuild, the key table is
- GIS City Data
- Data on NSF, NIH, population, income, clinical trials, employment, schooling, R&D expenditures and revenue of firms can be found in Hubs.
VC data
Data on the number of new vc backed firms in each city and year is in:
Z:\Hubs\2017\clean data The name of the file is firm_nr.txt.
Database is cities SQL script is: nr_firms.sql
Raw data is in:
Z:\VentureCapitalData\SDCVCData\vcdb2 The file is colevelsimple.txt
In order to see if there are outliers, I get the average coordinates for all cities and find the differences of the firm's coordinates from the city coordinate. The script for the average city coordinates is in
Z:\Hubs\2017\sql scripts and the file name is newcolevel.sql.
The differences are taken in excel. The file containing the differences is in
Z:\Hubs\2017 and the file name is new_colevel.txt.
- Data on the circle area in each city and year is in:
Z:\Hubs\2017\clean data The name of the file is circles.txt. (It contains only 106 observations)
Database is cities SQL script is: circles.sql
The script for joining the two tables on the VC table is in:
Z:\Hubs\2017\sql scripts The name of the file is new_firm_nr_circles.sql
- We use the cities with greater than 10 active VC backed firms. Data on the cities and number of active firms is in:
E:\McNair\Projects\Hubs\Summer 2017 The file is CitiesWithGT10Active.txt
The script for joining the final data with this file is located in
Z:\Hubs\2017\sql scripts The file name is final_joined_kerda.sql.
The final data is in
Z:\Hubs\2017\clean data The file name is new_final_kerda.txt.
Accelerator data
Accelerators data is in
Z:\Hubs\2017\clean data The file name is accelerators.txt The table is accelerators
The joined accelerators data with the VC table is in joined_accelerators table. The script is in
Z:\Hubs\2017\sql scripts The file name is join_accelerators.sql
The do file is in
Z:\Hubs\2017\kerda The name is agglomeartion_kerda.do
It includes the graphs, tables and the preliminary FE regressions with VC funding amount and growth rate. It also predicts the hazard rates, matches on the hazard rate in order to create synthetic control and treatment groups. What is left to do is to add 2 lagged and 3 forward observations for the cities which do have a match. Remove the overlapping observations for the years that get a treatment but which at the same time serve as a control.
See also
Also:
- Enclosing Circle Algorithm
- Normalizer
- Geocode.py
Unbiased measure
The number of startups affects the total area of the circles according to some function. The task is to find an unbiased measure of the area, which is not affected by the number of the startups, given the size and their distribution.
For the unbiased calculation of a measure in a different context see: http://users.nber.org/~edegan/w/images/d/d0/Hall_(2005)_-_A_Note_On_The_Bias_In_Herfindahl_Type_Measures_Based_On_Count_Data.pdf
GIS Resources
- https://www.census.gov/geo/maps-data/data/tiger-line.html
- https://www.census.gov/geo/maps-data/data/tiger.html
- http://postgis.net/features/
- https://en.wikipedia.org/wiki/GIS_file_formats
Useful functions for spatial joins
sum(expression): aggregate to return a sum for a set of records count(expression): aggregate to return the size of a set of records ST_Area(geometry) returns the area of the polygons ST_AsText(geometry) returns WKT text ST_Buffer(geometry, distance): For geometry: Returns a geometry that represents all points whose distance from this Geometry is less than or equal to distance. Calculations are in the Spatial Reference System of this Geometry. For geography: Uses a planar transform wrapper. ST_Contains(geometry A, geometry B) returns the true if geometry A contains geometry B ST_Distance(geometry A, geometry B) returns the minimum distance between geometry A and geometry B ST_DWithin(geometry A, geometry B, radius) returns the true if geometry A is radius distance or less from geometry B ST_GeomFromText(text) returns geometry ST_Intersection(geometry A, geometry B): Returns a geometry that represents the shared portion of geomA and geomB. The geography implementation does a transform to geometry to do the intersection and then transform back to WGS84 ST_Intersects(geometry A, geometry B) returns the true if geometry A intersects geometry B ST_Length(linestring) returns the length of the linestring ST_Touches(geometry A, geometry B) returns the true if the boundary of geometry A touches geometry B ST_Within(geometry A, geometry B) returns the true if geometry A is within geometry B geometry_a && geometry_b: Returns TRUE if A’s bounding box overlaps B’s. geometry_a = geometry_b: Returns TRUE if A’s bounding box is the same as B’s. ST_SetSRID(geometry, srid): Sets the SRID on a geometry to a particular integer value. ST_SRID(geometry): Returns the spatial reference identifier for the ST_Geometry as defined in spatial_ref_sys table. ST_Transform(geometry, srid): Returns a new geometry with its coordinates transformed to the SRID referenced by the integer parameter. ST_Union(): Returns a geometry that represents the point set union of the Geometries. substring(string [from int] [for int]): PostgreSQL string function to extract substring matching SQL regular expression. ST_Relate(geometry A, geometry B): Returns a text string representing the DE9IM relationship between the geometries. ST_GeoHash(geometry A): Returns a text string representing the GeoHash of the bounds of the object.
Native functions for geography
ST_AsText(geography) returns text ST_GeographyFromText(text) returns geography ST_AsBinary(geography) returns bytea ST_GeogFromWKB(bytea) returns geography ST_AsSVG(geography) returns text ST_AsGML(geography) returns text ST_AsKML(geography) returns text ST_AsGeoJson(geography) returns text ST_Distance(geography, geography) returns double ST_DWithin(geography, geography, float8) returns boolean ST_Area(geography) returns double ST_Length(geography) returns double ST_Covers(geography, geography) returns boolean ST_CoveredBy(geography, geography) returns boolean ST_Intersects(geography, geography) returns boolean ST_Buffer(geography, float8) returns geography [1] ST_Intersection(geography, geography) returns geography [1]
Functions for Linear Referencing
ST_LineInterpolatePoint(geometry A, double measure): Returns a point interpolated along a line. ST_LineLocatePoint(geometry A, geometry B): Returns a float between 0 and 1 representing the location of the closest point on LineString to the given Point. ST_Line_Substring(geometry A, double from, double to): Return a linestring being a substring of the input one starting and ending at the given fractions of total 2d length. ST_Locate_Along_Measure(geometry A, double measure): Return a derived geometry collection value with elements that match the specified measure. ST_Locate_Between_Measures(geometry A, double from, double to): Return a derived geometry collection value with elements that match the specified range of measures inclusively. ST_AddMeasure(geometry A, double from, double to): Return a derived geometry with measure elements linearly interpolated between the start and end points. If the geometry has no measure dimension, one is added.
3-D Functions
ST_3DClosestPoint — Returns the 3-dimensional point on g1 that is closest to g2. This is the first point of the 3D shortest line. ST_3DDistance — For geometry type Returns the 3-dimensional cartesian minimum distance (based on spatial ref) between two geometries in projected units. ST_3DDWithin — For 3d (z) geometry type Returns true if two geometries 3d distance is within number of units. ST_3DDFullyWithin — Returns true if all of the 3D geometries are within the specified distance of one another. ST_3DIntersects — Returns TRUE if the Geometries “spatially intersect” in 3d - only for points and linestrings ST_3DLongestLine — Returns the 3-dimensional longest line between two geometries ST_3DMaxDistance — For geometry type Returns the 3-dimensional cartesian maximum distance (based on spatial ref) between two geometries in projected units. ST_3DShortestLine — Returns the 3-dimensional shortest line between two geometries
Relevant PostgreSQL Commands
\dt *.* Show all tables \q Exit table
Specifities/ Outliers to consider
New York (decompose) Princeton area (keep Princeton unique) Reston, Virginia (keep) San Diego (include La Jolla) Silicon Valley (all distinct)
To make a circle
SELECT ST_Buffer([desired point], [desired radius], 'quad_segs=8') FROM [desired table]
quad_segs=8 indicates circle
For more precision in circle:
SELECT ST_Transform(geometry( ST_Buffer(geography( ST_Transform( [desired point], 4326 )), [desired radius]')), 900913) FROM [desired table]
4326 and 900913 represent particular precision.
Decimal Degrees
We are working with longitude and latitude in decimal degrees. See https://en.wikipedia.org/wiki/Decimal_degrees
When converting radius to km, multiply by 111.3199. For area, multiple by (111.3199)^2=12,392.12013601.
Census Data
Population
The Census Gazetteer files for 2010, 2000 and 1990 can give use population by census place. See https://www.census.gov/geo/maps-data/data/gazetteer.html
The places file contains data for all incorporated places and census designated places (CDPs) in the 50 states, the District of Columbia and Puerto Rico as of the January 1, 2010. The file is tab-delimited text, one line per record. Some records contain special characters. Download the National Places Gazetteer Files (1.2MB) Download the State-Based Places Gazetteer Files: Column Label Description Column 1 USPS United States Postal Service State Abbreviation Column 2 GEOID Geographic Identifier - fully concatenated geographic code (State FIPS and Place FIPS) Column 3 ANSICODE American National Standards Insititute code Column 4 NAME Name Column 5 LSAD Legal/Statistical area descriptor. Column 6 FUNCSTAT Functional status of entity. Column 7 POP10 2010 Census population count. Column 8 HU10 2010 Census housing unit count. Column 9 ALAND Land Area (square meters) - Created for statistical purposes only. Column 10 AWATER Water Area (square meters) - Created for statistical purposes only. Column 11 ALAND_SQMI Land Area (square miles) - Created for statistical purposes only. Column 12 AWATER_SQMI Water Area (square miles) - Created for statistical purposes only. Column 13 INTPTLAT Latitude (decimal degrees) First character is blank or "-" denoting North or South latitude respectively Column 14 INTPTLONG Longitude (decimal degrees) First character is blank or "-" denoting East or West longitude respectively.
Relationships
See https://www.census.gov/geo/maps-data/data/relationship.html
These text files describe geographic relationships. There are two types of relationship files; those that show the relationship between the same type of geography over time (comparability) and those that show the relationship between two types of geography for the same time period.
ACS (American Community Survey) Data
Steps to download:
1) Go to https://factfinder.census.gov/faces/nav/jsf/pages/download_center.xhtml 2) Select 'I know the dataset or table(s) that I want to download.' 3) Press Next 4) For 'Select a program:' choose 'American Community Survey' 5) For 'Select a dataset and click Add to Your Selections:' choose '<YEAR OF INTEREST> ACS 1-year estimates' 6) Press 'Add To Your Selections' 7) Press Next 8) For 'Select a geographic type:' choose 'Place - 160' 9) For Select a state: Don't choose a state, as we wish to download all. 10) For 'Select one or more geographic areas...' choose 'All Places within United States and Puerto Rico' 11) Press Next
Other
Counts of firms by NAICS code at the county level may be useful: https://www2.census.gov/geo/pdfs/education/cbp12gdbs.pdf
Tax Increment Finance Zones
- State by state enabling statues: https://www.cdfa.net/cdfa/tifmap.nsf/index.html