Difference between revisions of "Restoring vcdb3"

From edegan.com
Jump to navigation Jump to search
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
{{Project
 
{{Project
 +
|Has project output=Data
 
|Has title=Restoring vcdb3
 
|Has title=Restoring vcdb3
 
|Has owner=Ed Egan,
 
|Has owner=Ed Egan,
Line 53: Line 54:
  
 
Note that all code, including ranking code was run before backup.
 
Note that all code, including ranking code was run before backup.
 +
 +
==Fix the Geocoding==
 +
 +
The longitudes and latitudes were stored as numeric, rather than real, but this doesn't seem to the source of the problem as it has suitable precision[https://www.postgresql.org/docs/current/datatype-numeric.html]. Instead the problem seems to be in the loading and unloading of tables through the \COPY command.
 +
 +
Regardless, we have decimal degrees to six decimal places from Google Maps, which is equivalent to a staggering 11cm of accuracy at the equator. See http://wiki.gis.com/wiki/index.php/Decimal_degrees.
 +
 +
There are some notes on [[VentureXpert_Data#GeoCoding_Companies.2C_Firms.2C_and_Branch_Offices]] and the SQL appears to be in:
 +
E:\mcnair\Projects\VentureXpert Database\vcdb3\LoadingScripts\GeoLoad.sql
 +
 +
===Company Process===
 +
 +
The company process began with a load of oldgeocords 44740 from companybasegeomaster.txt (which came from vcdb2). The problem is that companybasegeomaster has low precision data in it. This was documented in [[VC_Database_Rebuild#Re-Fixing_erroneous_geo-coordinates]]. There's good data in vcdb2:
 +
*Geocore 43628 (with nulls), keyed by coname (colevelsimple)
 +
*Geoallcoords 44999 (with nulls), keyed by coname, datefirst, statecode
 +
*Geoallcoords1 44999 (with nulls), keyed by coname, datefirst, statecode, with exclude flag
 +
 +
The next step in the company geo process was to load remaining from RemainingLatLong.txt. This is high precision data.
 +
 +
The fix to the geo data is at the top of:
 +
E:\projects\vcdb3\RevisedDbaseCode.sql
 +
 +
Essentially this fix:
 +
*DROPs portcogeo; oldgeocords; geoallcoords; geoallcoords1; goodgeoold; geoallcoords2; geoallcoords3; goodgeonew; geocodesportco;
 +
*Load geoallcoords1fromvcdb2 (which comes from vcdb2!)
 +
*Merge the results of '''geoallcoords1fromvcdb2''' with the resutls of '''remaining'''
 +
*Mark the out-of-bounds exclusions
 +
*Produce '''portcogeo''' 47715
 +
 +
===Firm Process===
 +
 +
For the firms, there are three sources, all of which are high precision:
 +
*bogeo 2046 from BranchOfficesGeo.txt
 +
*oldfirmcoords 5556 from FirmCoords.txt
 +
*firmremainingcoords 706 from FirmRemainingCoords.txt
 +
 +
These produce:
 +
*bogeo 2046
 +
*firmgeocoords 6049
 +
*firmbasecore 15437, which is large because it contains non-US

Latest revision as of 12:41, 21 September 2020


Project
Restoring vcdb3
Project logo 02.png
Project Information
Has title Restoring vcdb3
Has owner Ed Egan
Has start date
Has deadline date
Has project status Complete
Has project output Data
Copyright © 2019 edegan.com. All Rights Reserved.


See also

The VentureXpert Data build notes.

Restore the dbase

Run the following:

cd /bulk/mcnair/backups
su researcher
createdb vcdb3
pg_restore -Fc -d vcdb3 vcdb3_Fc_230818.dump
cd /bulk
mkdir vcdb3
cd vcdb3
psql vcdb3

Reload the last changes

  1. make vcdb3 in E:\projects
  2. copy C:\Users\ed\Dropbox\coauthoredprojects\NonMcNair\MatchingVCs\RevisedDbaseCode.sql to e:\projects\vcdb3
  3. Run that code!

1.6m queries running in <7s index on 1.6m in <1s Big table (99m) used to take ~2mins, now takes 2:15. Might want to reoptimize config. Big reduction used to take 12mins. Now takes 55 secs! In later queries (not as big), allocated cpu 97-100% but mem only 3.2% Copy up to 3.6% mem usage. Max observed 4.3%. All of the code ran correctly, with the right counts, etc.

Consolidate code

Code was in:

E:\mcnair\Projects\VentureXpert Database\vcdb3\LoadingScripts

The following files were copied to E:\projects\vcdb3\OriginalSQL

  • Adding GDP.sql
  • Agglomeration.SQL - this is the agglomeration work
  • CohortPortCoMatch.sql
  • GeoLoad.sql
  • LoadingScriptsV1.sql - this does the load!
  • MatchingEntrepsV3.sql - this is the main build, after the load
  • MiscTasks.SQL
  • Population.sql
  • PortcoMatchMaster.SQL
  • RoundRanking.sql - this is the city ranking script
  • RoundRankingStates.sql - this is the state ranking script

Note that all code, including ranking code was run before backup.

Fix the Geocoding

The longitudes and latitudes were stored as numeric, rather than real, but this doesn't seem to the source of the problem as it has suitable precision[1]. Instead the problem seems to be in the loading and unloading of tables through the \COPY command.

Regardless, we have decimal degrees to six decimal places from Google Maps, which is equivalent to a staggering 11cm of accuracy at the equator. See http://wiki.gis.com/wiki/index.php/Decimal_degrees.

There are some notes on VentureXpert_Data#GeoCoding_Companies.2C_Firms.2C_and_Branch_Offices and the SQL appears to be in:

E:\mcnair\Projects\VentureXpert Database\vcdb3\LoadingScripts\GeoLoad.sql

Company Process

The company process began with a load of oldgeocords 44740 from companybasegeomaster.txt (which came from vcdb2). The problem is that companybasegeomaster has low precision data in it. This was documented in VC_Database_Rebuild#Re-Fixing_erroneous_geo-coordinates. There's good data in vcdb2:

  • Geocore 43628 (with nulls), keyed by coname (colevelsimple)
  • Geoallcoords 44999 (with nulls), keyed by coname, datefirst, statecode
  • Geoallcoords1 44999 (with nulls), keyed by coname, datefirst, statecode, with exclude flag

The next step in the company geo process was to load remaining from RemainingLatLong.txt. This is high precision data.

The fix to the geo data is at the top of:

E:\projects\vcdb3\RevisedDbaseCode.sql

Essentially this fix:

  • DROPs portcogeo; oldgeocords; geoallcoords; geoallcoords1; goodgeoold; geoallcoords2; geoallcoords3; goodgeonew; geocodesportco;
  • Load geoallcoords1fromvcdb2 (which comes from vcdb2!)
  • Merge the results of geoallcoords1fromvcdb2 with the resutls of remaining
  • Mark the out-of-bounds exclusions
  • Produce portcogeo 47715

Firm Process

For the firms, there are three sources, all of which are high precision:

  • bogeo 2046 from BranchOfficesGeo.txt
  • oldfirmcoords 5556 from FirmCoords.txt
  • firmremainingcoords 706 from FirmRemainingCoords.txt

These produce:

  • bogeo 2046
  • firmgeocoords 6049
  • firmbasecore 15437, which is large because it contains non-US