Difference between revisions of "Restoring vcdb3"
Line 63: | Line 63: | ||
===Company Process=== | ===Company Process=== | ||
− | The company process began with a load of oldgeocords 44740 from companybasegeomaster.txt (which came from vcdb2). The problem is that companybasegeomaster has low precision data in it. This was documented in [[VC_Database_Rebuild#Re-Fixing_erroneous_geo-coordinates]]. There's good data | + | The company process began with a load of oldgeocords 44740 from companybasegeomaster.txt (which came from vcdb2). The problem is that companybasegeomaster has low precision data in it. This was documented in [[VC_Database_Rebuild#Re-Fixing_erroneous_geo-coordinates]]. There's good data in vcdb2: |
− | *Geocore 43628 (with nulls) | + | *Geocore 43628 (with nulls), keyed by coname (colevelsimple) |
− | *Geoallcoords 44999 (with nulls) | + | *Geoallcoords 44999 (with nulls), keyed by coname, datefirst, statecode |
− | *Geoallcoords1 44999 (with nulls) | + | *Geoallcoords1 44999 (with nulls), keyed by coname, datefirst, statecode, with exclude flag |
The next step in the company geo process was to load remaining from RemainingLatLong.txt. This is high precision data. | The next step in the company geo process was to load remaining from RemainingLatLong.txt. This is high precision data. |
Revision as of 18:20, 30 May 2019
Restoring vcdb3 | |
---|---|
Project Information | |
Has title | Restoring vcdb3 |
Has owner | Ed Egan |
Has start date | |
Has deadline date | |
Has project status | Complete |
Copyright © 2019 edegan.com. All Rights Reserved. |
Contents
See also
The VentureXpert Data build notes.
Restore the dbase
Run the following:
cd /bulk/mcnair/backups su researcher createdb vcdb3 pg_restore -Fc -d vcdb3 vcdb3_Fc_230818.dump cd /bulk mkdir vcdb3 cd vcdb3 psql vcdb3
Reload the last changes
- make vcdb3 in E:\projects
- copy C:\Users\ed\Dropbox\coauthoredprojects\NonMcNair\MatchingVCs\RevisedDbaseCode.sql to e:\projects\vcdb3
- Run that code!
1.6m queries running in <7s index on 1.6m in <1s Big table (99m) used to take ~2mins, now takes 2:15. Might want to reoptimize config. Big reduction used to take 12mins. Now takes 55 secs! In later queries (not as big), allocated cpu 97-100% but mem only 3.2% Copy up to 3.6% mem usage. Max observed 4.3%. All of the code ran correctly, with the right counts, etc.
Consolidate code
Code was in:
E:\mcnair\Projects\VentureXpert Database\vcdb3\LoadingScripts
The following files were copied to E:\projects\vcdb3\OriginalSQL
- Adding GDP.sql
- Agglomeration.SQL - this is the agglomeration work
- CohortPortCoMatch.sql
- GeoLoad.sql
- LoadingScriptsV1.sql - this does the load!
- MatchingEntrepsV3.sql - this is the main build, after the load
- MiscTasks.SQL
- Population.sql
- PortcoMatchMaster.SQL
- RoundRanking.sql - this is the city ranking script
- RoundRankingStates.sql - this is the state ranking script
Note that all code, including ranking code was run before backup.
Fix the Geocoding
The longitudes and latitudes were stored as numeric, rather than real, but this doesn't seem to the source of the problem as it has suitable precision[1].
There are some notes on VentureXpert_Data#GeoCoding_Companies.2C_Firms.2C_and_Branch_Offices and the SQL appears to be in:
E:\mcnair\Projects\VentureXpert Database\vcdb3\LoadingScripts\GeoLoad.sql
Company Process
The company process began with a load of oldgeocords 44740 from companybasegeomaster.txt (which came from vcdb2). The problem is that companybasegeomaster has low precision data in it. This was documented in VC_Database_Rebuild#Re-Fixing_erroneous_geo-coordinates. There's good data in vcdb2:
- Geocore 43628 (with nulls), keyed by coname (colevelsimple)
- Geoallcoords 44999 (with nulls), keyed by coname, datefirst, statecode
- Geoallcoords1 44999 (with nulls), keyed by coname, datefirst, statecode, with exclude flag
The next step in the company geo process was to load remaining from RemainingLatLong.txt. This is high precision data.
The fix to the geo data is at the top of:
E:\projects\vcdb3\RevisedDbaseCode.sql
Firm Process
For the firms, there are three sources, all of which are high precision:
- bogeo 2046 from BranchOfficesGeo.txt
- oldfirmcoords 5556 from FirmCoords.txt
- firmremainingcoords 706 from FirmRemainingCoords.txt
These produce:
- bogeo 2046
- firmgeocoords 6049
- firmbasecore 15437, which is large because it contains non-US