US Address Verification
This project is composed of SQL queries against patents.ptoassigneend_allus
to gain confidence in the data. The scripts and rationale behind the design decisions in this table are described in its wiki page here and all original work was done by Sonia Zhang.
US Address Verification | |
---|---|
Project Information | |
Project Title | US Address Verification |
Owner | Oliver Chang |
Start Date | June 2017 |
Deadline | June 2017 |
Primary Billing | |
Notes | |
Has project status | Active |
Copyright © 2016 edegan.com. All Rights Reserved. |
Total Row Counts
Table Name | count(*) | Description |
---|---|---|
ptoassigneend | 7234001 | Base data with duplicates removed |
ptoassigneend_allus | 3572605 | Sonia's subset of only US addresses; 49.4% of ptoassigneend |
3572605 | ptoassigneend_us_cleaned | "cleaned postcode, city, and state" of ptoassigneend_allus; this is the one we want to work with |
5343 | ptoassigneend_us_extracted | cleaned up complete addresses; too small to be worthwhile |
Table Origin
Note that the suffix nd
stands for "no duplicates". Assume this method was done correctly.
Issue 1: Faulty CREATE WHERE clause
ptoassigneend_allus
is the result of
CREATE TABLE ptoassigneend_allus AS
SELECT * FROM ptoassigneend
WHERE city ~* '.*\d{5}[-]\d{4}.*' OR addrline1 ~* '.*\d{5}[-]\d{4}.*' OR
addrline2 ~* '.*\d{5}[-]\d{4}.*' OR postcode ~* '.*\d{5}[-]\d{4}.*' OR
country ~* 'UNITED STATES';
It is unclear why this approach is favored over
CREATE TABLE ptoassigneend_allus AS
SELECT * FROM ptoassigneend
WHERE (city ~* '.*\d{5}[-]\d{4}.*' OR addrline1 ~* '.*\d{5}[-]\d{4}.*' OR
addrline2 ~* '.*\d{5}[-]\d{4}.*' OR postcode ~* '.*\d{5}[-]\d{4}.*') AND
country ~* 'UNITED STATES';
It appears that the country field is correct and that the country names are taken from a small lookup table.
Proof:
patent=# select * from (select country, count(country) as c from ptoassigneend group by country o country | c ----------------------------------------+--------- UNITED STATES | 3570216 JAPAN | 1334774 GERMANY | 429429 KOREA, REPUBLIC OF | 284830 TAIWAN | 231493 FRANCE | 172995 CANADA | 132387 SWITZERLAND | 99281 UNITED KINGDOM | 90634 CHINA | 88954 NETHERLANDS | 84573 SWEDEN | 70538 ITALY | 61416 FINLAND | 44116 AUSTRALIA | 43314 ISRAEL | 41857 ENGLAND | 31341 DENMARK | 24742 BELGIUM | 24147 AUSTRIA | 19487 SINGAPORE | 17518 HONG KONG | 17091 SPAIN | 13288 INDIA | 12589 IRELAND | 11754 NORWAY | 11290 STATELESS | 10917 GREAT BRITAIN | 10692 BERMUDA | 8780 CAYMAN ISLANDS | 8206 NOT PROVIDED | 6658 VIRGIN ISLANDS, BRITISH | 6510 NEW ZEALAND | 6364 ONTARIO | 6047 LUXEMBOURG | 5856 KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF | 5625 LIECHTENSTEIN | 4711 BRAZIL | 4368 SOUTH AFRICA | 3699 RUSSIAN FEDERATION | 3375 SAUDI ARABIA | 3335 BARBADOS | 3115 HUNGARY | 2642 GERMAN DEMOCRATIC REPUBLIC | 2401 MEXICO | 2083 QUEBEC | 1777 NETHERLANDS ANTILLES | 1690 POLAND | 1607 MALAYSIA | 1606 BRITISH COLUMBIA | 1279 (50 rows)
Therefore, my proposed modified SELECT WHERE
is credible
patent=# select count(*) from ptoassigneend; count --------- 7234001 (1 row) patent=# select count(*) from ptoassigneend where country = 'UNITED STATES'; count --------- 3570216 (1 row) patent=# select count(*) from ptoassigneend where country like '%UNITED STATES%'; count --------- 3570217 (1 row) patent=# select * from ptoassigneend where (country like '%UNITED STATES%' and country != 'UNITED STATES'); reelno | frameno | name | addrline1 | addrline2 | --------+---------+------------------------------+---------------+----------------------+-------- 29344 | 970 | WARNER CHILCOTT COMPANY, LLC | P.O. BOX 1005 | UNION STREET, KM 1.1 | FAJARDO(1 row)