US Address Verification

From edegan.com
Jump to navigation Jump to search



This project is composed of SQL queries against patents.ptoassigneend_allus to gain confidence in the data. The scripts and rationale behind the design decisions in this table are described in its wiki page here and all original work was done by Sonia Zhang.

McNair Project
US Address Verification
Project logo 02.png
Project Information
Project Title US Address Verification
Owner Oliver Chang
Start Date June 2017
Deadline June 2017
Primary Billing
Notes
Has project status Active
Copyright © 2016 edegan.com. All Rights Reserved.


Total Row Counts

Number of Rows
Table Name count(*) Description
ptoassigneend 7234001 Base data with duplicates removed
ptoassigneend_allus 3572605 Sonia's subset of only US addresses; 49.4% of ptoassigneend
ptoassigneend_us_cleaned 3572605 "cleaned postcode, city, and state" of ptoassigneend_allus; this is the one we want to work with
ptoassigneend_us_extracted 5343 cleaned up complete addresses; too small to be worthwhile

Table Origin

Note that the suffix nd stands for "no duplicates". Assume this method was done correctly.

Issue 1: Faulty CREATE WHERE clause

ptoassigneend_allus is the result of

CREATE TABLE ptoassigneend_allus AS SELECT * FROM ptoassigneend WHERE city ~* '.*\d{5}[-]\d{4}.*' OR addrline1 ~* '.*\d{5}[-]\d{4}.*' OR addrline2 ~* '.*\d{5}[-]\d{4}.*' OR postcode ~* '.*\d{5}[-]\d{4}.*' OR country ~* 'UNITED STATES';

It is unclear why this approach is favored over

CREATE TABLE ptoassigneend_allus AS SELECT * FROM ptoassigneend WHERE (city ~* '.*\d{5}[-]\d{4}.*' OR addrline1 ~* '.*\d{5}[-]\d{4}.*' OR addrline2 ~* '.*\d{5}[-]\d{4}.*' OR postcode ~* '.*\d{5}[-]\d{4}.*') AND country ~* 'UNITED STATES';

It appears that the country field is correct and that the country names are taken from a small lookup table.

Proof:

patent=# select * from (select country, count(country) as c from ptoassigneend group by country o
                country                 |    c
----------------------------------------+---------
 UNITED STATES                          | 3570216
 JAPAN                                  | 1334774
 GERMANY                                |  429429
 KOREA, REPUBLIC OF                     |  284830
 TAIWAN                                 |  231493
 FRANCE                                 |  172995
 CANADA                                 |  132387
 SWITZERLAND                            |   99281
 UNITED KINGDOM                         |   90634
 CHINA                                  |   88954
 NETHERLANDS                            |   84573
 SWEDEN                                 |   70538
 ITALY                                  |   61416
 FINLAND                                |   44116
 AUSTRALIA                              |   43314
 ISRAEL                                 |   41857
 ENGLAND                                |   31341
 DENMARK                                |   24742
 BELGIUM                                |   24147
 AUSTRIA                                |   19487
 SINGAPORE                              |   17518
 HONG KONG                              |   17091
 SPAIN                                  |   13288
 INDIA                                  |   12589
 IRELAND                                |   11754
 NORWAY                                 |   11290
 STATELESS                              |   10917
 GREAT BRITAIN                          |   10692
 BERMUDA                                |    8780
 CAYMAN ISLANDS                         |    8206
 NOT PROVIDED                           |    6658
 VIRGIN ISLANDS, BRITISH                |    6510
 NEW ZEALAND                            |    6364
 ONTARIO                                |    6047
 LUXEMBOURG                             |    5856
 KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF |    5625
 LIECHTENSTEIN                          |    4711
 BRAZIL                                 |    4368
 SOUTH AFRICA                           |    3699
 RUSSIAN FEDERATION                     |    3375
 SAUDI ARABIA                           |    3335
 BARBADOS                               |    3115
 HUNGARY                                |    2642
 GERMAN DEMOCRATIC REPUBLIC             |    2401
 MEXICO                                 |    2083
 QUEBEC                                 |    1777
 NETHERLANDS ANTILLES                   |    1690
 POLAND                                 |    1607
 MALAYSIA                               |    1606
 BRITISH COLUMBIA                       |    1279
(50 rows)

Therefore, my proposed modified SELECT WHERE is credible

patent=# select count(*) from ptoassigneend;
  count
---------
 7234001
(1 row)

patent=# select count(*) from ptoassigneend where country = 'UNITED STATES';
  count
---------
 3570216
(1 row)

patent=# select count(*) from ptoassigneend where country like '%UNITED STATES%';
  count
---------
 3570217
(1 row)

patent=# select * from ptoassigneend where (country like '%UNITED STATES%' and country != 'UNITED STATES');
 reelno | frameno |             name             |   addrline1   |      addrline2       |
--------+---------+------------------------------+---------------+----------------------+--------
  29344 |     970 | WARNER CHILCOTT COMPANY, LLC | P.O. BOX 1005 | UNION STREET, KM 1.1 | FAJARDO(1 row)