US Address Verification
This project is composed of SQL queries against patents.ptoassigneend_allus
to gain confidence in the data. The scripts and rationale behind the design decisions in this table are described in its wiki page here and all original work was done by Sonia Zhang.
US Address Verification | |
---|---|
Project Information | |
Project Title | US Address Verification |
Owner | Oliver Chang |
Start Date | June 2017 |
Deadline | June 2017 |
Primary Billing | |
Notes | |
Has project status | Active |
Copyright © 2016 edegan.com. All Rights Reserved. |
Contents
- 1 Total Row Counts
- 2 The Good
- 3 The Bad
- 3.1 state column is ill-constrained
- 3.2 ptoassigneend_us_cleaned.postcode_*_cleaned is total garbage
- 3.2.1 only .15% of records have their postcodes extracted correctly
- 3.2.2 the underlying data is fine though...
- 3.2.3 ...this case is hit when postcode is present but none of the other fields are present
- 3.2.4 ...easy gains can be gotten as a quick fix
- 3.2.5 ...5 digit postcode works great
- 3.2.6 ...9 digit postcode works ok (two forms)
- 3.2.7 ...10 digit postcode works great
- 4 The Ugly
- 5 A Stopgap Measure
Total Row Counts
Table Name | count(*) | Description |
---|---|---|
ptoassigneend | 7,234,001 | Base data with duplicates removed |
ptoassigneend_allus | 3,572,605 | Sonia's subset of only US addresses; 49.4% of ptoassigneend |
ptoassigneend_us_cleaned | 3,572,605 | "cleaned postcode, city, and state" of ptoassigneend_allus; this is the one we want to work with |
ptoassigneend_us_extracted | 5,343 | cleaned up complete addresses; too small to be worthwhile |
Note that the suffix nd
stands for "no duplicates". Assume this method was done correctly.
The Good
country
column is reliable
It appears that the country field is consistent across all of the entries, and that the country names are taken from a small lookup table.
Proof:
patent=# select country, count(country) as c from ptoassigneend group by country order by c desc limit 50; country | c ----------------------------------------+--------- UNITED STATES | 3570216 JAPAN | 1334774 GERMANY | 429429 KOREA, REPUBLIC OF | 284830 TAIWAN | 231493 FRANCE | 172995 CANADA | 132387 SWITZERLAND | 99281 UNITED KINGDOM | 90634 CHINA | 88954 NETHERLANDS | 84573 SWEDEN | 70538 ITALY | 61416 FINLAND | 44116 AUSTRALIA | 43314 ISRAEL | 41857 ENGLAND | 31341 DENMARK | 24742 BELGIUM | 24147 AUSTRIA | 19487 SINGAPORE | 17518 HONG KONG | 17091 SPAIN | 13288 INDIA | 12589 IRELAND | 11754 NORWAY | 11290 STATELESS | 10917 GREAT BRITAIN | 10692 BERMUDA | 8780 CAYMAN ISLANDS | 8206 NOT PROVIDED | 6658 VIRGIN ISLANDS, BRITISH | 6510 NEW ZEALAND | 6364 ONTARIO | 6047 LUXEMBOURG | 5856 KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF | 5625 LIECHTENSTEIN | 4711 BRAZIL | 4368 SOUTH AFRICA | 3699 RUSSIAN FEDERATION | 3375 SAUDI ARABIA | 3335 BARBADOS | 3115 HUNGARY | 2642 GERMAN DEMOCRATIC REPUBLIC | 2401 MEXICO | 2083 QUEBEC | 1777 NETHERLANDS ANTILLES | 1690 POLAND | 1607 MALAYSIA | 1606 BRITISH COLUMBIA | 1279 (50 rows)
territories, etc are categorized differently
There are single-digit edge cases that in the strictest interpretation of country are correct.
patent=# select count(*) from ptoassigneend; count --------- 7234001 (1 row) patent=# select count(*) from ptoassigneend where country = 'UNITED STATES'; count --------- 3570216 (1 row) patent=# select count(*) from ptoassigneend where country like '%UNITED STATES%'; count --------- 3570217 (1 row) patent=# select * from ptoassigneend where (country like '%UNITED STATES%' and country != 'UNITED STATES'); reelno | frameno | name | addrline1 | addrline2 | --------+---------+------------------------------+---------------+----------------------+-------- 29344 | 970 | WARNER CHILCOTT COMPANY, LLC | P.O. BOX 1005 | UNION STREET, KM 1.1 | FAJARDO(1 row)
all zipcodes are valid US zipcodes
Using the data from geonames for the United States, put into the table via psql < Z:\zipcodes-oliver\dump.sql
where dump.sql
is generated via E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\uspto_assignments\GeonamesZips.java
, all of the values for postcodes are valid US postcodes.
patent=# select count(distinct c.postcode_f5_cleaned) from ptoassigneend_us_cleaned as c left join geonames_us_zipcodes as z on c.postcode_f5_cleaned = z.zipcode; count ------- 1116 (1 row) patent=# select count(distinct postcode_f5_cleaned) from ptoassigneend_us_cleaned; count ------- 1116 (1 row)
The Bad
state
column is ill-constrained
# returns 57 rows (District of Columbia is counted as distinct state) # select distinct state, count(state) from ptoassigneend_us_cleaned group by state limit 100; ... Armed Forces in Europe, the Middle East, Africa, and Canada | 2 UNITED STATES ARMY | 2 Armed Forces in the Pacific | 3 | 1 NATIONAL AERONAUTICS AND SPACE ADMINISTRATION | 25 | 0 ...
ptoassigneend_us_cleaned.postcode_*_cleaned
is total garbage
This garbage stretches to include
ptoassigneend_us_cleaned.postcode_f5_cleaned
ptoassigneend_us_cleaned.postcode_cleaned
only .15% of records have their postcodes extracted correctly
patent=# select count(*) from ptoassigneend_us_cleaned; count --------- 3572605 (1 row) patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned is not null; count --------- 3376480 (1 row) patent=# select count(*) from ptoassigneend_us_cleaned where postcode_f5_cleaned is not null; count ------- 5344 (1 row) patent=# select postcode_cleaned, postcode_f5_cleaned from ptoassigneend_us_cleaned where (postcode_cleaned is not null and postcode_f5_cleaned is null) limit 12; postcode_cleaned | postcode_f5_cleaned -------------------------------------------+--------------------- £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 | (12 rows) patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082'; count --------- 3371136 (1 row)
the underlying data is fine though...
patent=# select postcode, postcode_addr1, postcode_addr2, postcode_city from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' limit 5; postcode | postcode_addr1 | postcode_addr2 | postcode_city ----------+----------------+----------------+--------------- 75024 | | | 55379 | | | 94538 | | | 23219 | | | 73114 | | | (5 rows)
Moreover, selecting from the tables that ptoassigneend_us_cleaned
is derived from did not yield this string. Therefore, there is likely an error in the SQL script, perhaps with some wonky copy-pasting from the internet.
...this case is hit when postcode is present but none of the other fields are present
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u 0082'; count --------- 3371136 (1 row) patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u 0082' and postcode is not null; count --------- 3371136 (1 row) patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u 0082' and postcode is not null and postcode_addr1 is null and postcode_addr2 is null and postcode_city is null; count --------- 3371136 (1 row)
...easy gains can be gotten as a quick fix
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and postcode is not null; count --------- 3371136 (1 row) patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 5; count --------- 2971542 (1 row)
...5 digit postcode works great
patent=# select postcode from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 5 limit 12; postcode ---------- 75024 55379 94538 23219 73114 95134 33487 60603 84604 20191 97213 10504 (12 rows)
...9 digit postcode works ok (two forms)
patent=# select postcode from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 9 limit 12; postcode ----------- USA 20585 USA 33458 4826-2766 181951501 772522463 USA 27601 772522463 USA 20310 USA 47202 174012991 913929221 USA 12345 (12 rows)
...10 digit postcode works great
patent=# select postcode from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 10 limit 12; postcode ------------ 92121-1714 95134-1706 80527-2400 95066-4544 01862-2000 07962-2245 90245-5012 55133-3427 30332-0415 95134-1706 20892-7660 94080-4990 (12 rows)
The Ugly
fields directly contradict each other
Possible Heuristics:
- Presence of Ln, St, Ave, Blvd, Cir, etc...
approximate data is available for stateless assignees
only on the order of 500 stateless assignees have postcodes
patent=# select count(*) from ptoassigneend where (country = 'NOT PROVIDED' or country = 'STATELESS'); count ------- 17575 (1 row) patent=# select count(*) from ptoassigneend where (country = 'NOT PROVIDED' or country = 'STATELESS') and (postcode is not null); count ------- 329 (1 row) patent=# select count(*) from ptoassigneend where (country = 'NOT PROVIDED' or country = 'STATELESS') and (postcode is not null or addrline1 is not null); count ------- 1572 (1 row)
approximate location data is available (but unused)
patent=# select name, addrline1, addrline2, postcode from ptoassigneend where (country = 'NOT PROVIDED' or country = 'S TATELESS') and (addrline1 is not null or addrline2 is not null) limit 30; name | addrline1 | addrline2 | postcode --------------------------------------------------------------------------------------+----------------------------------+----------------------------------+---------- REGAL KING COMMERCIAL OFFSHORE DE MACAU LIMITED | ALAMEDA DR. CARLOS D'S ASSUMPCAO | DYNASTY PLAZA, 19 AND. (H) MACAO | ECODYNE CORPORATION | | A CORP. OF DE | SHELL OIL COMPANY | | A CORP. OF DE | RCA CORPORATION | | A CORP. OF DE | BACHEM FEINCHEMIKALIEN A.G. | | A SWISS CORP. | AMERICAN STERILIZER COMPANY | | A PA. CORP. | CREATIVE TECHNOLOGY LTD. | CREATIVE RESOURCE | 31 INTERNATIONAL BUSINESS PARK | 60922 FUJITSU LIMITED | NAKAHARA-KU, KAWASAKI-SHI | 1-1, KAMIKODANAKA 4-CHOME | 211-8 SIEMENS AKTIENGESELLSCHAFT | | A GERMAN CORP | ISABERG AB, HESTRA | | A CORP. OF SWEDEN | MOELLER MANUFACTURING CO., INC. | | A DE CORP. | MOBIL OIL CORPORATION | | A CORP OF NY | GENERAL ELECTRIC COMPANY | | A NY CORP. | GENERAL ELECTRIC COMPANY | A CORP. OF NY | | BTR INDUSTRIES LIMITED | CARLISLE PLACE | BTR HOUSE | DIATRON CORPORATION | | A CORP. OF CA | CLOSURES AND PACKAGING SERVICES LIMITED | P. O. BOX 119 | | GY1 3HB DR. JOHANNES HEIDENHAIN GMBH A CORPORATION OF GERMANY | | TRAUNREUT CITY | AMERICAN SAFETY RAZOR COMPANY | | A DE CORP. | ETHICON, INC. | | A NJ CORP. | GENERAL ELECTRIC COMPANY | | A NEW YORK CORP. | MOBIL OIL CORPORATION | | A CORP. OF NY | RICHARD LANGLECHNER GMBH | MAUERBERGERSTR. 15 | | KONISHIROKU PHOTO INDUSTRY CO., LTD., | | A CORP OF JAPAN | SHELL OIL COMPANY | | A CORP. OF DE | SIEMENS AKTIENGESELLSCHAFT A GERMAN CORPORATION | | MUNICH | C.A. BRIGGS COMPANY | | A CORP OF PA | GTE PRODUCTS CORPORATION | | A DE CORP. | TRAMEX ENGINEERING LIMITED | 8 CLARE STREET, | A IRISH COMPANY | MOBIL OIL CORPORATION | | A CORPORATION OF NEW YORK |
A Stopgap Measure
As a quick and dirty fix, walk through the cases and do the best we can. The script for that is located at E:\McNair\Projects\SimplerPatentData\src\db\ZipCodeHotfix.sql
and it creates the following table:
DROP VIEW IF EXISTS june_2017_zipcode_hotfix; CREATE VIEW june_2017_zipcode_hotfix AS SELECT reelno, frameno, name, GodHelpUsAll(postcode_cleaned, postcode) as postcode FROM ptoassigneend_us_cleaned;