Difference between revisions of "US Address Verification"

From edegan.com
Jump to navigation Jump to search
 
(35 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{McNair Projects
+
{{Project
 +
|Has project output=Tool
 +
|Has sponsor=McNair Center
 
|Has title=US Address Verification
 
|Has title=US Address Verification
 
|Has owner=Oliver Chang,
 
|Has owner=Oliver Chang,
 
|Has start date=June 2017
 
|Has start date=June 2017
 
|Has deadline=June 2017
 
|Has deadline=June 2017
|Has project status=Active
+
|Has project status=Complete
 
}}
 
}}
  
Line 15: Line 17:
 
|+Number of Rows
 
|+Number of Rows
 
|-
 
|-
! scope="col" | Table Name !! scope="col" | count(*) !! scope="col" | Description
+
! scope="col" | Table Name !! scope="col" | count(*) !! scope="col" | Description of Tables in <code>patent</code>
 
|-  
 
|-  
 
| ptoassigneend
 
| ptoassigneend
| 7234001
+
| 7,234,001
 
| Base data with duplicates removed
 
| Base data with duplicates removed
 
|-  
 
|-  
 
| ptoassigneend_allus
 
| ptoassigneend_allus
| 3572605
+
| 3,572,605
 
| Sonia's subset of only US addresses; 49.4% of ptoassigneend
 
| Sonia's subset of only US addresses; 49.4% of ptoassigneend
 
|-  
 
|-  
 
| ptoassigneend_us_cleaned
 
| ptoassigneend_us_cleaned
| 3572605
+
| 3,572,605
 
| "cleaned postcode, city, and state" of ptoassigneend_allus; '''this is the one we want to work with'''
 
| "cleaned postcode, city, and state" of ptoassigneend_allus; '''this is the one we want to work with'''
 
|-
 
|-
 
| ptoassigneend_us_extracted
 
| ptoassigneend_us_extracted
| 5343
+
| 5,343
 
| cleaned up complete addresses; too small to be worthwhile
 
| cleaned up complete addresses; too small to be worthwhile
 +
|-
 +
| june_2017_hotfix
 +
| 3,370,613
 +
| quick and dirty fix to get first 5 postcode digits from ptoassigneend_us_cleaned using heuristics (see [[#A_Stopgap_Measure]])
 +
|-
 +
| june_2017_zipcode_join
 +
| 1,365,408
 +
| hack to join zipcodes to patent ids; see <code>E:\McNair\Projects\SimplerPatentData\src\db\ZipcodeMapAssignmentToPatent.sql</code> for methodology
 
|}
 
|}
 
== Table Origin ==
 
  
 
Note that the suffix <code>nd</code> stands for "no duplicates". Assume this method was done correctly.
 
Note that the suffix <code>nd</code> stands for "no duplicates". Assume this method was done correctly.
  
'''Issue 1: Faulty CREATE WHERE clause'''
+
== The Good ==
 
 
<code>ptoassigneend_allus</code> is the result of
 
  
<code>
+
=== <code>country</code> column is reliable ===
CREATE TABLE ptoassigneend_allus AS
 
SELECT * FROM ptoassigneend
 
WHERE city ~* '.*\d{5}[-]\d{4}.*' OR addrline1 ~* '.*\d{5}[-]\d{4}.*' OR
 
addrline2 ~* '.*\d{5}[-]\d{4}.*' OR postcode ~* '.*\d{5}[-]\d{4}.*' OR
 
country ~* 'UNITED STATES';
 
</code>
 
  
It is unclear why this approach is favored over
+
It appears that the country field is consistent across all of the entries, and that the country names are taken from a small lookup table.
 
 
<code>
 
CREATE TABLE ptoassigneend_allus AS
 
SELECT * FROM ptoassigneend
 
WHERE (city ~* '.*\d{5}[-]\d{4}.*' OR addrline1 ~* '.*\d{5}[-]\d{4}.*' OR
 
addrline2 ~* '.*\d{5}[-]\d{4}.*' OR postcode ~* '.*\d{5}[-]\d{4}.*') AND
 
country ~* 'UNITED STATES';
 
</code>
 
 
 
It appears that the country field is correct and that the country names are taken from a small lookup table.
 
  
 
'''Proof:'''
 
'''Proof:'''
  
 
  <nowiki>
 
  <nowiki>
patent=# select * from (select country, count(country) as c from ptoassigneend group by country o
+
patent=# select country, count(country) as c from ptoassigneend group by country order by c desc limit 50;
 
                 country                |    c
 
                 country                |    c
 
----------------------------------------+---------
 
----------------------------------------+---------
Line 121: Line 111:
 
</nowiki>
 
</nowiki>
  
Therefore, my proposed modified <code>SELECT WHERE</code> is credible
+
=== territories, etc are categorized differently ===
 +
 
 +
There are single-digit edge cases that in the strictest interpretation of country are correct.
  
 
  <nowiki>
 
  <nowiki>
Line 147: Line 139:
 
   29344 |    970 | WARNER CHILCOTT COMPANY, LLC | P.O. BOX 1005 | UNION STREET, KM 1.1 | FAJARDO(1 row)
 
   29344 |    970 | WARNER CHILCOTT COMPANY, LLC | P.O. BOX 1005 | UNION STREET, KM 1.1 | FAJARDO(1 row)
 
</nowiki>
 
</nowiki>
 +
 +
=== all zipcodes are valid US zipcodes ===
 +
 +
Using the data from [http://download.geonames.org/export/zip/ geonames] for the United States, put into the table via <code>psql < Z:\zipcodes-oliver\dump.sql</code> where <code>dump.sql</code> is generated via <code>E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\uspto_assignments\GeonamesZips.java</code>, all of the values for postcodes are valid US postcodes.
 +
 +
<nowiki>
 +
patent=# select count(distinct c.postcode_f5_cleaned) from ptoassigneend_us_cleaned as c left join geonames_us_zipcodes
 +
as z on c.postcode_f5_cleaned = z.zipcode;                                                                           
 +
count                                                                                                                 
 +
-------                                                                                                               
 +
  1116                                                                                                                 
 +
(1 row)                                                                                                               
 +
                                                                                                                       
 +
patent=# select count(distinct postcode_f5_cleaned) from ptoassigneend_us_cleaned;                                     
 +
count                                                                                                                 
 +
-------                                                                                                               
 +
  1116                                                                                                                 
 +
(1 row)                                                                                                               
 +
</nowiki>
 +
 +
== The Bad ==
 +
 +
=== <code>state</code> column is ill-constrained ===
 +
 +
<nowiki>
 +
# returns 57 rows (District of Columbia is counted as distinct state)
 +
# select distinct state, count(state) from ptoassigneend_us_cleaned group by state limit 100;
 +
...
 +
Armed Forces in Europe, the Middle East, Africa, and Canada |      2
 +
UNITED STATES ARMY                                          |      2
 +
Armed Forces in the Pacific                                |      3
 +
                                                            |      1
 +
NATIONAL AERONAUTICS AND SPACE ADMINISTRATION              |    25
 +
                                                            |      0
 +
...
 +
</nowiki>
 +
 +
 +
=== <code>ptoassigneend_us_cleaned.postcode_*_cleaned</code> is total garbage ===
 +
 +
This garbage stretches to include
 +
 +
* <code>ptoassigneend_us_cleaned.postcode_f5_cleaned</code>
 +
* <code>ptoassigneend_us_cleaned.postcode_cleaned</code>
 +
 +
==== only .15% of records have their postcodes extracted correctly ====
 +
 +
<nowiki>
 +
patent=# select count(*) from ptoassigneend_us_cleaned;                                                                                                             
 +
  count                                                                                                                                                             
 +
---------                                                                                                                                                           
 +
3572605                                                                                                                                                           
 +
(1 row)                                                                                                                                                             
 +
                                                                                                                                                                                                     
 +
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned is not null;                                                                         
 +
  count                                                                                                                                                             
 +
---------                                                                                                                                                           
 +
3376480                                                                                                                                                           
 +
(1 row)                                                                                                                                                             
 +
                                                                                                                                                                   
 +
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_f5_cleaned is not null;                                                                       
 +
count                                                                                                                                                             
 +
-------                                                                                                                                                             
 +
  5344                                                                                                                                                             
 +
(1 row)                                                                                                                                                             
 +
                                                                                                                                                                   
 +
patent=# select postcode_cleaned, postcode_f5_cleaned from ptoassigneend_us_cleaned where (postcode_cleaned is not null and postcode_f5_cleaned is null) limit 12; 
 +
            postcode_cleaned              | postcode_f5_cleaned                                                                                                   
 +
-------------------------------------------+---------------------                                                                                                   
 +
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                       
 +
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                       
 +
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                       
 +
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                       
 +
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                       
 +
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                       
 +
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                       
 +
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                       
 +
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                       
 +
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                       
 +
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                       
 +
£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                       
 +
(12 rows)
 +
 +
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082';
 +
  count
 +
---------
 +
3371136
 +
(1 row)                                                                                                                                                         
 +
</nowiki>
 +
 +
==== the underlying data is fine though... ====
 +
 +
<nowiki>
 +
patent=# select postcode, postcode_addr1, postcode_addr2, postcode_city from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' limit 5;
 +
postcode | postcode_addr1 | postcode_addr2 | postcode_city
 +
----------+----------------+----------------+---------------
 +
75024    |                |                |
 +
55379    |                |                |
 +
94538    |                |                |
 +
23219    |                |                |
 +
73114    |                |                |
 +
(5 rows)</nowiki>
 +
 +
Moreover, selecting from the tables that <code>ptoassigneend_us_cleaned</code> is derived from did not yield this string. Therefore, there is likely an error in the SQL script, perhaps with some wonky copy-pasting from the internet.
 +
 +
==== ...this case is hit when postcode is present but none of the other fields are present ====
 +
 +
<nowiki>
 +
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u
 +
0082';                                                                                                                 
 +
  count                                                                                                               
 +
---------                                                                                                             
 +
3371136                                                                                                               
 +
(1 row)                                                                                                               
 +
                                                                                                                       
 +
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u
 +
0082' and postcode is not null;                                                                                       
 +
  count                                                                                                               
 +
---------                                                                                                             
 +
3371136                                                                                                               
 +
(1 row)           
 +
 +
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u 0082' and postcode is not null and postcode_addr1 is null and postcode_addr2 is null and postcode_city is null;
 +
  count
 +
---------
 +
3371136
 +
(1 row)</nowiki>
 +
 +
==== ...easy gains can be gotten as a quick fix ====
 +
 +
<nowiki>
 +
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and postcode is not null;
 +
  count
 +
---------
 +
3371136
 +
(1 row)
 +
 +
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 5;
 +
  count
 +
---------
 +
2971542
 +
(1 row)</nowiki>
 +
 +
==== ...5 digit postcode works great ====
 +
 +
<nowiki>
 +
patent=# select postcode from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 5 limit 12;
 +
postcode
 +
----------
 +
75024
 +
55379
 +
94538
 +
23219
 +
73114
 +
95134
 +
33487
 +
60603
 +
84604
 +
20191
 +
97213
 +
10504
 +
(12 rows)</nowiki>
 +
 +
==== ...9 digit postcode works ok (two forms) ====
 +
 +
<nowiki>
 +
patent=# select postcode from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 9 limit 12;
 +
postcode
 +
-----------
 +
USA 20585
 +
USA 33458
 +
4826-2766
 +
181951501
 +
772522463
 +
USA 27601
 +
772522463
 +
USA 20310
 +
USA 47202
 +
174012991
 +
913929221
 +
USA 12345
 +
(12 rows)</nowiki>
 +
 +
==== ...10 digit postcode works great ====
 +
 +
<nowiki>
 +
patent=# select postcode from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 10 limit 12;
 +
  postcode
 +
------------
 +
92121-1714
 +
95134-1706
 +
80527-2400
 +
95066-4544
 +
01862-2000
 +
07962-2245
 +
90245-5012
 +
55133-3427
 +
30332-0415
 +
95134-1706
 +
20892-7660
 +
94080-4990
 +
(12 rows)</nowiki>
 +
 +
== The Ugly ==
 +
 +
=== fields directly contradict each other ===
 +
 +
Possible Heuristics:
 +
 +
* Presence of Ln, St, Ave, Blvd, Cir, etc...
 +
 +
=== approximate data is available for stateless assignees ===
 +
 +
==== only on the order of 500 stateless assignees have postcodes ====
 +
<nowiki>
 +
patent=# select count(*) from ptoassigneend where (country = 'NOT PROVIDED' or country = 'STATELESS');
 +
count
 +
-------
 +
17575
 +
(1 row)
 +
 +
patent=# select count(*) from ptoassigneend where (country = 'NOT PROVIDED' or country = 'STATELESS') and (postcode is
 +
not null);
 +
count
 +
-------
 +
  329
 +
(1 row)
 +
 +
patent=# select count(*) from ptoassigneend where (country = 'NOT PROVIDED' or country = 'STATELESS') and (postcode is
 +
not null or addrline1 is not null);
 +
count
 +
-------
 +
  1572
 +
(1 row)
 +
</nowiki>
 +
 +
 +
==== approximate location data is available  (but unused) ====
 +
 +
<nowiki>
 +
patent=# select name, addrline1, addrline2, postcode from ptoassigneend where (country = 'NOT PROVIDED' or country = 'S
 +
TATELESS') and (addrline1 is not null or addrline2 is not null)  limit 30;
 +
                                        name                                        |            addrline1            |            addrline2            | postcode
 +
--------------------------------------------------------------------------------------+----------------------------------+----------------------------------+----------
 +
REGAL KING COMMERCIAL OFFSHORE DE MACAU LIMITED                                      | ALAMEDA DR. CARLOS D'S ASSUMPCAO | DYNASTY PLAZA, 19 AND. (H) MACAO |
 +
ECODYNE CORPORATION                                                                  |                                  | A CORP. OF DE                    |
 +
SHELL OIL COMPANY                                                                    |                                  | A CORP. OF DE                    |
 +
RCA CORPORATION                                                                      |                                  | A CORP. OF DE                    |
 +
BACHEM FEINCHEMIKALIEN A.G.                                                          |                                  | A SWISS CORP.                    |
 +
AMERICAN STERILIZER COMPANY                                                          |                                  | A PA. CORP.                      |
 +
CREATIVE TECHNOLOGY LTD.                                                            | CREATIVE RESOURCE                | 31 INTERNATIONAL BUSINESS PARK  | 60922
 +
FUJITSU LIMITED                                                                      | NAKAHARA-KU, KAWASAKI-SHI        | 1-1, KAMIKODANAKA 4-CHOME        | 211-8
 +
SIEMENS AKTIENGESELLSCHAFT                                                          |                                  | A GERMAN CORP                    |
 +
ISABERG AB, HESTRA                                                                  |                                  | A CORP. OF SWEDEN                |
 +
MOELLER MANUFACTURING CO., INC.                                                      |                                  | A DE CORP.                      |
 +
MOBIL OIL CORPORATION                                                                |                                  | A CORP OF NY                    |
 +
GENERAL ELECTRIC COMPANY                                                            |                                  | A NY CORP.                      |
 +
GENERAL ELECTRIC COMPANY                                                            | A CORP. OF NY                    |                                  |
 +
BTR INDUSTRIES LIMITED                                                              | CARLISLE PLACE                  | BTR HOUSE                        |
 +
DIATRON CORPORATION                                                                  |                                  | A CORP. OF CA                    |
 +
CLOSURES AND PACKAGING SERVICES LIMITED                                              | P. O. BOX 119                    |                                  | GY1 3HB
 +
DR. JOHANNES HEIDENHAIN GMBH                                A CORPORATION OF GERMANY |                                  | TRAUNREUT CITY                  |
 +
AMERICAN SAFETY RAZOR COMPANY                                                        |                                  | A DE CORP.                      |
 +
ETHICON, INC.                                                                        |                                  | A NJ CORP.                      |
 +
GENERAL ELECTRIC COMPANY                                                            |                                  | A NEW YORK CORP.                |
 +
MOBIL OIL CORPORATION                                                                |                                  | A CORP. OF NY                    |
 +
RICHARD LANGLECHNER GMBH                                                            | MAUERBERGERSTR. 15              |                                  |
 +
KONISHIROKU PHOTO INDUSTRY CO., LTD.,                                                |                                  | A CORP OF JAPAN                  |
 +
SHELL OIL COMPANY                                                                    |                                  | A CORP. OF DE                    |
 +
SIEMENS AKTIENGESELLSCHAFT                                  A GERMAN CORPORATION    |                                  | MUNICH                          |
 +
C.A. BRIGGS COMPANY                                                                  |                                  | A CORP OF PA                    |
 +
GTE PRODUCTS CORPORATION                                                            |                                  | A DE CORP.                      |
 +
TRAMEX ENGINEERING LIMITED                                                          | 8 CLARE STREET,                  | A IRISH COMPANY                  |
 +
MOBIL OIL CORPORATION                                                                |                                  | A CORPORATION OF NEW YORK        |
 +
</nowiki>
 +
 +
== A Stopgap Measure ==
 +
 +
As a quick and dirty fix, walk through the cases and do the best we can. The script for that is located at <code>E:\McNair\Projects\SimplerPatentData\src\db\hacks\June2017PostcodeHotfix.sql</code> and it creates the following table:
 +
 +
<nowiki>
 +
DROP VIEW IF EXISTS june_2017_postcodecode_hotfix;
 +
CREATE VIEW june_2017_postcode_hotfix AS
 +
  SELECT
 +
    reelno,
 +
    frameno,
 +
    name,
 +
    city_cleaned as city,
 +
    GodHelpUsAll(postcode_cleaned, postcode) as postcode
 +
  FROM ptoassigneend_us_cleaned;</nowiki>
 +
 +
<nowiki>
 +
patent=# select count(*) from june_2017_postcode_hotfix ;
 +
  count
 +
---------
 +
3572605
 +
(1 row)
 +
 +
patent=# select count(*) from june_2017_postcode_hotfix where postcode is not null;
 +
  count
 +
---------
 +
3370613
 +
(1 row)</nowiki>

Latest revision as of 12:47, 21 September 2020


Project
US Address Verification
Project logo 02.png
Project Information
Has title US Address Verification
Has owner Oliver Chang
Has start date June 2017
Has deadline date
Has project status Complete
Has sponsor McNair Center
Has project output Tool
Copyright © 2019 edegan.com. All Rights Reserved.


This project is composed of SQL queries against patents.ptoassigneend_allus to gain confidence in the data. The scripts and rationale behind the design decisions in this table are described in its wiki page here and all original work was done by Sonia Zhang.

Total Row Counts

Number of Rows
Table Name count(*) Description of Tables in patent
ptoassigneend 7,234,001 Base data with duplicates removed
ptoassigneend_allus 3,572,605 Sonia's subset of only US addresses; 49.4% of ptoassigneend
ptoassigneend_us_cleaned 3,572,605 "cleaned postcode, city, and state" of ptoassigneend_allus; this is the one we want to work with
ptoassigneend_us_extracted 5,343 cleaned up complete addresses; too small to be worthwhile
june_2017_hotfix 3,370,613 quick and dirty fix to get first 5 postcode digits from ptoassigneend_us_cleaned using heuristics (see #A_Stopgap_Measure)
june_2017_zipcode_join 1,365,408 hack to join zipcodes to patent ids; see E:\McNair\Projects\SimplerPatentData\src\db\ZipcodeMapAssignmentToPatent.sql for methodology

Note that the suffix nd stands for "no duplicates". Assume this method was done correctly.

The Good

country column is reliable

It appears that the country field is consistent across all of the entries, and that the country names are taken from a small lookup table.

Proof:

patent=# select country, count(country) as c from ptoassigneend group by country order by c desc limit 50;
                country                 |    c
----------------------------------------+---------
 UNITED STATES                          | 3570216
 JAPAN                                  | 1334774
 GERMANY                                |  429429
 KOREA, REPUBLIC OF                     |  284830
 TAIWAN                                 |  231493
 FRANCE                                 |  172995
 CANADA                                 |  132387
 SWITZERLAND                            |   99281
 UNITED KINGDOM                         |   90634
 CHINA                                  |   88954
 NETHERLANDS                            |   84573
 SWEDEN                                 |   70538
 ITALY                                  |   61416
 FINLAND                                |   44116
 AUSTRALIA                              |   43314
 ISRAEL                                 |   41857
 ENGLAND                                |   31341
 DENMARK                                |   24742
 BELGIUM                                |   24147
 AUSTRIA                                |   19487
 SINGAPORE                              |   17518
 HONG KONG                              |   17091
 SPAIN                                  |   13288
 INDIA                                  |   12589
 IRELAND                                |   11754
 NORWAY                                 |   11290
 STATELESS                              |   10917
 GREAT BRITAIN                          |   10692
 BERMUDA                                |    8780
 CAYMAN ISLANDS                         |    8206
 NOT PROVIDED                           |    6658
 VIRGIN ISLANDS, BRITISH                |    6510
 NEW ZEALAND                            |    6364
 ONTARIO                                |    6047
 LUXEMBOURG                             |    5856
 KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF |    5625
 LIECHTENSTEIN                          |    4711
 BRAZIL                                 |    4368
 SOUTH AFRICA                           |    3699
 RUSSIAN FEDERATION                     |    3375
 SAUDI ARABIA                           |    3335
 BARBADOS                               |    3115
 HUNGARY                                |    2642
 GERMAN DEMOCRATIC REPUBLIC             |    2401
 MEXICO                                 |    2083
 QUEBEC                                 |    1777
 NETHERLANDS ANTILLES                   |    1690
 POLAND                                 |    1607
 MALAYSIA                               |    1606
 BRITISH COLUMBIA                       |    1279
(50 rows)

territories, etc are categorized differently

There are single-digit edge cases that in the strictest interpretation of country are correct.

patent=# select count(*) from ptoassigneend;
  count
---------
 7234001
(1 row)

patent=# select count(*) from ptoassigneend where country = 'UNITED STATES';
  count
---------
 3570216
(1 row)

patent=# select count(*) from ptoassigneend where country like '%UNITED STATES%';
  count
---------
 3570217
(1 row)

patent=# select * from ptoassigneend where (country like '%UNITED STATES%' and country != 'UNITED STATES');
 reelno | frameno |             name             |   addrline1   |      addrline2       |
--------+---------+------------------------------+---------------+----------------------+--------
  29344 |     970 | WARNER CHILCOTT COMPANY, LLC | P.O. BOX 1005 | UNION STREET, KM 1.1 | FAJARDO(1 row)

all zipcodes are valid US zipcodes

Using the data from geonames for the United States, put into the table via psql < Z:\zipcodes-oliver\dump.sql where dump.sql is generated via E:\McNair\Projects\SimplerPatentData\src\main\java\org\bakerinstitute\mcnair\uspto_assignments\GeonamesZips.java, all of the values for postcodes are valid US postcodes.

patent=# select count(distinct c.postcode_f5_cleaned) from ptoassigneend_us_cleaned as c left join geonames_us_zipcodes 
 as z on c.postcode_f5_cleaned = z.zipcode;                                                                             
 count                                                                                                                  
-------                                                                                                                 
  1116                                                                                                                  
(1 row)                                                                                                                 
                                                                                                                        
patent=# select count(distinct postcode_f5_cleaned) from ptoassigneend_us_cleaned;                                      
 count                                                                                                                  
-------                                                                                                                 
  1116                                                                                                                  
(1 row)                                                                                                                 

The Bad

state column is ill-constrained

# returns 57 rows (District of Columbia is counted as distinct state)
# select distinct state, count(state) from ptoassigneend_us_cleaned group by state limit 100;
...
Armed Forces in Europe, the Middle East, Africa, and Canada |      2
UNITED STATES ARMY                                          |      2
Armed Forces in the Pacific                                 |      3
                                                            |      1
NATIONAL AERONAUTICS AND SPACE ADMINISTRATION               |     25
                                                            |      0
...


ptoassigneend_us_cleaned.postcode_*_cleaned is total garbage

This garbage stretches to include

  • ptoassigneend_us_cleaned.postcode_f5_cleaned
  • ptoassigneend_us_cleaned.postcode_cleaned

only .15% of records have their postcodes extracted correctly

patent=# select count(*) from ptoassigneend_us_cleaned;                                                                                                              
  count                                                                                                                                                              
---------                                                                                                                                                            
 3572605                                                                                                                                                             
(1 row)                                                                                                                                                              
                                                                                                                                                                                                       
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned is not null;                                                                           
  count                                                                                                                                                              
---------                                                                                                                                                            
 3376480                                                                                                                                                             
(1 row)                                                                                                                                                              
                                                                                                                                                                     
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_f5_cleaned is not null;                                                                        
 count                                                                                                                                                               
-------                                                                                                                                                              
  5344                                                                                                                                                               
(1 row)                                                                                                                                                              
                                                                                                                                                                     
patent=# select postcode_cleaned, postcode_f5_cleaned from ptoassigneend_us_cleaned where (postcode_cleaned is not null and postcode_f5_cleaned is null) limit 12;   
             postcode_cleaned              | postcode_f5_cleaned                                                                                                     
-------------------------------------------+---------------------                                                                                                    
 £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                         
 £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                         
 £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                         
 £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                         
 £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                         
 £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                         
 £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                         
 £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                         
 £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                         
 £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                         
 £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                         
 £\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082 |                                                                                                                         
(12 rows) 

patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082';
  count
---------
 3371136
(1 row)                                                                                                                                                           

the underlying data is fine though...

patent=# select postcode, postcode_addr1, postcode_addr2, postcode_city from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' limit 5;
 postcode | postcode_addr1 | postcode_addr2 | postcode_city
----------+----------------+----------------+---------------
 75024    |                |                |
 55379    |                |                |
 94538    |                |                |
 23219    |                |                |
 73114    |                |                |
(5 rows)

Moreover, selecting from the tables that ptoassigneend_us_cleaned is derived from did not yield this string. Therefore, there is likely an error in the SQL script, perhaps with some wonky copy-pasting from the internet.

...this case is hit when postcode is present but none of the other fields are present

patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u 
0082';                                                                                                                  
  count                                                                                                                 
---------                                                                                                               
 3371136                                                                                                                
(1 row)                                                                                                                 
                                                                                                                        
patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u 
0082' and postcode is not null;                                                                                         
  count                                                                                                                 
---------                                                                                                               
 3371136                                                                                                                
(1 row)             

patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u 0082' and postcode is not null and postcode_addr1 is null and postcode_addr2 is null and postcode_city is null;
  count
---------
 3371136
(1 row)

...easy gains can be gotten as a quick fix

patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and postcode is not null;
  count
---------
 3371136
(1 row)

patent=# select count(*) from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 5;
  count
---------
 2971542
(1 row)

...5 digit postcode works great

patent=# select postcode from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 5 limit 12;
 postcode
----------
 75024
 55379
 94538
 23219
 73114
 95134
 33487
 60603
 84604
 20191
 97213
 10504
(12 rows)

...9 digit postcode works ok (two forms)

patent=# select postcode from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 9 limit 12;
 postcode
-----------
 USA 20585
 USA 33458
 4826-2766
 181951501
 772522463
 USA 27601
 772522463
 USA 20310
 USA 47202
 174012991
 913929221
 USA 12345
(12 rows)

...10 digit postcode works great

patent=# select postcode from ptoassigneend_us_cleaned where postcode_cleaned = E'£\u009B\u0084Ê\u0082Ò£\u009B\u0084Ë\u0082' and char_length(postcode) = 10 limit 12;
  postcode
------------
 92121-1714
 95134-1706
 80527-2400
 95066-4544
 01862-2000
 07962-2245
 90245-5012
 55133-3427
 30332-0415
 95134-1706
 20892-7660
 94080-4990
(12 rows)

The Ugly

fields directly contradict each other

Possible Heuristics:

  • Presence of Ln, St, Ave, Blvd, Cir, etc...

approximate data is available for stateless assignees

only on the order of 500 stateless assignees have postcodes

patent=# select count(*) from ptoassigneend where (country = 'NOT PROVIDED' or country = 'STATELESS');
 count
-------
 17575
(1 row)

patent=# select count(*) from ptoassigneend where (country = 'NOT PROVIDED' or country = 'STATELESS') and (postcode is
not null);
 count
-------
   329
(1 row)

patent=# select count(*) from ptoassigneend where (country = 'NOT PROVIDED' or country = 'STATELESS') and (postcode is
not null or addrline1 is not null);
 count
-------
  1572
(1 row)


approximate location data is available (but unused)

patent=# select name, addrline1, addrline2, postcode from ptoassigneend where (country = 'NOT PROVIDED' or country = 'S
TATELESS') and (addrline1 is not null or addrline2 is not null)  limit 30;
                                         name                                         |            addrline1             |            addrline2             | postcode
--------------------------------------------------------------------------------------+----------------------------------+----------------------------------+----------
 REGAL KING COMMERCIAL OFFSHORE DE MACAU LIMITED                                      | ALAMEDA DR. CARLOS D'S ASSUMPCAO | DYNASTY PLAZA, 19 AND. (H) MACAO |
 ECODYNE CORPORATION                                                                  |                                  | A CORP. OF DE                    |
 SHELL OIL COMPANY                                                                    |                                  | A CORP. OF DE                    |
 RCA CORPORATION                                                                      |                                  | A CORP. OF DE                    |
 BACHEM FEINCHEMIKALIEN A.G.                                                          |                                  | A SWISS CORP.                    |
 AMERICAN STERILIZER COMPANY                                                          |                                  | A PA. CORP.                      |
 CREATIVE TECHNOLOGY LTD.                                                             | CREATIVE RESOURCE                | 31 INTERNATIONAL BUSINESS PARK   | 60922
 FUJITSU LIMITED                                                                      | NAKAHARA-KU, KAWASAKI-SHI        | 1-1, KAMIKODANAKA 4-CHOME        | 211-8
 SIEMENS AKTIENGESELLSCHAFT                                                           |                                  | A GERMAN CORP                    |
 ISABERG AB, HESTRA                                                                   |                                  | A CORP. OF SWEDEN                |
 MOELLER MANUFACTURING CO., INC.                                                      |                                  | A DE CORP.                       |
 MOBIL OIL CORPORATION                                                                |                                  | A CORP OF NY                     |
 GENERAL ELECTRIC COMPANY                                                             |                                  | A NY CORP.                       |
 GENERAL ELECTRIC COMPANY                                                             | A CORP. OF NY                    |                                  |
 BTR INDUSTRIES LIMITED                                                               | CARLISLE PLACE                   | BTR HOUSE                        |
 DIATRON CORPORATION                                                                  |                                  | A CORP. OF CA                    |
 CLOSURES AND PACKAGING SERVICES LIMITED                                              | P. O. BOX 119                    |                                  | GY1 3HB
 DR. JOHANNES HEIDENHAIN GMBH                                A CORPORATION OF GERMANY |                                  | TRAUNREUT CITY                   |
 AMERICAN SAFETY RAZOR COMPANY                                                        |                                  | A DE CORP.                       |
 ETHICON, INC.                                                                        |                                  | A NJ CORP.                       |
 GENERAL ELECTRIC COMPANY                                                             |                                  | A NEW YORK CORP.                 |
 MOBIL OIL CORPORATION                                                                |                                  | A CORP. OF NY                    |
 RICHARD LANGLECHNER GMBH                                                             | MAUERBERGERSTR. 15               |                                  |
 KONISHIROKU PHOTO INDUSTRY CO., LTD.,                                                |                                  | A CORP OF JAPAN                  |
 SHELL OIL COMPANY                                                                    |                                  | A CORP. OF DE                    |
 SIEMENS AKTIENGESELLSCHAFT                                  A GERMAN CORPORATION     |                                  | MUNICH                           |
 C.A. BRIGGS COMPANY                                                                  |                                  | A CORP OF PA                     |
 GTE PRODUCTS CORPORATION                                                             |                                  | A DE CORP.                       |
 TRAMEX ENGINEERING LIMITED                                                           | 8 CLARE STREET,                  | A IRISH COMPANY                  |
 MOBIL OIL CORPORATION                                                                |                                  | A CORPORATION OF NEW YORK        |

A Stopgap Measure

As a quick and dirty fix, walk through the cases and do the best we can. The script for that is located at E:\McNair\Projects\SimplerPatentData\src\db\hacks\June2017PostcodeHotfix.sql and it creates the following table:

DROP VIEW IF EXISTS june_2017_postcodecode_hotfix;
CREATE VIEW june_2017_postcode_hotfix AS
  SELECT
    reelno,
    frameno,
    name,
    city_cleaned as city,
    GodHelpUsAll(postcode_cleaned, postcode) as postcode
  FROM ptoassigneend_us_cleaned;
patent=# select count(*) from june_2017_postcode_hotfix ;
  count
---------
 3572605
(1 row)

patent=# select count(*) from june_2017_postcode_hotfix where postcode is not null;
  count
---------
 3370613
(1 row)