Identifying five-digit postcode is risky because of the existence of P.O. BOX #, SUITE #, etc. One option is to identify state and postcode together with the following SQL code: (take 'addrline1' as an example)
SELECT addrline1
FROM ptoassigneend_us_temp2
WHERE (addrline1 ~* '([,]|[.])\s\w{2,}\s{0,}\w{0,}\s{1,}\d{5}' OR
767 FIFTH AVE., NEW YORK, NY 10153 | 10153
Even excluding the P.O. BOX # and SUITE #, the false positive rate is noise still a little bit highexists.
The details and SQL function are in E:\McNair\Projects\PatentAddress\Cleang_Step2.sql
The output is table ptoassigneend_us_expost ptoassigneend_us_postex which include a new feature 'postcode_extracted'.
====Clean city====