=== Clean Address: more patterns ===
====Clean postcodePostcode====Postcode should always be at Identifying five-digit postcode is risky because of the end existence of addrline1, addrlin2 or city. Exclude P.O. BOX # and SUITE #.The One option is to identify state and postcode together with the following SQL code to extract postcode is:(take 'addrline1' as an example)
SELECT
FROM ptoassigneend_us_temp2
WHERE (addrline1 ~* '(^[,]|[.])\s\w{2,}\s){0,}\w{0,}\s{1,}\d{5}$'OR AND NOT (addrline1 ~* 'BO') AND NOT (addrline1 ~* 'P[.]O') OR addrline2 ~* '(^|\s)\w{2}\s{1}\d{5}$') AND AND NOT (addrline2 addrline1 ~* 'BO') AND NOT (addrline2 OR addrline1 ~* 'P[.]O') OR AND city NOT (addrline1 ~* '(^|SUITE\s)\d{5}$' AND NOT (city ~* 'BO') AND NOT (city ~* 'P[.]O'); # SELECT 66803306
Examples:
addrline1 | substring -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----+----------- 98625 | 98625 1650 WEST BIG BEAVER ROAD TROY, MI 48084 | 48084 1114 AVE NY NY 10036 | 10036 GLENDALE, CA 91204 | 91204 46 BAKER ST14714 F. PERTHSHIRE 77079 | 14714 314 N.JACKSON STREET, PROVIDENCEJACKSON 49201 | 49201 LAGUNA HILLS, RCA 92653 | 92653 1 ARAB, ALABAMA 35016 | 35016 1205 SIXTH ST.ISOUTHEAST 33907 | 33907 767 FIFTH AVE. 02905 , NEW YORK, NY 10153 | 0290510153 Even excluding the PO BOX # and SUITE #, the false positive rate is still a little bit high.
====Clean city & state====