==== Identify Clean Data ====
As mentioned, the ptoassigneend_us_extracted is cleaned. This section works on the remaining records which are stored in ptoassigneend_us_temp.
Next, filter out records with city that is a city, zip that is a zip, state that is a state. Note: The consistency between city and state or city and postcode is not checked in this section.
*state that is a state
All Select distinct state records with SELECT DISTINCT state FROM ptoassigneend_us_temp; The output shows that all the records not null or '' are valid state names.
*city that is a city
No comma, no dot
The feature city is messy. In this way, we first clean feature city before identifying clean data.
===== Clean Feature City =====
The following patterns can be used to clean feature city.
Select distinct city records and store them in table citylist (30971).
* city, state postcode(5)
'.*[,].*\d{5}'
-- 565
* city, state code
'.*[,]\s{0,}\w{2}$'
-- 284
* city, state code
'.*[,]\s{0,}\w{2}\s'
* IS NOT NULL AND city != '' AND city !~* '([,]|[.])'
-- 23501