====Extract Address Information====
====='''1. Introduction'''=====
*Five features (addrline1, addrline2, city, country, postcode) in the table contain address information.
*Features addrline1, addrline2 and city are not cleaned. They have city, country and postcode information.
*By now, we only focus on cleaning American patents.
====='''2. Postcode(U.S.)'''=====
U.S. post code follows the pattern [five digits - four digits]. In this way, U.S. patents can be extracted by searching for post code with regular expression
E:/McNair/Projects/PatentAddress/RxPostcode.sql
====='''3. State (U.S.)'''=====
Some patterns can be used to extract state information.
Examples:
---------------------------------------------------------------------------------------------------------------------------------------------------------
A CORPORATION OF NY
A CORPORATION OF TEXAS
E:/McNair/Projects/PatentAddress/RxPostcode.sql
====='''4. City (U.S.)'''=====
Some patterns can be used to extract city information.
E:/McNair/Projects/PatentAddress/RxCity.sql
====='''5. Output (Tables)'''=====
*'''ptoassigneend_allus'''
city_city is the city name extracted from 'city'; city_addr1 is the city name extracted from 'addrline1'; city_addr2 is the city name extracted from 'addrline2'.
====Condense Address Information==== '''61. IssuesIntroduction''' The address information extracted from addrline1, addrline2 and city is not consistent. For example, the postcode extracted from addrline1 may be different from that extracted from city. Examples: | postcode | postcode_addr1 | postcode_addr2 | postcode_city | 77042 | 77251-1407 | | | 83716-9632 | | 83707-0006 | | 90045 | 90080-0028 | | (postcode_city is the postcode extracted from 'city'; postcode_addr1 is the postcode extracted from 'addrline1'; postcode_addr2 is the postcode extracted from 'addrline2'.)
* The post code and zip regex for other countries besides U.S. can be found hereOutput: http://stackoverflow.com/questions/578406/what-is-the-ultimate-postal-code-and-zip-regexSchema | Name | Type | Owner public | ptoassigneend_city | table | researcher public | ptoassigneend_postcode | table | researcher public | ptoassigneend_state | table | researcher
ExampleSQL code is in: "US", "\d{5}([ \-]\d{4})?" "CA", "[ABCEGHJKLMNPRSTVXY]\d[ABCEGHJ-NPRSTV-Z][ ]?\d[ABCEGHJ-NPRSTV-Z]\d" "DE", "\d{5}" "JP", "E:\d{3}-McNair\d{4}" "FR", "Projects\d{2}[ ]?PatentAddress\d{3}"AddrClean.sql
* The city feature needs to be standardized. For example, ===='GRAND CAYMAN, CAYMAN ISLAND' Inconsistency between City and Country 'GRAND CAYMAN' indicate the same city. ''====
* Some state Inconsistency between addrline and country features don't match.
Example:
3001 ORCHARD PARKWAY SAN JOSE, CA 95134-2088 | MINATO-KU, TOKYO 107 | JAPAN
* Both state name and its abbreviation exist. * Information extracted from addrline1, addrline2 and city is not consistent. Which one is the priority?===='''Issues'''====
====Condense Address Information====* The post code and zip regex for other countries besides U.S. can be found here: http://stackoverflow.com/questions/578406/what-is-the-ultimate-postal-code-and-zip-regex
'''1. Introduction'''Example: "US", "\d{5}([ \-]\d{4})?" "CA", "[ABCEGHJKLMNPRSTVXY]\d[ABCEGHJ-NPRSTV-Z][ ]?\d[ABCEGHJ-NPRSTV-Z]\d" "DE", "\d{5}" "JP", "\d{3}-\d{4}" "FR", "\d{2}[ ]?\d{3}"
* The address information extracted from addrline1, addrline2 and city is not consistentfeature needs to be standardized. For example, 'GRAND CAYMAN, CAYMAN ISLAND' and 'GRAND CAYMAN' indicate the postcode extracted from addrline1 may be different from that extracted from same city.
Examples: | postcode | postcode_addr1 | postcode_addr2 | postcode_city | 77042 | 77251-1407 | | | 83716-9632 | | 83707-0006 | | 90045 | 90080-0028 | |* Both state name and its abbreviation exist.
(postcode_city is the postcode * Information extracted from 'addrline1, addrline2 and city'; postcode_addr1 is the postcode extracted from 'addrline1'; postcode_addr2 not consistent. Which one is the postcode extracted from 'addrline2'.) Output: Schema | Name | Type | Owner public | ptoassigneend_city | table | researcher public | ptoassigneend_postcode | table | researcher public | ptoassigneend_state | table | researcher SQL code is in: E:\McNair\Projects\PatentAddress\AddrClean.sqlpriority?