IO files are on the dbase server in:
Z:/PatentAddress
====Introduction====
*Five features (addrline1, addrline2, city, state, postcode) in the table contain address information.
*Features addrline1, addrline2 and city are not cleaned. They have city, state and postcode information.
*The first object of this project is to extract postcode, state and city information from the three features above (Section 2.2.1).
*Then, we summarize the postcode, state and city information in the original table and those extracted from addresses to generate only one postcode, state and city for each record (Section 2.2.2).
*By now, we only focus on cleaning American patents.
====Extract Address Information====
=====1. Introduction=====*Five features (addrline1, addrline2, city, country, postcode) in the table contain address information.*Features addrline1, addrline2 and city are not cleaned. They have city, country and postcode information. *The first object of this project is to extract postcode, state and city information from the three features above.*Then, we summarize the postcode, state and city information in the original table and those extracted from addresses to generate only one postcode, state and city for each records.*By now, we only focus on cleaning American patents. =====2. Postcode(U.S.)=====
U.S. post code follows the pattern [five digits - four digits]. In this way, U.S. patents can be extracted by searching for post code with regular expression
For example,
city | postcode_citypostcode_extracted
NEW YORK, NY 10022-3201 | 10022-3201
BEAVERTON, OREGON 97005-6453 | 97005-6453
OXFORD CT 06483-1011 | 06483-1011
The extracted post code records are stored in table ptoassigneend_missus_finalptoassigneend_us_extracted.
SQL code is in:
E:/McNair/Projects/PatentAddress/Functions.sql
=====3. State (U.S.)=====
Some The following patterns can be used to extract state information.
'''a. '[,] State Postcode''''
'''d. 'A CORP.* OF [State]''''
This pattern is not reliable. When addrline1 looks this way, addrline2 always provide more detailed address information than addrline2addrline1. Besides, a great part of state info extracted from 'A CORP.* OF [State]' doesn't match the state extracted from detailed addrline2. In this way, we discard this pattern.
Examples:
* Summary
The extracted state records are stored in the table ptoassigneend_missus_finalptoassigneend_us_extracted.
SQL code is in:
E:/McNair/Projects/PatentAddress/Functions.sql
=====4. City (U.S.)===== Some patterns can be used to extract city information.
Three lists of samples extracted from addrline1, addrline2 and city are The following patterns can be used to summarize the patternsextract city information. They are in Z:/PatentAddress/
'''a. '\s{2,} CityName [,] State Postcode''''
''This pattern can't be identified because of the noise:''
MASSACHUSETTS 02780-7319 ('State '+'Postcode')
*'''no space between street and city name :( '''
*'''ptoassigneend_us_extracted'''
Contain all the original features as well as city, state and postcode info extracted from features addrline1, addrline2 and city. See Section 2, 3, 4 .2.1 for extraction processdetails.
Table "public.ptoassigneend_us_extracted"