Sonia Zhang (Work Log)
Summer Work
02/23/2017 - Set Up the User Page and the Work Log Page. Got an overview of the patent data.
02/27/2017 - Started working on the issues listed on Patent Data Issues.
02/28/2017 - Cleaned the assigneeinfo, msalist etc.
03/1/2017 - Had a meeting discussing problems in the patent data.
03/2/2017 - Cleaned some of the 'name' and 'city' records in ptoassigneend2. Created ptoassigneend_country table to store country information. Figured out some methods to fill the empty 'city'/'country' information.
03/06/2017 - Updated ptoassigneend table. Filled some of the missing value in 'country' as 'UNITED STATES' based on 'state' information.
03/10/2017 - Extracted U.S. address information in ptoassigneend table. The extracted records are stored in the new table 'ptoassigneend_missus'. See details Patent Data Restructure
03/13/2017 - Applied similar methods to extract address information from Japanese patents. The results are stored in 'ptoassigneend_missjapan'. Matched the post code pattern to 200 distinct countries that exist in patent table.
03/14/2017 - As mentioned above, three kinds of information that can be extracted from address columns are city, country and post code (plus state for U.S.). The post code extracted is quite accurate for almost all the countries, and so is the country information (and the state for U.S.).
The problem is that the city information extracted is not quite good. It messes up with street names. One approach to increase the accuracy is to list all the possible cities in each country, and then match the address columns to these cities, which is time consuming.