In order to restructure the current patent dataset, the data requires rigorous cleaning. The primary areas for improvement are:
Applied similar methods to filter out patent records from Japan. The post code in Japan follows pattern [three digits- four digits].
The post code extracted is quite accurate for almost all the countriesU.S., and so is the country information (and the state for U.S.).
The problem is that the city information extracted is not quite good. It messes up with street names. One approach to increase the accuracy is to list all the possible cities in each country, and then match the address columns to these cities, which is time consuming.