Difference between revisions of "Geocoding Inventor Locations"
Line 3: | Line 3: | ||
This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude. | This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude. | ||
− | + | The scripts and modules that operationalize these matching techniques can be downloaded individually by function or as a bundle with all supporting data files ([http://www.edegan.com/repository/MatchPatentLocations.tar.gz MatchPatentLocations.tar.gz]). | |
− | |||
− | |||
==Reference Data== | ==Reference Data== | ||
Line 11: | Line 9: | ||
The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's [[GEOnet Names Server | GEOnet Names Server (GNS)]] which covers the world excluding the U.S. and Antartica. | The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's [[GEOnet Names Server | GEOnet Names Server (GNS)]] which covers the world excluding the U.S. and Antartica. | ||
− | Available reference data files for countries that have or are being processed include: | + | This project uses [[ISO3166]] two-character country codes to name source and reference files. Available reference data files for countries that have or are being processed include: |
− | *The | + | *Belgium: [http://www.edegan.com/repository/GNS-BE.txt GNS-BE.txt] |
+ | *France: [http://www.edegan.com/repository/GNS-FR.txt GNS-FR.txt] | ||
+ | *Great Britain (The United Kingdom of Great Britain and Northern Ireland): [http://www.edegan.com/repository/GNS-GB.txt GNS-GB.txt] | ||
+ | *Spain: [http://www.edegan.com/repository/GNS-ES.txt GNS-ES.txt] | ||
+ | *Switzerland: [http://www.edegan.com/repository/GNS-CH.txt GNS-CH.txt] | ||
− | The perl module [http://www.edegan.com/repository/GNS.pm GNS.pm] loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. | + | The perl module [http://www.edegan.com/repository/GNS.pm GNS.pm] loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. The load() method takes and ISO3166 code, and the index methods and most other methods take one of two specific GNS FC codes (e.g. "P" for populated place, and "A" for administrative area). |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
==The Source Files== | ==The Source Files== | ||
Line 28: | Line 24: | ||
*XX_exceptions.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): <tt>cty city adm postcode</tt> | *XX_exceptions.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): <tt>cty city adm postcode</tt> | ||
− | The <tt>cty</tt> is used as a primary key in both files. The XX_exceptions.txt provides details on hand identified records, or other records where special care has been taken. This file is not strictly | + | The <tt>cty</tt> is used as a primary key in both files. The XX_exceptions.txt provides details on hand identified records, or other records where special care has been taken. This file is not strictly required by the scrips but will be processed if present. |
− | The perl module [http://www.edegan.com/repository/PatentLocations.pm PatentLocations.pm] loads and provides an interface to this source data. The source code is the primary module documentation. | + | The perl module [http://www.edegan.com/repository/PatentLocations.pm PatentLocations.pm] loads and provides an interface to this source data. The source code is the primary module documentation. |
− | |||
− | |||
− | |||
− | |||
− | |||
==Postal Codes== | ==Postal Codes== | ||
Line 42: | Line 33: | ||
*United Kingdom ([http://en.wikipedia.org/wiki/UK_postcodes Sourced from Wikipedia]): A9 9AA, A99 9AA, A9A 9AA, AA9 9AA, AA99 9AA, AA9A 9AA. Simple Regex: <tt>([A-Z]{1,2}[0-9]{1,2}[A-Z]{0,1}\s[0-9][A-Z]{2,2})</tt> | *United Kingdom ([http://en.wikipedia.org/wiki/UK_postcodes Sourced from Wikipedia]): A9 9AA, A99 9AA, A9A 9AA, AA9 9AA, AA99 9AA, AA9A 9AA. Simple Regex: <tt>([A-Z]{1,2}[0-9]{1,2}[A-Z]{0,1}\s[0-9][A-Z]{2,2})</tt> | ||
+ | |||
+ | The [http://www.edegan.com/repository/PostalCodes.pm PostalCodes.pm] perl module provides a method to extract a postcode from a text string for a given ISO3166 code. | ||
==The Matching Process== | ==The Matching Process== | ||
− | The matching process is carried out by [http://www.edegan.com/repository/MatchPatentLocations.pl MatchPatentLocations.pl], which has a standard pod based command line interface. The -co option specifies the ISO3166 country code to be matched. The script uses these modules: PatentLocations.pm, GNS.pm | + | The matching process is carried out by [http://www.edegan.com/repository/MatchPatentLocations.pl MatchPatentLocations.pl], which has a standard pod based command line interface. The -co option specifies the ISO3166 country code to be matched. The script uses these modules: PatentLocations.pm, GNS.pm, CleanStrings.pm, GramMatch.pm, LCS.pm and PostalCodes.pm. In addition to GNS reference files and patent data source files as detailed above, the script also use PatentLocations-Stopwords.txt. |
Glossary of terms: | Glossary of terms: | ||
Line 61: | Line 54: | ||
#Exact match the units of well-formatted records | #Exact match the units of well-formatted records | ||
#Exact match tokens (1-5 words) | #Exact match tokens (1-5 words) | ||
− | # | + | #N-gram and LCS match |
− | |||
− | |||
#Reconsile multiple matches | #Reconsile multiple matches | ||
Revision as of 20:49, 19 August 2009
- This page is part of a series under the NBER Patent Data Project
This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude.
The scripts and modules that operationalize these matching techniques can be downloaded individually by function or as a bundle with all supporting data files (MatchPatentLocations.tar.gz).
Contents
Reference Data
The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's GEOnet Names Server (GNS) which covers the world excluding the U.S. and Antartica.
This project uses ISO3166 two-character country codes to name source and reference files. Available reference data files for countries that have or are being processed include:
- Belgium: GNS-BE.txt
- France: GNS-FR.txt
- Great Britain (The United Kingdom of Great Britain and Northern Ireland): GNS-GB.txt
- Spain: GNS-ES.txt
- Switzerland: GNS-CH.txt
The perl module GNS.pm loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. The load() method takes and ISO3166 code, and the index methods and most other methods take one of two specific GNS FC codes (e.g. "P" for populated place, and "A" for administrative area).
The Source Files
Per country source files are extracted from the NBER patent data. The problem of identifying countries for some address records will be addressed later. The format of the source file(s) is as follows (XX is an ISO3166 code):
- XX.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): cty
- XX_exceptions.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): cty city adm postcode
The cty is used as a primary key in both files. The XX_exceptions.txt provides details on hand identified records, or other records where special care has been taken. This file is not strictly required by the scrips but will be processed if present.
The perl module PatentLocations.pm loads and provides an interface to this source data. The source code is the primary module documentation.
Postal Codes
Postal codes, known as ZIP codes in the U.S., vary by national jurisdiction and for historical reasons. The following postal codes formats are posted for reference:
- United Kingdom (Sourced from Wikipedia): A9 9AA, A99 9AA, A9A 9AA, AA9 9AA, AA99 9AA, AA9A 9AA. Simple Regex: ([A-Z]{1,2}[0-9]{1,2}[A-Z]{0,1}\s[0-9][A-Z]{2,2})
The PostalCodes.pm perl module provides a method to extract a postcode from a text string for a given ISO3166 code.
The Matching Process
The matching process is carried out by MatchPatentLocations.pl, which has a standard pod based command line interface. The -co option specifies the ISO3166 country code to be matched. The script uses these modules: PatentLocations.pm, GNS.pm, CleanStrings.pm, GramMatch.pm, LCS.pm and PostalCodes.pm. In addition to GNS reference files and patent data source files as detailed above, the script also use PatentLocations-Stopwords.txt.
Glossary of terms:
- Units - isolated logical units from an address, such as the street number and name, the town, or the region. Postal codes are treated separately.
- Tokens - Single words or sequences of words separated by a space (note that this is a specific usage)
- n-grams - character sequences, such as bigrams (two letters from aa to zz), trigrams (aaa-zzz) and so forth
- Exact Matching - Case insensitive of matching of the entire sequence of both the source and the reference strings
- LCS - Longest Common Subsequence based matching (See below)
- Place and administrative area - somewhere identified as a FC=P or FC=A respectively in the GNS data. Unless otherwise specified matches are performed for both place and administrative area separately and in series.
The sequence of processing is as follows (matching only the remaining unmatched locations at each stage):
- Load the source files, clean and parse (parsing identifies units)
- Load the reference file, build indices
- Exact match the exception units of records with exceptions
- Exact match the units of well-formatted records
- Exact match tokens (1-5 words)
- N-gram and LCS match
- Reconsile multiple matches
Longest Common Subsequence (LCS)
Longest Common Subsequence is perhaps the simplest (for certain inefficient implementations) and most abundantly used of fuzzy matching technique. The Longest Common Subsequence page on wikipedia provides a very detailed background.