Changes
Jump to navigation
Jump to search
Longest Common Subsequence To reconsile multiple matches the following process is perhaps the simplest (for certain inefficient implementations) undertaken:*If there are both P and A matches and most abundantly used more than one of fuzzy matching technique. The [http:either P and//enor A matches, then determine the P-A pair with the shortest distance between then using a Haversine formula distance calculation based on the GNS reported longitudes and latitudes.wikipedia*If there are multiple P matches but no A matches, take the one that was arrived at first.org/wiki/Longest_common_subsequence Longest Common Subsequence page on wikipedia] provides a very detailed background*If there are multiple A matches but no P matches, take the one that was arrived at first.
Geocoding Inventor Locations (view source)
Revision as of 20:18, 20 August 2009
, 20:18, 20 August 2009no edit summary
This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude.
==Script Files== The scripts and modules that operationalize these matching techniques can be downloaded individually by function as individual scripts or as a bundle with all supporting data files ([http://www.edegan.com/repository/MatchPatentLocationsMatchLocations.tar.gz MatchPatentLocationsMatchLocations.tar.gz]).Note that the reference data should be placed in a subdirectory by default named "ref". The bundle contains:*[http://www.edegan.com/repository/MatchLocations.pl MatchLocations.pl] - The main script and initializes and processes the matching requests*[http://www.edegan.com/repository/Match::GNS.pm Match::GNS.pm - Interface to the GNS reference data (see below)*[http://www.edegan.com/repository/Match::Patent.pm Match::Patent.pm - Interface to the Patent Location data (see below)*[http://www.edegan.com/repository/Match::Gram.pm Match::Gram.pm - Custom NGram Module*[http://www.edegan.com/repository/Match::LCS.pm Match::LCS.pm - A standard LCS Module*[http://www.edegan.com/repository/Match::PostalCodes.pm Match::PostalCodes.pm]*[http://www.edegan.com/repository/Match::CleanStrings.pm Match::CleanStrings.pm - Provides string cleaning routines for both the reference and source data*[http://www.edegan.com/repository/PatentLocations-Stopwords.txt PatentLocations-Stopwords.txt - A Stop Word file (tab delimited)
==Reference Data==
*Switzerland: [http://www.edegan.com/repository/GNS-CH.txt GNS-CH.txt]
The perl module [httpMatch:://www.edegan.com/repository/GNS.pm GNS.pm] loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. The load() method takes and ISO3166 code, and the index methods and most other methods take one of two specific GNS FC codes (e.g. "P" for populated place, and "A" for administrative area).
==The Source Files==
The <tt>cty</tt> is used as a primary key in both files. The XX_exceptions.txt provides details on hand identified records, or other records where special care has been taken. This file is not strictly required by the scrips but will be processed if present.
The perl module [httpMatch://www.edegan.com/repository/PatentLocations.pm PatentLocations:Patent.pm] loads and provides an interface to this source data. The source code is the primary module documentation.
==Postal Codes==
*United Kingdom ([http://en.wikipedia.org/wiki/UK_postcodes Sourced from Wikipedia]): A9 9AA, A99 9AA, A9A 9AA, AA9 9AA, AA99 9AA, AA9A 9AA. Simple Regex: <tt>([A-Z]{1,2}[0-9]{1,2}[A-Z]{0,1}\s[0-9][A-Z]{2,2})</tt>
The [httpMatch:://www.edegan.com/repository/PostalCodes.pm PostalCodes.pm] perl module provides a method to extract a postcode from a text string for a given ISO3166 code.
==The Matching Process==
#Reconsile multiple matches
===Exact Matching Units=== The exact matching of units is performed for both the exception units and units of "well-formatted" records, that is records that have comma seperated logical units. Postcodes are extracted as a logical unit if possible first (to generate the PRS_POSTCODE field). Exact matching is case insensitive and units are trimmed of preceeding and subsequent spaces, but otherwise the match must be exact. ===Token Matching=== Each source string is cleaned of upper-asci code (i.e. reduced to alphanumerics plus spaces), removed of its postcode (to generate PRS_POSTCODE) and then seperated into token arrays on space characters. Thus the string "String1 String2 String3 String4 Postcode" would become:[0]String1 [1]String2 [2]String3 [3]String4 An arbitrary upper token set length limit of 5 is used if the length of the source token array (4 in the example above) is greater than or equal to 5. Then beginning at the upper length limit and decreasing by one after each set of this lenght has been tried, and starting from the right hand-side and moving one unit to the left each time, the token sets are joined with spaces and exact matched against the reference string. This process iterates all length one token sets have been tried and records the matches in the order that they were made. Thus continuing the example above the space-joined source token sets would be, in the order that they are tried:#String1 String2 String3 String4 (token set lenght=4, first and only set)#String2 String3 String4 (token set lenght=3, first set)#String1 String2 String3 (token set lenght=3, second set)#String3 String4 (token set lenght=2, first set)#String2 String3 (token set lenght=2, second set)#String1 String2 (token set lenght=2, third set)#String4 (token set lenght=1, first set)#String3 (token set lenght=1, second set)#String2 (token set lenght=1, third set)#String1 (token set lenght=1, fourth set) ===NGram and LCS Matching=== Longest Common Subsequence (LCS)is an abundantly used fuzzy matching technique. The [http://en.wikipedia.org/wiki/Longest_common_subsequence Longest Common Subsequence page on wikipedia] provides a very detailed background. However, LCS is an matching of two datasets is an NP-Hard problem and extremely processor intensive. To avoid long run times, LCS matching is done on only a small sub-set of string that have met the NGram criteria detailed below. NGram are letter token strings. Source and reference strings are transformed to include only characters from one of the following numbered sets:#ABCDEFGHIJKLMNOPQRSTUVWXYZ (i.e. uppercase Latin alphabet)#0123456789 (i.e. Standard numbers)#" " (i.e. the space character)#Alphanumeric (i.e. 1 and 2 above)#Alphanumeric plus space (1,2,3 above)#Alphabet only plus space(1 and 3 above) Gramsets of various lengths, currently 2,3 and 4 characters, are used. Reference strings are decomposed into grams and both forward and reverse indexed for speed on large datasets. Source strings are decomposed into grams. Candidate reference strings are found by compiling a list of all reference strings that contain at least one of the source grams (this uses the reverse index). Then each of the candidate reference strings is gram-matched against the source gram to compute a number of scores. Specically gram-matching involves computing:*The number of grams in the source string*The number of grams in the reference string*The number of grams in the source string that appear in the reference string*The number of grams in the reference string that appear in the source string*The above two numbers as a source percentage and a reference percentage Then an optional constraint that the first letter of the source and reference strings must match can be employed. If a source and reference pair meets a variety of criteria on the above variables (for example the source string is greater than 10 characters long, the source percentage of grams is greater then 80% and less than 110%, the reference percentage of grams is greater than 90% and less than 105%, and the two strings start with the same letter), then the LCS is computed. An LCS percentage, using the maximum of the lengths of the source and reference strings as a denominator, is also calculated. The candidate reference string with the highest gram and LCS scores, assuming that these scores meet the decision threshold is then selected as the closest match. If no candidate reference string meets the decision threshold the source string is left unmatched. The decision thresholds are configured in the MatchLocations.pl script, and sets of Ngram/LCS matchings, using different character sets, gram lengths and decision thresholds, are performed sequentially, with the currently unmatched source strings used as input for each round. ===Reconsiling Multiple Matches=== In a small number of cases it is possible that the source string will achieve more than one A (Area) or P (Place) match. For example suppose the string "Glouchester Street Cambridge Cambridgeshire" where considered. This could concievably produce two P matches and one A match with the token matching algorithm detailed above.