Difference between revisions of "Geocoding Inventor Locations"
imported>Ed |
|||
(61 intermediate revisions by 4 users not shown) | |||
Line 3: | Line 3: | ||
This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude. | This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude. | ||
− | + | ==Script Files== | |
− | [http://www. | + | The scripts and modules that operationalize these matching techniques can be downloaded as a bundle with ([http://www.edegan.com/repository/MatchLocations.tar.gz MatchLocations.tar.gz v1.0.1] ~20Mb) or without ([http://www.edegan.com/repository/MatchLocations_Full.tar.gz MatchLocations_Full.tar.gz v1.0.1] ~20Mb) all supporting data files. Note that the current version is 1.0.3, which will be posted shortly. The bundles contain the default directory structure. Defaults can be changed in the MatchLocations.pl script. |
− | |||
− | + | The directories are as follows: | |
− | * | + | *Source - Source data should be placed here. See below for formatting. |
− | + | *Results - Results generated by the scripts, including logs will appear here. | |
− | + | *GNS - contains GNS reference data named GNS-XX.txt | |
− | * | + | *Match - contains the modules |
− | * | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | * | ||
− | |||
− | |||
− | + | The bundle contains: | |
+ | *MatchLocations.pl - The main script that initializes and processes the matching requests | ||
+ | *BatchMatch.pl - A script for running batches | ||
+ | *Match::GNS.pm - Interface to the GNS reference data (see below) | ||
+ | *Match::Patent.pm - Interface to the Patent Location data (see below) | ||
+ | *Match::Common.pm - Provides common (string cleaning) routines for both the reference and source interface modules | ||
+ | *Match::PostalCodes.pm - A module that extracts postcodes of various formats from (address) strings | ||
+ | *Match::Gram.pm - Custom NGram Module | ||
+ | *Match::LCS.pm - A standard LCS Module | ||
+ | *PatentLocations-Stopwords.txt - A Stop Word file (tab delimited) | ||
+ | *GNS Reference Files - The full bundle contains a full set of correctly named GNS reference files | ||
− | + | The MatchLocations.pl script can be run from any shell or command line with perl installed. Example commands are: | |
− | + | <tt>perl MatchLocations.pl -co GB -u -human -r -wf </tt> | |
− | + | which will process ISO3166 <tt>country</tt> code GB (Great Britain), include <tt>unmatched</tt> inputs in the results file, produce a <tt>human</tt> choices file, write the <tt>report</tt> to a text file, and <tt>write fuzzy</tt> matches to additional seperate files as well as the main results file. Other options include <tt>over</tt> to override country designations and <tt>o</tt> to specify the results filename. | |
− | + | ||
+ | <tt>perl MatchLocations.pl -h</tt> | ||
+ | |||
+ | produces a simple help output. | ||
+ | |||
+ | ==The Source Files== | ||
+ | |||
+ | Per country source files are extracted from the NBER patent data. The format of the source file(s) is as follows (XX is an ISO3166 code): | ||
+ | |||
+ | XX.txt - Tab delimited plain text with no (intentional) string quotation. | ||
+ | Column(s): <tt>country</tt> <tt>str</tt> <tt>cty</tt> <tt>adm</tt> <tt>city</tt> <tt>postcode</tt> <tt>str</tt> | ||
+ | |||
+ | The column order is not important. <tt>country</tt>, <tt>str</tt>, and <tt>cty</tt> can not all be null. <tt>adm</tt> <tt>city</tt> <tt>postcode</tt> are optional 'exception' fields that are processed with priority. They provide hand corrections and other specifically generated information. | ||
+ | |||
+ | The perl module Match::Patent.pm loads and provides an interface to this source data. The source code is the primary module documentation. The Match::PostalCodes.pm perl module provides a method to extract [[Postal Codes]] from a the addresses for a large number of ISO3166 codes, and implements 'standard' postal code identification for all other jurisdictions. | ||
+ | |||
+ | ==Reference Data== | ||
+ | |||
+ | The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's [[GEOnet Names Server | GEOnet Names Server (GNS)]] which covers the world excluding the U.S. and Antartica. | ||
+ | |||
+ | This project uses [[ISO3166]] two-character country codes to name source and reference files. GNS does not use ISO3166 country codes, and so users will need to translate accordingly (see the [[GEOnet Names Server | GNS page]] for details). A full bundle of correctly names GNS files is also available. | ||
+ | |||
+ | The perl module Match::GNS.pm loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. The load() method takes an ISO3166 code, and the index methods and most other methods take specific GNS FC codes (e.g. "P" for populated place, "L" for locality, and "A" for administrative area). Which GNS FC codes are used is specified in the @Letters global varible of MatchLocations.pl and inherited by all other modules. | ||
+ | |||
+ | MatchLocations.pl also retrieves a list of all ISO3166 codes included in the data (from the MatchPatent.pm module) and in any specified override file, and calls Match::GNS.pm to load them. An override file can be specified with the <tt>-over</tt> option. Override files are tab-delimited and have the format: | ||
+ | |||
+ | ListedISO3166 1stPreference 2ndPreference 3rdPreference ... | ||
+ | |||
+ | The ISO3166 listed in the source data is then overridden and the alternatives are searched for matches in order of preference. The search is terminated when a match is found or the override set is exhausted. | ||
+ | |||
+ | ==The Matching Process== | ||
+ | |||
+ | The matching process is carried out by [http://www.edegan.com/repository/MatchPatentLocations.pl MatchLocations.pl] script, and its dependent modules (detailed above), which has a standard pod based command line interface. The <tt>-co</tt> option specifies the ISO3166 country code to be matched. If the override option is used, then the <tt>-co</tt> option can be used to specify the source file. When an override option is set to 1, rather than to the filename containing the overrides, then the source files countries are used to determine which GNS lookups to perform, otherwise the <tt>-co</tt> option specifies the GNS reference set. | ||
+ | |||
+ | Glossary of terms: | ||
+ | *Units - isolated logical units from an address, such as the street number and name, the town, or the region. Postal codes are treated separately. | ||
+ | *Tokens - Single words or sequences of words separated by a space (note that this is a specific usage) | ||
+ | *n-grams - character sequences, such as bigrams (two letters from aa to zz), trigrams (aaa-zzz) and so forth | ||
+ | *Exact Matching - Case insensitive of matching of the entire sequence of both the source and the reference strings | ||
+ | *LCS - Longest Common Subsequence based matching (See below) | ||
+ | *Administrative area, populated place, and locality - locations identified as a FC=A, FC=P or FC=L respectively in the GNS data. Unless otherwise specified, matches are performed for all GNS FC codes requested (default is A,P,L) separately and in series. | ||
+ | |||
+ | The sequence of processing is as follows (matching only the remaining unmatched locations at each stage): | ||
+ | #Load the source files, clean and parse (parsing identifies units) | ||
+ | #Load the reference file, build indices | ||
+ | #Exact match the exception units of records with exceptions | ||
+ | #Exact match the units of well-formatted records | ||
+ | #Exact match tokens (1-5 words) | ||
+ | #N-gram and LCS match | ||
+ | #Reconsile multiple matches | ||
+ | |||
+ | ===Exact Matching Units=== | ||
+ | |||
+ | The exact matching of units is performed for both the exception units and units of "well-formatted" records, that is records that have comma seperated logical units. Postcodes are extracted as a logical unit if possible first (to generate the PRS_POSTCODE field). Exact matching is case insensitive and units are trimmed of preceeding and subsequent spaces, but otherwise the match must be exact. Units are matched from the bottom to the top, in order of precedence. That is if the string is Unit1, Unit2, Unit3, Postcode; then Unit3 is matched with precedence over Units 2 and 1, and so forth. However, if multiple matches are made for a some FC code and one match is made for another, then preference is given to the different combination. For example if the string were "Chelsea, London" and both Chelsea and London were recorded in the GNS data as FC=P, but only London was recorded as a FC=A, then it would be most sensible to record P=Chelsea, A=London, and not P=London, A=London. This is differencing is done in the matching method and independent from the resolution of multiple matches at the end. | ||
+ | |||
+ | ===Token Matching=== | ||
+ | |||
+ | Each source string is cleaned of upper-asci code (i.e. reduced to alphanumerics plus spaces), removed of its postcode (to generate PRS_POSTCODE) and then seperated into token arrays on space characters. Thus the string "String1 String2 String3 String4 Postcode" would become: | ||
+ | [0]String1 | ||
+ | [1]String2 | ||
+ | [2]String3 | ||
+ | [3]String4 | ||
+ | |||
+ | An arbitrary upper token set length limit of 5 is used if the length of the source token array (4 in the example above) is greater than or equal to 5. Then beginning at the upper length limit and decreasing by one after each set of this lenght has been tried, and starting from the right hand-side and moving one unit to the left each time, the token sets are joined with spaces and exact matched against the reference string. This process iterates all length one token sets have been tried and records the matches in the order that they were made. Thus continuing the example above the space-joined source token sets would be, in the order that they are tried: | ||
+ | #String1 String2 String3 String4 (token set length=4, first and only set) | ||
+ | #String2 String3 String4 (token set length=3, first set) | ||
+ | #String1 String2 String3 (token set length=3, second set) | ||
+ | #String3 String4 (token set length=2, first set) | ||
+ | #String2 String3 (token set length=2, second set) | ||
+ | #String1 String2 (token set length=2, third set) | ||
+ | #String4 (token set length=1, first set) | ||
+ | #String3 (token set length=1, second set) | ||
+ | #String2 (token set length=1, third set) | ||
+ | #String1 (token set length=1, fourth set) | ||
+ | |||
+ | ===NGram and LCS Matching=== | ||
+ | |||
+ | Longest Common Subsequence (LCS) is an abundantly used fuzzy matching technique. The [http://en.wikipedia.org/wiki/Longest_common_subsequence Longest Common Subsequence page on wikipedia] provides a very detailed background. However, LCS matching of two datasets is an NP-Hard problem and extremely processor intensive. To avoid long run-times, LCS matching is done on only a small sub-set of strings that have met the NGram criteria detailed below. | ||
+ | |||
+ | NGrams are character-based token strings. Source and reference strings are transformed to include only characters from one of the following numbered sets: | ||
+ | #ABCDEFGHIJKLMNOPQRSTUVWXYZ (i.e. uppercase Latin alphabet) | ||
+ | #0123456789 (i.e. Standard numbers) | ||
+ | #" " (i.e. the space character) | ||
+ | #Alphanumeric (i.e. 1 and 2 above) | ||
+ | #Alphanumeric plus space (1,2,3 above) | ||
+ | #Alphabet only plus space(1 and 3 above) | ||
+ | |||
+ | Gramsets of various lengths, currently 2,3 and 4 characters, are used. Reference strings are decomposed into grams and both forward and reverse indexed for speed on large datasets. Source strings are decomposed into grams. Candidate reference strings are found by compiling a list of all reference strings that contain at least one of the source grams (this uses the reverse index). Then each of the candidate reference strings is gram-matched against the source gram to compute a number of scores. Specically gram-matching involves computing: | ||
+ | *The number of grams in the source string | ||
+ | *The number of grams in the reference string | ||
+ | *The number of grams in the source string that appear in the reference string | ||
+ | *The number of grams in the reference string that appear in the source string | ||
+ | *The above two numbers as a source percentage and a reference percentage | ||
+ | |||
+ | Then an optional constraint that the first letter of the source and reference strings must match can be employed. If a source and reference pair meets a variety of criteria on the above variables (for example the source string is greater than 10 characters long, the source percentage of grams is greater then 80% and less than 110%, the reference percentage of grams is greater than 90% and less than 105%, and the two strings start with the same letter), then the LCS is computed. An LCS percentage, using the maximum of the lengths of the source and reference strings as a denominator, is also calculated. | ||
+ | |||
+ | The candidate reference string with the highest gram and LCS scores, assuming that these scores meet the decision threshold is then selected as the closest match. If no candidate reference string meets the decision threshold the source string is left unmatched. The decision thresholds are configured in the MatchLocations.pl script, and sets of Ngram/LCS matchings, using different character sets, gram lengths and decision thresholds, are performed sequentially, with the currently unmatched source strings used as input for each round. | ||
+ | |||
+ | ===Reconciling Multiple Matches=== | ||
+ | |||
+ | In a small number of cases it is possible that the source string will achieve more than one match for more than one FC code. For example suppose the string "Glouchester Street Cambridge Cambridgeshire" were considered. This could concievably produce two P matches and one A match with the token matching algorithm detailed above. | ||
+ | |||
+ | To reconsile multiple matches the following process is undertaken: | ||
+ | *If an FC code has only one match keep that one match | ||
+ | *Aim for distinction in the set, giving priority in the order that the FC codes are specified in MatchLocations.pl. The default is to include A,P,L in order, so that precedence follows importance and size. This is important if multiple FC codes contain multiple overlapping matches. For example suppose A=1,2 P=2,3 and L=3. The algorithm will look forward and backwards to assign: A=1 P=2 L=3. | ||
+ | *Determine the set of FC code matches with the shortest distance between them using a [http://en.wikipedia.org/wiki/Haversine_formula Haversine formula] distance calculation based on the GNS reported longitudes and latitudes. (Note that the Haversine formula is implemented in the Match::GNS.pm module and is the most accurate method over short distances, where other methods, like the great-circle method, suffer from compounded rounding error problems.) This is important when multiple FC codes have muliple matches but they do not overlap. | ||
+ | *If one or more match is found for an FC code then one final 'best' match must be reported, even if it overlaps with another FC code or is distant. | ||
+ | |||
+ | ==Human Choices== | ||
+ | |||
+ | It is generally preferrable to have a very high degree of confidence in the fuzzy matches, so that they can be treated as correct without individual inspection. However the script and modules are capable of matching to any degree of accuracy. To get further matches that can be inspected/validated/chosen by a human agent, a very weak criteria is set for two runs of fuzzy matching, and then in each run the best (in terms of parameter scores) options are recorded and written into a 'human choice' file. | ||
+ | |||
+ | As a result a human choice file may contain: | ||
+ | #No matches for a source string as none of the reference strings managed to reach even the very weak threshold criteria. | ||
+ | #One match, as both runs of fuzzy matching produced the same recommendation. | ||
+ | #Two matches, as both runs of fuzzy matching produced one best candidate and the candidates were unique. | ||
+ | #More than two matches, as one or both of the fuzzy matching runs had multiple unique candidates with the same scores. | ||
+ | |||
+ | It appears likely that blocks of matches will be able to be identified from the human choice files, by restricting the results sets to ranges for one or more of the provided match accuracy parameters. | ||
+ | |||
+ | ==Output Files== | ||
+ | |||
+ | By default all files are outputted to the Results directory. Which files are outputted depends on the options selected, though the main results file is always outputted (with or without unmatched addresses) and includes fuzzy matches (unless the <tt>-e</tt> option is used to force just exact matching). The main results file outputs: | ||
+ | *COUNTRY - From the source entry | ||
+ | *STR - From the source entry | ||
+ | *CTY - From the source entry | ||
+ | *EXP_CITY - From the source entry | ||
+ | *EXP_ADM - From the source entry | ||
+ | *EXP_POSTCODE - From the source entry | ||
+ | *CTY_STR - A compound entry, delimited by #, used as an internal key. It is the software's best estimate of an address structure. | ||
+ | *EXP_STR - A compound entry, delimited by #, made from the exception data in a similar way to CTY_STR | ||
+ | *PRS_POSTCODE - The software's best estimate of the postcode if any | ||
+ | *MATCH_TYPE - The match type that was used to make the match | ||
+ | *PLACE - The name of the most precise location | ||
+ | *UNI - The GNS unique identifier of the most precise location | ||
+ | *LAT - The latitude of the most precise location | ||
+ | *LONG - The longitude of the most precise location | ||
+ | *FC - The FC code of the most precise location | ||
+ | |||
+ | The most precise location is taken to be the finest grained result. That is the match corresponding to the lowest level FC code. In the case of the default of FC=A,P,L preference is given to L then P then A. The following variables are then repeated for each FC code searched, and prefixed by the FC code (if no match was found for this FC code the entries will be blank): | ||
+ | *NAME | ||
+ | *UNI | ||
+ | *LAT | ||
+ | *LONG | ||
+ | |||
+ | The fuzzy match file(s), if requested with <tt>-wf</tt>, have the same format (they are written by the same method). The report file is a copy of the output to the terminal, and can be enabled with the <tt>-r</tt> option. The human choice file (enabled with <tt>-human</tt>) has its own format as follows: | ||
+ | *SOURCENAME - The word, token or string from the source entry that is being considered as relevant for a match | ||
+ | *REFNAME - The name of a place in the GNS file | ||
+ | *COUNTRY - From the source entry | ||
+ | *STR - From the source entry | ||
+ | *CTY - From the source entry | ||
+ | *EXP_CITY - From the source entry | ||
+ | *EXP_ADM - From the source entry | ||
+ | *EXP_POSTCODE - From the source entry | ||
+ | *REFTOTAL - The total number of grams in REFNAME | ||
+ | *SOURCETOTAL - The total number of grams in SOURCENAME | ||
+ | *REFPC - the percentage of the REFNAME grams that appear in the SOURCENAME gram set | ||
+ | *SOURCEPC - the percentage of the SOURCENAME grams that appear in the REFNAME gram set | ||
+ | *LEFTGRAMS - the number of the REFNAME grams that appear in the SOURCENAME gram set | ||
+ | *RIGHTGRAMS - the number of the SOURCENAME grams that appear in the REFNAME gram set | ||
+ | *LCSSCORE - The size of the longest common subsequence in characters | ||
+ | *SOURCELENGTH - The length of SOURCENAME | ||
+ | *REFLENGTH - The length of REFNAME | ||
+ | *MAXLENGTH - The maximum of the lengths of SOURCENAME and REFNAME | ||
+ | *LCSPC - The LCSSCORE divided by the MAXLENGTH | ||
+ | *FIRSTLETTERBINDS - Whether the fuzzy matching algorithm required the same first letter in SOURCENAME and REFNAME | ||
+ | *GRAMALPHABET - The gram alphabet used by the matching algorithm | ||
+ | *GRAMLENGTH - The length of the n-grams used |
Latest revision as of 04:47, 22 January 2010
- This page is part of a series under the NBER Patent Data Project
This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude.
Contents
Script Files
The scripts and modules that operationalize these matching techniques can be downloaded as a bundle with (MatchLocations.tar.gz v1.0.1 ~20Mb) or without (MatchLocations_Full.tar.gz v1.0.1 ~20Mb) all supporting data files. Note that the current version is 1.0.3, which will be posted shortly. The bundles contain the default directory structure. Defaults can be changed in the MatchLocations.pl script.
The directories are as follows:
- Source - Source data should be placed here. See below for formatting.
- Results - Results generated by the scripts, including logs will appear here.
- GNS - contains GNS reference data named GNS-XX.txt
- Match - contains the modules
The bundle contains:
- MatchLocations.pl - The main script that initializes and processes the matching requests
- BatchMatch.pl - A script for running batches
- Match::GNS.pm - Interface to the GNS reference data (see below)
- Match::Patent.pm - Interface to the Patent Location data (see below)
- Match::Common.pm - Provides common (string cleaning) routines for both the reference and source interface modules
- Match::PostalCodes.pm - A module that extracts postcodes of various formats from (address) strings
- Match::Gram.pm - Custom NGram Module
- Match::LCS.pm - A standard LCS Module
- PatentLocations-Stopwords.txt - A Stop Word file (tab delimited)
- GNS Reference Files - The full bundle contains a full set of correctly named GNS reference files
The MatchLocations.pl script can be run from any shell or command line with perl installed. Example commands are:
perl MatchLocations.pl -co GB -u -human -r -wf
which will process ISO3166 country code GB (Great Britain), include unmatched inputs in the results file, produce a human choices file, write the report to a text file, and write fuzzy matches to additional seperate files as well as the main results file. Other options include over to override country designations and o to specify the results filename.
perl MatchLocations.pl -h
produces a simple help output.
The Source Files
Per country source files are extracted from the NBER patent data. The format of the source file(s) is as follows (XX is an ISO3166 code):
XX.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): country str cty adm city postcode str
The column order is not important. country, str, and cty can not all be null. adm city postcode are optional 'exception' fields that are processed with priority. They provide hand corrections and other specifically generated information.
The perl module Match::Patent.pm loads and provides an interface to this source data. The source code is the primary module documentation. The Match::PostalCodes.pm perl module provides a method to extract Postal Codes from a the addresses for a large number of ISO3166 codes, and implements 'standard' postal code identification for all other jurisdictions.
Reference Data
The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's GEOnet Names Server (GNS) which covers the world excluding the U.S. and Antartica.
This project uses ISO3166 two-character country codes to name source and reference files. GNS does not use ISO3166 country codes, and so users will need to translate accordingly (see the GNS page for details). A full bundle of correctly names GNS files is also available.
The perl module Match::GNS.pm loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. The load() method takes an ISO3166 code, and the index methods and most other methods take specific GNS FC codes (e.g. "P" for populated place, "L" for locality, and "A" for administrative area). Which GNS FC codes are used is specified in the @Letters global varible of MatchLocations.pl and inherited by all other modules.
MatchLocations.pl also retrieves a list of all ISO3166 codes included in the data (from the MatchPatent.pm module) and in any specified override file, and calls Match::GNS.pm to load them. An override file can be specified with the -over option. Override files are tab-delimited and have the format:
ListedISO3166 1stPreference 2ndPreference 3rdPreference ...
The ISO3166 listed in the source data is then overridden and the alternatives are searched for matches in order of preference. The search is terminated when a match is found or the override set is exhausted.
The Matching Process
The matching process is carried out by MatchLocations.pl script, and its dependent modules (detailed above), which has a standard pod based command line interface. The -co option specifies the ISO3166 country code to be matched. If the override option is used, then the -co option can be used to specify the source file. When an override option is set to 1, rather than to the filename containing the overrides, then the source files countries are used to determine which GNS lookups to perform, otherwise the -co option specifies the GNS reference set.
Glossary of terms:
- Units - isolated logical units from an address, such as the street number and name, the town, or the region. Postal codes are treated separately.
- Tokens - Single words or sequences of words separated by a space (note that this is a specific usage)
- n-grams - character sequences, such as bigrams (two letters from aa to zz), trigrams (aaa-zzz) and so forth
- Exact Matching - Case insensitive of matching of the entire sequence of both the source and the reference strings
- LCS - Longest Common Subsequence based matching (See below)
- Administrative area, populated place, and locality - locations identified as a FC=A, FC=P or FC=L respectively in the GNS data. Unless otherwise specified, matches are performed for all GNS FC codes requested (default is A,P,L) separately and in series.
The sequence of processing is as follows (matching only the remaining unmatched locations at each stage):
- Load the source files, clean and parse (parsing identifies units)
- Load the reference file, build indices
- Exact match the exception units of records with exceptions
- Exact match the units of well-formatted records
- Exact match tokens (1-5 words)
- N-gram and LCS match
- Reconsile multiple matches
Exact Matching Units
The exact matching of units is performed for both the exception units and units of "well-formatted" records, that is records that have comma seperated logical units. Postcodes are extracted as a logical unit if possible first (to generate the PRS_POSTCODE field). Exact matching is case insensitive and units are trimmed of preceeding and subsequent spaces, but otherwise the match must be exact. Units are matched from the bottom to the top, in order of precedence. That is if the string is Unit1, Unit2, Unit3, Postcode; then Unit3 is matched with precedence over Units 2 and 1, and so forth. However, if multiple matches are made for a some FC code and one match is made for another, then preference is given to the different combination. For example if the string were "Chelsea, London" and both Chelsea and London were recorded in the GNS data as FC=P, but only London was recorded as a FC=A, then it would be most sensible to record P=Chelsea, A=London, and not P=London, A=London. This is differencing is done in the matching method and independent from the resolution of multiple matches at the end.
Token Matching
Each source string is cleaned of upper-asci code (i.e. reduced to alphanumerics plus spaces), removed of its postcode (to generate PRS_POSTCODE) and then seperated into token arrays on space characters. Thus the string "String1 String2 String3 String4 Postcode" would become: [0]String1 [1]String2 [2]String3 [3]String4
An arbitrary upper token set length limit of 5 is used if the length of the source token array (4 in the example above) is greater than or equal to 5. Then beginning at the upper length limit and decreasing by one after each set of this lenght has been tried, and starting from the right hand-side and moving one unit to the left each time, the token sets are joined with spaces and exact matched against the reference string. This process iterates all length one token sets have been tried and records the matches in the order that they were made. Thus continuing the example above the space-joined source token sets would be, in the order that they are tried:
- String1 String2 String3 String4 (token set length=4, first and only set)
- String2 String3 String4 (token set length=3, first set)
- String1 String2 String3 (token set length=3, second set)
- String3 String4 (token set length=2, first set)
- String2 String3 (token set length=2, second set)
- String1 String2 (token set length=2, third set)
- String4 (token set length=1, first set)
- String3 (token set length=1, second set)
- String2 (token set length=1, third set)
- String1 (token set length=1, fourth set)
NGram and LCS Matching
Longest Common Subsequence (LCS) is an abundantly used fuzzy matching technique. The Longest Common Subsequence page on wikipedia provides a very detailed background. However, LCS matching of two datasets is an NP-Hard problem and extremely processor intensive. To avoid long run-times, LCS matching is done on only a small sub-set of strings that have met the NGram criteria detailed below.
NGrams are character-based token strings. Source and reference strings are transformed to include only characters from one of the following numbered sets:
- ABCDEFGHIJKLMNOPQRSTUVWXYZ (i.e. uppercase Latin alphabet)
- 0123456789 (i.e. Standard numbers)
- " " (i.e. the space character)
- Alphanumeric (i.e. 1 and 2 above)
- Alphanumeric plus space (1,2,3 above)
- Alphabet only plus space(1 and 3 above)
Gramsets of various lengths, currently 2,3 and 4 characters, are used. Reference strings are decomposed into grams and both forward and reverse indexed for speed on large datasets. Source strings are decomposed into grams. Candidate reference strings are found by compiling a list of all reference strings that contain at least one of the source grams (this uses the reverse index). Then each of the candidate reference strings is gram-matched against the source gram to compute a number of scores. Specically gram-matching involves computing:
- The number of grams in the source string
- The number of grams in the reference string
- The number of grams in the source string that appear in the reference string
- The number of grams in the reference string that appear in the source string
- The above two numbers as a source percentage and a reference percentage
Then an optional constraint that the first letter of the source and reference strings must match can be employed. If a source and reference pair meets a variety of criteria on the above variables (for example the source string is greater than 10 characters long, the source percentage of grams is greater then 80% and less than 110%, the reference percentage of grams is greater than 90% and less than 105%, and the two strings start with the same letter), then the LCS is computed. An LCS percentage, using the maximum of the lengths of the source and reference strings as a denominator, is also calculated.
The candidate reference string with the highest gram and LCS scores, assuming that these scores meet the decision threshold is then selected as the closest match. If no candidate reference string meets the decision threshold the source string is left unmatched. The decision thresholds are configured in the MatchLocations.pl script, and sets of Ngram/LCS matchings, using different character sets, gram lengths and decision thresholds, are performed sequentially, with the currently unmatched source strings used as input for each round.
Reconciling Multiple Matches
In a small number of cases it is possible that the source string will achieve more than one match for more than one FC code. For example suppose the string "Glouchester Street Cambridge Cambridgeshire" were considered. This could concievably produce two P matches and one A match with the token matching algorithm detailed above.
To reconsile multiple matches the following process is undertaken:
- If an FC code has only one match keep that one match
- Aim for distinction in the set, giving priority in the order that the FC codes are specified in MatchLocations.pl. The default is to include A,P,L in order, so that precedence follows importance and size. This is important if multiple FC codes contain multiple overlapping matches. For example suppose A=1,2 P=2,3 and L=3. The algorithm will look forward and backwards to assign: A=1 P=2 L=3.
- Determine the set of FC code matches with the shortest distance between them using a Haversine formula distance calculation based on the GNS reported longitudes and latitudes. (Note that the Haversine formula is implemented in the Match::GNS.pm module and is the most accurate method over short distances, where other methods, like the great-circle method, suffer from compounded rounding error problems.) This is important when multiple FC codes have muliple matches but they do not overlap.
- If one or more match is found for an FC code then one final 'best' match must be reported, even if it overlaps with another FC code or is distant.
Human Choices
It is generally preferrable to have a very high degree of confidence in the fuzzy matches, so that they can be treated as correct without individual inspection. However the script and modules are capable of matching to any degree of accuracy. To get further matches that can be inspected/validated/chosen by a human agent, a very weak criteria is set for two runs of fuzzy matching, and then in each run the best (in terms of parameter scores) options are recorded and written into a 'human choice' file.
As a result a human choice file may contain:
- No matches for a source string as none of the reference strings managed to reach even the very weak threshold criteria.
- One match, as both runs of fuzzy matching produced the same recommendation.
- Two matches, as both runs of fuzzy matching produced one best candidate and the candidates were unique.
- More than two matches, as one or both of the fuzzy matching runs had multiple unique candidates with the same scores.
It appears likely that blocks of matches will be able to be identified from the human choice files, by restricting the results sets to ranges for one or more of the provided match accuracy parameters.
Output Files
By default all files are outputted to the Results directory. Which files are outputted depends on the options selected, though the main results file is always outputted (with or without unmatched addresses) and includes fuzzy matches (unless the -e option is used to force just exact matching). The main results file outputs:
- COUNTRY - From the source entry
- STR - From the source entry
- CTY - From the source entry
- EXP_CITY - From the source entry
- EXP_ADM - From the source entry
- EXP_POSTCODE - From the source entry
- CTY_STR - A compound entry, delimited by #, used as an internal key. It is the software's best estimate of an address structure.
- EXP_STR - A compound entry, delimited by #, made from the exception data in a similar way to CTY_STR
- PRS_POSTCODE - The software's best estimate of the postcode if any
- MATCH_TYPE - The match type that was used to make the match
- PLACE - The name of the most precise location
- UNI - The GNS unique identifier of the most precise location
- LAT - The latitude of the most precise location
- LONG - The longitude of the most precise location
- FC - The FC code of the most precise location
The most precise location is taken to be the finest grained result. That is the match corresponding to the lowest level FC code. In the case of the default of FC=A,P,L preference is given to L then P then A. The following variables are then repeated for each FC code searched, and prefixed by the FC code (if no match was found for this FC code the entries will be blank):
- NAME
- UNI
- LAT
- LONG
The fuzzy match file(s), if requested with -wf, have the same format (they are written by the same method). The report file is a copy of the output to the terminal, and can be enabled with the -r option. The human choice file (enabled with -human) has its own format as follows:
- SOURCENAME - The word, token or string from the source entry that is being considered as relevant for a match
- REFNAME - The name of a place in the GNS file
- COUNTRY - From the source entry
- STR - From the source entry
- CTY - From the source entry
- EXP_CITY - From the source entry
- EXP_ADM - From the source entry
- EXP_POSTCODE - From the source entry
- REFTOTAL - The total number of grams in REFNAME
- SOURCETOTAL - The total number of grams in SOURCENAME
- REFPC - the percentage of the REFNAME grams that appear in the SOURCENAME gram set
- SOURCEPC - the percentage of the SOURCENAME grams that appear in the REFNAME gram set
- LEFTGRAMS - the number of the REFNAME grams that appear in the SOURCENAME gram set
- RIGHTGRAMS - the number of the SOURCENAME grams that appear in the REFNAME gram set
- LCSSCORE - The size of the longest common subsequence in characters
- SOURCELENGTH - The length of SOURCENAME
- REFLENGTH - The length of REFNAME
- MAXLENGTH - The maximum of the lengths of SOURCENAME and REFNAME
- LCSPC - The LCSSCORE divided by the MAXLENGTH
- FIRSTLETTERBINDS - Whether the fuzzy matching algorithm required the same first letter in SOURCENAME and REFNAME
- GRAMALPHABET - The gram alphabet used by the matching algorithm
- GRAMLENGTH - The length of the n-grams used