Changes
Jump to navigation
Jump to search
==Country Codes== This project uses [The scripts and modules that operationalize these matching techniques can be downloaded individually by function or as a bundle with all supporting data files ([ISO3166]http://www.edegan.com/repository/MatchPatentLocations.tar.gz MatchPatentLocations.tar.gz] two-character country codes, as recognised by the UN and (with exceptions) used for top level domains on the internet.
Geocoding Inventor Locations (view source)
Revision as of 20:49, 19 August 2009
, 20:49, 19 August 2009no edit summary
This page details the various matching techniques used to Geocode inventor locations in the NBER patent data. Geocoding inventor locations entails matching the inventor addresses provided in the patent data to known locations through-out the world and recording their longitude and latitude.
==Reference Data==
The reference data for the locations (which provides the longitude and latitudes) is taken from the (U.S.) National Geospatial-Intelligence Agency's [[GEOnet Names Server | GEOnet Names Server (GNS)]] which covers the world excluding the U.S. and Antartica.
This project uses [[ISO3166]] two-character country codes to name source and reference files. Available reference data files for countries that have or are being processed include:*Belgium: [http://www.edegan.com/repository/GNS-BE.txt GNS-BE.txt]*France: [http://www.edegan.com/repository/GNS-FR.txt GNS-FR.txt]*Great Britain (The UKUnited Kingdom of Great Britain and Northern Ireland): [http://www.edegan.com/repository/GNS-GB.txt GNS-GB.txt]*Spain: [http://www.edegan.com/repository/GNS-ES.txt GNS-ES.txt]*Switzerland: [http://www.edegan.com/repository/GNS-CH.txt GNS-CH.txt]
The perl module [http://www.edegan.com/repository/GNS.pm GNS.pm] loads, indexes and provides an interface to key variables from this data. The source code is the primary module documentation. Exported Methods include:*newThe load() - Constructor. Takes an method takes and ISO3166 code, calls Load*Load() - Expects to find GNS-XX.txt (where XX is an ISO3166 code) and it to have GNS standard column names; Loads it.*Index - Build all of master the index methods and all sub-indices*GetIndexKeys() -Takes a most other methods take one of two specific GNS FC code codes (e.g. "P" for populated place,L,and "A" for administrative area) or ALL and returns a set of index keys*GetUNIs() - Takes a place name and a type (e.g. P,L,A,ALL); returns a list of corresponding UNIs*GetLongLat() - Takes a UNI, returns a longitude, latitude pair
==The Source Files==
*XX_exceptions.txt - Tab delimited plain text with no (intentional) string quotation. Column(s): <tt>cty city adm postcode</tt>
The <tt>cty</tt> is used as a primary key in both files. The XX_exceptions.txt provides details on hand identified records, or other records where special care has been taken. This file is not strictly necessary required by the scrips but will be processed if present.
The perl module [http://www.edegan.com/repository/PatentLocations.pm PatentLocations.pm] loads and provides an interface to this source data. The source code is the primary module documentation. Exported Methods include:*new() - Takes an ISO3166 country code, calls load. Expects to find a set stop words file ([http://www.edegan.com/repository/PatentLocations-Stopwords.txt PatentLocations-Stopwords.txt]) and a postcode RegEx file ([http://www.edegan.com/repository/PatentLocations-PostCode.rex PatentLocations-PostCode.rex]).*Load() - load the data file(s)*CleanAndParse() - Do a first round of cleaning and parsing (calls internal methods). Extract out the postcode and replace stop words.*UnMatched() - Takes an FC code (e.g. P,L,A,ALL) and returns the set of currently unmatched country name keys for that type*ReturnMatches() - Marks country name keys with their new match sets
==Postal Codes==
*United Kingdom ([http://en.wikipedia.org/wiki/UK_postcodes Sourced from Wikipedia]): A9 9AA, A99 9AA, A9A 9AA, AA9 9AA, AA99 9AA, AA9A 9AA. Simple Regex: <tt>([A-Z]{1,2}[0-9]{1,2}[A-Z]{0,1}\s[0-9][A-Z]{2,2})</tt>
The [http://www.edegan.com/repository/PostalCodes.pm PostalCodes.pm] perl module provides a method to extract a postcode from a text string for a given ISO3166 code.
==The Matching Process==
The matching process is carried out by [http://www.edegan.com/repository/MatchPatentLocations.pl MatchPatentLocations.pl], which has a standard pod based command line interface. The -co option specifies the ISO3166 country code to be matched. The script uses these modules: PatentLocations.pm, GNS.pm, CleanStrings.pm, GramMatch.pm, LCS.pm and PostalCodes.pm. In addition to GNS reference files and patent data source files as detailed above, the script also use PatentLocations-Stopwords.txt.
Glossary of terms:
#Exact match the units of well-formatted records
#Exact match tokens (1-5 words)
#LCS match the exception units of records with exceptions#LCS match (all other)#nN-gram and LCS match
#Reconsile multiple matches