Changes
Jump to navigation
Jump to search
m
NGram NGrams are letter character-based token strings. Source and reference strings are transformed to include only characters from one of the following numbered sets:
Geocoding Inventor Locations (view source)
Revision as of 20:11, 24 August 2009
, 20:11, 24 August 2009→NGram and LCS Matching
Longest Common Subsequence (LCS) is an abundantly used fuzzy matching technique. The [http://en.wikipedia.org/wiki/Longest_common_subsequence Longest Common Subsequence page on wikipedia] provides a very detailed background. However, LCS matching of two datasets is an NP-Hard problem and extremely processor intensive. To avoid long run-times, LCS matching is done on only a small sub-set of strings that have met the NGram criteria detailed below.
#ABCDEFGHIJKLMNOPQRSTUVWXYZ (i.e. uppercase Latin alphabet)
#0123456789 (i.e. Standard numbers)