Normalizing Surnames
Encodings
Surnames can be represented in many different encodings. For comparison purposes, it is convenient to have surnames encoded in single standard encoding, such as the Latin alphabet.
The Latin alphabet offers the advantage of simplicity. There are only 26 letter characters, A to Z, provided one ignores case (upper or lower). There are no ligatures or diacritics. As n-grams have (symbols)n permutations, an encoding with a large number of symbols will result in a much higher number of dimensions for the data for even a small value of n. Furthermore, most datasets used for practical applications are encoded in the Latin alphabet and having a classification system that allows for non-Latin characters would therefor introduce redundancy.
The first stage of normalization is therefore to check that the encoding is in the latin alphabet, with a minimal number of other symbols (such as the period, comma, and hyphen) that may provide meta information for further normalization, and to force it into the latin alpabet if it isn't. Maintaining information about the simplification or removal of ligature and diacritics (in particular) may be useful and is accomplished through the creation of additional binary variable.
Tussenvoegsel
Tussenvoegsel are surname prefixes, specifically in the Dutch language but used here generically, such as the words Van and De. A custom compiled list of Tussenvoegsel is used in the normalization process. Tussenvoegsel can be removed (and recorded with a binary variable) or concatenated with the surname.
Double-barrelled Surnames
Double-barrelled surnames may be hyphenated and easy to detect, such as Smith-Jones, but also come in many difficult forms. Spanish Naming Customs, for example, suggest the use of two surnames: a paternal surname (that is dominant) and a maternal surname. They are ordered, paternal-maternal, and often without the hyphen making discrimination problematic. However, for cultural indentification purposes it seems as suitable to use the maternal (last) surname, as to use the (strictly correct) paternal surname. While problems will persist (as in Zarragoza-Watkins), this is to some extent unavoidable.
Honorifics and Suffices
Surname data often contains honorifics such as Mr, Mrs, Ms, and Dr, as well as suffices such as Esq., Jr., roman numerals (II, III, IV, V, etc) and occasionally academic qualifications (PhD, MSc, etc). These need to be removed or seperated, and can be classified for gender, education, and other characteristics.
Military, political and class honorifics and suffices also need treatment. These include Sir, M.P., The Hon., Lord, Lt, Cap., Major, Gen., and so forth.
Initials and Middle Names
Short Names
It is difficult to classify names consisting of single words as either first names or surnames, or as data errors. For this reason single word names should probably be discarded. While there are an abundance of surnames composed of two or three letters, single letter names are exceedingly rare. As a single letter surname could be interpreted as an initial (as in Smith J) in a different format, it is possible to process single letter names in some instances, but not as surnames. The analysis of names depends on frequencies of letter combinations; thus a single letter surname is not meaningful for the analysis.
Name Orders and Formats
Some cultures and some datasets routine reverse (or re-order) the order of names; the most common reversal being Surname, FirstName Initial. Such reversals may or may not be indicated by punctuation and may be systematic across an entire dataset or idiosyncratic to groups or individuals within the dataset. To facilitate this the normalization script must support idiosyncratic reversal options using indicator variables.