Changes

Jump to navigation Jump to search
1,898 bytes added ,  19:27, 29 June 2009
no edit summary
==Double-barrelled Surnames==
Double-barrelled surnames may be hyphenated and easy to detect, such as Smith-Jones, but also come in many difficult forms. [http://en.wikipedia.org/wiki/Spanish_surname Spanish Naming Customs], for example, suggest the use of two surnames, : a paternal surname (that is dominant) and a maternal surname. They are ordered, paternal-maternal, and often without the hyphenmaking discrimination problematic. However, for cultural indentification purposes it seems as suitable to use the maternal (last) surname, as to use the (strictly correct) paternal surname. While problems will persist (as in Zarragoza-Watkins), this is to some extent unavoidable.
==Honorifics and Suffices==
 
Surname data often contains honorifics such as Mr, Mrs, Ms, and Dr, as well as suffices such as Esq., Jr., roman numerals (II, III, IV, V, etc) and occasionally academic qualifications (PhD, MSc, etc). These need to be removed or seperated, and can be classified for gender, education, and other characteristics.
 
Military, political and class honorifics and suffices also need treatment. These include Sir, M.P., The Hon., Lord, Lt, Cap., Major, Gen., and so forth.
 
==Initials and Middle Names==
 
 
==Short Names==
 
It is difficult to classify names consisting of single words as either first names or surnames, or as data errors. For this reason single word names should probably be discarded. While there are an abundance of surnames composed of two or three letters, single letter names are exceedingly rare. As a single letter surname could be interpreted as an initial (as in Smith J) in a different format, it is possible to process single letter names in some instances, but not as surnames. The analysis of names depends on frequencies of letter combinations; thus a single letter surname is not meaningful for the analysis.
 
==Name Orders and Formats==
 
Some cultures and some datasets routine reverse (or re-order) the order of names; the most common reversal being Surname, FirstName Initial. Such reversals may or may not be indicated by punctuation and may be systematic across an entire dataset or idiosyncratic to groups or individuals within the dataset. To facilitate this the normalization script must support idiosyncratic reversal options using indicator variables.
 
==Stop Words==
Anonymous user

Navigation menu