Changes

Normalizing Surnames (view source)

Revision as of 20:18, 29 June 2009

1,337 bytes added , 20:18, 29 June 2009

no edit summary

Surname data often contains honorifics such as Mr, Mrs, Ms, and Dr, as well as suffices such as Esq., Jr., roman numerals (II, III, IV, V, etc) and occasionally academic qualifications (PhD, MSc, etc). These need to be removed or seperated, and can be classified for gender, education, and other characteristics.

Military, political and class honorifics and suffices also need treatment. These include Sir, M.P., The Hon., Lord, Lt, Cap., Major, Gen., and so forth. Pratically all of these honorifics and suffices are sufficiently distinct from real names to be considered stop words, at least assuming context permits (i.e. from context "Major John Major" could have the first "Major" removed, but removing the "Major" from "John Major" would compromise the name-string). Coding these stop words for gender, education and other other variables of interest is possible.

==Initials and Middle Names==

Many name sources provide either middle initials or middle names, or sometimes both. In the case of initials very little information can be deduced (possibly more initials are indicative or higher social class or some such, but this is a blind guess). Middle names could be used in much the same fashion as first names, that is to deduce gender and possibly a SES (Socio-Economic Status) type variable. However, for the most part this is superflous information that can be ignored.

==Short Names==

==Name Orders and Formats==

Some cultures and some datasets ~~routine~~ routinely reverse (or re-order) the order of names; the most common reversal being Surname, FirstName Initial. Such reversals may or may not be indicated by punctuation and may be systematic across an entire dataset or idiosyncratic to groups or individuals within the dataset. To facilitate this the normalization script must support idiosyncratic reversal options using indicator variables.

I declare the following defacto-standard formats (there does not appear to be an [http://www.iso.org ISO] standard):

1. US census [http://www.census.gov/geo/www/standards/scdd/ADCStandard.pdf Address Data Content Standard]

2. Phonebook

~~==Stop Words==~~{| !Source !Element 1 !Element 2 !Element 3 !Element 4 !Element 5 !Element 6 |-US Census ADCS | Name Prefix | First Name | Middle Initial | Surname | Name Suffix|- Phone Book | Last Name | First Name | Middle Initial | ||}

Anonymous user

128.32.74.87

Changes

Normalizing Surnames (view source)

Revision as of 20:18, 29 June 2009

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools