Classifying Names by Culture

From edegan.com
Revision as of 19:28, 9 July 2009 by imported>Ed
Jump to navigation Jump to search

Individual's names contain information about their ethnic ancestory and culture (broadly defined). The purpose of this project is to create a classifier that given an individual's name can deduce, with good accuracy, their culture.

Classification tecnhiques use variations in the features of their subjects to predict classes. In the classic example (see R.A. Fisher 1936), a classifier for types of plant used the features "petal width", "petal length", and so forth. In our context features refer to properties of names, specifically the length of the name string, the frequency of occurance of n-grams, and so forth.

An n-gram is a combination of characters (a gram) of length "n". For example, using a 2-gram, also called a bigram or a digraph, the surname "EGAN" has frequency of one for the grams EG, GA, and AN, and a frequency of zero for all other grams from AA to ZZ.

The Process

The process follows the following broad steps:

  1. Sources of Surname Data: Various sources of surname data, with their classifications already known are needed for training and testing the classifier.
  2. Normalizing Surnames: Before we can extract features from names, they must be in a standardized format, such as just a surname encoded in the latin character set with no spaces.
  3. Extracting Features from Surnames: Given a standardized input we can extract a number of features from our names, such as the n-grams.