Classifying Names by Culture

Individual's names contain information about their ethnic ancestory and culture (broadly defined). The purpose of this project is to create a classifier that given an individual's name can deduce, with good accuracy, their culture.

Classification tecnhiques use variations in the features of their subjects to predict classes. In the classic example (see R.A. Fisher 1936), a classifier for types of plant used the features "petal width", "petal length", and so forth. In our context features refer to properties of names, specifically the length of the name string, the frequency of occurance of n-grams, and so forth.

An n-gram is a combination of characters (a gram) of length "n". For example, using a 2-gram, also called a bigram or a digraph, the surname "EGAN" has frequency of one for the grams EG, GA, and AN, and a frequency of zero for all other grams from AA to ZZ.

The Process

The process follows the following broad steps:

Sources of Surname Data: Various sources of surname data, with their classifications already known are needed for training and testing the classifier.
Normalizing Surnames: Before we can extract features from names, they must be in a standardized format, such as just a surname encoded in the latin character set with no spaces.
Extracting Features from Surnames: Given a standardized input we can extract a number of features from our names, such as the n-grams.

Classifying Names by Culture

The Process

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Sites

Sections

Organizations

Help

Tools