Difference between revisions of "Culture Based Classifications"

From edegan.com
Jump to navigation Jump to search
imported>Ed
 
(13 intermediate revisions by one other user not shown)
Line 3: Line 3:
 
There are many possible 'culture' based classes that one might want to use. At the finest grained level (and in the most ambitious case), one might want to predict actual countries of origin ([[UN GeoRegion Codes | standardized country names are provided by the United Nations]]). At all more course-grained levels, countries must be aggregated into meaningful units.  
 
There are many possible 'culture' based classes that one might want to use. At the finest grained level (and in the most ambitious case), one might want to predict actual countries of origin ([[UN GeoRegion Codes | standardized country names are provided by the United Nations]]). At all more course-grained levels, countries must be aggregated into meaningful units.  
  
Commonly used meaningful aggregations include:
+
Three commonly used (and meaningful) aggregations are:
 
#[[UN GeoRegion Codes | UN GeoRegions]]
 
#[[UN GeoRegion Codes | UN GeoRegions]]
#A language development based aggregation
+
#[[Ethnologue Classification]] - A language development based aggregation
 +
#[[US Census | US Census Ethnic Origin]]
 +
 
 +
==Custom Classifications==
 +
 
 +
For the purpose of economic analysis it is often not necessary to know which country an individual's name has its roots in; broad areas of origin are usually sufficient. Areas that are of particular interest (excluding the U.S.) might include:
 +
#Western European (including the colonies, particularly Australia, New Zealand, and Canada)
 +
#Scandinavian
 +
#Slavic
 +
#Hispanic/Latin
 +
#Chinese (including Chinese 'dependencies')
 +
#Arab (Muslim)
 +
#Israel (Jewish)
 +
#African
 +
#India/Pakistan
 +
#Korean
 +
#Japanese
 +
#Other Asia-Pacific (Phillipines, Vietnam...)
 +
 
 +
One might consider this list as not "politically correct", but it does not make statements regarding the intrinsic worth of individuals from these areas, or stereotype these areas, etc. Economists are interested in areas that would exhibit greater 'cultural' homogeneity within the area than across areas, and that are identifiable using names data.
 +
 
 +
Western European should be further decomposed into English, German, French, Italian, and perhaps Portugese. It is unlikely that it is possible to distinguish Spanish from Hispanic. Furthermore, Israel (and more particularly Jewish) is likely to be very hard to distinguish in most data sources. Israel has had a large amount of immigration, particularly from Russia and Europe, and is perhaps the definitive land of immigrants.
 +
 
 +
==Classification by Egan (2009)==
 +
 
 +
From an inspection (by hand) of names around the world, it appears that the following classes may be able to be identified:
 +
 
 +
*European
 +
**English
 +
**French
 +
**German
 +
**Greek
 +
**Italian
 +
**Portugese
 +
**Romanian
 +
**Scandinavian
 +
**Spanish
 +
**Turk
 +
*Slavic
 +
**East-West Slavic
 +
**South Slavic
 +
**North Slavic
 +
*African
 +
**East African
 +
**Other African
 +
*Asian
 +
**Arab
 +
**Chinese
 +
**Indian
 +
**Indonesia/Philippines
 +
**Japanese
 +
**Korean
 +
**Pakistani
 +
**Polynesian
 +
**Thai
 +
**Vietnamese
 +
 
 +
Note that some areas, particular North-Eastern African nations are classified as Arab. Romanian and Pakistani may be difficult to identify in some datasets but do appear to be recognisably unique in sufficient volume. The various [http://en.wikipedia.org/wiki/Slavic_peoples Slavic definitions] do not tie precisily to their accepted geographical and linguistic definitions but are suitably close as to be used.
 +
 
 +
The source file was not sufficiently detailed for many countries for them to be included in the classification. This is not necessarily a problem; the remaining countries account for a very low percentage of the world's population and for most applications we are only interested in classifications that a human could make in a real world context. The definition file for this classification system ([http://www.edegan.com/repository/Culture-EganClassification.txt Culture-EganClassification.txt]) uses [[UN GeoRegion Codes | UN recognized country names]] and excludes unclassifiable countries.
 +
 
 +
==Other Classifications==
 +
 
 +
One other potentially useful classification of names is based on differences in [http://en.wikipedia.org/wiki/Writing_system writing systems]. The following is a loose list of the major writing systems of the world:
 +
*Latin (alphabetic)
 +
*Cyrillic (alphabetic)
 +
*Hangul (featural alphabetic)
 +
*Other alphabets
 +
*Arabic (abjad)
 +
*Other abjads
 +
*Devanagari (abugida)
 +
*Other abugidas
 +
*Syllabaries
 +
*Chinese characters (logographic)

Latest revision as of 22:08, 16 July 2009

There are many possible 'culture' based classes that one might want to use. At the finest grained level (and in the most ambitious case), one might want to predict actual countries of origin ( standardized country names are provided by the United Nations). At all more course-grained levels, countries must be aggregated into meaningful units.

Three commonly used (and meaningful) aggregations are:

  1. UN GeoRegions
  2. Ethnologue Classification - A language development based aggregation
  3. US Census Ethnic Origin

Custom Classifications

For the purpose of economic analysis it is often not necessary to know which country an individual's name has its roots in; broad areas of origin are usually sufficient. Areas that are of particular interest (excluding the U.S.) might include:

  1. Western European (including the colonies, particularly Australia, New Zealand, and Canada)
  2. Scandinavian
  3. Slavic
  4. Hispanic/Latin
  5. Chinese (including Chinese 'dependencies')
  6. Arab (Muslim)
  7. Israel (Jewish)
  8. African
  9. India/Pakistan
  10. Korean
  11. Japanese
  12. Other Asia-Pacific (Phillipines, Vietnam...)

One might consider this list as not "politically correct", but it does not make statements regarding the intrinsic worth of individuals from these areas, or stereotype these areas, etc. Economists are interested in areas that would exhibit greater 'cultural' homogeneity within the area than across areas, and that are identifiable using names data.

Western European should be further decomposed into English, German, French, Italian, and perhaps Portugese. It is unlikely that it is possible to distinguish Spanish from Hispanic. Furthermore, Israel (and more particularly Jewish) is likely to be very hard to distinguish in most data sources. Israel has had a large amount of immigration, particularly from Russia and Europe, and is perhaps the definitive land of immigrants.

Classification by Egan (2009)

From an inspection (by hand) of names around the world, it appears that the following classes may be able to be identified:

  • European
    • English
    • French
    • German
    • Greek
    • Italian
    • Portugese
    • Romanian
    • Scandinavian
    • Spanish
    • Turk
  • Slavic
    • East-West Slavic
    • South Slavic
    • North Slavic
  • African
    • East African
    • Other African
  • Asian
    • Arab
    • Chinese
    • Indian
    • Indonesia/Philippines
    • Japanese
    • Korean
    • Pakistani
    • Polynesian
    • Thai
    • Vietnamese

Note that some areas, particular North-Eastern African nations are classified as Arab. Romanian and Pakistani may be difficult to identify in some datasets but do appear to be recognisably unique in sufficient volume. The various Slavic definitions do not tie precisily to their accepted geographical and linguistic definitions but are suitably close as to be used.

The source file was not sufficiently detailed for many countries for them to be included in the classification. This is not necessarily a problem; the remaining countries account for a very low percentage of the world's population and for most applications we are only interested in classifications that a human could make in a real world context. The definition file for this classification system (Culture-EganClassification.txt) uses UN recognized country names and excludes unclassifiable countries.

Other Classifications

One other potentially useful classification of names is based on differences in writing systems. The following is a loose list of the major writing systems of the world:

  • Latin (alphabetic)
  • Cyrillic (alphabetic)
  • Hangul (featural alphabetic)
  • Other alphabets
  • Arabic (abjad)
  • Other abjads
  • Devanagari (abugida)
  • Other abugidas
  • Syllabaries
  • Chinese characters (logographic)