Changes
Jump to navigation
Jump to search
no edit summary
We are primarily interested in sources of surname data that contain both surnames and countries of birth for surnames, for training and testing purposes.
==Olympic Athletes==
A list of all actors and their birth countries was extracted from the [http://www.imdb.com/interfaces#plain IMDB biographies file] using a simple one-off script ([http://www.edegan.com/repository/IMDB-ExtractNamesAndCountries.pl IMDB-ExtractNamesAndCountries.pl]). This produced the input file http://www.edegan.com/repository/IMDB-BioData.txt IMDB-BioData.txt]. Much as with the Olympics data the country names were then corrected to the [[UN GeoRegion standard | UN GeoRegions]], with individuals who were born in non-recognized jurisdictions, such as on a cruise ship at sea, excluded (see [http://www.edegan.com/repository/IMDB-BiosUNCountryCodes.txt IMDB-BiosUNCountryCodes.txt]).
A small percentage of actors have changed their names or use stage names. Care was taken to record actors' original birth names where available.
The NormalizeSurnames.pl script was with following options (and defaults):
==World Leaders==
A request to the CIA to use a web-bot to scrap data from the HTML version of the [https://www.cia.gov/library/publications/the-world-factbook CIA World Factbook] recieved no response. World leader information was downloaded in pdf format from the [https://www.cia.gov/library/publications/world-leaders-1/pdf-version/pdf-version.html CIA World Leaders PDF site] for April 2008, and converted into a plain-text file ([http://www.edegan.com/repository/WorldLeaders-Raw.txt WorldLeaders-Raw.txt]).
The raw file was then reprocessed by a one-off script to produce ([http://www.edegan.com/repository/WorldLeaders-Extracted.txt WorldLeaders-Extracted.txt]). The country codes were corrected using a look-up table ([http://www.edegan.com/repository/WorldLeaders-UNCountryLookup.txt WorldLeaders-UNCountryLookup.txt]) to produce the resulting basic dataset ([http://www.edegan.com/repository/WorldLeaders-ExtractedUNCountry.txt WorldLeaders-ExtractedUNCountry.txt]). Users should note that a very small number of individuals had invalidly coded countries (mostly countries that were not recognized by the UN but were recognized by the CIA) and were excluded in this process.
The NormalizeSurnames.pl script was with following options (and defaults):
<tt>perl -i=WorldLeaders-ExtractedUNCountry.txt -ncol=0 -rcol=1</tt>
The resultant output ([http://www.edegan.com/repository/IMDB-BiosUNCountryCodes-Normalized.txt IMDB-BiosUNCountryCodes-Normalized.txt]) was used to create the n-gram variables.