Changes
Jump to navigation
Jump to search
#Single letter names...
no edit summary
==Internet Movie Database (IMDB)==
A list of all actors and thier birth countries was extracted from the [http://www.imdb.com/interfaces#plain IMDB biographies file]
==Olympic Athletes==
The offline pages were then parsed by another script ([http://www.edegan.com/repository/Olympics-ExtractOlypiads.pl Olympics-ExtractOlypiads.pl]) and the resulting output ([http://www.edegan.com/repository/Olympics-RawOutput.txt Olympics-RawOutput.txt]) was checked by hand. This output is the basic names set with countries for the 2004 Olympic Athletes used here. Because some individuals competed in multiple events, identical full name strings were collapsed to produce a single record with a count. It seems unlikely that many John Joe Smiths entered, making such a reduction erroneous. Users of these scripts should the wikipedia source files have likely changed and should check results carefully.
The country names were then corrected to the [[UN GeoRegion Codes | UN GeoRegions]] and coded using SQL scripts , and country with idiosyncratic name reversals were marked to produce a normalization input file ([http://www.edegan.com/repository/Olympics-RawOutputWithUNReversal.txt Olympics-RawOutputWithUNReversal.txt]). The NormalizeSurnames.pl script was with following options (and defaults): <tt>perl -i=Olympics-RawOutputWithUNReversal.txt -ncol=1 -rcol=3</tt> The resultant output ([http://www.edegan.com/repository/Olympics-RawOutputWithUNReversalShort-Normalized.txt Olympics-RawOutputWithUNReversalShort-Normalized.txt) can be used to create the n-gram variables.
==World Leaders==