Difference between revisions of "OKC Project"
imported>Ed |
imported>Ed |
||
Line 75: | Line 75: | ||
===Ethnicity=== | ===Ethnicity=== | ||
+ | |||
+ | The data was preprocessed to assign individuals who reported more than one gender as being of 'mixed' race. The variable 'ethnicity' is a interger categorization for fast sorting, and the label are provided in the variable 'ethnicitylabels'. | ||
+------------------+--------+------------+ | +------------------+--------+------------+ | ||
Line 91: | Line 93: | ||
| none | 563167 | 0.3120 | | | none | 563167 | 0.3120 | | ||
+------------------+--------+------------+ | +------------------+--------+------------+ | ||
+ | |||
+ | The distribution for the reported number of ethnicities (with >1 assigned to mixed) is: | ||
+ | |||
+ | +-----------------+----------+ | ||
+ | | num_ethnicities | COUNT(*) | | ||
+ | +-----------------+----------+ | ||
+ | | 1 | 1699537 | | ||
+ | | 2 | 83676 | | ||
+ | | 3 | 14781 | | ||
+ | | 4 | 3348 | | ||
+ | | 5 | 976 | | ||
+ | | 6 | 330 | | ||
+ | | 7 | 243 | | ||
+ | | 8 | 409 | | ||
+ | | 9 | 1693 | | ||
+ | +-----------------+----------+ | ||
===Location=== | ===Location=== | ||
+ | |||
+ | Zip3 was provided in the data, but I was cautioned that the third number was not meaningul (i.e. added randomly for obscurification). Zip2 and Zip1 (the later is reported below) are coded as variables. | ||
+ | |||
+ | See: | ||
+ | *The [http://en.wikipedia.org/wiki/ZIP_code#Primary_State_Prefixes Wikipedia Zip Code page] for a map of Zip codes | ||
+ | *[http://en.wikipedia.org/wiki/List_of_ZIP_code_prefixes List of Zip Code Prefixes] | ||
+------+--------+------------+ | +------+--------+------------+ | ||
Line 111: | Line 135: | ||
===Age=== | ===Age=== | ||
+ | |||
+ | The age ranges were created arbitrarily - though they have worked out reasonable well. Further refinement is possible. The variable 'age_rnum' is a integer categorization, and 'age_range' provides the variable. Ages were calculated using 'birth_year' using 2010 as the reference point. | ||
+-----------+--------+------------+ | +-----------+--------+------------+ | ||
Line 125: | Line 151: | ||
| >60 | 33175 | 0.0184 | | | >60 | 33175 | 0.0184 | | ||
+-----------+--------+------------+ | +-----------+--------+------------+ | ||
+ | |||
+ | Birth year is self-reported almost surely wrongly so in some case. It ranges from 1900 to 1995. | ||
===Account Age=== | ===Account Age=== | ||
+ | |||
+ | The account age range variables, 'acc_age_rnum' for the interger categorization and 'acc_age_range' for the labels, were created arbitrarily. These could easily be refined, but were created to examine differences in messaging and viewing activitiy - and to provide a normalization base. The oldest account is 2,613 days. The youngest is 0 days. | ||
+---------------+--------+------------+ | +---------------+--------+------------+ | ||
Line 145: | Line 175: | ||
| 366-3650 | 540035 | 0.2992 | | | 366-3650 | 540035 | 0.2992 | | ||
+---------------+--------+------------+ | +---------------+--------+------------+ | ||
+ | |||
+ | ===Other variables=== | ||
+ | |||
+ | Quit times range from: | ||
+ | *2008-07-19 (youngest) | ||
+ | *2010-12-17 (oldest) | ||
+ | |||
+ | Deleted or blacklisted accounts: | ||
+ | +------------------------+----------+ | ||
+ | | deleted_or_blacklisted | Count | | ||
+ | +------------------------+----------+ | ||
+ | | 0 | 1519319 | | ||
+ | | 1 | 285674 | | ||
+ | +------------------------+----------+ | ||
+ | |||
+ | ==Outstanding Data Issues== | ||
+ | |||
+ | *The variables 'num_essays' and 'profile_length' need fixing |
Revision as of 15:21, 26 September 2011
- This page is protected so that only Ed, Toby and Misiek can read or edit it.
I will be posting reports and other materials on the OK Cupid project here.
Contents
Data Description from Misiek
Toby sent me a file with the following data description:
USERS Total = 1,804,993 Female = 722,889 = 40% US = 1,523,778 NonUS = 281,215 = 16% US Male = 897,323 = 59% US Female = 626,455 = 41% US Male Singles = 827,702 = 92% of US Male US Female Singles = 565,067 = 90% of US Female US Male LongTermInterest = 530,450 = 59% of Male US Female LongTermInterest = 333,947 = 53% of US Female US Male ShortTermInterest = 433,904 = 48% of Male US Female ShortTermInterest = 225,715 = 36% of US Female US Male CasualSex = 96,714 = 11% of Male US Female CasualSex = 19,331 = 3% of US Female US Male Gays = 77,902 = 9% of US Male US Male Bi = 19,295 = 2% of US Male US Male Straight = 800,126 = 89% of US Male US Female Gays = 46,118 = 7% of US Female US Female Bi = 59,585 = 10% of US Female US Female Straight = 520,752 = 83% of US Female US Male NoRace = 260,889 = 31% of US Male US Male White = 517,636 = 57,172% of US Male US Male Black = 38,787 = 4% of US Male US Male Asian = 34,158= 4% of US Male US Male Latino = 57,172 = 6% of US Male US Female NoRace = 185,725 = 30% of US Female US Female White = 358,119 = 57% of US Female US Female Black = 32,195 = 5% of US Female US Female Asian = 22,979 = 4% of US Female US Female Latino = 37,274 = 5% of US Female
Data Description
This section provides summary stats on gender/orientation/race/location/age.
Gender
+--------+---------+------------+ | female | Count | Percentage | +--------+---------+------------+ | 0 | 1082104 | 0.5995 | | 1 | 722889 | 0.4005 | +--------+---------+------------+
Orientation
0=Straight, 1=Gay, 2=Bi +-------------+---------+------------+ | orientation | Count | Percentage | +-------------+---------+------------+ | 0 | 1570050 | 0.8698 | | 1 | 139697 | 0.0774 | | 2 | 95246 | 0.0528 | +-------------+---------+------------+
Ethnicity
The data was preprocessed to assign individuals who reported more than one gender as being of 'mixed' race. The variable 'ethnicity' is a interger categorization for fast sorting, and the label are provided in the variable 'ethnicitylabels'.
+------------------+--------+------------+ | ethnicitylabels | Count | Percentage | +------------------+--------+------------+ | mixed | 105456 | 0.0584 | | white | 917178 | 0.5081 | | black | 53884 | 0.0299 | | hispanic_latin | 61034 | 0.0338 | | asian | 54645 | 0.0303 | | indian | 11658 | 0.0065 | | middle_eastern | 5916 | 0.0033 | | native_american | 4056 | 0.0022 | | pacific_islander | 3238 | 0.0018 | | other | 24761 | 0.0137 | | none | 563167 | 0.3120 | +------------------+--------+------------+
The distribution for the reported number of ethnicities (with >1 assigned to mixed) is:
+-----------------+----------+ | num_ethnicities | COUNT(*) | +-----------------+----------+ | 1 | 1699537 | | 2 | 83676 | | 3 | 14781 | | 4 | 3348 | | 5 | 976 | | 6 | 330 | | 7 | 243 | | 8 | 409 | | 9 | 1693 | +-----------------+----------+
Location
Zip3 was provided in the data, but I was cautioned that the third number was not meaningul (i.e. added randomly for obscurification). Zip2 and Zip1 (the later is reported below) are coded as variables.
See:
- The Wikipedia Zip Code page for a map of Zip codes
- List of Zip Code Prefixes
+------+--------+------------+ | zip1 | Count | Percentage | +------+--------+------------+ | 0 | 178253 | 0.0988 | | 1 | 229719 | 0.1273 | | 2 | 136538 | 0.0756 | | 3 | 148748 | 0.0824 | | 4 | 126115 | 0.0699 | | 5 | 66067 | 0.0366 | | 6 | 105097 | 0.0582 | | 7 | 126851 | 0.0703 | | 8 | 84204 | 0.0467 | | 9 | 322186 | 0.1785 | | 99 | 281215 | 0.1558 | +------+--------+------------+
Age
The age ranges were created arbitrarily - though they have worked out reasonable well. Further refinement is possible. The variable 'age_rnum' is a integer categorization, and 'age_range' provides the variable. Ages were calculated using 'birth_year' using 2010 as the reference point.
+-----------+--------+------------+ | age_range | Count | Percentage | +-----------+--------+------------+ | 15-19 | 97659 | 0.0541 | | 20-24 | 485028 | 0.2687 | | 25-29 | 475376 | 0.2634 | | 30-34 | 274281 | 0.1520 | | 35-39 | 155925 | 0.0864 | | 40-44 | 109121 | 0.0605 | | 45-49 | 78618 | 0.0436 | | 50-59 | 95810 | 0.0531 | | >60 | 33175 | 0.0184 | +-----------+--------+------------+
Birth year is self-reported almost surely wrongly so in some case. It ranges from 1900 to 1995.
Account Age
The account age range variables, 'acc_age_rnum' for the interger categorization and 'acc_age_range' for the labels, were created arbitrarily. These could easily be refined, but were created to examine differences in messaging and viewing activitiy - and to provide a normalization base. The oldest account is 2,613 days. The youngest is 0 days.
+---------------+--------+------------+ | acc_age_range | Count | Percentage | +---------------+--------+------------+ | 0 | 1385 | 0.0008 | | 1 | 7914 | 0.0044 | | 2 | 7967 | 0.0044 | | 3 | 9051 | 0.0050 | | 4 | 9134 | 0.0051 | | 5 | 10032 | 0.0056 | | 6-10 | 35862 | 0.0199 | | 11-15 | 39986 | 0.0222 | | 16-30 | 136983 | 0.0759 | | 31-60 | 233560 | 0.1294 | | 61-100 | 237369 | 0.1315 | | 101-365 | 535715 | 0.2968 | | 366-3650 | 540035 | 0.2992 | +---------------+--------+------------+
Other variables
Quit times range from:
- 2008-07-19 (youngest)
- 2010-12-17 (oldest)
Deleted or blacklisted accounts:
+------------------------+----------+ | deleted_or_blacklisted | Count | +------------------------+----------+ | 0 | 1519319 | | 1 | 285674 | +------------------------+----------+
Outstanding Data Issues
- The variables 'num_essays' and 'profile_length' need fixing