Difference between revisions of "OKC Project"
imported>Ed |
imported>Ed |
||
Line 9: | Line 9: | ||
*[[:Image:OKC2.sql.asci|OKC2.sql]] | *[[:Image:OKC2.sql.asci|OKC2.sql]] | ||
+ | These files are self-documenting, and the second file builds out the current base table set containing: '''users''', '''views''', '''messages''', and '''delta'''. | ||
==Setup Instructions for Usha== | ==Setup Instructions for Usha== |
Revision as of 14:59, 19 December 2011
- This page is protected so that only Ed, Usha, Toby and Misiek can read or edit it.
I will be posting reports and other materials on the OK Cupid project here.
Contents
Script Files
There are two SQL script files that were written to process the data:
These files are self-documenting, and the second file builds out the current base table set containing: users, views, messages, and delta.
Setup Instructions for Usha
- Go to vpn.hbs.edu and click Network Connect -> Start
- Using SSH client, connect to researchgrid.hbs.edu
- Connect the MYSql cluster: msyql -h rcsmysql.hbs.edu -u eegan -p --ssl-ca=rcsmysql
Useful commands:
mysql> SHOW Databases; +--------------------+ | Database | +--------------------+ | information_schema | | mpiskorski_tstuart | | okcupid | | okcupid_2 | +--------------------+ 4 rows in set (0.02 sec)
You'll have access to mpiskorski_tstuart
USE mpiskorski_tstuart;
(Misiek and Toby's working database)
SHOW TABLES;
There are two datasets: OKC1 and OKC2. We are now working with OKC2.
Both datasets have:
Users Messages Views
OKC2 also has ProfileDelta (changes to profiles). Describe tables with:
DESC TableName
Data Description from Misiek
Toby sent me a file with the following data description:
USERS Total = 1,804,993 Female = 722,889 = 40% US = 1,523,778 NonUS = 281,215 = 16% US Male = 897,323 = 59% US Female = 626,455 = 41% US Male Singles = 827,702 = 92% of US Male US Female Singles = 565,067 = 90% of US Female US Male LongTermInterest = 530,450 = 59% of Male US Female LongTermInterest = 333,947 = 53% of US Female US Male ShortTermInterest = 433,904 = 48% of Male US Female ShortTermInterest = 225,715 = 36% of US Female US Male CasualSex = 96,714 = 11% of Male US Female CasualSex = 19,331 = 3% of US Female US Male Gays = 77,902 = 9% of US Male US Male Bi = 19,295 = 2% of US Male US Male Straight = 800,126 = 89% of US Male US Female Gays = 46,118 = 7% of US Female US Female Bi = 59,585 = 10% of US Female US Female Straight = 520,752 = 83% of US Female US Male NoRace = 260,889 = 31% of US Male US Male White = 517,636 = 57,172% of US Male US Male Black = 38,787 = 4% of US Male US Male Asian = 34,158= 4% of US Male US Male Latino = 57,172 = 6% of US Male US Female NoRace = 185,725 = 30% of US Female US Female White = 358,119 = 57% of US Female US Female Black = 32,195 = 5% of US Female US Female Asian = 22,979 = 4% of US Female US Female Latino = 37,274 = 5% of US Female
Data Description
This section provides summary stats on gender/orientation/race/location/age.
Gender
+--------+---------+------------+---------+-----------+---------+-----------+ | female | Count | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd | +--------+---------+------------+---------+-----------+---------+-----------+ | 0 | 1082104 | 0.5995 | 4.1039 | 339857 | 2.2340 | 185671 | | 1 | 722889 | 0.4005 | 3.4872 | 174731 | 2.8786 | 304468 | +--------+---------+------------+---------+-----------+---------+-----------+
Orientation
0=Straight, 1=Gay, 2=Bi +-------------+---------+------------+---------+-----------+---------+-----------+ | orientation | Count | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd | +-------------+---------+------------+---------+-----------+---------+-----------+ | 0 | 1570050 | 0.8698 | 3.9727 | 460499 | 2.6087 | 420784 | | 1 | 139697 | 0.0774 | 2.9008 | 28802 | 2.1532 | 29212 | | 2 | 95246 | 0.0528 | 3.5746 | 25287 | 2.8651 | 40143 | +-------------+---------+------------+---------+-----------+---------+-----------+
Ethnicity
The data was preprocessed to assign individuals who reported more than one gender as being of 'mixed' race. The variable 'ethnicity' is a interger categorization for fast sorting, and the label are provided in the variable 'ethnicitylabels'.
+------------------+--------+------------+---------+-----------+---------+-----------+ | ethnicitylabels | Count | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd | +------------------+--------+------------+---------+-----------+---------+-----------+ | mixed | 105456 | 0.0584 | 4.0894 | 45674 | 2.8836 | 43540 | | white | 917178 | 0.5081 | 3.7588 | 332784 | 2.6029 | 327063 | | black | 53884 | 0.0299 | 4.2456 | 18498 | 2.4855 | 13101 | | hispanic_latin | 61034 | 0.0338 | 4.2292 | 25278 | 2.9754 | 23348 | | asian | 54645 | 0.0303 | 3.7712 | 16612 | 2.6209 | 17222 | | indian | 11658 | 0.0065 | 4.6518 | 4996 | 2.2567 | 2796 | | middle_eastern | 5916 | 0.0033 | 6.0000 | 3282 | 2.9282 | 1918 | | native_american | 4056 | 0.0022 | 5.0511 | 1682 | 3.0157 | 1345 | | pacific_islander | 3238 | 0.0018 | 3.6478 | 1098 | 2.9398 | 1173 | | other | 24761 | 0.0137 | 4.3936 | 11454 | 2.9021 | 9664 | | none | 563167 | 0.3120 | 3.9094 | 53230 | 2.1889 | 48969 | +------------------+--------+------------+---------+-----------+---------+-----------+
The distribution for the reported number of ethnicities (with >1 assigned to mixed) is:
+-----------------+----------+ | num_ethnicities | COUNT(*) | +-----------------+----------+ | 1 | 1699537 | | 2 | 83676 | | 3 | 14781 | | 4 | 3348 | | 5 | 976 | | 6 | 330 | | 7 | 243 | | 8 | 409 | | 9 | 1693 | +-----------------+----------+
Location
Zip3 was provided in the data, but I was cautioned that the third number was not meaningul (i.e. added randomly for obscurification). Zip2 and Zip1 (the later is reported below) are coded as variables.
See:
- The Wikipedia Zip Code page for a map of Zip codes
- List of Zip Code Prefixes
+------+--------+------------+---------+-----------+---------+-----------+ | zip1 | Count | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd | +------+--------+------------+---------+-----------+---------+-----------+ | 0 | 178253 | 0.0988 | 3.8582 | 55219 | 2.5989 | 52293 | | 1 | 229719 | 0.1273 | 4.4691 | 84690 | 2.6089 | 73503 | | 2 | 136538 | 0.0756 | 3.7811 | 38575 | 2.5554 | 37794 | | 3 | 148748 | 0.0824 | 3.8744 | 41979 | 2.6477 | 40292 | | 4 | 126115 | 0.0699 | 3.6275 | 33953 | 2.5882 | 33064 | | 5 | 66067 | 0.0366 | 3.3721 | 15643 | 2.3133 | 15157 | | 6 | 105097 | 0.0582 | 3.6479 | 29650 | 2.5607 | 29468 | | 7 | 126851 | 0.0703 | 4.0350 | 38615 | 2.8075 | 37758 | | 8 | 84204 | 0.0467 | 3.7276 | 23875 | 2.6178 | 23128 | | 9 | 322186 | 0.1785 | 3.8279 | 103851 | 2.7357 | 103539 | | 99 | 281215 | 0.1558 | 3.6252 | 48538 | 2.2509 | 44143 | +------+--------+------------+---------+-----------+---------+-----------+
Age
The age ranges were created arbitrarily - though they have worked out reasonable well. Further refinement is possible. The variable 'age_rnum' is a integer categorization, and 'age_range' provides the variable. Ages were calculated using 'birth_year' using 2010 as the reference point.
+-----------+--------+------------+---------+-----------+---------+-----------+ | age_range | Count | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd | +-----------+--------+------------+---------+-----------+---------+-----------+ | 15-19 | 97659 | 0.0541 | 4.8070 | 30385 | 3.5139 | 35708 | | 20-24 | 485028 | 0.2687 | 3.9309 | 146882 | 2.8494 | 159322 | | 25-29 | 475376 | 0.2634 | 3.5992 | 131205 | 2.4778 | 128834 | | 30-34 | 274281 | 0.1520 | 4.4587 | 89759 | 2.4158 | 66236 | | 35-39 | 155925 | 0.0864 | 3.7430 | 42681 | 2.3843 | 36490 | | 40-44 | 109121 | 0.0605 | 3.5817 | 27776 | 2.4255 | 24939 | | 45-49 | 78618 | 0.0436 | 3.6452 | 19225 | 2.2995 | 16338 | | 50-59 | 95810 | 0.0531 | 3.3421 | 21055 | 2.1347 | 17831 | | >60 | 33175 | 0.0184 | 2.9332 | 5620 | 1.8930 | 4441 | +-----------+--------+------------+---------+-----------+---------+-----------+
Birth year is self-reported almost surely wrongly so in some case. It ranges from 1900 to 1995.
Account Age
The account age range variables, 'acc_age_rnum' for the interger categorization and 'acc_age_range' for the labels, were created arbitrarily. These could easily be refined, but were created to examine differences in messaging and viewing activitiy - and to provide a normalization base. The oldest account is 2,613 days. The youngest is 0 days.
+---------------+--------+------------+---------+-----------+---------+-----------+ | acc_age_range | Count | Percentage | | | | | +---------------+--------+------------+---------+-----------+---------+-----------+ | 0 | 1385 | 0.0008 | | | | | | 1 | 7914 | 0.0044 | | | | | | 2 | 7967 | 0.0044 | | | | | | 3 | 9051 | 0.0050 | | | | | | 4 | 9134 | 0.0051 | | | | | | 5 | 10032 | 0.0056 | | | | | | 6-10 | 35862 | 0.0199 | | | | | | 11-15 | 39986 | 0.0222 | | | | | | 16-30 | 136983 | 0.0759 | | | | | | 31-60 | 233560 | 0.1294 | | | | | | 61-100 | 237369 | 0.1315 | 5.2286 | 142694 | 3.5119 | 144483 | | 101-365 | 535715 | 0.2968 | 3.5052 | 220052 | 2.4617 | 220381 | | 366-3650 | 540035 | 0.2992 | 3.5435 | 151842 | 2.1519 | 125275 | +---------------+--------+------------+---------+-----------+---------+-----------+
Other variables
Quit times range from:
- 2008-07-19 (youngest)
- 2010-12-17 (oldest)
Deleted or blacklisted accounts:
+------------------------+----------+ | deleted_or_blacklisted | Count | +------------------------+----------+ | 0 | 1519319 | | 1 | 285674 | +------------------------+----------+
Outstanding Data Issues
- The variables 'num_essays' and 'profile_length' need fixing