Difference between revisions of "OKC Project"

From edegan.com
Jump to navigation Jump to search
imported>Ed
imported>Ed
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
*This page is protected so that only Ed, Toby and Misiek can read or edit it.
+
*This page is protected so that only Ed, Usha, Toby and Misiek can read or edit it.
  
 
I will be posting reports and other materials on the OK Cupid project here.
 
I will be posting reports and other materials on the OK Cupid project here.
 +
 +
==Script Files==
 +
 +
There are two SQL script files that were written to process the data:
 +
*[[:Image:OKC1.sql.asci|OKC1.sql]]
 +
*[[:Image:OKC2.sql.asci|OKC2.sql]]
 +
 +
These files are self-documenting, and the second file builds out the current base table set containing: '''users''', '''views''', '''messages''', and '''delta'''.
 +
 +
==Setup Instructions for Usha==
 +
 +
See the [[Research Computing At HBS]] page for detailed information.
 +
 +
Quick quide:
 +
#Go to vpn.hbs.edu and click Network Connect -> Start
 +
#Using SSH client, connect to researchgrid.hbs.edu
 +
#Connect the MYSql cluster: msyql -h rcsmysql.hbs.edu -u eegan -p --ssl-ca=rcsmysql
 +
 +
Useful commands:
 +
 +
mysql> SHOW Databases;
 +
+--------------------+
 +
| Database          |
 +
+--------------------+
 +
| information_schema |
 +
| mpiskorski_tstuart |
 +
| okcupid            |
 +
| okcupid_2          |
 +
+--------------------+
 +
4 rows in set (0.02 sec)
 +
 +
You'll have access to mpiskorski_tstuart
 +
 +
USE mpiskorski_tstuart;
 +
 +
(Misiek and Toby's working database)
 +
 +
SHOW TABLES;
 +
 +
There are two datasets: OKC1 and OKC2. We are now working with OKC2.
 +
 +
Both datasets have:
 +
Users
 +
Messages
 +
Views
 +
 +
OKC2 also has ProfileDelta (changes to profiles).
 +
Describe tables with:
 +
 +
DESC TableName
  
 
==Data Description from Misiek==
 
==Data Description from Misiek==
Line 56: Line 106:
 
===Gender===
 
===Gender===
  
  +--------+---------+------------+
+
  +--------+---------+------------+---------+-----------+---------+-----------+
  | female | Count  | Percentage |
+
  | female | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
  +--------+---------+------------+
+
  +--------+---------+------------+---------+-----------+---------+-----------+
  |      0 | 1082104 |    0.5995 |
+
  |      0 | 1082104 |    0.5995 |  4.1039 |    339857 |  2.2340 |    185671 |
  |      1 |  722889 |    0.4005 |
+
  |      1 |  722889 |    0.4005 |  3.4872 |    174731 |  2.8786 |    304468 |
  +--------+---------+------------+
+
  +--------+---------+------------+---------+-----------+---------+-----------+
  
 
===Orientation===
 
===Orientation===
  
 
  0=Straight, 1=Gay, 2=Bi
 
  0=Straight, 1=Gay, 2=Bi
  +-------------+---------+------------+
+
  +-------------+---------+------------+---------+-----------+---------+-----------+
  | orientation | Count  | Percentage |
+
  | orientation | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
  +-------------+---------+------------+
+
  +-------------+---------+------------+---------+-----------+---------+-----------+
  |          0 | 1570050 |    0.8698 |
+
  |          0 | 1570050 |    0.8698 |  3.9727 |    460499 |  2.6087 |    420784 |
  |          1 |  139697 |    0.0774 |
+
  |          1 |  139697 |    0.0774 |  2.9008 |    28802 |  2.1532 |    29212 |
  |          2 |  95246 |    0.0528 |
+
  |          2 |  95246 |    0.0528 |  3.5746 |    25287 |  2.8651 |    40143 |
  +-------------+---------+------------+
+
  +-------------+---------+------------+---------+-----------+---------+-----------+
  
 
===Ethnicity===
 
===Ethnicity===
Line 78: Line 128:
 
The data was preprocessed to assign individuals who reported more than one gender as being of 'mixed' race. The variable 'ethnicity' is a interger categorization for fast sorting, and the label are provided in the variable 'ethnicitylabels'.
 
The data was preprocessed to assign individuals who reported more than one gender as being of 'mixed' race. The variable 'ethnicity' is a interger categorization for fast sorting, and the label are provided in the variable 'ethnicitylabels'.
  
  +------------------+--------+------------+
+
  +------------------+--------+------------+---------+-----------+---------+-----------+
  | ethnicitylabels  | Count  | Percentage |
+
  | ethnicitylabels  | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
  +------------------+--------+------------+
+
  +------------------+--------+------------+---------+-----------+---------+-----------+
  | mixed            | 105456 |    0.0584 |
+
  | mixed            | 105456 |    0.0584 |  4.0894 |    45674 |  2.8836 |    43540 |
  | white            | 917178 |    0.5081 |
+
  | white            | 917178 |    0.5081 |  3.7588 |    332784 |  2.6029 |    327063 |
  | black            |  53884 |    0.0299 |
+
  | black            |  53884 |    0.0299 |  4.2456 |    18498 |  2.4855 |    13101 |
  | hispanic_latin  |  61034 |    0.0338 |
+
  | hispanic_latin  |  61034 |    0.0338 |  4.2292 |    25278 |  2.9754 |    23348 |
  | asian            |  54645 |    0.0303 |
+
  | asian            |  54645 |    0.0303 |  3.7712 |    16612 |  2.6209 |    17222 |
  | indian          |  11658 |    0.0065 |
+
  | indian          |  11658 |    0.0065 |  4.6518 |      4996 |  2.2567 |      2796 |
  | middle_eastern  |  5916 |    0.0033 |
+
  | middle_eastern  |  5916 |    0.0033 |  6.0000 |      3282 |  2.9282 |      1918 |
  | native_american  |  4056 |    0.0022 |
+
  | native_american  |  4056 |    0.0022 |  5.0511 |      1682 |  3.0157 |      1345 |
  | pacific_islander |  3238 |    0.0018 |
+
  | pacific_islander |  3238 |    0.0018 |  3.6478 |      1098 |  2.9398 |      1173 |
  | other            |  24761 |    0.0137 |
+
  | other            |  24761 |    0.0137 |  4.3936 |    11454 |  2.9021 |      9664 |
  | none            | 563167 |    0.3120 |
+
  | none            | 563167 |    0.3120 |  3.9094 |    53230 |  2.1889 |    48969 |
  +------------------+--------+------------+
+
  +------------------+--------+------------+---------+-----------+---------+-----------+
  
 
The distribution for the reported number of ethnicities (with >1 assigned to mixed) is:
 
The distribution for the reported number of ethnicities (with >1 assigned to mixed) is:
Line 118: Line 168:
 
*[http://en.wikipedia.org/wiki/List_of_ZIP_code_prefixes List of Zip Code Prefixes]
 
*[http://en.wikipedia.org/wiki/List_of_ZIP_code_prefixes List of Zip Code Prefixes]
  
  +------+--------+------------+
+
  +------+--------+------------+---------+-----------+---------+-----------+
  | zip1 | Count  | Percentage |
+
  | zip1 | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
  +------+--------+------------+
+
  +------+--------+------------+---------+-----------+---------+-----------+
  | 0    | 178253 |    0.0988 |
+
  | 0    | 178253 |    0.0988 |  3.8582 |    55219 |  2.5989 |    52293 |
  | 1    | 229719 |    0.1273 |
+
  | 1    | 229719 |    0.1273 |  4.4691 |    84690 |  2.6089 |    73503 |
  | 2    | 136538 |    0.0756 |
+
  | 2    | 136538 |    0.0756 |  3.7811 |    38575 |  2.5554 |    37794 |
  | 3    | 148748 |    0.0824 |
+
  | 3    | 148748 |    0.0824 |  3.8744 |    41979 |  2.6477 |    40292 |
  | 4    | 126115 |    0.0699 |
+
  | 4    | 126115 |    0.0699 |  3.6275 |    33953 |  2.5882 |    33064 |
  | 5    |  66067 |    0.0366 |
+
  | 5    |  66067 |    0.0366 |  3.3721 |    15643 |  2.3133 |    15157 |
  | 6    | 105097 |    0.0582 |
+
  | 6    | 105097 |    0.0582 |  3.6479 |    29650 |  2.5607 |    29468 |
  | 7    | 126851 |    0.0703 |
+
  | 7    | 126851 |    0.0703 |  4.0350 |    38615 |  2.8075 |    37758 |
  | 8    |  84204 |    0.0467 |
+
  | 8    |  84204 |    0.0467 |  3.7276 |    23875 |  2.6178 |    23128 |
  | 9    | 322186 |    0.1785 |
+
  | 9    | 322186 |    0.1785 |  3.8279 |    103851 |  2.7357 |    103539 |
  | 99  | 281215 |    0.1558 |
+
  | 99  | 281215 |    0.1558 |  3.6252 |    48538 |  2.2509 |    44143 |
  +------+--------+------------+
+
  +------+--------+------------+---------+-----------+---------+-----------+
  
 
===Age===
 
===Age===
Line 138: Line 188:
 
The age ranges were created arbitrarily - though they have worked out reasonable well. Further refinement is possible. The variable 'age_rnum' is a integer categorization, and 'age_range' provides the variable. Ages were calculated using 'birth_year' using 2010 as the reference point.
 
The age ranges were created arbitrarily - though they have worked out reasonable well. Further refinement is possible. The variable 'age_rnum' is a integer categorization, and 'age_range' provides the variable. Ages were calculated using 'birth_year' using 2010 as the reference point.
  
  +-----------+--------+------------+
+
  +-----------+--------+------------+---------+-----------+---------+-----------+
  | age_range | Count  | Percentage |
+
  | age_range | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
  +-----------+--------+------------+
+
  +-----------+--------+------------+---------+-----------+---------+-----------+
  | 15-19    |  97659 |    0.0541 |
+
  | 15-19    |  97659 |    0.0541 |  4.8070 |    30385 |  3.5139 |    35708 |
  | 20-24    | 485028 |    0.2687 |
+
  | 20-24    | 485028 |    0.2687 |  3.9309 |    146882 |  2.8494 |    159322 |
  | 25-29    | 475376 |    0.2634 |
+
  | 25-29    | 475376 |    0.2634 |  3.5992 |    131205 |  2.4778 |    128834 |
  | 30-34    | 274281 |    0.1520 |
+
  | 30-34    | 274281 |    0.1520 |  4.4587 |    89759 |  2.4158 |    66236 |
  | 35-39    | 155925 |    0.0864 |
+
  | 35-39    | 155925 |    0.0864 |  3.7430 |    42681 |  2.3843 |    36490 |
  | 40-44    | 109121 |    0.0605 |
+
  | 40-44    | 109121 |    0.0605 |  3.5817 |    27776 |  2.4255 |    24939 |
  | 45-49    |  78618 |    0.0436 |
+
  | 45-49    |  78618 |    0.0436 |  3.6452 |    19225 |  2.2995 |    16338 |
  | 50-59    |  95810 |    0.0531 |
+
  | 50-59    |  95810 |    0.0531 |  3.3421 |    21055 |  2.1347 |    17831 |
  | >60      |  33175 |    0.0184 |
+
  | >60      |  33175 |    0.0184 |  2.9332 |      5620 |  1.8930 |      4441 |
  +-----------+--------+------------+
+
  +-----------+--------+------------+---------+-----------+---------+-----------+
  
 
Birth year is self-reported almost surely wrongly so in some case. It ranges from 1900 to 1995.
 
Birth year is self-reported almost surely wrongly so in some case. It ranges from 1900 to 1995.
Line 158: Line 208:
 
The account age range variables, 'acc_age_rnum' for the interger categorization and 'acc_age_range' for the labels, were created arbitrarily. These could easily be refined, but were created to examine differences in messaging and viewing activitiy - and to provide a normalization base. The oldest account is 2,613 days. The youngest is 0 days.
 
The account age range variables, 'acc_age_rnum' for the interger categorization and 'acc_age_range' for the labels, were created arbitrarily. These could easily be refined, but were created to examine differences in messaging and viewing activitiy - and to provide a normalization base. The oldest account is 2,613 days. The youngest is 0 days.
  
  +---------------+--------+------------+
+
  +---------------+--------+------------+---------+-----------+---------+-----------+
  | acc_age_range | Count  | Percentage |
+
  | acc_age_range | Count  | Percentage |        |          |        |          |
  +---------------+--------+------------+
+
  +---------------+--------+------------+---------+-----------+---------+-----------+
  | 0            |  1385 |    0.0008 |
+
  | 0            |  1385 |    0.0008 |        |          |        |          |
  | 1            |  7914 |    0.0044 |
+
  | 1            |  7914 |    0.0044 |        |          |        |          |
  | 2            |  7967 |    0.0044 |
+
  | 2            |  7967 |    0.0044 |        |          |        |          |
  | 3            |  9051 |    0.0050 |
+
  | 3            |  9051 |    0.0050 |        |          |        |          |
  | 4            |  9134 |    0.0051 |
+
  | 4            |  9134 |    0.0051 |        |          |        |          |
  | 5            |  10032 |    0.0056 |
+
  | 5            |  10032 |    0.0056 |        |          |        |          |
  | 6-10          |  35862 |    0.0199 |
+
  | 6-10          |  35862 |    0.0199 |        |          |        |          |
  | 11-15        |  39986 |    0.0222 |
+
  | 11-15        |  39986 |    0.0222 |        |          |        |          |
  | 16-30        | 136983 |    0.0759 |
+
  | 16-30        | 136983 |    0.0759 |        |          |        |          |
  | 31-60        | 233560 |    0.1294 |
+
  | 31-60        | 233560 |    0.1294 |        |          |        |          |
  | 61-100        | 237369 |    0.1315 |
+
  | 61-100        | 237369 |    0.1315 |  5.2286 |    142694 |  3.5119 |    144483 |
  | 101-365      | 535715 |    0.2968 |
+
  | 101-365      | 535715 |    0.2968 |  3.5052 |    220052 |  2.4617 |    220381 |
  | 366-3650      | 540035 |    0.2992 |
+
  | 366-3650      | 540035 |    0.2992 |  3.5435 |    151842 |  2.1519 |    125275 |
  +---------------+--------+------------+
+
  +---------------+--------+------------+---------+-----------+---------+-----------+
  
 
===Other variables===
 
===Other variables===

Latest revision as of 15:00, 19 December 2011

  • This page is protected so that only Ed, Usha, Toby and Misiek can read or edit it.

I will be posting reports and other materials on the OK Cupid project here.

Script Files

There are two SQL script files that were written to process the data:

These files are self-documenting, and the second file builds out the current base table set containing: users, views, messages, and delta.

Setup Instructions for Usha

See the Research Computing At HBS page for detailed information.

Quick quide:

  1. Go to vpn.hbs.edu and click Network Connect -> Start
  2. Using SSH client, connect to researchgrid.hbs.edu
  3. Connect the MYSql cluster: msyql -h rcsmysql.hbs.edu -u eegan -p --ssl-ca=rcsmysql

Useful commands:

mysql> SHOW Databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mpiskorski_tstuart |
| okcupid            |
| okcupid_2          |
+--------------------+
4 rows in set (0.02 sec)

You'll have access to mpiskorski_tstuart

USE mpiskorski_tstuart;

(Misiek and Toby's working database)

SHOW TABLES;

There are two datasets: OKC1 and OKC2. We are now working with OKC2.

Both datasets have:

Users
Messages
Views

OKC2 also has ProfileDelta (changes to profiles). Describe tables with:

DESC TableName

Data Description from Misiek

Toby sent me a file with the following data description:

USERS

Total = 1,804,993
Female = 722,889 = 40%

US = 1,523,778
NonUS = 281,215 = 16%

US Male = 897,323 = 59%
US Female = 626,455 = 41%

US Male Singles = 827,702 = 92% of US Male
US Female Singles =  565,067 = 90% of US Female 

US Male LongTermInterest = 530,450 = 59% of Male
US Female LongTermInterest = 333,947 = 53% of US Female

US Male ShortTermInterest =  433,904 = 48% of Male
US Female ShortTermInterest = 225,715 = 36% of US Female

US Male CasualSex = 96,714 = 11% of Male
US Female CasualSex = 19,331 = 3% of US Female

US Male Gays = 77,902 = 9% of US Male
US Male Bi = 19,295 = 2% of US Male
US Male Straight = 800,126 = 89% of US Male

US Female Gays = 46,118 = 7% of US Female
US Female Bi = 59,585 = 10% of US Female
US Female Straight = 520,752 = 83% of US Female

US Male NoRace = 260,889 = 31% of US Male
US Male White = 517,636 = 57,172% of US Male
US Male Black = 38,787 = 4% of US Male
US Male Asian = 34,158= 4% of US Male
US Male Latino = 57,172 = 6% of US Male

US Female NoRace = 185,725 = 30% of US Female
US Female White = 358,119 = 57% of US Female
US Female Black =  32,195 = 5% of US Female
US Female Asian = 22,979 = 4% of US Female
US Female Latino = 37,274 = 5% of US Female

Data Description

This section provides summary stats on gender/orientation/race/location/age.

Gender

+--------+---------+------------+---------+-----------+---------+-----------+
| female | Count   | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+--------+---------+------------+---------+-----------+---------+-----------+
|      0 | 1082104 |     0.5995 |  4.1039 |    339857 |  2.2340 |    185671 |
|      1 |  722889 |     0.4005 |  3.4872 |    174731 |  2.8786 |    304468 |
+--------+---------+------------+---------+-----------+---------+-----------+

Orientation

0=Straight, 1=Gay, 2=Bi
+-------------+---------+------------+---------+-----------+---------+-----------+
| orientation | Count   | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+-------------+---------+------------+---------+-----------+---------+-----------+
|           0 | 1570050 |     0.8698 |  3.9727 |    460499 |  2.6087 |    420784 |
|           1 |  139697 |     0.0774 |  2.9008 |     28802 |  2.1532 |     29212 |
|           2 |   95246 |     0.0528 |  3.5746 |     25287 |  2.8651 |     40143 |
+-------------+---------+------------+---------+-----------+---------+-----------+

Ethnicity

The data was preprocessed to assign individuals who reported more than one gender as being of 'mixed' race. The variable 'ethnicity' is a interger categorization for fast sorting, and the label are provided in the variable 'ethnicitylabels'.

+------------------+--------+------------+---------+-----------+---------+-----------+
| ethnicitylabels  | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+------------------+--------+------------+---------+-----------+---------+-----------+
| mixed            | 105456 |     0.0584 |  4.0894 |     45674 |  2.8836 |     43540 |
| white            | 917178 |     0.5081 |  3.7588 |    332784 |  2.6029 |    327063 |
| black            |  53884 |     0.0299 |  4.2456 |     18498 |  2.4855 |     13101 |
| hispanic_latin   |  61034 |     0.0338 |  4.2292 |     25278 |  2.9754 |     23348 |
| asian            |  54645 |     0.0303 |  3.7712 |     16612 |  2.6209 |     17222 |
| indian           |  11658 |     0.0065 |  4.6518 |      4996 |  2.2567 |      2796 |
| middle_eastern   |   5916 |     0.0033 |  6.0000 |      3282 |  2.9282 |      1918 |
| native_american  |   4056 |     0.0022 |  5.0511 |      1682 |  3.0157 |      1345 |
| pacific_islander |   3238 |     0.0018 |  3.6478 |      1098 |  2.9398 |      1173 |
| other            |  24761 |     0.0137 |  4.3936 |     11454 |  2.9021 |      9664 |
| none             | 563167 |     0.3120 |  3.9094 |     53230 |  2.1889 |     48969 |
+------------------+--------+------------+---------+-----------+---------+-----------+

The distribution for the reported number of ethnicities (with >1 assigned to mixed) is:

+-----------------+----------+
| num_ethnicities | COUNT(*) |
+-----------------+----------+
|               1 |  1699537 |
|               2 |    83676 |
|               3 |    14781 |
|               4 |     3348 |
|               5 |      976 |
|               6 |      330 |
|               7 |      243 |
|               8 |      409 |
|               9 |     1693 |
+-----------------+----------+

Location

Zip3 was provided in the data, but I was cautioned that the third number was not meaningul (i.e. added randomly for obscurification). Zip2 and Zip1 (the later is reported below) are coded as variables.

See:

+------+--------+------------+---------+-----------+---------+-----------+
| zip1 | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+------+--------+------------+---------+-----------+---------+-----------+
| 0    | 178253 |     0.0988 |  3.8582 |     55219 |  2.5989 |     52293 |
| 1    | 229719 |     0.1273 |  4.4691 |     84690 |  2.6089 |     73503 |
| 2    | 136538 |     0.0756 |  3.7811 |     38575 |  2.5554 |     37794 |
| 3    | 148748 |     0.0824 |  3.8744 |     41979 |  2.6477 |     40292 |
| 4    | 126115 |     0.0699 |  3.6275 |     33953 |  2.5882 |     33064 |
| 5    |  66067 |     0.0366 |  3.3721 |     15643 |  2.3133 |     15157 |
| 6    | 105097 |     0.0582 |  3.6479 |     29650 |  2.5607 |     29468 |
| 7    | 126851 |     0.0703 |  4.0350 |     38615 |  2.8075 |     37758 |
| 8    |  84204 |     0.0467 |  3.7276 |     23875 |  2.6178 |     23128 |
| 9    | 322186 |     0.1785 |  3.8279 |    103851 |  2.7357 |    103539 |
| 99   | 281215 |     0.1558 |  3.6252 |     48538 |  2.2509 |     44143 |
+------+--------+------------+---------+-----------+---------+-----------+

Age

The age ranges were created arbitrarily - though they have worked out reasonable well. Further refinement is possible. The variable 'age_rnum' is a integer categorization, and 'age_range' provides the variable. Ages were calculated using 'birth_year' using 2010 as the reference point.

+-----------+--------+------------+---------+-----------+---------+-----------+
| age_range | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+-----------+--------+------------+---------+-----------+---------+-----------+
| 15-19     |  97659 |     0.0541 |  4.8070 |     30385 |  3.5139 |     35708 |
| 20-24     | 485028 |     0.2687 |  3.9309 |    146882 |  2.8494 |    159322 |
| 25-29     | 475376 |     0.2634 |  3.5992 |    131205 |  2.4778 |    128834 |
| 30-34     | 274281 |     0.1520 |  4.4587 |     89759 |  2.4158 |     66236 |
| 35-39     | 155925 |     0.0864 |  3.7430 |     42681 |  2.3843 |     36490 |
| 40-44     | 109121 |     0.0605 |  3.5817 |     27776 |  2.4255 |     24939 |
| 45-49     |  78618 |     0.0436 |  3.6452 |     19225 |  2.2995 |     16338 |
| 50-59     |  95810 |     0.0531 |  3.3421 |     21055 |  2.1347 |     17831 |
| >60       |  33175 |     0.0184 |  2.9332 |      5620 |  1.8930 |      4441 |
+-----------+--------+------------+---------+-----------+---------+-----------+

Birth year is self-reported almost surely wrongly so in some case. It ranges from 1900 to 1995.

Account Age

The account age range variables, 'acc_age_rnum' for the interger categorization and 'acc_age_range' for the labels, were created arbitrarily. These could easily be refined, but were created to examine differences in messaging and viewing activitiy - and to provide a normalization base. The oldest account is 2,613 days. The youngest is 0 days.

+---------------+--------+------------+---------+-----------+---------+-----------+
| acc_age_range | Count  | Percentage |         |           |         |           |
+---------------+--------+------------+---------+-----------+---------+-----------+
| 0             |   1385 |     0.0008 |         |           |         |           |
| 1             |   7914 |     0.0044 |         |           |         |           |
| 2             |   7967 |     0.0044 |         |           |         |           |
| 3             |   9051 |     0.0050 |         |           |         |           |
| 4             |   9134 |     0.0051 |         |           |         |           |
| 5             |  10032 |     0.0056 |         |           |         |           |
| 6-10          |  35862 |     0.0199 |         |           |         |           |
| 11-15         |  39986 |     0.0222 |         |           |         |           |
| 16-30         | 136983 |     0.0759 |         |           |         |           |
| 31-60         | 233560 |     0.1294 |         |           |         |           |
| 61-100        | 237369 |     0.1315 |  5.2286 |    142694 |  3.5119 |    144483 |
| 101-365       | 535715 |     0.2968 |  3.5052 |    220052 |  2.4617 |    220381 |
| 366-3650      | 540035 |     0.2992 |  3.5435 |    151842 |  2.1519 |    125275 |
+---------------+--------+------------+---------+-----------+---------+-----------+

Other variables

Quit times range from:

  • 2008-07-19 (youngest)
  • 2010-12-17 (oldest)

Deleted or blacklisted accounts:

+------------------------+----------+
| deleted_or_blacklisted |  Count   |
+------------------------+----------+
|                      0 |  1519319 |
|                      1 |   285674 |
+------------------------+----------+

Outstanding Data Issues

  • The variables 'num_essays' and 'profile_length' need fixing