Difference between revisions of "OKC Project"

From edegan.com
Jump to navigation Jump to search
imported>Ed
(New page: *This page is protected so that only Ed, Toby and Misiek can read or edit it. I will be posting reports and other materials on the OK Cupid project here shortly.)
 
imported>Ed
 
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
*This page is protected so that only Ed, Toby and Misiek can read or edit it.
+
*This page is protected so that only Ed, Usha, Toby and Misiek can read or edit it.
  
I will be posting reports and other materials on the OK Cupid project here shortly.
+
I will be posting reports and other materials on the OK Cupid project here.
 +
 
 +
==Script Files==
 +
 
 +
There are two SQL script files that were written to process the data:
 +
*[[:Image:OKC1.sql.asci|OKC1.sql]]
 +
*[[:Image:OKC2.sql.asci|OKC2.sql]]
 +
 
 +
These files are self-documenting, and the second file builds out the current base table set containing: '''users''', '''views''', '''messages''', and '''delta'''.
 +
 
 +
==Setup Instructions for Usha==
 +
 
 +
See the [[Research Computing At HBS]] page for detailed information.
 +
 
 +
Quick quide:
 +
#Go to vpn.hbs.edu and click Network Connect -> Start
 +
#Using SSH client, connect to researchgrid.hbs.edu
 +
#Connect the MYSql cluster: msyql -h rcsmysql.hbs.edu -u eegan -p --ssl-ca=rcsmysql
 +
 
 +
Useful commands:
 +
 
 +
mysql> SHOW Databases;
 +
+--------------------+
 +
| Database          |
 +
+--------------------+
 +
| information_schema |
 +
| mpiskorski_tstuart |
 +
| okcupid            |
 +
| okcupid_2          |
 +
+--------------------+
 +
4 rows in set (0.02 sec)
 +
 
 +
You'll have access to mpiskorski_tstuart
 +
 
 +
USE mpiskorski_tstuart;
 +
 
 +
(Misiek and Toby's working database)
 +
 
 +
SHOW TABLES;
 +
 
 +
There are two datasets: OKC1 and OKC2. We are now working with OKC2.
 +
 
 +
Both datasets have:
 +
Users
 +
Messages
 +
Views
 +
 
 +
OKC2 also has ProfileDelta (changes to profiles).
 +
Describe tables with:
 +
 
 +
DESC TableName
 +
 
 +
==Data Description from Misiek==
 +
 
 +
Toby sent me a file with the following data description:
 +
 
 +
USERS
 +
 +
Total = 1,804,993
 +
Female = 722,889 = 40%
 +
 +
US = 1,523,778
 +
NonUS = 281,215 = 16%
 +
 +
US Male = 897,323 = 59%
 +
US Female = 626,455 = 41%
 +
 +
US Male Singles = 827,702 = 92% of US Male
 +
US Female Singles =  565,067 = 90% of US Female
 +
 +
US Male LongTermInterest = 530,450 = 59% of Male
 +
US Female LongTermInterest = 333,947 = 53% of US Female
 +
 +
US Male ShortTermInterest =  433,904 = 48% of Male
 +
US Female ShortTermInterest = 225,715 = 36% of US Female
 +
 +
US Male CasualSex = 96,714 = 11% of Male
 +
US Female CasualSex = 19,331 = 3% of US Female
 +
 +
US Male Gays = 77,902 = 9% of US Male
 +
US Male Bi = 19,295 = 2% of US Male
 +
US Male Straight = 800,126 = 89% of US Male
 +
 +
US Female Gays = 46,118 = 7% of US Female
 +
US Female Bi = 59,585 = 10% of US Female
 +
US Female Straight = 520,752 = 83% of US Female
 +
 +
US Male NoRace = 260,889 = 31% of US Male
 +
US Male White = 517,636 = 57,172% of US Male
 +
US Male Black = 38,787 = 4% of US Male
 +
US Male Asian = 34,158= 4% of US Male
 +
US Male Latino = 57,172 = 6% of US Male
 +
 +
US Female NoRace = 185,725 = 30% of US Female
 +
US Female White = 358,119 = 57% of US Female
 +
US Female Black =  32,195 = 5% of US Female
 +
US Female Asian = 22,979 = 4% of US Female
 +
US Female Latino = 37,274 = 5% of US Female
 +
 
 +
==Data Description==
 +
 
 +
This section provides summary stats on gender/orientation/race/location/age.
 +
 
 +
===Gender===
 +
 
 +
+--------+---------+------------+---------+-----------+---------+-----------+
 +
| female | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
 +
+--------+---------+------------+---------+-----------+---------+-----------+
 +
|      0 | 1082104 |    0.5995 |  4.1039 |    339857 |  2.2340 |    185671 |
 +
|      1 |  722889 |    0.4005 |  3.4872 |    174731 |  2.8786 |    304468 |
 +
+--------+---------+------------+---------+-----------+---------+-----------+
 +
 
 +
===Orientation===
 +
 
 +
0=Straight, 1=Gay, 2=Bi
 +
+-------------+---------+------------+---------+-----------+---------+-----------+
 +
| orientation | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
 +
+-------------+---------+------------+---------+-----------+---------+-----------+
 +
|          0 | 1570050 |    0.8698 |  3.9727 |    460499 |  2.6087 |    420784 |
 +
|          1 |  139697 |    0.0774 |  2.9008 |    28802 |  2.1532 |    29212 |
 +
|          2 |  95246 |    0.0528 |  3.5746 |    25287 |  2.8651 |    40143 |
 +
+-------------+---------+------------+---------+-----------+---------+-----------+
 +
 
 +
===Ethnicity===
 +
 
 +
The data was preprocessed to assign individuals who reported more than one gender as being of 'mixed' race. The variable 'ethnicity' is a interger categorization for fast sorting, and the label are provided in the variable 'ethnicitylabels'.
 +
 
 +
+------------------+--------+------------+---------+-----------+---------+-----------+
 +
| ethnicitylabels  | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
 +
+------------------+--------+------------+---------+-----------+---------+-----------+
 +
| mixed            | 105456 |    0.0584 |  4.0894 |    45674 |  2.8836 |    43540 |
 +
| white            | 917178 |    0.5081 |  3.7588 |    332784 |  2.6029 |    327063 |
 +
| black            |  53884 |    0.0299 |  4.2456 |    18498 |  2.4855 |    13101 |
 +
| hispanic_latin  |  61034 |    0.0338 |  4.2292 |    25278 |  2.9754 |    23348 |
 +
| asian            |  54645 |    0.0303 |  3.7712 |    16612 |  2.6209 |    17222 |
 +
| indian          |  11658 |    0.0065 |  4.6518 |      4996 |  2.2567 |      2796 |
 +
| middle_eastern  |  5916 |    0.0033 |  6.0000 |      3282 |  2.9282 |      1918 |
 +
| native_american  |  4056 |    0.0022 |  5.0511 |      1682 |  3.0157 |      1345 |
 +
| pacific_islander |  3238 |    0.0018 |  3.6478 |      1098 |  2.9398 |      1173 |
 +
| other            |  24761 |    0.0137 |  4.3936 |    11454 |  2.9021 |      9664 |
 +
| none            | 563167 |    0.3120 |  3.9094 |    53230 |  2.1889 |    48969 |
 +
+------------------+--------+------------+---------+-----------+---------+-----------+
 +
 
 +
The distribution for the reported number of ethnicities (with >1 assigned to mixed) is:
 +
 
 +
+-----------------+----------+
 +
| num_ethnicities | COUNT(*) |
 +
+-----------------+----------+
 +
|              1 |  1699537 |
 +
|              2 |    83676 |
 +
|              3 |    14781 |
 +
|              4 |    3348 |
 +
|              5 |      976 |
 +
|              6 |      330 |
 +
|              7 |      243 |
 +
|              8 |      409 |
 +
|              9 |    1693 |
 +
+-----------------+----------+
 +
 
 +
===Location===
 +
 
 +
Zip3 was provided in the data, but I was cautioned that the third number was not meaningul (i.e. added randomly for obscurification). Zip2 and Zip1 (the later is reported below) are coded as variables.
 +
 
 +
See:
 +
*The [http://en.wikipedia.org/wiki/ZIP_code#Primary_State_Prefixes Wikipedia Zip Code page] for a map of Zip codes
 +
*[http://en.wikipedia.org/wiki/List_of_ZIP_code_prefixes List of Zip Code Prefixes]
 +
 
 +
+------+--------+------------+---------+-----------+---------+-----------+
 +
| zip1 | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
 +
+------+--------+------------+---------+-----------+---------+-----------+
 +
| 0    | 178253 |    0.0988 |  3.8582 |    55219 |  2.5989 |    52293 |
 +
| 1    | 229719 |    0.1273 |  4.4691 |    84690 |  2.6089 |    73503 |
 +
| 2    | 136538 |    0.0756 |  3.7811 |    38575 |  2.5554 |    37794 |
 +
| 3    | 148748 |    0.0824 |  3.8744 |    41979 |  2.6477 |    40292 |
 +
| 4    | 126115 |    0.0699 |  3.6275 |    33953 |  2.5882 |    33064 |
 +
| 5    |  66067 |    0.0366 |  3.3721 |    15643 |  2.3133 |    15157 |
 +
| 6    | 105097 |    0.0582 |  3.6479 |    29650 |  2.5607 |    29468 |
 +
| 7    | 126851 |    0.0703 |  4.0350 |    38615 |  2.8075 |    37758 |
 +
| 8    |  84204 |    0.0467 |  3.7276 |    23875 |  2.6178 |    23128 |
 +
| 9    | 322186 |    0.1785 |  3.8279 |    103851 |  2.7357 |    103539 |
 +
| 99  | 281215 |    0.1558 |  3.6252 |    48538 |  2.2509 |    44143 |
 +
+------+--------+------------+---------+-----------+---------+-----------+
 +
 
 +
===Age===
 +
 
 +
The age ranges were created arbitrarily - though they have worked out reasonable well. Further refinement is possible. The variable 'age_rnum' is a integer categorization, and 'age_range' provides the variable. Ages were calculated using 'birth_year' using 2010 as the reference point.
 +
 
 +
+-----------+--------+------------+---------+-----------+---------+-----------+
 +
| age_range | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
 +
+-----------+--------+------------+---------+-----------+---------+-----------+
 +
| 15-19    |  97659 |    0.0541 |  4.8070 |    30385 |  3.5139 |    35708 |
 +
| 20-24    | 485028 |    0.2687 |  3.9309 |    146882 |  2.8494 |    159322 |
 +
| 25-29    | 475376 |    0.2634 |  3.5992 |    131205 |  2.4778 |    128834 |
 +
| 30-34    | 274281 |    0.1520 |  4.4587 |    89759 |  2.4158 |    66236 |
 +
| 35-39    | 155925 |    0.0864 |  3.7430 |    42681 |  2.3843 |    36490 |
 +
| 40-44    | 109121 |    0.0605 |  3.5817 |    27776 |  2.4255 |    24939 |
 +
| 45-49    |  78618 |    0.0436 |  3.6452 |    19225 |  2.2995 |    16338 |
 +
| 50-59    |  95810 |    0.0531 |  3.3421 |    21055 |  2.1347 |    17831 |
 +
| >60      |  33175 |    0.0184 |  2.9332 |      5620 |  1.8930 |      4441 |
 +
+-----------+--------+------------+---------+-----------+---------+-----------+
 +
 
 +
Birth year is self-reported almost surely wrongly so in some case. It ranges from 1900 to 1995.
 +
 
 +
===Account Age===
 +
 
 +
The account age range variables, 'acc_age_rnum' for the interger categorization and 'acc_age_range' for the labels, were created arbitrarily. These could easily be refined, but were created to examine differences in messaging and viewing activitiy - and to provide a normalization base. The oldest account is 2,613 days. The youngest is 0 days.
 +
 
 +
+---------------+--------+------------+---------+-----------+---------+-----------+
 +
| acc_age_range | Count  | Percentage |        |          |        |          |
 +
+---------------+--------+------------+---------+-----------+---------+-----------+
 +
| 0            |  1385 |    0.0008 |        |          |        |          |
 +
| 1            |  7914 |    0.0044 |        |          |        |          |
 +
| 2            |  7967 |    0.0044 |        |          |        |          |
 +
| 3            |  9051 |    0.0050 |        |          |        |          |
 +
| 4            |  9134 |    0.0051 |        |          |        |          |
 +
| 5            |  10032 |    0.0056 |        |          |        |          |
 +
| 6-10          |  35862 |    0.0199 |        |          |        |          |
 +
| 11-15        |  39986 |    0.0222 |        |          |        |          |
 +
| 16-30        | 136983 |    0.0759 |        |          |        |          |
 +
| 31-60        | 233560 |    0.1294 |        |          |        |          |
 +
| 61-100        | 237369 |    0.1315 |  5.2286 |    142694 |  3.5119 |    144483 |
 +
| 101-365      | 535715 |    0.2968 |  3.5052 |    220052 |  2.4617 |    220381 |
 +
| 366-3650      | 540035 |    0.2992 |  3.5435 |    151842 |  2.1519 |    125275 |
 +
+---------------+--------+------------+---------+-----------+---------+-----------+
 +
 
 +
===Other variables===
 +
 
 +
Quit times range from:
 +
*2008-07-19 (youngest)
 +
*2010-12-17 (oldest)
 +
 
 +
Deleted or blacklisted accounts:
 +
+------------------------+----------+
 +
| deleted_or_blacklisted |  Count  |
 +
+------------------------+----------+
 +
|                      0 |  1519319 |
 +
|                      1 |  285674 |
 +
+------------------------+----------+
 +
 
 +
==Outstanding Data Issues==
 +
 
 +
*The variables 'num_essays' and 'profile_length' need fixing

Latest revision as of 15:00, 19 December 2011

  • This page is protected so that only Ed, Usha, Toby and Misiek can read or edit it.

I will be posting reports and other materials on the OK Cupid project here.

Script Files

There are two SQL script files that were written to process the data:

These files are self-documenting, and the second file builds out the current base table set containing: users, views, messages, and delta.

Setup Instructions for Usha

See the Research Computing At HBS page for detailed information.

Quick quide:

  1. Go to vpn.hbs.edu and click Network Connect -> Start
  2. Using SSH client, connect to researchgrid.hbs.edu
  3. Connect the MYSql cluster: msyql -h rcsmysql.hbs.edu -u eegan -p --ssl-ca=rcsmysql

Useful commands:

mysql> SHOW Databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mpiskorski_tstuart |
| okcupid            |
| okcupid_2          |
+--------------------+
4 rows in set (0.02 sec)

You'll have access to mpiskorski_tstuart

USE mpiskorski_tstuart;

(Misiek and Toby's working database)

SHOW TABLES;

There are two datasets: OKC1 and OKC2. We are now working with OKC2.

Both datasets have:

Users
Messages
Views

OKC2 also has ProfileDelta (changes to profiles). Describe tables with:

DESC TableName

Data Description from Misiek

Toby sent me a file with the following data description:

USERS

Total = 1,804,993
Female = 722,889 = 40%

US = 1,523,778
NonUS = 281,215 = 16%

US Male = 897,323 = 59%
US Female = 626,455 = 41%

US Male Singles = 827,702 = 92% of US Male
US Female Singles =  565,067 = 90% of US Female 

US Male LongTermInterest = 530,450 = 59% of Male
US Female LongTermInterest = 333,947 = 53% of US Female

US Male ShortTermInterest =  433,904 = 48% of Male
US Female ShortTermInterest = 225,715 = 36% of US Female

US Male CasualSex = 96,714 = 11% of Male
US Female CasualSex = 19,331 = 3% of US Female

US Male Gays = 77,902 = 9% of US Male
US Male Bi = 19,295 = 2% of US Male
US Male Straight = 800,126 = 89% of US Male

US Female Gays = 46,118 = 7% of US Female
US Female Bi = 59,585 = 10% of US Female
US Female Straight = 520,752 = 83% of US Female

US Male NoRace = 260,889 = 31% of US Male
US Male White = 517,636 = 57,172% of US Male
US Male Black = 38,787 = 4% of US Male
US Male Asian = 34,158= 4% of US Male
US Male Latino = 57,172 = 6% of US Male

US Female NoRace = 185,725 = 30% of US Female
US Female White = 358,119 = 57% of US Female
US Female Black =  32,195 = 5% of US Female
US Female Asian = 22,979 = 4% of US Female
US Female Latino = 37,274 = 5% of US Female

Data Description

This section provides summary stats on gender/orientation/race/location/age.

Gender

+--------+---------+------------+---------+-----------+---------+-----------+
| female | Count   | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+--------+---------+------------+---------+-----------+---------+-----------+
|      0 | 1082104 |     0.5995 |  4.1039 |    339857 |  2.2340 |    185671 |
|      1 |  722889 |     0.4005 |  3.4872 |    174731 |  2.8786 |    304468 |
+--------+---------+------------+---------+-----------+---------+-----------+

Orientation

0=Straight, 1=Gay, 2=Bi
+-------------+---------+------------+---------+-----------+---------+-----------+
| orientation | Count   | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+-------------+---------+------------+---------+-----------+---------+-----------+
|           0 | 1570050 |     0.8698 |  3.9727 |    460499 |  2.6087 |    420784 |
|           1 |  139697 |     0.0774 |  2.9008 |     28802 |  2.1532 |     29212 |
|           2 |   95246 |     0.0528 |  3.5746 |     25287 |  2.8651 |     40143 |
+-------------+---------+------------+---------+-----------+---------+-----------+

Ethnicity

The data was preprocessed to assign individuals who reported more than one gender as being of 'mixed' race. The variable 'ethnicity' is a interger categorization for fast sorting, and the label are provided in the variable 'ethnicitylabels'.

+------------------+--------+------------+---------+-----------+---------+-----------+
| ethnicitylabels  | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+------------------+--------+------------+---------+-----------+---------+-----------+
| mixed            | 105456 |     0.0584 |  4.0894 |     45674 |  2.8836 |     43540 |
| white            | 917178 |     0.5081 |  3.7588 |    332784 |  2.6029 |    327063 |
| black            |  53884 |     0.0299 |  4.2456 |     18498 |  2.4855 |     13101 |
| hispanic_latin   |  61034 |     0.0338 |  4.2292 |     25278 |  2.9754 |     23348 |
| asian            |  54645 |     0.0303 |  3.7712 |     16612 |  2.6209 |     17222 |
| indian           |  11658 |     0.0065 |  4.6518 |      4996 |  2.2567 |      2796 |
| middle_eastern   |   5916 |     0.0033 |  6.0000 |      3282 |  2.9282 |      1918 |
| native_american  |   4056 |     0.0022 |  5.0511 |      1682 |  3.0157 |      1345 |
| pacific_islander |   3238 |     0.0018 |  3.6478 |      1098 |  2.9398 |      1173 |
| other            |  24761 |     0.0137 |  4.3936 |     11454 |  2.9021 |      9664 |
| none             | 563167 |     0.3120 |  3.9094 |     53230 |  2.1889 |     48969 |
+------------------+--------+------------+---------+-----------+---------+-----------+

The distribution for the reported number of ethnicities (with >1 assigned to mixed) is:

+-----------------+----------+
| num_ethnicities | COUNT(*) |
+-----------------+----------+
|               1 |  1699537 |
|               2 |    83676 |
|               3 |    14781 |
|               4 |     3348 |
|               5 |      976 |
|               6 |      330 |
|               7 |      243 |
|               8 |      409 |
|               9 |     1693 |
+-----------------+----------+

Location

Zip3 was provided in the data, but I was cautioned that the third number was not meaningul (i.e. added randomly for obscurification). Zip2 and Zip1 (the later is reported below) are coded as variables.

See:

+------+--------+------------+---------+-----------+---------+-----------+
| zip1 | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+------+--------+------------+---------+-----------+---------+-----------+
| 0    | 178253 |     0.0988 |  3.8582 |     55219 |  2.5989 |     52293 |
| 1    | 229719 |     0.1273 |  4.4691 |     84690 |  2.6089 |     73503 |
| 2    | 136538 |     0.0756 |  3.7811 |     38575 |  2.5554 |     37794 |
| 3    | 148748 |     0.0824 |  3.8744 |     41979 |  2.6477 |     40292 |
| 4    | 126115 |     0.0699 |  3.6275 |     33953 |  2.5882 |     33064 |
| 5    |  66067 |     0.0366 |  3.3721 |     15643 |  2.3133 |     15157 |
| 6    | 105097 |     0.0582 |  3.6479 |     29650 |  2.5607 |     29468 |
| 7    | 126851 |     0.0703 |  4.0350 |     38615 |  2.8075 |     37758 |
| 8    |  84204 |     0.0467 |  3.7276 |     23875 |  2.6178 |     23128 |
| 9    | 322186 |     0.1785 |  3.8279 |    103851 |  2.7357 |    103539 |
| 99   | 281215 |     0.1558 |  3.6252 |     48538 |  2.2509 |     44143 |
+------+--------+------------+---------+-----------+---------+-----------+

Age

The age ranges were created arbitrarily - though they have worked out reasonable well. Further refinement is possible. The variable 'age_rnum' is a integer categorization, and 'age_range' provides the variable. Ages were calculated using 'birth_year' using 2010 as the reference point.

+-----------+--------+------------+---------+-----------+---------+-----------+
| age_range | Count  | Percentage | AvgSent | TotalSent | AvgRecd | TotalRecd |
+-----------+--------+------------+---------+-----------+---------+-----------+
| 15-19     |  97659 |     0.0541 |  4.8070 |     30385 |  3.5139 |     35708 |
| 20-24     | 485028 |     0.2687 |  3.9309 |    146882 |  2.8494 |    159322 |
| 25-29     | 475376 |     0.2634 |  3.5992 |    131205 |  2.4778 |    128834 |
| 30-34     | 274281 |     0.1520 |  4.4587 |     89759 |  2.4158 |     66236 |
| 35-39     | 155925 |     0.0864 |  3.7430 |     42681 |  2.3843 |     36490 |
| 40-44     | 109121 |     0.0605 |  3.5817 |     27776 |  2.4255 |     24939 |
| 45-49     |  78618 |     0.0436 |  3.6452 |     19225 |  2.2995 |     16338 |
| 50-59     |  95810 |     0.0531 |  3.3421 |     21055 |  2.1347 |     17831 |
| >60       |  33175 |     0.0184 |  2.9332 |      5620 |  1.8930 |      4441 |
+-----------+--------+------------+---------+-----------+---------+-----------+

Birth year is self-reported almost surely wrongly so in some case. It ranges from 1900 to 1995.

Account Age

The account age range variables, 'acc_age_rnum' for the interger categorization and 'acc_age_range' for the labels, were created arbitrarily. These could easily be refined, but were created to examine differences in messaging and viewing activitiy - and to provide a normalization base. The oldest account is 2,613 days. The youngest is 0 days.

+---------------+--------+------------+---------+-----------+---------+-----------+
| acc_age_range | Count  | Percentage |         |           |         |           |
+---------------+--------+------------+---------+-----------+---------+-----------+
| 0             |   1385 |     0.0008 |         |           |         |           |
| 1             |   7914 |     0.0044 |         |           |         |           |
| 2             |   7967 |     0.0044 |         |           |         |           |
| 3             |   9051 |     0.0050 |         |           |         |           |
| 4             |   9134 |     0.0051 |         |           |         |           |
| 5             |  10032 |     0.0056 |         |           |         |           |
| 6-10          |  35862 |     0.0199 |         |           |         |           |
| 11-15         |  39986 |     0.0222 |         |           |         |           |
| 16-30         | 136983 |     0.0759 |         |           |         |           |
| 31-60         | 233560 |     0.1294 |         |           |         |           |
| 61-100        | 237369 |     0.1315 |  5.2286 |    142694 |  3.5119 |    144483 |
| 101-365       | 535715 |     0.2968 |  3.5052 |    220052 |  2.4617 |    220381 |
| 366-3650      | 540035 |     0.2992 |  3.5435 |    151842 |  2.1519 |    125275 |
+---------------+--------+------------+---------+-----------+---------+-----------+

Other variables

Quit times range from:

  • 2008-07-19 (youngest)
  • 2010-12-17 (oldest)

Deleted or blacklisted accounts:

+------------------------+----------+
| deleted_or_blacklisted |  Count   |
+------------------------+----------+
|                      0 |  1519319 |
|                      1 |   285674 |
+------------------------+----------+

Outstanding Data Issues

  • The variables 'num_essays' and 'profile_length' need fixing