OpenML

JavaScript is required to properly view the contents of this page!

okcupid-stem

active ARFF Publicly available Visibility: public Uploaded 19-11-2020 by Marcos de Paula Bueno
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes

Issue	#Downvotes for this reason	By

Loading wiki

Help us complete this description Edit

User profile data for San Francisco OkCupid users published in [Kim, A. Y., & Escobedo-Land, A. (2015). OKCupid data for introductory statistics and data science courses. Journal of Statistics Education, 23(2).]. The curated dataset was downloaded from [https://github.com/rudeboybert/JSE_OkCupid]. The original dataset was created with the use of a python script that pulled the data from public profiles on www.okcupid.com on 06/30/2012. It includes people (n = 59946) within a 25 mile radius of San Francisco, who were online in the last year (06/30/2011), with at least one profile picture. Permission to use this data was obtained by the author of the original paper from OkCupid president and co-founder Christian Rudder under the condition that the dataset remains public. As target, the variable 'job' was collapsed into three categories: 'stem', 'non_stem', and 'student'. STEM jobs were defined as 'job' %in% c('computer / hardware / software', 'science / tech / engineering'). Observations with 'job' %in% c('unemployed', 'retired', 'rather not say') or missing values in 'job' were removed. The original dataset also included ten open text variables 'essay0' to 'essay9', which were removed from the dataset uploaded here. The dataset further includes the date/time variable 'last_online' (ignored by default) which could be used to construct additional features. Using OkCupid data for predicting STEM jobs was inspired by Max Kuhns book 'Feature Engineering and Selection: A Practical Approach for Predictive Models' [https://bookdown.org/max/FES/].

20 features

job (target)	nominal	3 unique values 0 missing
age	numeric	53 unique values 0 missing
body_type	nominal	12 unique values 3905 missing
diet	nominal	18 unique values 19382 missing
drinks	nominal	6 unique values 1552 missing
drugs	nominal	3 unique values 11622 missing
education	nominal	32 unique values 3486 missing
ethnicity	nominal	208 unique values 3989 missing
height	numeric	57 unique values 1 missing
income	nominal	12 unique values 39886 missing
location	nominal	184 unique values 0 missing
offspring	nominal	15 unique values 28914 missing
orientation	nominal	3 unique values 0 missing
pets	nominal	15 unique values 14938 missing
religion	nominal	45 unique values 15126 missing
sex	nominal	2 unique values 0 missing
sign	nominal	48 unique values 7700 missing
smokes	nominal	5 unique values 3572 missing
speaks	nominal	7019 unique values 34 missing
status	nominal	5 unique values 0 missing