Data
Titanic

Titanic

active ARFF Publicly available Visibility: public Uploaded 16-10-2017 by Joaquin Vanschoren
3 likes downloaded by 45 people , 62 total downloads 0 issues 0 downvotes
  • Computational Universe Manufacturing text_data
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Author: Frank E. Harrell Jr., Thomas Cason Source: [Vanderbilt Biostatistics](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html) Please cite: The original Titanic dataset, describing the survival status of individual passengers on the Titanic. The titanic data does not contain information from the crew, but it does contain actual ages of half of the passengers. The principal source for data about Titanic passengers is the Encyclopedia Titanica. The datasets used here were begun by a variety of researchers. One of the original sources is Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay. Thomas Cason of UVa has greatly updated and improved the titanic data frame using the Encyclopedia Titanica and created the dataset here. Some duplicate passengers have been dropped, many errors corrected, many missing ages filled in, and new variables created. For more information about how this dataset was constructed: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt ### Attribute information The variables on our extracted dataset are pclass, survived, name, age, embarked, home.dest, room, ticket, boat, and sex. pclass refers to passenger class (1st, 2nd, 3rd), and is a proxy for socio-economic class. Age is in years, and some infants had fractional values. The titanic2 data frame has no missing data and includes records for the crew, but age is dichotomized at adult vs. child. These data were obtained from Robert Dawson, Saint Mary's University, E-mail. The variables are pclass, age, sex, survived. These data frames are useful for demonstrating many of the functions in Hmisc as well as demonstrating binary logistic regression analysis using the Design library. For more details and references see Simonoff, Jeffrey S (1997): The "unusual episode" and a second statistics course. J Statistics Education, Vol. 5 No. 1.

14 features

survived (target)nominal2 unique values
0 missing
pclassnumeric3 unique values
0 missing
namestring1307 unique values
0 missing
sexnominal2 unique values
0 missing
agenumeric98 unique values
263 missing
sibspnumeric7 unique values
0 missing
parchnumeric8 unique values
0 missing
ticketstring929 unique values
0 missing
farenumeric281 unique values
1 missing
cabinstring186 unique values
1014 missing
embarkednominal3 unique values
2 missing
boatstring27 unique values
823 missing
bodynumeric121 unique values
1188 missing
home.deststring369 unique values
564 missing

62 properties

1309
Number of instances (rows) of the dataset.
14
Number of attributes (columns) of the dataset.
2
Number of distinct values of the target attribute (if it is nominal).
3855
Number of missing values in the dataset.
1309
Number of instances with at least one value missing.
6
Number of numeric attributes.
3
Number of nominal attributes.
2.04
Second quartile (Median) of skewness among attributes of the numeric type.
809
Number of instances belonging to the most frequent class.
0.94
Minimal entropy among attributes.
14.29
Percentage of binary attributes.
7.73
Second quartile (Median) of standard deviation of attributes of the numeric type.
1.15
Maximum entropy among attributes.
-1.32
Minimum kurtosis among attributes of the numeric type.
100
Percentage of instances having missing values.
1.15
Third quartile of entropy among attributes.
27.03
Maximum kurtosis among attributes of the numeric type.
0.39
Minimum of means among attributes of the numeric type.
21.04
Percentage of missing values.
22.91
Third quartile of kurtosis among attributes of the numeric type.
160.81
Maximum of means among attributes of the numeric type.
0.02
Minimal mutual information between the nominal attributes and the target attribute.
42.86
Percentage of numeric attributes.
65.17
Third quartile of means among attributes of the numeric type.
0.21
Maximum mutual information between the nominal attributes and the target attribute.
2
The minimal number of distinct values among attributes of the nominal type.
21.43
Percentage of nominal attributes.
0.21
Third quartile of mutual information between the nominal attributes and the target attribute.
3
The maximum number of distinct values among attributes of the nominal type.
-0.6
Minimum skewness among attributes of the numeric type.
0.84
Minimum standard deviation of attributes of the numeric type.
0.94
First quartile of entropy among attributes.
3.98
Third quartile of skewness among attributes of the numeric type.
4.37
Maximum skewness among attributes of the numeric type.
38.2
Percentage of instances belonging to the least frequent class.
-1.27
First quartile of kurtosis among attributes of the numeric type.
63.24
Third quartile of standard deviation of attributes of the numeric type.
97.7
Maximum standard deviation of attributes of the numeric type.
500
Number of instances belonging to the least frequent class.
0.47
First quartile of means among attributes of the numeric type.
0.58
Standard deviation of the number of distinct values among attributes of the nominal type.
1.05
Average entropy of the attributes.
2
Number of binary attributes.
0.02
First quartile of mutual information between the nominal attributes and the target attribute.
11.03
Mean kurtosis among attributes of the numeric type.
-0.08
First quartile of skewness among attributes of the numeric type.
37.86
Mean of means among attributes of the numeric type.
0.86
First quartile of standard deviation of attributes of the numeric type.
0.61
Average class difference between consecutive instances.
0.12
Average mutual information between the nominal attributes and the target attribute.
1.05
Second quartile (Median) of entropy among attributes.
0.96
Entropy of the target attribute values.
8.09
An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.
10.1
Second quartile (Median) of kurtosis among attributes of the numeric type.
0.01
Number of attributes divided by the number of instances.
2.33
Average number of distinct values among the attributes of the nominal type.
16.09
Second quartile (Median) of means among attributes of the numeric type.
8.34
Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.
1.96
Mean skewness among attributes of the numeric type.
0.12
Second quartile (Median) of mutual information between the nominal attributes and the target attribute.
61.8
Percentage of instances belonging to the most frequent class.
27.77
Mean standard deviation of attributes of the numeric type.

11 tasks

0 runs - estimation_procedure: 10-fold Crossvalidation - target_feature: survived
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
Define a new task