Data
dna

dna

active ARFF public Visibility: public Uploaded 06-04-2017 by Pieter Gijsbers
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
  • Biology Computational Biology Genetics Kaggle OpenML-CC18 study_135 study_98 study_99
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Author: Ross King, based on data from Genbank 64.1 Source: [MLbench](https://www.rdocumentation.org/packages/mlbench/versions/2.1-1/topics/DNA). Originally from the StatLog project. Please Cite: Primate Splice-Junction Gene Sequences (DNA) Originally from the StatLog project. The raw data is still available on [UCI](https://archive.ics.uci.edu/ml/datasets/Molecular+Biology+(Splice-junction+Gene+Sequences)). The data consists of 3,186 data points (splice junctions). The data points are described by 180 indicator binary variables and the problem is to recognize the 3 classes (ei, ie, neither), i.e., the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). The StatLog DNA dataset is a processed version of the [Irvine database]((https://archive.ics.uci.edu/ml/datasets/Molecular+Biology+(Splice-junction+Gene+Sequences))). The main difference is that the symbolic variables representing the nucleotides (only A,G,T,C) were replaced by 3 binary indicator variables. Thus the original 60 symbolic attributes were changed into 180 binary attributes. The names of the examples were removed. The examples with ambiguities were removed (there was very few of them, 4). The StatLog version of this dataset was produced by Ross King at Strathclyde University. For original details see the Irvine database documentation. The nucleotides A,C,G,T were given indicator values as follows: A -> 1 0 0 C -> 0 1 0 G -> 0 0 1 T -> 0 0 0 Hint: Much better performance is generally observed if attributes closest to the junction are used. In the StatLog version, this means using attributes A61 to A120 only.

181 features

class (target)nominal3 unique values
0 missing
A0nominal2 unique values
0 missing
A1nominal2 unique values
0 missing
A2nominal2 unique values
0 missing
A3nominal2 unique values
0 missing
A4nominal2 unique values
0 missing
A5nominal2 unique values
0 missing
A6nominal2 unique values
0 missing
A7nominal2 unique values
0 missing
A8nominal2 unique values
0 missing
A9nominal2 unique values
0 missing
A10nominal2 unique values
0 missing
A11nominal2 unique values
0 missing
A12nominal2 unique values
0 missing
A13nominal2 unique values
0 missing
A14nominal2 unique values
0 missing
A15nominal2 unique values
0 missing
A16nominal2 unique values
0 missing
A17nominal2 unique values
0 missing
A18nominal2 unique values
0 missing
A19nominal2 unique values
0 missing
A20nominal2 unique values
0 missing
A21nominal2 unique values
0 missing
A22nominal2 unique values
0 missing
A23nominal2 unique values
0 missing
A24nominal2 unique values
0 missing
A25nominal2 unique values
0 missing
A26nominal2 unique values
0 missing
A27nominal2 unique values
0 missing
A28nominal2 unique values
0 missing
A29nominal2 unique values
0 missing
A30nominal2 unique values
0 missing
A31nominal2 unique values
0 missing
A32nominal2 unique values
0 missing
A33nominal2 unique values
0 missing
A34nominal2 unique values
0 missing
A35nominal2 unique values
0 missing
A36nominal2 unique values
0 missing
A37nominal2 unique values
0 missing
A38nominal2 unique values
0 missing
A39nominal2 unique values
0 missing
A40nominal2 unique values
0 missing
A41nominal2 unique values
0 missing
A42nominal2 unique values
0 missing
A43nominal2 unique values
0 missing
A44nominal2 unique values
0 missing
A45nominal2 unique values
0 missing
A46nominal2 unique values
0 missing
A47nominal2 unique values
0 missing
A48nominal2 unique values
0 missing
A49nominal2 unique values
0 missing
A50nominal2 unique values
0 missing
A51nominal2 unique values
0 missing
A52nominal2 unique values
0 missing
A53nominal2 unique values
0 missing
A54nominal2 unique values
0 missing
A55nominal2 unique values
0 missing
A56nominal2 unique values
0 missing
A57nominal2 unique values
0 missing
A58nominal2 unique values
0 missing
A59nominal2 unique values
0 missing
A60nominal2 unique values
0 missing
A61nominal2 unique values
0 missing
A62nominal2 unique values
0 missing
A63nominal2 unique values
0 missing
A64nominal2 unique values
0 missing
A65nominal2 unique values
0 missing
A66nominal2 unique values
0 missing
A67nominal2 unique values
0 missing
A68nominal2 unique values
0 missing
A69nominal2 unique values
0 missing
A70nominal2 unique values
0 missing
A71nominal2 unique values
0 missing
A72nominal2 unique values
0 missing
A73nominal2 unique values
0 missing
A74nominal2 unique values
0 missing
A75nominal2 unique values
0 missing
A76nominal2 unique values
0 missing
A77nominal2 unique values
0 missing
A78nominal2 unique values
0 missing
A79nominal2 unique values
0 missing
A80nominal2 unique values
0 missing
A81nominal2 unique values
0 missing
A82nominal2 unique values
0 missing
A83nominal2 unique values
0 missing
A84nominal2 unique values
0 missing
A85nominal2 unique values
0 missing
A86nominal2 unique values
0 missing
A87nominal2 unique values
0 missing
A88nominal2 unique values
0 missing
A89nominal2 unique values
0 missing
A90nominal2 unique values
0 missing
A91nominal2 unique values
0 missing
A92nominal2 unique values
0 missing
A93nominal2 unique values
0 missing
A94nominal2 unique values
0 missing
A95nominal2 unique values
0 missing
A96nominal2 unique values
0 missing
A97nominal2 unique values
0 missing
A98nominal2 unique values
0 missing
A99nominal2 unique values
0 missing
A100nominal2 unique values
0 missing
A101nominal2 unique values
0 missing
A102nominal2 unique values
0 missing
A103nominal2 unique values
0 missing
A104nominal2 unique values
0 missing
A105nominal2 unique values
0 missing
A106nominal2 unique values
0 missing
A107nominal2 unique values
0 missing
A108nominal2 unique values
0 missing
A109nominal2 unique values
0 missing
A110nominal2 unique values
0 missing
A111nominal2 unique values
0 missing
A112nominal2 unique values
0 missing
A113nominal2 unique values
0 missing
A114nominal2 unique values
0 missing
A115nominal2 unique values
0 missing
A116nominal2 unique values
0 missing
A117nominal2 unique values
0 missing
A118nominal2 unique values
0 missing
A119nominal2 unique values
0 missing
A120nominal2 unique values
0 missing
A121nominal2 unique values
0 missing
A122nominal2 unique values
0 missing
A123nominal2 unique values
0 missing
A124nominal2 unique values
0 missing
A125nominal2 unique values
0 missing
A126nominal2 unique values
0 missing
A127nominal2 unique values
0 missing
A128nominal2 unique values
0 missing
A129nominal2 unique values
0 missing
A130nominal2 unique values
0 missing
A131nominal2 unique values
0 missing
A132nominal2 unique values
0 missing
A133nominal2 unique values
0 missing
A134nominal2 unique values
0 missing
A135nominal2 unique values
0 missing
A136nominal2 unique values
0 missing
A137nominal2 unique values
0 missing
A138nominal2 unique values
0 missing
A139nominal2 unique values
0 missing
A140nominal2 unique values
0 missing
A141nominal2 unique values
0 missing
A142nominal2 unique values
0 missing
A143nominal2 unique values
0 missing
A144nominal2 unique values
0 missing
A145nominal2 unique values
0 missing
A146nominal2 unique values
0 missing
A147nominal2 unique values
0 missing
A148nominal2 unique values
0 missing
A149nominal2 unique values
0 missing
A150nominal2 unique values
0 missing
A151nominal2 unique values
0 missing
A152nominal2 unique values
0 missing
A153nominal2 unique values
0 missing
A154nominal2 unique values
0 missing
A155nominal2 unique values
0 missing
A156nominal2 unique values
0 missing
A157nominal2 unique values
0 missing
A158nominal2 unique values
0 missing
A159nominal2 unique values
0 missing
A160nominal2 unique values
0 missing
A161nominal2 unique values
0 missing
A162nominal2 unique values
0 missing
A163nominal2 unique values
0 missing
A164nominal2 unique values
0 missing
A165nominal2 unique values
0 missing
A166nominal2 unique values
0 missing
A167nominal2 unique values
0 missing
A168nominal2 unique values
0 missing
A169nominal2 unique values
0 missing
A170nominal2 unique values
0 missing
A171nominal2 unique values
0 missing
A172nominal2 unique values
0 missing
A173nominal2 unique values
0 missing
A174nominal2 unique values
0 missing
A175nominal2 unique values
0 missing
A176nominal2 unique values
0 missing
A177nominal2 unique values
0 missing
A178nominal2 unique values
0 missing
A179nominal2 unique values
0 missing

62 properties

3186
Number of instances (rows) of the dataset.
181
Number of attributes (columns) of the dataset.
3
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
0
Number of numeric attributes.
181
Number of nominal attributes.
Maximum skewness among attributes of the numeric type.
Minimum standard deviation of attributes of the numeric type.
0.77
First quartile of entropy among attributes.
Third quartile of skewness among attributes of the numeric type.
Maximum standard deviation of attributes of the numeric type.
24.01
Percentage of instances belonging to the least frequent class.
First quartile of kurtosis among attributes of the numeric type.
Third quartile of standard deviation of attributes of the numeric type.
0.81
Average entropy of the attributes.
765
Number of instances belonging to the least frequent class.
First quartile of means among attributes of the numeric type.
0.07
Standard deviation of the number of distinct values among attributes of the nominal type.
Mean kurtosis among attributes of the numeric type.
180
Number of binary attributes.
0
First quartile of mutual information between the nominal attributes and the target attribute.
Mean of means among attributes of the numeric type.
First quartile of skewness among attributes of the numeric type.
0.02
Average mutual information between the nominal attributes and the target attribute.
First quartile of standard deviation of attributes of the numeric type.
0.38
Average class difference between consecutive instances.
36.27
An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.
0.81
Second quartile (Median) of entropy among attributes.
1.48
Entropy of the target attribute values.
2.01
Average number of distinct values among the attributes of the nominal type.
Second quartile (Median) of kurtosis among attributes of the numeric type.
0.06
Number of attributes divided by the number of instances.
Mean skewness among attributes of the numeric type.
Second quartile (Median) of means among attributes of the numeric type.
68.43
Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.
51.91
Percentage of instances belonging to the most frequent class.
Mean standard deviation of attributes of the numeric type.
0.01
Second quartile (Median) of mutual information between the nominal attributes and the target attribute.
1654
Number of instances belonging to the most frequent class.
0.58
Minimal entropy among attributes.
Second quartile (Median) of skewness among attributes of the numeric type.
1
Maximum entropy among attributes.
Minimum kurtosis among attributes of the numeric type.
99.45
Percentage of binary attributes.
Second quartile (Median) of standard deviation of attributes of the numeric type.
Maximum kurtosis among attributes of the numeric type.
Minimum of means among attributes of the numeric type.
0
Percentage of instances having missing values.
0.85
Third quartile of entropy among attributes.
Maximum of means among attributes of the numeric type.
0
Minimal mutual information between the nominal attributes and the target attribute.
0
Percentage of missing values.
Third quartile of kurtosis among attributes of the numeric type.
0.38
Maximum mutual information between the nominal attributes and the target attribute.
2
The minimal number of distinct values among attributes of the nominal type.
0
Percentage of numeric attributes.
Third quartile of means among attributes of the numeric type.
3
The maximum number of distinct values among attributes of the nominal type.
Minimum skewness among attributes of the numeric type.
100
Percentage of nominal attributes.
0.02
Third quartile of mutual information between the nominal attributes and the target attribute.

28 tasks

4967 runs - estimation_procedure: 10-fold Crossvalidation - target_feature: class
2096 runs - estimation_procedure: 10-fold Crossvalidation - evaluation_measure: precision - target_feature: class
0 runs - estimation_procedure: 10 times 10-fold Crossvalidation - target_feature: class
0 runs - estimation_procedure: 33% Holdout set - target_feature: class
0 runs - estimation_procedure: 4-fold Crossvalidation - target_feature: class
0 runs - estimation_procedure: 10-fold Crossvalidation - evaluation_measure: predictive_accuracy - target_feature: class
0 runs - estimation_procedure: 10-fold Crossvalidation - target_feature: classification problem
0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class
0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class
0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class
0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class
0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class
0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class
0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class
0 runs - estimation_procedure: 10-fold Learning Curve - target_feature: class
0 runs - estimation_procedure: Interleaved Test then Train - target_feature: class
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering - target_feature: tubercolosis
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
Define a new task