OpenML

JavaScript is required to properly view the contents of this page!

Explore
- Data
- Task
- Flow
- Run
- Study
- Task type
- Measure
- People
Help
Blog
Contact
Please cite us

kdd_coil_2

active ARFF Publicly available Visibility: public Uploaded 03-10-2014 by Joaquin Vanschoren
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes

Issue	#Downvotes for this reason	By

Loading wiki

Help us complete this description Edit

Author: Source: Unknown - Date unknown Please cite: %%%%%%%%%%%%%%%%%%% Data-Description % %%%%%%%%%%%%%%%%%%% COIL 1999 Competition Data Data Type multivariate Abstract This data set is from the 1999 Computational Intelligence and Learning (COIL) competition. The data contains measurements of river chemical concentrations and algae densities. Sources Original Owner [1]ERUDIT European Network for Fuzzy Logic and Uncertainty Modelling in Information Technology Donor Jens Strackeljan Technical University Clausthal Institute of Applied Mechanics Graupenstr. 3, 38678 Clausthal-Zellerfeld, Germany [2]tmjs@itm.tu-clausthal.de Date Donated: September 9, 1999 Data Characteristics This data comes from a water quality study where samples were taken from sites on different European rivers of a period of approximately one year. These samples were analyzed for various chemical substances including: nitrogen in the form of nitrates, nitrites and ammonia, phosphate, pH, oxygen, chloride. In parallel, algae samples were collected to determine the algae population distributions. Other Relevant Information The competition involved the prediction of algal frequency distributions on the basis of the measured concentrations of the chemical substances and the global information concerning the season when the sample was taken, the river size and its flow velocity. The competition [3]instructions contain additional information on the prediction task. Data Format There are a total of 340 examples each containing 17 values. The first 11 values of each data set are the season, the river size, the fluid velocity and 8 chemical concentrations which should be relevant for the algae population distribution. The last 8 values of each example are the distribution of different kinds of algae. These 8 kinds are only a very small part of the whole community, but for the competition we limited the number to 7. The value 0.0 means that the frequency is very low. The data set also contains some empty fields which are labeled with the string XXXXX. The training data are saved in the file: analysis.data (ASCII format). Table 1: Structure of the file analysis.data A K a g CC[1,1] CC[1,11] AG[1,1] AG[1,7] CC[200,1] CC[200,11] AG[200,1] AG[200,7] Explanation: CC[i,j]: Chemical concentration or river characteristic AG[i,j]: Algal frequency The chemical parameters are labeled as A, ..., K. The columns of the algaes are labeled as a, ..,g. Past Usage [4]The Third (1999) International COIL Competition Home Page _________________________________________________________________ [5]The UCI KDD Archive [6]Information and Computer Science [7]University of California, Irvine Irvine, CA 92697-3425 Last modified: October 13, 1999 References 1. http://www.erudit.de/ 2. mailto:tmjs@itm.tu-clausthal.de 3. file://localhost/research/ml/datasets/uci/raw/data/ucikdd/coil/instructions.txt 4. http://www.erudit.de/erudit/activities/ic-99/index.htm 5. http://kdd.ics.uci.edu/ 6. http://www.ics.uci.edu/ 7. http://www.uci.edu/ %%%%%%%%%%%%%%%%%%% Task-Description % %%%%%%%%%%%%%%%%%%% Third International Competition Protecting rivers and streams by monitoring chemical concentrations and algae communities. Intelligent Techniques for Monitoring Water Quality using chemical indicators and algae population Recent years have been characterised by increasing concern at the impact man is having on the environment. The impact on the environment of toxic waste, from a wide variety of manufacturing processes, is well known. More recently, however, it has become clear that the more subtle effects of nutrient level and chemical balance changes arising from farming land run-off and sewage water treatment also have a serious, but indirect, effect on the states of rivers, lakes and even the sea. In temperate climates across the world summers are characterized by numerous reports excessive summer algae growth resulting in poor water clarity, mass deaths of river fish from reduced oxygen levels and the closure of recreational water facilities on account of the toxic effects of this annual algal bloom. Reducing the impact of these man-made changes in river nutrient levels has stimulated much biological research with the aim of identifying the crucial chemical control variables for the biological processes. The data used in this problem comes from one such study. During the research study water quality samples were taken from sites on different European rivers of a period of approximately one year. These samples were analyzed for various chemical substances including: nitrogen in the form of nitrates, nitrites and ammonia, phosphate, pH, oxygen, chloride. In parallel, algae samples were collected to determine the algae population distributions. It is well known that the dynamics of the algae community is determined by external chemical environment with one or more factors being predominant. While the chemical analysis is cheap and easily automated, the biological part involves microscopic examination, requires trained manpower and is therefore both expensive and slow. Diatoms like Cymbella are major contributors to primary production throughout the world. The diatom reacts with large sensitivity to even small changes in acidity . Over a three and half billion year history algae have evolved and adapted as primary plant colonizers of almost every known habitant in terrestrial and aquatic environments. They respond very rapidly to man-made environment changes. The relationship between the chemical and biological features is complex and can be expected to need the application of advanced techniques. Typical of such real-life problems, the particular data set for the problem contains a mixture of (fuzzy) qualiative variables and numerical measurement values, with much of the data being incomplete. The competition task is the prediction of algal frequency distributions on the basis of the measured concentrations of the chemical substances and the global information concerning the season when the sample was taken, the river size and its flow velocity. The two last variables are given as linguistic variables. 340 data sets were taken and each contain 17 values. The first 11 values of each data set are the season, the river size, the fluid velocity and 8 chemical concentrations which should be relevant for the algae population distribution. The last 8 values of each data set are the distribution of different kinds of algae. These 8 kinds are only a very small part of the whole community, but for the competition we limited the number to 7. The value 0.0 means that the frequency is very low. The data set also contains some empty fields which are labeled with the string XXXXX. Each participant in the competition receives 200 complete data sets (training data) and 140 data sets (evaluation data) containing only the 11 values of the river descriptions and the chemical concentrations. This training data is to be used in obtainin a 'model' providing a prediction of the algal distributions associated with the evaluation data. The training data are saved in the file: analysis.txt (ASCII format). Structure of the file analysis.txt A K a g CC1,1 ... CC1,11 AG1,1 ... AG1,7 .... ... ... ... CC200,1 ... CC200,11 AG240,1 ... AG240,7 Explanation: CCi,j: Chemical concentration j=1,..11 AGi,k: Algal frequency k=1...7 The chemical parameters are labeled as A, ..., K. The columns of the algaes are labeled as a, ..,g. Evaluation data are saved in file eval.txt (ASCII format). Table 2: Structure of the file eval.* A K CC1,1 ... CC1,11 ..... ... CC140,1 ... CC140,11 _____________________________________________________________ Objective The objective of the competition is to provide a prediction model on basis of the training data. Having obtained this prediction model, each participant must provide the solution in the form of the results of applying this model to the evaluation data. The results obtained in this way should correspond to the results of the evaluation data (which are known to the organizer). The criteria used to evaluate the results is given below. All 7 Algae frequency distributions must be determined. For this purpose any number of partial models may be developed. _____________________________________________________________ Judgment of the results To judge the results, the sum of squared errors will be calculated. The following Table describes the results of a particular participant. Matrix of results a g Res1,1 ... Res1,7 .... ... Res140,1 Res140,7 All solutions that lead to a smallest total error will be regarded as winner of the contest. Information about the dataset CLASSTYPE: numeric CLASSINDEX: last ALGAE #: 2/7

12 features

algae_2 (target)	numeric	119 unique values 0 missing
season	nominal	4 unique values 0 missing
river_size	nominal	3 unique values 0 missing
fluid_velocity	nominal	3 unique values 0 missing
concentration_1	numeric	96 unique values 2 missing
concentration_2	numeric	103 unique values 2 missing
concentration_3	numeric	272 unique values 16 missing
concentration_4	numeric	283 unique values 2 missing
concentration_5	numeric	270 unique values 2 missing
concentration_6	numeric	252 unique values 2 missing
concentration_7	numeric	286 unique values 7 missing
concentration_8	numeric	194 unique values 23 missing

Show all 12 features

107 properties

NumberOfInstances

316

Number of instances (rows) of the dataset.

NumberOfFeatures

Number of attributes (columns) of the dataset.

NumberOfClasses

Number of distinct values of the target attribute (if it is nominal).

NumberOfMissingValues

Number of missing values in the dataset.

NumberOfInstancesWithMissingValues

Number of instances with at least one value missing.

NumberOfNumericFeatures

Number of numeric attributes.

NumberOfSymbolicFeatures

Number of nominal attributes.

MaxMutualInformation

Maximum mutual information between the nominal attributes and the target attribute.

MinNominalAttDistinctValues

The minimal number of distinct values among attributes of the nominal type.

PercentageOfBinaryFeatures

Percentage of binary attributes.

Quartile2StdDevOfNumericAtts

18.43

Second quartile (Median) of standard deviation of attributes of the numeric type.

RandomTreeDepth1AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

Dimensionality

0.04

Number of attributes divided by the number of instances.

MaxNominalAttDistinctValues

The maximum number of distinct values among attributes of the nominal type.

MinSkewnessOfNumericAtts

-0.89

Minimum skewness among attributes of the numeric type.

PercentageOfInstancesWithMissingValues

10.76

Percentage of instances having missing values.

Quartile3AttributeEntropy

Third quartile of entropy among attributes.

RandomTreeDepth1ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

EquivalentNumberOfAtts

Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.

MaxSkewnessOfNumericAtts

3.03

Maximum skewness among attributes of the numeric type.

MinStdDevOfNumericAtts

0.59

Minimum standard deviation of attributes of the numeric type.

PercentageOfMissingValues

1.48

Percentage of missing values.

Quartile3KurtosisOfNumericAtts

7.44

Third quartile of kurtosis among attributes of the numeric type.

AutoCorrelation

-7.18

Average class difference between consecutive instances.

RandomTreeDepth1Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1

J48.00001.AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .00001

MaxStdDevOfNumericAtts

187.1

Maximum standard deviation of attributes of the numeric type.

MinorityClassPercentage

Percentage of instances belonging to the least frequent class.

PercentageOfNumericFeatures

Percentage of numeric attributes.

Quartile3MeansOfNumericAtts

88.65

Third quartile of means among attributes of the numeric type.

CfsSubsetEval_DecisionStumpAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth2AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

J48.00001.ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .00001

MeanAttributeEntropy

Average entropy of the attributes.

MinorityClassSize

Number of instances belonging to the least frequent class.

PercentageOfSymbolicFeatures

Percentage of nominal attributes.

Quartile3MutualInformation

Third quartile of mutual information between the nominal attributes and the target attribute.

CfsSubsetEval_DecisionStumpErrRate

Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth2ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

J48.00001.Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .00001

MeanKurtosisOfNumericAtts

4.98

Mean kurtosis among attributes of the numeric type.

NaiveBayesAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes

Quartile1AttributeEntropy

First quartile of entropy among attributes.

Quartile3SkewnessOfNumericAtts

2.5

Third quartile of skewness among attributes of the numeric type.

CfsSubsetEval_DecisionStumpKappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth2Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2

J48.0001.AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .0001

MeanMeansOfNumericAtts

46.56

Mean of means among attributes of the numeric type.

NaiveBayesErrRate

Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes

Quartile1KurtosisOfNumericAtts

1.27

First quartile of kurtosis among attributes of the numeric type.

Quartile3StdDevOfNumericAtts

85.41

Third quartile of standard deviation of attributes of the numeric type.

CfsSubsetEval_NaiveBayesAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth3AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

J48.0001.ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .0001

MeanMutualInformation

Average mutual information between the nominal attributes and the target attribute.

NaiveBayesKappa

Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes

Quartile1MeansOfNumericAtts

7.63

First quartile of means among attributes of the numeric type.

REPTreeDepth1AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 1

CfsSubsetEval_NaiveBayesErrRate

Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth3ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

J48.0001.Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .0001

MeanNoiseToSignalRatio

An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.

NumberOfBinaryFeatures

Number of binary attributes.

Quartile1MutualInformation

First quartile of mutual information between the nominal attributes and the target attribute.

REPTreeDepth1ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 1

CfsSubsetEval_NaiveBayesKappa

Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

RandomTreeDepth3Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3

J48.001.AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .001

MeanNominalAttDistinctValues

3.33

Average number of distinct values among the attributes of the nominal type.

Quartile1SkewnessOfNumericAtts

0.08

First quartile of skewness among attributes of the numeric type.

REPTreeDepth1Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 1

CfsSubsetEval_kNN1NAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

StdvNominalAttDistinctValues

0.58

Standard deviation of the number of distinct values among attributes of the nominal type.

J48.001.ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .001

J48.001.Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .001

MeanSkewnessOfNumericAtts

1.41

Mean skewness among attributes of the numeric type.

Quartile1StdDevOfNumericAtts

2.25

First quartile of standard deviation of attributes of the numeric type.

REPTreeDepth2AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 2

CfsSubsetEval_kNN1NErrRate

Error rate achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

kNN1NAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk

MajorityClassPercentage

Percentage of instances belonging to the most frequent class.

MeanStdDevOfNumericAtts

48.52

Mean standard deviation of attributes of the numeric type.

Quartile2AttributeEntropy

Second quartile (Median) of entropy among attributes.

REPTreeDepth2ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 2

CfsSubsetEval_kNN1NKappa

Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W

kNN1NErrRate

Error rate achieved by the landmarker weka.classifiers.lazy.IBk

MajorityClassSize

Number of instances belonging to the most frequent class.

MinAttributeEntropy

Minimal entropy among attributes.

Quartile2KurtosisOfNumericAtts

4.07

Second quartile (Median) of kurtosis among attributes of the numeric type.

REPTreeDepth2Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 2

ClassEntropy

Entropy of the target attribute values.

kNN1NKappa

Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk

MaxAttributeEntropy

Maximum entropy among attributes.

MinKurtosisOfNumericAtts

0.51

Minimum kurtosis among attributes of the numeric type.

Quartile2MeansOfNumericAtts

12.86

Second quartile (Median) of means among attributes of the numeric type.

REPTreeDepth3AUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 3

DecisionStumpAUC

Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump

MaxKurtosisOfNumericAtts

15.78

Maximum kurtosis among attributes of the numeric type.

MinMeansOfNumericAtts

2.96

Minimum of means among attributes of the numeric type.

Quartile2MutualInformation

Second quartile (Median) of mutual information between the nominal attributes and the target attribute.

REPTreeDepth3ErrRate

Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 3

DecisionStumpErrRate

Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump

MaxMeansOfNumericAtts

160.45

Maximum of means among attributes of the numeric type.

MinMutualInformation

Minimal mutual information between the nominal attributes and the target attribute.

Quartile2SkewnessOfNumericAtts

2.04

Second quartile (Median) of skewness among attributes of the numeric type.

REPTreeDepth3Kappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 3

DecisionStumpKappa

Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump

Show all 107 properties

13 tasks

Supervised Regression on kdd_coil_2

0 runs - estimation_procedure: 10-fold Crossvalidation - evaluation_measure: mean_absolute_error - target_feature: algae_2

Supervised Regression on kdd_coil_2

0 runs - estimation_procedure: 10 times 10-fold Crossvalidation - evaluation_measure: mean_absolute_error - target_feature: algae_2

Clustering on kdd_coil_2