Data
kc1-binary

kc1-binary

active ARFF Publicly available Visibility: public Uploaded 06-10-2014 by Joaquin Vanschoren
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
  • Data Science Engineering mythbusting_1 PROMISE study_1 study_15 study_20 study_41 study_52 study_7 study_88
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Author: Source: Unknown - Date unknown Please cite: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% This is a PROMISE Software Engineering Repository data set made publicly available in order to encourage repeatable, verifiable, refutable, and/or improvable predictive models of software engineering. If you publish material based on PROMISE data sets then, please follow the acknowledgment guidelines posted on the PROMISE repository web page http://promise.site.uottawa.ca/SERepository . %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 1. Title: Class-level data for KC1 This one includes a {_TRUE,FALSE} attribute (DL) to indicate defectiveness. 2. Sources (a) Creator: A. Gunes Koru (b) Date: February 21, 2005 (c) Contact: gkoru AT umbc DOT edu Phone: +1 (410) 455 8843 3. Donor: A. Gunes Koru 4. Past Usage: This data was used for: A. Gunes Koru and Hongfang Liu, "An Investigation of the Effect of Module Size on Defect Prediction Using Static Measures", PROMISE - Predictive Models in Software Engineering Workshop, ICSE 2005, May 15th 2005, Saint Louis, Missouri, US. We used several machine learning algorithms to predict the defective modules in five NASA products, namely, CM1, JM1, KC1, KC2, and PC1. A set of static measures were used as predictor variables. While doing so, we observed that a large portion of the modules were small, as measured by lines of code (LOC). When we experimented on the data subsets created by partitioning according to module size, we obtained higher prediction performance for the subsets that include larger modules. We also performed defect prediction using class-level data for KC1 rather than method-level data. In this case, the use of class-level data resulted in improved prediction performance compared to using method-level data. These findings suggest that quality assurance activities can be guided even better if defect predictions are made by using data that belong to larger modules. 5. Features: The descriptions of the features are taken from http://mdp.ivv.nasa.gov/mdp_glossary.html Feature Used as the Response Variable: ====================================== DL: Defect level. _TRUE if the class contains one or more defects, false otherwise. Features at Class Level Originally ================================== PERCENT_PUB_DATA: The percentage of data that is public and protected data in a class. In general, lower values indicate greater encapsulation. It is measure of encapsulation. ACCESS_TO_PUB_DATA: The amount of times that a class's public and protected data is accessed. In general, lower values indicate greater encapsulation. It is a measure of encapsulation. COUPLING_BETWEEN_OBJECTS: The number of distinct non-inheritance-related classes on which a class depends. If a class that is heavily dependent on many classes outside of its hierarchy is introduced into a library, all the classes upon which it depends need to be introduced as well. This may be acceptable, especially if the classes which it references are already part of a class library and are even more fundamental than the specified class. DEPTH: The level for a class. For instance, if a parent has one child the depth for the child is two. Depth indicates at what level a class is located within its class hierarchy. In general, inheritance increases when depth increases. LACK_OF_COHESION_OF_METHODS: For each data field in a class, the percentage of the methods in the class using that data field; the percentages are averaged then subtracted from 100%. The locm metric indicates low or high percentage of cohesion. If the percentage is low, the class is cohesive. If it is high, it may indicate that the class could be split into separate classes that will individually have greater cohesion. NUM_OF_CHILDREN: The number of classes derived from a specified class. DEP_ON_CHILD: Whether a class is dependent on a descendant. FAN_IN: This is a count of calls by higher modules. RESPONSE_FOR_CLASS: A count of methods implemented within a class plus the number of methods accessible to an object class due to inheritance. In general, lower values indicate greater polymorphism. WEIGHTED_METHODS_PER_CLASS: A count of methods implemented within a class (rather than all methods accessible within the class hierarchy). In general, lower values indicate greater polymorphism. Features Transformed to Class Level (Originally at Method Level) ================================================================ Transformation was achieved by obtaining min, max, sum, and avg values over all the methods in a class. There this data set includes four features for all of the following features that were originally at the method level but transformed to the class level. For example, LOC_BLANK has minLOC_BLANK, maxLOC_BLANK, avgLOC_BLANK, and maxLOC_BLANK. LOC_BLANK: Lines with only white space or no text content. BRANCH_COUNT: This metric is the number of branches for each module. Branches are defined as those edges that exit from a decision node. The greater the number of branches in a program's modules, the more testing resource's required. LOC_CODE_AND_COMMENT: Lines that contain both code and comment. LOC_COMMENTS: The number of lines in a module. This particular metric includes all blank lines, comment lines, and source lines. CYCLOMATIC_COMPLEXITY: It is a measure of the complexity of a modules decision structure. It is the number of linearly independent paths. DESIGN_COMPLEXITY: Design complexity is a measure of a module's decision structure as it relates to calls to other modules. This quantifies the testing effort related to integration. ESSENTIAL_COMPLEXITY: Essential complexity is a measure of the degree to which a module contains unstructured constructs. LOC_EXECUTABLE: Source lines of code that contain only code and white space. HALSTEAD_CONTENT: Complexity of a given algorithm independent of the language used to express the algorithm. HALSTEAD_DIFFICULTY: Level of difficulty in the program. HALSTEAD_EFFORT: Estimated mental effort required to develop the program. HALSTEAD_ERROR_EST: Estimated number of errors in the program. HALSTEAD_LENGTH: This is a Halstead metric that includes the total number of operator occurrences and total number of operand occurrences. HALSTEAD_LEVEL: Level at which the program can be understood. HALSTEAD_PROG_TIME: Estimated amount of time to implement the algorithm. HALSTEAD_VOLUME: This is a Halstead metric that contains the minimum number of bits required for coding the program. NUM_OPERANDS: Variables and identifiers Constants (numeric literal/string) Function names when used during calls. NUM_UNIQUE_OPERANDS: Variables and identifiers Constants (numeric literal/string) Function names when used during calls NUM_UNIQUE_OPERATORS: Number of unique operators. LOC_TOTAL: Total Lines of Code.

95 features

DL (target)nominal2 unique values
0 missing
PERCENT_PUB_DATAnumeric12 unique values
0 missing
ACCESS_TO_PUB_DATAnumeric1 unique values
0 missing
COUPLING_BETWEEN_OBJECTSnumeric25 unique values
0 missing
DEPTHnumeric7 unique values
0 missing
LACK_OF_COHESION_OF_METHODSnumeric41 unique values
0 missing
NUM_OF_CHILDRENnumeric6 unique values
0 missing
DEP_ON_CHILDnumeric2 unique values
0 missing
FAN_INnumeric4 unique values
0 missing
RESPONSE_FOR_CLASSnumeric63 unique values
0 missing
WEIGHTED_METHODS_PER_CLASSnumeric39 unique values
0 missing
minLOC_BLANKnumeric1 unique values
0 missing
minBRANCH_COUNTnumeric1 unique values
0 missing
minLOC_CODE_AND_COMMENTnumeric1 unique values
0 missing
minLOC_COMMENTSnumeric1 unique values
0 missing
minCYCLOMATIC_COMPLEXITYnumeric1 unique values
0 missing
minDESIGN_COMPLEXITYnumeric1 unique values
0 missing
minESSENTIAL_COMPLEXITYnumeric1 unique values
0 missing
minLOC_EXECUTABLEnumeric5 unique values
0 missing
minHALSTEAD_CONTENTnumeric13 unique values
0 missing
minHALSTEAD_DIFFICULTYnumeric7 unique values
0 missing
minHALSTEAD_EFFORTnumeric12 unique values
0 missing
minHALSTEAD_ERROR_ESTnumeric2 unique values
0 missing
minHALSTEAD_LENGTHnumeric9 unique values
0 missing
minHALSTEAD_LEVELnumeric16 unique values
0 missing
minHALSTEAD_PROG_TIMEnumeric12 unique values
0 missing
minHALSTEAD_VOLUMEnumeric8 unique values
0 missing
minNUM_OPERANDSnumeric6 unique values
0 missing
minNUM_OPERATORSnumeric7 unique values
0 missing
minNUM_UNIQUE_OPERANDSnumeric6 unique values
0 missing
minNUM_UNIQUE_OPERATORSnumeric6 unique values
0 missing
minLOC_TOTALnumeric7 unique values
0 missing
maxLOC_BLANKnumeric25 unique values
0 missing
maxBRANCH_COUNTnumeric38 unique values
0 missing
maxLOC_CODE_AND_COMMENTnumeric12 unique values
0 missing
maxLOC_COMMENTSnumeric26 unique values
0 missing
maxCYCLOMATIC_COMPLEXITYnumeric30 unique values
0 missing
maxDESIGN_COMPLEXITYnumeric24 unique values
0 missing
maxESSENTIAL_COMPLEXITYnumeric19 unique values
0 missing
maxLOC_EXECUTABLEnumeric82 unique values
0 missing
maxHALSTEAD_CONTENTnumeric122 unique values
0 missing
maxHALSTEAD_DIFFICULTYnumeric112 unique values
0 missing
maxHALSTEAD_EFFORTnumeric123 unique values
0 missing
maxHALSTEAD_ERROR_ESTnumeric63 unique values
0 missing
maxHALSTEAD_LENGTHnumeric104 unique values
0 missing
maxHALSTEAD_LEVELnumeric18 unique values
0 missing
maxHALSTEAD_PROG_TIMEnumeric123 unique values
0 missing
maxHALSTEAD_VOLUMEnumeric118 unique values
0 missing
maxNUM_OPERANDSnumeric88 unique values
0 missing
maxNUM_OPERATORSnumeric97 unique values
0 missing
maxNUM_UNIQUE_OPERANDSnumeric63 unique values
0 missing
maxNUM_UNIQUE_OPERATORSnumeric31 unique values
0 missing
maxLOC_TOTALnumeric85 unique values
0 missing
avgLOC_BLANKnumeric83 unique values
0 missing
avgBRANCH_COUNTnumeric95 unique values
0 missing
avgLOC_CODE_AND_COMMENTnumeric33 unique values
0 missing
avgLOC_COMMENTSnumeric69 unique values
0 missing
avgCYCLOMATIC_COMPLEXITYnumeric90 unique values
0 missing
avgDESIGN_COMPLEXITYnumeric92 unique values
0 missing
avgESSENTIAL_COMPLEXITYnumeric60 unique values
0 missing
avgLOC_EXECUTABLEnumeric114 unique values
0 missing
avgHALSTEAD_CONTENTnumeric133 unique values
0 missing
avgHALSTEAD_DIFFICULTYnumeric125 unique values
0 missing
avgHALSTEAD_EFFORTnumeric133 unique values
0 missing
avgHALSTEAD_ERROR_ESTnumeric110 unique values
0 missing
avgHALSTEAD_LENGTHnumeric129 unique values
0 missing
avgHALSTEAD_LEVELnumeric129 unique values
0 missing
avgHALSTEAD_PROG_TIMEnumeric132 unique values
0 missing
avgHALSTEAD_VOLUMEnumeric133 unique values
0 missing
avgNUM_OPERANDSnumeric122 unique values
0 missing
avgNUM_OPERATORSnumeric126 unique values
0 missing
avgNUM_UNIQUE_OPERANDSnumeric116 unique values
0 missing
avgNUM_UNIQUE_OPERATORSnumeric115 unique values
0 missing
avgLOC_TOTALnumeric124 unique values
0 missing
sumLOC_BLANKnumeric57 unique values
0 missing
sumBRANCH_COUNTnumeric85 unique values
0 missing
sumLOC_CODE_AND_COMMENTnumeric16 unique values
0 missing
sumLOC_COMMENTSnumeric41 unique values
0 missing
sumCYCLOMATIC_COMPLEXITYnumeric70 unique values
0 missing
sumDESIGN_COMPLEXITYnumeric70 unique values
0 missing
sumESSENTIAL_COMPLEXITYnumeric51 unique values
0 missing
sumLOC_EXECUTABLEnumeric108 unique values
0 missing
sumHALSTEAD_CONTENTnumeric131 unique values
0 missing
sumHALSTEAD_DIFFICULTYnumeric124 unique values
0 missing
sumHALSTEAD_EFFORTnumeric131 unique values
0 missing
sumHALSTEAD_ERROR_ESTnumeric90 unique values
0 missing
sumHALSTEAD_LENGTHnumeric118 unique values
0 missing
sumHALSTEAD_LEVELnumeric119 unique values
0 missing
sumHALSTEAD_PROG_TIMEnumeric130 unique values
0 missing
sumHALSTEAD_VOLUMEnumeric131 unique values
0 missing
sumNUM_OPERANDSnumeric117 unique values
0 missing
sumNUM_OPERATORSnumeric116 unique values
0 missing
sumNUM_UNIQUE_OPERANDSnumeric99 unique values
0 missing
sumNUM_UNIQUE_OPERATORSnumeric101 unique values
0 missing
sumLOC_TOTALnumeric121 unique values
0 missing

107 properties

145
Number of instances (rows) of the dataset.
95
Number of attributes (columns) of the dataset.
2
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
94
Number of numeric attributes.
1
Number of nominal attributes.
Third quartile of mutual information between the nominal attributes and the target attribute.
0.35
Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0.3
Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2
0.22
Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .00001
Average entropy of the attributes.
60
Number of instances belonging to the least frequent class.
1.05
Percentage of nominal attributes.
3.96
Third quartile of skewness among attributes of the numeric type.
0.3
Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0.39
Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2
0.66
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .0001
13.16
Mean kurtosis among attributes of the numeric type.
0.8
Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes
First quartile of entropy among attributes.
62.41
Third quartile of standard deviation of attributes of the numeric type.
0.66
Area Under the ROC Curve achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0.7
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3
0.37
Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .0001
1290.47
Mean of means among attributes of the numeric type.
0.27
Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes
2.44
First quartile of kurtosis among attributes of the numeric type.
0.69
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 1
0.35
Error rate achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0.3
Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3
0.22
Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .0001
Average mutual information between the nominal attributes and the target attribute.
0.41
Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes
1
First quartile of means among attributes of the numeric type.
0.34
Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 1
0.3
Kappa coefficient achieved by the landmarker weka.classifiers.bayes.NaiveBayes -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0.39
Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 3
0.66
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .001
An estimate of the amount of irrelevant information in the attributes regarding the class. Equals (MeanAttributeEntropy - MeanMutualInformation) divided by MeanMutualInformation.
1
Number of binary attributes.
First quartile of mutual information between the nominal attributes and the target attribute.
0.34
Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 1
0.66
Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0
Standard deviation of the number of distinct values among attributes of the nominal type.
0.37
Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .001
2
Average number of distinct values among the attributes of the nominal type.
1.46
First quartile of skewness among attributes of the numeric type.
0.69
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 2
0.35
Error rate achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0.63
Area Under the ROC Curve achieved by the landmarker weka.classifiers.lazy.IBk
0.22
Kappa coefficient achieved by the landmarker weka.classifiers.trees.J48 -C .001
2.73
Mean skewness among attributes of the numeric type.
0.94
First quartile of standard deviation of attributes of the numeric type.
0.34
Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 2
0.3
Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0.37
Error rate achieved by the landmarker weka.classifiers.lazy.IBk
58.62
Percentage of instances belonging to the most frequent class.
2999.29
Mean standard deviation of attributes of the numeric type.
Second quartile (Median) of entropy among attributes.
8.18
Second quartile (Median) of kurtosis among attributes of the numeric type.
0.34
Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 2
0.98
Entropy of the target attribute values.
0.24
Kappa coefficient achieved by the landmarker weka.classifiers.lazy.IBk
85
Number of instances belonging to the most frequent class.
Minimal entropy among attributes.
7.4
Second quartile (Median) of means among attributes of the numeric type.
0.69
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.REPTree -L 3
0.73
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump
Maximum entropy among attributes.
-0.82
Minimum kurtosis among attributes of the numeric type.
Second quartile (Median) of mutual information between the nominal attributes and the target attribute.
0.34
Error rate achieved by the landmarker weka.classifiers.trees.REPTree -L 3
0.3
Error rate achieved by the landmarker weka.classifiers.trees.DecisionStump
69.94
Maximum kurtosis among attributes of the numeric type.
0
Minimum of means among attributes of the numeric type.
2.4
Second quartile (Median) of skewness among attributes of the numeric type.
0.34
Kappa coefficient achieved by the landmarker weka.classifiers.trees.REPTree -L 3
0.43
Kappa coefficient achieved by the landmarker weka.classifiers.trees.DecisionStump
76249.59
Maximum of means among attributes of the numeric type.
Minimal mutual information between the nominal attributes and the target attribute.
7.96
Second quartile (Median) of standard deviation of attributes of the numeric type.
0.7
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1
0.66
Number of attributes divided by the number of instances.
Maximum mutual information between the nominal attributes and the target attribute.
2
The minimal number of distinct values among attributes of the nominal type.
1.05
Percentage of binary attributes.
Third quartile of entropy among attributes.
0.3
Error rate achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1
Number of attributes needed to optimally describe the class (under the assumption of independence among attributes). Equals ClassEntropy divided by MeanMutualInformation.
2
The maximum number of distinct values among attributes of the nominal type.
-1.11
Minimum skewness among attributes of the numeric type.
0
Percentage of instances having missing values.
16.95
Third quartile of kurtosis among attributes of the numeric type.
0.66
Average class difference between consecutive instances.
0.39
Kappa coefficient achieved by the landmarker weka.classifiers.trees.RandomTree -depth 1
0.66
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.J48 -C .00001
8.42
Maximum skewness among attributes of the numeric type.
0
Minimum standard deviation of attributes of the numeric type.
0
Percentage of missing values.
59.62
Third quartile of means among attributes of the numeric type.
0.66
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.DecisionStump -E "weka.attributeSelection.CfsSubsetEval -P 1 -E 1" -S "weka.attributeSelection.BestFirst -D 1 -N 5" -W
0.7
Area Under the ROC Curve achieved by the landmarker weka.classifiers.trees.RandomTree -depth 2
0.37
Error rate achieved by the landmarker weka.classifiers.trees.J48 -C .00001
200468.25
Maximum standard deviation of attributes of the numeric type.
41.38
Percentage of instances belonging to the least frequent class.
98.95
Percentage of numeric attributes.

14 tasks

553 runs - estimation_procedure: 10-fold Crossvalidation - evaluation_measure: predictive_accuracy - target_feature: DL
212 runs - estimation_procedure: 10 times 10-fold Crossvalidation - evaluation_measure: predictive_accuracy - target_feature: DL
0 runs - estimation_procedure: Interleaved Test then Train - target_feature: DL
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
0 runs - estimation_procedure: 50 times Clustering
Define a new task