C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling
- Chris Drummond, Institute for Information Technology, National Research Council Canada
- Robert Holte, Department of Computing Science, University of Alberta
This paper takes a new look at two sampling 
schemes commonly used to adapt machine al 
gorithms to imbalanced classes and misclas 
sification costs. It uses a performance anal 
ysis technique called cost curves to explore 
the interaction of over and undersampling 
with the decision tree learner C4.5. C4.5 
was chosen as, when combined with one of 
the sampling schemes, it is quickly becom 
ing the community standard when evaluat 
ing new cost sensitive learning algorithms. 
This paper shows that using C4.5 with under 
sampling establishes a reasonable standard 
for algorithmic comparison. But it is recom 
mended that the least cost classifier be part of 
that standard as it can be better than under 
sampling for relatively modest costs. Over 
sampling, however, shows little sensitivity, 
there is often little di#erence in performance 
when misclassification costs are changed.
Citation
C. Drummond, 
R. Holte. 
"C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling". Workshop on Learning from Imbalanced Datasets II, January 2003.
	
		| Keywords: | under-sampling, over-sampling, machine learning | 
	
		| Category: | In Workshop | 
BibTeX
@misc{Drummond+Holte:03,
  author = {Chris Drummond and Robert Holte},
  title = {C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling
    Beats Over-Sampling},
  booktitle = {Workshop on Learning from Imbalanced Datasets II},
  year = 2003,
}Last Updated: August 13, 2007
Submitted by Russ Greiner