C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling
- Chris Drummond, Institute for Information Technology, National Research Council Canada
- Robert Holte, Department of Computing Science, University of Alberta
This paper takes a new look at two sampling
schemes commonly used to adapt machine al
gorithms to imbalanced classes and misclas
sification costs. It uses a performance anal
ysis technique called cost curves to explore
the interaction of over and undersampling
with the decision tree learner C4.5. C4.5
was chosen as, when combined with one of
the sampling schemes, it is quickly becom
ing the community standard when evaluat
ing new cost sensitive learning algorithms.
This paper shows that using C4.5 with under
sampling establishes a reasonable standard
for algorithmic comparison. But it is recom
mended that the least cost classifier be part of
that standard as it can be better than under
sampling for relatively modest costs. Over
sampling, however, shows little sensitivity,
there is often little di#erence in performance
when misclassification costs are changed.
Citation
C. Drummond,
R. Holte.
"C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling". Workshop on Learning from Imbalanced Datasets II, January 2003.
Keywords: |
under-sampling, over-sampling, machine learning |
Category: |
In Workshop |
BibTeX
@misc{Drummond+Holte:03,
author = {Chris Drummond and Robert Holte},
title = {C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling
Beats Over-Sampling},
booktitle = {Workshop on Learning from Imbalanced Datasets II},
year = 2003,
}
Last Updated: August 13, 2007
Submitted by Russ Greiner