Not Logged In

C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling

This paper takes a new look at two sampling schemes commonly used to adapt machine al­ gorithms to imbalanced classes and misclas­ sification costs. It uses a performance anal­ ysis technique called cost curves to explore the interaction of over and under­sampling with the decision tree learner C4.5. C4.5 was chosen as, when combined with one of the sampling schemes, it is quickly becom­ ing the community standard when evaluat­ ing new cost sensitive learning algorithms. This paper shows that using C4.5 with under­ sampling establishes a reasonable standard for algorithmic comparison. But it is recom­ mended that the least cost classifier be part of that standard as it can be better than under­ sampling for relatively modest costs. Over­ sampling, however, shows little sensitivity, there is often little di#erence in performance when misclassification costs are changed.

Citation

C. Drummond, R. Holte. "C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling". Workshop on Learning from Imbalanced Datasets II, January 2003.

Keywords: under-sampling, over-sampling, machine learning
Category: In Workshop

BibTeX

@misc{Drummond+Holte:03,
  author = {Chris Drummond and Robert Holte},
  title = {C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling
    Beats Over-Sampling},
  booktitle = {Workshop on Learning from Imbalanced Datasets II},
  year = 2003,
}

Last Updated: August 13, 2007
Submitted by Russ Greiner

University of Alberta Logo AICML Logo