:7 October 2016
Imbalance is a common occurrence in real-world datasets wherein the classes do not have (roughly) equal priors. For example, you may have a binary classification problem with 1000 samples. A total of 900 samples are labeled 0 and the remaining 100 samples are labeled 1. This is an imbalanced dataset and the ratio of Class-0 to Class-1 instances is 900:100 or more concisely 9:1. You can have a class imbalance problem on two-class classification problems as well as multi-class classification problems. Imbalanced data pose serious challenges to the task of classification. Dealing with imbalanced classes can be frustrating. You may discover that all the great results you were getting is a lie (i.e., accuracy paradox). Your accuracy measures might tell you that your models are doing great, but this accuracy might only be reflecting the underlying class distribution. In other words, your model is very likely predicting one class regardless of the data it is asked to predict. One approach to combating imbalanced data is to under-sample the majority class. This increases the sensitivity of the classifier to the minority class, but you are throwing away valuable data. Or you can over-sample the minority class which blunts the sensitivity of the classifier to the majority class. Or you can do both, intelligently. This is precisely what is behind the SMOTE technique (Chawla et. al., 2002). A combination of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. If you are frustrated by imbalanced classes, check out the SMOTE implementation in the imbalanced-learn Python package.
- Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
Last modified 7 October 2016 at 11:11 pm by svattam