International Journal of Data Science and Analysis

| Peer-Reviewed |

Issues of Class Imbalance in Classification of Binary Data: A Review

Received: 25 September 2019    Accepted: 08 November 2019    Published: 17 November 2019
Views:       Downloads:

Share This Article

Abstract

Handling classification issues of class imbalance data has gained attentions of researchers in the last few years. Class imbalance problem evolves when one of two classes has more sample than the other class. The class with more sample is called major class while the other one is referred to as minor class. The most classification or predicting models are more focusing on classifying or predicting the major class correctly, ignoring the minor class. In this paper, various data pre-processing approaches to improve accuracy of the models were reviewed with application to terminated pregnancy data. The data were extracted from the 2013 Nigeria Demographic and Health Survey (NDHS). The response variable is “terminated pregnancy” (asking women of reproductive age whether they have ever experienced terminated pregnancy or not), which has two possible classes (“YES” or “NO”) that exhibited class imbalanced. The major class (“NO”) is 86.82% (of the sample) representing Nigerian women of age 15 – 49 years who had never experience terminated pregnancy while the other category (minor class) is 13.18%. Hence, different resampling techniques were exploited to handle the problem and to improve the model performance. Synthetic Minority Oversampling Technique (SMOTE) improved the model best among the resampling techniques considered. The following socio-demographic factors: age, age at first birth, residential area, region, education level of women were significantly associated with having terminated pregnancy in Nigeria.

DOI 10.11648/j.ijdsa.20190506.13
Published in International Journal of Data Science and Analysis (Volume 5, Issue 6, December 2019)
Page(s) 123-127
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Classification, Class Imbalanced, Resampling Techniques, Logistic Model, Terminated Pregnancy

References
[1] Wang, S., Member, and Xin Yao, (2012), “Multiclass Imbalance Problems: Analysis and Potential Solutions”, IEEE Transactions On Systems, Man, And Cybernetics—Part B: Cybernetics, Vol. 42, No. 4.
[2] Nitesh V. Chawla, Nathalie Japkowicz, Aleksander Ko lcz, (2004) “Editorial: Special Issue on Learning from Imbalanced Data Sets”; ACM SIGKDD Explorations Newsletter; Volume 6, Issue 1 - Page 1-6. Doi: 10.1145/1007730.1007733.
[3] Longadge. R., Dongre. S. S., and Malik, L., (2013), Class Imbalance Problem in Data Mining: Review; International Journal of Computer Science and Network (IJCSN); Vol. 2, Issue 1.
[4] Galar, M. and Fransico, (2012) “A review on Ensembles for the class Imbalance Problem: Bagging, Boosting and Hybrid Based Approaches” IEEE Transactions on Systems, Man, And Cybernetics—Part C: Application and Reviews, Vol. 42, No. 4.
[5] Chawla V. N., Bowyer K. W., Hall L. O., Kegelmeyer W. P., (2002), SMOTE: Synthetic Minority Over-Sampling Technique, Journal of Artificial Intelligence Research, 16 (2002), 321-357.
[6] Brown, I. and C. Mues, (2012), An Experimental Comparison of Classification Algorithms for Imbalanced Credit Scoring Data Sets, Expert Systems with Applications, 39 (2012), no. 3, 3446-3453. http://dx.doi.org/10.1016/j.eswa.2011.09.033.
[7] Seiffert C., Taghi M. Khoshgoftaar, Jason Van Hulse, Amri Napolitano, (2008) “A Comparative Study of Data Sampling and Cost Sensitive Learning”, IEEE International Conference on Data Mining Workshops. 15-19.
[8] Liu, P., Lijun Cai, Yong Wang, Longbo Zhang, (2010) “Classifying Skewed Data Streams Based on Reusing Data”; International Conference on Computer Application and System Modeling (ICCASM 2010).
[9] Tang, Y., Zhang, Y., Chawla, N. V., and Sven Krasser; (2009), “Correspondence SVMs Modeling for Highly Imbalanced Classification”; IEEE Transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, Vol. 39, No. 1.
[10] Agresti, A., (2002) Categorical Data Analysis, John Willey & Sons, Inc, New York.
[11] Fawcett, T., (2006), An Introduction to ROC analysis, Pattern Recognition Letters, 27, 861-874. http://dx.doi.org/10.1016/j.patrec.2005.10.010.
[12] Hanifah, F. S, Wijayanto, H. and Kurnia, A. (2015). SMOTE Bagging Algorithm for Imbalanced Dataset in Logistic Regression Analysis. Applied Mathematical Sciences, Vol. 9, 2015, no. 138, 6857-6865. http://dx.doi.org/10.12988/ams.2015.58562.
[13] Torgo, L. (2010). Data Mining with R, learning with case studies Chapman and Hall/CRC. URL: http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR.
[14] R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
[15] National Population Commission (NPC) [Nigeria] and ICF International. 2014. Nigeria Demographic and Health Survey 2013. Abuja, Nigeria, and Rockville, Maryland, USA: NPC and ICF International.
[16] Lunardon, Giovanna Menardi, and Nicola Torelli (2014). ROSE: a Package for Binary Imbalanced Learning. R Journal, 6 (1), 82-92.
[17] Kuhn, M., Wing, J., Weston, S., Williams, A., Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan and Tyler Hunt. (2018). caret: Classification and Regression Training. R package version 6.0-81. https://CRAN.R-project.org/package=caret.
Author Information
  • Department of Statistics and Mathematical Sciences, Kwara State University, Ilorin, Nigeria

  • Department of Statistics, University of Ilorin, Ilorin, Nigeria

Cite This Article
  • APA Style

    Samuel Adewale Aderoju, Emmanuel Teju Jolayemi. (2019). Issues of Class Imbalance in Classification of Binary Data: A Review. International Journal of Data Science and Analysis, 5(6), 123-127. https://doi.org/10.11648/j.ijdsa.20190506.13

    Copy | Download

    ACS Style

    Samuel Adewale Aderoju; Emmanuel Teju Jolayemi. Issues of Class Imbalance in Classification of Binary Data: A Review. Int. J. Data Sci. Anal. 2019, 5(6), 123-127. doi: 10.11648/j.ijdsa.20190506.13

    Copy | Download

    AMA Style

    Samuel Adewale Aderoju, Emmanuel Teju Jolayemi. Issues of Class Imbalance in Classification of Binary Data: A Review. Int J Data Sci Anal. 2019;5(6):123-127. doi: 10.11648/j.ijdsa.20190506.13

    Copy | Download

  • @article{10.11648/j.ijdsa.20190506.13,
      author = {Samuel Adewale Aderoju and Emmanuel Teju Jolayemi},
      title = {Issues of Class Imbalance in Classification of Binary Data: A Review},
      journal = {International Journal of Data Science and Analysis},
      volume = {5},
      number = {6},
      pages = {123-127},
      doi = {10.11648/j.ijdsa.20190506.13},
      url = {https://doi.org/10.11648/j.ijdsa.20190506.13},
      eprint = {https://download.sciencepg.com/pdf/10.11648.j.ijdsa.20190506.13},
      abstract = {Handling classification issues of class imbalance data has gained attentions of researchers in the last few years. Class imbalance problem evolves when one of two classes has more sample than the other class. The class with more sample is called major class while the other one is referred to as minor class. The most classification or predicting models are more focusing on classifying or predicting the major class correctly, ignoring the minor class. In this paper, various data pre-processing approaches to improve accuracy of the models were reviewed with application to terminated pregnancy data. The data were extracted from the 2013 Nigeria Demographic and Health Survey (NDHS). The response variable is “terminated pregnancy” (asking women of reproductive age whether they have ever experienced terminated pregnancy or not), which has two possible classes (“YES” or “NO”) that exhibited class imbalanced. The major class (“NO”) is 86.82% (of the sample) representing Nigerian women of age 15 – 49 years who had never experience terminated pregnancy while the other category (minor class) is 13.18%. Hence, different resampling techniques were exploited to handle the problem and to improve the model performance. Synthetic Minority Oversampling Technique (SMOTE) improved the model best among the resampling techniques considered. The following socio-demographic factors: age, age at first birth, residential area, region, education level of women were significantly associated with having terminated pregnancy in Nigeria.},
     year = {2019}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Issues of Class Imbalance in Classification of Binary Data: A Review
    AU  - Samuel Adewale Aderoju
    AU  - Emmanuel Teju Jolayemi
    Y1  - 2019/11/17
    PY  - 2019
    N1  - https://doi.org/10.11648/j.ijdsa.20190506.13
    DO  - 10.11648/j.ijdsa.20190506.13
    T2  - International Journal of Data Science and Analysis
    JF  - International Journal of Data Science and Analysis
    JO  - International Journal of Data Science and Analysis
    SP  - 123
    EP  - 127
    PB  - Science Publishing Group
    SN  - 2575-1891
    UR  - https://doi.org/10.11648/j.ijdsa.20190506.13
    AB  - Handling classification issues of class imbalance data has gained attentions of researchers in the last few years. Class imbalance problem evolves when one of two classes has more sample than the other class. The class with more sample is called major class while the other one is referred to as minor class. The most classification or predicting models are more focusing on classifying or predicting the major class correctly, ignoring the minor class. In this paper, various data pre-processing approaches to improve accuracy of the models were reviewed with application to terminated pregnancy data. The data were extracted from the 2013 Nigeria Demographic and Health Survey (NDHS). The response variable is “terminated pregnancy” (asking women of reproductive age whether they have ever experienced terminated pregnancy or not), which has two possible classes (“YES” or “NO”) that exhibited class imbalanced. The major class (“NO”) is 86.82% (of the sample) representing Nigerian women of age 15 – 49 years who had never experience terminated pregnancy while the other category (minor class) is 13.18%. Hence, different resampling techniques were exploited to handle the problem and to improve the model performance. Synthetic Minority Oversampling Technique (SMOTE) improved the model best among the resampling techniques considered. The following socio-demographic factors: age, age at first birth, residential area, region, education level of women were significantly associated with having terminated pregnancy in Nigeria.
    VL  - 5
    IS  - 6
    ER  - 

    Copy | Download

  • Sections