PERFORMANCE ANALYSIS OF THE IMBALANCED DATA METHOD ON INCREASING THE CLASSIFICATION ACCURACY OF THE MACHINE LEARNING HYBRID METHOD

Azmi Aulia Rahman
Sri Suryani Prasetiyowati
Yuliant Sibaroni


DOI: https://doi.org/10.29100/jipi.v8i1.3286

Abstract


This study analyzes the performance of hybrid methods in improving accuracy on imbalanced data using Dengue Hemorrhagic Fever Case Data from 2017 to 2021 in Bandung City. The attributes used in this study consist of Total Population, Total Male, Elementary School Graduation, Junior High School Graduation, High School Graduation, College Graduation, Rainfall, Average Temperature, Humidity, Male Cases, Number of Cases, and Class. This research combines five Machine Learning methods, such as Decision Tree, Support Vector Machine, Artificial Neural Network, K-Nearest Neighbor, and Nae Bayes. Hybrid Methods used in this research are Voting and Stacking methods. The oversampling methods used to handle imbalanced data in this study are Random Oversampling and Adasyn. The results show that Voting and Stacking without Random Oversampling and Adasyn get the same accuracy of 88,88%. While using Random Oversampling, voting gets an accuracy of 95,37% and stacking gets an accuracy of 96,29%. While using Adasyn, voting gets an accuracy of 94,44% and stacking gets an accuracy of 97,22%. Based on the results obtained, it can be concluded that the Random Oversampling and Adasyn Method can improve the performance of the Machine Learning hybrid method on imbalanced data. The contribution of this research is to provide information on the study and analysis of the implementation of the Random Oversampling and Adasyn methods in improving the performance of the Voting and Stacking methods in hybrid classification.

Keywords


Classification; Machine Learning; Hybrid Methods; Random Oversampling; Adasyn

Full Text:

PDF

Article Metrics :

References


I. H. Sarker, A. S. M. Kayes, and P. Watters, Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage, J. Big Data, vol. 6, no. 1, 2019, doi: 10.1186/s40537-019-0219-y.

F. Nurul Inayah, S. Suryani Prasetiyowati, and Y. Sibaroni, Classification of Dengue Hemorrhagic Fever (DHF) Spread in Bandung using Hybrid Nave Bayes, K-Nearest Neighbor, and Artificial Neural Network Methods, Int. J. Inf. Commun. Technol., vol. 7, no. 1, pp. 1020, 2021, doi: 10.21108/ijoict.v7i1.562.

M. Hasits, Waspada, Awal Tahun 2019 Ada 48 Kasus DBD di Kota Bandung, 2019. https://bandung.merdeka.com/halo-bandung/read/164791/waspada-awal-tahun-2019-ada-48-kasus-dbd-di-kota-bandung (accessed Aug. 02, 2022).

A. Fernndez, S. del Ro, N. V. Chawla, and F. Herrera, An insight into imbalanced Big Data classification: outcomes and challenges, Complex Intell. Syst., vol. 3, no. 2, pp. 105120, 2017, doi: 10.1007/s40747-017-0037-9.

H. Kaur, H. S. Pannu, and A. K. Malhi, A Systematic Review on Imbalanced Data Challenges in Machine Learning: Applications and Solutions, ACM Comput. Surv., vol. 52, no. 4, 2019, doi: 10.1145/3343440.

Y. T. Kim, D. K. Kim, H. Kim, and D. J. Kim, A Comparison of Oversampling Methods for Constructing a Prognostic Model in the Patient with Heart Failure, Int. Conf. ICT Converg., vol. 2020-Octob, pp. 379383, 2020, doi: 10.1109/ICTC49870.2020.9289522.

R. Sanjudevi and D. Savitha, DENGUE FEVER PREDICTION USING CLASSIFICATION TECHNIQUES, Int. Res. J. Eng. Technol., vol. 06, no. 02, pp. 558563, 2019.

S. Gambhir, S. K. Malik, and Y. Kumar, PSO-ANN based diagnostic model for the early detection of dengue disease, New Horizons Transl. Med., vol. 4, no. 14, pp. 18, 2017, doi: 10.1016/j.nhtm.2017.10.001.

T. Sajana, M. Syamala, L. Phaneendra Maguluri, and C. Usha Kumari, A hybrid approach for classification of infectious diseases, Mater. Today Proc., no. xxxx, 2021, doi: 10.1016/j.matpr.2020.11.727.

A. Fahmi, D. Purwitasari, S. Sumpeno, and M. H. Purnomo, Performance Evaluation of Classifiers for Predicting Infection Cases of Dengue Virus Based on Clinical Diagnosis Criteria, IES 2020 - Int. Electron. Symp. Role Auton. Intell. Syst. Hum. Life Comf., pp. 456462, 2020, doi: 10.1109/IES50839.2020.9231728.

P. Taneja and N. Gautam, Hybrid Classification Method for Dengue Prediction, Int. J. Eng. Adv. Technol., vol. 8, no. 6, pp. 18581861, 2019, doi: 10.35940/ijeat.F7892.088619.

M. A. Rahman, S. S. Prasetiyowati, and Y. Sibaroni, Performance Analysis of the Hybrid Voting Method on the Classification of the Number of Cases of Dengue Fever., vol. 8, no. 1, pp. 1019, 2022, doi: 10.21108/ijoict.v8i1.614.

S. K. Kalagotla, S. V. Gangashetty, and K. Giridhar, A novel stacking technique for prediction of diabetes, Comput. Biol. Med., vol. 135, no. February, p. 104554, 2021, doi: 10.1016/j.compbiomed.2021.104554.

R. Mohammed, J. Rawashdeh, and M. Abdullah, Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results, 2020 11th Int. Conf. Inf. Commun. Syst. ICICS 2020, pp. 243248, 2020, doi: 10.1109/ICICS49469.2020.239556.

S. Gupta and M. K. Gupta, Computational Prediction of Cervical Cancer Diagnosis Using Ensemble-Based Classification Algorithm, Comput. J., vol. 65, no. 6, pp. 15271539, 2022, doi: 10.1093/comjnl/bxaa198.

M. Hayaty, S. Muthmainah, and S. M. Ghufran, Random and Synthetic Over-Sampling Approach to Resolve Data Imbalance in Classification, Int. J. Artif. Intell. Res., vol. 4, no. 2, p. 86, 2021, doi: 10.29099/ijair.v4i2.152.

A. Gumilar, S. S. Prasetiyowati, and Y. Sibaroni, Performance Analysis of Hybrid Machine Learning Methods on Imbalanced Data (Rainfall Classification), vol. 5, no. 158, pp. 481490, 2022.

A. Gosain and S. Sardana, Handling Class Imbalance Problem using Oversampling Techniques: A Review, 2017 Int. Conf. Adv. Comput. Commun. Informatics, ICACCI 2017, vol. 2017-Janua, pp. 7985, 2017, doi: 10.1109/ICACCI.2017.8125820.

N. G. Ramadhan, Comparative Analysis of ADASYN-SVM and SMOTE-SVM Methods on the Detection of Type 2 Diabetes Mellitus, Sci. J. Informatics, vol. 8, no. 2, pp. 276282, 2021, doi: 10.15294/sji.v8i2.32484.

U. Ashfaq, D. B. P. M., and R. Mafas, Managing Student Performance: A Predictive Analytics using Imbalanced Data, Int. J. Recent Technol. Eng., vol. 8, no. 6, pp. 22772283, 2020, doi: 10.35940/ijrte.e7008.038620.

V. N. Jenipher and D. S. Radhika, SVM kernel Methods with Data Normalization for Lung Cancer Survivability Prediction Application, 2021 Third Int. Conf. Intell. Commun. Technol. Virtual Mob. Networks, pp. 12941299, 2021, doi: 10.1109/ICICV50876.2021.9388543.

V. N. G. Raju, K. P. Lakshmi, V. M. Jain, A. Kalidindi, and V. Padma, Study the Influence of Normalization/Transformation process on the Accuracy of Supervised Classification, Proc. 3rd Int. Conf. Smart Syst. Inven. Technol. ICSSIT 2020, no. Icssit, pp. 729735, 2020, doi: 10.1109/ICSSIT48917.2020.9214160.

A. Fahmi, F. A. Muqtadiroh, D. Purwitasari, S. Sumpeno, and M. H. Purnomo, A Multi-Class Classification of Dengue Infection Cases with Feature Selection in Imbalanced Clinical Diagnosis Data, Int. J. Intell. Eng. Syst., vol. 15, no. 3, p. 2022, 2022, doi: 10.22266/ijies2022.0630.15.

W. Han, Z. Huang, S. Li, and Y. Jia, Distribution-Sensitive Unbalanced Data Oversampling Method for Medical Diagnosis, J. Med. Syst., vol. 43, no. 2, 2019, doi: 10.1007/s10916-018-1154-8.

N. Rachburee and W. Punlumjeak, Oversampling technique in student performance classification from engineering course, Int. J. Electr. Comput. Eng., vol. 11, no. 4, pp. 35673574, 2021, doi: 10.11591/ijece.v11i4.pp3567-3574.

J. Liu, Y. Gao, and F. Hu, A fast network intrusion detection system using adaptive synthetic oversampling and LightGBM, Comput. Secur., vol. 106, p. 102289, 2021, doi: 10.1016/j.cose.2021.102289.

P. Gnip, L. Vokorokos, and P. Drotr, Selective oversampling approach for strongly imbalanced data, PeerJ Comput. Sci., vol. 7, pp. 122, 2021, doi: 10.7717/PEERJ-CS.604.

M. S. Shelke, P. R. Deshmukh, and P. V. K. Shandilya, A Review on Imbalanced Data Handling Using Undersampling and Oversampling Technique, Int. J. Recent Trends Eng. Res., vol. 3, no. 4, pp. 444449, 2017, doi: 10.23883/ijrter.2017.3168.0uwxm.

A. E. Mohamed, Comparative Study of Four Supervised Machine Learning Techniques for Classification, Int. J. Appl. Sci. Technol., vol. 7, no. 2, 2017, [Online]. Available: www.ijastnet.com

P. C. Sen, M. Hajra, and M. Ghosh, Supervised Classification Algorithms in Machine Learning: A Survey and Review, vol. 937. Springer Singapore, 2020. doi: 10.1007/978-981-13-7403-6_11.

A. Z. Abdullah, B. Winarno, and D. R. S. Saputro, The decision tree classification with C4.5 and C5.0 algorithm based on R to detect case fatality rate of dengue hemorrhagic fever in Indonesia, J. Phys. Conf. Ser., vol. 1776, no. 1, pp. 010, 2021, doi: 10.1088/1742-6596/1776/1/012040.

S. Pitchumani Angayarkanni, Predictive Analytics of Chronic Kidney Disease using Machine Learning Algorithm, Int. J. Recent Technol. Eng., vol. 8, no. 2, pp. 940947, 2019, doi: 10.35940/ijrte.B1727.078219.

O. F.Y, A. J.E.T, A. O, H. J. O, O. O, and A. J, Supervised Machine Learning Algorithms: Classification and Comparison, Int. J. Comput. Trends Technol., vol. 48, no. 3, pp. 128138, 2017, doi: 10.14445/22312803/ijctt-v48p126.

B. Gupta, A. Rawat, A. Jain, A. Arora, and N. Dhami, Analysis of Various Decision Tree Algorithms for Classification in Data Mining, Int. J. Comput. Appl., vol. 163, no. 8, pp. 1519, 2017, doi: 10.5120/ijca2017913660.

K. T. Swe, P. Thu, and Z. Tun, DENGUE FEVER CLASSIFICATION TOOL USING MACHINE LEARNING, vol. 2, no. May, pp. 510, 2020.

D. A. Anggoro and N. D. Kurnia, Comparison of Accuracy Level of Support Vector Machine (SVM) and K Nearest Neighbors (KNN) Algorithms in Predicting Heart Disease, Int. J. Emerg. Trends Eng. Res., vol. 8, pp. 16891694, 2020, doi: 10.30534/ijeter/2020/32852020.

X. Hu, H. Zhang, H. Mei, D. Xiao, Y. Li, and M. Li, Landslide susceptibility mapping using the stacking ensemble machine learning method in lushui, southwest China, Appl. Sci., vol. 10, no. 11, 2020, doi: 10.3390/app10114016.

M. Hayri Kesikoglu, U. Haluk Atasever, F. Dadaser-Celik, and C. Ozkan, Performance of ANN, SVM and MLH techniques for land use/cover change detection at Sultan Marshes wetland, Turkey, Water Sci. Technol., vol. 80, no. 3, pp. 466477, 2019, doi: 10.2166/wst.2019.290.

P. Silitonga, B. E. Dewi, A. Bustamam, and H. S. Al-Ash, Evaluation of Dengue Model Performances Developed Using Artificial Neural Network and Random Forest Classifiers, Procedia Comput. Sci., vol. 179, no. 2020, pp. 135143, 2021, doi: 10.1016/j.procs.2020.12.018.

N. Iqbal and M. Islam, Machine Learning for Dengue Outbreak Prediction: A Performance Evaluation of Different Prominent Classifiers, Informatica, vol. 43, no. 3, 2019, doi: 10.31449/inf.v43i3.1548.

A. Salam, Sri Suryani Prasetiyowati, and Yuliant Sibaroni, Prediction Vulnerability Level of Dengue Fever Using KNN and Random Forest, J. RESTI (Rekayasa Sist. dan Teknol. Informasi), vol. 4, no. 3, pp. 531536, 2020, doi: 10.29207/resti.v4i3.1926.

A. Prabhat and V. Khullar, Sentiment classification on Big Data using Nave Bayes and Logistic Regression, 2017 Int. Conf. Comput. Commun. Informatics, ICCCI 2017, 2017, doi: 10.1109/ICCCI.2017.8117734.

C. Qi and X. Tang, A hybrid ensemble method for improved prediction of slope stability, Int. J. Numer. Anal. Methods Geomech., vol. 42, no. 15, pp. 18231839, 2018, doi: 10.1002/nag.2834.

N. Nahar, F. Ara, M. A. I. Neloy, V. Barua, M. S. Hossain, and K. Andersson, A Comparative Analysis of the Ensemble Method for Liver Disease Prediction, ICIET 2019 - 2nd Int. Conf. Innov. Eng. Technol., pp. 2324, 2019, doi: 10.1109/ICIET48527.2019.9290507.