Optimizing Diabetes Prediction: Addressing Data Imbalance with Machine Learning Algorithms
PDF File

Keywords

Machine learning
Imbalanced dataset
Diabetes classification
Ensemble learning

How to Cite

Optimizing Diabetes Prediction: Addressing Data Imbalance with Machine Learning Algorithms. (2024). ADBA Computer Science, 1(1), 26-35. https://doi.org/10.69882/adba.cs.2024075

Abstract

Imbalanced datasets pose significant challenges in various fields, including the classification of medical conditions such as diabetes. This study investigates six methodologies for handling imbalanced diabetes datasets, aiming to enhance classification performance through diverse preprocessing techniques. The methodologies are evaluated using multiple models: Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, SVM, KNN, Naive Bayes, XGBoost, LightGBM, and CatBoost. The preprocessing techniques include simple implementation, data standardization, normalization, standardization with K Fold cross-validation, and two variations incorporating the SMOTE oversampling technique.The effectiveness of each methodology is assessed based on accuracy, precision, recall, and F1 scores across different classifiers. Results indicate that standardization combined with K Fold cross-validation consistently enhances model performance. Additionally, the integration of the SMOTE technique significantly improves results, especially for Gradient Boosting and SVM classifiers. Among the tested models, CatBoost demonstrated exceptional performance in handling imbalanced datasets, achieving an accuracy of 95.18%, precision of 91.10%, recall of 95.52%, and an F1 score of 93.26%. This study underscores the importance of tailored preprocessing techniques in improving the classification of imbalanced medical datasets, highlighting their potential to enhance predictive accuracy in critical applications.

PDF File

References

Abdulhadi, N. and A. Al-Mousa, 2021 Diabetes Detection Using Machine Learning Classification Methods. In 2021 International Conference on Information Technology (ICIT), pp. 350–354, IEEE.

Anderson, J. P., J. R. Parikh, D. K. Shenfeld, V. Ivanov, C. Marks, et al., 2016 Reverse Engineering and Evaluation of Prediction Models for Progression to Type 2 Diabetes: An Application of Machine Learning Using Electronic Health Records. Journal of Diabetes Science and Technology 10: 6–18.

Arias-Duart, A., E. Mariotti, D. Garcia-Gasulla, and J. M. Alonso- Moral, 2023 A Confusion Matrix for Evaluating Feature Attribution Methods. In 2023 IEEE/CVF Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW), pp. 3709–3714, IEEE.

Association, A. D., 2009 Diagnosis and Classification of Diabetes Mellitus. Diabetes Care 32: S62–S67.

Bhavsar, H. and M. H. Panchal, 2024 A Review on Support Vector Machine for Data Classification Unpublished.

Bhoi, S. K., S. K. Panda, K. K. Jena, P. A. Abhisekh, S. Sahoo, et al., 2021 Prediction of Diabetes in Females of Pima Indian Heritage: A Complete Supervised Learning Approach.

Breiman, L., 2001 Random Forests. Machine Learning 45: 5–32. Chang, V., J. Bailey, Q. A. Xu, and Z. Sun, 2023 Pima Indians Diabetes Mellitus Classification Based on Machine Learning (ML) Algorithms. Neural Computing & Applications 35: 16157– 16173.

Chen, T. and C. Guestrin, 2016 XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794, ACM.

De Amorim, L. B. V., G. D. C. Cavalcanti, and R. M. O. Cruz, 2023 The choice of scaling technique matters for classification performance. Applied Soft Computing 133: 109924.

D.K., T., P. B.G, and F. Xiong, 2019 Auto-detection of epileptic seizure events using deep neural network with different feature scaling techniques. Pattern Recognition Letters 128: 544–550.

Duvva, P., 2024 Did the Confusion Matrix Ever Confuse You? https://medium.com/wicds/ did-the-confusion-matrix-ever-confuse-you-5fe869c10739, Accessed: March 9, 2024.

Gosain, A. and S. Sardana, 2017 Handling Class Imbalance Problem Using Oversampling Techniques: A Review. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 79–85, IEEE.

Ke, G., Q. Meng, T. Finley, T. Wang, W. Chen, et al., 2016 Light- GBM: A Highly Efficient Gradient Boosting Decision Tree Unpublished. mKumari, S., D. Kumar, and M. Mittal, 2021 An Ensemble Approach for Classification and Prediction of Diabetes Mellitus Using Soft Voting Classifier. International Journal of Cognitive Computing in Engineering 2: 40–46.

LaValley, M. P., 2008 Logistic Regression. Circulation 117: 2395– 2399.

Luque, A., A. Carrasco, A. Martín, and A. de Las Heras, 2019 The Impact of Class Imbalance in Classification Performance Metrics Based on the Binary Confusion Matrix. Pattern Recognition 91: 216–231.

Miao, Y., 2021 Using Machine Learning Algorithms to Predict Diabetes Mellitus Based on PIMA Indians Diabetes Dataset. In 2021 the 5th International Conference on Virtual and Augmented Reality Simulations, pp. 47–53, ACM.

Milo, T. and A. Somech, 2020 Automating Exploratory Data Analysis mvia Machine Learning: An Overview. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 2617–2622, ACM.

Mousa, A., W. Mustafa, R. B. Marqas, and S. H. M. Mohammed, 2023 A Comparative Study of Diabetes Detection Using The Pima Indian Diabetes Database. University of Duhok Journal 26: m277–288.

Murugan, S., P. K. Sivakumar, C. Kavitha, A. Harichandran, and W.-C. Lai, 2023 An Electro-Oculogram (EOG) Sensor’s Ability to Detect Driver Hypovigilance Using Machine Learning. Sensors 23.

Naz, H. and S. Ahuja, 2020 Deep Learning Approach for Diabetes Prediction Using PIMA Indian Dataset. Journal of Diabetes and Metabolic Disorders 19: 391–403.

Patra, R. and B. Khuntia, 2021 Analysis and Prediction of Pima Indian Diabetes Dataset Using SDKNN Classifier Technique. IOP Conference Series: Materials Science and Engineering 1070: 012059.

Powers, D. M. W., 2020 Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation .

Pradipta, G. A., R.Wardoyo, A. Musdholifah, I. N. H. Sanjaya, and M. Ismail, 2021 SMOTE for Handling Imbalanced Data Problem: A Review. In 2021 Sixth International Conference on Informatics and Computing (ICIC), pp. 1–8, IEEE.

Prokhorenkova, L., G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, 2019 CatBoost: Unbiased Boosting with Categorical Features Unpublished.

Rajni and A. Amandeep, 2019 RB-Bayes Algorithm for the Prediction of Diabetic in Pima Indian Dataset. International Journal of Electrical and Computer Engineering 9: 4866–4872.

Ramyachitra, D. D. and P. Manikandan, 2014 Imbalanced Dataset Classification and Solutions: A Review. International Journal of Computing and Business Research 5.

Raschka, S., 2014 An Overview of General Performance Metrics of Binary Classifier Systems Unpublished.

Rigatti, S. J., 2017 Random Forest. Journal of Insurance Medicine 47: 31–39.

Salih, A. A. and A. M. Abdulazeez, 2021 Evaluation of Classification Algorithms for Intrusion Detection System: A Review. Journal of Soft Computing and Data Mining 2.

Shivahare, B. D., J. Singh, V. Ravi, R. R. Chandan, T. J. Alahmadi, et al., 2024 Delving into Machine Learning’s Influence on Disease

Diagnosis and Prediction. The Open Public Health Journal 17: e18749445297804.

Sidey-Gibbons, J. A. M. and C. J. Sidey-Gibbons, 2019 Machine Learning in Medicine: A Practical Introduction. BMC Medical Research Methodology 19: 64.

Sohil, F., M. U. Sohali, and J. Shabbir, 2013 An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics), volume 6. Springer, 7th edition.

Su, J., 2024 A Fast Decision Tree Learning Algorithm Unpublished.Taunk, K., S. De, S. Verma, and A. Swetapadma, 2019 A Brief Review of Nearest Neighbor Algorithm for Learning and Classification. mIn 2019 International Conference on Intelligent Computing and Control Systems (ICCS), pp. 1255–1260, IEEE.

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.