Optimizing Diabetes Prediction: Addressing Data Imbalance with Machine Learning Algorithms
Machine learning
Imbalanced dataset
Diabetes classification
Ensemble learning

Optimizing Diabetes Prediction: Addressing Data Imbalance with Machine Learning Algorithms. (2024). ADBA Computer Science, 1(1), 26-35. https://doi.org/10.69882/adba.cs.2024075


Imbalanced datasets pose significant challenges in various fields, including the classification of medical conditions such as diabetes. This study investigates six methodologies for handling imbalanced diabetes datasets, aiming to enhance classification performance through diverse preprocessing techniques. The methodologies are evaluated using multiple models: Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, SVM, KNN, Naive Bayes, XGBoost, LightGBM, and CatBoost. The preprocessing techniques include simple implementation, data standardization, normalization, standardization with K Fold cross-validation, and two variations incorporating the SMOTE oversampling technique.The effectiveness of each methodology is assessed based on accuracy, precision, recall, and F1 scores across different classifiers. Results indicate that standardization combined with K Fold cross-validation consistently enhances model performance. Additionally, the integration of the SMOTE technique significantly improves results, especially for Gradient Boosting and SVM classifiers. Among the tested models, CatBoost demonstrated exceptional performance in handling imbalanced datasets, achieving an accuracy of 95.18%, precision of 91.10%, recall of 95.52%, and an F1 score of 93.26%. This study underscores the importance of tailored preprocessing techniques in improving the classification of imbalanced medical datasets, highlighting their potential to enhance predictive accuracy in critical applications.

