A Two-Stage Hybrid Oversampling and Ensemble Learning Framework for Improved Type 2 Diabetes Mellitus Classification

Diabetes Mellitus Imbalance Data Oversampling Ensemble Learning Classification

Authors

Vol. 8 No. 2 (2026): May
Medical Informatics
December 10, 2025
February 14, 2026
May 2, 2026

Downloads

Type 2 Diabetes Mellitus (T2DM) screening using clinical tabular data commonly suffers from class imbalance, where non-diabetic records dominate diabetic cases, causing models to bias toward the majority class and yield poor detection of the positive (diabetic) class. This study aims to improve T2DM classification on an imbalanced dataset by increasing minority-class detection while maintaining acceptable overall performance. The main contribution is a leakage-safe framework that integrates two-stage hybrid oversampling (RandomOverSampler followed by Borderline-SMOTE) and soft-voting ensemble learning to obtain more balanced predictions. Experiments were conducted on the Diabetes Bangladesh (DiaBD) dataset, containing 5,288 clinical records with a binary target, diabetic (Yes/No, mapped to 1/0). The data were stratified into train_full/test splits (80/20) and further into train/validation splits (80/20 of train_full). Features were normalized using MinMaxScaler fitted only on the training set and applied to validation and test sets to prevent data leakage. Class imbalance handling was applied only on the training set using the proposed two-stage oversampling (ROS Borderline-SMOTE; borderline-1, k_neighbors=3). Classification models included SVM (RBF), Random Forest, and Gradient Boosting, as well as soft-voting ensembles of two and three models. Results show that the baseline setting (No OS) can achieve high accuracy but low minority detection; for instance, SVM (No OS) reached an accuracy of 0.9374 with a Recall_pos of 0.0909 and an F1_pos of 0.1587. After oversampling, SVM (OS) improved minority recall to 0.7273 with F1_pos 0.4188, although accuracy decreased to 0.8688 due to increased false positives. The best-balanced performance was achieved by the SVM + RandomForest soft-voting ensemble (OS) with accuracy 0.9125, Recall_pos 0.6545, and the highest F1_pos 0.4932. Overall, the proposed two-stage hybrid oversampling combined with soft-voting ensembles improves T2DM detection on imbalanced tabular data, and the findings highlight that model selection should prioritize Recall_pos and F1_pos rather than accuracy alone.

How to Cite

Permatasari, S. F. N. ., & Ermatita, E. (2026). A Two-Stage Hybrid Oversampling and Ensemble Learning Framework for Improved Type 2 Diabetes Mellitus Classification. Indonesian Journal of Electronics, Electromedical Engineering, and Medical Informatics, 8(2), 202-217. https://doi.org/10.35882/ijeeemi.v8i2.308

Similar Articles

1-10 of 150

You may also start an advanced similarity search for this article.