1. Data Preparation 💾

This dataset is used for predicting the likelihood of a cerebral stroke based on various health and demographic factors. It is an imbalanced classification problem, making it suitable for evaluating models using metrics like Precision, Recall, F1 Score, ROC-AUC, and PR-AUC.

Key Features Include: - Age, Hypertension, Heart Disease - Average Glucose Level, BMI - Smoking Status, Gender, Work Type, Residence Type - Stroke (target variable)

Class Imbalance: The target variable (stroke) is highly imbalanced, with a small percentage of positive stroke cases.

Source: Cerebral Stroke Prediction (Imbalanced Dataset) on Kaggle


2. Exploratory Data Analysis 🔍

2.1. Initial Data Overview 📊

Format

A data frame with NA records and 11 columns; NA comes with positive (minority) outcomes and NA comes with negative (majority) outcomes (NA imbalance).

Variables

gender, age, hypertension, heart_disease, ever_married, work_type, Residence_type, avg_glucose_level, bmi, smoking_status, Class

## Rows: 41,938
## Columns: 11
## $ gender            <fct> Male, Male, Female, Female, Male, Female, Female, Fe…
## $ age               <dbl> 3, 58, 8, 70, 14, 47, 52, 75, 32, 74, 79, 79, 37, 37…
## $ hypertension      <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ heart_disease     <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ ever_married      <fct> No, Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, …
## $ work_type         <fct> children, Private, Private, Private, Never_worked, P…
## $ Residence_type    <fct> Rural, Urban, Urban, Rural, Rural, Urban, Urban, Rur…
## $ avg_glucose_level <dbl> 95.12, 87.96, 110.89, 69.04, 161.28, 210.95, 77.59, …
## $ bmi               <dbl> 18.0, 39.2, 17.6, 35.9, 19.1, 50.1, 17.7, 27.0, 32.3…
## $ smoking_status    <int> 1, 3, 1, 2, 1, 1, 2, 3, 4, 3, 1, 2, 3, 2, 3, 3, 4, 3…
## $ Class             <fct> negative, negative, negative, negative, negative, ne…

2.2. Visualization

2.2.1. Distribution of Classes


2.2.2. Distribution of Predictors


2.2.3. Distribution of Predictors by Class - Histogram

2.2.3. Distribution of Predictors by Class - Contingency Tables
gender negative positive
Female 24,588 (98.6%) 357 (1.4%)
Male 16,700 (98.3%) 286 (1.7%)
Other 7 (100%) 0 (0.0%)
hypertension negative positive
0 37,797 (98.8%) 471 (1.2%)
1 3,498 (95.3%) 172 (4.7%)
heart_disease negative positive
0 39,631 (98.8%) 499 (1.2%)
1 1,664 (92%) 144 (8%)
ever_married negative positive
No 15,087 (99.6%) 67 (0.4%)
Yes 26,208 (97.8%) 576 (2.2%)

work_type negative positive
children 6,059 (100%) 1 (0%)
Govt_job 5,167 (98.5%) 77 (1.5%)
Never_worked 176 (100%) 0 (0.0%)
Private 23,626 (98.5%) 358 (1.5%)
Self-employed 6,267 (96.8%) 207 (3.2%)
Residence_type negative positive
Rural 20,619 (98.5%) 315 (1.5%)
Urban 20,676 (98.4%) 328 (1.6%)
smoking_status negative positive
1 12,771 (99.3%) 95 (0.7%)
2 6,919 (97.5%) 180 (2.5%)
3 15,491 (98.4%) 256 (1.6%)
4 6,114 (98.2%) 112 (1.8%)



3. Methodology 🛠️

The following machine learning workflow is considered for all models:

📦 Data Splitting:
The original dataset is divided into three parts: I used stratified splits to ensure the minority class is represented in all subsets.

  • Training set (≈56%): used to train the model.
  • Validation set (≈24%): used to tune hyperparameters and select the optimal classification threshold.
  • Test set (≈20%): held out and used only once for final model evaluation.

This 3-way split ensures that model selection and threshold tuning do not bias the final reported performance.

⚖️ Handling Class Imbalance:
All training datasets are augmented using oversampling techniques such as SMOTE, ADASYN, or SMOTE-ENN to address the severe class imbalance (positive class ≈ 1%).
However, both the validation and test sets are not augmented and retain their original class distributions to reflect real-world scenarios.

In addition to oversampling, we also apply class weighting during model training to further mitigate imbalance. The weighting scheme assigns higher penalties to misclassifications of the minority class. The weights are computed inversely proportional to class frequencies, using the formula:

\[ \text{weight}_{\text{class}} = \frac{n_{\text{total}}}{2 \times n_{\text{class}}} \]

where \(n_{\text{class}}\) is the number of observations in a given class, and \(n_{\text{total}}\) is the total number of training samples.

🎯 Model Training & Threshold Selection:
All models are trained to maximize the F1 score, which balances precision and recall, making it particularly suitable for highly imbalanced classification tasks. The classification threshold is tuned on the validation set. This tuned threshold is then applied to the test set for final evaluation, which is performed only once.

📊 Evaluation Metrics:
Model performance is assessed using the following metrics:

  • F1 Score: harmonic mean of precision and recall
  • Precision (PPV) and Recall (Sensitivity)
  • Specificity and ROC-AUC (where appropriate)
  • PR-AUC is also considered in some cases to better capture precision-recall trade-offs under imbalance

This methodology ensures fair comparison and realistic performance estimation.


3.1. Generalized Linear Models (GLM) 🔢

In this section, we evaluate logistic regression (GLM) for handling the imbalanced dataset.
We explore the following strategies:

  1. Baseline logistic regression without any class imbalance treatment.
  2. Resampling approaches: Random Over-Sampling (ROS), SMOTE, SMOTE-ENN, Random Under-Sampling (RUS)

The classification threshold is tuned on the validation set to maximize the F1 score,
ensuring that the model achieves a balance between precision and recall while accounting for the class imbalance.

3.1.1 Baseline GLM

ROC is used to find an optimal threshold for classification. This time, the threshold is 0.06

##           Reference
## Prediction negative positive
##   negative     7856       88
##   positive      403       40
## 
## Sensitivity: 0.312
## Specificity: 0.951
## Precision:    0.09
## F1 Score:    0.14
3.1.2 GLM + ROS
## 
## negative positive 
##    23126    23126

ROC is used to find an optimal threshold for classification. This time, the threshold is 0.87

##           Reference
## Prediction negative positive
##   negative     8047      102
##   positive      212       26
## 
## Sensitivity: 0.203
## Specificity: 0.974
## Precision:    0.109
## F1 Score:    0.142

3.1.3 GLM + SMOTE

ROC is used to find an optimal threshold for classification. This time, the threshold is 0.87

##           Reference
## Prediction negative positive
##   negative     7988       96
##   positive      271       32
## 
## Sensitivity: 0.25
## Specificity: 0.967
## Precision:    0.106
## F1 Score:    0.148

3.1.4 GLM + SMOTE-ENN
## Number of instances removed from majority class with ENN: 1544    Time needed: 2.39
## 
## negative positive 
##    21582    23126

ROC is used to find an optimal threshold for classification. This time, the threshold is 0.89

##           Reference
## Prediction negative positive
##   negative     7940       94
##   positive      319       34
## 
## Sensitivity: 0.266
## Specificity: 0.961
## Precision:    0.096
## F1 Score:    0.141

3.1.5 GLM + RUS
## 
## negative positive 
##      361      361

ROC is used to find an optimal threshold for classification. This time, the threshold is 0.85

##           Reference
## Prediction negative positive
##   negative     7895       92
##   positive      364       36
## 
## Sensitivity: 0.281
## Specificity: 0.956
## Precision:    0.09
## F1 Score:    0.136

3.2 Random Forest 🌲

Random Forest is a robust ensemble method that often performs well with unbalanced data due to its ability to handle non-linear relationships.
In this section, we test different Random Forest variations:

  1. Baseline logistic regression without any class imbalance treatment.
  2. Resampling approaches: SMOTE-ENN and ADASYN
  3. Cost-Sensitive Weights

The classification threshold is tuned on the validation set to maximize the F1 score,
ensuring that the model achieves a balance between precision and recall while accounting for the class imbalance.

3.2.1. Baseline RF

ROC is used to find an optimal threshold for classification. This time, the threshold is 0.04

##           Reference
## Prediction negative positive
##   negative     7530       62
##   positive      729       66
## 
## Sensitivity: 0.516
## Specificity: 0.912
## Precision:    0.083
## F1 Score:    0.143

3.2.2 RF + SMOTE

ROC is used to find an optimal threshold for classification. This time, the threshold is 0.1

##           Reference
## Prediction negative positive
##   negative     7718       94
##   positive      541       34
## 
## Sensitivity: 0.266
## Specificity: 0.934
## Precision:    0.059
## F1 Score:    0.097

3.2.3 RF + SMOTE-ENN

ROC is used to find an optimal threshold for classification. This time, the threshold is 0.07

##           Reference
## Prediction negative positive
##   negative     7310       72
##   positive      949       56
## 
## Sensitivity: 0.438
## Specificity: 0.885
## Precision:    0.056
## F1 Score:    0.099

3.2.4 RF + ADASYN

ROC is used to find an optimal threshold for classification. This time, the threshold is 0.05

##           Reference
## Prediction negative positive
##   negative     7399       70
##   positive      860       58
## 
## Sensitivity: 0.453
## Specificity: 0.896
## Precision:    0.063
## F1 Score:    0.111
3.2.5 Weighted RF

ROC is used to find an optimal threshold for classification. This time, the threshold is 0.05

##           Reference
## Prediction negative positive
##   negative     7627       67
##   positive      632       61
## 
## Sensitivity: 0.477
## Specificity: 0.923
## Precision:    0.088
## F1 Score:    0.149

3.3 Boosting (XGBoost) ⚡

In this section, we explore the XGBoost algorithm, a powerful gradient boosting technique, for handling the severe class imbalance.

We compare the following strategies: 1. Baseline logistic regression without any class imbalance treatment. 2. Resampling approaches: SMOTE, SMOTE-ENN, and ADASYN.
3. Cost-Sensitive Weights

As with previous models, the classification threshold is tuned on the validation set to maximize the F1 score, ensuring a balance between precision and recall.

3.3.1 Baseline XGBoost

ROC is used to find an optimal threshold for classification. This time, the threshold is 0.06

##           Reference
## Prediction negative positive
##   negative     7840       94
##   positive      419       34
## 
## Sensitivity: 0.266
## Specificity: 0.949
## Precision:    0.075
## F1 Score:    0.117

3.3.2 XGBoost + SMOTE

Again, ROC is used to find an optimal threshold for classification. This time, the threshold is 0.03

##           Reference
## Prediction negative positive
##   negative     7365       70
##   positive      894       58
## 
## Sensitivity: 0.453
## Specificity: 0.892
## Precision:    0.061
## F1 Score:    0.107
3.3.3 XGBoost + SMOTE-ENN

Again, ROC is used to find an optimal threshold for classification. This time, the threshold is 0.04

##           Reference
## Prediction negative positive
##   negative     7296       63
##   positive      963       65
## 
## Sensitivity: 0.508
## Specificity: 0.883
## Precision:    0.063
## F1 Score:    0.112
3.3.4 XGBoost + ADASYN

Again, ROC is used to find an optimal threshold for classification. This time, the threshold is 0.02

##           Reference
## Prediction negative positive
##   negative     7074       58
##   positive     1185       70
## 
## Sensitivity: 0.547
## Specificity: 0.857
## Precision:    0.056
## F1 Score:    0.101
3.3.5 Weighted XGBoost

ROC is used to find an optimal threshold for classification. This time, the threshold is 0.61

##           Reference
## Prediction negative positive
##   negative     8089      114
##   positive      170       14
## 
## Sensitivity: 0.109
## Specificity: 0.979
## Precision:    0.076
## F1 Score:    0.09

3.4 Neural Networks (NN) 🧠

In this section, we explore feed-forward Neural Networks (NN) for the same task. Neural networks can capture complex patterns but are also sensitive to class imbalance.

We compare the following strategies: 1. Baseline logistic regression without any class imbalance treatment. 2. Resampling approaches: SMOTE, SMOTE-ENN, and ADASYN.
3. Cost-Sensitive Weights

As with previous models, the classification threshold is tuned on the validation set to maximize the F1 score, ensuring a balance between precision and recall.

3.4.1 Baseline NN
## Epoch 10, Loss: 1.9757
## Epoch 20, Loss: 0.9488
## Epoch 30, Loss: 0.4874
## Epoch 40, Loss: 0.2737
## Epoch 50, Loss: 0.1863
## Epoch 60, Loss: 0.1558
## Epoch 70, Loss: 0.1439
## Epoch 80, Loss: 0.1361
## Epoch 90, Loss: 0.1345
## Epoch 100, Loss: 0.1357

ROC is used to find an optimal threshold for classification. This time, the threshold is 0

##           Reference
## Prediction negative positive
##   negative      410       18
##   positive     7849      110
## 
## Sensitivity: 0.859
## Specificity: 0.05
## Precision:    0.014
## F1 Score:    0.027

3.3.2 NN + SMOTE
## Epoch 10, Loss: 0.9184
## Epoch 20, Loss: 0.7789
## Epoch 30, Loss: 0.6891
## Epoch 40, Loss: 0.6458
## Epoch 50, Loss: 0.6224
## Epoch 60, Loss: 0.6024
## Epoch 70, Loss: 0.5827
## Epoch 80, Loss: 0.5648
## Epoch 90, Loss: 0.5611
## Epoch 100, Loss: 0.5520

ROC is used to find an optimal threshold for classification. This time, the threshold is 0.78

##           Reference
## Prediction negative positive
##   negative     7777       92
##   positive      482       36
## 
## Sensitivity: 0.281
## Specificity: 0.942
## Precision:    0.069
## F1 Score:    0.111
3.3.2 NN + SMOTE-ENN
## Epoch 10, Loss: 0.8300
## Epoch 20, Loss: 0.7661
## Epoch 30, Loss: 0.7221
## Epoch 40, Loss: 0.6894
## Epoch 50, Loss: 0.6639
## Epoch 60, Loss: 0.6539
## Epoch 70, Loss: 0.6406
## Epoch 80, Loss: 0.6334
## Epoch 90, Loss: 0.6161
## Epoch 100, Loss: 0.6070

ROC is used to find an optimal threshold for classification. This time, the threshold is 0.73

##           Reference
## Prediction negative positive
##   negative     7643       89
##   positive      616       39
## 
## Sensitivity: 0.305
## Specificity: 0.925
## Precision:    0.06
## F1 Score:    0.1
3.3.4 NN + ADASYN
## Epoch 10, Loss: 1.4386
## Epoch 20, Loss: 1.4443
## Epoch 30, Loss: 1.4374
## Epoch 40, Loss: 1.4515
## Epoch 50, Loss: 1.4504
## Epoch 60, Loss: 1.4498
## Epoch 70, Loss: 1.4482
## Epoch 80, Loss: 1.4409
## Epoch 90, Loss: 1.4534
## Epoch 100, Loss: 1.4441

Again, ROC is used to find an optimal threshold for classification. This time, the threshold is 0.07

##           Reference
## Prediction negative positive
##   negative       67        3
##   positive     8192      125
## 
## Sensitivity: 0.977
## Specificity: 0.008
## Precision:    0.015
## F1 Score:    0.03
3.3.5 Weighted NN
## Epoch 10, Loss: 0.8496
## Epoch 20, Loss: 0.8554
## Epoch 30, Loss: 0.8446
## Epoch 40, Loss: 0.8488
## Epoch 50, Loss: 0.8457
## Epoch 60, Loss: 0.8597
## Epoch 70, Loss: 0.8653
## Epoch 80, Loss: 0.8501
## Epoch 90, Loss: 0.8651
## Epoch 100, Loss: 0.8473

ROC is used to find an optimal threshold for classification. This time, the threshold is 0.24

##           Reference
## Prediction negative positive
##   negative      686       26
##   positive     7573      102
## 
## Sensitivity: 0.797
## Specificity: 0.083
## Precision:    0.013
## F1 Score:    0.026

3.5 Model Comparison 📊

In this section, comparison of all four models—GLM, Random Forest, XGBoost, and Neural Networks—under six different strategies for handling class imbalance:

  • Baseline (no adjustment)
  • Random Oversampling (ROS)
  • SMOTE
  • SMOTE-ENN
  • Adaptive Synthetic (ADASYN)
  • Class-Sensitive Weighting

For each combination, we report the F1 Score, Precision, Recall, and AUC, evaluated on the test set. Thresholds were tuned on the validation set to maximize F1 Score, unless otherwise noted.

Model Performance Comparison
Sensitivity Specificity Precision F1_Score
RF (Weighting) 0.477 0.923 0.088 0.149
GLM (SMOTE) 0.25 0.967 0.106 0.148
RF (Baseline) 0.516 0.912 0.083 0.143
GLM (SMOTE-ENN) 0.266 0.961 0.096 0.141
GLM (Baseline) 0.312 0.951 0.09 0.14
XGBoost (Baseline) 0.266 0.949 0.075 0.117
XGBoost (SMOTE-ENN) 0.508 0.883 0.063 0.112
NN (SMOTE) 0.281 0.942 0.069 0.111
RF (ADASYN) 0.453 0.896 0.063 0.111
XGBoost (SMOTE) 0.453 0.892 0.061 0.107
XGBoost (ADASYN) 0.547 0.857 0.056 0.101
NN (SMOTE-ENN) 0.305 0.925 0.06 0.1
RF (SMOTE-ENN) 0.438 0.885 0.056 0.099
RF (SMOTE) 0.266 0.934 0.059 0.097
XGBoost (Weighting) 0.109 0.979 0.076 0.09
NN (ADASYN) 0.977 0.008 0.015 0.03
NN (Baseline) 0.859 0.05 0.014 0.027
NN (Weighting) 0.797 0.083 0.013 0.026
Note:
Performance metrics for various classification models- those above 75% quantile are highlighted green

3.6 Improving the Best Model 🚀

Section 3.5 showed that the Random Forest model with class weighting achieved the best performance compared to other models. Building on that, I am experimenting with different weighting schemes to see if further improvements are possible.

So far, I have been using the following weighting scheme:

\[ \text{weight}_{\text{class}} = \frac{n_{\text{total}}}{2 \times n_{\text{class}}} \]

where:
- \(n_{\text{class}}\) is the number of samples in the given class
- \(n_{\text{total}}\) is the total number of samples in the training data

Now, I am testing alternative weighting strategies such as:

  1. Adjusting the denominator to values other than 2 \[ \text{weight}_{\text{class}} = \frac{n_{\text{total}}}{r \times n_{\text{class}}} \]

  2. Applying a direct inverse frequency weighting:
    \[ \text{weight}_{\text{class}} = \frac{1}{n_{\text{class}}} \]

  3. Applying Weighted Cross-Entropy (WCE)

\[ \text{weight}_{\text{major}} = w \cdot \log(p_{major}) \text{weight}_{\text{minor}} = \log(1 - p_{major}) \]

  • \(w\): Class weight applied to the positive class to increase its importance
  • \(p\): Model-predicted probability of the positive class
  1. Applying Focal Loss \[ \text{weight}_{\text{major}} = (1-p)^{\gamma} \log(p) \text{weight}_{\text{minor}} = p^{\gamma} \log(1-p) \]

Additionally, I am combining these weighting schemes with advanced resampling techniques such as SMOTE (Synthetic Minority Over-sampling Technique) and SMOTE-ENN (a hybrid of SMOTE and Edited Nearest Neighbors) to further balance the dataset and improve model generalization.

These combined approaches aim to increase the model’s ability to correctly classify the minority class without sacrificing overall accuracy.

3.6.1 Adjusting Weighting Scale (r)

\[ \text{weight}_{\text{class}} = \frac{n_{\text{total}}}{r \times n_{\text{class}}} \]

##           Reference
## Prediction negative positive
##   negative     7597       75
##   positive      662       53
## 
## Sensitivity: 0.414
## Specificity: 0.92
## Precision:    0.074
## F1 Score:    0.126
##           Reference
## Prediction negative positive
##   negative     7626       76
##   positive      633       52
## 
## Sensitivity: 0.406
## Specificity: 0.923
## Precision:    0.076
## F1 Score:    0.128
##           Reference
## Prediction negative positive
##   negative     7556       69
##   positive      703       59
## 
## Sensitivity: 0.461
## Specificity: 0.915
## Precision:    0.077
## F1 Score:    0.133
##           Reference
## Prediction negative positive
##   negative     7554       71
##   positive      705       57
## 
## Sensitivity: 0.445
## Specificity: 0.915
## Precision:    0.075
## F1 Score:    0.128
##           Reference
## Prediction negative positive
##   negative     7453       62
##   positive      806       66
## 
## Sensitivity: 0.516
## Specificity: 0.902
## Precision:    0.076
## F1 Score:    0.132
3.6.2 Direct Inverse Frequency Weighting

\[ \text{weight}_{\text{class}} = \frac{1}{n_{\text{class}}} \]

##           Reference
## Prediction negative positive
##   negative     7669       80
##   positive      590       48
## 
## Sensitivity: 0.375
## Specificity: 0.929
## Precision:    0.075
## F1 Score:    0.125
3.6.3 Weighted Cross-Entropy (WCE) with Different Ratios

\[ \text{weight}_{\text{major}} = - w \cdot \log(p_{major}) \\ \text{weight}_{\text{minor}} = - \log(1 - p_{major}) \]

##           Reference
## Prediction negative positive
##   negative     7601       74
##   positive      658       54
## 
## Sensitivity: 0.422
## Specificity: 0.92
## Precision:    0.076
## F1 Score:    0.129
##           Reference
## Prediction negative positive
##   negative     7632       73
##   positive      627       55
## 
## Sensitivity: 0.43
## Specificity: 0.924
## Precision:    0.081
## F1 Score:    0.136
##           Reference
## Prediction negative positive
##   negative     7648       75
##   positive      611       53
## 
## Sensitivity: 0.414
## Specificity: 0.926
## Precision:    0.08
## F1 Score:    0.134
##           Reference
## Prediction negative positive
##   negative     7678       79
##   positive      581       49
## 
## Sensitivity: 0.383
## Specificity: 0.93
## Precision:    0.078
## F1 Score:    0.129
##           Reference
## Prediction negative positive
##   negative     7859       95
##   positive      400       33
## 
## Sensitivity: 0.258
## Specificity: 0.952
## Precision:    0.076
## F1 Score:    0.118
##           Reference
## Prediction negative positive
##   negative     7535       71
##   positive      724       57
## 
## Sensitivity: 0.445
## Specificity: 0.912
## Precision:    0.073
## F1 Score:    0.125
3.6.4 Focal Loss with Different Ratios

\[ \text{weight}_{\text{major}} = (1-p)^{\gamma} \log(p) \\ \text{weight}_{\text{minor}} = p^{\gamma} \log(1-p) \]

##           Reference
## Prediction negative positive
##   negative     7458       69
##   positive      801       59
## 
## Sensitivity: 0.461
## Specificity: 0.903
## Precision:    0.069
## F1 Score:    0.119
##           Reference
## Prediction negative positive
##   negative     7658       78
##   positive      601       50
## 
## Sensitivity: 0.391
## Specificity: 0.927
## Precision:    0.077
## F1 Score:    0.128
##           Reference
## Prediction negative positive
##   negative     7755       82
##   positive      504       46
## 
## Sensitivity: 0.359
## Specificity: 0.939
## Precision:    0.084
## F1 Score:    0.136
##           Reference
## Prediction negative positive
##   negative     7601       72
##   positive      658       56
## 
## Sensitivity: 0.438
## Specificity: 0.92
## Precision:    0.078
## F1 Score:    0.133
##           Reference
## Prediction negative positive
##   negative     7431       59
##   positive      828       69
## 
## Sensitivity: 0.539
## Specificity: 0.9
## Precision:    0.077
## F1 Score:    0.135
##           Reference
## Prediction negative positive
##   negative     7835       90
##   positive      424       38
## 
## Sensitivity: 0.297
## Specificity: 0.949
## Precision:    0.082
## F1 Score:    0.129
3.6.5 Comparing Weighting Schems
Performance Across Different Weighting Schema
Weighting Sensitivity Specificity Precision F1_Score
r = 0.5 0.4141 0.9198 0.0741 0.1257
r = 1 0.4062 0.9234 0.0759 0.1279
r = 1.5 0.4609 0.9149 0.0774 0.1326
r = 2 0.4453 0.9146 0.0748 0.1281
r = 3 0.5156 0.9024 0.0757 0.1320
Inverse Frequency 0.3750 0.9286 0.0752 0.1253
WCE - Weights = 0.5 0.4219 0.9203 0.0758 0.1286
WCE - Weights = 1 0.4297 0.9241 0.0806 0.1358
WCE - Weights = 1.5 0.4141 0.9260 0.0798 0.1338
WCE - Weights = 2 0.3828 0.9297 0.0778 0.1293
WCE - Weights = 3 0.2578 0.9516 0.0762 0.1176
WCE - Weights = 4 0.4453 0.9123 0.0730 0.1254
FL - Gamma = 0.5 0.4609 0.9030 0.0686 0.1194
FL - Gamma = 1 0.3906 0.9272 0.0768 0.1284
FL - Gammma = 1.5 0.3594 0.9390 0.0836 0.1357
FL - Gamma = 2 0.4375 0.9203 0.0784 0.1330
FL - Gamma = 3 0.5391 0.8997 0.0769 0.1346
FL - Gamma = 4 0.2969 0.9487 0.0823 0.1288
3.6.6 RF + WCE + SMOTE

After evaluating multiple class imbalance handling strategies, it is now clear that Weighted Cross-Entropy (WCE) performs best in this context. WCE adjusts the loss function by assigning more importance to the minority class, helping the model focus on harder-to-classify examples and address class imbalance effectively.

Building on this, the next step is to combine WCE with sampling techniques to further enhance performance. Specifically, we will integrate WCE with:

SMOTE (Synthetic Minority Oversampling Technique): which generates synthetic samples for the minority class to balance the dataset.

SMOTE-ENN (SMOTE + Edited Nearest Neighbors): which combines oversampling with data cleaning by removing ambiguous and noisy samples using nearest neighbors.

This hybrid approach aims to leverage both algorithm-level (WCE) and data-level (SMOTE/SMOTE-ENN) methods to improve predictive accuracy—especially for the minority class.

We will implement and evaluate the following combinations:

Weighted Cross-Entropy (weight = 1) + SMOTE

Weighted Cross-Entropy (weight = 1) + SMOTE-ENN

The models will be compared using performance metrics such as F1-score, precision, recall, and AUC on validation and test datasets.

##           Reference
## Prediction negative positive
##   negative     7419       84
##   positive      840       44
## 
## Sensitivity: 0.344
## Specificity: 0.898
## Precision:    0.05
## F1 Score:    0.087
3.6.7 RF + FL + SMOTE-ENN
##           Reference
## Prediction negative positive
##   negative     7954      108
##   positive      305       20
## 
## Sensitivity: 0.156
## Specificity: 0.963
## Precision:    0.062
## F1 Score:    0.088

3.7 Quantification

Since the evaluation score was low, I am now using quantification methods using two approaches for our best model so far, weighted random forest, and our mose baseline model.

*Probabilistic Classify & Count (PCC): adding the predicted probabilities.

*Adjusted Classify & Count (ACC): Using classifier’s known error rates (calculated from a training set) to correct the raw prediction counts from a test set.

3.7.1 PCC (Weighted RF)

Actual Prevalence in Test Set is 128 and the predicted PCC from weighted RF is 127.

3.7.2 ACC (Weighted RF)

Actual Prevalence in Test Set is 128 and the predicted ACC from weighted RF is 142.

3.7.3 PCC (Baseline GLM)

Actual Prevalence in Test Set for is 128 and the predicted PCC from GLM Baseline is 124.

3.7.4 ACC (Baseline GLM)

Actual Prevalence in Test Set for is 128 and the predicted ACC from GLM Baseline is 408.