Miscalibration caused by data points reweighting

Miscalibration caused by data points reweighting#

Another cause for model miscalibration is related to training set resampling. In general, resampling is encountered when dealing with imbalanced datasets. In this section, we show the effect of resampling on model calibration and the methodology to use when it comes to imbalanced datasets.

Let’s synthetically generate an imbalanced dataset with 90% of the samples belonging to the majority class and 10% to the minority class.

# Make sure to have scikit-learn >= 1.5
import sklearn
sklearn.__version__
'1.6.1'
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=50_000,
    n_features=2,
    n_redundant=0,
    n_clusters_per_class=1,
    weights=[0.99, 0.01],
    class_sep=2,
    random_state=1,
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.9, random_state=0
)

As a model, we use a logistic regression model and check the classification report.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

logistic_regression = LogisticRegression().fit(X_train, y_train)
print(classification_report(y_test, logistic_regression.predict(X_test)))
              precision    recall  f1-score   support

           0       0.99      1.00      1.00     44330
           1       0.96      0.44      0.60       670

    accuracy                           0.99     45000
   macro avg       0.97      0.72      0.80     45000
weighted avg       0.99      0.99      0.99     45000

When it comes to imbalanced datasets, in general, data scientists tend to be unhappy with one of the statistical metrics used. Here, they might be unhappy with the recall metric that is too low for their taste.

Let’s check what would be the related decision boundary of our model.

import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay

_, ax = plt.subplots()
DecisionBoundaryDisplay.from_estimator(
    logistic_regression,
    X_test,
    ax=ax,
    cmap="coolwarm",
    response_method="predict",
    plot_method="contour"
)
ax.scatter(*X_train.T, c=y_train, cmap="coolwarm", edgecolors="black")
_ = ax.set(xlabel="Feature 1", ylabel="Feature 2")
../../_images/374695d4d6e009b4094ab34254665c3c8505124f4605921377b20680540fe3b1.png

So we see that our model is conservative by wrongly classifying sample from the majority class. However, if our data scientists want to improve the recall, they would like to move the decision boundary to classify correctly more samples from the minority class at the cost of misclassifying more samples from the majority class.

A body of literature is usually advocating for resampling the training set such that the model is trained on a more balanced dataset. In scikit-learn, the effect of the parameter class_weight provide an equivalence to resampling the training set when set to "balanced".

We therefore repeat the previous experiment but setting this parameter and check the effect on the classification report and the decision boundary.

logistic_regression_balanced = LogisticRegression(class_weight="balanced")
logistic_regression_balanced.fit(X_train, y_train)
print(classification_report(y_test, logistic_regression_balanced.predict(X_test)))
              precision    recall  f1-score   support

           0       0.99      0.85      0.92     44330
           1       0.07      0.71      0.12       670

    accuracy                           0.85     45000
   macro avg       0.53      0.78      0.52     45000
weighted avg       0.98      0.85      0.90     45000
_, ax = plt.subplots()
DecisionBoundaryDisplay.from_estimator(
    logistic_regression_balanced,
    X_test,
    ax=ax,
    cmap="coolwarm",
    response_method="predict",
    plot_method="contour",
)
ax.scatter(*X_train.T, c=y_train, cmap="coolwarm", edgecolors="black")
_ = ax.set(xlabel="Feature 1", ylabel="Feature 2")
../../_images/cfc73e1da204923542af704d8fc7531939a186a20e2c6b00a43e174bcf2c9cb8.png

So we see that the recall increases at the cost of lowering the precision. This is confirmed by the decision boundary displacement.

However, here we completely discard the potential effect on the calibration of the model. Instead to check the hard decision boundary, let’s check the decision boundary based on the probability estimates.

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 5), sharex=True, sharey=True)
for ax, model in zip(axes.ravel(), [logistic_regression, logistic_regression_balanced]):
    disp = DecisionBoundaryDisplay.from_estimator(
        model,
        X_test,
        ax=ax,
        cmap="coolwarm",
        response_method="predict_proba",
        alpha=0.8,
    )
    ax.scatter(*X_train.T, c=y_train, cmap="coolwarm", edgecolors="black")
    ax.set(xlabel="Feature 1", ylabel="Feature 2")
    fig.colorbar(disp.surface_, ax=ax, label="Probability estimate")
../../_images/5c608337972e1c4ad001a698a76683ddd655c617905ca83c401b211aee4a49b2.png

We see that the two models have a very different probability estimates. We should therefore check the calibration of the two models to check if one model is better calibrated than the other.

from sklearn.calibration import CalibrationDisplay

disp = CalibrationDisplay.from_estimator(
    logistic_regression, X_test, y_test, strategy="quantile", name="Unbalanced LR"
)
CalibrationDisplay.from_estimator(
    logistic_regression_balanced,
    X_test,
    y_test,
    strategy="quantile",
    ax=disp.ax_,
    name="Balanced LR",
)
disp.ax_.set(aspect="equal")
_ = disp.ax_.legend(loc="upper left")
../../_images/c06629a37df25e4e104d37ee3e47fe9562eb0887b7dc36b64f48ad13f4a1e162.png

We clearly see that the balanced logistic regression model is completely miscalibrated. In short, this is the effect of resampling. We could have a look at the ROC curves of the two models to check if the predictions ranking changed.

from sklearn.metrics import RocCurveDisplay

fig, ax = plt.subplots()
RocCurveDisplay.from_estimator(
    logistic_regression, X_test, y_test, ax=ax, linestyle="-.", name="Unbalanced LR"
)
RocCurveDisplay.from_estimator(
    logistic_regression_balanced,
    X_test,
    y_test,
    ax=ax,
    linestyle="--",
    name="Balanced LR",
)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7f799df53290>
../../_images/884fe837b00ca054c69b48f6eba32ae5451e70899d5a4e88a5ad1c3c1739275c.png

We see that the two models have the same ROC curve. So it means, that the ranking of the predictions is the same.

As a conclusion, we should not use resampling to deal with imbalanced datasets. Instead, if we are interesting in improving a given metric, we should instead tune the threshold that is set to 0.5 by default to transform the probability estimates into hard predictions. It will have the same effect as “moving” the decision boundary but it will not impact the calibration of the model. We will go in further details in this topic in the next section. But we can quickly experiment with the FixedThresholdClassifier from scikit-learn that allows to set a threshold to transform the probability estimates into hard predictions.

from sklearn.model_selection import FixedThresholdClassifier

threshold = 0.1
logistic_regrssion_with_threshold = FixedThresholdClassifier(
    logistic_regression, threshold=threshold
).fit(X_train, y_train)
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 5), sharex=True, sharey=True)
for ax, model, title in zip(
    axes.ravel(),
    [logistic_regression, logistic_regrssion_with_threshold],
    ["Threshold 0.5 (default)", f"Threshold {threshold}"],
):
    disp = DecisionBoundaryDisplay.from_estimator(
        model,
        X_test,
        ax=ax,
        cmap="coolwarm",
        response_method="predict",
        plot_method="contour",
    )
    ax.scatter(*X_train.T, c=y_train, cmap="coolwarm", edgecolors="black")
    ax.set(xlabel="Feature 1", ylabel="Feature 2", title=title)
../../_images/6d51a3adb90cc3d0710877d8f093c5d91367840cb70e60ca5cbbebb5771240a4.png

We see that the decision boundary similarly to the balanced logistic regression model. In addition, since we have a parameter to tune, we can easily target a certain score for some targeted metric that is not trivial with resampling.

We can go further and check that the two models that we have are both calibrated the same way.

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 5), sharex=True, sharey=True)
for ax, model, title in zip(
    axes.ravel(),
    [logistic_regression, logistic_regrssion_with_threshold],
    ["Threshold 0.5 (default)", f"Threshold {threshold}"],
):
    disp = DecisionBoundaryDisplay.from_estimator(
        model,
        X_test,
        ax=ax,
        cmap="coolwarm",
        response_method="predict_proba",
        alpha=0.8,
    )
    DecisionBoundaryDisplay.from_estimator(
        model,
        X_test,
        ax=ax,
        cmap="coolwarm",
        response_method="predict",
        plot_method="contour",
    )
    ax.scatter(*X_train.T, c=y_train, cmap="coolwarm", edgecolors="black")
    ax.set(xlabel="Feature 1", ylabel="Feature 2", title=title)
    fig.colorbar(disp.surface_, ax=ax, label="Probability estimate")
../../_images/9ff799fd9624eebec9e77921d1fef556d414a71f38ecf010fa87cd9f81d23c25.png

This is not a surprise since the thresholding is a post-processing that threshold the probability estimates. Therefore, it does not impact the calibration of the model.