Imbalanced classification: pitfalls and solutions

Imbalanced classification: pitfalls and solutions#

Imbalanced classification refers to issues where the balance of the class frequencies in the target variable creates additional challenges for the classification problem. We focus on two particular issues related to imbalanced classification.

The first issue is related to a large difference between the class frequencies in the target variable. It means that the event of interest to predict is rare. As an example, in fraud detection, the event of interest is a fraudulent transaction and is much less common than legitimate transactions. A large class imbalance can result in degenerate predictive model performance when evaluated naively. In this notebook, we first focus on studying this use case that is often not correctly addressed in many educational resources.

The second issue is related to the fact that the data acquisition process itself might not reflect the true class balance. This means that the class frequencies in the target variable are not representative of the true class balance. As an example, for medical diagnosis, the data acquisition process may be biased towards patients with a rare disease by collecting equal numbers of patients with the disease and equal numbers of patients without the disease. Therefore, there is a need to correct this bias. This will be the focus of the next notebook.

Class imbalance: representative data acquisition with rare events of interest#

In real-world applications, we commonly need to predict rare events, e.g. frauds, rare diseases, rare climatic events, etc. Simplifying this problem to a binary outcome, it means that the probability for the event of interest is low, typically lower than a few percents.

To cover the implications of class imbalance, we first generate a synthetic dataset for which we control the rate of the positive class. We define the generative process below as follows:

We generate a vector of coefficients true_coef of shape (n_features,) where each element is a standard normal random variable. In short, it is the true model that we would like to learn.
We generate a matrix of features X of shape (n_samples, n_features) where each column is a standard normal random variable.
We compute the linear predictor z as the dot product of the features and the vector of coefficients true_coef.
We transform the linear predictor z into class probabilities using the sigmoid function. To create rare positive events, we shift the intercept of the sigmoid function.
Finally, we generate a binary target variable y where we sample each event by drawing a sample from a binomial distribution with n=1 and p being the probability of the positive class we previously computed.

import matplotlib.pyplot as plt  # needed for pandas in jupyterlite.
import numpy as np
import pandas as pd
from scipy.special import expit

rng = np.random.default_rng(0)
n_samples, n_features = 1_000_000, 5

true_coef = rng.normal(size=n_features)
X = rng.normal(size=(n_samples, n_features))
z = X @ true_coef
true_intercept = -4
y = rng.binomial(n=1, p=expit(z + true_intercept))

# Wrap as pandas data structures for convenience.
X = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(n_features)])
y = pd.Series(y, name="target")

Recall that the expit function, also known as the logistic sigmoid function, is defined as expit(x) = 1 / (1 + np.exp(-x)) and looks as follows:

_, ax = plt.subplots()
z = np.linspace(-10, 10, 100)
ax.plot(z, expit(z))
_ = ax.set(
    title="Sigmoid/Expit function",
    xlabel="Linear predictor",
    ylabel="Probability",
)

../../_images/d74e9e0c20eaf1a010a96ef17d0db88d7694e8fd3dff6a41f375187143f666a4.png

The expit function allows to transform the linear predictor into probabilities between 0 and 1. The role of the intercept is to shift the sigmoid function to the left or right.

Let’s look at the true target and especially the relative class frequencies and absolute counts.

print(f"Relative class frequencies:\n {y.value_counts(normalize=True) * 100}")

Relative class frequencies:
 target
0    97.5037
1     2.4963
Name: proportion, dtype: float64

print(f"Class counts:\n {y.value_counts()}\n")

Class counts:
 target
0    975037
1     24963
Name: count, dtype: int64

Looking at the true target distribution, we therefore observe that the probability for a sample to be the positive class with label 1 is rare (~2.5%). Regarding absolute counts, because we generated 1,000,000 samples, the number of events of interest is high enough to train a machine learning model (25,000).

A particular challenge when dealing with real-world class imbalance is that the number of available samples of the rare event is usually low even with a large number of samples. Therefore, it is always important to check the absolute counts of the rare event and if the dataset contains less than 1,000 samples of the rare event, then you will face the usual challenges of training a machine learning model on a dataset with a small number of data points: large variance of the estimator, weak signal, catastrophic overfitting, etc.

Learning a predictive model#

Here, we know that our generative process was intentionally crafted to sample the target variable from the prediction function of a logistic regression model. Therefore, fitting a logistic regression model on this data might be able to recover the true model.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(penalty=None).fit(X, y)

Let’s check if the learned model is able to recover the true model.

comparison_coef = pd.DataFrame(
    {
        "Data generating model": np.hstack((true_intercept, true_coef)),
        "Unpenalized logistic regression": np.hstack(
            (model.intercept_, model.coef_.flatten())
        ),
    },
    index=np.hstack(["intercept", model.feature_names_in_]),
)
ax = comparison_coef.plot.barh()
_ = ax.set(
    title="Comparison of the true and learned model coefficients",
    xlabel="Coefficient value",
    ylabel="Feature",
)

../../_images/9a1c7d84e3a071c270a9c186bf61e331ae6caccf0ba2c9620166b140dd75ebb6.png

We observe that the learned model is able to recover the true model coefficients. However, be aware that it is not always necessarily the case, as illustrated in the following exercise.

Exercise#

Write a small function that embeds the generative process that we defined above. This time only generate 10,000 samples, train a logistic regression model and check the learned model coefficients. Make sure to pass the same true coefficients than in the previous exercise.

Do you recover the true model coefficients? If not, what is the reason?

# TODO: write your code here!

# Do not scroll too quickly!

Solution#

def generate_imbalanced_dataset(true_coef, true_intercept, n_samples=10_000, seed=0):
    rng = np.random.default_rng(seed)

    # We can sample a new design matrix but we need to keep the same true coefficients.
    X = rng.normal(size=(n_samples, true_coef.shape[0]))
    z = X @ true_coef
    y = rng.binomial(n=1, p=expit(z + true_intercept))

    # Wrap as pandas data structures for convenience.
    X = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
    y = pd.Series(y, name="target")

    return X, y


X_exercise, y_exercise = generate_imbalanced_dataset(
    true_coef, true_intercept, n_samples=10_000, seed=1
)
model_exercise = LogisticRegression(penalty=None).fit(X_exercise, y_exercise)

comparison_coef_exercise = pd.DataFrame(
    {
        "Data generating model": np.hstack((true_intercept, true_coef)),
        "Unpenalized logistic regression": np.hstack(
            (model_exercise.intercept_, model_exercise.coef_.flatten())
        ),
    },
    index=np.hstack(["intercept", model_exercise.feature_names_in_]),
)
ax = comparison_coef_exercise.plot.barh()
_ = ax.set(
    title=(
        "Comparison of the true and learned model coefficients\n"
        "trained on a smaller dataset"
    ),
    xlabel="Coefficient value",
    ylabel="Feature",
)

../../_images/73bb405160e19085d86b60e83783b557fdab8f5fb8a973584fddd0e6e41cd68d.png

We observe that we have a larger difference between the coefficients of the true generative process and the learned model and that furthermore, the learned model has a larger variance (different coefficients when we vary the seed used to sample the training set).

The reason is that the coefficients of the generative process can only be recovered if the following assumptions are met:

We have access to an unlimited number of labeled training data points. As the sample size increases, the coefficients of the predictive model will get closer to the true coefficients.
The predictive model should be well specified. In other words, if our predictive model is not flexible enough then it will underfit and not recover all the signal of the true model.
The training process converges to a minimum of a strictly proper scoring rule computed on the training set.

Let us explain the meaning of this last assumption. We are interested in assessing the quality of the probabilistic predictions made by our model:

y_proba = model.predict_proba(X)
y_proba = pd.DataFrame(y_proba, columns=["p_hat(y=0)", "p_hat(y=1)"])
y_proba

	p_hat(y=0)	p_hat(y=1)
0	0.949956	0.050044
1	0.992806	0.007194
2	0.991267	0.008733
3	0.993746	0.006254
4	0.990541	0.009459
...	...	...
999995	0.991282	0.008718
999996	0.972341	0.027659
999997	0.974709	0.025291
999998	0.964693	0.035307
999999	0.966115	0.033885

1000000 rows × 2 columns

_ = y_proba.plot.hist(
    bins=100, figsize=(10, 5), subplots=True, layout=(1, 2), sharey=True
)

../../_images/6d4192113235a6ddec37c999c3bedcc3b386d683476e32f9f5234dec2723af2d.png

bins = np.linspace(0, 1, 300)
_ = (
    pd.concat([y_proba, y.to_frame()], axis=1)
    .groupby("target")["p_hat(y=1)"]
    .plot.hist(bins=bins, alpha=0.5, legend=True, density=True)
)

../../_images/5ca30ede305665e49deee49fb02ef3c1cdfba5877709a710da15c42138939cf6.png

Our predictive model estimates the probabilities of the class of interest (i.e. p_hat(y=1)). However, those probabilistic predictions do not necessarily reflect the true probabilities.

First, we can quickly compute the (marginal) mean of the estimated probabilities and check if we are close to the true probability of the positive class.

y_proba.mean() * 100

p_hat(y=0)    97.503089
p_hat(y=1)     2.496911
dtype: float64

y.value_counts(normalize=True) * 100

target
0    97.5037
1     2.4963
Name: proportion, dtype: float64

This confirms that the probabilistic predictions of our model are meaningful, at least from a marginal point of view.

The reason is that the learning algorithm used by LogisticRegression successfully minimized the log-loss on the training set. The log-loss is a “strictly proper” scoring rule. A strictly proper scoring rule is minimized if and only if the model predictions exactly match the data generating process.

The three above conditions work together: the strictly proper scoring rule provides the right objective from a probabilistic prediction point of view, the well-specified model ensures the true coefficients exist within the model’s parameter space, and the unbounded sample size prevents overfitting: the optimum reached on the training set matches the expected optimum on the test set.

Since our classifier has successfully converged to the parameters of the data generating process, we would expect our classifier to be well calibrated. We can check that by plotting the calibration curve.

from sklearn.calibration import CalibrationDisplay

display = CalibrationDisplay.from_estimator(model, X, y, n_bins=10, strategy="quantile")
_ = display.ax_.set_title("Calibration curve of the unpenalized logistic regression")

../../_images/5e4121f947238c67c2b9544afe93b95fc8edf9c95f3f7f510a50c5ea6da9ab7d.png

Since we have rare events, most data points have low predicted probabilities for the positive class and the quantile-based strategy will not show a curve on the right-hand side of the plot. Let’s zoom in on the plot to better see the curve.

display.plot()
axis_lim = (
    min(display.prob_true.min(), display.prob_pred.min()) * 0.9,
    max(display.prob_true.max(), display.prob_pred.max()) * 1.1,
)
_ = display.ax_.set(
    xlim=axis_lim,
    ylim=axis_lim,
    title="Calibration curve of the unpenalized logistic regression",
)

../../_images/6e690390a496ffeeb67eda21f3b4a8cf17c85aba0d3f3d21a615658fe744d1fb.png

We observe that our logistic regression model is well calibrated as the curve is close to the diagonal line. This is a direct consequence of the fact that the probabilities estimated by the model are close to the true probabilities.

From predicted probabilities to predicted outcomes (and to operational decisions)#

Up to this point of the notebook, we have not encountered any real issues due to the fact that our dataset is imbalanced: with enough data points, a well-specified model minimizing a strictly proper scoring rule, everything seems to be fine.

However, practitioners have been complaining for many years regarding the above setting. Indeed, practical issue often arise when naively translating the estimated probabilities into predicted classification outcomes.

In classification, the predicted outcomes correspond to the classes of the target. As a general rule, the estimated probabilities of the classifier are processed to predict a single binary outcome for each sample. In general the most probable class is selected. For binary classification, it means that the predicted class probability is thresholded with a decision cut-off value set at 0.5. In scikit-learn, this happens in the predict method. Let’s check the link between the predict_proba and predict methods.

y_pred = model.predict(X)
y_proba = model.predict_proba(X)

np.allclose(y_pred, y_proba[:, 1] > 0.5)

True

Discrete binary classification outcomes are typically evaluated with dedicated metrics. Those binary classification metrics for discrete prediction outcomes are all derived from the confusion matrix: indicating the number of true positives, true negatives, false positives and false negatives.

from sklearn.metrics import ConfusionMatrixDisplay

display = ConfusionMatrixDisplay.from_predictions(y, y_pred)
_ = display.ax_.set_title("Confusion matrix of the unpenalized logistic regression")

../../_images/0e46ba1388ed335ef924e26b2f313785aaea8090687346bb9b423cd2354fbc52.png

From the confusion matrix above, we can already understand what bothers practitioners: the total number of positive predictions is very close to zero.

One might interpret this to mean that our model is therefore not able to detect rare events and is thus useless. In general, instead of using the confusion matrix, practitioners use different metrics such as the precision, recall, etc. Let’s check the classification report available in scikit-learn that provides a summary of the metrics.

from sklearn.metrics import classification_report

print(classification_report(y, model.predict(X)))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99    975037
           1       0.00      0.00      0.00     24963

    accuracy                           0.98   1000000
   macro avg       0.49      0.50      0.49   1000000
weighted avg       0.95      0.98      0.96   1000000

As expected, the precision and recall for the class of interest are degenerate.

In the next section, we present a popular “solution” implemented by many practitioners to deal with this problem.

What people naively do and why you should not do it#

One of the reasons for not having any true positives in the confusion matrix is that the estimated probabilities for rare events are low because, as previously shown, those events are rare. The second reason is that the features we have access to are not very predictive: a large proportion of the variability of the target is unexplained by the features but instead attributed to unobserved and independent factors.

One way to counter the issue of degenerate classification metrics is to resample the dataset and balance the class frequencies. This means that we artificially increase the number of samples of the rare event and thus the likelihood of the rare event to be detected is higher. When fitting a model on such a resampled dataset, we therefore artificially boost the estimated probabilities for the positive class (as it became less rare in the resampled data).

Let’s use imbalanced-learn to resample the dataset before training a logistic regression model. When running this notebook under jupyterlite, it is necessary to pip install imbalanced-learn first:

%pip install -q imbalanced-learn

/home/runner/work/calibration-cost-sensitive-learning/calibration-cost-sensitive-learning/.pixi/envs/doc/bin/python: No module named pip

Note: you may need to restart the kernel to use updated packages.

from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import RandomUnderSampler

# Enforce a 0.7 ratio between the number of data points of the two positive and negative
# classes.
undersampling_model = make_pipeline(
    RandomUnderSampler(sampling_strategy=0.7, random_state=0),
    LogisticRegression(penalty=None),
).fit(X, y)

Now, let’s repeat the previous experiment and check the confusion matrix and the classification report.

display = ConfusionMatrixDisplay.from_estimator(undersampling_model, X, y)
_ = display.ax_.set_title("Confusion matrix of the under-sampled logistic regression")

../../_images/8360958c0893774637e33d3569ebd0d53b899a77b79b7650eb6d5a218ad8573a.png

print(classification_report(y, undersampling_model.predict(X)))

              precision    recall  f1-score   support

           0       0.98      0.80      0.88    975037
           1       0.06      0.50      0.11     24963

    accuracy                           0.79   1000000
   macro avg       0.52      0.65      0.50   1000000
weighted avg       0.96      0.79      0.86   1000000

We observe that the number of true positives is now non-zero as well as the precision and recall for the class of interest.

So we might be tempted to conclude that we did the right thing by resampling the dataset. However, here we only looked at the “thresholded” metrics. We should study the calibration of the model.

Since we are working with synthetic data and we have access to the true coefficients of the data generating process, we can also compare the learned coefficients to the true coefficients.

Exercise#

Plot the coefficients of the model and check whether or not the coefficients are close to the true coefficients. Then, plot the calibration curve and check whether or not the model is well calibrated. What do you observe?

# TODO: write your code here.
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# Do not scroll too quickly ;)

Solution#

comparison_coef = pd.DataFrame(
    {
        "Data generating model": np.hstack((true_intercept, true_coef)),
        "Model trained on under-sampled data": np.hstack(
            (
                undersampling_model[-1].intercept_,
                undersampling_model[-1].coef_.flatten(),
            )
        ),
    },
    index=np.hstack(["intercept", undersampling_model[-1].feature_names_in_]),
)
ax = comparison_coef.plot.barh()
_ = ax.set(
    title="Comparison of the true and learned model coefficients",
    xlabel="Coefficient value",
    ylabel="Feature",
)

../../_images/0ba03f10006e483f0a167204d381f9d743c9b2989b98e38890de7ffa9b46670a.png

display = CalibrationDisplay.from_estimator(
    undersampling_model,
    X,
    y,
    n_bins=20,
    strategy="quantile",
    name="Model trained on under-sampled data",
)
display.ax_.set_title("Calibration curve of the under-sampled logistic regression")
_ = display.ax_.legend(loc="upper right")

../../_images/27297eb3d9d89afa679e63cdc9111b9732557d78d7540967a82dbb473c96e9a8.png

We observe that the coefficients related to the features are close to the true coefficients of the generative model. However, the intercept is completely off. This results in an uncalibrated model as seen in the calibration curve: our model becomes too confident at predicting the (originally) rare event which is not surprising because it is exactly what we intended to do by under-sampling the data points from the negative class.

Exercise#

Since our model is not well calibrated, as an exercise, re-calibrate the model using the sklearn.calibration.CalibratedClassifierCV and check both the calibration curve, the confusion matrix, and the classification report. In the CalibratedClassifierCV set the parameter method="isotonic".

What do you observe?

from sklearn.calibration import CalibratedClassifierCV


# TODO: write your code here.
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# Do not scroll too quickly ;)

Solution#

calibrated_model = CalibratedClassifierCV(undersampling_model, method="isotonic")
calibrated_model.fit(X, y)

CalibratedClassifierCV(estimator=Pipeline(steps=[('randomundersampler',
                                                  RandomUnderSampler(random_state=0,
                                                                     sampling_strategy=0.7)),
                                                 ('logisticregression',
                                                  LogisticRegression(penalty=None))]),
                       method='isotonic')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

CalibratedClassifierCV

?Documentation for CalibratedClassifierCViFitted

Parameters

	estimator	Pipeline(step...nalty=None))])
	method	'isotonic'
	cv	None
	n_jobs	None
	ensemble	'auto'

estimator: Pipeline

RandomUnderSampler

Parameters

	sampling_strategy	0.7
	random_state	0
	replacement	False

LogisticRegression

?Documentation for LogisticRegression

Parameters

	penalty	None
	dual	False
	tol	0.0001
	C	1.0
	fit_intercept	True
	intercept_scaling	1
	class_weight	None
	random_state	None
	solver	'lbfgs'
	max_iter	100
	multi_class	'deprecated'
	verbose	0
	warm_start	False
	n_jobs	None
	l1_ratio	None

display = CalibrationDisplay.from_estimator(
    calibrated_model, X, y, n_bins=20, strategy="quantile"
)

axis_lim = (
    min(display.prob_true.min(), display.prob_pred.min()) * 0.9,
    max(display.prob_true.max(), display.prob_pred.max()) * 1.1,
)
_ = display.ax_.set(
    xlim=axis_lim,
    ylim=axis_lim,
    title="Calibration curve of the calibrated under-sampled logistic regression",
)

../../_images/f4d356a36896d13bec828d34d037458f49698fcf8216908f8b4748c0fc26344c.png

display = ConfusionMatrixDisplay.from_estimator(calibrated_model, X, y)
_ = display.ax_.set_title(
    "Confusion matrix of the calibrated under-sampled logistic regression"
)

../../_images/c35159d1f3366a824e15f226227b6f121084fca0e8f1beea186e0cc9c58919e3.png

print(classification_report(y, calibrated_model.predict(X)))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99    975037
           1       0.45      0.00      0.00     24963

    accuracy                           0.98   1000000
   macro avg       0.71      0.50      0.49   1000000
weighted avg       0.96      0.98      0.96   1000000

So in terms of calibration, we see that the CalibratedClassifierCV is able to re-calibrate the model. When looking at the confusion matrix, and the classification report, we see that we reverted the effect of the resampling and we are back to square one.

So what is the lesson to learn here?

Resampling acts by artificially shifting the class distribution such that rare events are more likely during the training process. It impacts the predicted outcomes and for the simple case where we have a well-defined linear model, it is equivalent to shifting the intercept. However, the class probabilities predicted by the model trained on resampled data are completely off compared to the true probabilities.

Therefore, it tells us that we should be careful with the choice of evaluation metrics and how it interacts with the choice of the decision cut-off threshold.

Ranking metrics (e.g. ROC AUC) and probabilistic metrics (e.g. log loss) that assess both ranking and calibration of the predictive model at the same time are good choices but they completely ignore the choice of the decision cut-off threshold.

“Thresholded” metrics (e.g. precision, recall) are impacted by the decision cut-off threshold. Therefore, looking such metrics only for a single decision cut-off: can be misleading: the performance metrics can be bad, not because the underlying model is bad but instead because the default choice of the cut-off makes no sense for highly imbalanced classification problems. It is recommended to look at how those metrics change when varying the decision cut-off threshold.

Let’s explore this further in the next section.

Assessing the impact of the decision cut-off on “thresholded” metrics#

In this section, we show two useful meta-estimators available in scikit-learn to set the decision cut-off threshold to change the predicted outcomes of a classifier.

On the one hand, the FixedThresholdClassifier meta-estimator accepts an explicit value that is used to threshold the estimated probabilities into predicted outcomes. The value is defined by the user and is not optimized to maximize a specific metric.

On the other hand, the TunedThresholdClassifierCV meta-estimator tunes the decision cut-off threshold to maximize a specific metric. The metric is defined by the user and is optimized using cross-validation.

Let’s first demonstrate how to use the FixedThresholdClassifier meta-estimator. First, let’s define a vanilla logistic regression model since we previously saw that the resulting model is well calibrated when fitted on the original dataset.

model = LogisticRegression(penalty=None).fit(X, y)

Now, let’s say that we would like to get a model with a specific precision-recall trade-off. For such analysis, we compute the precision and recall as a function of the decision cut-off threshold as well as the precision-recall curve.

import numpy as np
from sklearn.metrics import make_scorer, precision_score, recall_score

# The following functionality is not yet implemented in scikit-learn and we use a bit
# of private API to easily compute the precision and recall as a function of the
# decision cut-off threshold. In the future, you can refer to the following PR that
# implements such functionality:
# https://github.com/scikit-learn/scikit-learn/pull/31338
from sklearn.metrics._scorer import _CurveScorer as CurveScorer

thresholds = np.linspace(0, 1, 100)
precision_curve_scorer = CurveScorer.from_scorer(
    make_scorer(precision_score, zero_division=0),
    response_method="predict_proba",
    thresholds=thresholds,
)
recall_curve_scorer = CurveScorer.from_scorer(
    make_scorer(recall_score, zero_division=0),
    response_method="predict_proba",
    thresholds=thresholds,
)

precision_scores, precision_thresholds = precision_curve_scorer(model, X, y)
recall_scores, recall_thresholds = recall_curve_scorer(model, X, y)

%pip install -q plotly nbformat

/home/runner/work/calibration-cost-sensitive-learning/calibration-cost-sensitive-learning/.pixi/envs/doc/bin/python: No module named pip

Note: you may need to restart the kernel to use updated packages.

import plotly.graph_objects as go
from plotly.subplots import make_subplots

fig_plotly = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Precision and Recall vs Threshold", "Precision-Recall Curve"),
    horizontal_spacing=0.1,
)

fig_plotly.add_trace(
    go.Scatter(
        x=precision_thresholds,
        y=precision_scores,
        mode="lines+markers",
        name="Precision",
        marker=dict(symbol="cross"),
        hovertemplate="Threshold: %{x:.2f}<br>Precision: %{y:.3f}",
    ),
    row=1,
    col=1,
)
fig_plotly.add_trace(
    go.Scatter(
        x=recall_thresholds,
        y=recall_scores,
        mode="lines+markers",
        name="Recall",
        marker=dict(symbol="cross"),
        hovertemplate="Threshold: %{x:.2f}<br>Recall: %{y:.3f}",
    ),
    row=1,
    col=1,
)

fig_plotly.add_trace(
    go.Scatter(
        x=recall_scores,
        y=precision_scores,
        mode="lines+markers",
        name="PR Curve",
        marker=dict(symbol="circle"),
        hovertemplate="Recall: %{x:.3f}<br>Precision: %{y:.3f}<br>Threshold: %{text}",
        text=[f"{t:.2f}" for t in precision_thresholds],
        showlegend=False,
    ),
    row=1,
    col=2,
)

fig_plotly.update_layout(
    legend=dict(
        x=0.35,
        y=0.85,
        xanchor="left",
        yanchor="top",
        bgcolor="rgba(255,255,255,0.8)",
        bordercolor="rgba(0,0,0,0.2)",
        borderwidth=1,
    ),
    hovermode="closest",
    width=1200,
    height=500,
)

fig_plotly.update_xaxes(title_text="Threshold", range=[0, 1], row=1, col=1)
fig_plotly.update_yaxes(title_text="Score", range=[0, 1], row=1, col=1)
fig_plotly.update_xaxes(title_text="Recall", range=[0, 1], row=1, col=2)
fig_plotly.update_yaxes(title_text="Precision", range=[0, 1], row=1, col=2)
fig_plotly.show()

Using these curves, we now can make a choice regarding a specific trade-off between the level of recall and precision for our classifier. Let’s expose a possible use case: our model could be used for predictive maintenance and in this particular setting, we could imagine that operators reviewing cases of rare failures expect a certain level of precision of the automated failure detection system. Otherwise, the system will show too many false positive cases, tiring the operators, leading to potential errors. However, while expecting a certain level of precision, we also would like our automated failure detection system to maximize the recall level.

Thus, by looking at the precision-recall curve above, we could impose a minimum level of precision of 10%. You can mentally draw an horizontal line at 0.1 on the y-axis and then consider all points above this line and seek for the maximum recall and deduce the corresponding optimal threshold. In this case we should find 0.07.

Exercise#

Using the FixedThresholdClassifier meta-estimator, set the decision cut-off threshold to the value for which the precision and recall curve intersect (i.e. ~0.07). Check the calibration curve, the confusion matrix, and the classification report.

Is the model the resulting model well calibrated? What are the level of precision and recall of the class of interest?

from sklearn.model_selection import FixedThresholdClassifier

# TODO: write your code here.
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# Do not scroll too quickly ;)

Solution#

threshold = 0.07
model = FixedThresholdClassifier(
    LogisticRegression(penalty=None), threshold=threshold
).fit(X, y)

display = CalibrationDisplay.from_estimator(model, X, y, n_bins=20, strategy="quantile")

axis_lim = (
    min(display.prob_true.min(), display.prob_pred.min()) * 0.9,
    max(display.prob_true.max(), display.prob_pred.max()) * 1.1,
)
_ = display.ax_.set(
    xlim=axis_lim,
    ylim=axis_lim,
    title="Calibration curve of the fixed threshold logistic regression",
)

../../_images/1bb7b3e0ebf71e35cb722434849c45258f3629dd73a59effd385446f0676c456.png

display = ConfusionMatrixDisplay.from_estimator(model, X, y)
_ = display.ax_.set_title("Confusion matrix of the fixed threshold logistic regression")

../../_images/758bd1c56eacd8eab4335865c65c90eef79b8de627f44669b8ae85aa7b53c7e4.png

print(classification_report(y, model.predict(X)))

              precision    recall  f1-score   support

           0       0.98      0.95      0.97    975037
           1       0.10      0.20      0.13     24963

    accuracy                           0.94   1000000
   macro avg       0.54      0.58      0.55   1000000
weighted avg       0.96      0.94      0.95   1000000

As expected, we observe that the model is well calibrated because modifying the decision cut-off threshold does not impact the values returned by the predict_proba method: the calibration curve remains unchanged.

However, it does impact the binary values returned by the predict method and therefore the confusion matrix.

With the selected threshold, we expect to have a minimum level of precision of 10% which is exactly what we observe.

While it is an interesting exercise, setting the threshold manually is not the best practice. It would be better to use the TunedThresholdClassifierCV meta-estimator to tune the decision cut-off threshold to maximize a specific metric or a specific trade-off using cross-validation to avoid depending too much on a single train/test split.

Below, we show a case where we want to maximize the recall score but such that the model reach a minimum precision score. We therefore need to create a custom function that can be used by the TunedThresholdClassifierCV meta-estimator.

def maximize_recall_under_constrained_precision(y_true, y_pred, precision_level):
    precision, recall = precision_score(y_true, y_pred), recall_score(y_true, y_pred)

    if precision < precision_level:
        # We reject any model that does not meet the required precision level
        # by returning the worst possible score.
        return -np.inf

    # Otherwise, we want to select the cut-off threshold that maximizes the recall.
    return recall

from sklearn.model_selection import TunedThresholdClassifierCV

# Create a scorer that maximizes the recall but such that the precision is at
# least 0.1.
scoring = make_scorer(maximize_recall_under_constrained_precision, precision_level=0.1)
model = TunedThresholdClassifierCV(
    estimator=LogisticRegression(penalty=None), scoring=scoring, n_jobs=-1
).fit(X, y)

display = ConfusionMatrixDisplay.from_estimator(model, X, y)
_ = display.ax_.set_title("Confusion matrix of the tuned threshold logistic regression")

../../_images/58ec3e5b1a563c07121903dfcec8e411af4c29781e338578d1ec4a8a8def5e00.png

print(classification_report(y, model.predict(X)))

              precision    recall  f1-score   support

           0       0.98      0.97      0.97    975037
           1       0.11      0.15      0.13     24963

    accuracy                           0.95   1000000
   macro avg       0.54      0.56      0.55   1000000
weighted avg       0.96      0.95      0.95   1000000

Looking at the confusion matrix, we observe that we detect a certain number of rare events. Looking into the classification report, we observe that the constraint set on the precision is respected (i.e. the precision is 0.11). For such precision, the maximum recall is 0.15. Now, let’s check what is the decision cut-off threshold that was found during the cross-validation procedure.

float(model.best_threshold_)

0.07943894606514569

Here, we chose to maximize a specific metric under a constraint. It is the best choice when no “business” metric is known for the machine learning task at hand. However, be aware that if you have a “business” metric available, then you should use it together with the TunedThresholdClassifierCV meta-estimator. To see an example, refer to the notebook entitled “Cost-sensitive learning to optimize a business metrics” from this course.

Take away#

When working on imbalanced classification problems, using the default decision threshold of 0.5 can lead to seemingly disappointing classification performance when evaluating the model using metrics derived from the confusion matrix (accuracy, precision, recall, F1 score, Matthews correlation coefficient, …).
Resampling the training set, can improve those metrics but at the cost of breaking the calibration of the predicted probabilities.
Instead, we recommend to evaluate and tune the hyper-parameters the models using threshold-independent metrics (such as ROC-AUC, log-loss) and then plot the thresholded prediction metrics for many choices of the cut-off threshold.
Then, we can use the TunedThresholdClassifierCV meta-estimator to find the best decision threshold for an explicitly defined trade-off between precision and recall.
In later notebooks, we will explore how to deal with a prevalence shift between the available training data and the target deployment setting, how to incorporate business-defined costs into the threshold tuning process and dive deeper into the interplay between ranking performance, calibration and various choices of evaluation metrics.

Imbalanced classification: pitfalls and solutions

Contents

Imbalanced classification: pitfalls and solutions#

Class imbalance: representative data acquisition with rare events of interest#

Learning a predictive model#

Exercise#

Solution#

From predicted probabilities to predicted outcomes (and to operational decisions)#

What people naively do and why you should not do it#

Exercise#

Solution#

Exercise#

Solution#

Assessing the impact of the decision cut-off on “thresholded” metrics#

Exercise#

Solution#

Take away#