Computing prediction intervals using quantile regression#

Environment setup#

We need to install some extra dependencies for this notebook if needed (when running jupyterlite).

%pip install -q https://pypi.anaconda.org/ogrisel/simple/polars/1.24.0/polars-1.24.0-cp39-abi3-emscripten_3_1_58_wasm32.whl
%pip install -q skrub altair holidays plotly nbformat
ERROR: polars-1.24.0-cp39-abi3-emscripten_3_1_58_wasm32.whl is not a supported wheel on this platform.

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
import warnings

import altair
import cloudpickle
import numpy as np
import pyarrow  # noqa: F401
import polars as pl
import tzdata  # noqa: F401

from tutorial_helpers import (
    binned_coverage,
    plot_lorenz_curve,
    plot_reliability_diagram,
    plot_residuals_vs_predicted,
    collect_cv_predictions,
)

# Ignore warnings from pkg_resources triggered by Python 3.13's multiprocessing.
warnings.filterwarnings("ignore", category=UserWarning, module="pkg_resources")
with open("feature_engineering_pipeline.pkl", "rb") as f:
    feature_engineering_pipeline = cloudpickle.load(f)


features = feature_engineering_pipeline["features"]
targets = feature_engineering_pipeline["targets"]
prediction_time = feature_engineering_pipeline["prediction_time"]
horizons = feature_engineering_pipeline["horizons"]
target_column_name_pattern = feature_engineering_pipeline["target_column_name_pattern"]

Define the quantile regressors#

In this section, we show how one can use a gradient boosting but modify the loss function to predict different quantiles and thus obtain an uncertainty quantification of the predictions.

In terms of evaluation, we reuse the R2 and MAPE scores. However, they are not helpful to assess the reliability of quantile models. For this purpose, we use a derivate of the metric minimize by those models: the pinball loss. We use the D2 score that is easier to interpret since the best possible score is bounded by 1 and a score of 0 corresponds to constant predictions at the target quantile.

horizon_of_interest = horizons[-1]  # Focus on the 24-hour horizon
target_column_name = target_column_name_pattern.format(horizon=horizon_of_interest)
predicted_target_column_name = "predicted_" + target_column_name
target = targets[target_column_name].skb.mark_as_y()
target
<GetItem 'load_mw_horizon_24h'>
Show graph Var 'prediction_start_time' Call 'define_prediction_time_range' Var 'prediction_end_time' SubsamplePreviews Call 'build_targets' Var 'historical_data_start_time' Call 'build_historical_time_range' Var 'historical_data_end_time' Call 'load_electricity_load_data' Var 'data_source_folder' CallMethod 'with_columns' y: GetItem 'load_mw_horizon_24h'

Result (on a subsample):

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

from sklearn.metrics import get_scorer, make_scorer
from sklearn.metrics import mean_absolute_percentage_error, d2_pinball_score

scoring = {
    "r2": get_scorer("r2"),
    "mape": make_scorer(mean_absolute_percentage_error),
    "d2_pinball_05": make_scorer(d2_pinball_score, alpha=0.05),
    "d2_pinball_50": make_scorer(d2_pinball_score, alpha=0.50),
    "d2_pinball_95": make_scorer(d2_pinball_score, alpha=0.95),
}

We know define three different models:

  • a model predicting the 5th percentile of the load

  • a model predicting the median of the load

  • a model predicting the 95th percentile of the load

from sklearn.ensemble import HistGradientBoostingRegressor


common_params = dict(
    loss="quantile", learning_rate=0.1, max_leaf_nodes=100, random_state=0
)
predictions_hgbr_05 = features.skb.apply(
    HistGradientBoostingRegressor(**common_params, quantile=0.05),
    y=target,
)
predictions_hgbr_50 = features.skb.apply(
    HistGradientBoostingRegressor(**common_params, quantile=0.5),
    y=target,
)
predictions_hgbr_95 = features.skb.apply(
    HistGradientBoostingRegressor(**common_params, quantile=0.95),
    y=target,
)

Evaluation via cross-validation#

We evaluate the performance of the quantile regressors via cross-validation.

from sklearn.model_selection import TimeSeriesSplit


max_train_size = 2 * 52 * 24 * 7  # max ~2 years of training data
test_size = 24 * 7 * 24  # 24 weeks of test data
gap = 7 * 24  # 1 week gap between train and test sets
ts_cv_5 = TimeSeriesSplit(
    n_splits=5, max_train_size=max_train_size, test_size=test_size, gap=gap
)
cv_results_hgbr_05 = predictions_hgbr_05.skb.cross_validate(
    cv=ts_cv_5,
    scoring=scoring,
    return_learner=True,
    verbose=1,
    n_jobs=-1,
)
cv_results_hgbr_50 = predictions_hgbr_50.skb.cross_validate(
    cv=ts_cv_5,
    scoring=scoring,
    return_learner=True,
    verbose=1,
    n_jobs=-1,
)
cv_results_hgbr_95 = predictions_hgbr_95.skb.cross_validate(
    cv=ts_cv_5,
    scoring=scoring,
    return_learner=True,
    verbose=1,
    n_jobs=-1,
)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   12.8s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   12.4s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    9.8s finished

Let’s first collect all the cross-validated predictions to make further inspection.

cv_predictions_hgbr_05 = collect_cv_predictions(
    cv_results_hgbr_05["learner"], ts_cv_5, predictions_hgbr_05, prediction_time
)
cv_predictions_hgbr_50 = collect_cv_predictions(
    cv_results_hgbr_50["learner"], ts_cv_5, predictions_hgbr_50, prediction_time
)
cv_predictions_hgbr_95 = collect_cv_predictions(
    cv_results_hgbr_95["learner"], ts_cv_5, predictions_hgbr_95, prediction_time
)

Now, let’s make a plot of the predictions for each model and thus we need to gather all the predictions in a single dataframe.

results = pl.concat(
    [
        cv_predictions_hgbr_05[0].rename({"predicted_load_mw": "predicted_load_mw_05"}),
        cv_predictions_hgbr_50[0].select("predicted_load_mw").rename(
            {"predicted_load_mw": "predicted_load_mw_50"}
        ),
        cv_predictions_hgbr_95[0].select("predicted_load_mw").rename(
            {"predicted_load_mw": "predicted_load_mw_95"}
        ),
    ],
    how="horizontal",
).tail(24 * 10)

Now, we plot the observed values and the predicted median with a line. In addition, we plot the 5th and 95th percentiles as a shaded area. It means that between those two bounds, we expect to find 90% of the observed values.

We plot this information on a portion of the test data from the first fold of the cross-validation.

median_chart = (
    altair.Chart(results)
    .transform_fold(["load_mw", "predicted_load_mw_50"])
    .mark_line(tooltip=True)
    .encode(x="prediction_time:T", y="value:Q", color="key:N")
)

# Add a column for the band legend
results_with_band = results.with_columns(pl.lit("90% interval").alias("band_type"))

quantile_band_chart = (
    altair.Chart(results_with_band)
    .mark_area(opacity=0.4, tooltip=True)
    .encode(
        x="prediction_time:T",
        y="predicted_load_mw_05:Q",
        y2="predicted_load_mw_95:Q",
        color=altair.Color("band_type:N", scale=altair.Scale(range=["lightgreen"])),
    )
)

combined_chart = quantile_band_chart + median_chart
combined_chart.resolve_scale(color="independent").interactive()

Now, we can inspect the cross-validated metrics for each model.

cv_results_hgbr_05[
    [col for col in cv_results_hgbr_05.columns if col.startswith("test_")]
].mean(axis=0).round(3)
test_r2               0.853
test_mape             0.051
test_d2_pinball_05    0.698
test_d2_pinball_50    0.644
test_d2_pinball_95   -1.250
dtype: float64
cv_results_hgbr_50[
    [col for col in cv_results_hgbr_50.columns if col.startswith("test_")]
].mean(axis=0).round(3)
test_r2               0.970
test_mape             0.024
test_d2_pinball_05    0.180
test_d2_pinball_50    0.841
test_d2_pinball_95    0.494
dtype: float64
cv_results_hgbr_95[
    [col for col in cv_results_hgbr_95.columns if col.startswith("test_")]
].mean(axis=0).round(3)
test_r2               0.848
test_mape             0.064
test_d2_pinball_05   -2.392
test_d2_pinball_50    0.611
test_d2_pinball_95    0.775
dtype: float64

Focusing on the different D2 scores, we observe that each model minimize the D2 score associated to the target quantile that we set. For instance, the model predicting the 5th percentile obtained the highest D2 pinball score with alpha=0.05. It is expected but a confirmation of what loss each model minimizes.

Now, let’s collect the cross-validated predictions and plot the residual vs predicted values for the different models.

plot_residuals_vs_predicted(cv_predictions_hgbr_05).interactive().properties(
    title=(
        "Residuals vs Predicted Values from cross-validation predictions"
        " for quantile 0.05"
    )
)
plot_residuals_vs_predicted(cv_predictions_hgbr_50).interactive().properties(
    title=("Residuals vs Predicted Values from cross-validation predictions for median")
)
plot_residuals_vs_predicted(cv_predictions_hgbr_95).interactive().properties(
    title=(
        "Residuals vs Predicted Values from cross-validation predictions"
        " for quantile 0.95"
    )
)

We observe an expected behaviour: the residuals are centered and symmetric around 0 for the median model while not centered and biased for the 5th and 95th percentiles models.

Note that we could obtain similar plots using scikit-learn’s PredictionErrorDisplay. This display allows to also plot the observed values vs predicted values as well.

cv_predictions_hgbr_05_concat = pl.concat(cv_predictions_hgbr_05, how="vertical")
cv_predictions_hgbr_50_concat = pl.concat(cv_predictions_hgbr_50, how="vertical")
cv_predictions_hgbr_95_concat = pl.concat(cv_predictions_hgbr_95, how="vertical")
import matplotlib.pyplot as plt
from sklearn.metrics import PredictionErrorDisplay


for kind in ["actual_vs_predicted", "residual_vs_predicted"]:
    fig, axs = plt.subplots(1, 3, figsize=(15, 5), sharey=True)

    PredictionErrorDisplay.from_predictions(
        y_true=cv_predictions_hgbr_05_concat["load_mw"].to_numpy(),
        y_pred=cv_predictions_hgbr_05_concat["predicted_load_mw"].to_numpy(),
        kind=kind,
        ax=axs[0],
    )
    axs[0].set_title("0.05 quantile regression")

    PredictionErrorDisplay.from_predictions(
        y_true=cv_predictions_hgbr_50_concat["load_mw"].to_numpy(),
        y_pred=cv_predictions_hgbr_50_concat["predicted_load_mw"].to_numpy(),
        kind=kind,
        ax=axs[1],
    )
    axs[1].set_title("Median regression")

    PredictionErrorDisplay.from_predictions(
        y_true=cv_predictions_hgbr_95_concat["load_mw"].to_numpy(),
        y_pred=cv_predictions_hgbr_95_concat["predicted_load_mw"].to_numpy(),
        kind=kind,
        ax=axs[2],
    )
    axs[2].set_title("0.95 quantile regression")

    fig.suptitle(f"{kind} for GBRT minimzing different quantile losses")
../../_images/bcac0193b5dc4ba74231a3eb32f14433eb3d424277bc60c267dbdc1ce3e63688.png ../../_images/338fd7979e2c9547b724ae801426beb3f271f831bf2d966d2f1c5ad4978b7190.png

Those plots carry the same information than the previous ones.

Now, we assess if the actual coverage of the models is close to the target coverage of 90%. In addition, we compute the average width of the bands.

def coverage(y_true, y_quantile_low, y_quantile_high):
    y_true = np.asarray(y_true)
    y_quantile_low = np.asarray(y_quantile_low)
    y_quantile_high = np.asarray(y_quantile_high)
    return float(
        np.logical_and(y_true >= y_quantile_low, y_true <= y_quantile_high)
        .mean()
        .round(4)
    )


def mean_width(y_true, y_quantile_low, y_quantile_high):
    y_true = np.asarray(y_true)
    y_quantile_low = np.asarray(y_quantile_low)
    y_quantile_high = np.asarray(y_quantile_high)
    return float(np.abs(y_quantile_high - y_quantile_low).mean().round(1))
coverage(
    cv_predictions_hgbr_50_concat["load_mw"].to_numpy(),
    cv_predictions_hgbr_05_concat["predicted_load_mw"].to_numpy(),
    cv_predictions_hgbr_95_concat["predicted_load_mw"].to_numpy(),
)
0.7657

We see that the obtained coverage (~77%) on the cross-validated predictions is much lower than the target coverage of 90%. It means that the pair of regressors is not jointly calibrated to estimate the 90% interval.

mean_width(
    cv_predictions_hgbr_50_concat["load_mw"].to_numpy(),
    cv_predictions_hgbr_05_concat["predicted_load_mw"].to_numpy(),
    cv_predictions_hgbr_95_concat["predicted_load_mw"].to_numpy(),
)
5144.7

In terms of interpretable measure, the mean width provides a measure in the original unit of the target variable in MW that is ~5,100 MW.

We can go a bit further and bin the cross-validated predictions and check if some specific bins show a better or worse coverage.

binned_coverage_results = binned_coverage(
    [df["load_mw"].to_numpy() for df in cv_predictions_hgbr_50],
    [df["predicted_load_mw"].to_numpy() for df in cv_predictions_hgbr_05],
    [df["predicted_load_mw"].to_numpy() for df in cv_predictions_hgbr_95],
    n_bins=10,
)
binned_coverage_results
bin_left bin_right bin_center fold_idx coverage mean_width n_samples
0 28744.0 36884.0 32814.0 0 0.4205 4235.5 459
1 36886.0 40158.0 38522.0 0 0.6493 4291.2 499
2 40160.0 42982.0 41571.0 0 0.7117 4501.7 496
3 42983.0 45219.0 44101.0 0 0.7400 4281.1 473
4 45220.0 47332.0 46276.0 0 0.7401 4071.5 454
5 47334.0 49718.0 48526.0 0 0.7044 3868.6 477
6 49720.0 53126.0 51423.0 0 0.7418 4679.3 426
7 53129.0 57570.0 55349.5 0 0.8162 6144.5 321
8 57572.0 63173.0 60372.5 0 0.8687 7106.8 259
9 63176.0 86573.0 74874.5 0 0.8929 8807.9 168
10 28744.0 36884.0 32814.0 1 0.4637 3917.9 496
11 36886.0 40158.0 38522.0 1 0.7960 4557.4 402
12 40160.0 42982.0 41571.0 1 0.8550 4842.4 407
13 42983.0 45219.0 44101.0 1 0.8783 4546.3 378
14 45220.0 47332.0 46276.0 1 0.8939 4620.4 377
15 47334.0 49718.0 48526.0 1 0.8648 5101.1 392
16 49720.0 53126.0 51423.0 1 0.8450 5774.8 400
17 53129.0 57570.0 55349.5 1 0.8657 6392.7 402
18 57572.0 63173.0 60372.5 1 0.8005 6515.6 386
19 63176.0 86573.0 74874.5 1 0.6505 7550.9 392
20 28744.0 36884.0 32814.0 2 0.6887 4985.7 318
21 36886.0 40158.0 38522.0 2 0.7657 5111.9 367
22 40160.0 42982.0 41571.0 2 0.7731 5186.4 357
23 42983.0 45219.0 44101.0 2 0.8050 4952.7 400
24 45220.0 47332.0 46276.0 2 0.8814 4906.7 413
25 47334.0 49718.0 48526.0 2 0.8951 4984.1 410
26 49720.0 53126.0 51423.0 2 0.8716 5976.0 475
27 53129.0 57570.0 55349.5 2 0.8577 6568.6 534
28 57572.0 63173.0 60372.5 2 0.8386 6386.9 446
29 63176.0 86573.0 74874.5 2 0.6731 7098.5 312
30 28744.0 36884.0 32814.0 3 0.7563 4199.4 513
31 36886.0 40158.0 38522.0 3 0.8534 3682.2 498
32 40160.0 42982.0 41571.0 3 0.8358 3972.7 481
33 42983.0 45219.0 44101.0 3 0.8337 3750.7 457
34 45220.0 47332.0 46276.0 3 0.8767 3718.1 511
35 47334.0 49718.0 48526.0 3 0.7968 4168.4 497
36 49720.0 53126.0 51423.0 3 0.7019 4535.3 369
37 53129.0 57570.0 55349.5 3 0.8386 5236.3 254
38 57572.0 63173.0 60372.5 3 0.7041 5945.4 267
39 63176.0 86573.0 74874.5 3 0.3568 7160.5 185
40 28744.0 36884.0 32814.0 4 0.6217 4230.8 230
41 36886.0 40158.0 38522.0 4 0.7570 4435.0 251
42 40160.0 42982.0 41571.0 4 0.8723 4569.2 274
43 42983.0 45219.0 44101.0 4 0.8350 4213.3 309
44 45220.0 47332.0 46276.0 4 0.8385 4167.3 260
45 47334.0 49718.0 48526.0 4 0.7137 4547.9 241
46 49720.0 53126.0 51423.0 4 0.8324 5691.7 346
47 53129.0 57570.0 55349.5 4 0.8591 6481.9 504
48 57572.0 63173.0 60372.5 4 0.7857 6359.0 658
49 63176.0 86573.0 74874.5 4 0.5495 6589.4 959

Let’s make a plot to check those data visually.

coverage_by_bin = binned_coverage_results.copy()
coverage_by_bin["bin_label"] = coverage_by_bin.apply(
    lambda row: f"[{row.bin_left:.0f}, {row.bin_right:.0f}]", axis=1
)
ax = coverage_by_bin.boxplot(column="coverage", by="bin_label", whis=1000)
ax.axhline(y=0.9, color="red", linestyle="--", label="Target coverage (0.9)")
ax.set(
    xlabel="Load bins (MW)",
    ylabel="Coverage",
    title="Coverage Distribution by Load Bins",
)
ax.set_title("Coverage Distribution by Load Bins")
ax.legend()
plt.suptitle("")  # Remove automatic suptitle from boxplot
_ = plt.xticks(rotation=45)
../../_images/9b51da011e26cf5b2869e43619f9cc039198ea6144880979901f9da56822ce4e.png

We observe that the lower and higher bins, so low and high load, have the worse coverage with a high variability.

Reliability diagrams and Lorenz curves for quantile regression#

plot_reliability_diagram(
    cv_predictions_hgbr_50, kind="quantile", quantile_level=0.50
).interactive().properties(
    title="Reliability diagram for quantile 0.50 from cross-validation predictions"
)
plot_reliability_diagram(
    cv_predictions_hgbr_05, kind="quantile", quantile_level=0.05
).interactive().properties(
    title="Reliability diagram for quantile 0.05 from cross-validation predictions"
)
plot_reliability_diagram(
    cv_predictions_hgbr_95, kind="quantile", quantile_level=0.95
).interactive().properties(
    title="Reliability diagram for quantile 0.95 from cross-validation predictions"
)
plot_lorenz_curve(cv_predictions_hgbr_50).interactive().properties(
    title="Lorenz curve for quantile 0.50 from cross-validation predictions"
)
plot_lorenz_curve(cv_predictions_hgbr_05).interactive().properties(
    title="Lorenz curve for quantile 0.05 from cross-validation predictions"
)
plot_lorenz_curve(cv_predictions_hgbr_95).interactive().properties(
    title="Lorenz curve for quantile 0.95 from cross-validation predictions"
)

Quantile regression as classification#

In the following, we turn a quantile regression problem for all possible quantile levels into a multiclass classification problem by discretizing the target variable into bins and interpolating the cumulative sum of the bin membership probability to estimate the CDF of the distribution of the continuous target variable conditioned on the features.

Ideally, the classifier should be efficient when trained on a large number of classes (induced by the number of bins). Therefore we use a Random Forest classifier as the default base estimator.

There are several advantages to this approach:

  • a single model is trained and can jointly estimate quantiles for all quantile levels (assuming a well tuned number of bins);

  • the quantile levels can be chosen at prediction time, which allows for a flexible quantile regression model;

  • in practice, the resulting predictions are often reasonably well calibrated as we will see in the reliability diagrams below.

One possible drawback is that current implementations of gradient boosting models tend to be very slow to train with a large number of classes. Random Forests are much more efficient in this case, but they do not always provide the best predictive performance. It could be the case that combining this approach with tabular neural networks can lead to competitive results.

However, the current scikit-learn API is not expressive enough to to handle the output shape of the quantile prediction function. We therefore cannot make it fit into a skrub pipeline.

from scipy.interpolate import interp1d
from sklearn.base import BaseEstimator, RegressorMixin, clone
from sklearn.utils.validation import check_is_fitted
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.utils.validation import check_consistent_length
from sklearn.utils import check_random_state
import numpy as np


class BinnedQuantileRegressor(BaseEstimator, RegressorMixin):
    def __init__(
        self,
        estimator=None,
        n_bins=100,
        quantile=0.5,
        random_state=None,
    ):
        self.n_bins = n_bins
        self.estimator = estimator
        self.quantile = quantile
        self.random_state = random_state

    def fit(self, X, y):
        # Lightweight input validation: most of the input validation will be
        # handled by the sub estimators.
        random_state = check_random_state(self.random_state)
        check_consistent_length(X, y)
        self.target_binner_ = KBinsDiscretizer(
            n_bins=self.n_bins,
            strategy="quantile",
            subsample=200_000,
            encode="ordinal",
            quantile_method="averaged_inverted_cdf",
            random_state=random_state,
        )

        y_binned = (
            self.target_binner_.fit_transform(np.asarray(y).reshape(-1, 1))
            .ravel()
            .astype(np.int32)
        )

        # Fit the multiclass classifier to predict the binned targets from the
        # training set.
        if self.estimator is None:
            estimator = RandomForestClassifier(random_state=random_state)
        else:
            estimator = clone(self.estimator)
        self.estimator_ = estimator.fit(X, y_binned)
        return self

    def predict_quantiles(self, X, quantiles=(0.05, 0.5, 0.95)):
        check_is_fitted(self, "estimator_")
        edges = self.target_binner_.bin_edges_[0]
        n_bins = edges.shape[0] - 1
        expected_shape = (X.shape[0], n_bins)
        y_proba_raw = self.estimator_.predict_proba(X)

        # Some might stay empty on the training set. Typically, classifiers do
        # not learn to predict an explicit 0 probability for unobserved classes
        # so we have to post process their output:
        if y_proba_raw.shape != expected_shape:
            y_proba = np.zeros(shape=expected_shape)
            y_proba[:, self.estimator_.classes_] = y_proba_raw
        else:
            y_proba = y_proba_raw

        # Build the mapper for inverse CDF mapping, from cumulated
        # probabilities to continuous prediction.
        y_cdf = np.zeros(shape=(X.shape[0], edges.shape[0]))
        y_cdf[:, 1:] = np.cumsum(y_proba, axis=1)
        return np.asarray([interp1d(y_cdf_i, edges)(quantiles) for y_cdf_i in y_cdf])

    def predict(self, X):
        return self.predict_quantiles(X, quantiles=(self.quantile,)).ravel()
quantiles = (0.05, 0.5, 0.95)
bqr = BinnedQuantileRegressor(
    RandomForestClassifier(
        n_estimators=300,
        min_samples_leaf=5,
        max_features=0.2,
        n_jobs=-1,
        random_state=0,
    ),
    n_bins=30,
)
bqr
BinnedQuantileRegressor(estimator=RandomForestClassifier(max_features=0.2,
                                                         min_samples_leaf=5,
                                                         n_estimators=300,
                                                         n_jobs=-1,
                                                         random_state=0),
                        n_bins=30)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
from sklearn.model_selection import cross_validate

X, y = features.skb.eval(), target.skb.eval()

cv_results_bqr = cross_validate(
    bqr,
    X,
    y,
    cv=ts_cv_5,
    scoring={
        "d2_pinball_50": make_scorer(d2_pinball_score, alpha=0.5),
    },
    return_estimator=True,
    return_indices=True,
    verbose=1,
    n_jobs=-1,
)
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  2.1min finished
cv_predictions_bqr_all = [
    cv_predictions_bqr_05 := [],
    cv_predictions_bqr_50 := [],
    cv_predictions_bqr_95 := [],
]
for fold_ix, (qreg, test_idx) in enumerate(
    zip(cv_results_bqr["estimator"], cv_results_bqr["indices"]["test"])
):
    print(f"CV iteration #{fold_ix}")
    print(f"Test set size: {test_idx.shape[0]} rows")
    print(
        f"Test time range: {prediction_time.skb.eval()[test_idx][0, 0]} to "
        f"{prediction_time.skb.eval()[test_idx][-1, 0]} "
    )
    y_pred_all_quantiles = qreg.predict_quantiles(X[test_idx], quantiles=quantiles)

    coverage_score = coverage(
        y[test_idx],
        y_pred_all_quantiles[:, 0],
        y_pred_all_quantiles[:, 2],
    )
    print(f"Coverage: {coverage_score:.3f}")

    mean_width_score = mean_width(
        y[test_idx],
        y_pred_all_quantiles[:, 0],
        y_pred_all_quantiles[:, 2],
    )
    print(f"Mean prediction interval width: " f"{mean_width_score:.1f} MW")

    for q_idx, (quantile, predictions) in enumerate(
        zip(quantiles, cv_predictions_bqr_all)
    ):
        observed = y[test_idx]
        predicted = y_pred_all_quantiles[:, q_idx]
        predictions.append(
            pl.DataFrame(
                {
                    "prediction_time": prediction_time.skb.eval()[test_idx],
                    "load_mw": observed,
                    "predicted_load_mw": predicted,
                }
            )
        )
        print(f"d2_pinball score: {d2_pinball_score(observed, predicted):.3f}")
    print()
CV iteration #0
Test set size: 4032 rows
Test time range: 2023-02-11 00:00:00+00:00 to 2023-07-28 23:00:00+00:00 
Coverage: 0.965
Mean prediction interval width: 10778.0 MW
d2_pinball score: 0.346
d2_pinball score: 0.735
d2_pinball score: -0.025

CV iteration #1
Test set size: 4032 rows
Test time range: 2023-07-29 00:00:00+00:00 to 2024-01-12 23:00:00+00:00 
Coverage: 0.992
Mean prediction interval width: 11109.8 MW
d2_pinball score: 0.360
d2_pinball score: 0.813
d2_pinball score: 0.275

CV iteration #2
Test set size: 4032 rows
Test time range: 2024-01-13 00:00:00+00:00 to 2024-06-28 23:00:00+00:00 
Coverage: 0.988
Mean prediction interval width: 11762.3 MW
d2_pinball score: 0.292
d2_pinball score: 0.789
d2_pinball score: 0.108

CV iteration #3
Test set size: 4032 rows
Test time range: 2024-06-29 00:00:00+00:00 to 2024-12-13 23:00:00+00:00 
Coverage: 0.989
Mean prediction interval width: 9291.9 MW
d2_pinball score: 0.248
d2_pinball score: 0.806
d2_pinball score: 0.316

CV iteration #4
Test set size: 4032 rows
Test time range: 2024-12-14 00:00:00+00:00 to 2025-05-30 23:00:00+00:00 
Coverage: 0.978
Mean prediction interval width: 13170.0 MW
d2_pinball score: 0.349
d2_pinball score: 0.790
d2_pinball score: 0.249
# Let's assess the calibration of the quantile regression model:
plot_reliability_diagram(
    cv_predictions_bqr_50, kind="quantile", quantile_level=0.50
).interactive().properties(
    title="Reliability diagram for quantile 0.50 from cross-validation predictions"
)
plot_reliability_diagram(
    cv_predictions_bqr_05, kind="quantile", quantile_level=0.05
).interactive().properties(
    title="Reliability diagram for quantile 0.05 from cross-validation predictions"
)
plot_reliability_diagram(
    cv_predictions_bqr_95, kind="quantile", quantile_level=0.95
).interactive().properties(
    title="Reliability diagram for quantile 0.95 from cross-validation predictions"
)

We can complement this assessment with the Lorenz curves, which only assess the ranking power of the predictions, irrespective of their absolute values.

plot_lorenz_curve(cv_predictions_bqr_50).interactive().properties(
    title="Lorenz curve for quantile 0.50 from cross-validation predictions"
)
plot_lorenz_curve(cv_predictions_bqr_05).interactive().properties(
    title="Lorenz curve for quantile 0.05 from cross-validation predictions"
)
plot_lorenz_curve(cv_predictions_bqr_95).interactive().properties(
    title="Lorenz curve for quantile 0.95 from cross-validation predictions"
)