Price Prediction of Used Cars in Germany (2011-2021) — Part II: Prediction Using Quantile Loss

17 Nov 2021 - Tags: Prediction, Uncertainty, and LightGBM

AutoScout24, based in Munich, is one of the largest European online marketplace for buying and selling new and used cars. With its products and services, the car portal is aimed at private consumers, dealers, manufacturers and advertisers. In this post, we use a high-quality car data set from Germany that was automatically scraped from the AutoScout24 and is available on the Kaggle website. This interesting data set comprises more than 45,000 records of cars for sale in Germany registered between 2011 and 2021.

Outline

Hyperparameter Tuning with scikit-optimize
Final Estimators: Computing Predictive Intervals

Let’s first install and import packages we’ll use and set up the working environment:

# Install packages
! uv add lightgbm scikit-optimize category_encoders

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from category_encoders import LeaveOneOutEncoder
from lightgbm import LGBMRegressor
import seaborn as sns
from sklearn.compose import ColumnTransformer, TransformedTargetRegressor
from sklearn.metrics import make_scorer, mean_pinball_loss, median_absolute_error
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, PowerTransformer, RobustScaler
from sklearn import set_config
from skopt import gp_minimize, space
from skopt.utils import use_named_args

# Settings
%config InlineBackend.figure_format = 'retina'
sns.set_theme(font_scale=1.5, rc={'lines.linewidth': 2.5})

Quantile Loss Function (aka Pinball Loss Function)

Instead of only reporting a single best estimate, we want to predict an estimate and uncertainty as to our estimate. This is especially challenging when that uncertainty varies with the independent variables. A quantile regression is one method for estimating uncertainty which can be used with our model (gradient boosted decision trees).

A quantile is the value below which a fraction of observations in a group falls.

By minimizing the quantile loss, \(\alpha 100\) of the observations should be smaller than predictions (\(y^{(i)} \lt \hat{y}^{(i)}\)), and \((1-\alpha).100\, \%\) should be bigger (\(y^{(i)} \ge \hat{y}^{(i)}\)).

In other words, the model should over-predict \(\alpha.100\,\%\) of the times, and under-predict \((1-\alpha).100\, \%\) of the times. Note that, here the error is defined as \(y - \hat{y}\). As an example, a prediction for quantile \(\alpha=0.9\) should over-predict \(90\%\) of the times.

The quantile loss function is defined as below:

\[\begin{equation} \mathcal{L}_{\alpha}(y^{(i)}, \hat y^{(i)}) = \begin{cases} \begin{aligned} \alpha\,&(y^{(i)} - \hat y^{(i)}) && \text{if} \, y^{(i)} \ge \hat y^{(i)}\quad \text{(i.e., under-predictions)}, \\ (\alpha - 1)\,&(y^{(i)} - \hat y^{(i)}) && \text{if} \, y^{(i)} \lt \hat y^{(i)}\quad \text{(i.e., over-predictions)}, \end{aligned} \end{cases} \end{equation}\]

For a set of predictions, the total cost is its average. Therefore, the cost function will be:

\[\begin{equation} \mathcal{J} = \frac{1}{m} \Sigma_{i=1}^m \mathcal{L}_{\alpha}(y^{(i)}, \hat y^{(i)}), \end{equation}\]

We can see that the quantile loss differs depending on the evaluated quantile value, \(\alpha\). The closer the \(\alpha\) is to 0, the more this loss function penalizes over-predictions (negative errors). As the \(\alpha\) value gets closer to 1, the more this loss function penalizes under-predictions (positive errors). Figure below illustrates this.

def quantile_loss(y_true, y_pred, alpha):
    """
    Quantile loss function.

    Parameters
    ----------
    y_true : ndarray
        True observations (i.e., labels).
    y_pred : ndarray
        Model predictions (same shape as `y_true`).
    alpha : float
        Evaluated quantile values. Should be between 0 and 1.
    """
    errors = y_true - y_pred
    pos = errors >= 0.0
    neg = np.logical_not(pos)
    loss_vals = alpha * (pos * errors) + (alpha - 1) * (neg * errors)
    return loss_vals

For the sake of illustration, we consider one single true value and investigate how the quantile loss function behaves when model predictions are larger than the true value (negative erros) or smaller that that (positive errors) for different quantile values \(\alpha\).

y_true = 5.0
y_pred = np.linspace(0.0, 10, 51)
errors = y_true - y_pred

# Quantile values
alphas = [0.1, 0.5, 0.9]
colors = ['sandybrown', 'chocolate', 'saddlebrown']

fig, axes = plt.subplots(1, 3, sharey=True, figsize=(15, 5))
fig.tight_layout(w_pad=1)
axline_kw = dict(ls='--', lw=1.5, c='gray')

for alpha, color, ax in zip(alphas, colors, axes):
    ax.axvline(0.0, **axline_kw)
    ax.axhline(0.0, **axline_kw)

    loss_vals = quantile_loss(y_true, y_pred, alpha)
    ax.plot(errors, loss_vals, color)
    ax.set(title=fr'$\alpha={alpha}$', xlabel=r'Error, $y - \hat{y}$')

axes[0].set_ylabel(r'Quantile loss, $\mathcal{L}$');

png

Data Loading

Let’s load our data set that we have analysed and cleaned previously.

cars = pd.read_csv('../data/autoscout24_germany_dataset_cleaned.csv')
cars.sample(frac=1).head(n=10)

	mileage	make	model	fuel	gear	price	hp	age
3206	26566	Fiat	500	Gasoline	Manual	8240	69.0	4
2076	92000	Mitsubishi	Space Star	Gasoline	Manual	5399	71.0	6
8537	101709	smart	forTwo	Gasoline	Automatic	8490	102.0	10
502	20302	Ford	Galaxy	Diesel	Manual	28711	190.0	3
37020	79500	Ford	Edge	Diesel	Automatic	28740	209.0	5
9958	38335	Hyundai	i20	Gasoline	Manual	12470	84.0	4
3856	74860	Volkswagen	up!	Gasoline	Manual	5500	75.0	10
10262	106000	Ford	Galaxy	Diesel	Automatic	14000	163.0	7
23216	29450	Ford	Focus	Gasoline	Manual	16450	140.0	4
22424	50000	Mercedes-Benz	Sprinter	Diesel	Manual	17990	114.0	5

Preparatory Feature Selection

We do not utilize 'model' column since it is a subset of 'make' (car brand) feature and each car brand has its own set of model categories, that is not the same for other car brands. It is better to use this column when predicting the sale price of a specific car make, e.g. BMW, is aimed.

cars.drop(columns=['model'], inplace=True)
cars.head()

	mileage	make	fuel	gear	price	hp	age
0	235000	BMW	Diesel	Manual	6800	116.0	11
1	92800	Volkswagen	Gasoline	Manual	6877	122.0	11
2	149300	SEAT	Gasoline	Manual	6900	160.0	11
3	96200	Renault	Gasoline	Manual	6950	110.0	11
4	156000	Peugeot	Gasoline	Manual	6950	156.0	11

Data Splitting

We use \(80\%\) of samples in the training set, while the remaining samples (\(20\%\)) will be available in the testing set.

data = cars.drop(columns='price', inplace=False)
target = cars['price']

X_train, X_test, y_train, y_test = \
    train_test_split(data, target, test_size=0.2)

Leave-One-Out Encoding

Our data consists a mixture of numerical and categorical features. More specifically, ('mileage', 'hp', 'age') are the numerical features, and ('make', 'fuel', 'gear') are the categorical features. In orther to be fed into a machine-learning model, categorical features should be represented numerically.

for feature in ('make', 'fuel', 'gear'):
    print(
        f'Feature: {feature} - '
        f'Number of categories: {data[feature].nunique()}')

Feature: make - Number of categories: 72
Feature: fuel - Number of categories: 4
Feature: gear - Number of categories: 3

To encode the features 'fuel' and 'gear', we use the most commonly used method one-hot encoding (sometimes called “dummy encoding”). This algorithm creates a binary variable for each unique value of the categorical variables (for each sample, the new feature is 1 if the sample’s category matches the new feature, otherwise the value is 0). This algorithm adds a new feature for each unique category in 'fuel' and 'gear'.

Since the category 'make' in our data set comes with high cardinality (i.e., it has too many unique categories), using one-hot encoding for this feature will create too many additional feature variables. This could lead to poor model performance - as data dimentionality increases, the model suffers from the curse of dimensionality).

To keep the number of dimensions under control, we use leave-one-out encoding method. It essentially calculates the mean of the target variables over all samples that have the same value for the categorical-feature variable in question. The encoding algorithm is slightly different between training and testing sets:

For training set, it leaves the sample under consideration out, hence the name Leave One Out. This is done to avoid target leakage. Moreover, it adds random Gaussian noise with zero mean and \(\sigma\) standard deviation into training data in order to decrease overfitting. \(\sigma\) is a hyperparameter that should be tuned.
For testing set, it does not leave the current sample out - the statistics about target variable for each value of each categorical variable calculated for the training data set is used. It also does not need the randomness factor.

Leave-one-out technique is implemented in the Category Encoders library as a scikit-learn-style transformer:

loo_encoder = LeaveOneOutEncoder(sigma=0.05)
encoded = loo_encoder.fit_transform(X_train['make'], y_train)
encoded.head(n=10)

	make
38370	9573.924156
13498	9029.417690
27888	8792.132242
12999	25944.844420
7243	9101.103968
2653	12284.006678
7050	13583.914503
31847	14364.409615
1741	8766.192993
36229	8188.430674

Box-Cox Transformation of Targets

In the previous notebook, where we did the exploratory data analysis, we saw that the distribution of the target variables (prices) in our data set is very skewed to the left. We also saw that the target variables show unequal variability when monitored over different indipendent variables (see cone-shape scatter plots there). This is a sign of heteroscedasticity.

Although, we will use a tree-based learning algorithm that does not make any assumptions on normality of the targets, it would be useful to make targets more Gaussian-like for modeling issues related to heteroscedasticity (non-constant variance).

Here, we use Box-Cox transformation to map targets to as close to a Gaussian distribution as possible in order to stabilise the variance and minimize the skewness. Note that Box-Cox transform can only be applied to strictly positive data (such as income, price etc).

pt = PowerTransformer(method='box-cox')
y_orig = target.values[:, np.newaxis]
y_norm = pt.fit_transform(y_orig)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
plt.suptitle('Targets Distribution', y=1)

hist_kw = dict(bins=30, facecolor='teal')

_ = ax1.hist(y_orig, **hist_kw)
ax1.set_title('Original')

_ = ax2.hist(y_norm, **hist_kw)
ax2.set_title('Transformed (Box-Cox)');

png

Creating Model Pipeline Step by Step

Data-Preprocessing and Transformation

Robust scaling for numerical features
One-hot encoding for 'fuel' and 'gear'
Leave-one-out encoding for 'make'

def create_preprocessor():
    """
    Data preprocessing and feature transformation pipeline

    Returns
    -------
    out : `sklearn.compose.ColumnTransformer` object
    """
    # Numerical features
    numerical_features = ['mileage', 'hp', 'age']
    numerical_transformer = RobustScaler()

    # Categorical features
    categorical_features_a = ['fuel', 'gear']
    categorical_transformer_a = OneHotEncoder(handle_unknown='ignore')

    categorical_features_b = ['make']
    categorical_transformer_b = Pipeline(
        steps=[
            ('loo_encoder', LeaveOneOutEncoder()),
            ('scaler', RobustScaler())])

    # Preprocessing pipeline
    preprocessor = ColumnTransformer(
        transformers=[
            ('numer', numerical_transformer, numerical_features),
            ('categ_a', categorical_transformer_a, categorical_features_a),
            ('categ_b', categorical_transformer_b, categorical_features_b)])

    return preprocessor

LightGBM Regression Estimator

LightGBM is a gradient boosting tree-based framework developed by Microsoft. Fast training, accuracy, and efficient memory utilisation are its main features. LightGBM uses a leaf-wise tree growth algorithm that tends to converge faster compared to depth-wise growth algorithms. It can be a go-to model for many tabular-data machine-learning problems.

LightGBM has several parameters which control how the decision trees are constructed, including:

num_leaves: maximum number of leaves in one tree. Large values increase accuracy on the training set, but also increase the chance of getting hurt by overfitting.
max_depth: limit the maximum depth for tree model. It controls the maximum number of splits in the decision trees.
n_estimators: number of boosting iterations (trees to build). The more trees built the more accurate the model can be at the cost of longer training time and higher chance of overfitting.
learning_rate: the shrinkage rate used for boosting.

We will use Bayesian optimization to find a good combination of these hyperparameters.

To create our regression estimator:

We use LightGBMRegressor model with quantile loss function to estimate the predictive intervals on our dataset.
As discussed before, we regress on transformed targets (Box-Cox transformation). To do so, we need to wrap LightGBM’s regressor by scikit-learn’s TransformedTargetRegressor.

def create_quantile_regressor(alpha=0.5):
    """
    LightGBM regressor w/ quantile loss applied on transformed targets.
    In the regression problem, targets (prices) are transformed by using
    Box-Cox power transform method, `PowerTransform(method='box-cox')`.

    Parameters
    ----------
    alpha: float
        The coefficient used in quantile-based loss, should be between 0
        and 1. Default: 0.5 (median).

    Returns
    -------
    Quantile regressor to apply on transformed targets.
    """
    ttq_regressor = TransformedTargetRegressor(
        regressor=LGBMRegressor(objective='quantile', alpha=alpha),
        transformer=PowerTransformer(method='box-cox'))

    return ttq_regressor

Full Training Pipeline

def create_full_pipeline(alpha=0.5):
    """
    Full prediction pipeline.

    Parameters
    ----------
    alpha: float
        The coefficient used in quantile-based loss, should be between 0
        and 1. Default: 0.5 (median).

    Returns
    -------
    Pipeline of data transforms with the final estimator returned as
    `sklearn.pipeline.Pipeline` object.
    """
    preprocessor = create_preprocessor()
    ttq_regressor = create_quantile_regressor(alpha=alpha)
    model = Pipeline(
        steps=[
            ('preprocessor', preprocessor),
            ('ttq_regressor', ttq_regressor)])

    return model

set_config(display='diagram')
create_full_pipeline()

Hyperparameter Tuning with scikit-optimize

Objective

def hyperparameter_tuner(model, X, y, search_space, cvs_params, gp_params):
    """
    Parameters
    ----------
    model : estimator object implementing `fit()` method
        The object to use to fit the data.
    X : array-like of shape (n_samples, n_features)
        The data to fit.
    y : array-like of shape (n_samples,) or (n_samples, n_outputs)
        The target variable to try to predict.
    search_space : list, shape (n_dims,)
        List of hyperparameter search space dimensions.
    cvs_params : dict
        Parameters for Scikit-Learn's `cross_val_score()` method.
    gp_params : dict
        Parameters for Scikit-Optimize's `gp_minimize()` method.

    Returns
    -------
    The Gaussian Processes optimization result with the required
    information, returned as SciPy's `OptimizeResult` object.
    """
    @use_named_args(search_space)
    def objective_func(**params):
        """
        Function to *minimize*. Decorated by the `use_named_args()`, it
        takes named (keyword) arguments and returns the objective value.
        """
        model.set_params(**params)
        scores = cross_val_score(model, X, y, **cvs_params)
        return -1.0 * np.mean(scores)

    # We are now ready for sequential model-based optimization
    optim_result = gp_minimize(objective_func, search_space, **gp_params)
    return optim_result

Optimization

%%time

# Parameters of a Scikit-Learn estimator in a pipeline are accessed
# using the `<estimator>__<parameter>` syntax. Valid parameter keys
# can be listed with model's `get_params()` method.

# Hyperparameters search space
search_space = [
    space.Real(
        0.01, 0.6, prior='log-uniform',
        name='preprocessor__categ_b__loo_encoder__sigma'),
    space.Integer(10, 50, name='ttq_regressor__regressor__num_leaves'),
    space.Integer(2, 30, name='ttq_regressor__regressor__max_depth'),
    space.Real(
        0.01, 0.5, prior='log-uniform',
        name='ttq_regressor__regressor__learning_rate')]

# Parameters that'll be passed to sklearn's `cross_val_score()`
cvs_params = {'cv': 3, 'n_jobs': 4}

# Parameters that'll be passed to skopt's `gp_minimize()`
gp_params = {'n_calls': 50,
             'n_initial_points': 10,
             'acq_func': 'EI',
             'n_jobs': 4,
             'verbose': False}

# Lower, middle and higher quantiles to evaluate: lower/upper bounds + median
quantiles = {'q_lower': 0.025, 'q_middle': 0.5, 'q_upper': 0.975}

# Cache Gaussian Process optimization results
cache_gp_results = dict()

for q_level, alpha in quantiles.items():
    print(f'\nHyperparameter optimization for quantile regressor with '
          f'alpha={alpha:.3f} ..... ', end='')

    # Create model pipeline
    model = create_full_pipeline(alpha=alpha)

    # Cross-validation scoring strategy
    scorer = make_scorer(
        mean_pinball_loss, alpha=alpha, greater_is_better=False)
    cvs_params['scoring'] = scorer

    # Find optimal hyperparameters
    gp_result = hyperparameter_tuner(
        model, X_train, y_train, search_space, cvs_params, gp_params)

    # Cache current results
    cache_gp_results[q_level] = gp_result
    print('Done')

    # Report GP optimization result
    print('Best parameters:')
    for i_param, param_value in enumerate(gp_result.x):
        param_name = search_space[i_param].name.split('__')[-1]
        print(''.rjust(4) + f'{param_name}: {param_value:.6f}')
    print('=' * 40)

Hyperparameter optimization for quantile regressor with alpha=0.025 ..... Done
Best parameters:
    sigma: 0.01
    num_leaves: 20
    max_depth: 6
    learning_rate: 0.320519
========================================

Hyperparameter optimization for quantile regressor with alpha=0.500 ..... Done
Best parameters:
    sigma: 0.01
    num_leaves: 41
    max_depth: 27
    learning_rate: 0.197304
========================================

Hyperparameter optimization for quantile regressor with alpha=0.975 ..... Done
Best parameters:
    sigma: 0.01
    num_leaves: 31
    max_depth: 19
    learning_rate: 0.157132
========================================
CPU times: user 3min 1s, sys: 3.07 s, total: 3min 4s
Wall time: 2min 15s

Plotting Convergence Traces

n_models = len(quantiles)
fig, axes = plt.subplots(1, n_models, figsize=(15, 5))
fig.tight_layout(w_pad=1)

colors = ['sandybrown', 'chocolate', 'saddlebrown']

for i, (q_level, gp_result) in enumerate(cache_gp_results.items()):
    call_idx = np.arange(1, gp_result.func_vals.size + 1)
    min_vals = [gp_result.func_vals[:i].min() for i in call_idx]

    ax = axes[i]
    ax.plot(call_idx, min_vals, '-o', color=colors[i])

    alpha = quantiles[q_level]
    ax.set(xlabel=r'Number of calls, $n$', title=fr'$\alpha$={alpha}')

axes[0].set_ylabel(r'$f_{min}(x)$ after $n$ calls')
plt.suptitle('GP Convergence Traces', y=1.1);

png

Final Estimators: Computing Predictive Intervals

trained_models = dict()

for q_level, alpha in quantiles.items():
    # Create model for this quantile
    model = create_full_pipeline(alpha=alpha)

    # Retrieve best (optimized) parameters for this model
    gp_res = cache_gp_results[q_level]
    n_params = len(search_space)
    best_params = dict(
        (search_space[i].name, gp_res.x[i]) for i in range(n_params))

    # Set the parameters of this model
    model.set_params(**best_params)

    model.fit(X_train, y_train)
    trained_models[q_level] = model

def coverage_percentage(
        y: np.ndarray,
        y_lower: np.ndarray,
        y_upper: np.ndarray) -> float:
    """
    Function to calculate the coverage percentage of the predictive
    interval, i.e. the percentage of observations that fall between the
    lower and upper bounds of the predictions.

    Parameters
    ----------
    y : ndarray of shape (n_samples,)
        Target values.
    y_lower : ndarray of shape (n_samples,)
        Lower predictive intervals.
    y_upper : ndarray of shape (n_samples,)
        Upper predictive intervals.
    """
    return np.mean(np.logical_and(y > y_lower, y < y_upper)) * 100.0

# Compute predictive intervals

y_pred_lower = trained_models['q_lower'].predict(X_test)
y_pred_upper = trained_models['q_upper'].predict(X_test)

ctop = coverage_percentage(y_test, y_pred_lower, np.inf)
cbot = coverage_percentage(y_test, -np.inf, y_pred_upper)
cint = coverage_percentage(y_test, y_pred_lower, y_pred_upper)

print(f'Coverage of top 95%: {ctop:.1f}%',
      f'Coverage of bottom 95%: {cbot:.1f}%',
      f'Coverage of 95% predictive interval: {cint:.1f}%', sep='\n')

Coverage of top 95%: 97.4%
Coverage of bottom 95%: 97.0%
Coverage of 95% predictive interval: 94.4%

y_pred_middle = trained_models['q_middle'].predict(X_test)

print(
    f'Median Absolute Error: ',
    f'\tLower-quantile model: {int(median_absolute_error(y_test, y_pred_lower))}',
    f'\tMiddle-quantile model: {int(median_absolute_error(y_test, y_pred_middle))}',
    f'\tUpper-quantile model: {int(median_absolute_error(y_test, y_pred_upper))}',
    sep='\n')

Median Absolute Error:
	Lower-quantile model: 2708
	Middle-quantile model: 967
	Upper-quantile model: 3655

Nima Nooshiri