Price Prediction of Used Cars in Germany (2011-2021) — Part I: Exploratory Data Analysis

05 Nov 2021 - Tags: EDA

AutoScout24, based in Munich, is one of the largest European online marketplace for buying and selling new and used cars. With its products and services, the car portal is aimed at private consumers, dealers, manufacturers and advertisers. In this post, we use a high-quality car data set from Germany that was automatically scraped from the AutoScout24 and is available on the Kaggle website. This interesting data set comprises more than 45,000 records of cars for sale in Germany registered between 2011 and 2021.

Let’s first install and import packages we’ll use and set up the working environment:

!conda install -c conda-forge -q -y squarify

from datetime import datetime
from typing import List, Optional, Union

import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
from scipy.stats import probplot

import seaborn as sns
import squarify

# Settings
%config InlineBackend.figure_format = 'retina'
sns.set_theme(font_scale=1.25)
random_seed = 87

Data Loading

Let’s load the data from the dataset file in CSV format. Pandas comes with a handy function read_csv to create a DataFrame from a file path or buffer.

cars_rawdf = pd.read_csv('../data/autoscout24_germany_dataset.csv')
cars_rawdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46390 entries, 0 to 46389
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   mileage    46390 non-null  int64
 1   make       46390 non-null  object
 2   model      46247 non-null  object
 3   fuel       46390 non-null  object
 4   gear       46208 non-null  object
 5   offerType  46390 non-null  object
 6   price      46390 non-null  int64
 7   hp         46361 non-null  float64
 8   year       46390 non-null  int64
dtypes: float64(1), int64(3), object(5)
memory usage: 3.2+ MB

cars_rawdf.sample(frac=1, random_state=random_seed).head(n=10)

	mileage	make	model	fuel	gear	offerType	price	hp	year
10316	151000	BMW	325	Diesel	Automatic	Used	13999	204.0	2011
25361	88413	Fiat	Punto Evo	Gasoline	Manual	Used	4740	105.0	2011
45829	50	Skoda	Fabia	Gasoline	Manual	Pre-registered	14177	95.0	2021
34066	25000	Citroen	C3	Gasoline	Manual	Used	5000	73.0	2011
40035	48882	Opel	Mokka	Gasoline	Manual	Used	13200	140.0	2016
39682	59500	SEAT	Mii	Gasoline	Manual	Used	4450	60.0	2016
5457	25602	Ford	Fiesta	Gasoline	Manual	Used	7840	71.0	2019
15175	2500	Volkswagen	Caddy	Diesel	Manual	Demonstration	38550	122.0	2020
18569	26828	Renault	Clio	Gasoline	Manual	Used	10990	90.0	2018
40890	167482	Volkswagen	Golf	Diesel	Manual	Used	9799	110.0	2017

The data set contains information about cars in Germany (registered between 2011 and 2021), for which we’ll be predicting the sale price. It shows various fields such as make (car brand), hp (horse power), mileage (the aggregate number of miles traveled) etc. A quick look at the offerType field shows that the data set contains five distinct types of offer: Used, Pre-registered, Demonstration, Employer’s car, and New.

print(
    'Counts of unique offer types:',
    cars_rawdf['offerType'].value_counts(),
    sep='\n'
)

Counts of unique offer types:
Used              40110
Pre-registered     2780
Demonstration      2366
Employee's car     1121
New                  13
Name: offerType, dtype: int64

Here, we select data for ‘Used’ cars only to build a model for predicting car sale prices in used-car markets. After selecting all records for ‘Used’ cars in the original data set, we remove the column offerType since it is same for all entries in the extracted new data set.

cars = cars_rawdf[cars_rawdf['offerType'] == 'Used'].copy()
cars.sample(frac=1).head(n=10)

	mileage	make	model	fuel	gear	offerType	price	hp	year
4136	47500	Ford	Fiesta	Gasoline	Manual	Used	8150	71.0	2017
5965	98600	Ford	Transit Custom	Diesel	Manual	Used	11781	101.0	2014
3054	61150	Volkswagen	up!	Gasoline	Manual	Used	7450	60.0	2017
23141	22000	Mazda	CX-3	Gasoline	Manual	Used	20000	120.0	2017
23861	18100	Nissan	X-Trail	Gasoline	Automatic	Used	27375	159.0	2020
8375	31008	Volkswagen	up!	Gasoline	Manual	Used	6990	60.0	2018
40104	64000	Volkswagen	Golf Variant	Gasoline	Manual	Used	13800	125.0	2016
11711	28900	Renault	Clio	Gasoline	Manual	Used	8700	90.0	2017
10652	75367	Mercedes-Benz	CLA 220	Diesel	Automatic	Used	27970	177.0	2017
43350	14721	Peugeot	108	Gasoline	Manual	Used	9990	72.0	2019

cars.drop(columns='offerType', inplace=True)
cars.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40110 entries, 0 to 46375
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   mileage  40110 non-null  int64
 1   make     40110 non-null  object
 2   model    39985 non-null  object
 3   fuel     40110 non-null  object
 4   gear     39937 non-null  object
 5   price    40110 non-null  int64
 6   hp       40091 non-null  float64
 7   year     40110 non-null  int64
dtypes: float64(1), int64(3), object(4)
memory usage: 2.8+ MB

Data Manipulation and Cleaning

Numerical Features

Add a New Column: Car Age

The columns year, that is the car registration year, does not solely provide direct information, but rather in combination with current date. A feature that is considered by many car buyers on the used-car markets is the age of the car. In combination with the average anual mileage (that can be different from country to country, and for different years in a given country!), it determines whether a car has reasonable mileage or not. This property is missing in the data set, but we can substract column year from current year and create a meaningful feature named car age:

cars['age'] = datetime.now().year - cars['year']
cars.drop(columns='year', inplace=True)

cars.sample(frac=1).head(n=10)

	mileage	make	model	fuel	gear	price	hp	age
10837	180000	Ford	Focus	Diesel	Manual	6900	95.0	6
31773	82500	Citroen	Berlingo	Diesel	Automatic	7790	92.0	9
35784	210000	Renault	Megane	Diesel	Manual	3700	110.0	10
30646	12461	Audi	Q3	Diesel	Automatic	46480	190.0	2
25864	141500	Volkswagen	Caddy	Diesel	Manual	9999	102.0	7
31055	11316	Ford	Focus	Gasoline	Manual	19589	125.0	2
13981	138994	Fiat	500L	Diesel	Manual	6995	105.0	8
17854	63290	Nissan	Pulsar	Diesel	Manual	9750	110.0	8
18375	16075	Ford	Focus	Diesel	Manual	19996	150.0	3
21141	162200	Citroen	C3	Diesel	Manual	3890	68.0	10

Mask Outliers

We can identify outliers (that are also considered as missing observations) for numerical features mileage, hp, age, as well as the target variable price. It is obvious that thses values cannot be assigned to zero or negative for any samples in the data set. Therefore, we mask negative values (outliers) for these columns:

numerical_features = ['mileage', 'hp', 'age']

for feature in numerical_features + ['price']:
    cars[feature].mask(cars[feature] < 0, inplace=True)

Categorical Features

Overview

categorical_features = ['make', 'fuel', 'gear']

for feature in categorical_features:
    print(
        f"Categorical feature: '{feature}'\n"
        f"No. of unique elemnts: {cars[feature].nunique()}\n"
        f"Unique values:\n{cars[feature].unique()}"
    )
    print('=' * 40)

Categorical feature: 'make'
No. of unique elemnts: 76
Unique values:
['BMW' 'Volkswagen' 'SEAT' 'Renault' 'Peugeot' 'Toyota' 'Opel' 'Mazda'
 'Ford' 'Mercedes-Benz' 'Chevrolet' 'Audi' 'Fiat' 'Kia' 'Dacia' 'MINI'
 'Hyundai' 'Skoda' 'Citroen' 'Infiniti' 'Suzuki' 'SsangYong' 'smart'
 'Volvo' 'Jaguar' 'Porsche' 'Nissan' 'Honda' 'Lada' 'Mitsubishi' 'Others'
 'Lexus' 'Cupra' 'Maserati' 'Bentley' 'Land' 'Alfa' 'Jeep' 'Subaru'
 'Dodge' 'Microcar' 'Baic' 'Tesla' 'Chrysler' '9ff' 'McLaren' 'Aston'
 'Rolls-Royce' 'Alpine' 'Lancia' 'Abarth' 'DS' 'Daihatsu' 'Ligier'
 'Caravans-Wohnm' 'Aixam' 'Piaggio' 'Morgan' 'Tazzari' 'Trucks-Lkw' 'RAM'
 'Ferrari' 'Iveco' 'DAF' 'Alpina' 'Polestar' 'Maybach' 'Brilliance'
 'FISKER' 'Lamborghini' 'Cadillac' 'Trailer-Anhänger' 'Isuzu' 'Corvette'
 'DFSK' 'Estrima']
========================================
Categorical feature: 'fuel'
No. of unique elemnts: 11
Unique values:
['Diesel' 'Gasoline' 'Electric/Gasoline' '-/- (Fuel)' 'Electric' 'CNG'
 'LPG' 'Electric/Diesel' 'Others' 'Hydrogen' 'Ethanol']
========================================
Categorical feature: 'gear'
No. of unique elemnts: 3
Unique values:
['Manual' 'Automatic' nan 'Semi-automatic']
========================================

Car makes (brands) in our data set possesses a high cardinality, meaning that there too many of unique values of this category. One-Hot Encoding becomes a big problem in this case since we will have a separate column for each unique make value indicating its presence or absence in a given record. This leads to a big problem called the curse of dimensionality: as the number of features increases, the amount of data required to be able to distinguish between these features and generalize learned model grows exponentially. We will come back to this later and use a different encoding algorithm for this feature.

Fuel Type

We group the fuel types of the cars into four major categories: ‘Gasoline’, ‘Diesel’, ‘Electric’ and ‘Others’.

replace_mapping = {'Electric': ['Electric/Gasoline', 'Electric/Diesel'],
                   'Others': ['-/- (Fuel)', 'CNG', 'LPG', 'Hydrogen', 'Ethanol']}

for value, to_replace in replace_mapping.items():
    cars['fuel'].replace(to_replace, value, inplace=True)

Handling Missing Data

Overview

First step here is to get a better picture of the frequency of missing observations in our data set. This is important since it leads us to come to a ‘decision’ about how to treat missing values (i.e. what strategy to use for missing records). Moreover, in order to handle missing observations, also data preparation for training, it is necessary to split the features into numerical and categorical features.

def countplot_missing_data(
    df: pd.DataFrame,
    ax: mpl.axes.Axes,
    **kwargs
) -> None:
    """
    Utility function to plot the counts of missing values in each
    column of the given dataframe `df` using bars.
    """
    missings = df.isna().sum(axis=0).rename('count')
    missings = missings.rename_axis(index='feature').reset_index()
    missings.sort_values('count', ascending=False, inplace=True)

    x = range(missings.shape[0])
    bars = ax.bar(x, missings['count'], **kwargs)

    ax.bar_label(bars, fmt='%d', label_type='edge', padding=3)
    ax.set(xticks=x, xticklabels=missings['feature'])

all_features = {
    'numerical': ['mileage', 'hp', 'age'],
    'categorical': ['make', 'fuel', 'gear']
}

colors = ['teal', 'skyblue']

fig, axes = plt.subplots(1, 2, figsize=(12, 6))
for (item, ax, color) in zip(all_features.items(), axes, colors):
    features_type, features_list = item
    countplot_missing_data(cars[features_list], ax, color=color, width=0.35)
    ax.set(
        title=f'{features_type.capitalize()} Features',
        ylabel='Missing Values, Count [#]'
    )

png

The is really a high-quality data set; no missing values for most of the features available. The number of missing observations for features hp and gear are very small compared to the total number of data (less than 0.05% and 0.5%, respectively). Instead of eliminating (dropping) records for these samples, we try to impute the missing values using information from other records. For the numerical feature hp we use the median and for the categorical feature gear we use most frequrnt imputation strategy to replace missing values. For both cases, we compuate the values of interest for each car make and model, meaning that we first group the data based on make and model:

agg_data = cars.groupby(['make', 'model']).aggregate(
    {'hp': np.nanmedian, 'gear': pd.Series.mode}
)

agg_data.sample(frac=1, random_state=912).head(n=20)

make	model	hp	gear
Peugeot	807	136.0	Manual
Lexus	GS 450h	345.0	Automatic
Lada	Niva	83.0	Manual
Ford	Flex	314.0	Automatic
Mitsubishi	Eclipse Cross	163.0	Automatic
Suzuki	Vitara	120.0	Manual
smart	forTwo	71.0	Automatic
Citroen	C-Zero	57.5	Automatic
BMW	X5 M	400.0	Automatic
Peugeot	Rifter	131.0	Manual

Function below can be used as imputation transformer. This function groups the data based on specific features, then aggregate the grouped data to extract values of interest according to our imputation strategy. Finally, it returns the index and corresponding filling value for each missing observation, so it can be used to update the original data frame and replace the missing values. In the next section, we’ll see how this function comes in handy.

def imputation_transform(
    df: pd.DataFrame,
    label: str,
    group_by: Union[str, List[str]],
    strategy: str
) -> pd.Series:
    """
    Imputation transformer for replacing missing values.

    Parameters
    ----------
    df:
        Main data set.
    label:
        Column for which the missing values are imputed. After groupby,
        the data is aggregated over this column.
    group_by:
        Column, or list of columns used to group data `df`.
    strategy : str, {'mean', 'median', 'most_frequent'}
        The imputation strategy.
        If 'mean' or 'median', then uses equivalent NumPy functions
        ignoring NaNs (`np.nanmean` and `np.nanmedian`, respectively).
        If 'most_frequent', it uses `pd.Series.mode` method.

    Returns
    -------
    Values that can be used to update `df` to replace its missing
    values for the column `label`.
    """
    strategy_maps = {
        'mean': np.nanmean,
        'median': np.nanmedian,
        'most_frequent': pd.Series.mode
    }

    func = strategy_maps[strategy]
    agg_data = df.groupby(group_by).aggregate({label: func})

    def imputer(row):
        """
        Imputation function to apply to each row of the `df` dataframe
        to replace missing values for the feature `agg_over` using the
        imputation strategy `strategy`.
        """
        try:
            x = agg_data.loc[tuple(row[group_by]), label]
            if isinstance(x, np.ndarray):
                fill_value = x[0] if x.size != 0 else np.nan
            else:
                fill_value = x
        except KeyError:
            fill_value = np.nan

        return fill_value

    nan_idx = df[label].isna()
    return df.loc[nan_idx].apply(lambda row: imputer(row), axis=1)

Median Imputation of Missing `hp` Values

hp_imp = imputation_transform(
    df=cars,
    label='hp',
    strategy='median',
    group_by=['make', 'model']
)

cars['hp'].update(hp_imp)

Mode Imputation of Missing `gear` Values

gear_imp = imputation_transform(
    df=cars,
    label='gear',
    strategy='most_frequent',
    group_by=['make', 'model']
)

cars['gear'].update(gear_imp)

Revisiting Our Data Set

all_features = {
    'numerical': ['mileage', 'hp', 'age'],
    'categorical': ['make', 'fuel', 'gear']
}

colors = ['teal', 'skyblue']

fig, axes = plt.subplots(1, 2, figsize=(12, 6))
for (item, ax, color) in zip(all_features.items(), axes, colors):
    features_type, features_list = item
    countplot_missing_data(cars[features_list], ax, color=color, width=0.35)
    ax.set(
        title=f'{features_type.capitalize()} Features',
        ylabel='Missing Values, Count [#]'
    )

png

So far so good. We successfully handelled the missing values for most of the records in the data set. But there are still some missing observations. At this point, we could simply eliminate (drop) records for these samples:

cars.dropna(axis=0, subset=['hp', 'gear'], inplace=True)
cars.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40088 entries, 0 to 46375
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   mileage  40088 non-null  int64
 1   make     40088 non-null  object
 2   model    39977 non-null  object
 3   fuel     40088 non-null  object
 4   gear     40088 non-null  object
 5   price    40088 non-null  int64
 6   hp       40088 non-null  float64
 7   age      40088 non-null  int64
dtypes: float64(1), int64(3), object(4)
memory usage: 2.8+ MB

Exploratory Data Analysis

Visualizing Categorical Data Using Treemaps

Treemap is a popular visualization technique used to visualize ‘Part of a Whole’ relationship. Treemaps are easy to follow and interpret. Treemaps are often used for sales data, as they capture relative sizes of data categories, allowing for quick perception of the items that are large contributors to each category. Apart from the sizes, categories can be colour-coded to show a separate dimension of data. Here, we use treemap visualization to explore the relative sizes (counts) of different categories and their average price in the data sat. The size of nested grids in our treemap visualization represents the counts of each category, and grid-cell colours indicate the mean price for that category.

def plot_treemap(
    df: pd.DataFrame,
    size_key: str,
    color_key: str,
    ax: mpl.axes.Axes,
    cmap: mpl.colors.Colormap = mpl.cm.Blues,
    font_size: int = 8,
    title_prefix: Optional[str] = None,
    **kwargs
) -> None:
    """
    Utility function to plot treemaps for categorical features using squarify.

    Parameters
    ----------
    size_key:
        Column name whose numeric values specify sizes of rectangles.
    color_key:
        Column name whose numeric values specify colors of rectangles.
    ax:
        Matplotlib Axes instance.
    cmap:
        Matplotlib colormap.
    font_size:
        Set font size of the label text.
    title_prefix:
        Text to prepend to the figure title.
    **kwargs:
        Additional keyword arguments passed to matplotlib.Axes.bar by
        squarify (e.g. `edgecolor`, `linewidth` etc).
    """
    df_sorted = df.sort_values(by=size_key, ascending=False)
    sizes = df_sorted[size_key]

    norm = mpl.colors.Normalize(vmin=df[color_key].min(), vmax=df[color_key].max())

    colors = [cmap(norm(x)) for x in df_sorted[color_key]]

    labels = [
        f"{entry[0]}\nCount: {entry[1]}\nAvg. Pr.: {entry[2]:.0f}"
        for entry in df_sorted.values
    ]

    squarify.plot(sizes=sizes, color=colors, label=labels, ax=ax, **kwargs)

    ax.axis('off')
    for text_obj in ax.texts:
        text_obj.set_fontsize(font_size)

    fig = ax.get_figure()
    cbar = fig.colorbar(mpl.cm.ScalarMappable(norm, cmap), ax=ax)
    cbar.set_label(color_key)

    title = f"Size: {size_key}, Color: {color_key}"
    if title_prefix:
        title = title_prefix + '\n' + title
    ax.set_title(title)

def get_category_meanprice(
    df_in: pd.DataFrame,
    categ_key: str
) -> pd.DataFrame:
    """
    Utility function to get mean price for a given categorical feature.

    Parameters
    ----------
    df_in:
        Input dataframe.
    categ_key:
        Categorical feature name (column name).

    Returns
    -------
    df_out:
        Output dataframe with column names `['count', 'mean_price']`.
    """
    df_out = pd.concat(
        objs=[
            df_in[categ_key].value_counts().rename('count'),
            df_in.groupby(categ_key)['price'].mean().rename('mean_price')
          ],
        axis=1
    )

    df_out.reset_index(drop=False, inplace=True)
    df_out.rename(columns={'index': categ_key}, inplace=True)

    return df_out

Car Makes (Brands)

print(
    f"Number of unique car-makes: {cars['make'].nunique()}",
    f"Unique car makes listed in the data set:",
    sep='\n'
)

cars['make'].unique()

Number of unique car-makes: 72
Unique car makes listed in the data set:

array(['BMW', 'Volkswagen', 'SEAT', 'Renault', 'Peugeot', 'Toyota',
       'Opel', 'Mazda', 'Ford', 'Mercedes-Benz', 'Chevrolet', 'Audi',
       'Fiat', 'Kia', 'Dacia', 'MINI', 'Hyundai', 'Skoda', 'Citroen',
       'Infiniti', 'Suzuki', 'SsangYong', 'smart', 'Volvo', 'Jaguar',
       'Porsche', 'Nissan', 'Honda', 'Lada', 'Mitsubishi', 'Others',
       'Lexus', 'Cupra', 'Maserati', 'Bentley', 'Land', 'Alfa', 'Jeep',
       'Subaru', 'Dodge', 'Microcar', 'Baic', 'Tesla', 'Chrysler', '9ff',
       'McLaren', 'Aston', 'Rolls-Royce', 'Lancia', 'Abarth', 'DS',
       'Daihatsu', 'Ligier', 'Caravans-Wohnm', 'Aixam', 'Piaggio',
       'Alpine', 'Morgan', 'RAM', 'Ferrari', 'Iveco', 'Alpina',
       'Polestar', 'Maybach', 'Brilliance', 'FISKER', 'Lamborghini',
       'Cadillac', 'Isuzu', 'Corvette', 'DFSK', 'Estrima'], dtype=object)

# Count up cars by `make` feature & average `price`s for each `make` type
makes = get_category_meanprice(cars, 'make')

# Select data w/ a minimum number of items
min_nitems = 200
indexer = makes['count'] >= min_nitems

treemap_kwargs = dict(edgecolor='gray', linewidth=1, font_size=10, alpha=0.8)

fig, ax = plt.subplots(figsize=(16, 9))
plot_treemap(
    makes.loc[indexer, :],
    'count',
    'mean_price',
    ax,
    cmap=mpl.cm.PuBu,
    title_prefix=f'Car Makes (# >= {min_nitems})',
    **treemap_kwargs
)

png

Fuel Types

# Count up cars by `fuel` feature & average `price`s for each `fuel` type
fuels = get_category_meanprice(cars, 'fuel')

fig, ax = plt.subplots(figsize=(12, 7))
plot_treemap(
    fuels,
    'count',
    'mean_price',
    ax,
    cmap=mpl.cm.BuPu,
    title_prefix='Fuel Types',
    **treemap_kwargs
)

png

Gear Types

# Count up cars by `gear` feature & average `price`s for each `gear` type
gears = get_category_meanprice(cars, 'gear')

fig, ax = plt.subplots(figsize=(12, 7))
plot_treemap(
    gears,
    'count',
    'mean_price',
    ax,
    cmap=mpl.cm.Reds,
    title_prefix='Gear Types',
    **treemap_kwargs
)

png

Visualizing Categorical Data Using Box Plots

min_nitems = 200
s = cars['make'].value_counts() > min_nitems
indexer = cars['make'].isin(s[s].index)
order = cars[indexer].groupby('make')['price'].median().sort_values(ascending=False).index

fig = plt.figure(figsize=(12, 7), tight_layout=True)
gs = mpl.gridspec.GridSpec(2, 2)
boxplot_kwargs = dict(orient='h', showfliers=False, linewidth=1)

sns.boxplot(
    y=cars.loc[indexer, 'make'],
    x=cars.loc[indexer, 'price'] / 1e+3,
    ax=fig.add_subplot(gs[:, 0]),
    order=order,
    palette='pink',
    **boxplot_kwargs
)

for ifeature, feature in enumerate(['fuel', 'gear']):
    order = cars.groupby(feature)['price'].median().sort_values(ascending=False).index
    sns.boxplot(
        y=cars[feature],
        x=cars['price'] / 1e+3,
        ax=fig.add_subplot(gs[ifeature, 1]),
        order=order,
        **boxplot_kwargs
    )

for ax in fig.get_axes():
    ax.set_xlabel(r'price ($\times$ 1000)')

png

Scatter Plots: Visualizing the Relationship Between Numerical Data

fig, axes = plt.subplots(1, 3, figsize=(14, 4.5))
fig.tight_layout(w_pad=2.5)

for (feature, ax) in zip(all_features['numerical'], axes):
    ax.scatter(cars[feature], cars['price'] / 1e+3, edgecolor='w', alpha=0.8)
    ax.set(xlabel=feature, ylabel=r'price ($\times$ 1000)')

axes[0].set(ylabel=r'price ($\times$ 1000)');

png

Scatter plots above shows that the car sale prices have high correlation with mileage and hp. Moreover, these plots reveal another important characteristic of the data: heteroscedasticity. We can see that the predictive variable (price) monitored over different independent variables show unequal variability (particularly, over hp feature. Look at the cone-shape of the scatter plot). This means that a linear regression model is not a proper model to predict the sale price for this data set, sice the heteroscedasticity of the data will ruin the results.

Distribution of the Numerical Features

fig, axes = plt.subplots(1, 3, figsize=(14, 4.5))
fig.tight_layout(w_pad=2.5)

for (feature, ax) in zip(all_features['numerical'], axes):
    ax.hist(cars[feature], bins=25)
    ax.set(xlabel=feature, ylabel='Count [#]')

png

Checking Correlations Between Price and Numerical Features

corr = cars.corr()
new_order = ['price'] + numerical_features
corr = corr.reindex(index=new_order, columns=new_order)

mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True

fig, ax = plt.subplots(figsize=(6, 5))
ax.set_facecolor('none')
sns.heatmap(corr, ax=ax, mask=mask, annot=True, cmap='BrBG', vmin=-1);

<img src=”/assets/img/2021-11-05-car-price-germany-eda/output_55_0.png” class=”center_img” style=”width: 50%;”

The highest correlation shows for price with horsepower hp and age (no big news). Looking at negative correlation between price with age also seems natural.

Distribution of Target Variable

prices = cars['price'] / 1e+3

bin_width = 5   # in €1000
nbins = int(round((prices.max() - prices.min()) / bin_width)) + 1
bins = np.linspace(prices.min(), prices.max(), nbins)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
ax1.hist(prices, bins=bins)
ax1.set(xlabel=r'Price ($\times$ 1000)', ylabel='Count [#]')
_ = probplot(prices, plot=ax2)

png

The target distribution (of sale prices) is so skewed. Most of regression algorithms perform best when the target variable is normally distributed (or close to normal) and has a standard deviation of one.

Save Processed Data to a New CSV File

cars.to_csv(
    '../data/autoscout24_germany_dataset_cleaned.csv',
    header=True,
    index=False
)

Nima Nooshiri

Nima Nooshiri

Price Prediction of Used Cars in Germany (2011-2021) — Part I: Exploratory Data Analysis

Outline

Data Loading

Data Manipulation and Cleaning

Numerical Features

Add a New Column: Car Age

Mask Outliers

Categorical Features

Overview

Fuel Type

Handling Missing Data

Overview

Median Imputation of Missing `hp` Values

Mode Imputation of Missing `gear` Values

Revisiting Our Data Set

Exploratory Data Analysis

Visualizing Categorical Data Using Treemaps

Car Makes (Brands)

Fuel Types

Gear Types

Visualizing Categorical Data Using Box Plots

Scatter Plots: Visualizing the Relationship Between Numerical Data

Distribution of the Numerical Features

Checking Correlations Between Price and Numerical Features

Distribution of Target Variable

Nima Nooshiri

Nima Nooshiri

Price Prediction of Used Cars in Germany (2011-2021) — Part I: Exploratory Data Analysis

Outline

Data Loading

Data Manipulation and Cleaning

Numerical Features

Add a New Column: Car Age

Mask Outliers

Categorical Features

Overview

Fuel Type

Handling Missing Data

Overview

Median Imputation of Missing hp Values

Mode Imputation of Missing gear Values

Revisiting Our Data Set

Exploratory Data Analysis

Visualizing Categorical Data Using Treemaps

Car Makes (Brands)

Fuel Types

Gear Types

Visualizing Categorical Data Using Box Plots

Scatter Plots: Visualizing the Relationship Between Numerical Data

Distribution of the Numerical Features

Checking Correlations Between Price and Numerical Features

Distribution of Target Variable

Median Imputation of Missing `hp` Values

Mode Imputation of Missing `gear` Values