Preprocessing - encoding, discretization, dtypes#

Before performing All Relevant Feature Selection, you might want to apply preprocessing as:

ordinal encoding (recommended over other encoding strategies)
discretization and auto-grouping of levels
select columns in a consistent fashion, according data types and patterns

Ordinal encoding is strongly recommended for GBM-based methods. It is well documented that GBMs generally return better results when categorical predictors are encoded as integers (see lightGBM documentation for instance). OHE should be avoided since it leads to shallow trees.

Auto-grouping levels of categorical predictors and discretization of continuous ones is achieved using a method using GBMs, specifically lightGBM. LightGBM is extremely efficient at building histograms and finding the best splits. Leveraging the inner working of lightGBM, it is fairly simple and blazing fast to discretize and auto-group the predictor matrix.

[1]:

# from IPython.core.display import display, HTML
# display(HTML("<style>.container { width:95% !important; }</style>"))

[2]:

# Settings and libraries
from __future__ import print_function

import pandas as pd
import lightgbm
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import gc

import arfs
import arfs.preprocessing as arfspp
from arfs.utils import (
    _make_corr_dataset_regression,
    _make_corr_dataset_classification,
)

%matplotlib inline

[3]:

print(f"Run with ARFS {arfs.__version__} and ligthgbm {lightgbm.__version__}")

Run with ARFS 2.1.0 and ligthgbm 3.3.1

Generating artificial data#

Inspired by the BorutaPy unit test

[4]:

X, y, w = _make_corr_dataset_regression()
data = X.copy()
data["target"] = y

[5]:

# significant regressors
x_vars = ["var0", "var1", "var2", "var3", "var4"]
y_vars = ["target"]
g = sns.PairGrid(data, x_vars=x_vars, y_vars=y_vars)
g.map(plt.scatter, alpha=0.1)

# noise
x_vars = ["var5", "var6", "var7", "var8", "var9", "var10"]
y_vars = ["target"]
g = sns.PairGrid(data, x_vars=x_vars, y_vars=y_vars)
g.map(plt.scatter, alpha=0.1)

plt.plot();

../_images/notebooks_preprocessing_6_0.png

../_images/notebooks_preprocessing_6_1.png

[6]:

X.head()

[6]:

	var0	var1	var2	var3	var4	var5	var6	var8	var9	var10	var11	var12	nice_guys
0	1.016361	0.821664	-1.184095	0.985738	1.050445	-0.494190	2	-0.361717	1	1.0	0	1.705980	Bias
1	0.929581	0.334013	-1.063030	0.819273	1.016252	0.283845	1	0.178670	2	1.0	1	NaN	Klaue
2	1.095456	0.187234	-1.488666	1.087443	1.140542	0.962503	2	-3.375579	2	1.0	2	NaN	Imaginedragons
3	1.318165	0.994528	-1.370624	1.592398	1.315021	1.165595	0	-0.449650	2	1.0	3	2.929707	MarkZ
4	0.849496	0.184859	-0.806604	0.865702	0.991916	-0.058833	1	0.763903	1	1.0	4	NaN	Thanos

[7]:

X.dtypes

[7]:

var0          float64
var1          float64
var2          float64
var3          float64
var4          float64
var5          float64
var6            int32
var7            int32
var8          float64
var9            int32
var10         float64
var11        category
var12         float64
nice_guys      object
dtype: object

Selecting columns#

The custome selector returns a callable (function) which will return the list of column names.

[8]:

from arfs.preprocessing import dtype_column_selector

cat_features_selector = dtype_column_selector(
    dtype_include=["category", "object", "bool"],
    dtype_exclude=[np.number],
    pattern=None,
    exclude_cols=None,
)

cat_features_selector(X)

[8]:

['var11', 'nice_guys']

Encoding categorical data#

Unlike to scikit-learn, you don’t need to build a custom ColumnTransformer. You can just throw in a dataframe and it will return the same dataframe with the relevant columns encoded. You can:

fit
transform
fit_transform
inverse_transform
use it in a Pipeline

as you will do in vanilla scikit-learn.

[9]:

from arfs.preprocessing import OrdinalEncoderPandas

encoder = OrdinalEncoderPandas(exclude_cols=["nice_guys"]).fit(X)
X_encoded = encoder.transform(X)

[10]:

X_encoded.head()

[10]:

	var0	var1	var2	var3	var4	var5	var6	var8	var9	var10	var11	var12	nice_guys
0	1.016361	0.821664	-1.184095	0.985738	1.050445	-0.494190	2	-0.361717	1	1.0	0.0	1.705980	Bias
1	0.929581	0.334013	-1.063030	0.819273	1.016252	0.283845	1	0.178670	2	1.0	1.0	NaN	Klaue
2	1.095456	0.187234	-1.488666	1.087443	1.140542	0.962503	2	-3.375579	2	1.0	2.0	NaN	Imaginedragons
3	1.318165	0.994528	-1.370624	1.592398	1.315021	1.165595	0	-0.449650	2	1.0	3.0	2.929707	MarkZ
4	0.849496	0.184859	-0.806604	0.865702	0.991916	-0.058833	1	0.763903	1	1.0	4.0	NaN	Thanos

[11]:

encoder.inverse_transform(X_encoded)

[11]:

	var0	var1	var2	var3	var4	var5	var6	var7	var8	var9	var10	var11	var12	nice_guys
0	1.016361	0.821664	-1.184095	0.985738	1.050445	-0.494190	2	0	-0.361717	1	1.0	0	1.705980	Bias
1	0.929581	0.334013	-1.063030	0.819273	1.016252	0.283845	1	0	0.178670	2	1.0	1	NaN	Klaue
2	1.095456	0.187234	-1.488666	1.087443	1.140542	0.962503	2	0	-3.375579	2	1.0	2	NaN	Imaginedragons
3	1.318165	0.994528	-1.370624	1.592398	1.315021	1.165595	0	0	-0.449650	2	1.0	3	2.929707	MarkZ
4	0.849496	0.184859	-0.806604	0.865702	0.991916	-0.058833	1	0	0.763903	1	1.0	4	NaN	Thanos
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	0.904073	0.009764	-0.909486	0.831147	1.041658	0.081291	0	0	-0.542799	3	1.0	495	NaN	Dracula
996	1.370908	1.144173	-1.129008	2.002148	1.418488	0.484591	0	1	-1.932372	1	1.0	496	2.587139	Alien
997	1.113773	0.206798	-1.371649	1.167287	1.079760	0.267475	0	0	-0.329798	0	1.0	497	3.880225	Gruber
998	0.905654	0.598419	-0.792174	0.783557	1.275266	0.380826	1	0	-1.987310	1	1.0	498	1.888725	Gruber
999	1.138102	0.143818	-1.348573	1.321900	1.309521	1.342722	1	0	-0.871590	1	1.0	499	NaN	Creationist

1000 rows × 14 columns

[12]:

X.head()

[12]:

	var0	var1	var2	var3	var4	var5	var6	var8	var9	var10	var11	var12	nice_guys
0	1.016361	0.821664	-1.184095	0.985738	1.050445	-0.494190	2	-0.361717	1	1.0	0	1.705980	Bias
1	0.929581	0.334013	-1.063030	0.819273	1.016252	0.283845	1	0.178670	2	1.0	1	NaN	Klaue
2	1.095456	0.187234	-1.488666	1.087443	1.140542	0.962503	2	-3.375579	2	1.0	2	NaN	Imaginedragons
3	1.318165	0.994528	-1.370624	1.592398	1.315021	1.165595	0	-0.449650	2	1.0	3	2.929707	MarkZ
4	0.849496	0.184859	-0.806604	0.865702	0.991916	-0.058833	1	0.763903	1	1.0	4	NaN	Thanos

[13]:

X.nunique()

[13]:

var0         1000
var1         1000
var2         1000
var3         1000
var4         1000
var5         1000
var6            6
var7            2
var8         1000
var9            6
var10           1
var11         500
var12         500
nice_guys      36
dtype: int64

Discretization and auto-grouping#

Discretization and auto-grouping are supervised tasks and therefore require a target, and optionally a weight.

[14]:

from arfs.preprocessing import TreeDiscretizer

# main parameter controlling how agressive will be the auto-grouping
lgb_params = {"min_split_gain": 0.05}
# instanciate the discretizer
disc = TreeDiscretizer(bin_features="all", n_bins=10, boost_params=lgb_params)

disc.fit(X=X, y=y, sample_weight=w)

[14]:

TreeDiscretizer(bin_features=['var0', 'var1', 'var2', 'var3', 'var4', 'var5',
                              'var6', 'var7', 'var8', 'var9', 'var10', 'var11',
                              'var12', 'nice_guys'],
                boost_params={'max_leaf': 10, 'min_split_gain': 0.05,
                              'num_boost_round': 1, 'objective': 'rmse'})

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The regularization min_split_gain has as an outcome, as expected, to auto-group into a single level all the predictors not correlated to the target.

[15]:

X_trans = disc.transform(X)
X_trans

[15]:

	var0	var1	var2	var3	var4	var5	var6	var7	var8	var9	var10	var11	var12	nice_guys
0	(1.01, 1.09]	(0.6, 0.85]	(-1.34, -1.1]	(0.84, 1.01]	(1.0, 1.06]	(-0.96, -0.24]	(1.0, inf]	(-inf, inf]	(-1.43, -0.00048]	(-inf, inf]	(-inf, inf]	0 / 1 //.../10 / 11	(1.61, 2.55]	Bias / /...// Excel
1	(0.89, 1.01]	(0.32, 0.35]	(-1.1, -0.96]	(0.71, 0.84]	(1.0, 1.06]	(0.084, 0.34]	(-inf, 1.0]	(-inf, inf]	(-0.00048, 0.18]	(-inf, inf]	(-inf, inf]	0 / 1 //.../10 / 11	(1.61, 2.55]	Bias / /...// Excel
2	(1.09, 1.22]	(0.17, 0.27]	(-1.58, -1.34]	(1.01, 1.12]	(1.13, 1.34]	(0.68, 1.45]	(1.0, inf]	(-inf, inf]	(-inf, -1.91]	(-inf, inf]	(-inf, inf]	0 / 1 //.../10 / 11	(1.61, 2.55]	Bias / /...// Excel
3	(1.3, 1.38]	(0.85, inf]	(-1.58, -1.34]	(1.42, 1.6]	(1.13, 1.34]	(0.68, 1.45]	(-inf, 1.0]	(-inf, inf]	(-1.43, -0.00048]	(-inf, inf]	(-inf, inf]	0 / 1 //.../10 / 11	(2.88, inf]	Bias / /...// Excel
4	(0.78, 0.89]	(0.17, 0.27]	(-0.89, -0.59]	(0.84, 1.01]	(0.9, 1.0]	(-0.12, 0.084]	(-inf, 1.0]	(-inf, inf]	(0.72, 0.98]	(-inf, inf]	(-inf, inf]	0 / 1 //.../10 / 11	(1.61, 2.55]	Bias / /...// Excel
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	(0.89, 1.01]	(-inf, 0.17]	(-0.96, -0.89]	(0.71, 0.84]	(1.0, 1.06]	(-0.12, 0.084]	(-inf, 1.0]	(-inf, inf]	(-1.43, -0.00048]	(-inf, inf]	(-inf, inf]	468 / 4/.../8 / 499	(1.61, 2.55]	Bias / /...// Excel
996	(1.3, 1.38]	(0.85, inf]	(-1.34, -1.1]	(1.81, inf]	(1.42, inf]	(0.44, 0.63]	(-inf, 1.0]	(-inf, inf]	(-inf, -1.91]	(-inf, inf]	(-inf, inf]	468 / 4/.../8 / 499	(2.55, 2.88]	Bias / /...// Excel
997	(1.09, 1.22]	(0.17, 0.27]	(-1.58, -1.34]	(1.12, 1.31]	(1.06, 1.13]	(0.084, 0.34]	(-inf, 1.0]	(-inf, inf]	(-1.43, -0.00048]	(-inf, inf]	(-inf, inf]	468 / 4/.../8 / 499	(2.88, inf]	Bias / /...// Excel
998	(0.89, 1.01]	(0.45, 0.6]	(-0.89, -0.59]	(0.71, 0.84]	(1.13, 1.34]	(0.34, 0.44]	(-inf, 1.0]	(-inf, inf]	(-inf, -1.91]	(-inf, inf]	(-inf, inf]	468 / 4/.../8 / 499	(1.61, 2.55]	Bias / /...// Excel
999	(1.09, 1.22]	(-inf, 0.17]	(-1.58, -1.34]	(1.31, 1.42]	(1.13, 1.34]	(0.68, 1.45]	(-inf, 1.0]	(-inf, inf]	(-1.43, -0.00048]	(-inf, inf]	(-inf, inf]	468 / 4/.../8 / 499	(1.61, 2.55]	Bias / /...// Excel

1000 rows × 14 columns

Sklearn Pipeline#

The selectors follow the scikit-learn base class, therefore they are compatible with scikit-learn in general. This has several advantages: - easier to maintain - easier to version - more flexible - running faster by removing unnecessary columns before going to the computational demanding steps

[16]:

from sklearn.pipeline import Pipeline
from arfs.preprocessing import OrdinalEncoderPandas
from arfs.feature_selection import CollinearityThreshold, UniqueValuesThreshold, MissingValueThreshold

encoder = Pipeline(
    [   ("zero_variance", UniqueValuesThreshold()),
        ("collinearity", CollinearityThreshold(threshold=0.75)),
        ("disctretizer", TreeDiscretizer(bin_features="all", n_bins=10, boost_params=lgb_params)),
        # the treediscretization might introduce NaN for pure noise columns
        ("zero_variance_disc", UniqueValuesThreshold()),
        ("encoder", OrdinalEncoderPandas()),
    ]
)

X_encoded = encoder.fit(X, y).transform(X)
X_encoded.head()

[16]:

	var1	var3	var5	var6	var8
0	(0.74, 0.94]	(0.91, 1.12]	(-inf, -0.31]	(1.0, inf]	(-0.45, -0.28]
1	(0.19, 0.4]	(0.71, 0.91]	(0.084, 0.34]	(-inf, 1.0]	(-0.00048, 0.18]
2	(0.16, 0.19]	(0.91, 1.12]	(0.45, 1.03]	(1.0, inf]	(-inf, -1.15]
3	(0.94, inf]	(1.48, 1.6]	(1.03, 1.38]	(-inf, 1.0]	(-0.59, -0.45]
4	(0.16, 0.19]	(0.71, 0.91]	(-0.17, 0.084]	(-inf, 1.0]	(0.33, 0.98]