MRmR - regression and classification#
Maximal relevance minimal redundancy feature selection is, theoretically, a subset of the all relevant feature selection.
[41]:
# from IPython.core.display import display, HTML
# display(HTML("<style>.container { width:95% !important; }</style>"))
import gc
import arfs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sys import getsizeof, path
from arfs.utils import load_data
from arfs.feature_selection import MinRedundancyMaxRelevance
from arfs.preprocessing import TreeDiscretizer
plt.style.use("fivethirtyeight")
rng = np.random.RandomState(seed=42)
import warnings
warnings.filterwarnings("ignore")
[42]:
print(f"Run with ARFS {arfs.__version__}")
Run with ARFS 2.2.0
[43]:
%matplotlib inline
[44]:
gc.enable()
gc.collect()
[44]:
203
Simple Usage#
In the following examples, I’ll use a classical data set to which I added random predictors (numerical and categorical). A maximal relevance minimal redonduncy method should discard them. In the unit tests, you’ll find examples using artifical data with genuine (correlated and non-linear) predictors and with some random/noise columns.
MRmr#
[45]:
boston = load_data(name="Boston")
X, y = boston.data, boston.target
y.name = "target"
[46]:
y.head()
[46]:
0 24.0
1 21.6
2 34.7
3 33.4
4 36.2
Name: target, dtype: float64
[47]:
X.dtypes
[47]:
CRIM float64
ZN float64
INDUS float64
CHAS category
NOX float64
RM float64
AGE float64
DIS float64
RAD category
TAX float64
PTRATIO float64
B float64
LSTAT float64
random_num1 float64
random_num2 int32
random_cat category
random_cat_2 category
genuine_num float64
dtype: object
[48]:
X.head()
[48]:
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | random_num1 | random_num2 | random_cat | random_cat_2 | genuine_num | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 0.496714 | 0 | cat_3517 | Platist | 7.080332 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | -0.138264 | 0 | cat_2397 | MarkZ | 5.245384 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 0.647689 | 0 | cat_3735 | Dracula | 6.375795 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 1.523030 | 0 | cat_2870 | Bejita | 6.725118 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | -0.234153 | 4 | cat_1160 | Variance | 7.867781 |
[49]:
fs_mrmr = MinRedundancyMaxRelevance(
n_features_to_select=5,
relevance_func=None,
redundancy_func=None,
task="regression", # "classification",
denominator_func=np.mean,
only_same_domain=False,
return_scores=False,
show_progress=True,
n_jobs=-1,
)
# fs_mrmr.fit(X=X, y=y.astype(str), sample_weight=None)
fs_mrmr.fit(X=X, y=y, sample_weight=None)
[49]:
MinRedundancyMaxRelevance(n_features_to_select=5, n_jobs=-1, redundancy_func=functools.partial(<function association_series at 0x00000281A87AE950>, n_jobs=-1, normalize=True), relevance_func=functools.partial(<function f_stat_regression_parallel at 0x00000281A87AEDD0>, n_jobs=-1))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MinRedundancyMaxRelevance(n_features_to_select=5, n_jobs=-1, redundancy_func=functools.partial(<function association_series at 0x00000281A87AE950>, n_jobs=-1, normalize=True), relevance_func=functools.partial(<function f_stat_regression_parallel at 0x00000281A87AEDD0>, n_jobs=-1))
[50]:
X_trans = fs_mrmr.transform(X)
X_trans.head()
[50]:
genuine_num | LSTAT | RM | CHAS | RAD | |
---|---|---|---|---|---|
0 | 7.080332 | 4.98 | 6.575 | 0.0 | 1.0 |
1 | 5.245384 | 9.14 | 6.421 | 0.0 | 2.0 |
2 | 6.375795 | 4.03 | 7.185 | 0.0 | 2.0 |
3 | 6.725118 | 2.94 | 6.998 | 0.0 | 3.0 |
4 | 7.867781 | 5.33 | 7.147 | 0.0 | 3.0 |
Using a single job, avoiding the overhead of starting multiple processes (for moderate size data)
[51]:
fs_mrmr = MinRedundancyMaxRelevance(
n_features_to_select=5,
relevance_func=None,
redundancy_func=None,
task="regression", # "classification",
denominator_func=np.mean,
only_same_domain=False,
return_scores=False,
show_progress=True,
n_jobs=1,
)
fs_mrmr.fit(X=X, y=y, sample_weight=None)
[51]:
MinRedundancyMaxRelevance(n_features_to_select=5, redundancy_func=functools.partial(<function association_series at 0x00000281A87AE950>, n_jobs=1, normalize=True), relevance_func=functools.partial(<function f_stat_regression_parallel at 0x00000281A87AEDD0>, n_jobs=1))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MinRedundancyMaxRelevance(n_features_to_select=5, redundancy_func=functools.partial(<function association_series at 0x00000281A87AE950>, n_jobs=1, normalize=True), relevance_func=functools.partial(<function f_stat_regression_parallel at 0x00000281A87AEDD0>, n_jobs=1))
[ ]:
[52]:
fs_mrmr.feature_names_in_
[52]:
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT', 'random_num1', 'random_num2',
'random_cat', 'random_cat_2', 'genuine_num'], dtype=object)
[53]:
fs_mrmr.support_
[53]:
array([False, False, False, True, False, True, False, False, True,
False, False, False, True, False, False, False, False, True])
[54]:
fs_mrmr.get_feature_names_out()
[54]:
array(['CHAS', 'RM', 'RAD', 'LSTAT', 'genuine_num'], dtype=object)
[55]:
fs_mrmr.ranking_
[55]:
mrmr | relevance | redundancy | |
---|---|---|---|
genuine_num | inf | 2.461769 | 0.000000 |
LSTAT | 1636.219687 | 1.636220 | 0.001000 |
RM | 3.249000 | 1.106967 | 0.340710 |
CHAS | 1.752553 | 0.731266 | 0.417258 |
RAD | 2.070764 | 0.990593 | 0.478371 |
Classification#
[56]:
titanic = load_data(name="Titanic")
X, y = titanic.data, titanic.target
# y = y.astype('category')
y.name = "target"
[57]:
X.head()
[57]:
pclass | sex | embarked | random_cat | is_alone | title | age | family_size | fare | random_num | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | female | S | Morty | 1 | Mrs | 29.0000 | 0.0 | 211.3375 | 0.496714 |
1 | 1 | male | S | Morty | 0 | Master | 0.9167 | 3.0 | 151.5500 | -0.138264 |
2 | 1 | female | S | Fry | 0 | Mrs | 2.0000 | 3.0 | 151.5500 | 0.647689 |
3 | 1 | male | S | Cartman | 0 | Mr | 30.0000 | 3.0 | 151.5500 | 1.523030 |
4 | 1 | female | S | Vador | 0 | Mrs | 25.0000 | 3.0 | 151.5500 | -0.234153 |
[58]:
X.dtypes
[58]:
pclass object
sex object
embarked object
random_cat object
is_alone object
title object
age float64
family_size float64
fare float64
random_num float64
dtype: object
[59]:
y
[59]:
0 1
1 1
2 0
3 0
4 0
..
1304 0
1305 0
1306 0
1307 0
1308 0
Name: target, Length: 1309, dtype: category
Categories (2, object): ['0', '1']
[60]:
fs_mrmr = MinRedundancyMaxRelevance(
n_features_to_select=5,
relevance_func=None,
redundancy_func=None,
task="classification",
denominator_func=np.mean,
only_same_domain=False,
return_scores=False,
show_progress=True,
n_jobs=1,
)
# fs_mrmr.fit(X=X, y=y.astype(str), sample_weight=None)
fs_mrmr.fit(X=X, y=y, sample_weight=None)
[60]:
MinRedundancyMaxRelevance(n_features_to_select=5, redundancy_func=functools.partial(<function association_series at 0x00000281A87AE950>, n_jobs=1, normalize=True), relevance_func=functools.partial(<function f_stat_classification_parallel at 0x00000281A87AF010>, n_jobs=1), task='classification')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MinRedundancyMaxRelevance(n_features_to_select=5, redundancy_func=functools.partial(<function association_series at 0x00000281A87AE950>, n_jobs=1, normalize=True), relevance_func=functools.partial(<function f_stat_classification_parallel at 0x00000281A87AF010>, n_jobs=1), task='classification')
[61]:
fs_mrmr.feature_names_in_
[61]:
array(['pclass', 'sex', 'embarked', 'random_cat', 'is_alone', 'title',
'age', 'family_size', 'fare', 'random_num'], dtype=object)
[62]:
fs_mrmr.support_
[62]:
array([False, True, False, False, True, True, False, True, True,
False])
[63]:
fs_mrmr.get_feature_names_out()
[63]:
array(['sex', 'is_alone', 'title', 'family_size', 'fare'], dtype=object)
[64]:
fs_mrmr.ranking_
[64]:
mrmr | relevance | redundancy | |
---|---|---|---|
sex | inf | 1.740256 | 0.000000 |
fare | 8.129865 | 1.499352 | 0.184425 |
title | 1.483999 | 0.694114 | 0.467732 |
family_size | -1.320894 | -0.516471 | 0.391001 |
is_alone | -2.068465 | -0.639219 | 0.309031 |
Pipeline#
Integration as a step of a feature selection pipeline
[65]:
from sklearn.pipeline import Pipeline
from arfs.feature_selection import (
MissingValueThreshold,
UniqueValuesThreshold,
make_fs_summary,
)
titanic = load_data(name="Titanic")
X, y = titanic.data, titanic.target
# y = y.astype('category')
y.name = "target"
fs_mrmr = MinRedundancyMaxRelevance(
n_features_to_select=5,
relevance_func=None,
redundancy_func=None,
task="classification",
denominator_func=np.mean,
only_same_domain=False,
return_scores=False,
show_progress=True,
n_jobs=1,
)
mrmr_fs_pipeline = Pipeline(
[
("missing", MissingValueThreshold(threshold=0.05)),
("unique", UniqueValuesThreshold(threshold=1)),
("mrmr", fs_mrmr),
]
)
X_trans = mrmr_fs_pipeline.fit(X=X, y=y).transform(X=X)
# collinearity__sample_weight=w,
# lowimp__sample_weight=w)
X_trans.head()
[65]:
sex | fare | title | family_size | is_alone | |
---|---|---|---|---|---|
0 | female | 211.3375 | Mrs | 0.0 | 1 |
1 | male | 151.5500 | Master | 3.0 | 0 |
2 | female | 151.5500 | Mrs | 3.0 | 0 |
3 | male | 151.5500 | Mr | 3.0 | 0 |
4 | female | 151.5500 | Mrs | 3.0 | 0 |
[66]:
make_fs_summary(mrmr_fs_pipeline)
[66]:
predictor | missing | unique | mrmr | |
---|---|---|---|---|
0 | pclass | 1 | 1 | 0 |
1 | sex | 1 | 1 | 1 |
2 | embarked | 1 | 1 | 0 |
3 | random_cat | 1 | 1 | 0 |
4 | is_alone | 1 | 1 | 1 |
5 | title | 1 | 1 | 1 |
6 | age | 1 | 1 | 0 |
7 | family_size | 1 | 1 | 1 |
8 | fare | 1 | 1 | 1 |
9 | random_num | 1 | 1 | 0 |
[67]:
mrmr_fs_pipeline.named_steps["mrmr"].get_feature_names_out()
[67]:
array(['sex', 'is_alone', 'title', 'family_size', 'fare'], dtype=object)
Does discretization help?#
We can use TreeDiscretizer
to discretize and auto-group the predictors (whatever if numeric or not). Does that improves the MRmr output?
[68]:
# main parameter controlling how agressive will be the auto-grouping
lgb_params = {"min_split_gain": 0.05}
# instanciate the discretizer
disc = TreeDiscretizer(bin_features="all", n_bins=10, boost_params=lgb_params)
titanic = load_data(name="Titanic")
X, y = titanic.data, titanic.target
# y = y.astype('category')
y.name = "target"
y = y.astype("int")
fs_mrmr = MinRedundancyMaxRelevance(
n_features_to_select=5,
relevance_func=None,
redundancy_func=None,
task="classification",
denominator_func=np.mean,
only_same_domain=False,
return_scores=False,
show_progress=True,
n_jobs=1,
)
mrmr_fs_pipeline = Pipeline(
[
("missing", MissingValueThreshold(threshold=0.05)),
("unique", UniqueValuesThreshold(threshold=1)),
("discretizer", disc),
("mrmr", fs_mrmr),
]
)
X_trans = mrmr_fs_pipeline.fit(X=X, y=y).transform(X=X)
# collinearity__sample_weight=w,
# lowimp__sample_weight=w)
X_trans.head()
[68]:
sex | title | is_alone | pclass | embarked | |
---|---|---|---|---|---|
0 | female | Mrs | 1 | 1 | S / missing / Q |
1 | male | Master | 0 | 1 | S / missing / Q |
2 | female | Mrs | 0 | 1 | S / missing / Q |
3 | male | Mr | 0 | 1 | S / missing / Q |
4 | female | Mrs | 0 | 1 | S / missing / Q |
[69]:
make_fs_summary(mrmr_fs_pipeline)
[69]:
predictor | missing | unique | discretizer | mrmr | |
---|---|---|---|---|---|
0 | pclass | 1 | 1 | nan | 1 |
1 | sex | 1 | 1 | nan | 1 |
2 | embarked | 1 | 1 | nan | 1 |
3 | random_cat | 1 | 1 | nan | 0 |
4 | is_alone | 1 | 1 | nan | 1 |
5 | title | 1 | 1 | nan | 1 |
6 | age | 1 | 1 | nan | 0 |
7 | family_size | 1 | 1 | nan | 0 |
8 | fare | 1 | 1 | nan | 0 |
9 | random_num | 1 | 1 | nan | 0 |