MRmR - regression and classification#

Maximal relevance minimal redundancy feature selection is, theoretically, a subset of the all relevant feature selection.

[41]:

# from IPython.core.display import display, HTML
# display(HTML("<style>.container { width:95% !important; }</style>"))
import gc
import arfs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sys import getsizeof, path
from arfs.utils import load_data
from arfs.feature_selection import MinRedundancyMaxRelevance
from arfs.preprocessing import TreeDiscretizer

plt.style.use("fivethirtyeight")
rng = np.random.RandomState(seed=42)

import warnings

warnings.filterwarnings("ignore")

[42]:

print(f"Run with ARFS {arfs.__version__}")

Run with ARFS 2.2.0

[43]:

%matplotlib inline

[44]:

gc.enable()
gc.collect()

[44]:

Simple Usage#

In the following examples, I’ll use a classical data set to which I added random predictors (numerical and categorical). A maximal relevance minimal redonduncy method should discard them. In the unit tests, you’ll find examples using artifical data with genuine (correlated and non-linear) predictors and with some random/noise columns.

MRmr#

[45]:

boston = load_data(name="Boston")
X, y = boston.data, boston.target
y.name = "target"

[46]:

y.head()

[46]:

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: target, dtype: float64

[47]:

X.dtypes

[47]:

CRIM             float64
ZN               float64
INDUS            float64
CHAS            category
NOX              float64
RM               float64
AGE              float64
DIS              float64
RAD             category
TAX              float64
PTRATIO          float64
B                float64
LSTAT            float64
random_num1      float64
random_num2        int32
random_cat      category
random_cat_2    category
genuine_num      float64
dtype: object

[48]:

X.head()

[48]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	random_num1	random_num2	random_cat	random_cat_2	genuine_num
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	0.496714	0	cat_3517	Platist	7.080332
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	-0.138264	0	cat_2397	MarkZ	5.245384
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	0.647689	0	cat_3735	Dracula	6.375795
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	1.523030	0	cat_2870	Bejita	6.725118
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	-0.234153	4	cat_1160	Variance	7.867781

[49]:

fs_mrmr = MinRedundancyMaxRelevance(
    n_features_to_select=5,
    relevance_func=None,
    redundancy_func=None,
    task="regression",  # "classification",
    denominator_func=np.mean,
    only_same_domain=False,
    return_scores=False,
    show_progress=True,
    n_jobs=-1,
)

# fs_mrmr.fit(X=X, y=y.astype(str), sample_weight=None)
fs_mrmr.fit(X=X, y=y, sample_weight=None)

[49]:

MinRedundancyMaxRelevance(n_features_to_select=5, n_jobs=-1,
                          redundancy_func=functools.partial(<function association_series at 0x00000281A87AE950>, n_jobs=-1, normalize=True),
                          relevance_func=functools.partial(<function f_stat_regression_parallel at 0x00000281A87AEDD0>, n_jobs=-1))

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

[50]:

X_trans = fs_mrmr.transform(X)
X_trans.head()

[50]:

	genuine_num	LSTAT	RM	RAD
0	7.080332	4.98	6.575	1.0
1	5.245384	9.14	6.421	2.0
2	6.375795	4.03	7.185	2.0
3	6.725118	2.94	6.998	3.0
4	7.867781	5.33	7.147	3.0

Using a single job, avoiding the overhead of starting multiple processes (for moderate size data)

[51]:

fs_mrmr = MinRedundancyMaxRelevance(
    n_features_to_select=5,
    relevance_func=None,
    redundancy_func=None,
    task="regression",  # "classification",
    denominator_func=np.mean,
    only_same_domain=False,
    return_scores=False,
    show_progress=True,
    n_jobs=1,
)

fs_mrmr.fit(X=X, y=y, sample_weight=None)

[51]:

MinRedundancyMaxRelevance(n_features_to_select=5,
                          redundancy_func=functools.partial(<function association_series at 0x00000281A87AE950>, n_jobs=1, normalize=True),
                          relevance_func=functools.partial(<function f_stat_regression_parallel at 0x00000281A87AEDD0>, n_jobs=1))

[ ]:

[52]:

fs_mrmr.feature_names_in_

[52]:

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT', 'random_num1', 'random_num2',
       'random_cat', 'random_cat_2', 'genuine_num'], dtype=object)

[53]:

fs_mrmr.support_

[53]:

array([False, False, False,  True, False,  True, False, False,  True,
       False, False, False,  True, False, False, False, False,  True])

[54]:

fs_mrmr.get_feature_names_out()

[54]:

array(['CHAS', 'RM', 'RAD', 'LSTAT', 'genuine_num'], dtype=object)

[55]:

fs_mrmr.ranking_

[55]:

	mrmr	relevance	redundancy
genuine_num	inf	2.461769	0.000000
LSTAT	1636.219687	1.636220	0.001000
RM	3.249000	1.106967	0.340710
CHAS	1.752553	0.731266	0.417258
RAD	2.070764	0.990593	0.478371

Classification#

[56]:

titanic = load_data(name="Titanic")
X, y = titanic.data, titanic.target

# y = y.astype('category')
y.name = "target"

[57]:

X.head()

[57]:

	pclass	sex	embarked	random_cat	is_alone	title	age	family_size	fare	random_num
0	1	female	S	Morty	1	Mrs	29.0000	0.0	211.3375	0.496714
1	1	male	S	Morty	0	Master	0.9167	3.0	151.5500	-0.138264
2	1	female	S	Fry	0	Mrs	2.0000	3.0	151.5500	0.647689
3	1	male	S	Cartman	0	Mr	30.0000	3.0	151.5500	1.523030
4	1	female	S	Vador	0	Mrs	25.0000	3.0	151.5500	-0.234153

[58]:

X.dtypes

[58]:

pclass          object
sex             object
embarked        object
random_cat      object
is_alone        object
title           object
age            float64
family_size    float64
fare           float64
random_num     float64
dtype: object

[59]:

[59]:

0       1
1       1
2       0
3       0
4       0
       ..
1304    0
1305    0
1306    0
1307    0
1308    0
Name: target, Length: 1309, dtype: category
Categories (2, object): ['0', '1']

[60]:

fs_mrmr = MinRedundancyMaxRelevance(
    n_features_to_select=5,
    relevance_func=None,
    redundancy_func=None,
    task="classification",
    denominator_func=np.mean,
    only_same_domain=False,
    return_scores=False,
    show_progress=True,
    n_jobs=1,
)

# fs_mrmr.fit(X=X, y=y.astype(str), sample_weight=None)
fs_mrmr.fit(X=X, y=y, sample_weight=None)

[60]:

MinRedundancyMaxRelevance(n_features_to_select=5,
                          redundancy_func=functools.partial(<function association_series at 0x00000281A87AE950>, n_jobs=1, normalize=True),
                          relevance_func=functools.partial(<function f_stat_classification_parallel at 0x00000281A87AF010>, n_jobs=1),
                          task='classification')

[61]:

fs_mrmr.feature_names_in_

[61]:

array(['pclass', 'sex', 'embarked', 'random_cat', 'is_alone', 'title',
       'age', 'family_size', 'fare', 'random_num'], dtype=object)

[62]:

fs_mrmr.support_

[62]:

array([False,  True, False, False,  True,  True, False,  True,  True,
       False])

[63]:

fs_mrmr.get_feature_names_out()

[63]:

array(['sex', 'is_alone', 'title', 'family_size', 'fare'], dtype=object)

[64]:

fs_mrmr.ranking_

[64]:

	mrmr	relevance	redundancy
sex	inf	1.740256	0.000000
fare	8.129865	1.499352	0.184425
title	1.483999	0.694114	0.467732
family_size	-1.320894	-0.516471	0.391001
is_alone	-2.068465	-0.639219	0.309031

Pipeline#

Integration as a step of a feature selection pipeline

[65]:

from sklearn.pipeline import Pipeline
from arfs.feature_selection import (
    MissingValueThreshold,
    UniqueValuesThreshold,
    make_fs_summary,
)

titanic = load_data(name="Titanic")
X, y = titanic.data, titanic.target

# y = y.astype('category')
y.name = "target"

fs_mrmr = MinRedundancyMaxRelevance(
    n_features_to_select=5,
    relevance_func=None,
    redundancy_func=None,
    task="classification",
    denominator_func=np.mean,
    only_same_domain=False,
    return_scores=False,
    show_progress=True,
    n_jobs=1,
)

mrmr_fs_pipeline = Pipeline(
    [
        ("missing", MissingValueThreshold(threshold=0.05)),
        ("unique", UniqueValuesThreshold(threshold=1)),
        ("mrmr", fs_mrmr),
    ]
)

X_trans = mrmr_fs_pipeline.fit(X=X, y=y).transform(X=X)
#   collinearity__sample_weight=w,
#   lowimp__sample_weight=w)
X_trans.head()

[65]:

	sex	fare	title	family_size	is_alone
0	female	211.3375	Mrs	0.0	1
1	male	151.5500	Master	3.0	0
2	female	151.5500	Mrs	3.0	0
3	male	151.5500	Mr	3.0	0
4	female	151.5500	Mrs	3.0	0

[66]:

make_fs_summary(mrmr_fs_pipeline)

[66]:

	predictor	missing	unique	mrmr
0	pclass	1	1	0
1	sex	1	1	1
2	embarked	1	1	0
3	random_cat	1	1	0
4	is_alone	1	1	1
5	title	1	1	1
6	age	1	1	0
7	family_size	1	1	1
8	fare	1	1	1
9	random_num	1	1	0

[67]:

mrmr_fs_pipeline.named_steps["mrmr"].get_feature_names_out()

[67]:

array(['sex', 'is_alone', 'title', 'family_size', 'fare'], dtype=object)

Does discretization help?#

We can use TreeDiscretizer to discretize and auto-group the predictors (whatever if numeric or not). Does that improves the MRmr output?

[68]:

# main parameter controlling how agressive will be the auto-grouping
lgb_params = {"min_split_gain": 0.05}
# instanciate the discretizer
disc = TreeDiscretizer(bin_features="all", n_bins=10, boost_params=lgb_params)

titanic = load_data(name="Titanic")
X, y = titanic.data, titanic.target

# y = y.astype('category')
y.name = "target"
y = y.astype("int")

fs_mrmr = MinRedundancyMaxRelevance(
    n_features_to_select=5,
    relevance_func=None,
    redundancy_func=None,
    task="classification",
    denominator_func=np.mean,
    only_same_domain=False,
    return_scores=False,
    show_progress=True,
    n_jobs=1,
)

mrmr_fs_pipeline = Pipeline(
    [
        ("missing", MissingValueThreshold(threshold=0.05)),
        ("unique", UniqueValuesThreshold(threshold=1)),
        ("discretizer", disc),
        ("mrmr", fs_mrmr),
    ]
)

X_trans = mrmr_fs_pipeline.fit(X=X, y=y).transform(X=X)
#   collinearity__sample_weight=w,
#   lowimp__sample_weight=w)
X_trans.head()

[68]:

	sex	title	is_alone	pclass	embarked
0	female	Mrs	1	1	S / missing / Q
1	male	Master	0	1	S / missing / Q
2	female	Mrs	0	1	S / missing / Q
3	male	Mr	0	1	S / missing / Q
4	female	Mrs	0	1	S / missing / Q

[69]:

make_fs_summary(mrmr_fs_pipeline)

[69]:

	predictor	missing	unique	discretizer	mrmr
0	pclass	1	1	nan	1
1	sex	1	1	nan	1
2	embarked	1	1	nan	1
3	random_cat	1	1	nan	0
4	is_alone	1	1	nan	1
5	title	1	1	nan	1
6	age	1	1	nan	0
7	family_size	1	1	nan	0
8	fare	1	1	nan	0
9	random_num	1	1	nan	0