MRmR - regression and classification#

Maximal relevance minimal redundancy feature selection is, theoretically, a subset of the all relevant feature selection.

[41]:
# from IPython.core.display import display, HTML
# display(HTML("<style>.container { width:95% !important; }</style>"))
import gc
import arfs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sys import getsizeof, path
from arfs.utils import load_data
from arfs.feature_selection import MinRedundancyMaxRelevance
from arfs.preprocessing import TreeDiscretizer

plt.style.use("fivethirtyeight")
rng = np.random.RandomState(seed=42)

import warnings

warnings.filterwarnings("ignore")
[42]:
print(f"Run with ARFS {arfs.__version__}")
Run with ARFS 2.2.0
[43]:
%matplotlib inline
[44]:
gc.enable()
gc.collect()
[44]:
203

Simple Usage#

In the following examples, I’ll use a classical data set to which I added random predictors (numerical and categorical). A maximal relevance minimal redonduncy method should discard them. In the unit tests, you’ll find examples using artifical data with genuine (correlated and non-linear) predictors and with some random/noise columns.

MRmr#

[45]:
boston = load_data(name="Boston")
X, y = boston.data, boston.target
y.name = "target"
[46]:
y.head()
[46]:
0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: target, dtype: float64
[47]:
X.dtypes
[47]:
CRIM             float64
ZN               float64
INDUS            float64
CHAS            category
NOX              float64
RM               float64
AGE              float64
DIS              float64
RAD             category
TAX              float64
PTRATIO          float64
B                float64
LSTAT            float64
random_num1      float64
random_num2        int32
random_cat      category
random_cat_2    category
genuine_num      float64
dtype: object
[48]:
X.head()
[48]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT random_num1 random_num2 random_cat random_cat_2 genuine_num
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 0.496714 0 cat_3517 Platist 7.080332
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 -0.138264 0 cat_2397 MarkZ 5.245384
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 0.647689 0 cat_3735 Dracula 6.375795
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 1.523030 0 cat_2870 Bejita 6.725118
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 -0.234153 4 cat_1160 Variance 7.867781
[49]:
fs_mrmr = MinRedundancyMaxRelevance(
    n_features_to_select=5,
    relevance_func=None,
    redundancy_func=None,
    task="regression",  # "classification",
    denominator_func=np.mean,
    only_same_domain=False,
    return_scores=False,
    show_progress=True,
    n_jobs=-1,
)

# fs_mrmr.fit(X=X, y=y.astype(str), sample_weight=None)
fs_mrmr.fit(X=X, y=y, sample_weight=None)
[49]:
MinRedundancyMaxRelevance(n_features_to_select=5, n_jobs=-1,
                          redundancy_func=functools.partial(<function association_series at 0x00000281A87AE950>, n_jobs=-1, normalize=True),
                          relevance_func=functools.partial(<function f_stat_regression_parallel at 0x00000281A87AEDD0>, n_jobs=-1))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[50]:
X_trans = fs_mrmr.transform(X)
X_trans.head()
[50]:
genuine_num LSTAT RM CHAS RAD
0 7.080332 4.98 6.575 0.0 1.0
1 5.245384 9.14 6.421 0.0 2.0
2 6.375795 4.03 7.185 0.0 2.0
3 6.725118 2.94 6.998 0.0 3.0
4 7.867781 5.33 7.147 0.0 3.0

Using a single job, avoiding the overhead of starting multiple processes (for moderate size data)

[51]:
fs_mrmr = MinRedundancyMaxRelevance(
    n_features_to_select=5,
    relevance_func=None,
    redundancy_func=None,
    task="regression",  # "classification",
    denominator_func=np.mean,
    only_same_domain=False,
    return_scores=False,
    show_progress=True,
    n_jobs=1,
)

fs_mrmr.fit(X=X, y=y, sample_weight=None)
[51]:
MinRedundancyMaxRelevance(n_features_to_select=5,
                          redundancy_func=functools.partial(<function association_series at 0x00000281A87AE950>, n_jobs=1, normalize=True),
                          relevance_func=functools.partial(<function f_stat_regression_parallel at 0x00000281A87AEDD0>, n_jobs=1))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[ ]:

[52]:
fs_mrmr.feature_names_in_
[52]:
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT', 'random_num1', 'random_num2',
       'random_cat', 'random_cat_2', 'genuine_num'], dtype=object)
[53]:
fs_mrmr.support_
[53]:
array([False, False, False,  True, False,  True, False, False,  True,
       False, False, False,  True, False, False, False, False,  True])
[54]:
fs_mrmr.get_feature_names_out()
[54]:
array(['CHAS', 'RM', 'RAD', 'LSTAT', 'genuine_num'], dtype=object)
[55]:
fs_mrmr.ranking_
[55]:
mrmr relevance redundancy
genuine_num inf 2.461769 0.000000
LSTAT 1636.219687 1.636220 0.001000
RM 3.249000 1.106967 0.340710
CHAS 1.752553 0.731266 0.417258
RAD 2.070764 0.990593 0.478371

Classification#

[56]:
titanic = load_data(name="Titanic")
X, y = titanic.data, titanic.target

# y = y.astype('category')
y.name = "target"
[57]:
X.head()
[57]:
pclass sex embarked random_cat is_alone title age family_size fare random_num
0 1 female S Morty 1 Mrs 29.0000 0.0 211.3375 0.496714
1 1 male S Morty 0 Master 0.9167 3.0 151.5500 -0.138264
2 1 female S Fry 0 Mrs 2.0000 3.0 151.5500 0.647689
3 1 male S Cartman 0 Mr 30.0000 3.0 151.5500 1.523030
4 1 female S Vador 0 Mrs 25.0000 3.0 151.5500 -0.234153
[58]:
X.dtypes
[58]:
pclass          object
sex             object
embarked        object
random_cat      object
is_alone        object
title           object
age            float64
family_size    float64
fare           float64
random_num     float64
dtype: object
[59]:
y
[59]:
0       1
1       1
2       0
3       0
4       0
       ..
1304    0
1305    0
1306    0
1307    0
1308    0
Name: target, Length: 1309, dtype: category
Categories (2, object): ['0', '1']
[60]:
fs_mrmr = MinRedundancyMaxRelevance(
    n_features_to_select=5,
    relevance_func=None,
    redundancy_func=None,
    task="classification",
    denominator_func=np.mean,
    only_same_domain=False,
    return_scores=False,
    show_progress=True,
    n_jobs=1,
)

# fs_mrmr.fit(X=X, y=y.astype(str), sample_weight=None)
fs_mrmr.fit(X=X, y=y, sample_weight=None)
[60]:
MinRedundancyMaxRelevance(n_features_to_select=5,
                          redundancy_func=functools.partial(<function association_series at 0x00000281A87AE950>, n_jobs=1, normalize=True),
                          relevance_func=functools.partial(<function f_stat_classification_parallel at 0x00000281A87AF010>, n_jobs=1),
                          task='classification')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[61]:
fs_mrmr.feature_names_in_
[61]:
array(['pclass', 'sex', 'embarked', 'random_cat', 'is_alone', 'title',
       'age', 'family_size', 'fare', 'random_num'], dtype=object)
[62]:
fs_mrmr.support_
[62]:
array([False,  True, False, False,  True,  True, False,  True,  True,
       False])
[63]:
fs_mrmr.get_feature_names_out()
[63]:
array(['sex', 'is_alone', 'title', 'family_size', 'fare'], dtype=object)
[64]:
fs_mrmr.ranking_
[64]:
mrmr relevance redundancy
sex inf 1.740256 0.000000
fare 8.129865 1.499352 0.184425
title 1.483999 0.694114 0.467732
family_size -1.320894 -0.516471 0.391001
is_alone -2.068465 -0.639219 0.309031

Pipeline#

Integration as a step of a feature selection pipeline

[65]:
from sklearn.pipeline import Pipeline
from arfs.feature_selection import (
    MissingValueThreshold,
    UniqueValuesThreshold,
    make_fs_summary,
)

titanic = load_data(name="Titanic")
X, y = titanic.data, titanic.target

# y = y.astype('category')
y.name = "target"

fs_mrmr = MinRedundancyMaxRelevance(
    n_features_to_select=5,
    relevance_func=None,
    redundancy_func=None,
    task="classification",
    denominator_func=np.mean,
    only_same_domain=False,
    return_scores=False,
    show_progress=True,
    n_jobs=1,
)

mrmr_fs_pipeline = Pipeline(
    [
        ("missing", MissingValueThreshold(threshold=0.05)),
        ("unique", UniqueValuesThreshold(threshold=1)),
        ("mrmr", fs_mrmr),
    ]
)

X_trans = mrmr_fs_pipeline.fit(X=X, y=y).transform(X=X)
#   collinearity__sample_weight=w,
#   lowimp__sample_weight=w)
X_trans.head()
[65]:
sex fare title family_size is_alone
0 female 211.3375 Mrs 0.0 1
1 male 151.5500 Master 3.0 0
2 female 151.5500 Mrs 3.0 0
3 male 151.5500 Mr 3.0 0
4 female 151.5500 Mrs 3.0 0
[66]:
make_fs_summary(mrmr_fs_pipeline)
[66]:
  predictor missing unique mrmr
0 pclass 1 1 0
1 sex 1 1 1
2 embarked 1 1 0
3 random_cat 1 1 0
4 is_alone 1 1 1
5 title 1 1 1
6 age 1 1 0
7 family_size 1 1 1
8 fare 1 1 1
9 random_num 1 1 0
[67]:
mrmr_fs_pipeline.named_steps["mrmr"].get_feature_names_out()
[67]:
array(['sex', 'is_alone', 'title', 'family_size', 'fare'], dtype=object)

Does discretization help?#

We can use TreeDiscretizer to discretize and auto-group the predictors (whatever if numeric or not). Does that improves the MRmr output?

[68]:
# main parameter controlling how agressive will be the auto-grouping
lgb_params = {"min_split_gain": 0.05}
# instanciate the discretizer
disc = TreeDiscretizer(bin_features="all", n_bins=10, boost_params=lgb_params)

titanic = load_data(name="Titanic")
X, y = titanic.data, titanic.target

# y = y.astype('category')
y.name = "target"
y = y.astype("int")

fs_mrmr = MinRedundancyMaxRelevance(
    n_features_to_select=5,
    relevance_func=None,
    redundancy_func=None,
    task="classification",
    denominator_func=np.mean,
    only_same_domain=False,
    return_scores=False,
    show_progress=True,
    n_jobs=1,
)

mrmr_fs_pipeline = Pipeline(
    [
        ("missing", MissingValueThreshold(threshold=0.05)),
        ("unique", UniqueValuesThreshold(threshold=1)),
        ("discretizer", disc),
        ("mrmr", fs_mrmr),
    ]
)

X_trans = mrmr_fs_pipeline.fit(X=X, y=y).transform(X=X)
#   collinearity__sample_weight=w,
#   lowimp__sample_weight=w)
X_trans.head()
[68]:
sex title is_alone pclass embarked
0 female Mrs 1 1 S / missing / Q
1 male Master 0 1 S / missing / Q
2 female Mrs 0 1 S / missing / Q
3 male Mr 0 1 S / missing / Q
4 female Mrs 0 1 S / missing / Q
[69]:
make_fs_summary(mrmr_fs_pipeline)
[69]:
  predictor missing unique discretizer mrmr
0 pclass 1 1 nan 1
1 sex 1 1 nan 1
2 embarked 1 1 nan 1
3 random_cat 1 1 nan 0
4 is_alone 1 1 nan 1
5 title 1 1 nan 1
6 age 1 1 nan 0
7 family_size 1 1 nan 0
8 fare 1 1 nan 0
9 random_num 1 1 nan 0