ARFS - Customizing GrootCV#

GrootCV allows users to customize certain parameters when using the lightGBM algorithm in Python. However, it’s essential to be aware that some of these parameters will be automatically adjusted by the algorithm itself. Specifically, the overridden parameters are objective, verbose, is_balanced, and n_iterations.

[1]:

# from IPython.core.display import display, HTML
# display(HTML("<style>.container { width:95% !important; }</style>"))
import time
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

import arfs
from arfs.feature_selection import GrootCV
from arfs.utils import load_data
from arfs.benchmark import highlight_tick

rng = np.random.RandomState(seed=42)

# import warnings
# warnings.filterwarnings('ignore')

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)

[2]:

import os
import multiprocessing


def get_physical_cores():
    if os.name == "posix":  # For Unix-based systems (e.g., Linux, macOS)
        try:
            return os.sysconf("SC_NPROCESSORS_ONLN")
        except ValueError:
            pass
    elif os.name == "nt":  # For Windows
        try:
            return int(os.environ["NUMBER_OF_PROCESSORS"])
        except (ValueError, KeyError):
            pass
    return multiprocessing.cpu_count()


num_physical_cores = get_physical_cores()
print(f"Number of physical cores: {num_physical_cores}")

Number of physical cores: 4

[3]:

print(f"Run with ARFS {arfs.__version__}")

Run with ARFS 2.0.5

[4]:

boston = load_data(name="Boston")
X, y = boston.data, boston.target

[5]:

X.dtypes

[5]:

CRIM             float64
ZN               float64
INDUS            float64
CHAS            category
NOX              float64
RM               float64
AGE              float64
DIS              float64
RAD             category
TAX              float64
PTRATIO          float64
B                float64
LSTAT            float64
random_num1      float64
random_num2        int32
random_cat      category
random_cat_2    category
genuine_num      float64
dtype: object

[6]:

X.head()

[6]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	random_num1	random_num2	random_cat	random_cat_2	genuine_num
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	0.496714	0	cat_3517	Platist	7.080332
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	-0.138264	0	cat_2397	MarkZ	5.245384
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	0.647689	0	cat_3735	Dracula	6.375795
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	1.523030	0	cat_2870	Bejita	6.725118
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	-0.234153	4	cat_1160	Variance	7.867781

Number of processes#

For optimal performance on CPUs, set the n_jobs parameter to match the number of physical cores available on your machine. Keep in mind that using n_jobs=0 will use the default number of threads in OpenMP, and running multiple processes might not always yield faster results due to the overhead involved in initiating those processes.

[7]:

for n_jobs in range(num_physical_cores):
    start_time = time.time()
    feat_selector = GrootCV(
        objective="rmse",
        cutoff=1,
        n_folds=5,
        n_iter=5,
        silent=True,
        fastshap=False,
        n_jobs=n_jobs,
    )
    feat_selector.fit(X, y, sample_weight=None)
    end_time = time.time()
    execution_time = end_time - start_time
    print(f"n_jobs = {n_jobs}, Execution time: {execution_time:.3f} seconds")

invalid value encountered in cast
invalid value encountered in cast
invalid value encountered in cast
invalid value encountered in cast

n_jobs = 0, Execution time: 9.602 seconds

n_jobs = 1, Execution time: 8.959 seconds

n_jobs = 2, Execution time: 7.147 seconds

n_jobs = 3, Execution time: 9.987 seconds

Regularization#

By default, GrootCV is using crossvalidation and early-stopping to prevent overfitting and to ensure the robustness of the feature selection. However, if you want to prevent overfitting even more, you can use the dedicated lightgbm parameters such as min_data_in_leaf.

Let’s illustrate: - less regularization than default - default regularization ({"min_data_in_leaf": 20}) - larger regularization

[8]:

# GrootCV with less regularization
feat_selector = GrootCV(
    objective="rmse",
    cutoff=1,
    n_folds=5,
    n_iter=5,
    silent=True,
    fastshap=True,
    n_jobs=0,
    lgbm_params={"min_data_in_leaf": 10},
)
feat_selector.fit(X, y, sample_weight=None)
print(f"The selected features: {feat_selector.get_feature_names_out()}")
print(f"The agnostic ranking: {feat_selector.ranking_}")
print(f"The naive ranking: {feat_selector.ranking_absolutes_}")
fig = feat_selector.plot_importance(n_feat_per_inch=5)

# highlight synthetic random variable
fig = highlight_tick(figure=fig, str_match="random")
fig = highlight_tick(figure=fig, str_match="genuine", color="green")
plt.show()

The selected features: ['CRIM' 'NOX' 'RM' 'AGE' 'DIS' 'TAX' 'PTRATIO' 'B' 'LSTAT' 'genuine_num']
The agnostic ranking: [2 1 1 1 2 2 2 2 1 2 2 2 2 1 1 1 1 2]
The naive ranking: ['LSTAT', 'RM', 'genuine_num', 'PTRATIO', 'DIS', 'NOX', 'CRIM', 'AGE', 'TAX', 'B', 'random_num1', 'INDUS', 'random_cat', 'random_cat_2', 'RAD', 'random_num2', 'ZN', 'CHAS']

../_images/notebooks_arfs_grootcv_custom_params_10_2.png

[9]:

# GrootCV with default regularization
feat_selector = GrootCV(
    objective="rmse",
    cutoff=1,
    n_folds=5,
    n_iter=5,
    silent=True,
    fastshap=True,
    n_jobs=0,
    lgbm_params=None,
)
feat_selector.fit(X, y, sample_weight=None)
print(f"The selected features: {feat_selector.get_feature_names_out()}")
print(f"The agnostic ranking: {feat_selector.ranking_}")
print(f"The naive ranking: {feat_selector.ranking_absolutes_}")
fig = feat_selector.plot_importance(n_feat_per_inch=5)

# highlight synthetic random variable
fig = highlight_tick(figure=fig, str_match="random")
fig = highlight_tick(figure=fig, str_match="genuine", color="green")
plt.show()

The selected features: ['CRIM' 'NOX' 'RM' 'AGE' 'DIS' 'TAX' 'PTRATIO' 'LSTAT' 'genuine_num']
The agnostic ranking: [2 1 1 1 2 2 2 2 1 2 2 1 2 1 1 1 1 2]
The naive ranking: ['LSTAT', 'RM', 'genuine_num', 'PTRATIO', 'DIS', 'NOX', 'CRIM', 'AGE', 'TAX', 'B', 'random_num1', 'INDUS', 'random_cat', 'random_cat_2', 'RAD', 'ZN', 'random_num2', 'CHAS']

../_images/notebooks_arfs_grootcv_custom_params_11_2.png

[10]:

# GrootCV with larger regularization
feat_selector = GrootCV(
    objective="rmse",
    cutoff=1,
    n_folds=5,
    n_iter=5,
    silent=True,
    fastshap=True,
    n_jobs=0,
    lgbm_params={"min_data_in_leaf": 100},
)
feat_selector.fit(X, y, sample_weight=None)
print(f"The selected features: {feat_selector.get_feature_names_out()}")
print(f"The agnostic ranking: {feat_selector.ranking_}")
print(f"The naive ranking: {feat_selector.ranking_absolutes_}")
fig = feat_selector.plot_importance(n_feat_per_inch=5)

# highlight synthetic random variable
fig = highlight_tick(figure=fig, str_match="random")
fig = highlight_tick(figure=fig, str_match="genuine", color="green")
plt.show()

The selected features: ['CRIM' 'RM' 'DIS' 'TAX' 'PTRATIO' 'LSTAT' 'genuine_num']
The agnostic ranking: [2 1 1 1 1 2 1 2 1 2 2 1 2 1 1 1 1 2]
The naive ranking: ['LSTAT', 'genuine_num', 'RM', 'PTRATIO', 'TAX', 'CRIM', 'DIS', 'B', 'random_num1', 'AGE', 'NOX', 'random_cat', 'RAD', 'random_num2', 'random_cat_2', 'INDUS', 'ZN', 'CHAS']

../_images/notebooks_arfs_grootcv_custom_params_12_2.png