ARFS vs Boruta and BorutaShap#

comparison with Leshy, which is BorutaPy implementation with:

  • categorical features handling

  • plot method

  • catboost and lightGBM handling

  • SHAP and permutation importance

  • sample weight

The implementation is however quite close to the BorutaPy one. A PR has been opened on the official BorutaPy repo.

# from IPython.core.display import display, HTML
# display(HTML("<style>.container { width:95% !important; }</style>"))
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import gc
import shap
from boruta import BorutaPy as bp
from sklearn.datasets import fetch_openml
from sklearn.inspection import permutation_importance
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_openml
from sklearn.inspection import permutation_importance
from sklearn.base import clone
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sys import getsizeof, path

from boruta import BorutaPy

import arfs
import arfs.feature_selection as arfsfs
import arfs.feature_selection.allrelevant as arfsgroot
from arfs.utils import LightForestClassifier, LightForestRegressor
from arfs.benchmark import highlight_tick, compare_varimp, sklearn_pimp_bench
from arfs.utils import load_data

rng = np.random.RandomState(seed=42)

# import warnings
# warnings.filterwarnings('ignore')
Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
%matplotlib inline


I’ll just remove the collinear predictors since they are actually harmful for the ARFS, see the Collinearity notebook

cancer = load_data(name="cancer")
X, y =,

# basic feature selection
basic_fs_pipeline = Pipeline(
        ("missing", arfsfs.MissingValueThreshold(threshold=0.05)),
        ("unique", arfsfs.UniqueValuesThreshold(threshold=1)),
        ("cardinality", arfsfs.CardinalityThreshold(threshold=1000)),
        ("collinearity", arfsfs.CollinearityThreshold(threshold=0.75)),

X_filtered = basic_fs_pipeline.fit_transform(
    X=X, y=y
)  #  , collinearity__sample_weight=w,
mean texture mean area texture error smoothness error symmetry error worst smoothness random_num1 random_num2 genuine_num
0 10.38 1001.0 0.9053 0.006399 0.03003 0.1622 0.496714 0 -0.249340
1 17.77 1326.0 0.7339 0.005225 0.01389 0.1238 -0.138264 1 -0.044410
2 21.25 1203.0 0.7869 0.006150 0.02250 0.1444 0.647689 3 0.128395
3 20.38 386.1 1.1560 0.009110 0.05963 0.2098 1.523030 0 -0.079921
4 14.34 1297.0 0.7813 0.011490 0.01756 0.1374 -0.234153 0 -0.094302
f = basic_fs_pipeline.named_steps["collinearity"].plot_association()


Boruta, in its “official” implementation uses gain/gini feature importance (which is known to be biased). Let’s see what are the results on this data set

Unfortunately, Boruta does not work with newer version of numpy and python. However, if you have an older version of numpy, yuo can run this comparison, it should returns the same results.


# define random forest classifier, with utilising all cores and
# sampling in proportion to y labels
rf = RandomForestClassifier(n_jobs=-1, class_weight="balanced", max_depth=5)

# define Boruta feature selection method
bp_feat_selector = BorutaPy(rf, n_estimators="auto", verbose=1, random_state=1)

# find all relevant features - 5 features should be selected, y.values)

# check selected features - first 5 features are selected
Let’s compare to the official python implementation, using the same setting and the gini/gain feature importance. We should have the same results (btw, you can check the unit tests, BorutaPy is used as baseline).

# Leshy, all the predictors, no-preprocessing
model = clone(rf)

leshy_feat_selector = arfsgroot.Leshy(
), y, sample_weight=None)
print(f"The selected features: {leshy_feat_selector.get_feature_names_out()}")
print(f"The agnostic ranking: {leshy_feat_selector.ranking_}")
print(f"The naive ranking: {leshy_feat_selector.ranking_absolutes_}")
fig = leshy_feat_selector.plot_importance(n_feat_per_inch=5)

# highlight synthetic random variable
fig = highlight_tick(figure=fig, str_match="random")
fig = highlight_tick(figure=fig, str_match="genuine", color="green")
fasttreeshap is not installed. Fallback to shap.

Leshy finished running using shap var. imp.

Iteration:      1 / 100
Confirmed:      7
Tentative:      0
Rejected:       2
All relevant predictors selected in 00:00:04.00
The selected features: ['mean texture' 'mean area' 'texture error' 'smoothness error'
 'symmetry error' 'worst smoothness' 'genuine_num']
The agnostic ranking: [1 1 1 1 1 1 2 3 1]
The naive ranking: ['mean area', 'worst smoothness', 'mean texture', 'genuine_num', 'smoothness error', 'texture error', 'symmetry error', 'random_num1', 'random_num2']
CPU times: user 5.25 s, sys: 537 ms, total: 5.79 s
Wall time: 4.5 s

Same Results?#

def check_list_equal(L1, L2):
    return len(L1) == len(L2) and sorted(L1) == sorted(L2)
BorutaShap with native importance#

BorutaShap, is an alternative implementation (heavy re-writting and new material) of Boruta with Shap feature importance. Let’s see what are the results on this data set


from BorutaShap import BorutaShap
from arfs.preprocessing import OrdinalEncoderPandas

# define random forest classifier, with utilising all cores and
# sampling in proportion to y labels
model = RandomForestClassifier(n_jobs=-1, class_weight="balanced", max_depth=5)

# define BorutaShap feature selection method (doesn't convert automatically cat feature)
X_encoded = OrdinalEncoderPandas().fit_transform(X=X_filtered)
bs_feat_selector = BorutaShap(
    model=model, importance_measure="gini", classification=True

# find all relevant features - 5 features should be selected, y=y, n_trials=100, random_state=0)

# Returns Boxplot of features
bs_feat_selector.plot(X_size=12, figsize=(8, 6), y_scale="log", which_features="all")
The hasattr is called before fitting so feature_importances_ is not found. Let’s use the default, which is also a random forest


from BorutaShap import BorutaShap

bs_feat_selector = BorutaShap(importance_measure="gini", classification=True)

# find all relevant features - 5 features should be selected, y=y, n_trials=100, random_state=0)

# Returns Boxplot of features
bs_feat_selector.plot(X_size=12, figsize=(12, 8), y_scale="log", which_features="all")
Comparison with Boruta#

    list(X_filtered.columns[bp_feat_selector.support_]), list(bs_feat_selector.accepted)