ARFS - fasttreeshap vs shap#

Leshy, BoostAGroota, and GrootCV are tree-based algorithms. They benefit from a faster implementation of the Shapley values by LinkedIn, which is claimed to outperform both the treeExplainer in the SHAP package and the native C++ implementation of lightgbm/xgboost/catboost. The improvement in speed will vary depending on the size of the task and your hardware resources (including virtualization for VMs). On older machine, the fasttreeshap implementation might actually be slower.

However, it currently does not work with xgboost (not a deal breaker because lightgbm is the preferred default).

[1]:
import numpy as np
import pandas as pd

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

import arfs
from arfs.feature_selection import GrootCV, Leshy
from arfs.utils import load_data
from arfs.benchmark import highlight_tick

rng = np.random.RandomState(seed=42)
Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
[2]:
# Generate synthetic data with Poisson-distributed target variable
bias = 1

n_samples = 100_000
n_features = 100
n_informative = 20

X, y, true_coef = make_regression(
    n_samples=n_samples,
    n_features=n_features,
    n_informative=n_informative,
    noise=1,
    random_state=8,
    bias=bias,
    coef=True,
)
y = (y - y.mean()) / y.std()
y = np.exp(y)  # Transform to positive values for Poisson distribution
y = np.random.poisson(y)  # Add Poisson noise to the target variable
# dummy sample weight (e.g. exposure), smallest being 30 days
w = np.random.uniform(30 / 365, 1, size=len(y))
# make the count a Poisson rate (frequency)
y = y / w

X = pd.DataFrame(X)
X.columns = [f"pred_{i}" for i in range(X.shape[1])]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X, y, w, test_size=0.5, random_state=42
)

true_coef = pd.Series(true_coef)
true_coef.index = X.columns
true_coef = pd.Series({**{"intercept": bias}, **true_coef})
true_coef

genuine_predictors = true_coef[true_coef > 0.0]

print(f"The true coefficient of the linear data generating process are:\n {true_coef}")
The true coefficient of the linear data generating process are:
 intercept     1.000000
pred_0        0.000000
pred_1        0.000000
pred_2        0.000000
pred_3        0.000000
               ...
pred_95       0.000000
pred_96      10.576299
pred_97       0.000000
pred_98       0.000000
pred_99      62.472033
Length: 101, dtype: float64

GrootCV - fastshap vs shap#

Fastshap enable#

[3]:
%%time
feat_selector = GrootCV(
    objective="rmse",
    cutoff=1,
    n_folds=3,
    n_iter=3,
    silent=True,
    fastshap=True,
    n_jobs=0,
    lgbm_params={"device": "cpu"},
)
feat_selector.fit(X_train, y_train, sample_weight=None)
CPU times: user 10min 34s, sys: 4.55 s, total: 10min 39s
Wall time: 3min 11s
[3]:
GrootCV(fastshap=True,
        lgbm_params={'device': 'cpu', 'num_threads': 0, 'objective': 'rmse',
                     'verbosity': -1},
        n_folds=3, n_iter=3, objective='rmse')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[4]:
print(f"The selected features: {feat_selector.get_feature_names_out()}")
print(f"The agnostic ranking: {feat_selector.ranking_}")
print(f"The naive ranking: {feat_selector.ranking_absolutes_}")


# fig = feat_selector.plot_importance(n_feat_per_inch=5)
# # highlight synthetic random variable
# for name in true_coef.index:
#     if name in genuine_predictors.index:
#         fig = highlight_tick(figure=fig, str_match=name, color="green")
#     else:
#         fig = highlight_tick(figure=fig, str_match=name)

# plt.show()
The selected features: ['pred_7' 'pred_9' 'pred_15' 'pred_23' 'pred_27' 'pred_31' 'pred_35'
 'pred_39' 'pred_41' 'pred_46' 'pred_48' 'pred_49' 'pred_52' 'pred_66'
 'pred_71' 'pred_79' 'pred_85' 'pred_96' 'pred_99']
The agnostic ranking: [1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 2 1
 1 1 2 1 2 1 1 1 1 2 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1
 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2]
The naive ranking: ['pred_7', 'pred_9', 'pred_31', 'pred_49', 'pred_41', 'pred_52', 'pred_71', 'pred_66', 'pred_27', 'pred_99', 'pred_23', 'pred_79', 'pred_39', 'pred_35', 'pred_85', 'pred_48', 'pred_46', 'pred_96', 'pred_15', 'pred_89', 'pred_21', 'pred_38', 'pred_32', 'pred_16', 'pred_69', 'pred_47', 'pred_50', 'pred_28', 'pred_60', 'pred_44', 'pred_67', 'pred_61', 'pred_34', 'pred_84', 'pred_17', 'pred_37', 'pred_29', 'pred_70', 'pred_5', 'pred_62', 'pred_19', 'pred_78', 'pred_59', 'pred_82', 'pred_64', 'pred_24', 'pred_92', 'pred_22', 'pred_80', 'pred_97', 'pred_95', 'pred_68', 'pred_58', 'pred_81', 'pred_91', 'pred_77', 'pred_53', 'pred_36', 'pred_10', 'pred_74', 'pred_45', 'pred_93', 'pred_30', 'pred_4', 'pred_65', 'pred_63', 'pred_76', 'pred_54', 'pred_43', 'pred_8', 'pred_56', 'pred_72', 'pred_0', 'pred_20', 'pred_11', 'pred_75', 'pred_83', 'pred_73', 'pred_18', 'pred_57', 'pred_14', 'pred_55', 'pred_12', 'pred_98', 'pred_88', 'pred_87', 'pred_26', 'pred_90', 'pred_42', 'pred_1', 'pred_33', 'pred_25', 'pred_94', 'pred_51', 'pred_2', 'pred_6', 'pred_40', 'pred_3', 'pred_13', 'pred_86']

Fastshap disable#

[5]:
%%time
feat_selector = GrootCV(
    objective="rmse",
    cutoff=1,
    n_folds=3,
    n_iter=3,
    silent=True,
    fastshap=False,
    n_jobs=0,
    lgbm_params={"device": "cpu"},
)
feat_selector.fit(X_train, y_train, sample_weight=None)
CPU times: user 18min 15s, sys: 3.74 s, total: 18min 19s
Wall time: 5min 23s
[5]:
GrootCV(lgbm_params={'device': 'cpu', 'num_threads': 0, 'objective': 'rmse',
                     'verbosity': -1},
        n_folds=3, n_iter=3, objective='rmse')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[6]:
print(f"The selected features: {feat_selector.get_feature_names_out()}")
print(f"The agnostic ranking: {feat_selector.ranking_}")
print(f"The naive ranking: {feat_selector.ranking_absolutes_}")
The selected features: ['pred_7' 'pred_9' 'pred_15' 'pred_23' 'pred_27' 'pred_31' 'pred_35'
 'pred_39' 'pred_41' 'pred_46' 'pred_48' 'pred_49' 'pred_52' 'pred_66'
 'pred_71' 'pred_79' 'pred_85' 'pred_96' 'pred_99']
The agnostic ranking: [1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 2 1
 1 1 2 1 2 1 1 1 1 2 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1
 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2]
The naive ranking: ['pred_7', 'pred_9', 'pred_31', 'pred_49', 'pred_41', 'pred_52', 'pred_71', 'pred_66', 'pred_27', 'pred_99', 'pred_23', 'pred_79', 'pred_39', 'pred_35', 'pred_85', 'pred_48', 'pred_46', 'pred_96', 'pred_15', 'pred_38', 'pred_32', 'pred_21', 'pred_89', 'pred_50', 'pred_5', 'pred_17', 'pred_29', 'pred_28', 'pred_69', 'pred_61', 'pred_84', 'pred_58', 'pred_67', 'pred_59', 'pred_68', 'pred_34', 'pred_97', 'pred_47', 'pred_60', 'pred_91', 'pred_75', 'pred_22', 'pred_10', 'pred_82', 'pred_16', 'pred_78', 'pred_42', 'pred_95', 'pred_80', 'pred_37', 'pred_2', 'pred_62', 'pred_76', 'pred_92', 'pred_20', 'pred_77', 'pred_19', 'pred_24', 'pred_63', 'pred_93', 'pred_44', 'pred_11', 'pred_53', 'pred_65', 'pred_33', 'pred_45', 'pred_14', 'pred_98', 'pred_57', 'pred_64', 'pred_30', 'pred_81', 'pred_83', 'pred_87', 'pred_25', 'pred_51', 'pred_70', 'pred_8', 'pred_36', 'pred_55', 'pred_0', 'pred_88', 'pred_43', 'pred_12', 'pred_4', 'pred_74', 'pred_72', 'pred_54', 'pred_1', 'pred_13', 'pred_73', 'pred_40', 'pred_56', 'pred_3', 'pred_26', 'pred_18', 'pred_94', 'pred_6', 'pred_86', 'pred_90']