ARFS - How to use with large data?#

It might be than the data set is too large for All Relevant Feature Selection to run in a reasonable time. Usually random sampling (stratified, grouped, etc) solves this issues. If extreme sampling is needed, ARFS provide two methods for decreasing drastically the number of rows. The sampling methods and outcomes are illustrated below.

[18]:
# from IPython.core.display import display, HTML
# display(HTML("<style>.container { width:95% !important; }</style>"))
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gc

from sklearn.pipeline import Pipeline

import arfs.feature_selection as arfsfs
from arfs.preprocessing import OrdinalEncoderPandas
from arfs.benchmark import highlight_tick, compare_varimp, sklearn_pimp_bench
from arfs.utils import load_data
from arfs.sampling import sample

# plt.style.use('fivethirtyeight')
rng = np.random.RandomState(seed=42)

import warnings

warnings.filterwarnings("ignore")
[2]:
import arfs

print(f"Run with ARFS {arfs.__version__}")
Run with ARFS 3.0.0
[3]:
%matplotlib inline
[4]:
gc.enable()
gc.collect()
[4]:
4

Sampling data#

A fairly large data set for illustration

[5]:
# X, y, w = _generated_corr_dataset_regr(size=100_000)

housing = load_data(name="housing")
X, y = housing.data, housing.target

X = pd.DataFrame(X)
X.columns = housing.feature_names
y = pd.Series(y)
X.head()
[5]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25

Cleaning the data#

[6]:
basic_fs_pipeline = Pipeline(
    [
        ("missing", arfsfs.MissingValueThreshold(threshold=0.05)),
        ("unique", arfsfs.UniqueValuesThreshold(threshold=1)),
        ("cardinality", arfsfs.CardinalityThreshold(threshold=10)),
        ("collinearity", arfsfs.CollinearityThreshold(threshold=0.75)),
    ]
)

X_trans = basic_fs_pipeline.fit_transform(X=X, y=y)
#   collinearity__sample_weight=w,
#   lowimp__sample_weight=w)
[7]:
f = basic_fs_pipeline.named_steps["collinearity"].plot_association(figsize=(4, 4))
../_images/notebooks_arfs_large_data_sampling_9_0.png
[8]:
data = X_trans.copy()
data["target"] = y.values
[9]:
print(f"The dataset shape is {data.shape}")
The dataset shape is (20640, 8)

Random sampling#

You can sample using pandas or scikit-learn (providing more advanced random sampling, as stratified sampling, depending on the task).

[10]:
data_rnd_samp = data.sample(n=1_000)
print(f"The sampled dataset shape is {data_rnd_samp.shape}")
The sampled dataset shape is (1000, 8)

Sampling by clustering the rows#

Using the Gower distance (handling mixed-type data), the rows are clustered and the average/mode for each column is returned. The larger the number of clusters, the closer to the original data. As it requires the computation of a distance matrix, you might need to use random sampling first to avoid a computational bottleneck.

[ ]:
%%time
data_dum = data.sample(n=10_000)
data_dum = data.copy()
# Compute a distance matrix, 10_000x10_000 and use it for clustering (1_000 clusters)
data_g_samp = sample(df=data_dum, n=1_000, sample_weight=None, method="gower")
print(f"The sampled dataset shape is {data_g_samp.shape}")

Sampling by removing outliers#

This method is adapted from BorutaShap. It uses IsolationForest to remove the less similar samples and iterates till the 2-sample KS statistics is > \(95\%\) (for having a similar distribution than the original data). There is no guarantee of the output size.

[11]:
%%time
data_isof_samp = sample(df=data, sample_weight=None, method="isoforest")
print(f"The sampled dataset shape is {data_isof_samp.shape}")
The sampled dataset shape is (2064, 8)
CPU times: user 604 ms, sys: 7.84 ms, total: 612 ms
Wall time: 711 ms

Impact on the feature selection#

Let’s perform the feature selection on this toy data set to see what is the impact of the different sampling strategies.

[16]:
%%time
# No Sampling
X = data.drop("target", axis=1)
y = data.target

# GrootCV
feat_selector = arfsfs.GrootCV(
    objective="rmse", cutoff=1, n_folds=5, n_iter=5, silent=True
)
feat_selector.fit(X, y, sample_weight=None)
print(feat_selector.get_feature_names_out())
fig = feat_selector.plot_importance(n_feat_per_inch=5)

# highlight synthetic random variable
fig = highlight_tick(figure=fig, str_match="random")
fig = highlight_tick(figure=fig, str_match="genuine", color="green")
plt.show()
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[151]   training's l2: 0.204331     valid_1's l2: 0.326671
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[95]    training's l2: 0.240687     valid_1's l2: 0.312489
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[219]   training's l2: 0.175836     valid_1's l2: 0.346323
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[117]   training's l2: 0.228562     valid_1's l2: 0.303057
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[77]    training's l2: 0.251179     valid_1's l2: 0.337232
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[125]   training's l2: 0.221841     valid_1's l2: 0.316486
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[122]   training's l2: 0.223227     valid_1's l2: 0.334787
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[162]   training's l2: 0.199937     valid_1's l2: 0.320165
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[145]   training's l2: 0.207626     valid_1's l2: 0.325318
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[156]   training's l2: 0.203951     valid_1's l2: 0.335031
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[137]   training's l2: 0.213178     valid_1's l2: 0.34583
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[129]   training's l2: 0.218581     valid_1's l2: 0.315213
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[122]   training's l2: 0.218357     valid_1's l2: 0.336374
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[111]   training's l2: 0.232967     valid_1's l2: 0.313847
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[145]   training's l2: 0.20741      valid_1's l2: 0.322324
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[100]   training's l2: 0.23626      valid_1's l2: 0.309587
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[167]   training's l2: 0.196718     valid_1's l2: 0.336912
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[185]   training's l2: 0.187894     valid_1's l2: 0.333857
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[170]   training's l2: 0.19827      valid_1's l2: 0.319529
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[168]   training's l2: 0.193972     valid_1's l2: 0.329542
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[141]   training's l2: 0.209756     valid_1's l2: 0.340815
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[116]   training's l2: 0.225559     valid_1's l2: 0.322635
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[76]    training's l2: 0.253261     valid_1's l2: 0.328681
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[151]   training's l2: 0.208201     valid_1's l2: 0.320702
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[100]   training's l2: 0.235078     valid_1's l2: 0.31675
['MedInc' 'HouseAge' 'AveRooms' 'AveBedrms' 'Population' 'AveOccup'
 'Longitude']
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed exec>:14

NameError: name 'highlight_tick' is not defined
../_images/notebooks_arfs_large_data_sampling_19_3.png
[17]:
%%time

X = data_rnd_samp.drop("target", axis=1)
y = data_rnd_samp.target

# GrootCV
feat_selector = arfsfs.GrootCV(
    objective="rmse", cutoff=1, n_folds=5, n_iter=5, silent=True
)
feat_selector.fit(X, y, sample_weight=None)
print(feat_selector.get_feature_names_out())
fig = feat_selector.plot_importance(n_feat_per_inch=5)

# highlight synthetic random variable
fig = highlight_tick(figure=fig, str_match="random")
fig = highlight_tick(figure=fig, str_match="genuine", color="green")
plt.show()
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[30]    training's l2: 0.166799     valid_1's l2: 0.375475
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[45]    training's l2: 0.0961632    valid_1's l2: 0.470772
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[52]    training's l2: 0.0891238    valid_1's l2: 0.486354
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[32]    training's l2: 0.167216     valid_1's l2: 0.376773
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[46]    training's l2: 0.0968427    valid_1's l2: 0.431018
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[41]    training's l2: 0.119891     valid_1's l2: 0.359728
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[60]    training's l2: 0.0684558    valid_1's l2: 0.449729
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[43]    training's l2: 0.100557     valid_1's l2: 0.374715
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[45]    training's l2: 0.0956372    valid_1's l2: 0.469351
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[32]    training's l2: 0.134747     valid_1's l2: 0.547114
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[33]    training's l2: 0.15071      valid_1's l2: 0.397982
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[29]    training's l2: 0.176599     valid_1's l2: 0.421348
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[25]    training's l2: 0.193759     valid_1's l2: 0.444774
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[33]    training's l2: 0.147421     valid_1's l2: 0.500831
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[48]    training's l2: 0.0932204    valid_1's l2: 0.404222
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[67]    training's l2: 0.0487236    valid_1's l2: 0.506928
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[37]    training's l2: 0.128301     valid_1's l2: 0.442707
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[45]    training's l2: 0.0989439    valid_1's l2: 0.42606
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[63]    training's l2: 0.0657157    valid_1's l2: 0.432651
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[57]    training's l2: 0.068082     valid_1's l2: 0.408432
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[57]    training's l2: 0.068025     valid_1's l2: 0.499865
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[37]    training's l2: 0.126191     valid_1's l2: 0.419695
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[29]    training's l2: 0.16468      valid_1's l2: 0.419023
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[34]    training's l2: 0.144042     valid_1's l2: 0.430156
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[45]    training's l2: 0.102332     valid_1's l2: 0.463595
['MedInc' 'HouseAge' 'AveRooms' 'AveBedrms' 'Population' 'AveOccup'
 'Longitude']
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed exec>:13

NameError: name 'highlight_tick' is not defined
../_images/notebooks_arfs_large_data_sampling_20_3.png
[19]:
%%time

X = data_g_samp.drop("target", axis=1)
y = data_g_samp.target

# GrootCV
feat_selector = arfsfs.GrootCV(
    objective="rmse", cutoff=1, n_folds=5, n_iter=5, silent=True
)
feat_selector.fit(X, y, sample_weight=None)
print(feat_selector.get_feature_names_out())
fig = feat_selector.plot_importance(n_feat_per_inch=5)

# highlight synthetic random variable
fig = highlight_tick(figure=fig, str_match="random")
fig = highlight_tick(figure=fig, str_match="genuine", color="green")
plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
File <timed exec>:1

NameError: name 'data_g_samp' is not defined
[21]:
%%time

X = data_isof_samp.drop("target", axis=1)
y = data_isof_samp.target

# GrootCV
feat_selector = arfsfs.GrootCV(
    objective="rmse", cutoff=1, n_folds=5, n_iter=5, silent=True
)
feat_selector.fit(X, y, sample_weight=None)
print(feat_selector.get_feature_names_out())
fig = feat_selector.plot_importance(n_feat_per_inch=5)

# highlight synthetic random variable
fig = highlight_tick(figure=fig, str_match="random")
fig = highlight_tick(figure=fig, str_match="genuine", color="green")
plt.show()
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[41]    training's l2: 0.143895     valid_1's l2: 0.419566
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[49]    training's l2: 0.123278     valid_1's l2: 0.397813
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[37]    training's l2: 0.159719     valid_1's l2: 0.383639
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[56]    training's l2: 0.11299      valid_1's l2: 0.366118
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[66]    training's l2: 0.0896061    valid_1's l2: 0.397312
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[39]    training's l2: 0.157648     valid_1's l2: 0.373146
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[50]    training's l2: 0.123834     valid_1's l2: 0.39308
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[40]    training's l2: 0.159366     valid_1's l2: 0.316867
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[52]    training's l2: 0.114191     valid_1's l2: 0.457971
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[46]    training's l2: 0.129243     valid_1's l2: 0.441562
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[56]    training's l2: 0.10658      valid_1's l2: 0.358745
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[48]    training's l2: 0.120558     valid_1's l2: 0.439188
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[37]    training's l2: 0.161763     valid_1's l2: 0.397943
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[43]    training's l2: 0.139222     valid_1's l2: 0.394755
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[39]    training's l2: 0.153084     valid_1's l2: 0.405815
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[48]    training's l2: 0.123297     valid_1's l2: 0.423094
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[41]    training's l2: 0.155462     valid_1's l2: 0.341056
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[31]    training's l2: 0.186186     valid_1's l2: 0.387608
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[38]    training's l2: 0.161008     valid_1's l2: 0.386238
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[41]    training's l2: 0.146398     valid_1's l2: 0.435355
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[36]    training's l2: 0.161081     valid_1's l2: 0.469358
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[32]    training's l2: 0.182545     valid_1's l2: 0.380976
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[40]    training's l2: 0.151907     valid_1's l2: 0.364813
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[38]    training's l2: 0.160105     valid_1's l2: 0.381487
Training until validation scores don't improve for 20 rounds
Early stopping, best iteration is:
[70]    training's l2: 0.0845652    valid_1's l2: 0.414601
['MedInc' 'HouseAge' 'AveRooms' 'AveBedrms' 'Population' 'AveOccup'
 'Longitude']
../_images/notebooks_arfs_large_data_sampling_22_2.png
CPU times: user 43.9 s, sys: 736 ms, total: 44.6 s
Wall time: 15.7 s