ARFS - How to use with large data?#

It might be than the data set is too large for All Relevant Feature Selection to run in a reasonable time. Usually random sampling (stratified, grouped, etc) solves this issues. If extreme sampling is needed, ARFS provide two methods for decreasing drastically the number of rows. The sampling methods and outcomes are illustrated below.

[1]:

# from IPython.core.display import display, HTML
# display(HTML("<style>.container { width:95% !important; }</style>"))
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gc

from sklearn.pipeline import Pipeline

import arfs.feature_selection as arfsfs
import arfs.feature_selection.allrelevant as arfsgroot
from arfs.preprocessing import OrdinalEncoderPandas
from arfs.benchmark import highlight_tick, compare_varimp, sklearn_pimp_bench
from arfs.utils import load_data
from arfs.sampling import sample

# plt.style.use('fivethirtyeight')
rng = np.random.RandomState(seed=42)

import warnings

warnings.filterwarnings("ignore")

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)

[2]:

import arfs

print(f"Run with ARFS {arfs.__version__}")

Run with ARFS 2.0.5

[3]:

%matplotlib inline

[4]:

gc.enable()
gc.collect()

[4]:

Sampling data#

A fairly large data set for illustration

[5]:

# X, y, w = _generated_corr_dataset_regr(size=100_000)

housing = load_data(name="housing")
X, y = housing.data, housing.target

X = pd.DataFrame(X)
X.columns = housing.feature_names
y = pd.Series(y)
X.head()

[5]:

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25

Cleaning the data#

[6]:

basic_fs_pipeline = Pipeline(
    [
        ("missing", arfsfs.MissingValueThreshold(threshold=0.05)),
        ("unique", arfsfs.UniqueValuesThreshold(threshold=1)),
        ("cardinality", arfsfs.CardinalityThreshold(threshold=10)),
        ("collinearity", arfsfs.CollinearityThreshold(threshold=0.75)),
    ]
)

X_trans = basic_fs_pipeline.fit_transform(X=X, y=y)
#   collinearity__sample_weight=w,
#   lowimp__sample_weight=w)

[7]:

f = basic_fs_pipeline.named_steps["collinearity"].plot_association(figsize=(4, 4))

../_images/notebooks_arfs_large_data_sampling_9_0.png

[8]:

data = X_trans.copy()
data["target"] = y.values

[9]:

print(f"The dataset shape is {data.shape}")

The dataset shape is (20640, 7)

Random sampling#

You can sample using pandas or scikit-learn (providing more advanced random sampling, as stratified sampling, depending on the task).

[10]:

data_rnd_samp = data.sample(n=1_000)
print(f"The sampled dataset shape is {data_rnd_samp.shape}")

The sampled dataset shape is (1000, 7)

Sampling by clustering the rows#

Using the Gower distance (handling mixed-type data), the rows are clustered and the average/mode for each column is returned. The larger the number of clusters, the closer to the original data. As it requires the computation of a distance matrix, you might need to use random sampling first to avoid a computational bottleneck.

[11]:

%%time
data_dum = data.sample(n=10_000)
data_dum = data.copy()
# Compute a distance matrix, 10_000x10_000 and use it for clustering (1_000 clusters)
data_g_samp = sample(df=data_dum, n=1_000, sample_weight=None, method="gower")
print(f"The sampled dataset shape is {data_g_samp.shape}")

The sampled dataset shape is (1000, 8)
CPU times: user 49.1 s, sys: 12.4 s, total: 1min 1s
Wall time: 1min 2s

Sampling by removing outliers#

This method is adapted from BorutaShap. It uses IsolationForest to remove the less similar samples and iterates till the 2-sample KS statistics is > \(95\%\) (for having a similar distribution than the original data). There is no guarantee of the output size.

[12]:

%%time
data_isof_samp = sample(df=data, sample_weight=None, method="isoforest")
print(f"The sampled dataset shape is {data_isof_samp.shape}")

The sampled dataset shape is (2064, 7)
CPU times: user 399 ms, sys: 13.3 ms, total: 413 ms
Wall time: 449 ms

Impact on the feature selection#

Let’s perform the feature selection on this toy data set to see what is the impact of the different sampling strategies.

[13]:

%%time
# No Sampling
X = data.drop("target", axis=1)
y = data.target

# GrootCV
feat_selector = arfsgroot.GrootCV(
    objective="rmse", cutoff=1, n_folds=5, n_iter=5, silent=True
)
feat_selector.fit(X, y, sample_weight=None)
print(feat_selector.get_feature_names_out())
fig = feat_selector.plot_importance(n_feat_per_inch=5)

# highlight synthetic random variable
fig = highlight_tick(figure=fig, str_match="random")
fig = highlight_tick(figure=fig, str_match="genuine", color="green")
plt.show()

['MedInc' 'HouseAge' 'AveRooms' 'Population' 'AveOccup' 'Longitude']

../_images/notebooks_arfs_large_data_sampling_19_2.png

CPU times: user 12min 19s, sys: 2.9 s, total: 12min 22s
Wall time: 3min 32s

[14]:

%%time

X = data_rnd_samp.drop("target", axis=1)
y = data_rnd_samp.target

# GrootCV
feat_selector = arfsgroot.GrootCV(
    objective="rmse", cutoff=1, n_folds=5, n_iter=5, silent=True
)
feat_selector.fit(X, y, sample_weight=None)
print(feat_selector.get_feature_names_out())
fig = feat_selector.plot_importance(n_feat_per_inch=5)

# highlight synthetic random variable
fig = highlight_tick(figure=fig, str_match="random")
fig = highlight_tick(figure=fig, str_match="genuine", color="green")
plt.show()

['MedInc' 'HouseAge' 'AveRooms' 'Population' 'AveOccup' 'Longitude']

../_images/notebooks_arfs_large_data_sampling_20_2.png

CPU times: user 23.1 s, sys: 436 ms, total: 23.5 s
Wall time: 7.43 s

[15]:

%%time

X = data_g_samp.drop("target", axis=1)
y = data_g_samp.target

# GrootCV
feat_selector = arfsgroot.GrootCV(
    objective="rmse", cutoff=1, n_folds=5, n_iter=5, silent=True
)
feat_selector.fit(X, y, sample_weight=None)
print(feat_selector.get_feature_names_out())
fig = feat_selector.plot_importance(n_feat_per_inch=5)

# highlight synthetic random variable
fig = highlight_tick(figure=fig, str_match="random")
fig = highlight_tick(figure=fig, str_match="genuine", color="green")
plt.show()

['MedInc' 'HouseAge' 'AveRooms' 'Population' 'AveOccup' 'Longitude']

../_images/notebooks_arfs_large_data_sampling_21_2.png

CPU times: user 25.3 s, sys: 512 ms, total: 25.8 s
Wall time: 8.27 s

[16]:

%%time

X = data_isof_samp.drop("target", axis=1)
y = data_isof_samp.target

# GrootCV
feat_selector = arfsgroot.GrootCV(
    objective="rmse", cutoff=1, n_folds=5, n_iter=5, silent=True
)
feat_selector.fit(X, y, sample_weight=None)
print(feat_selector.get_feature_names_out())
fig = feat_selector.plot_importance(n_feat_per_inch=5)

# highlight synthetic random variable
fig = highlight_tick(figure=fig, str_match="random")
fig = highlight_tick(figure=fig, str_match="genuine", color="green")
plt.show()

['MedInc' 'HouseAge' 'AveRooms' 'Population' 'AveOccup' 'Longitude']

../_images/notebooks_arfs_large_data_sampling_22_2.png

CPU times: user 39.5 s, sys: 630 ms, total: 40.2 s
Wall time: 12.7 s