arfs.feature_selection package#

Submodules#

arfs.feature_selection.allrelevant module#

This module provides 3 different methods to perform ‘all relevant feature selection’

Reference:#

NILSSON, Roland, PEÑA, José M., BJÖRKEGREN, Johan, et al. Consistent feature selection for pattern recognition in polynomial time. Journal of Machine Learning Research, 2007, vol. 8, no Mar, p. 589-612.

KURSA, Miron B., RUDNICKI, Witold R., et al. Feature selection with the Boruta package. J Stat Softw, 2010, vol. 36, no 11, p. 1-13.

https://github.com/chasedehan/BoostARoota

The module structure#

Original BorutaPy version#

Author: Daniel Homola <dani.homola@gmail.com>

Original code and method by: Miron B Kursa, https://m2.icm.edu.pl/boruta/ Modified by Thomas Bury, pull request: https://github.com/scikit-learn-contrib/boruta_py/pull/100 Waiting for merging

https://github.com/scikit-learn-contrib/boruta_py/pull/100 is a new PR based on #77 making all the changes optional. Waiting for merge

Leshy is a re-work of the PR I submitted.

License: BSD 3 clause

class arfs.feature_selection.allrelevant.BoostAGroota(estimator=None, cutoff=4, iters=10, max_rounds=500, delta=0.1, silent=True, importance='shap')[source]#

Bases: SelectorMixin, BaseEstimator

BoostAGroota is an all-relevant feature selection method, while most others are minimal optimal. It tries to find all features carrying information usable for prediction, rather than finding a possibly compact subset of features on which some estimator has a minimal error.

Why bother with all-relevant feature selection? When you try to understand the phenomenon that made your data, you should care about all factors that contribute to it, not just the bluntest signs of it in the context of your methodology (minimal optimal set of features by definition depends on your estimator choice).

Parameters:
  • estimator (scikit-learn estimator) – The model to train, lightGBM recommended, see the reduce lightgbm method.

  • cutoff (float) – The value by which the max of shadow imp is divided, to compare to real importance.

  • iters (int (>0)) – The number of iterations to average for the feature importance (on the same split), to reduce the variance.

  • max_rounds (int (>0)) – The number of times the core BoostAGroota algorithm will run. Each round eliminates more and more features.

  • delta (float (0 < delta <= 1)) – Stopping criteria for whether another round is started.

  • silent (bool) – Set to True if you don’t want to see the BoostAGroota output printed.

  • importance (str, default 'shap') – The kind of feature importance to use. Possible values: ‘shap’ (Shapley values), ‘pimp’ (permutation importance), and ‘native’ (Gini/impurity).

Variables:
  • selected_features (list of str) – The list of columns to keep.

  • ranking (array of shape [n_features]) – The feature ranking, such that ranking_[i] corresponds to the ranking position of the i-th feature. Selected (i.e., estimated best) features are assigned rank 1, and tentative features are assigned rank 2.

  • ranking_absolutes (array of shape [n_features]) – The absolute feature ranking as ordered by the selection process. It does not guarantee that this order is correct for all models. For a model-agnostic ranking, see the attribute ranking.

  • sha_cutoff_df (dataframe) – Feature importance of the real+shadow predictors over iterations.

  • mean_shadow (float) – The threshold below which the predictors are rejected.

Examples

>>> X = df[filtered_features].copy()
>>> y = df['target'].copy()
>>> w = df['weight'].copy()
>>> model = LGBMRegressor(n_jobs=-1, n_estimators=100, objective='rmse', random_state=42, verbose=0)
>>> feat_selector = BoostAGroota(estimator=model, cutoff=1, iters=10, max_rounds=10, delta=0.1, importance='shap')
>>> feat_selector.fit(X, y, sample_weight=None)
>>> print(feat_selector.selected_features_)
>>> feat_selector.plot_importance(n_feat_per_inch=5)
fit(X, y, sample_weight=None)[source]#

Fit the BoostAGroota transformer with the provided estimator. :type X: :param X: the predictors matrix :type X: pd.DataFrame :type y: :param y: the target :type y: pd.Series :type sample_weight: :param sample_weight: sample_weight, if any :type sample_weight: pd.series

plot_importance(n_feat_per_inch=5)[source]#

Boxplot of the variable importance, ordered by magnitude. The max shadow variable importance illustrated by the dashed line. Requires to apply the fit method first.

Parameters:

n_feat_per_inch (int, default 5) – Number of features to plot per inch (for scaling the figure).

Returns:

fig (plt.figure or None) – The matplotlib figure object containing the boxplot, or None if there are no selected features.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') BoostAGroota#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

transform(X)[source]#

Reduce X to the selected features.

Parameters:

X (array of shape [n_samples, n_features]) – The input samples.

Returns:

X_r (array of shape [n_samples, n_selected_features]) – The input samples with only the selected features.

class arfs.feature_selection.allrelevant.GrootCV(objective=None, cutoff=1, n_folds=5, folds=None, n_iter=5, silent=True, rf=False, fastshap=False, n_jobs=0, lgbm_params=None)[source]#

Bases: SelectorMixin, BaseEstimator

GrootCV is a feature selection method based on cross-validation with lightGBM.

A shuffled copy of the predictors matrix is added (shadows) to the original set of predictors. The lightGBM is fitted using repeated cross-validation, the feature importance is extracted each time and averaged to smooth out the noise. If the feature importance is larger than the average shadow feature importance then the predictors are rejected, the others are kept.

  • Cross-validated feature importance to smooth out the noise, based on lightGBM only (which is, most of the time, the fastest and more accurate Boosting).

  • the feature importance is derived using SHAP importance

  • Taking the max of median of the shadow var. imp over folds otherwise not enough conservative and it improves the convergence (needs less evaluation to find a threshold)

  • Not based on a given percentage of cols needed to be deleted

  • Plot method for var. imp

Parameters:
  • objective (str or callable, default None) – The objective function to use in lightGBM. If None, it uses the objective specified in lgbm_params.

  • cutoff (float, default 1) – The value by which the max of shadow imp is divided, to compare to real importance.

  • n_folds (int, default 5) – The number of folds for cross-validation.

  • folds (Optional[Union[Iterable[Tuple[np.ndarray, np.ndarray]]) – (generator or iterator of (train_idx, test_idx) tuples, scikit-learn splitter object or None, optional (default=None)) If generator or iterator, it should yield the train and test indices for each fold. If object, it should be one of the scikit-learn splitter classes (https://scikit-learn.org/stable/modules/classes.html#splitter-classes) and have split method. This argument has highest priority over other data split arguments.

  • n_iter (int, default 5) – The number of iterations to average for the feature importance (on the same split), to reduce variance.

  • silent (bool, default True) – Set to True if you don’t want to see the GrootCV output printed.

  • rf (bool, default False) – If True, use random forest for calculating feature importances; otherwise, use lightGBM.

  • fastshap (bool, default False) – If True, use fastSHAP for calculating feature importances; otherwise, use SHAP.

  • n_jobs (int, default 0) – The number of jobs to run in parallel. If 0, no parallelism is used.

  • lgbm_params (dict, default None) – The parameters for the lightGBM model.

Variables:
  • selected_features (ndarray) – The list of columns to keep as selected features.

  • cv_df (pd.DataFrame) – DataFrame containing feature importance values for each fold and iteration.

  • sha_cutoff (float) – The threshold below which the predictors are rejected.

  • ranking_absolutes (list) – The absolute feature ranking as ordered by the selection process.

  • ranking (ndarray) – The feature ranking, where 2 corresponds to selected features and 1 to tentative features.

fit(X, y, sample_weight=None)[source]#

Fit the GrootCV on the input data.

transform(X)[source]#

Apply the fitted GrootCV on new data.

plot_importance(n_feat_per_inch=5)[source]#

Plot the feature importance of the fitted GrootCV.

Warning

If sha_cutoff is None, you should apply the fit method first.

Examples

>>> X = df[filtered_features].copy()
>>> y = df['target'].copy()
>>> w = df['weight'].copy()
>>> feat_selector = arfsgroot.GrootCV(objective='rmse', cutoff = 1, n_folds=5, n_iter=5)
>>> feat_selector.fit(X, y, sample_weight=None)
>>> feat_selector.plot_importance(n_feat_per_inch=5)
fit(X, y, sample_weight=None)[source]#

Fit the GrootCV on the input data.

Parameters:
  • X (pd.DataFrame of shape (n_samples, n_features)) – The predictor dataframe.

  • y (array-like of shape (n_samples,)) – The target vector.

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None.

Returns:

self (object) – Returns self.

plot_importance(n_feat_per_inch=5)[source]#

Plot the feature importance of the fitted GrootCV.

Parameters:

n_feat_per_inch (int, default 5) – The number of features per inch in the plot.

Returns:

fig (matplotlib.figure.Figure or None) – The matplotlib figure containing the plot or None if no feature is selected.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GrootCV#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

transform(X)[source]#

Apply the fitted GrootCV on new data.

Parameters:

X (pd.DataFrame of shape (n_samples, n_features)) – The predictor dataframe.

Returns:

X_selected (pd.DataFrame of shape (n_samples, n_selected_features)) – The selected features from the input dataframe.

class arfs.feature_selection.allrelevant.Leshy(estimator, n_estimators=1000, perc=90, alpha=0.05, importance='shap', two_step=True, max_iter=100, random_state=None, verbose=0, keep_weak=False)[source]#

Bases: SelectorMixin, BaseEstimator

This is an improved version of BorutaPy which itself is an improved Python implementation of the Boruta R package. Boruta is an all relevant feature selection method, while most other are minimal optimal; this means it tries to find all features carrying information usable for prediction, rather than finding a possibly compact subset of features on which some estimator has a minimal error. Why bother with all relevant feature selection? When you try to understand the phenomenon that made your data, you should care about all factors that contribute to it, not just the bluntest signs of it in context of your methodology (minimal optimal set of features by definition depends on your estimator choice).

Parameters:
  • estimator (object) – A supervised learning estimator, with a ‘fit’ method that returns the feature_importances_ attribute. Important features must correspond to high absolute values in the feature_importances_

  • n_estimators (int or string, default = 1000) – If int sets the number of estimators in the chosen ensemble method. If ‘auto’ this is determined automatically based on the size of the dataset. The other parameters of the used estimators need to be set with initialisation.

  • perc (int, default = 100) – Instead of the max we use the percentile defined by the user, to pick our threshold for comparison between shadow and real features. The max tend to be too stringent. This provides a finer control over this. The lower perc is the more false positives will be picked as relevant but also the less relevant features will be left out. The usual trade-off. The default is essentially the vanilla Boruta corresponding to the max.

  • alpha (float, default = 0.05) – Level at which the corrected p-values will get rejected in both correction steps.

  • importance (str, default = 'shap') – The kind of variable importance used to compare and discriminate original vs shadow predictors. Note that the builtin tree importance (gini/impurity based importance) is biased towards numerical and large cardinality predictors, even if they are random. Shapley values and permutation imp. are robust w.r.t those predictors. Possible values: ‘shap’ (Shapley values), ‘fastshap’ (FastTreeShap implementation), ‘pimp’ (permutation importance) and ‘native’ (Gini/impurity)

  • two_step (Boolean, default = True) – If you want to use the original implementation of Boruta with Bonferroni correction only set this to False.

  • max_iter (int, default = 100) – The number of maximum iterations to perform.

  • random_state (int, RandomState instance or None; default=None) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • verbose (int, default 0) – Controls verbosity of output. 0: no output, 1: displays iteration number, 2: which features have been selected already

Variables:
  • n_features (int) – The number of selected features.

  • support (array of shape [n_features]) – The mask of selected features - only confirmed ones are True.

  • support_weak (array of shape [n_features]) – The mask of selected tentative features, which haven’t gained enough support during the max_iter number of iterations.

  • selected_features (list of str) – the list of columns to keep

  • ranking (array of shape [n_features]) – The feature ranking, such that ranking_[i] corresponds to the ranking position of the i-th feature. Selected (i.e., estimated best) features are assigned rank one and tentative features are assigned rank 2.

  • ranking_absolutes (array of shape [n_features]) – The absolute feature ranking as ordered by selection process. It does not guarantee that this order is correct for all models. For a model agnostic ranking, see the the attribute ranking

  • cat_name (list of str) – the name of the categorical columns

  • cat_idx (list of int) – the index of the categorical columns

  • imp_real_hist (array) – array of the historical feature importance of the real predictors

  • sha_max (float) – the maximum feature importance of the shadow predictors

  • col_names (list of str) – the names of the real predictors

Examples

>>> import pandas as pd
>>> from sklearn.ensemble import RandomForestClassifier
>>> from boruta import BorutaPy
>>>
>>> # load X and y
>>> # NOTE BorutaPy accepts numpy arrays only, hence the .values attribute
>>> X = pd.read_csv('examples/test_X.csv', index_col=0).values
>>> y = pd.read_csv('examples/test_y.csv', header=None, index_col=0).values
>>> y = y.ravel()
>>>
>>> # define random forest classifier, with utilising all cores and
>>> # sampling in proportion to y labels
>>> rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
>>>
>>> # define Boruta feature selection method
>>> feat_selector = Leshy(rf, n_estimators='auto', verbose=2, random_state=1)
>>>
>>> # find all relevant features - 5 features should be selected
>>> feat_selector.fit(X, y)
>>>
>>> # check selected features - first 5 features are selected
>>> feat_selector.selected_features_
>>>
>>> # check ranking of features
>>> feat_selector.ranking_
>>>
>>> # call transform() on X to filter it down to selected features
>>> X_filtered = feat_selector.transform(X)

References

See the original paper [1]_ for more details.

..[1] Kursa M., Rudnicki W., “Feature Selection with the Boruta Package”

Journal of Statistical Software, Vol. 36, Issue 11, Sep 2010

_add_shadows_get_imps(X, y, sample_weight, dec_reg)[source]#

Add a shuffled copy of the columns (shadows) and get the feature importance of the augmented data set

Parameters:
  • X (pd.DataFrame of shape [n_samples, n_features]) – predictor matrix

  • y (pd.series of shape [n_samples]) – target

  • sample_weight (array-like, shape = [n_samples], default None) – Individual weights for each sample

  • dec_reg (array) – holds the decision about each feature 1, 0, -1 (accepted, undecided, rejected)

Returns:

imp_real: array

feature importance of the real predictors

imp_sha: array

feature importance of the shadow predictors

static _assign_hits(hit_reg, cur_imp, imp_sha_max)[source]#

count how many times a given feature was more important than the best of the shadow features

Parameters:
  • hit_reg (array) – count how many times a given feature was more important than the best of the shadow features

  • cur_imp (array) – current importance

  • imp_sha_max (array) – importance of the best shadow predictor

Returns:

hit_reg (array) – the how many times a given feature was more important than the best of the shadow features

_calculate_absolute_ranking()[source]#

Compute feature importance scores using SHAP values.

Parameters:
  • new_x_tr (numpy.ndarray) – The training dataset after being processed.

  • shap_matrix (numpy.ndarray) – The matrix containing SHAP values computed by a LightGBM model.

  • param (dict) – A dictionary containing the parameters for a LightGBM model.

  • objective (str) – The objective function of the LightGBM model.

Returns:

list – A list of tuples containing feature names and their corresponding importance scores.

_calculate_relative_ranking(n_feat, tentative, confirmed, imp_history)[source]#

Calculates the relative ranking of features based on their importance history.

Parameters:
  • n_feat (int) – The total number of features.

  • tentative (ndarray of shape (n_tentative_features,)) – An array containing the indices of tentative features.

  • confirmed (ndarray of shape (n_confirmed_features,)) – An array containing the indices of confirmed features.

  • imp_history (ndarray of shape (n_iterations + 1, n_features)) – An array containing the feature importances for each iteration.

Returns:

None

_calculate_support(confirmed, tentative, n_feat)[source]#

Calculate the feature support arrays.

Parameters:
  • confirmed (array-like of shape (n_confirmed,)) – Indices of confirmed features.

  • tentative (array-like of shape (n_tentative,)) – Indices of tentative features.

  • n_feat (int) – Total number of features.

Returns:

None – The function populates the following class attributes: - n_features_ : int

Number of selected features.

  • support_ndarray of shape (n_feat,)

    Boolean array indicating the selected features.

  • support_weak_ndarray of shape (n_feat,)

    Boolean array indicating the tentatively selected features.

_check_params(X, y)[source]#

Private method, Check hyperparameters as well as X and y before proceeding with fit.

Parameters:
  • X (pd.DataFrame) – predictor matrix

  • y (pd.series) – target series

Raises:
  • ValueError – [description]

  • ValueError – [description]

_do_tests(dec_reg, hit_reg, _iter)[source]#

Private method, Perform the rest if the feature should be tagget as relevant (confirmed), not relevant (rejected) or undecided. The test is performed by considering the binomial tentatives over several attempts. I.e. count how many times a given feature was more important than the best of the shadow features and test if the associated probability to the z-score is below, between or above the rejection or acceptance threshold.

Parameters:
  • dec_reg (array) – holds the decision about each feature 1, 0, -1 (accepted, undecided, rejected)

  • hit_reg (array) – counts how many times a given feature was more important than the best of the shadow features

  • _iter (int) – iteration number

Returns:

dec_reg (array) – holds the decision about each feature 1, 0, -1 (accepted, undecided, rejected)

static _fdrcorrection(pvals, alpha=0.05)[source]#

Benjamini/Hochberg p-value correction for false discovery rate, from statsmodels package. Included here for decoupling dependency on statsmodels.

Parameters:
  • pvals (array_like) – set of p-values of the individual tests.

  • alpha (float) – error rate

Returns:

  • rejected (array, bool) – True if a hypothesis is rejected, False if not

  • pvalue-corrected (array) – pvalues adjusted for multiple hypothesis testing to limit FDR

_fit(X_raw, y, sample_weight=None)[source]#

Private method. See the methods overview in the documentation for explanation of the process

Parameters:
  • X_raw (array-like, shape = [n_samples, n_features]) – The training input samples.

  • y (array-like, shape = [n_samples]) – The target values.

  • sample_weight (array-like, shape = [n_samples], default None) – Individual weights for each sample

Returns:

self (object) – Nothing but attributes

_get_tree_num(n_feat)[source]#
private method, get a good estimated for the number of trees

given the number of features

Parameters:

n_feat (int) – The number of features

Returns:

n_estimators (int) – the number of trees

static _nanrankdata(X, axis=1)[source]#

Replaces bottleneck’s nanrankdata with scipy and numpy alternative.

Parameters:
  • X (array or pd.DataFrame) – the data array

  • axis (int, optional) – row-wise (0) or column-wise (1), by default 1

Returns:

ranks (array) – the ranked array

_print_result(dec_reg, _iter, start_time)[source]#

Print the results of feature selection.

Parameters:
  • dec_reg (bool) – Decision on whether to proceed with another round of feature selection.

  • _iter (int) – Current iteration number.

  • start_time (float) – Time when the feature selection process started.

Returns:

None – The function prints the relevant results and running time.

_print_results(dec_reg, _iter, flag)[source]#

Private method, printing the result

Parameters:
  • dec_reg (array) – if the feature as been tagged as relevant (confirmed), not relevant (rejected) or undecided

  • _iter (int) – the iteration number

  • flag (int) – is still in the feature selection process or not

Returns:

output: str

the output to be printed out

_run_iteration(X, y, sample_weight, dec_reg, sha_max_history, imp_history, hit_reg, _iter)[source]#

Run an iteration of the Gradient Boosting algorithm.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The input samples.

  • y (array-like of shape (n_samples,)) – The target values.

  • sample_weight (array-like of shape (n_samples,), default None) – Sample weights. If None, then samples are equally weighted.

  • dec_reg (array-like of shape (n_samples,)) – Decision function of the estimator.

  • sha_max_history (list of floats) – List of the maximum shadow importance value at each iteration.

  • imp_history (array-like of shape (n_iterations, n_features)) – Matrix of feature importances at each iteration.

  • hit_reg (array-like of shape (n_samples,)) – Array of hit counts for each sample.

  • _iter (int) – The current iteration number.

Returns:

  • dec_reg (array-like of shape (n_samples,)) – Updated decision function of the estimator.

  • sha_max_history (list of floats) – List of the maximum shadow importance value at each iteration.

  • imp_history (array-like of shape (n_iterations, n_features)) – Matrix of feature importances at each iteration.

  • hit_reg (array-like of shape (n_samples,)) – Array of hit counts for each sample.

  • imp_sha_max (float) – The maximum shadow importance value for this iteration.

_update_estimator()[source]#

Update the estimator with a new random state, if applicable.

If the dataset is not categorical, the estimator’s random_state parameter is updated with a new random state generated by the random_state attribute of the Leshy object. If the estimator is a LightGBM model, the random state value is generated between 0 and 10000.

Parameters:

None

Returns:

None

_update_tree_num(dec_reg)[source]#

Update the number of trees in the estimator based on the number of selected features.

Parameters:

dec_reg (array-like of shape (n_features,)) – The decision rule for each feature, where negative values indicate that the feature should be rejected and non-negative values indicate that the feature should be selected.

Returns:

None

Notes

This function updates the n_estimators parameter of the estimator if it is set to “auto”. The number of trees is determined based on the number of selected features. Specifically, the number of trees is set to the value returned by the _get_tree_num method, which takes as input the number of selected features that are not rejected.

If n_estimators is not set to “auto”, this function does nothing.

fit(X, y, sample_weight=None)[source]#

Fits the Boruta feature selection with the provided estimator.

Parameters:
  • X (array-like, shape = [n_samples, n_features]) – The training input samples.

  • y (array-like, shape = [n_samples]) – The target values.

  • sample_weight (array-like, shape = [n_samples], default None) – Individual weights for each sample

Returns:

self (object) – Nothing but attributes

plot_importance(n_feat_per_inch=5)[source]#

Boxplot of the variable importance, ordered by magnitude The max shadow variable importance illustrated by the dashed line. Requires to apply the fit method first.

Parameters:

n_feat_per_inch (int, default 5) – number of features to plot per inch (for scaling the figure)

Returns:

fig (plt.figure) – the matplotlib figure object containing the boxplot

select_features(X, y, sample_weight=None)[source]#

Select features using the Leshy algorithm.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The input data.

  • y (array-like of shape (n_samples,)) – The target values.

  • sample_weight (array-like of shape (n_samples,), default None) – Individual weights for each sample.

Returns:

  • dec_reg (ndarray of shape (n_features,)) – The decision rule. 1 means the feature is selected, 0 means the feature is not selected.

  • sha_max_history (list) – List of the maximum shadow importances per iteration.

  • imp_history (ndarray of shape (n_iterations, n_features)) – Array containing the feature importances per iteration.

  • imp_sha_max (float) – Maximum shadow importance value.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') Leshy#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

transform(X)[source]#

Reduce X to the selected features.

Parameters:

X (array of shape [n_samples, n_features]) – The input samples.

Returns:

X_r (array of shape [n_samples, n_selected_features]) – The input samples with only the selected features.

arfs.feature_selection.allrelevant._boostaroota(X, y, estimator, cutoff, iters, max_rounds, delta, silent, weight, imp)[source]#

Private function, reduces the number of predictors using a sklearn estimator.

Parameters:
  • x (pd.DataFrame) – The dataframe to create shadow features on.

  • y (pd.Series) – The target.

  • estimator (scikit-learn estimator) – The model to train, lightGBM recommended, see the reduce lightgbm method.

  • cutoff (float) – The value by which the max of shadow imp is divided, to compare to real importance.

  • iters (int (>0)) – The number of iterations to average for the feature importances (on the same split), to reduce the variance.

  • max_rounds (int (>0)) – The number of times the core BoostARoota algorithm will run. Each round eliminates more and more features.

  • delta (float (0 < delta <= 1)) – Stopping criteria for whether another round is started.

  • silent (bool) – Set to True if you don’t want to see the BoostARoota output printed. Will still show any errors or warnings that may occur.

  • weight (pd.Series, optional) – Sample weights, if any.

  • imp (str) – whether if native, shap, fastshap or permutation importance should be used

Returns:

  • crit (bool) – If the criteria have been reached or not.

  • keep_vars (pd.DataFrame) – Feature importance of the real predictors over iterations.

  • df_vimp (pd.DataFrame) – Feature importance of the real+shadow predictors over iterations.

  • mean_shadow (float) – The feature importance threshold to reject or not the predictors.

arfs.feature_selection.allrelevant._compute_importance(new_x_tr, shap_matrix, param, objective, fastshap)[source]#

Compute feature importance scores using SHAP values.

Parameters:
  • new_x_tr (numpy.ndarray) – The training dataset after being processed.

  • shap_matrix (numpy.ndarray) – The matrix containing SHAP values computed by a LightGBM model.

  • param (dict) – A dictionary containing the parameters for a LightGBM model.

  • objective (str) – The objective function of the LightGBM model.

Returns:

list – A list of tuples containing feature names and their corresponding importance scores.

arfs.feature_selection.allrelevant._create_shadow(X_train)[source]#

Create shadow features by making copies of all X variables and randomly shuffling them.

Parameters:

X_train (pd.DataFrame) – The dataframe to create shadow features on.

Returns:

pd.DataFrame – A dataframe that is twice the width of X_train and contains the shadow features, along with a list of the shadow feature names.

arfs.feature_selection.allrelevant._get_confirmed_and_tentative(dec_reg)[source]#

Extracts the confirmed and tentative features from dec_reg.

arfs.feature_selection.allrelevant._get_imp(estimator, X, y, sample_weight=None, cat_feature=None)[source]#

Private function, Get the native feature importance (impurity based for instance)

Notes

This is know to return biased and uninformative results. e.g. https://scikit-learn.org/stable/auto_examples/inspection/ plot_permutation_importance.html#sphx-glr-auto-examples-inspection-plot-permutation-importance-py

or

https://explained.ai/rf-importance/

Parameters:
  • X (array-like, shape = [n_samples, n_features]) – The training input samples.

  • y (array-like, shape = [n_samples]) – The target values.

  • sample_weight (array-like, shape = [n_samples], default None) – Individual weights for each sample

  • cat_feature (list of int or None) – the list of integers, cols loc, of the categorical predictors. Avoids to detect and encode each iteration if the exact same columns are passed to the selection methods.

Returns:

imp (array) – the permutation importance array

arfs.feature_selection.allrelevant._get_perm_imp(estimator, X, y, sample_weight, cat_feature=None)[source]#

Private function, Get the SHAP feature importance

Parameters:
  • estimator (sklearn estimator) –

  • X (pd.DataFrame of shape [n_samples, n_features]) – predictor matrix

  • y (pd.series of shape [n_samples]) – target

  • sample_weight (array-like, shape = [n_samples], default None) – Individual weights for each sample

  • cat_feature (list of int or None) – the list of integers, cols loc, of the categorical predictors. Avoids to detect and encode each iteration if the exact same columns are passed to the selection methods.

Returns:

imp (array) – the permutation importance array

arfs.feature_selection.allrelevant._get_shap_imp(estimator, X, y, sample_weight=None, cat_feature=None)[source]#

Get the SHAP feature importance (compatible with all SHAP versions)

Parameters:
  • estimator (estimator object) – An estimator object implementing fit and predict methods.

  • X (pd.DataFrame of shape [n_samples, n_features]) – Predictor matrix.

  • y (pd.Series of shape [n_samples]) – Target variable.

  • sample_weight (array-like, shape = [n_samples], default None) – Individual weights for each sample.

  • cat_feature (list of int or None, default None) – The list of integers, columns loc, of the categorical predictors.

Returns:

shap_imp (array) – The SHAP importance array.

arfs.feature_selection.allrelevant._get_shap_imp_fast(estimator, X, y, sample_weight=None, cat_feature=None)[source]#

Get the SHAP feature importance using the fasttreeshap implementation

Parameters:
  • estimator (estimator object) – An estimator object implementing fit and predict methods.

  • X (pd.DataFrame of shape [n_samples, n_features]) – Predictor matrix.

  • y (pd.Series of shape [n_samples]) – Target variable.

  • sample_weight (array-like, shape = [n_samples], default None) – Individual weights for each sample.

  • cat_feature (list of int or None, default None) – The list of integers, columns loc, of the categorical predictors. Avoids detecting and encoding each iteration if the exact same columns are passed to the selection methods.

Returns:

shap_imp (array) – The SHAP importance array.

arfs.feature_selection.allrelevant._merge_importance_df(df, importance, iter, n_folds, column_names, silent=True)[source]#

Merge the feature importance dataframe df with the importance information for the current iteration of a cross-validation loop.

Parameters:
  • df (pandas.DataFrame) – The current feature importance dataframe.

  • importance (dict) – A dictionary with the feature importance information for the current iteration.

  • i (int) – The index of the current iteration.

  • n_folds (int) – The number of folds used in the cross-validation loop.

  • silent (bool, optional) – If True, suppress output.

Returns:

pandas.DataFrame – The updated feature importance dataframe.

arfs.feature_selection.allrelevant._reduce_vars_lgb_cv(X, y, objective, folds, n_folds, cutoff, n_iter, silent, weight, rf, fastshap, lgbm_params=None, n_jobs=0)[source]#

Reduce the number of predictors using a lightgbm (python API)

Parameters:
  • X (pd.DataFrame) – the dataframe to create shadow features on

  • y (pd.Series) – the target

  • objective (str) – the lightGBM objective

  • folds – (generator or iterator of (train_idx, test_idx) tuples, scikit-learn splitter object or None, optional (default=None)) If generator or iterator, it should yield the train and test indices for each fold. If object, it should be one of the scikit-learn splitter classes (https://scikit-learn.org/stable/modules/classes.html#splitter-classes) and have split method. This argument has highest priority over other data split arguments.

  • nfold (int) – Number of folds in CV.

  • cutoff (float) – the value by which the max of shadow imp is divided, to compare to real importance

  • n_iter (int) – The number of repetition of the cross-validation, smooth out the feature importance noise

  • silent (bool) – Set to True if don’t want to see the BoostARoota output printed. Will still show any errors or warnings that may occur

  • weight (pd.series) – sample_weight, if any

  • rf (bool, default False) – the lightGBM implementation of the random forest

  • fastshap (bool) – enable or not the fasttreeshap implementation

  • lgbm_params (dict, optional) – dictionary of lightgbm parameters

  • n_jobs (int, default 0) – 0 means default number of threads in OpenMP for the best speed, set this to the number of real CPU cores, not the number of threads

Returns:

  • real_vars[‘feature’] (pd.dataframe) – feature importance of the real predictors over iter

  • df (pd.DataFrame) – feature importance of the real+shadow predictors over iter

  • cutoff_shadow (float) – the feature importance threshold, to reject or not the predictors

arfs.feature_selection.allrelevant._reduce_vars_sklearn(X, y, estimator, this_round, cutoff, n_iterations, delta, silent, weight, imp_kind, cat_feature)[source]#

Private function, reduce the number of predictors using a sklearn estimator

Parameters:
  • x (pd.DataFrame) – the dataframe to create shadow features on

  • y (pd.Series) – the target

  • estimator (sklearn estimator) – the model to train, lightGBM recommended

  • this_round (int) – The number of times the core BoostARoota algorithm will run. Each round eliminates more and more features

  • cutoff (float) – the value by which the max of shadow imp is divided, to compare to real importance

  • n_iterations (int) – The number of iterations to average for the feature importance (on the same split), to reduce the variance

  • delta (float (0 < delta <= 1)) – Stopping criteria for whether another round is started

  • silent (bool) – Set to True if don’t want to see the BoostARoota output printed. Will still show any errors or warnings that may occur

  • weight (pd.series) – sample_weight, if any

  • imp_kind (str) – whether if native, shap, fastshap or permutation importance should be used

  • cat_feature (list of int or None) – the list of integers, cols loc, of the categorical predictors. Avoids to detect and encode each iteration if the exact same columns are passed to the selection methods.

Returns:

  • criteria (bool) – if the criteria has been reached or not

  • real_vars[‘feature’] (pd.dataframe) – feature importance of the real predictors over iter

  • df (pd.DataFrame) – feature importance of the real+shadow predictors over iter

  • mean_shadow (float) – the feature importance threshold, to reject or not the predictors

Raises:

ValueError – error if the feature importance type is not

arfs.feature_selection.allrelevant._select_tentative(tentative, imp_history, sha_max_history)[source]#

Select tentative features based on median importance values.

Parameters:
  • tentative (array-like of shape (n_tentative,)) – Array of indices representing tentative features.

  • imp_history (array-like of shape (n_iterations + 1, n_features)) – Importance values for each feature in each iteration.

  • sha_max_history (array-like of shape (n_iterations + 1,)) – The history of the highest stability scores.

Returns:

tentative (array-like of shape (n_tentative_confirmed,)) – The confirmed tentative features based on their median importance values.

arfs.feature_selection.allrelevant._set_lgb_parameters(X, y, objective, rf, silent, n_jobs=0, lgbm_params=None)[source]#

Set parameters for a LightGBM model based on the input features and the objective.

Parameters:
  • X (numpy array or pandas DataFrame) – The feature matrix of the training data.

  • y (numpy array or pandas Series) – The target variable of the training data.

  • objective (str) – The objective function to optimize during training.

  • rf (bool, default False) – Whether to use random forest boosting.

  • silent (bool, default True) – Whether to print messages during parameter setting.

  • n_jobs (int, default 0) – 0 means default number of threads in OpenMP for the best speed, set this to the number of real CPU cores, not the number of threads

Return type:

dict

Returns:

dict – The dictionary of LightGBM parameters.

arfs.feature_selection.allrelevant._split_data(X, y, tridx, validx, weight=None)[source]#

Split data into train and validation sets based on provided indices.

Parameters:
  • X (pandas.DataFrame) – Features.

  • y (pandas.Series) – Target variable.

  • tridx (list) – Indices to be used for training.

  • validx (list) – Indices to be used for validation.

  • weight (pandas.Series, optional) – Weights for each sample, by default None.

Returns:

tuple of pandas.DataFrame and pandas.Series – X_train, X_val, y_train, y_val, weight_tr, weight_val

arfs.feature_selection.allrelevant._split_fit_estimator(estimator, X, y, sample_weight=None, cat_feature=None)[source]#

Private function, split the train, test and fit the model

Parameters:
  • estimator (estimator object implementing 'fit' and 'predict') – The object to use to fit the data.

  • X (pd.DataFrame of shape [n_samples, n_features]) – predictor matrix

  • y (pd.series of shape [n_samples]) – target

  • sample_weight (array-like, shape = [n_samples], default None) – Individual weights for each sample

  • cat_feature (list of int or None) – the list of integers, cols loc, of the categrocial predictors. Avoids to detect and encode each iteration if the exact same columns are passed to the selection methods.

Returns:

model :

fitted model

X_ttarray [n_samples, n_features]

the test split, predictors

y_ttarray [n_samples]

the test split, target

arfs.feature_selection.allrelevant._train_lgb_model(X_train, y_train, weight_train, X_val, y_val, weight_val, category_cols=None, early_stopping_rounds=20, fastshap=False, **params)[source]#

Train a LightGBM model with the given training data and hyperparameters and return the trained model and its SHAP values.

Parameters:
  • X_train (array-like of shape (n_samples, n_features)) – The input training data.

  • y_train (array-like of shape (n_samples,)) – The target training data.

  • weight_train (array-like of shape (n_samples,)) – The sample weights for training data.

  • X_val (array-like of shape (n_val_samples, n_features)) – The input validation data.

  • y_val (array-like of shape (n_val_samples,)) – The target validation data.

  • weight_val (array-like of shape (n_val_samples,)) – The sample weights for validation data.

  • category_cols (array-like or None, optional (default=None)) – The indices of categorical columns. If None, no categorical columns will be considered.

  • early_stopping_rounds (int, optional (default=20)) – Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training._train_lgb_model

  • fastshap (bool) – enable or not fasttreeshap implementation

  • **params (dict) – Other parameters passed to the LightGBM model.

Returns:

tuple of (Booster, numpy.ndarray, int) – The trained LightGBM model, its SHAP values for X_train, and the best iteration reached during training.

arfs.feature_selection.base module#

Base Submodule

This module provides a base class for selector using a statistic and a threshold

Module Structure:#

  • BaseThresholdSelector: parent class for the “treshold-based” selectors

class arfs.feature_selection.base.BaseThresholdSelector(threshold=0.05, statistic_fn=None, greater_than_threshold=False)[source]#

Bases: SelectorMixin, BaseEstimator

Base class for threshold-based feature selection

Parameters:
  • threshold (float, .05) – Features with a training-set missing greater/lower (geq/leq) than this threshold will be removed

  • statistic_fn (callable, optional) – The function for computing the statistic series. The index should be the column names and the the values the computed statistic

  • greater_than_threshold (bool, False) – Whether or not to reject the features if lower or greater than threshold

Returns:

selected_features (list of str) – List of selected features.

Variables:
  • n_features_in (int) – number of input predictors

  • support (list of bool) – the list of the selected X-columns

  • selected_features (list of str) – the list of names of selected features

  • not_selected_features (list of str) – the list of names of rejected features

fit(X, y=None, sample_weight=None)[source]#

Learn empirical statistics from X.

Parameters:
  • X (pd.DataFrame, shape (n_samples, n_features)) – Data from which to compute variances, where n_samples is the number of samples and n_features is the number of features.

  • y (any, default None) – Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

  • sample_weight (pd.Series, optional, shape (n_samples,)) – weights for computing the statistics (e.g. weighted average)

Returns:

self (object) – Returns the instance itself.

fit_transform(X, y=None, sample_weight=None, **fit_params)[source]#

Fit to data, then transform it. Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X. :type X: :param X: Input samples. :type X: array-like of shape (n_samples, n_features) :type y: :param y: Target values (None for unsupervised transformations). :type y: array-like of shape (n_samples,) or (n_samples, n_outputs), default=None :type sample_weight: :param sample_weight: sample weight values. :type sample_weight: array-like of shape (n_samples,) or (n_samples, n_outputs), default=None :type **fit_params: :param **fit_params: Additional fit parameters. :type **fit_params: dict

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') BaseThresholdSelector#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

transform(X)[source]#

Transform the data, returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

arfs.feature_selection.lasso module#

LassoFeatureSelection Submodule

This module provides LASSO-based feature selection, specifically designed for use with Generalized Linear Models (GLM). The Lasso Regularized GLM introduces an L1 regularization penalty (Lasso regularization), encouraging some coefficients to become exactly zero during the model fitting process. This regularization effectively removes irrelevant features from the model, making it a powerful tool for feature selection, particularly in datasets with numerous variables.

Module Structure:#

  • EnetGLM: class serves as a scikit-learn wrapper for the regularized statsmodels GLM, providing seamless integration with scikit-learn’s ecosystem.

  • weighted_cross_val_score: function allows users to pass weights to the model and define a custom scoring metric.

  • grid_search_cv: function performs a weighted LASSO grid search to find the best Lasso parameter for the model.

  • LassoFeatureSelection: class is the core feature selection class, estimating the Lasso parameter through

    the grid search process, enabling efficient and effective feature selection.

With this submodule, users can easily leverage Lasso Regularized GLMs and conduct feature selection, improving model performance and interpretability in various datasets.

class arfs.feature_selection.lasso.EnetGLM(family='gaussian', link=None, alpha=0.0, L1_wt=1e-06, fit_intercept=True)[source]#

Bases: BaseEstimator, RegressorMixin

Elastic Net Generalized Linear Model.

Parameters:
  • family (str, (default=``”gaussian”``)) – The distributional assumption of the model. It can be any of the statsmodels distribution: “gaussian”, “binomial”, “poisson”, “gamma”, “negativebinomial”, “tweedie”

  • link (str, optional) – the GLM link function. It can be any of the: “identity”, “log”, “logit”, “probit”, “cloglog”, “inverse_squared”

  • alpha (float, optional (default=0.0)) – The elastic net mixing parameter. 0 <= alpha <= 1. alpha = 0 is equivalent to ridge regression, alpha = 1 is equivalent to lasso regression.

  • L1_wt (float, optional (default=0.0)) – The weight of the L1 penalty term. 0 <= L1_wt <= 1. The L1_wt parameter represents the weight of the L1 penalty term in the model and should be within the range 0 to 1. A value of 0 corresponds to ridge regression, while a value of 1 corresponds to lasso regression. However, for obtaining statistics, L1_wt should be set to a value greater than 0. If it is set to 0.0, statsmodels returns a ridge regularized wrapper without refitting the model, making the statistics unavailable and breaking the class. Nevertheless, you can set L1_wt to a very small value, such as 1e-9, to obtain close-to-ridge behavior while still obtaining the necessary statistics.

  • fit_intercept (bool, optional (default=True)) – Whether to fit an intercept term in the model.

__init__(family='gaussian', link=None, alpha=0.0, L1_wt=1e-06, fit_intercept=True)[source]#

Initialize self.

Parameters:
  • family (str) – The distributional assumption of the model.

  • link (Optional[str]) – the GLM link function

  • alpha (float) – The penalty weight. If a scalar, the same penalty weight applies to all variables in the model. If a vector, it must have the same length as params, and contains a penalty weight for each coefficient.

  • L1_wt (float) – The L1_wt parameter represents the weight of the L1 penalty term in the model and should be within the range 0 to 1. A value of 0 corresponds to ridge regression, while a value of 1 corresponds to lasso regression. However, for obtaining statistics, L1_wt should be set to a value greater than 0. If it is set to 0.0, statsmodels returns a ridge regularized wrapper without refitting the model, making the statistics unavailable and breaking the class. Nevertheless, you can set L1_wt to a very small value, such as 1e-9, to obtain close-to-ridge behavior while still obtaining the necessary statistics.

  • fit_intercept (bool) – Whether to fit an intercept term in the model.

fit(X, y, sample_weight=None)[source]#

Fit the model to the data.

Notes

In statsmodels and GLMs in general, you can use either an offset or a weight to account for differences in exposure between observations. However, if you choose to use an offset, you need to pass the number of cases (ncl) instead of the frequency and set the offset to the logarithm of the exposure due to the log link function. It is recommended to use the frequency and the weights instead of the offset because this ensures that all models have the same inputs. To use the frequency and the weights, you can fit the model using the following code:

`python self.model = sm.GLM(endog=y, exog=X, var_weights=sample_weight, family=self.family) `

This is equivalent to using the exposure and the log of the exposure internally, which can be done using the following code:

`python self.model = sm.GLM(endog=y, exog=sm.add_constant(X), exposure=sample_weight, family=sm.families.Poisson()) self.result = self.model.fit() `

Parameters:
  • X (DataFrame) – array-like, shape (n_samples, n_features) The input data.

  • y (Union[ndarray, Series]) – array-like, shape (n_samples,) The target values.

  • sample_weight (array-like, shape (n_samples,), optional (default=None)) – Sample weights.

Returns:

self (object) – Returns self.

get_coef()[source]#

Get the estimated coefficients of the fitted model.

Returns:

coef_ (array-like, shape (n_features,)) – The estimated coefficients of the fitted model.

predict(X)[source]#

Predict using the fitted model.

Parameters:

X – array-like, shape (n_samples, n_features) The input data.

Returns:

y (array-like, shape (n_samples,)) – The predicted target values.

Raises:

ValueError – If the model has not been fit.

score(X, y, sample_weight=None)[source]#

Return the deviance of the fitted model.

Parameters:
  • X (DataFrame) – array-like, shape (n_samples, n_features) The input data.

  • sample_weight (array-like, shape (n_samples,), optional (default=None)) – Sample weights.

Returns:

deviance (float) – The deviance of the fitted model.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') EnetGLM#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') EnetGLM#

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self (object) – The updated object.

summary()[source]#

Print a summary of the fitted model.

Returns:

summary (str) – The summary of the fitted model.

class arfs.feature_selection.lasso.LassoFeatureSelection(family='gaussian', link=None, n_iterations=10, score='bic', fit_intercept=True, n_jobs=-1)[source]#

Bases: BaseEstimator, TransformerMixin

LassoFeatureSelection performs feature selection using GLM Lasso regularization.

Parameters:
  • family (str, (default=``”gaussian”``)) – The distributional assumption of the model. It can be any of the statsmodels distribution: “gaussian”, “binomial”, “poisson”, “gamma”, “negativebinomial”, “tweedie”

  • link (str, optional) – the GLM link function. It can be any of the: “identity”, “log”, “logit”, “probit”, “cloglog”, “inverse_squared”

  • n_iterations (int, default 10) – Number of iterations for the grid search.

  • score (str, default "bic") – The score to use for model selection. Options: “bic” (Bayesian Information Criterion) or “mean_cv” (mean cross-validation score).

  • n_jobs (int, default -1) – the number of processes. -1 means all the processes

Variables:
  • family (str) – The family of the GLM.

  • n_iterations (int) – Number of iterations for the grid search.

  • best_estimator (EnetGLM) – The best estimator found after grid search cross-validation.

  • selected_features (ndarray) – The selected feature names.

  • support (ndarray) – The support of selected features (True for selected, False otherwise).

  • feature_names_in (ndarray) – The input feature names.

  • score (str) – The score used for model selection.

  • n_jobs (int) – the number of processes. -1 means all the processes

fit(X, y=None, sample_weight=None)[source]#

Fit the LassoFeatureSelection model and select the best features.

transform(X)[source]#

Transform the input data to keep only the selected features.

get_feature_names_out()[source]#

Get the names of the selected features.

fit(X, y=None, sample_weight=None)[source]#

Fit the LassoFeatureSelection model and select the best features.

Parameters:
  • X (Union[pd.DataFrame, np.ndarray]) – The input features, can be either a pandas DataFrame or a numpy array.

  • y (Optional[Union[pd.Series, np.ndarray]], default None) – The target values, can be either a pandas Series or a numpy array.

  • sample_weight (Optional[Union[pd.Series, np.ndarray]], default None) – Sample weights to be used during training. Can be either a pandas Series or a numpy array.

Returns:

LassoFeatureSelection – The fitted LassoFeatureSelection model.

get_feature_names_out()[source]#

Get the names of the selected features.

Return type:

ndarray

Returns:

np.ndarray – The names of the selected features.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') LassoFeatureSelection#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

transform(X)[source]#

Transform the input data to keep only the selected features.

Parameters:

X (Union[pd.DataFrame, np.ndarray]) – The input features, can be either a pandas DataFrame or a numpy array.

Return type:

DataFrame

Returns:

Union[pd.DataFrame, np.ndarray] – The transformed data with only the selected features.

arfs.feature_selection.lasso._fit_and_score(estimator, X, y, train_index, test_index, sample_weight=None)[source]#

Fit and score an estimator on a specified train-test split.

Parameters:
  • estimator (BaseEstimator) – The estimator object implementing the scikit-learn estimator interface.

  • X (Union[pd.DataFrame, np.ndarray]) – The input features, can be either a pandas DataFrame or a numpy array.

  • y (Union[pd.Series, np.ndarray]) – The target values, can be either a pandas Series or a numpy array.

  • train_index (np.ndarray) – Array of indices representing the training data.

  • test_index (np.ndarray) – Array of indices representing the test data.

  • sample_weight (Optional[Union[pd.Series, np.ndarray]], default None) – Sample weights to be used during training. Can be either a pandas Series or a numpy array.

Return type:

float

Returns:

float – The score of the estimator on the test data.

Raises:

ValueError – If the input data is not of the correct format.

arfs.feature_selection.lasso.drop_existing_sm_constant_from_df(X)[source]#
arfs.feature_selection.lasso.grid_search_cv(X, y, sample_weight=None, n_iterations=10, family='gaussian', link=None, score='bic', fit_intercept=True, n_jobs=-1)[source]#

Perform grid search cross-validation for an Elastic Net Generalized Linear Model (EnetGLM).

Parameters:
  • X (Union[pd.DataFrame, np.ndarray]) – The input features, can be either a pandas DataFrame or a numpy array.

  • y (Union[pd.Series, np.ndarray]) – The target values, can be either a pandas Series or a numpy array.

  • sample_weight (Optional[Union[pd.Series, np.ndarray]], default None) – Sample weights to be used during training. Can be either a pandas Series or a numpy array.

  • n_iterations (int, default 10) – Number of iterations for the grid search.

  • family (str, default "gaussian") – The family of the GLM. Options: “gaussian”, “poisson”, “gamma”, “negativebinomial”, “binomial”, “tweedie”.

  • link (str, optional) – the GLM link function. It can be any of the: “identity”, “log”, “logit”, “probit”, “cloglog”, “inverse_squared”

  • score (str, default "bic") – The score to use for model selection. Options: “bic” (Bayesian Information Criterion) or “mean_cv” (mean cross-validation score).

  • n_jobs (int) – the number of processes

Return type:

EnetGLM

Returns:

EnetGLM – The best estimator found after grid search cross-validation.

Raises:

ValueError – If the input data is not of the correct format or if an invalid family or score value is provided.

arfs.feature_selection.lasso.weighted_cross_val_score(estimator, X, y, sample_weight=None, cv=5, n_jobs=-1)[source]#

Perform cross-validation for a scikit-learn estimator with a score function that requires sample_weight.

Parameters:
  • estimator (estimator) – The scikit-learn estimator object.

  • X (array-like of shape (n_samples, n_features)) – The input features.

  • y (array-like of shape (n_samples,)) – The target variable.

  • sample_weight (array-like of shape (n_samples,), optional) – The sample weights for each data point.

  • cv (int, default 5) – The number of cross-validation folds.

  • n_jobs – the number of processes

Returns:

  • scores (array of shape (cv,)) – The list of scores for each fold.

  • average_score (float) – The average score across all folds.

arfs.feature_selection.mrmr module#

MRMR Feature Selection Module

This module provides MinRedundancyMaxRelevance (MRMR) feature selection for classification or regression tasks. In a classification task, the target should be of object or pandas category dtype, while in a regression task, the target should be numeric. The predictors can be categorical or numerical without requiring encoding, as the appropriate method (correlation, correlation ratio, or Theil’s U) will be automatically selected based on the data type.

Module Structure:#

  • MinRedundancyMaxRelevance: MRMR feature selection class for classification or regression tasks.

class arfs.feature_selection.mrmr.MinRedundancyMaxRelevance(n_features_to_select, relevance_func=None, redundancy_func=None, task='regression', denominator_func=<function mean>, only_same_domain=False, return_scores=False, n_jobs=1, show_progress=True)[source]#

Bases: SelectorMixin, BaseEstimator

MRMR feature selection for a classification or a regression task For a classification task, the target should be of object or pandas category dtype. For a regression task, the target should be of numpy categorical dtype. The predictors can be categorical or numerical, there is no encoding required. The dtype will be automatically detected and the right method applied (either correlation, correlation ration or Theil’s U)

Parameters:
  • n_features_to_select (int) – Number of features to select.

  • relevance_func (callable, optional) – relevance function having arguments “X”, “y”, “sample_weight” and returning a pd.Series containing a score of relevance for each feature

  • redundancy_func (callable, optional) – Redundancy method. If callable, it should take “X”, “sample_weight” as input and return a pandas.Series containing a score of redundancy for each feature.

  • denominator_func (str or callable (optional, default 'mean')) – Synthesis function to apply to the denominator of MRMR score. If string, name of method. Supported: ‘max’, ‘mean’. If callable, it should take an iterable as input and return a scalar.

  • task (str) – either “regression” or “classification”

  • only_same_domain (bool (optional, default False)) – If False, all the necessary correlation coefficients are computed. If True, only features belonging to the same domain are compared. Domain is defined by the string preceding the first underscore: for instance “cusinfo_age” and “cusinfo_income” belong to the same domain, whereas “age” and “income” don’t.

  • return_scores (bool (optional, default False)) – If False, only the list of selected features is returned. If True, a tuple containing (list of selected features, relevance, redundancy) is returned.

  • n_jobs (int (optional, default 1)) – Maximum number of workers to use. Only used when relevance = “f” or redundancy = “corr”. If -1, use as many workers as min(cpu count, number of features).

  • show_progress (bool (optional, default True)) – If False, no progress bar is displayed. If True, a TQDM progress bar shows the number of features processed.

Returns:

selected_features (list of str) – List of selected features.

Variables:
  • n_features_in (int) – number of input predictors

  • ranking (pd.DataFrame) – name and scores for the selected features

  • support (list of bool) – the list of the selected X-columns

Example

>>> from sklearn.datasets import make_classification, make_regression
>>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5
>>> X = pd.DataFrame(X)
>>> y = pd.Series(y)
>>> pred_name = [f"pred_{i}" for i in range(X.shape[1])]
>>> X.columns = pred_name
>>> y.name = "target"
>>> fs_mrmr = MinRedundancyMaxRelevance(
>>>                  n_features_to_select=5,
>>>                  relevance_func=None,
>>>                  redundancy_func=None,
>>>                  task="regression", #"classification",
>>>                  denominator_func=np.mean,
>>>                  only_same_domain=False,
>>>                  return_scores=False,
>>>                  show_progress=True)
>>> #fs_mrmr.fit(X=X, y=y.astype(str), sample_weight=None)
>>> fs_mrmr.fit(X=X, y=y, sample_weight=None)
fit(X, y, sample_weight=None)[source]#

fit the MRmr selector by learning the associations

Parameters:
  • X (pd.DataFrame, shape (n_samples, n_features)) – Data from which to compute variances, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like or pd.Series of shape (n_samples,)) – Target vector. Must be numeric for regression or categorical for classification.

  • sample_weight (pd.Series, optional, shape (n_samples,)) – weights for computing the statistics (e.g. weighted average)

Returns:

self (object) – If return_scores=False, returns self. If return_scores=True, returns (selected_features, relevance_scores).

fit_transform(X, y, sample_weight=None, **fit_params)[source]#

Fit to data, then transform it. Fits transformer to X and y and optionally sample_weight with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • sample_weight (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – sample weight values.

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

run_feature_selection()[source]#
select_next_feature(not_selected_features, selected_features, relevance, redundancy)[source]#
set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') MinRedundancyMaxRelevance#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

transform(X)[source]#

Transform the data, returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

update_ranks(best_feature, score, score_denominator)[source]#

arfs.feature_selection.summary module#

Feature Selection Summary Module

This module provides a function for creating the summary report of a FS pipeline

Module Structure:#

  • make_fs_summary main function for creating the summary

  • highlight_discarded function for creating style for the pd.DataFrame

arfs.feature_selection.summary.highlight_discarded(s)[source]#

highlight X in red and V in green.

Parameters:

s (array-like of shape (n_features,)) – the boolean array for defining the style

arfs.feature_selection.summary.make_fs_summary(selector_pipe)[source]#

make_fs_summary makes a summary dataframe highlighting at which step a given predictor has been rejected (if any).

Parameters:

selector_pipe (sklearn.pipeline.Pipeline) – the feature selector pipeline.

Examples

>>> groot_pipeline = Pipeline([
... ('missing', MissingValueThreshold()),
... ('unique', UniqueValuesThreshold()),
... ('cardinality', CardinalityThreshold()),
... ('collinearity', CollinearityThreshold(threshold=0.5)),
... ('lowimp', VariableImportance(eval_metric='poisson', objective='poisson', verbose=2)),
... ('grootcv', GrootCV(objective='poisson', cutoff=1, n_folds=3, n_iter=5))])
>>> groot_pipeline.fit_transform(
    X=df[predictors],
    y=df[target],
    lowimp__sample_weight=df[weight],
    grootcv__sample_weight=df[weight])
>>> fs_summary_df = make_fs_summary(groot_pipeline)

arfs.feature_selection.unsupervised module#

Unsupervised Feature Selection

This module provides selectors using unsupervised statistics and a threshold

Module Structure:#

  • MissingValueThreshold: child class of the BaseThresholdSelector, filter out columns with too many missing values

  • UniqueValuesThreshold child of the BaseThresholdSelector, filter out columns with zero variance

  • CardinalityThreshold child of the BaseThresholdSelector, filter out categorical columns with too many levels

  • CollinearityThreshold child of the BaseThresholdSelector, filter out collinear columns

class arfs.feature_selection.unsupervised.CardinalityThreshold(threshold=1000)[source]#

Bases: BaseThresholdSelector

Feature selector that removes all categorical features with more unique values than threshold This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

Parameters:

threshold (int, default = 1000) – Features with a training-set missing larger than this threshold will be removed. The thresold should be >= 1

Returns:

selected_features (list of str) – List of selected features.

Variables:
  • n_features_in (int) – number of input predictors

  • support (list of bool) – the list of the selected X-columns

  • selected_features (list of str) – the list of names of selected features

  • not_selected_features (list of str) – the list of names of rejected features

Example

>>> from sklearn.datasets import make_classification, make_regression
>>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5
>>> X = pd.DataFrame(X)
>>> y = pd.Series(y)
>>> pred_name = [f"pred_{i}" for i in range(X.shape[1])]
>>> X.columns = pred_name
>>> selector = CardinalityThreshold(100)
>>> selector.fit_transform(X)
set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') CardinalityThreshold#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

class arfs.feature_selection.unsupervised.CollinearityThreshold(threshold=0.8, method='association', n_jobs=1, nom_nom_assoc=<function weighted_theils_u>, num_num_assoc=<function weighted_corr>, nom_num_assoc=<function correlation_ratio>)[source]#

Bases: SelectorMixin, BaseEstimator

Feature selector that removes collinear features. This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning. It computes the association between features (continuous or categorical), store the pairs of collinear features and remove one of them for all pairs having an association value above the threshold.

The association measures are the Spearman correlation coefficient, correlation ratio and Theil’s U. The association matrix is not necessarily symmetrical.

By changing the method to “correlation”, data will be encoded as integer and the Spearman correlation coefficient will be used instead. Faster but not a best practice because the categorical variables are considered as numeric.

Parameters:
  • threshold (float, default = .8) – Features with a training-set missing larger than this threshold will be removed The thresold should be > 0 and =< 1

  • method (str, default = "association") – method for computing the association matrix. Either “association” or “correlation”. Correlation leads to encoding of categorical variables as numeric

  • n_jobs (int, default = -1) – the number of threads, -1 uses all the threads for computating the association matrix

  • nom_nom_assoc (str or callable, default = "theil") – the categorical-categorical association measure, by default Theil’s U, not symmetrical!

  • num_num_assoc (str or callable, default = "spearman") – the numeric-numeric association measure

  • nom_num_assoc (str or callable, default = "correlation_ratio") – the numeric-categorical association measure

Returns:

selected_features (list of str) – List of selected features.

Variables:
  • n_features_in (int) – number of input predictors

  • assoc_matrix (pd.DataFrame) – the square association matrix

  • collinearity_summary (pd.DataFrame) – the pairs of collinear features and the association values

  • support (list of bool) – the list of the selected X-columns

  • selected_features (list of str) – the list of names of selected features

  • not_selected_features (list of str) – the list of names of rejected features

Example

>>> from sklearn.datasets import make_classification, make_regression
>>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5
>>> X = pd.DataFrame(X)
>>> y = pd.Series(y)
>>> pred_name = [f"pred_{i}" for i in range(X.shape[1])]
>>> X.columns = pred_name
>>> selector = CollinearityThreshold(threshold=0.75)
>>> selector.fit_transform(X)
fit(X, y=None, sample_weight=None)[source]#

Learn empirical associtions from X.

Parameters:
  • X (pd.DataFrame, shape (n_samples, n_features)) – Data from which to compute variances, where n_samples is the number of samples and n_features is the number of features.

  • y (any, default None) – Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

  • sample_weight (pd.Series, optional, shape (n_samples,)) – weights for computing the statistics (e.g. weighted average)

Returns:

self (object) – Returns the instance itself.

plot_association(ax=None, cmap='PuOr', figsize=None, cbar_kw=None, imgshow_kw=None)[source]#

plot_association plots the association matrix

Parameters:
  • ax (matplotlib.axes.Axes, optional) – the mpl axes if the figure object exists already, by default None

  • cmap (str, optional) – colormap name, by default “PuOr”

  • figsize (tuple of float, optional) – figure size, by default None

  • cbar_kw (dict, optional) – colorbar kwargs, by default None

  • imgshow_kw (dict, optional) – imgshow kwargs, by default None

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') CollinearityThreshold#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

transform(X)[source]#

Reduce X to the selected features.

Parameters:

X (array of shape [n_samples, n_features]) – The input samples.

Returns:

X_r (array of shape [n_samples, n_selected_features]) – The input samples with only the selected features.

class arfs.feature_selection.unsupervised.MissingValueThreshold(threshold=0.05)[source]#

Bases: BaseThresholdSelector

Feature selector that removes all high missing percentage features. This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

Parameters:

threshold (float, default = .05) – Features with a training-set missing larger than this threshold will be removed.

Returns:

selected_features (list of str) – List of selected features.

Variables:
  • n_features_in (int) – number of input predictors

  • support (list of bool) – the list of the selected X-columns

  • selected_features (list of str) – the list of names of selected features

  • not_selected_features (list of str) – the list of names of rejected features

Example

>>> from sklearn.datasets import make_classification, make_regression
>>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5
>>> X = pd.DataFrame(X)
>>> y = pd.Series(y)
>>> pred_name = [f"pred_{i}" for i in range(X.shape[1])]
>>> X.columns = pred_name
>>> selector = MissingValueThreshold(0.05)
>>> selector.fit_transform(X)
set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') MissingValueThreshold#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

class arfs.feature_selection.unsupervised.UniqueValuesThreshold(threshold=1)[source]#

Bases: BaseThresholdSelector

Feature selector that removes all features with zero variance (single unique values) or remove columns with less unique values than threshold This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

Parameters:

threshold (int, default = 1) – Features with a training-set missing larger than this threshold will be removed. The thresold should be >= 1

Returns:

selected_features (list of str) – List of selected features.

Variables:
  • n_features_in (int) – number of input predictors

  • support (list of bool) – the list of the selected X-columns

  • selected_features (list of str) – the list of names of selected features

  • not_selected_features (list of str) – the list of names of rejected features

Example

>>> from sklearn.datasets import make_classification, make_regression
>>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5
>>> X = pd.DataFrame(X)
>>> y = pd.Series(y)
>>> pred_name = [f"pred_{i}" for i in range(X.shape[1])]
>>> X.columns = pred_name
>>> selector = UniqueValuesThreshold(1)
>>> selector.fit_transform(X)
set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') UniqueValuesThreshold#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

arfs.feature_selection.unsupervised._pandas_count_unique_values_cat_features(X)[source]#

Counts the number of unique values in categorical features of a pandas DataFrame.

Parameters:

X (pandas DataFrame) – The input data.

Returns:

pandas Series – The number of unique values in each categorical feature.

Raises:

TypeError – If the input data is not a pandas DataFrame.

arfs.feature_selection.variable_importance module#

Supervised Feature Selection

This module provides selectors using supervised statistics and a threshold, using SHAP, permutation importance or impurity (Gini) importance.

Module Structure:#

  • VariableImportance main class for identifying non-important features

class arfs.feature_selection.variable_importance.VariableImportance(task='regression', encode=True, n_iterations=10, threshold=0.99, lgb_kwargs={'objective': 'rmse', 'zero_as_missing': False}, encoder_kwargs=None, fastshap=False, verbose=-1)[source]#

Bases: SelectorMixin, BaseEstimator

Feature selector that removes predictors with zero or low variable importance.

Identify the features with zero/low importance according to SHAP values of a lightgbm. The gbm can be trained with early stopping using a utils set to prevent overfitting. The feature importances are averaged over n_iterations to reduce the variance. The predictors are then ranked from the most important to the least important and the cumulative variable importance is computed. All the predictors not contributing (VI=0) or contributing to less than the threshold to the cumulative importance are removed.

Parameters:
  • task (string) – The machine learning task, either ‘classification’ or ‘regression’ or ‘multiclass’, be sure to use a consistent objective function

  • encode (boolean, default = True) – Whether or not to encode the predictors

  • n_iterations (int, default = 10) – Number of iterations, the more iterations, the smaller the variance

  • threshold (float, default = .99) – The selector computes the cumulative feature importance and ranks the predictors from the most important to the least important. All the predictors contributing to less than this value are rejected.

  • lgb_kwargs (dictionary of keyword arguments) – dictionary of lightgbm estimators parameters with at least the objective function {‘objective’:’rmse’}

  • encoder_kwargs (dictionary of keyword arguments, optional) – dictionary of the OrdinalEncoderPandas parameters

Returns:

selected_features (list of str) – List of selected features.

Variables:
  • n_features_in (int) – number of input predictors

  • assoc_matrix (pd.DataFrame) – the square association matrix

  • collinearity_summary (pd.DataFrame) – the pairs of collinear features and the association values

  • support (list of bool) – the list of the selected X-columns

  • selected_features (list of str) – the list of names of selected features

  • not_selected_features (list of str) – the list of names of rejected features

  • fastshap (boolean) – enable or not the fasttreeshap implementation

  • verbose (int, default = -1) – controls the progress bar, > 1 print out progress

Example

>>> from sklearn.datasets import make_classification, make_regression
>>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5
>>> X = pd.DataFrame(X)
>>> y = pd.Series(y)
>>> pred_name = [f"pred_{i}" for i in range(X.shape[1])]
>>> X.columns = pred_name
>>> selector = VariableImportance(threshold=0.75)
>>> selector.fit_transform(X, y)
fit(X, y, sample_weight=None)[source]#

Learn variable importance from X and y, supervised learning.

Parameters:
  • X (pd.DataFrame, shape (n_samples, n_features)) – Data from which to compute variances, where n_samples is the number of samples and n_features is the number of features.

  • y (any, default None) – Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

  • sample_weight (pd.Series, optional, shape (n_samples,)) – weights for computing the statistics (e.g. weighted average)

Returns:

self (object) – Returns the instance itself.

fit_transform(X, y=None, sample_weight=None)[source]#

Fit to data, then transform it. Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X. :type X: :param X: Input samples. :type X: array-like of shape (n_samples, n_features) :type y: :param y: Target values (None for unsupervised transformations). :type y: array-like of shape (n_samples,) or (n_samples, n_outputs), default=None :param **fit_params: Additional fit parameters. :type **fit_params: dict

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

plot_importance(figsize=None, plot_n=50, n_feat_per_inch=3, log=True, style=None)[source]#

Plots plot_n most important features and the cumulative importance of features. If threshold is provided, prints the number of features needed to reach threshold cumulative importance.

Parameters:
  • plot_n (int, default = 50) – Number of most important features to plot. Defaults to 15 or the maximum number of features whichever is smaller

  • n_feat_per_inch (int) – number of features per inch, the larger the less space between labels

  • figsize (tuple of float, optional) – The rendered size as a percentage size

  • log (bool, default True) – Whether or not render variable importance on a log scale

  • style (bool, default False) – set arfs style or not

Returns:

hv.plot – the feature importances holoviews object

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') VariableImportance#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

transform(X)[source]#

Transform the data, returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.

Returns:

X (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

Raises:

TypeError – if the input is not a pd.DataFrame

Module contents#

class arfs.feature_selection.BoostAGroota(estimator=None, cutoff=4, iters=10, max_rounds=500, delta=0.1, silent=True, importance='shap')[source]#

Bases: SelectorMixin, BaseEstimator

BoostAGroota is an all-relevant feature selection method, while most others are minimal optimal. It tries to find all features carrying information usable for prediction, rather than finding a possibly compact subset of features on which some estimator has a minimal error.

Why bother with all-relevant feature selection? When you try to understand the phenomenon that made your data, you should care about all factors that contribute to it, not just the bluntest signs of it in the context of your methodology (minimal optimal set of features by definition depends on your estimator choice).

Parameters:
  • estimator (scikit-learn estimator) – The model to train, lightGBM recommended, see the reduce lightgbm method.

  • cutoff (float) – The value by which the max of shadow imp is divided, to compare to real importance.

  • iters (int (>0)) – The number of iterations to average for the feature importance (on the same split), to reduce the variance.

  • max_rounds (int (>0)) – The number of times the core BoostAGroota algorithm will run. Each round eliminates more and more features.

  • delta (float (0 < delta <= 1)) – Stopping criteria for whether another round is started.

  • silent (bool) – Set to True if you don’t want to see the BoostAGroota output printed.

  • importance (str, default 'shap') – The kind of feature importance to use. Possible values: ‘shap’ (Shapley values), ‘pimp’ (permutation importance), and ‘native’ (Gini/impurity).

Variables:
  • selected_features (list of str) – The list of columns to keep.

  • ranking (array of shape [n_features]) – The feature ranking, such that ranking_[i] corresponds to the ranking position of the i-th feature. Selected (i.e., estimated best) features are assigned rank 1, and tentative features are assigned rank 2.

  • ranking_absolutes (array of shape [n_features]) – The absolute feature ranking as ordered by the selection process. It does not guarantee that this order is correct for all models. For a model-agnostic ranking, see the attribute ranking.

  • sha_cutoff_df (dataframe) – Feature importance of the real+shadow predictors over iterations.

  • mean_shadow (float) – The threshold below which the predictors are rejected.

Examples

>>> X = df[filtered_features].copy()
>>> y = df['target'].copy()
>>> w = df['weight'].copy()
>>> model = LGBMRegressor(n_jobs=-1, n_estimators=100, objective='rmse', random_state=42, verbose=0)
>>> feat_selector = BoostAGroota(estimator=model, cutoff=1, iters=10, max_rounds=10, delta=0.1, importance='shap')
>>> feat_selector.fit(X, y, sample_weight=None)
>>> print(feat_selector.selected_features_)
>>> feat_selector.plot_importance(n_feat_per_inch=5)
fit(X, y, sample_weight=None)[source]#

Fit the BoostAGroota transformer with the provided estimator. :type X: :param X: the predictors matrix :type X: pd.DataFrame :type y: :param y: the target :type y: pd.Series :type sample_weight: :param sample_weight: sample_weight, if any :type sample_weight: pd.series

plot_importance(n_feat_per_inch=5)[source]#

Boxplot of the variable importance, ordered by magnitude. The max shadow variable importance illustrated by the dashed line. Requires to apply the fit method first.

Parameters:

n_feat_per_inch (int, default 5) – Number of features to plot per inch (for scaling the figure).

Returns:

fig (plt.figure or None) – The matplotlib figure object containing the boxplot, or None if there are no selected features.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') BoostAGroota#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

transform(X)[source]#

Reduce X to the selected features.

Parameters:

X (array of shape [n_samples, n_features]) – The input samples.

Returns:

X_r (array of shape [n_samples, n_selected_features]) – The input samples with only the selected features.

class arfs.feature_selection.CardinalityThreshold(threshold=1000)[source]#

Bases: BaseThresholdSelector

Feature selector that removes all categorical features with more unique values than threshold This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

Parameters:

threshold (int, default = 1000) – Features with a training-set missing larger than this threshold will be removed. The thresold should be >= 1

Returns:

selected_features (list of str) – List of selected features.

Variables:
  • n_features_in (int) – number of input predictors

  • support (list of bool) – the list of the selected X-columns

  • selected_features (list of str) – the list of names of selected features

  • not_selected_features (list of str) – the list of names of rejected features

Example

>>> from sklearn.datasets import make_classification, make_regression
>>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5
>>> X = pd.DataFrame(X)
>>> y = pd.Series(y)
>>> pred_name = [f"pred_{i}" for i in range(X.shape[1])]
>>> X.columns = pred_name
>>> selector = CardinalityThreshold(100)
>>> selector.fit_transform(X)
set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') CardinalityThreshold#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

class arfs.feature_selection.CollinearityThreshold(threshold=0.8, method='association', n_jobs=1, nom_nom_assoc=<function weighted_theils_u>, num_num_assoc=<function weighted_corr>, nom_num_assoc=<function correlation_ratio>)[source]#

Bases: SelectorMixin, BaseEstimator

Feature selector that removes collinear features. This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning. It computes the association between features (continuous or categorical), store the pairs of collinear features and remove one of them for all pairs having an association value above the threshold.

The association measures are the Spearman correlation coefficient, correlation ratio and Theil’s U. The association matrix is not necessarily symmetrical.

By changing the method to “correlation”, data will be encoded as integer and the Spearman correlation coefficient will be used instead. Faster but not a best practice because the categorical variables are considered as numeric.

Parameters:
  • threshold (float, default = .8) – Features with a training-set missing larger than this threshold will be removed The thresold should be > 0 and =< 1

  • method (str, default = "association") – method for computing the association matrix. Either “association” or “correlation”. Correlation leads to encoding of categorical variables as numeric

  • n_jobs (int, default = -1) – the number of threads, -1 uses all the threads for computating the association matrix

  • nom_nom_assoc (str or callable, default = "theil") – the categorical-categorical association measure, by default Theil’s U, not symmetrical!

  • num_num_assoc (str or callable, default = "spearman") – the numeric-numeric association measure

  • nom_num_assoc (str or callable, default = "correlation_ratio") – the numeric-categorical association measure

Returns:

selected_features (list of str) – List of selected features.

Variables:
  • n_features_in (int) – number of input predictors

  • assoc_matrix (pd.DataFrame) – the square association matrix

  • collinearity_summary (pd.DataFrame) – the pairs of collinear features and the association values

  • support (list of bool) – the list of the selected X-columns

  • selected_features (list of str) – the list of names of selected features

  • not_selected_features (list of str) – the list of names of rejected features

Example

>>> from sklearn.datasets import make_classification, make_regression
>>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5
>>> X = pd.DataFrame(X)
>>> y = pd.Series(y)
>>> pred_name = [f"pred_{i}" for i in range(X.shape[1])]
>>> X.columns = pred_name
>>> selector = CollinearityThreshold(threshold=0.75)
>>> selector.fit_transform(X)
fit(X, y=None, sample_weight=None)[source]#

Learn empirical associtions from X.

Parameters:
  • X (pd.DataFrame, shape (n_samples, n_features)) – Data from which to compute variances, where n_samples is the number of samples and n_features is the number of features.

  • y (any, default None) – Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

  • sample_weight (pd.Series, optional, shape (n_samples,)) – weights for computing the statistics (e.g. weighted average)

Returns:

self (object) – Returns the instance itself.

plot_association(ax=None, cmap='PuOr', figsize=None, cbar_kw=None, imgshow_kw=None)[source]#

plot_association plots the association matrix

Parameters:
  • ax (matplotlib.axes.Axes, optional) – the mpl axes if the figure object exists already, by default None

  • cmap (str, optional) – colormap name, by default “PuOr”

  • figsize (tuple of float, optional) – figure size, by default None

  • cbar_kw (dict, optional) – colorbar kwargs, by default None

  • imgshow_kw (dict, optional) – imgshow kwargs, by default None

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') CollinearityThreshold#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

transform(X)[source]#

Reduce X to the selected features.

Parameters:

X (array of shape [n_samples, n_features]) – The input samples.

Returns:

X_r (array of shape [n_samples, n_selected_features]) – The input samples with only the selected features.

class arfs.feature_selection.GrootCV(objective=None, cutoff=1, n_folds=5, folds=None, n_iter=5, silent=True, rf=False, fastshap=False, n_jobs=0, lgbm_params=None)[source]#

Bases: SelectorMixin, BaseEstimator

GrootCV is a feature selection method based on cross-validation with lightGBM.

A shuffled copy of the predictors matrix is added (shadows) to the original set of predictors. The lightGBM is fitted using repeated cross-validation, the feature importance is extracted each time and averaged to smooth out the noise. If the feature importance is larger than the average shadow feature importance then the predictors are rejected, the others are kept.

  • Cross-validated feature importance to smooth out the noise, based on lightGBM only (which is, most of the time, the fastest and more accurate Boosting).

  • the feature importance is derived using SHAP importance

  • Taking the max of median of the shadow var. imp over folds otherwise not enough conservative and it improves the convergence (needs less evaluation to find a threshold)

  • Not based on a given percentage of cols needed to be deleted

  • Plot method for var. imp

Parameters:
  • objective (str or callable, default None) – The objective function to use in lightGBM. If None, it uses the objective specified in lgbm_params.

  • cutoff (float, default 1) – The value by which the max of shadow imp is divided, to compare to real importance.

  • n_folds (int, default 5) – The number of folds for cross-validation.

  • folds (Optional[Union[Iterable[Tuple[np.ndarray, np.ndarray]]) – (generator or iterator of (train_idx, test_idx) tuples, scikit-learn splitter object or None, optional (default=None)) If generator or iterator, it should yield the train and test indices for each fold. If object, it should be one of the scikit-learn splitter classes (https://scikit-learn.org/stable/modules/classes.html#splitter-classes) and have split method. This argument has highest priority over other data split arguments.

  • n_iter (int, default 5) – The number of iterations to average for the feature importance (on the same split), to reduce variance.

  • silent (bool, default True) – Set to True if you don’t want to see the GrootCV output printed.

  • rf (bool, default False) – If True, use random forest for calculating feature importances; otherwise, use lightGBM.

  • fastshap (bool, default False) – If True, use fastSHAP for calculating feature importances; otherwise, use SHAP.

  • n_jobs (int, default 0) – The number of jobs to run in parallel. If 0, no parallelism is used.

  • lgbm_params (dict, default None) – The parameters for the lightGBM model.

Variables:
  • selected_features (ndarray) – The list of columns to keep as selected features.

  • cv_df (pd.DataFrame) – DataFrame containing feature importance values for each fold and iteration.

  • sha_cutoff (float) – The threshold below which the predictors are rejected.

  • ranking_absolutes (list) – The absolute feature ranking as ordered by the selection process.

  • ranking (ndarray) – The feature ranking, where 2 corresponds to selected features and 1 to tentative features.

fit(X, y, sample_weight=None)[source]#

Fit the GrootCV on the input data.

transform(X)[source]#

Apply the fitted GrootCV on new data.

plot_importance(n_feat_per_inch=5)[source]#

Plot the feature importance of the fitted GrootCV.

Warning

If sha_cutoff is None, you should apply the fit method first.

Examples

>>> X = df[filtered_features].copy()
>>> y = df['target'].copy()
>>> w = df['weight'].copy()
>>> feat_selector = arfsgroot.GrootCV(objective='rmse', cutoff = 1, n_folds=5, n_iter=5)
>>> feat_selector.fit(X, y, sample_weight=None)
>>> feat_selector.plot_importance(n_feat_per_inch=5)
fit(X, y, sample_weight=None)[source]#

Fit the GrootCV on the input data.

Parameters:
  • X (pd.DataFrame of shape (n_samples, n_features)) – The predictor dataframe.

  • y (array-like of shape (n_samples,)) – The target vector.

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None.

Returns:

self (object) – Returns self.

plot_importance(n_feat_per_inch=5)[source]#

Plot the feature importance of the fitted GrootCV.

Parameters:

n_feat_per_inch (int, default 5) – The number of features per inch in the plot.

Returns:

fig (matplotlib.figure.Figure or None) – The matplotlib figure containing the plot or None if no feature is selected.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GrootCV#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

transform(X)[source]#

Apply the fitted GrootCV on new data.

Parameters:

X (pd.DataFrame of shape (n_samples, n_features)) – The predictor dataframe.

Returns:

X_selected (pd.DataFrame of shape (n_samples, n_selected_features)) – The selected features from the input dataframe.

class arfs.feature_selection.LassoFeatureSelection(family='gaussian', link=None, n_iterations=10, score='bic', fit_intercept=True, n_jobs=-1)[source]#

Bases: BaseEstimator, TransformerMixin

LassoFeatureSelection performs feature selection using GLM Lasso regularization.

Parameters:
  • family (str, (default=``”gaussian”``)) – The distributional assumption of the model. It can be any of the statsmodels distribution: “gaussian”, “binomial”, “poisson”, “gamma”, “negativebinomial”, “tweedie”

  • link (str, optional) – the GLM link function. It can be any of the: “identity”, “log”, “logit”, “probit”, “cloglog”, “inverse_squared”

  • n_iterations (int, default 10) – Number of iterations for the grid search.

  • score (str, default "bic") – The score to use for model selection. Options: “bic” (Bayesian Information Criterion) or “mean_cv” (mean cross-validation score).

  • n_jobs (int, default -1) – the number of processes. -1 means all the processes

Variables:
  • family (str) – The family of the GLM.

  • n_iterations (int) – Number of iterations for the grid search.

  • best_estimator (EnetGLM) – The best estimator found after grid search cross-validation.

  • selected_features (ndarray) – The selected feature names.

  • support (ndarray) – The support of selected features (True for selected, False otherwise).

  • feature_names_in (ndarray) – The input feature names.

  • score (str) – The score used for model selection.

  • n_jobs (int) – the number of processes. -1 means all the processes

fit(X, y=None, sample_weight=None)[source]#

Fit the LassoFeatureSelection model and select the best features.

transform(X)[source]#

Transform the input data to keep only the selected features.

get_feature_names_out()[source]#

Get the names of the selected features.

fit(X, y=None, sample_weight=None)[source]#

Fit the LassoFeatureSelection model and select the best features.

Parameters:
  • X (Union[pd.DataFrame, np.ndarray]) – The input features, can be either a pandas DataFrame or a numpy array.

  • y (Optional[Union[pd.Series, np.ndarray]], default None) – The target values, can be either a pandas Series or a numpy array.

  • sample_weight (Optional[Union[pd.Series, np.ndarray]], default None) – Sample weights to be used during training. Can be either a pandas Series or a numpy array.

Returns:

LassoFeatureSelection – The fitted LassoFeatureSelection model.

get_feature_names_out()[source]#

Get the names of the selected features.

Return type:

ndarray

Returns:

np.ndarray – The names of the selected features.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') LassoFeatureSelection#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

transform(X)[source]#

Transform the input data to keep only the selected features.

Parameters:

X (Union[pd.DataFrame, np.ndarray]) – The input features, can be either a pandas DataFrame or a numpy array.

Return type:

DataFrame

Returns:

Union[pd.DataFrame, np.ndarray] – The transformed data with only the selected features.

class arfs.feature_selection.Leshy(estimator, n_estimators=1000, perc=90, alpha=0.05, importance='shap', two_step=True, max_iter=100, random_state=None, verbose=0, keep_weak=False)[source]#

Bases: SelectorMixin, BaseEstimator

This is an improved version of BorutaPy which itself is an improved Python implementation of the Boruta R package. Boruta is an all relevant feature selection method, while most other are minimal optimal; this means it tries to find all features carrying information usable for prediction, rather than finding a possibly compact subset of features on which some estimator has a minimal error. Why bother with all relevant feature selection? When you try to understand the phenomenon that made your data, you should care about all factors that contribute to it, not just the bluntest signs of it in context of your methodology (minimal optimal set of features by definition depends on your estimator choice).

Parameters:
  • estimator (object) – A supervised learning estimator, with a ‘fit’ method that returns the feature_importances_ attribute. Important features must correspond to high absolute values in the feature_importances_

  • n_estimators (int or string, default = 1000) – If int sets the number of estimators in the chosen ensemble method. If ‘auto’ this is determined automatically based on the size of the dataset. The other parameters of the used estimators need to be set with initialisation.

  • perc (int, default = 100) – Instead of the max we use the percentile defined by the user, to pick our threshold for comparison between shadow and real features. The max tend to be too stringent. This provides a finer control over this. The lower perc is the more false positives will be picked as relevant but also the less relevant features will be left out. The usual trade-off. The default is essentially the vanilla Boruta corresponding to the max.

  • alpha (float, default = 0.05) – Level at which the corrected p-values will get rejected in both correction steps.

  • importance (str, default = 'shap') – The kind of variable importance used to compare and discriminate original vs shadow predictors. Note that the builtin tree importance (gini/impurity based importance) is biased towards numerical and large cardinality predictors, even if they are random. Shapley values and permutation imp. are robust w.r.t those predictors. Possible values: ‘shap’ (Shapley values), ‘fastshap’ (FastTreeShap implementation), ‘pimp’ (permutation importance) and ‘native’ (Gini/impurity)

  • two_step (Boolean, default = True) – If you want to use the original implementation of Boruta with Bonferroni correction only set this to False.

  • max_iter (int, default = 100) – The number of maximum iterations to perform.

  • random_state (int, RandomState instance or None; default=None) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

  • verbose (int, default 0) – Controls verbosity of output. 0: no output, 1: displays iteration number, 2: which features have been selected already

Variables:
  • n_features (int) – The number of selected features.

  • support (array of shape [n_features]) – The mask of selected features - only confirmed ones are True.

  • support_weak (array of shape [n_features]) – The mask of selected tentative features, which haven’t gained enough support during the max_iter number of iterations.

  • selected_features (list of str) – the list of columns to keep

  • ranking (array of shape [n_features]) – The feature ranking, such that ranking_[i] corresponds to the ranking position of the i-th feature. Selected (i.e., estimated best) features are assigned rank one and tentative features are assigned rank 2.

  • ranking_absolutes (array of shape [n_features]) – The absolute feature ranking as ordered by selection process. It does not guarantee that this order is correct for all models. For a model agnostic ranking, see the the attribute ranking

  • cat_name (list of str) – the name of the categorical columns

  • cat_idx (list of int) – the index of the categorical columns

  • imp_real_hist (array) – array of the historical feature importance of the real predictors

  • sha_max (float) – the maximum feature importance of the shadow predictors

  • col_names (list of str) – the names of the real predictors

Examples

>>> import pandas as pd
>>> from sklearn.ensemble import RandomForestClassifier
>>> from boruta import BorutaPy
>>>
>>> # load X and y
>>> # NOTE BorutaPy accepts numpy arrays only, hence the .values attribute
>>> X = pd.read_csv('examples/test_X.csv', index_col=0).values
>>> y = pd.read_csv('examples/test_y.csv', header=None, index_col=0).values
>>> y = y.ravel()
>>>
>>> # define random forest classifier, with utilising all cores and
>>> # sampling in proportion to y labels
>>> rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
>>>
>>> # define Boruta feature selection method
>>> feat_selector = Leshy(rf, n_estimators='auto', verbose=2, random_state=1)
>>>
>>> # find all relevant features - 5 features should be selected
>>> feat_selector.fit(X, y)
>>>
>>> # check selected features - first 5 features are selected
>>> feat_selector.selected_features_
>>>
>>> # check ranking of features
>>> feat_selector.ranking_
>>>
>>> # call transform() on X to filter it down to selected features
>>> X_filtered = feat_selector.transform(X)

References

See the original paper [1]_ for more details.

..[1] Kursa M., Rudnicki W., “Feature Selection with the Boruta Package”

Journal of Statistical Software, Vol. 36, Issue 11, Sep 2010

_add_shadows_get_imps(X, y, sample_weight, dec_reg)[source]#

Add a shuffled copy of the columns (shadows) and get the feature importance of the augmented data set

Parameters:
  • X (pd.DataFrame of shape [n_samples, n_features]) – predictor matrix

  • y (pd.series of shape [n_samples]) – target

  • sample_weight (array-like, shape = [n_samples], default None) – Individual weights for each sample

  • dec_reg (array) – holds the decision about each feature 1, 0, -1 (accepted, undecided, rejected)

Returns:

imp_real: array

feature importance of the real predictors

imp_sha: array

feature importance of the shadow predictors

static _assign_hits(hit_reg, cur_imp, imp_sha_max)[source]#

count how many times a given feature was more important than the best of the shadow features

Parameters:
  • hit_reg (array) – count how many times a given feature was more important than the best of the shadow features

  • cur_imp (array) – current importance

  • imp_sha_max (array) – importance of the best shadow predictor

Returns:

hit_reg (array) – the how many times a given feature was more important than the best of the shadow features

_calculate_absolute_ranking()[source]#

Compute feature importance scores using SHAP values.

Parameters:
  • new_x_tr (numpy.ndarray) – The training dataset after being processed.

  • shap_matrix (numpy.ndarray) – The matrix containing SHAP values computed by a LightGBM model.

  • param (dict) – A dictionary containing the parameters for a LightGBM model.

  • objective (str) – The objective function of the LightGBM model.

Returns:

list – A list of tuples containing feature names and their corresponding importance scores.

_calculate_relative_ranking(n_feat, tentative, confirmed, imp_history)[source]#

Calculates the relative ranking of features based on their importance history.

Parameters:
  • n_feat (int) – The total number of features.

  • tentative (ndarray of shape (n_tentative_features,)) – An array containing the indices of tentative features.

  • confirmed (ndarray of shape (n_confirmed_features,)) – An array containing the indices of confirmed features.

  • imp_history (ndarray of shape (n_iterations + 1, n_features)) – An array containing the feature importances for each iteration.

Returns:

None

_calculate_support(confirmed, tentative, n_feat)[source]#

Calculate the feature support arrays.

Parameters:
  • confirmed (array-like of shape (n_confirmed,)) – Indices of confirmed features.

  • tentative (array-like of shape (n_tentative,)) – Indices of tentative features.

  • n_feat (int) – Total number of features.

Returns:

None – The function populates the following class attributes: - n_features_ : int

Number of selected features.

  • support_ndarray of shape (n_feat,)

    Boolean array indicating the selected features.

  • support_weak_ndarray of shape (n_feat,)

    Boolean array indicating the tentatively selected features.

_check_params(X, y)[source]#

Private method, Check hyperparameters as well as X and y before proceeding with fit.

Parameters:
  • X (pd.DataFrame) – predictor matrix

  • y (pd.series) – target series

Raises:
  • ValueError – [description]

  • ValueError – [description]

_do_tests(dec_reg, hit_reg, _iter)[source]#

Private method, Perform the rest if the feature should be tagget as relevant (confirmed), not relevant (rejected) or undecided. The test is performed by considering the binomial tentatives over several attempts. I.e. count how many times a given feature was more important than the best of the shadow features and test if the associated probability to the z-score is below, between or above the rejection or acceptance threshold.

Parameters:
  • dec_reg (array) – holds the decision about each feature 1, 0, -1 (accepted, undecided, rejected)

  • hit_reg (array) – counts how many times a given feature was more important than the best of the shadow features

  • _iter (int) – iteration number

Returns:

dec_reg (array) – holds the decision about each feature 1, 0, -1 (accepted, undecided, rejected)

static _fdrcorrection(pvals, alpha=0.05)[source]#

Benjamini/Hochberg p-value correction for false discovery rate, from statsmodels package. Included here for decoupling dependency on statsmodels.

Parameters:
  • pvals (array_like) – set of p-values of the individual tests.

  • alpha (float) – error rate

Returns:

  • rejected (array, bool) – True if a hypothesis is rejected, False if not

  • pvalue-corrected (array) – pvalues adjusted for multiple hypothesis testing to limit FDR

_fit(X_raw, y, sample_weight=None)[source]#

Private method. See the methods overview in the documentation for explanation of the process

Parameters:
  • X_raw (array-like, shape = [n_samples, n_features]) – The training input samples.

  • y (array-like, shape = [n_samples]) – The target values.

  • sample_weight (array-like, shape = [n_samples], default None) – Individual weights for each sample

Returns:

self (object) – Nothing but attributes

_get_tree_num(n_feat)[source]#
private method, get a good estimated for the number of trees

given the number of features

Parameters:

n_feat (int) – The number of features

Returns:

n_estimators (int) – the number of trees

static _nanrankdata(X, axis=1)[source]#

Replaces bottleneck’s nanrankdata with scipy and numpy alternative.

Parameters:
  • X (array or pd.DataFrame) – the data array

  • axis (int, optional) – row-wise (0) or column-wise (1), by default 1

Returns:

ranks (array) – the ranked array

_print_result(dec_reg, _iter, start_time)[source]#

Print the results of feature selection.

Parameters:
  • dec_reg (bool) – Decision on whether to proceed with another round of feature selection.

  • _iter (int) – Current iteration number.

  • start_time (float) – Time when the feature selection process started.

Returns:

None – The function prints the relevant results and running time.

_print_results(dec_reg, _iter, flag)[source]#

Private method, printing the result

Parameters:
  • dec_reg (array) – if the feature as been tagged as relevant (confirmed), not relevant (rejected) or undecided

  • _iter (int) – the iteration number

  • flag (int) – is still in the feature selection process or not

Returns:

output: str

the output to be printed out

_run_iteration(X, y, sample_weight, dec_reg, sha_max_history, imp_history, hit_reg, _iter)[source]#

Run an iteration of the Gradient Boosting algorithm.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The input samples.

  • y (array-like of shape (n_samples,)) – The target values.

  • sample_weight (array-like of shape (n_samples,), default None) – Sample weights. If None, then samples are equally weighted.

  • dec_reg (array-like of shape (n_samples,)) – Decision function of the estimator.

  • sha_max_history (list of floats) – List of the maximum shadow importance value at each iteration.

  • imp_history (array-like of shape (n_iterations, n_features)) – Matrix of feature importances at each iteration.

  • hit_reg (array-like of shape (n_samples,)) – Array of hit counts for each sample.

  • _iter (int) – The current iteration number.

Returns:

  • dec_reg (array-like of shape (n_samples,)) – Updated decision function of the estimator.

  • sha_max_history (list of floats) – List of the maximum shadow importance value at each iteration.

  • imp_history (array-like of shape (n_iterations, n_features)) – Matrix of feature importances at each iteration.

  • hit_reg (array-like of shape (n_samples,)) – Array of hit counts for each sample.

  • imp_sha_max (float) – The maximum shadow importance value for this iteration.

_update_estimator()[source]#

Update the estimator with a new random state, if applicable.

If the dataset is not categorical, the estimator’s random_state parameter is updated with a new random state generated by the random_state attribute of the Leshy object. If the estimator is a LightGBM model, the random state value is generated between 0 and 10000.

Parameters:

None

Returns:

None

_update_tree_num(dec_reg)[source]#

Update the number of trees in the estimator based on the number of selected features.

Parameters:

dec_reg (array-like of shape (n_features,)) – The decision rule for each feature, where negative values indicate that the feature should be rejected and non-negative values indicate that the feature should be selected.

Returns:

None

Notes

This function updates the n_estimators parameter of the estimator if it is set to “auto”. The number of trees is determined based on the number of selected features. Specifically, the number of trees is set to the value returned by the _get_tree_num method, which takes as input the number of selected features that are not rejected.

If n_estimators is not set to “auto”, this function does nothing.

fit(X, y, sample_weight=None)[source]#

Fits the Boruta feature selection with the provided estimator.

Parameters:
  • X (array-like, shape = [n_samples, n_features]) – The training input samples.

  • y (array-like, shape = [n_samples]) – The target values.

  • sample_weight (array-like, shape = [n_samples], default None) – Individual weights for each sample

Returns:

self (object) – Nothing but attributes

plot_importance(n_feat_per_inch=5)[source]#

Boxplot of the variable importance, ordered by magnitude The max shadow variable importance illustrated by the dashed line. Requires to apply the fit method first.

Parameters:

n_feat_per_inch (int, default 5) – number of features to plot per inch (for scaling the figure)

Returns:

fig (plt.figure) – the matplotlib figure object containing the boxplot

select_features(X, y, sample_weight=None)[source]#

Select features using the Leshy algorithm.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The input data.

  • y (array-like of shape (n_samples,)) – The target values.

  • sample_weight (array-like of shape (n_samples,), default None) – Individual weights for each sample.

Returns:

  • dec_reg (ndarray of shape (n_features,)) – The decision rule. 1 means the feature is selected, 0 means the feature is not selected.

  • sha_max_history (list) – List of the maximum shadow importances per iteration.

  • imp_history (ndarray of shape (n_iterations, n_features)) – Array containing the feature importances per iteration.

  • imp_sha_max (float) – Maximum shadow importance value.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') Leshy#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

transform(X)[source]#

Reduce X to the selected features.

Parameters:

X (array of shape [n_samples, n_features]) – The input samples.

Returns:

X_r (array of shape [n_samples, n_selected_features]) – The input samples with only the selected features.

class arfs.feature_selection.MinRedundancyMaxRelevance(n_features_to_select, relevance_func=None, redundancy_func=None, task='regression', denominator_func=<function mean>, only_same_domain=False, return_scores=False, n_jobs=1, show_progress=True)[source]#

Bases: SelectorMixin, BaseEstimator

MRMR feature selection for a classification or a regression task For a classification task, the target should be of object or pandas category dtype. For a regression task, the target should be of numpy categorical dtype. The predictors can be categorical or numerical, there is no encoding required. The dtype will be automatically detected and the right method applied (either correlation, correlation ration or Theil’s U)

Parameters:
  • n_features_to_select (int) – Number of features to select.

  • relevance_func (callable, optional) – relevance function having arguments “X”, “y”, “sample_weight” and returning a pd.Series containing a score of relevance for each feature

  • redundancy_func (callable, optional) – Redundancy method. If callable, it should take “X”, “sample_weight” as input and return a pandas.Series containing a score of redundancy for each feature.

  • denominator_func (str or callable (optional, default 'mean')) – Synthesis function to apply to the denominator of MRMR score. If string, name of method. Supported: ‘max’, ‘mean’. If callable, it should take an iterable as input and return a scalar.

  • task (str) – either “regression” or “classification”

  • only_same_domain (bool (optional, default False)) – If False, all the necessary correlation coefficients are computed. If True, only features belonging to the same domain are compared. Domain is defined by the string preceding the first underscore: for instance “cusinfo_age” and “cusinfo_income” belong to the same domain, whereas “age” and “income” don’t.

  • return_scores (bool (optional, default False)) – If False, only the list of selected features is returned. If True, a tuple containing (list of selected features, relevance, redundancy) is returned.

  • n_jobs (int (optional, default 1)) – Maximum number of workers to use. Only used when relevance = “f” or redundancy = “corr”. If -1, use as many workers as min(cpu count, number of features).

  • show_progress (bool (optional, default True)) – If False, no progress bar is displayed. If True, a TQDM progress bar shows the number of features processed.

Returns:

selected_features (list of str) – List of selected features.

Variables:
  • n_features_in (int) – number of input predictors

  • ranking (pd.DataFrame) – name and scores for the selected features

  • support (list of bool) – the list of the selected X-columns

Example

>>> from sklearn.datasets import make_classification, make_regression
>>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5
>>> X = pd.DataFrame(X)
>>> y = pd.Series(y)
>>> pred_name = [f"pred_{i}" for i in range(X.shape[1])]
>>> X.columns = pred_name
>>> y.name = "target"
>>> fs_mrmr = MinRedundancyMaxRelevance(
>>>                  n_features_to_select=5,
>>>                  relevance_func=None,
>>>                  redundancy_func=None,
>>>                  task="regression", #"classification",
>>>                  denominator_func=np.mean,
>>>                  only_same_domain=False,
>>>                  return_scores=False,
>>>                  show_progress=True)
>>> #fs_mrmr.fit(X=X, y=y.astype(str), sample_weight=None)
>>> fs_mrmr.fit(X=X, y=y, sample_weight=None)
fit(X, y, sample_weight=None)[source]#

fit the MRmr selector by learning the associations

Parameters:
  • X (pd.DataFrame, shape (n_samples, n_features)) – Data from which to compute variances, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like or pd.Series of shape (n_samples,)) – Target vector. Must be numeric for regression or categorical for classification.

  • sample_weight (pd.Series, optional, shape (n_samples,)) – weights for computing the statistics (e.g. weighted average)

Returns:

self (object) – If return_scores=False, returns self. If return_scores=True, returns (selected_features, relevance_scores).

fit_transform(X, y, sample_weight=None, **fit_params)[source]#

Fit to data, then transform it. Fits transformer to X and y and optionally sample_weight with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • sample_weight (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – sample weight values.

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

run_feature_selection()[source]#
select_next_feature(not_selected_features, selected_features, relevance, redundancy)[source]#
set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') MinRedundancyMaxRelevance#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

transform(X)[source]#

Transform the data, returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

update_ranks(best_feature, score, score_denominator)[source]#
class arfs.feature_selection.MissingValueThreshold(threshold=0.05)[source]#

Bases: BaseThresholdSelector

Feature selector that removes all high missing percentage features. This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

Parameters:

threshold (float, default = .05) – Features with a training-set missing larger than this threshold will be removed.

Returns:

selected_features (list of str) – List of selected features.

Variables:
  • n_features_in (int) – number of input predictors

  • support (list of bool) – the list of the selected X-columns

  • selected_features (list of str) – the list of names of selected features

  • not_selected_features (list of str) – the list of names of rejected features

Example

>>> from sklearn.datasets import make_classification, make_regression
>>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5
>>> X = pd.DataFrame(X)
>>> y = pd.Series(y)
>>> pred_name = [f"pred_{i}" for i in range(X.shape[1])]
>>> X.columns = pred_name
>>> selector = MissingValueThreshold(0.05)
>>> selector.fit_transform(X)
set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') MissingValueThreshold#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

class arfs.feature_selection.UniqueValuesThreshold(threshold=1)[source]#

Bases: BaseThresholdSelector

Feature selector that removes all features with zero variance (single unique values) or remove columns with less unique values than threshold This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.

Parameters:

threshold (int, default = 1) – Features with a training-set missing larger than this threshold will be removed. The thresold should be >= 1

Returns:

selected_features (list of str) – List of selected features.

Variables:
  • n_features_in (int) – number of input predictors

  • support (list of bool) – the list of the selected X-columns

  • selected_features (list of str) – the list of names of selected features

  • not_selected_features (list of str) – the list of names of rejected features

Example

>>> from sklearn.datasets import make_classification, make_regression
>>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5
>>> X = pd.DataFrame(X)
>>> y = pd.Series(y)
>>> pred_name = [f"pred_{i}" for i in range(X.shape[1])]
>>> X.columns = pred_name
>>> selector = UniqueValuesThreshold(1)
>>> selector.fit_transform(X)
set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') UniqueValuesThreshold#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

class arfs.feature_selection.VariableImportance(task='regression', encode=True, n_iterations=10, threshold=0.99, lgb_kwargs={'objective': 'rmse', 'zero_as_missing': False}, encoder_kwargs=None, fastshap=False, verbose=-1)[source]#

Bases: SelectorMixin, BaseEstimator

Feature selector that removes predictors with zero or low variable importance.

Identify the features with zero/low importance according to SHAP values of a lightgbm. The gbm can be trained with early stopping using a utils set to prevent overfitting. The feature importances are averaged over n_iterations to reduce the variance. The predictors are then ranked from the most important to the least important and the cumulative variable importance is computed. All the predictors not contributing (VI=0) or contributing to less than the threshold to the cumulative importance are removed.

Parameters:
  • task (string) – The machine learning task, either ‘classification’ or ‘regression’ or ‘multiclass’, be sure to use a consistent objective function

  • encode (boolean, default = True) – Whether or not to encode the predictors

  • n_iterations (int, default = 10) – Number of iterations, the more iterations, the smaller the variance

  • threshold (float, default = .99) – The selector computes the cumulative feature importance and ranks the predictors from the most important to the least important. All the predictors contributing to less than this value are rejected.

  • lgb_kwargs (dictionary of keyword arguments) – dictionary of lightgbm estimators parameters with at least the objective function {‘objective’:’rmse’}

  • encoder_kwargs (dictionary of keyword arguments, optional) – dictionary of the OrdinalEncoderPandas parameters

Returns:

selected_features (list of str) – List of selected features.

Variables:
  • n_features_in (int) – number of input predictors

  • assoc_matrix (pd.DataFrame) – the square association matrix

  • collinearity_summary (pd.DataFrame) – the pairs of collinear features and the association values

  • support (list of bool) – the list of the selected X-columns

  • selected_features (list of str) – the list of names of selected features

  • not_selected_features (list of str) – the list of names of rejected features

  • fastshap (boolean) – enable or not the fasttreeshap implementation

  • verbose (int, default = -1) – controls the progress bar, > 1 print out progress

Example

>>> from sklearn.datasets import make_classification, make_regression
>>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5
>>> X = pd.DataFrame(X)
>>> y = pd.Series(y)
>>> pred_name = [f"pred_{i}" for i in range(X.shape[1])]
>>> X.columns = pred_name
>>> selector = VariableImportance(threshold=0.75)
>>> selector.fit_transform(X, y)
fit(X, y, sample_weight=None)[source]#

Learn variable importance from X and y, supervised learning.

Parameters:
  • X (pd.DataFrame, shape (n_samples, n_features)) – Data from which to compute variances, where n_samples is the number of samples and n_features is the number of features.

  • y (any, default None) – Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.

  • sample_weight (pd.Series, optional, shape (n_samples,)) – weights for computing the statistics (e.g. weighted average)

Returns:

self (object) – Returns the instance itself.

fit_transform(X, y=None, sample_weight=None)[source]#

Fit to data, then transform it. Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X. :type X: :param X: Input samples. :type X: array-like of shape (n_samples, n_features) :type y: :param y: Target values (None for unsupervised transformations). :type y: array-like of shape (n_samples,) or (n_samples, n_outputs), default=None :param **fit_params: Additional fit parameters. :type **fit_params: dict

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

plot_importance(figsize=None, plot_n=50, n_feat_per_inch=3, log=True, style=None)[source]#

Plots plot_n most important features and the cumulative importance of features. If threshold is provided, prints the number of features needed to reach threshold cumulative importance.

Parameters:
  • plot_n (int, default = 50) – Number of most important features to plot. Defaults to 15 or the maximum number of features whichever is smaller

  • n_feat_per_inch (int) – number of features per inch, the larger the less space between labels

  • figsize (tuple of float, optional) – The rendered size as a percentage size

  • log (bool, default True) – Whether or not render variable importance on a log scale

  • style (bool, default False) – set arfs style or not

Returns:

hv.plot – the feature importances holoviews object

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') VariableImportance#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

transform(X)[source]#

Transform the data, returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.

Returns:

X (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

Raises:

TypeError – if the input is not a pd.DataFrame

arfs.feature_selection.make_fs_summary(selector_pipe)[source]#

make_fs_summary makes a summary dataframe highlighting at which step a given predictor has been rejected (if any).

Parameters:

selector_pipe (sklearn.pipeline.Pipeline) – the feature selector pipeline.

Examples

>>> groot_pipeline = Pipeline([
... ('missing', MissingValueThreshold()),
... ('unique', UniqueValuesThreshold()),
... ('cardinality', CardinalityThreshold()),
... ('collinearity', CollinearityThreshold(threshold=0.5)),
... ('lowimp', VariableImportance(eval_metric='poisson', objective='poisson', verbose=2)),
... ('grootcv', GrootCV(objective='poisson', cutoff=1, n_folds=3, n_iter=5))])
>>> groot_pipeline.fit_transform(
    X=df[predictors],
    y=df[target],
    lowimp__sample_weight=df[weight],
    grootcv__sample_weight=df[weight])
>>> fs_summary_df = make_fs_summary(groot_pipeline)