arfs.feature_selection package#
Submodules#
arfs.feature_selection.allrelevant module#
This module provides 3 different methods to perform ‘all relevant feature selection’
Reference:#
NILSSON, Roland, PEÑA, José M., BJÖRKEGREN, Johan, et al. Consistent feature selection for pattern recognition in polynomial time. Journal of Machine Learning Research, 2007, vol. 8, no Mar, p. 589-612.
KURSA, Miron B., RUDNICKI, Witold R., et al. Feature selection with the Boruta package. J Stat Softw, 2010, vol. 36, no 11, p. 1-13.
The module structure#
The
Leshyclass, a heavy re-work ofBorutaPyclass itself a modified version of Boruta, the pull request I submitted and still pending: https://github.com/scikit-learn-contrib/boruta_py/pull/100The
BoostAGrootaclass, a modified version of BoostARoota, PR still to be submitted https://github.com/chasedehan/BoostARootaThe
GrootCVclass for a new method for all relevant feature selection using a lightgGBM model, cross-validated SHAP importances and shadowing.
Original BorutaPy version#
Author: Daniel Homola <dani.homola@gmail.com>
Original code and method by: Miron B Kursa, https://m2.icm.edu.pl/boruta/ Modified by Thomas Bury, pull request: https://github.com/scikit-learn-contrib/boruta_py/pull/100 Waiting for merging
https://github.com/scikit-learn-contrib/boruta_py/pull/100 is a new PR based on #77 making all the changes optional. Waiting for merge
Leshy is a re-work of the PR I submitted.
License: BSD 3 clause
- class arfs.feature_selection.allrelevant.BoostAGroota(estimator=None, cutoff=4, iters=10, max_rounds=500, delta=0.1, silent=True, importance='shap')[source]#
Bases:
SelectorMixin,BaseEstimatorBoostAGroota is an all-relevant feature selection method, while most others are minimal optimal. It tries to find all features carrying information usable for prediction, rather than finding a possibly compact subset of features on which some estimator has a minimal error.
Why bother with all-relevant feature selection? When you try to understand the phenomenon that made your data, you should care about all factors that contribute to it, not just the bluntest signs of it in the context of your methodology (minimal optimal set of features by definition depends on your estimator choice).
- Parameters:
estimator (
scikit-learn estimator) – The model to train, lightGBM recommended, see the reduce lightgbm method.cutoff (
float) – The value by which the max of shadow imp is divided, to compare to real importance.iters (
int (>0)) – The number of iterations to average for the feature importance (on the same split), to reduce the variance.max_rounds (
int (>0)) – The number of times the core BoostAGroota algorithm will run. Each round eliminates more and more features.delta (
float (0 < delta <= 1)) – Stopping criteria for whether another round is started.silent (
bool) – Set to True if you don’t want to see the BoostAGroota output printed.importance (
str, default'shap') – The kind of feature importance to use. Possible values: ‘shap’ (Shapley values), ‘pimp’ (permutation importance), and ‘native’ (Gini/impurity).
- Variables:
selected_features (
listofstr) – The list of columns to keep.ranking (
arrayofshape [n_features]) – The feature ranking, such thatranking_[i]corresponds to the ranking position of the i-th feature. Selected (i.e., estimated best) features are assigned rank 1, and tentative features are assigned rank 2.ranking_absolutes (
arrayofshape [n_features]) – The absolute feature ranking as ordered by the selection process. It does not guarantee that this order is correct for all models. For a model-agnostic ranking, see the attributeranking.sha_cutoff_df (
dataframe) – Feature importance of the real+shadow predictors over iterations.mean_shadow (
float) – The threshold below which the predictors are rejected.
Examples
>>> X = df[filtered_features].copy() >>> y = df['target'].copy() >>> w = df['weight'].copy() >>> model = LGBMRegressor(n_jobs=-1, n_estimators=100, objective='rmse', random_state=42, verbose=0) >>> feat_selector = BoostAGroota(estimator=model, cutoff=1, iters=10, max_rounds=10, delta=0.1, importance='shap') >>> feat_selector.fit(X, y, sample_weight=None) >>> print(feat_selector.selected_features_) >>> feat_selector.plot_importance(n_feat_per_inch=5)
- fit(X, y, sample_weight=None)[source]#
Fit the BoostAGroota transformer with the provided estimator. :type X: :param X: the predictors matrix :type X:
pd.DataFrame:type y: :param y: the target :type y:pd.Series:type sample_weight: :param sample_weight: sample_weight, if any :type sample_weight:pd.series
- plot_importance(n_feat_per_inch=5)[source]#
Boxplot of the variable importance, ordered by magnitude. The max shadow variable importance illustrated by the dashed line. Requires to apply the fit method first.
- Parameters:
n_feat_per_inch (
int, default5) – Number of features to plot per inch (for scaling the figure).- Returns:
fig (
plt.figureorNone) – The matplotlib figure object containing the boxplot, or None if there are no selected features.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') BoostAGroota#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- class arfs.feature_selection.allrelevant.GrootCV(objective=None, cutoff=1, n_folds=5, folds=None, n_iter=5, silent=True, rf=False, fastshap=False, n_jobs=0, lgbm_params=None)[source]#
Bases:
SelectorMixin,BaseEstimatorGrootCV is a feature selection method based on cross-validation with lightGBM.
A shuffled copy of the predictors matrix is added (shadows) to the original set of predictors. The lightGBM is fitted using repeated cross-validation, the feature importance is extracted each time and averaged to smooth out the noise. If the feature importance is larger than the average shadow feature importance then the predictors are rejected, the others are kept.
Cross-validated feature importance to smooth out the noise, based on lightGBM only (which is, most of the time, the fastest and more accurate Boosting).
the feature importance is derived using SHAP importance
Taking the max of median of the shadow var. imp over folds otherwise not enough conservative and it improves the convergence (needs less evaluation to find a threshold)
Not based on a given percentage of cols needed to be deleted
Plot method for var. imp
- Parameters:
objective (
strorcallable, defaultNone) – The objective function to use in lightGBM. If None, it uses the objective specified in lgbm_params.cutoff (
float, default1) – The value by which the max of shadow imp is divided, to compare to real importance.n_folds (
int, default5) – The number of folds for cross-validation.folds (
Optional[Union[Iterable[Tuple[np.ndarray,np.ndarray]]) – (generator or iterator of (train_idx, test_idx) tuples, scikit-learn splitter object or None, optional (default=None)) If generator or iterator, it should yield the train and test indices for each fold. If object, it should be one of the scikit-learn splitter classes (https://scikit-learn.org/stable/modules/classes.html#splitter-classes) and have split method. This argument has highest priority over other data split arguments.n_iter (
int, default5) – The number of iterations to average for the feature importance (on the same split), to reduce variance.silent (
bool, defaultTrue) – Set to True if you don’t want to see the GrootCV output printed.rf (
bool, defaultFalse) – If True, use random forest for calculating feature importances; otherwise, use lightGBM.fastshap (
bool, defaultFalse) – If True, use fastSHAP for calculating feature importances; otherwise, use SHAP.n_jobs (
int, default0) – The number of jobs to run in parallel. If 0, no parallelism is used.lgbm_params (
dict, defaultNone) – The parameters for the lightGBM model.
- Variables:
selected_features (
ndarray) – The list of columns to keep as selected features.cv_df (
pd.DataFrame) – DataFrame containing feature importance values for each fold and iteration.sha_cutoff (
float) – The threshold below which the predictors are rejected.ranking_absolutes (
list) – The absolute feature ranking as ordered by the selection process.ranking (
ndarray) – The feature ranking, where 2 corresponds to selected features and 1 to tentative features.
Warning
If sha_cutoff is None, you should apply the fit method first.
Examples
>>> X = df[filtered_features].copy() >>> y = df['target'].copy() >>> w = df['weight'].copy() >>> feat_selector = arfsgroot.GrootCV(objective='rmse', cutoff = 1, n_folds=5, n_iter=5) >>> feat_selector.fit(X, y, sample_weight=None) >>> feat_selector.plot_importance(n_feat_per_inch=5)
- fit(X, y, sample_weight=None)[source]#
Fit the GrootCV on the input data.
- Parameters:
X (
pd.DataFrameofshape (n_samples,n_features)) – The predictor dataframe.y (
array-likeofshape (n_samples,)) – The target vector.sample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default None.
- Returns:
self (
object) – Returns self.
- plot_importance(n_feat_per_inch=5)[source]#
Plot the feature importance of the fitted GrootCV.
- Parameters:
n_feat_per_inch (
int, default5) – The number of features per inch in the plot.- Returns:
fig (
matplotlib.figure.FigureorNone) – The matplotlib figure containing the plot or None if no feature is selected.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GrootCV#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- class arfs.feature_selection.allrelevant.Leshy(estimator, n_estimators=1000, perc=90, alpha=0.05, importance='shap', two_step=True, max_iter=100, random_state=None, verbose=0, keep_weak=False)[source]#
Bases:
SelectorMixin,BaseEstimatorThis is an improved version of BorutaPy which itself is an improved Python implementation of the Boruta R package. Boruta is an all relevant feature selection method, while most other are minimal optimal; this means it tries to find all features carrying information usable for prediction, rather than finding a possibly compact subset of features on which some estimator has a minimal error. Why bother with all relevant feature selection? When you try to understand the phenomenon that made your data, you should care about all factors that contribute to it, not just the bluntest signs of it in context of your methodology (minimal optimal set of features by definition depends on your estimator choice).
- Parameters:
estimator (
object) – A supervised learning estimator, with a ‘fit’ method that returns thefeature_importances_attribute. Important features must correspond to high absolute values in thefeature_importances_n_estimators (
intorstring, default= 1000) – If int sets the number of estimators in the chosen ensemble method. If ‘auto’ this is determined automatically based on the size of the dataset. The other parameters of the used estimators need to be set with initialisation.perc (
int, default= 100) – Instead of the max we use the percentile defined by the user, to pick our threshold for comparison between shadow and real features. The max tend to be too stringent. This provides a finer control over this. The lower perc is the more false positives will be picked as relevant but also the less relevant features will be left out. The usual trade-off. The default is essentially the vanilla Boruta corresponding to the max.alpha (
float, default= 0.05) – Level at which the corrected p-values will get rejected in both correction steps.importance (
str, default ='shap') – The kind of variable importance used to compare and discriminate original vs shadow predictors. Note that the builtin tree importance (gini/impurity based importance) is biased towards numerical and large cardinality predictors, even if they are random. Shapley values and permutation imp. are robust w.r.t those predictors. Possible values: ‘shap’ (Shapley values), ‘fastshap’ (FastTreeShap implementation), ‘pimp’ (permutation importance) and ‘native’ (Gini/impurity)two_step (
Boolean, default= True) – If you want to use the original implementation of Boruta with Bonferroni correction only set this to False.max_iter (
int, default= 100) – The number of maximum iterations to perform.random_state (
int,RandomState instanceorNone; default=None) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.verbose (
int, default0) – Controls verbosity of output. 0: no output, 1: displays iteration number, 2: which features have been selected already
- Variables:
n_features (
int) – The number of selected features.support (
arrayofshape [n_features]) – The mask of selected features - only confirmed ones are True.support_weak (
arrayofshape [n_features]) – The mask of selected tentative features, which haven’t gained enough support during the max_iter number of iterations.selected_features (
listofstr) – the list of columns to keepranking (
arrayofshape [n_features]) – The feature ranking, such thatranking_[i]corresponds to the ranking position of the i-th feature. Selected (i.e., estimated best) features are assigned rank one and tentative features are assigned rank 2.ranking_absolutes (
arrayofshape [n_features]) – The absolute feature ranking as ordered by selection process. It does not guarantee that this order is correct for all models. For a model agnostic ranking, see the the attributerankingcat_name (
listofstr) – the name of the categorical columnscat_idx (
listofint) – the index of the categorical columnsimp_real_hist (
array) – array of the historical feature importance of the real predictorssha_max (
float) – the maximum feature importance of the shadow predictorscol_names (
listofstr) – the names of the real predictors
Examples
>>> import pandas as pd >>> from sklearn.ensemble import RandomForestClassifier >>> from boruta import BorutaPy >>> >>> # load X and y >>> # NOTE BorutaPy accepts numpy arrays only, hence the .values attribute >>> X = pd.read_csv('examples/test_X.csv', index_col=0).values >>> y = pd.read_csv('examples/test_y.csv', header=None, index_col=0).values >>> y = y.ravel() >>> >>> # define random forest classifier, with utilising all cores and >>> # sampling in proportion to y labels >>> rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5) >>> >>> # define Boruta feature selection method >>> feat_selector = Leshy(rf, n_estimators='auto', verbose=2, random_state=1) >>> >>> # find all relevant features - 5 features should be selected >>> feat_selector.fit(X, y) >>> >>> # check selected features - first 5 features are selected >>> feat_selector.selected_features_ >>> >>> # check ranking of features >>> feat_selector.ranking_ >>> >>> # call transform() on X to filter it down to selected features >>> X_filtered = feat_selector.transform(X)
References
See the original paper [1]_ for more details.
- ..[1] Kursa M., Rudnicki W., “Feature Selection with the Boruta Package”
Journal of Statistical Software, Vol. 36, Issue 11, Sep 2010
- _add_shadows_get_imps(X, y, sample_weight, dec_reg)[source]#
Add a shuffled copy of the columns (shadows) and get the feature importance of the augmented data set
- Parameters:
X (
pd.DataFrameofshape [n_samples,n_features]) – predictor matrixy (
pd.seriesofshape [n_samples]) – targetsample_weight (
array-like,shape = [n_samples], defaultNone) – Individual weights for each sampledec_reg (
array) – holds the decision about each feature 1, 0, -1 (accepted, undecided, rejected)
- Returns:
- imp_real: array
feature importance of the real predictors
- imp_sha: array
feature importance of the shadow predictors
- static _assign_hits(hit_reg, cur_imp, imp_sha_max)[source]#
count how many times a given feature was more important than the best of the shadow features
- Parameters:
hit_reg (
array) – count how many times a given feature was more important than the best of the shadow featurescur_imp (
array) – current importanceimp_sha_max (
array) – importance of the best shadow predictor
- Returns:
hit_reg (
array) – the how many times a given feature was more important than the best of the shadow features
- _calculate_absolute_ranking()[source]#
Compute feature importance scores using SHAP values.
- Parameters:
new_x_tr (
numpy.ndarray) – The training dataset after being processed.shap_matrix (
numpy.ndarray) – The matrix containing SHAP values computed by a LightGBM model.param (
dict) – A dictionary containing the parameters for a LightGBM model.objective (
str) – The objective function of the LightGBM model.
- Returns:
list– A list of tuples containing feature names and their corresponding importance scores.
- _calculate_relative_ranking(n_feat, tentative, confirmed, imp_history)[source]#
Calculates the relative ranking of features based on their importance history.
- Parameters:
n_feat (
int) – The total number of features.tentative (
ndarrayofshape (n_tentative_features,)) – An array containing the indices of tentative features.confirmed (
ndarrayofshape (n_confirmed_features,)) – An array containing the indices of confirmed features.imp_history (
ndarrayofshape (n_iterations + 1,n_features)) – An array containing the feature importances for each iteration.
- Returns:
None
- _calculate_support(confirmed, tentative, n_feat)[source]#
Calculate the feature support arrays.
- Parameters:
confirmed (
array-likeofshape (n_confirmed,)) – Indices of confirmed features.tentative (
array-likeofshape (n_tentative,)) – Indices of tentative features.n_feat (
int) – Total number of features.
- Returns:
None– The function populates the following class attributes: - n_features_ : intNumber of selected features.
- support_ndarray of shape (n_feat,)
Boolean array indicating the selected features.
- support_weak_ndarray of shape (n_feat,)
Boolean array indicating the tentatively selected features.
- _check_params(X, y)[source]#
Private method, Check hyperparameters as well as X and y before proceeding with fit.
- Parameters:
X (
pd.DataFrame) – predictor matrixy (
pd.series) – target series
- Raises:
ValueError – [description]
ValueError – [description]
- _do_tests(dec_reg, hit_reg, _iter)[source]#
Private method, Perform the rest if the feature should be tagget as relevant (confirmed), not relevant (rejected) or undecided. The test is performed by considering the binomial tentatives over several attempts. I.e. count how many times a given feature was more important than the best of the shadow features and test if the associated probability to the z-score is below, between or above the rejection or acceptance threshold.
- Parameters:
dec_reg (
array) – holds the decision about each feature 1, 0, -1 (accepted, undecided, rejected)hit_reg (
array) – counts how many times a given feature was more important than the best of the shadow features_iter (
int) – iteration number
- Returns:
dec_reg (
array) – holds the decision about each feature 1, 0, -1 (accepted, undecided, rejected)
- static _fdrcorrection(pvals, alpha=0.05)[source]#
Benjamini/Hochberg p-value correction for false discovery rate, from statsmodels package. Included here for decoupling dependency on statsmodels.
- Parameters:
pvals (
array_like) – set of p-values of the individual tests.alpha (
float) – error rate
- Returns:
rejected (
array,bool) – True if a hypothesis is rejected, False if notpvalue-corrected (
array) – pvalues adjusted for multiple hypothesis testing to limit FDR
- _fit(X_raw, y, sample_weight=None)[source]#
Private method. See the methods overview in the documentation for explanation of the process
- Parameters:
X_raw (
array-like,shape = [n_samples,n_features]) – The training input samples.y (
array-like,shape = [n_samples]) – The target values.sample_weight (
array-like,shape = [n_samples], defaultNone) – Individual weights for each sample
- Returns:
self (
object) – Nothing but attributes
- _get_tree_num(n_feat)[source]#
- private method, get a good estimated for the number of trees
given the number of features
- Parameters:
n_feat (
int) – The number of features- Returns:
n_estimators (
int) – the number of trees
- static _nanrankdata(X, axis=1)[source]#
Replaces bottleneck’s nanrankdata with scipy and numpy alternative.
- Parameters:
X (
arrayorpd.DataFrame) – the data arrayaxis (
int, optional) – row-wise (0) or column-wise (1), by default 1
- Returns:
ranks (
array) – the ranked array
- _print_result(dec_reg, _iter, start_time)[source]#
Print the results of feature selection.
- Parameters:
dec_reg (
bool) – Decision on whether to proceed with another round of feature selection._iter (
int) – Current iteration number.start_time (
float) – Time when the feature selection process started.
- Returns:
None– The function prints the relevant results and running time.
- _print_results(dec_reg, _iter, flag)[source]#
Private method, printing the result
- Parameters:
dec_reg (
array) – if the feature as been tagged as relevant (confirmed), not relevant (rejected) or undecided_iter (
int) – the iteration numberflag (
int) – is still in the feature selection process or not
- Returns:
- output: str
the output to be printed out
- _run_iteration(X, y, sample_weight, dec_reg, sha_max_history, imp_history, hit_reg, _iter)[source]#
Run an iteration of the Gradient Boosting algorithm.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – The input samples.y (
array-likeofshape (n_samples,)) – The target values.sample_weight (
array-likeofshape (n_samples,), defaultNone) – Sample weights. If None, then samples are equally weighted.dec_reg (
array-likeofshape (n_samples,)) – Decision function of the estimator.sha_max_history (
listoffloats) – List of the maximum shadow importance value at each iteration.imp_history (
array-likeofshape (n_iterations,n_features)) – Matrix of feature importances at each iteration.hit_reg (
array-likeofshape (n_samples,)) – Array of hit counts for each sample._iter (
int) – The current iteration number.
- Returns:
dec_reg (
array-likeofshape (n_samples,)) – Updated decision function of the estimator.sha_max_history (
listoffloats) – List of the maximum shadow importance value at each iteration.imp_history (
array-likeofshape (n_iterations,n_features)) – Matrix of feature importances at each iteration.hit_reg (
array-likeofshape (n_samples,)) – Array of hit counts for each sample.imp_sha_max (
float) – The maximum shadow importance value for this iteration.
- _update_estimator()[source]#
Update the estimator with a new random state, if applicable.
If the dataset is not categorical, the estimator’s random_state parameter is updated with a new random state generated by the random_state attribute of the Leshy object. If the estimator is a LightGBM model, the random state value is generated between 0 and 10000.
- Parameters:
None –
- Returns:
None
- _update_tree_num(dec_reg)[source]#
Update the number of trees in the estimator based on the number of selected features.
- Parameters:
dec_reg (
array-likeofshape (n_features,)) – The decision rule for each feature, where negative values indicate that the feature should be rejected and non-negative values indicate that the feature should be selected.- Returns:
None
Notes
This function updates the n_estimators parameter of the estimator if it is set to “auto”. The number of trees is determined based on the number of selected features. Specifically, the number of trees is set to the value returned by the _get_tree_num method, which takes as input the number of selected features that are not rejected.
If n_estimators is not set to “auto”, this function does nothing.
- fit(X, y, sample_weight=None)[source]#
Fits the Boruta feature selection with the provided estimator.
- Parameters:
X (
array-like,shape = [n_samples,n_features]) – The training input samples.y (
array-like,shape = [n_samples]) – The target values.sample_weight (
array-like,shape = [n_samples], defaultNone) – Individual weights for each sample
- Returns:
self (
object) – Nothing but attributes
- plot_importance(n_feat_per_inch=5)[source]#
Boxplot of the variable importance, ordered by magnitude The max shadow variable importance illustrated by the dashed line. Requires to apply the fit method first.
- Parameters:
n_feat_per_inch (
int, default5) – number of features to plot per inch (for scaling the figure)- Returns:
fig (
plt.figure) – the matplotlib figure object containing the boxplot
- select_features(X, y, sample_weight=None)[source]#
Select features using the Leshy algorithm.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – The input data.y (
array-likeofshape (n_samples,)) – The target values.sample_weight (
array-likeofshape (n_samples,), defaultNone) – Individual weights for each sample.
- Returns:
dec_reg (
ndarrayofshape (n_features,)) – The decision rule. 1 means the feature is selected, 0 means the feature is not selected.sha_max_history (
list) – List of the maximum shadow importances per iteration.imp_history (
ndarrayofshape (n_iterations,n_features)) – Array containing the feature importances per iteration.imp_sha_max (
float) – Maximum shadow importance value.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') Leshy#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- arfs.feature_selection.allrelevant._boostaroota(X, y, estimator, cutoff, iters, max_rounds, delta, silent, weight, imp)[source]#
Private function, reduces the number of predictors using a sklearn estimator.
- Parameters:
x (
pd.DataFrame) – The dataframe to create shadow features on.y (
pd.Series) – The target.estimator (
scikit-learn estimator) – The model to train, lightGBM recommended, see the reduce lightgbm method.cutoff (
float) – The value by which the max of shadow imp is divided, to compare to real importance.iters (
int (>0)) – The number of iterations to average for the feature importances (on the same split), to reduce the variance.max_rounds (
int (>0)) – The number of times the core BoostARoota algorithm will run. Each round eliminates more and more features.delta (
float (0 < delta <= 1)) – Stopping criteria for whether another round is started.silent (
bool) – Set to True if you don’t want to see the BoostARoota output printed. Will still show any errors or warnings that may occur.weight (
pd.Series, optional) – Sample weights, if any.imp (
str) – whether if native, shap, fastshap or permutation importance should be used
- Returns:
crit (
bool) – If the criteria have been reached or not.keep_vars (
pd.DataFrame) – Feature importance of the real predictors over iterations.df_vimp (
pd.DataFrame) – Feature importance of the real+shadow predictors over iterations.mean_shadow (
float) – The feature importance threshold to reject or not the predictors.
- arfs.feature_selection.allrelevant._compute_importance(new_x_tr, shap_matrix, param, objective, fastshap)[source]#
Compute feature importance scores using SHAP values.
- Parameters:
new_x_tr (
numpy.ndarray) – The training dataset after being processed.shap_matrix (
numpy.ndarray) – The matrix containing SHAP values computed by a LightGBM model.param (
dict) – A dictionary containing the parameters for a LightGBM model.objective (
str) – The objective function of the LightGBM model.
- Returns:
list– A list of tuples containing feature names and their corresponding importance scores.
- arfs.feature_selection.allrelevant._create_shadow(X_train)[source]#
Create shadow features by making copies of all X variables and randomly shuffling them.
- Parameters:
X_train (
pd.DataFrame) – The dataframe to create shadow features on.- Returns:
pd.DataFrame– A dataframe that is twice the width of X_train and contains the shadow features, along with a list of the shadow feature names.
- arfs.feature_selection.allrelevant._get_confirmed_and_tentative(dec_reg)[source]#
Extracts the confirmed and tentative features from dec_reg.
- arfs.feature_selection.allrelevant._get_imp(estimator, X, y, sample_weight=None, cat_feature=None)[source]#
Private function, Get the native feature importance (impurity based for instance)
Notes
This is know to return biased and uninformative results. e.g. https://scikit-learn.org/stable/auto_examples/inspection/ plot_permutation_importance.html#sphx-glr-auto-examples-inspection-plot-permutation-importance-py
or
https://explained.ai/rf-importance/
- Parameters:
X (
array-like,shape = [n_samples,n_features]) – The training input samples.y (
array-like,shape = [n_samples]) – The target values.sample_weight (
array-like,shape = [n_samples], defaultNone) – Individual weights for each samplecat_feature (
listofintorNone) – the list of integers, cols loc, of the categorical predictors. Avoids to detect and encode each iteration if the exact same columns are passed to the selection methods.
- Returns:
imp (
array) – the permutation importance array
- arfs.feature_selection.allrelevant._get_perm_imp(estimator, X, y, sample_weight, cat_feature=None)[source]#
Private function, Get the SHAP feature importance
- Parameters:
estimator (
sklearn estimator) –X (
pd.DataFrameofshape [n_samples,n_features]) – predictor matrixy (
pd.seriesofshape [n_samples]) – targetsample_weight (
array-like,shape = [n_samples], defaultNone) – Individual weights for each samplecat_feature (
listofintorNone) – the list of integers, cols loc, of the categorical predictors. Avoids to detect and encode each iteration if the exact same columns are passed to the selection methods.
- Returns:
imp (
array) – the permutation importance array
- arfs.feature_selection.allrelevant._get_shap_imp(estimator, X, y, sample_weight=None, cat_feature=None)[source]#
Get the SHAP feature importance (compatible with all SHAP versions)
- Parameters:
estimator (
estimator object) – An estimator object implementing fit and predict methods.X (
pd.DataFrameofshape [n_samples,n_features]) – Predictor matrix.y (
pd.Seriesofshape [n_samples]) – Target variable.sample_weight (
array-like,shape = [n_samples], defaultNone) – Individual weights for each sample.cat_feature (
listofintorNone, defaultNone) – The list of integers, columns loc, of the categorical predictors.
- Returns:
shap_imp (
array) – The SHAP importance array.
- arfs.feature_selection.allrelevant._get_shap_imp_fast(estimator, X, y, sample_weight=None, cat_feature=None)[source]#
Get the SHAP feature importance using the fasttreeshap implementation
- Parameters:
estimator (
estimator object) – An estimator object implementing fit and predict methods.X (
pd.DataFrameofshape [n_samples,n_features]) – Predictor matrix.y (
pd.Seriesofshape [n_samples]) – Target variable.sample_weight (
array-like,shape = [n_samples], defaultNone) – Individual weights for each sample.cat_feature (
listofintorNone, defaultNone) – The list of integers, columns loc, of the categorical predictors. Avoids detecting and encoding each iteration if the exact same columns are passed to the selection methods.
- Returns:
shap_imp (
array) – The SHAP importance array.
- arfs.feature_selection.allrelevant._merge_importance_df(df, importance, iter, n_folds, column_names, silent=True)[source]#
Merge the feature importance dataframe df with the importance information for the current iteration of a cross-validation loop.
- Parameters:
df (
pandas.DataFrame) – The current feature importance dataframe.importance (
dict) – A dictionary with the feature importance information for the current iteration.i (
int) – The index of the current iteration.n_folds (
int) – The number of folds used in the cross-validation loop.silent (
bool, optional) – If True, suppress output.
- Returns:
pandas.DataFrame– The updated feature importance dataframe.
- arfs.feature_selection.allrelevant._reduce_vars_lgb_cv(X, y, objective, folds, n_folds, cutoff, n_iter, silent, weight, rf, fastshap, lgbm_params=None, n_jobs=0)[source]#
Reduce the number of predictors using a lightgbm (python API)
- Parameters:
X (
pd.DataFrame) – the dataframe to create shadow features ony (
pd.Series) – the targetobjective (
str) – the lightGBM objectivefolds – (generator or iterator of (train_idx, test_idx) tuples, scikit-learn splitter object or None, optional (default=None)) If generator or iterator, it should yield the train and test indices for each fold. If object, it should be one of the scikit-learn splitter classes (https://scikit-learn.org/stable/modules/classes.html#splitter-classes) and have split method. This argument has highest priority over other data split arguments.
nfold (
int) – Number of folds in CV.cutoff (
float) – the value by which the max of shadow imp is divided, to compare to real importancen_iter (
int) – The number of repetition of the cross-validation, smooth out the feature importance noisesilent (
bool) – Set to True if don’t want to see the BoostARoota output printed. Will still show any errors or warnings that may occurweight (
pd.series) – sample_weight, if anyrf (
bool, defaultFalse) – the lightGBM implementation of the random forestfastshap (
bool) – enable or not the fasttreeshap implementationlgbm_params (
dict, optional) – dictionary of lightgbm parametersn_jobs (
int, default0) – 0 means default number of threads in OpenMP for the best speed, set this to the number of real CPU cores, not the number of threads
- Returns:
real_vars[‘feature’] (
pd.dataframe) – feature importance of the real predictors over iterdf (
pd.DataFrame) – feature importance of the real+shadow predictors over itercutoff_shadow (
float) – the feature importance threshold, to reject or not the predictors
- arfs.feature_selection.allrelevant._reduce_vars_sklearn(X, y, estimator, this_round, cutoff, n_iterations, delta, silent, weight, imp_kind, cat_feature)[source]#
Private function, reduce the number of predictors using a sklearn estimator
- Parameters:
x (
pd.DataFrame) – the dataframe to create shadow features ony (
pd.Series) – the targetestimator (
sklearn estimator) – the model to train, lightGBM recommendedthis_round (
int) – The number of times the core BoostARoota algorithm will run. Each round eliminates more and more featurescutoff (
float) – the value by which the max of shadow imp is divided, to compare to real importancen_iterations (
int) – The number of iterations to average for the feature importance (on the same split), to reduce the variancedelta (
float (0 < delta <= 1)) – Stopping criteria for whether another round is startedsilent (
bool) – Set to True if don’t want to see the BoostARoota output printed. Will still show any errors or warnings that may occurweight (
pd.series) – sample_weight, if anyimp_kind (
str) – whether if native, shap, fastshap or permutation importance should be usedcat_feature (
listofintorNone) – the list of integers, cols loc, of the categorical predictors. Avoids to detect and encode each iteration if the exact same columns are passed to the selection methods.
- Returns:
criteria (
bool) – if the criteria has been reached or notreal_vars[‘feature’] (
pd.dataframe) – feature importance of the real predictors over iterdf (
pd.DataFrame) – feature importance of the real+shadow predictors over itermean_shadow (
float) – the feature importance threshold, to reject or not the predictors
- Raises:
ValueError – error if the feature importance type is not
- arfs.feature_selection.allrelevant._select_tentative(tentative, imp_history, sha_max_history)[source]#
Select tentative features based on median importance values.
- Parameters:
tentative (
array-likeofshape (n_tentative,)) – Array of indices representing tentative features.imp_history (
array-likeofshape (n_iterations + 1,n_features)) – Importance values for each feature in each iteration.sha_max_history (
array-likeofshape (n_iterations + 1,)) – The history of the highest stability scores.
- Returns:
tentative (
array-likeofshape (n_tentative_confirmed,)) – The confirmed tentative features based on their median importance values.
- arfs.feature_selection.allrelevant._set_lgb_parameters(X, y, objective, rf, silent, n_jobs=0, lgbm_params=None)[source]#
Set parameters for a LightGBM model based on the input features and the objective.
- Parameters:
X (
numpy arrayorpandas DataFrame) – The feature matrix of the training data.y (
numpy arrayorpandas Series) – The target variable of the training data.objective (
str) – The objective function to optimize during training.rf (
bool, defaultFalse) – Whether to use random forest boosting.silent (
bool, defaultTrue) – Whether to print messages during parameter setting.n_jobs (
int, default0) – 0 means default number of threads in OpenMP for the best speed, set this to the number of real CPU cores, not the number of threads
- Return type:
dict- Returns:
dict– The dictionary of LightGBM parameters.
- arfs.feature_selection.allrelevant._split_data(X, y, tridx, validx, weight=None)[source]#
Split data into train and validation sets based on provided indices.
- Parameters:
X (
pandas.DataFrame) – Features.y (
pandas.Series) – Target variable.tridx (
list) – Indices to be used for training.validx (
list) – Indices to be used for validation.weight (
pandas.Series, optional) – Weights for each sample, by default None.
- Returns:
tupleofpandas.DataFrameandpandas.Series– X_train, X_val, y_train, y_val, weight_tr, weight_val
- arfs.feature_selection.allrelevant._split_fit_estimator(estimator, X, y, sample_weight=None, cat_feature=None)[source]#
Private function, split the train, test and fit the model
- Parameters:
estimator (estimator object implementing
'fit'and'predict') – The object to use to fit the data.X (
pd.DataFrameofshape [n_samples,n_features]) – predictor matrixy (
pd.seriesofshape [n_samples]) – targetsample_weight (
array-like,shape = [n_samples], defaultNone) – Individual weights for each samplecat_feature (
listofintorNone) – the list of integers, cols loc, of the categrocial predictors. Avoids to detect and encode each iteration if the exact same columns are passed to the selection methods.
- Returns:
- model :
fitted model
- X_ttarray [n_samples, n_features]
the test split, predictors
- y_ttarray [n_samples]
the test split, target
- arfs.feature_selection.allrelevant._train_lgb_model(X_train, y_train, weight_train, X_val, y_val, weight_val, category_cols=None, early_stopping_rounds=20, fastshap=False, **params)[source]#
Train a LightGBM model with the given training data and hyperparameters and return the trained model and its SHAP values.
- Parameters:
X_train (
array-likeofshape (n_samples,n_features)) – The input training data.y_train (
array-likeofshape (n_samples,)) – The target training data.weight_train (
array-likeofshape (n_samples,)) – The sample weights for training data.X_val (
array-likeofshape (n_val_samples,n_features)) – The input validation data.y_val (
array-likeofshape (n_val_samples,)) – The target validation data.weight_val (
array-likeofshape (n_val_samples,)) – The sample weights for validation data.category_cols (
array-likeorNone,optional (default=None)) – The indices of categorical columns. If None, no categorical columns will be considered.early_stopping_rounds (
int,optional (default=20)) – Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training._train_lgb_modelfastshap (
bool) – enable or not fasttreeshap implementation**params (
dict) – Other parameters passed to the LightGBM model.
- Returns:
tupleof(Booster,numpy.ndarray,int)– The trained LightGBM model, its SHAP values for X_train, and the best iteration reached during training.
arfs.feature_selection.base module#
Base Submodule
This module provides a base class for selector using a statistic and a threshold
Module Structure:#
BaseThresholdSelector: parent class for the “treshold-based” selectors
- class arfs.feature_selection.base.BaseThresholdSelector(threshold=0.05, statistic_fn=None, greater_than_threshold=False)[source]#
Bases:
SelectorMixin,BaseEstimatorBase class for threshold-based feature selection
- Parameters:
threshold (
float,.05) – Features with a training-set missing greater/lower (geq/leq) than this threshold will be removedstatistic_fn (
callable, optional) – The function for computing the statistic series. The index should be the column names and the the values the computed statisticgreater_than_threshold (
bool,False) – Whether or not to reject the features if lower or greater than threshold
- Returns:
selected_features (
listofstr) – List of selected features.- Variables:
n_features_in (
int) – number of input predictorssupport (
listofbool) – the list of the selected X-columnsselected_features (
listofstr) – the list of names of selected featuresnot_selected_features (
listofstr) – the list of names of rejected features
- fit(X, y=None, sample_weight=None)[source]#
Learn empirical statistics from X.
- Parameters:
X (
pd.DataFrame,shape (n_samples,n_features)) – Data from which to compute variances, where n_samples is the number of samples and n_features is the number of features.y (
any, defaultNone) – Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.sample_weight (
pd.Series, optional,shape (n_samples,)) – weights for computing the statistics (e.g. weighted average)
- Returns:
self (
object) – Returns the instance itself.
- fit_transform(X, y=None, sample_weight=None, **fit_params)[source]#
Fit to data, then transform it. Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X. :type X: :param X: Input samples. :type X:
array-likeofshape (n_samples,n_features):type y: :param y: Target values (None for unsupervised transformations). :type y:array-likeofshape (n_samples,)or(n_samples,n_outputs), default=None :type sample_weight: :param sample_weight: sample weight values. :type sample_weight:array-likeofshape (n_samples,)or(n_samples,n_outputs), default=None :type **fit_params: :param **fit_params: Additional fit parameters. :type **fit_params:dict- Returns:
X_new (
ndarray arrayofshape (n_samples,n_features_new)) – Transformed array.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') BaseThresholdSelector#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
arfs.feature_selection.lasso module#
LassoFeatureSelection Submodule
This module provides LASSO-based feature selection, specifically designed for use with Generalized Linear Models (GLM). The Lasso Regularized GLM introduces an L1 regularization penalty (Lasso regularization), encouraging some coefficients to become exactly zero during the model fitting process. This regularization effectively removes irrelevant features from the model, making it a powerful tool for feature selection, particularly in datasets with numerous variables.
Module Structure:#
EnetGLM: class serves as a scikit-learn wrapper for the regularized statsmodels GLM, providing seamless integration with scikit-learn’s ecosystem.
weighted_cross_val_score: function allows users to pass weights to the model and define a custom scoring metric.
grid_search_cv: function performs a weighted LASSO grid search to find the best Lasso parameter for the model.
- LassoFeatureSelection: class is the core feature selection class, estimating the Lasso parameter through
the grid search process, enabling efficient and effective feature selection.
With this submodule, users can easily leverage Lasso Regularized GLMs and conduct feature selection, improving model performance and interpretability in various datasets.
- class arfs.feature_selection.lasso.EnetGLM(family='gaussian', link=None, alpha=0.0, L1_wt=1e-06, fit_intercept=True)[source]#
Bases:
BaseEstimator,RegressorMixinElastic Net Generalized Linear Model.
- Parameters:
family (
str,(default=``”gaussian”``)) – The distributional assumption of the model. It can be any of the statsmodels distribution: “gaussian”, “binomial”, “poisson”, “gamma”, “negativebinomial”, “tweedie”link (
str, optional) – the GLM link function. It can be any of the: “identity”, “log”, “logit”, “probit”, “cloglog”, “inverse_squared”alpha (
float,optional (default=0.0)) – The elastic net mixing parameter. 0 <= alpha <= 1. alpha = 0 is equivalent to ridge regression, alpha = 1 is equivalent to lasso regression.L1_wt (
float,optional (default=0.0)) – The weight of the L1 penalty term. 0 <= L1_wt <= 1. The L1_wt parameter represents the weight of the L1 penalty term in the model and should be within the range 0 to 1. A value of 0 corresponds to ridge regression, while a value of 1 corresponds to lasso regression. However, for obtaining statistics, L1_wt should be set to a value greater than 0. If it is set to 0.0, statsmodels returns a ridge regularized wrapper without refitting the model, making the statistics unavailable and breaking the class. Nevertheless, you can set L1_wt to a very small value, such as 1e-9, to obtain close-to-ridge behavior while still obtaining the necessary statistics.fit_intercept (
bool,optional (default=True)) – Whether to fit an intercept term in the model.
- __init__(family='gaussian', link=None, alpha=0.0, L1_wt=1e-06, fit_intercept=True)[source]#
Initialize self.
- Parameters:
family (
str) – The distributional assumption of the model.link (
Optional[str]) – the GLM link functionalpha (
float) – The penalty weight. If a scalar, the same penalty weight applies to all variables in the model. If a vector, it must have the same length as params, and contains a penalty weight for each coefficient.L1_wt (
float) – The L1_wt parameter represents the weight of the L1 penalty term in the model and should be within the range 0 to 1. A value of 0 corresponds to ridge regression, while a value of 1 corresponds to lasso regression. However, for obtaining statistics, L1_wt should be set to a value greater than 0. If it is set to 0.0, statsmodels returns a ridge regularized wrapper without refitting the model, making the statistics unavailable and breaking the class. Nevertheless, you can set L1_wt to a very small value, such as 1e-9, to obtain close-to-ridge behavior while still obtaining the necessary statistics.fit_intercept (
bool) – Whether to fit an intercept term in the model.
- fit(X, y, sample_weight=None)[source]#
Fit the model to the data.
Notes
In statsmodels and GLMs in general, you can use either an offset or a weight to account for differences in exposure between observations. However, if you choose to use an offset, you need to pass the number of cases (ncl) instead of the frequency and set the offset to the logarithm of the exposure due to the log link function. It is recommended to use the frequency and the weights instead of the offset because this ensures that all models have the same inputs. To use the frequency and the weights, you can fit the model using the following code:
`python self.model = sm.GLM(endog=y, exog=X, var_weights=sample_weight, family=self.family) `This is equivalent to using the exposure and the log of the exposure internally, which can be done using the following code:
`python self.model = sm.GLM(endog=y, exog=sm.add_constant(X), exposure=sample_weight, family=sm.families.Poisson()) self.result = self.model.fit() `- Parameters:
X (
DataFrame) – array-like, shape (n_samples, n_features) The input data.y (
Union[ndarray,Series]) – array-like, shape (n_samples,) The target values.sample_weight (
array-like,shape (n_samples,),optional (default=None)) – Sample weights.
- Returns:
self (
object) – Returns self.
- get_coef()[source]#
Get the estimated coefficients of the fitted model.
- Returns:
coef_ (
array-like,shape (n_features,)) – The estimated coefficients of the fitted model.
- predict(X)[source]#
Predict using the fitted model.
- Parameters:
X – array-like, shape (n_samples, n_features) The input data.
- Returns:
y (
array-like,shape (n_samples,)) – The predicted target values.- Raises:
ValueError – If the model has not been fit.
- score(X, y, sample_weight=None)[source]#
Return the deviance of the fitted model.
- Parameters:
X (
DataFrame) – array-like, shape (n_samples, n_features) The input data.sample_weight (
array-like,shape (n_samples,),optional (default=None)) – Sample weights.
- Returns:
deviance (
float) – The deviance of the fitted model.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') EnetGLM#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') EnetGLM#
Request metadata passed to the
scoremethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter inscore.- Returns:
self (
object) – The updated object.
- class arfs.feature_selection.lasso.LassoFeatureSelection(family='gaussian', link=None, n_iterations=10, score='bic', fit_intercept=True, n_jobs=-1)[source]#
Bases:
BaseEstimator,TransformerMixinLassoFeatureSelection performs feature selection using GLM Lasso regularization.
- Parameters:
family (
str,(default=``”gaussian”``)) – The distributional assumption of the model. It can be any of the statsmodels distribution: “gaussian”, “binomial”, “poisson”, “gamma”, “negativebinomial”, “tweedie”link (
str, optional) – the GLM link function. It can be any of the: “identity”, “log”, “logit”, “probit”, “cloglog”, “inverse_squared”n_iterations (
int, default10) – Number of iterations for the grid search.score (
str, default"bic") – The score to use for model selection. Options: “bic” (Bayesian Information Criterion) or “mean_cv” (mean cross-validation score).n_jobs (
int, default-1) – the number of processes. -1 means all the processes
- Variables:
family (
str) – The family of the GLM.n_iterations (
int) – Number of iterations for the grid search.best_estimator (
EnetGLM) – The best estimator found after grid search cross-validation.selected_features (
ndarray) – The selected feature names.support (
ndarray) – The support of selected features (True for selected, False otherwise).feature_names_in (
ndarray) – The input feature names.score (
str) – The score used for model selection.n_jobs (
int) – the number of processes. -1 means all the processes
- fit(X, y=None, sample_weight=None)[source]#
Fit the LassoFeatureSelection model and select the best features.
- fit(X, y=None, sample_weight=None)[source]#
Fit the LassoFeatureSelection model and select the best features.
- Parameters:
X (
Union[pd.DataFrame,np.ndarray]) – The input features, can be either a pandas DataFrame or a numpy array.y (
Optional[Union[pd.Series,np.ndarray]], defaultNone) – The target values, can be either a pandas Series or a numpy array.sample_weight (
Optional[Union[pd.Series,np.ndarray]], defaultNone) – Sample weights to be used during training. Can be either a pandas Series or a numpy array.
- Returns:
LassoFeatureSelection– The fitted LassoFeatureSelection model.
- get_feature_names_out()[source]#
Get the names of the selected features.
- Return type:
ndarray- Returns:
np.ndarray– The names of the selected features.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') LassoFeatureSelection#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- transform(X)[source]#
Transform the input data to keep only the selected features.
- Parameters:
X (
Union[pd.DataFrame,np.ndarray]) – The input features, can be either a pandas DataFrame or a numpy array.- Return type:
DataFrame- Returns:
Union[pd.DataFrame,np.ndarray]– The transformed data with only the selected features.
- arfs.feature_selection.lasso._fit_and_score(estimator, X, y, train_index, test_index, sample_weight=None)[source]#
Fit and score an estimator on a specified train-test split.
- Parameters:
estimator (
BaseEstimator) – The estimator object implementing the scikit-learn estimator interface.X (
Union[pd.DataFrame,np.ndarray]) – The input features, can be either a pandas DataFrame or a numpy array.y (
Union[pd.Series,np.ndarray]) – The target values, can be either a pandas Series or a numpy array.train_index (
np.ndarray) – Array of indices representing the training data.test_index (
np.ndarray) – Array of indices representing the test data.sample_weight (
Optional[Union[pd.Series,np.ndarray]], defaultNone) – Sample weights to be used during training. Can be either a pandas Series or a numpy array.
- Return type:
float- Returns:
float– The score of the estimator on the test data.- Raises:
ValueError – If the input data is not of the correct format.
- arfs.feature_selection.lasso.grid_search_cv(X, y, sample_weight=None, n_iterations=10, family='gaussian', link=None, score='bic', fit_intercept=True, n_jobs=-1)[source]#
Perform grid search cross-validation for an Elastic Net Generalized Linear Model (EnetGLM).
- Parameters:
X (
Union[pd.DataFrame,np.ndarray]) – The input features, can be either a pandas DataFrame or a numpy array.y (
Union[pd.Series,np.ndarray]) – The target values, can be either a pandas Series or a numpy array.sample_weight (
Optional[Union[pd.Series,np.ndarray]], defaultNone) – Sample weights to be used during training. Can be either a pandas Series or a numpy array.n_iterations (
int, default10) – Number of iterations for the grid search.family (
str, default"gaussian") – The family of the GLM. Options: “gaussian”, “poisson”, “gamma”, “negativebinomial”, “binomial”, “tweedie”.link (
str, optional) – the GLM link function. It can be any of the: “identity”, “log”, “logit”, “probit”, “cloglog”, “inverse_squared”score (
str, default"bic") – The score to use for model selection. Options: “bic” (Bayesian Information Criterion) or “mean_cv” (mean cross-validation score).n_jobs (
int) – the number of processes
- Return type:
- Returns:
EnetGLM– The best estimator found after grid search cross-validation.- Raises:
ValueError – If the input data is not of the correct format or if an invalid family or score value is provided.
- arfs.feature_selection.lasso.weighted_cross_val_score(estimator, X, y, sample_weight=None, cv=5, n_jobs=-1)[source]#
Perform cross-validation for a scikit-learn estimator with a score function that requires sample_weight.
- Parameters:
estimator (
estimator) – The scikit-learn estimator object.X (
array-likeofshape (n_samples,n_features)) – The input features.y (
array-likeofshape (n_samples,)) – The target variable.sample_weight (
array-likeofshape (n_samples,), optional) – The sample weights for each data point.cv (
int, default5) – The number of cross-validation folds.n_jobs – the number of processes
- Returns:
scores (
arrayofshape (cv,)) – The list of scores for each fold.average_score (
float) – The average score across all folds.
arfs.feature_selection.mrmr module#
MRMR Feature Selection Module
This module provides MinRedundancyMaxRelevance (MRMR) feature selection for classification or regression tasks. In a classification task, the target should be of object or pandas category dtype, while in a regression task, the target should be numeric. The predictors can be categorical or numerical without requiring encoding, as the appropriate method (correlation, correlation ratio, or Theil’s U) will be automatically selected based on the data type.
Module Structure:#
MinRedundancyMaxRelevance: MRMR feature selection class for classification or regression tasks.
- class arfs.feature_selection.mrmr.MinRedundancyMaxRelevance(n_features_to_select, relevance_func=None, redundancy_func=None, task='regression', denominator_func=<function mean>, only_same_domain=False, return_scores=False, n_jobs=1, show_progress=True)[source]#
Bases:
SelectorMixin,BaseEstimatorMRMR feature selection for a classification or a regression task For a classification task, the target should be of object or pandas category dtype. For a regression task, the target should be of numpy categorical dtype. The predictors can be categorical or numerical, there is no encoding required. The dtype will be automatically detected and the right method applied (either correlation, correlation ration or Theil’s U)
- Parameters:
n_features_to_select (
int) – Number of features to select.relevance_func (
callable, optional) – relevance function having arguments “X”, “y”, “sample_weight” and returning a pd.Series containing a score of relevance for each featureredundancy_func (
callable, optional) – Redundancy method. If callable, it should take “X”, “sample_weight” as input and return a pandas.Series containing a score of redundancy for each feature.denominator_func (
strorcallable (optional, default'mean')) – Synthesis function to apply to the denominator of MRMR score. If string, name of method. Supported: ‘max’, ‘mean’. If callable, it should take an iterable as input and return a scalar.task (
str) – either “regression” or “classification”only_same_domain (
bool (optional, defaultFalse)) – If False, all the necessary correlation coefficients are computed. If True, only features belonging to the same domain are compared. Domain is defined by the string preceding the first underscore: for instance “cusinfo_age” and “cusinfo_income” belong to the same domain, whereas “age” and “income” don’t.return_scores (
bool (optional, defaultFalse)) – If False, only the list of selected features is returned. If True, a tuple containing (list of selected features, relevance, redundancy) is returned.n_jobs (
int (optional, default1)) – Maximum number of workers to use. Only used when relevance = “f” or redundancy = “corr”. If -1, use as many workers as min(cpu count, number of features).show_progress (
bool (optional, defaultTrue)) – If False, no progress bar is displayed. If True, a TQDM progress bar shows the number of features processed.
- Returns:
selected_features (
listofstr) – List of selected features.- Variables:
n_features_in (
int) – number of input predictorsranking (
pd.DataFrame) – name and scores for the selected featuressupport (
listofbool) – the list of the selected X-columns
Example
>>> from sklearn.datasets import make_classification, make_regression >>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5 >>> X = pd.DataFrame(X) >>> y = pd.Series(y) >>> pred_name = [f"pred_{i}" for i in range(X.shape[1])] >>> X.columns = pred_name >>> y.name = "target" >>> fs_mrmr = MinRedundancyMaxRelevance( >>> n_features_to_select=5, >>> relevance_func=None, >>> redundancy_func=None, >>> task="regression", #"classification", >>> denominator_func=np.mean, >>> only_same_domain=False, >>> return_scores=False, >>> show_progress=True) >>> #fs_mrmr.fit(X=X, y=y.astype(str), sample_weight=None) >>> fs_mrmr.fit(X=X, y=y, sample_weight=None)
- fit(X, y, sample_weight=None)[source]#
fit the MRmr selector by learning the associations
- Parameters:
X (
pd.DataFrame,shape (n_samples,n_features)) – Data from which to compute variances, where n_samples is the number of samples and n_features is the number of features.y (
array-likeorpd.Seriesofshape (n_samples,)) – Target vector. Must be numeric for regression or categorical for classification.sample_weight (
pd.Series, optional,shape (n_samples,)) – weights for computing the statistics (e.g. weighted average)
- Returns:
self (
object) – If return_scores=False, returns self. If return_scores=True, returns (selected_features, relevance_scores).
- fit_transform(X, y, sample_weight=None, **fit_params)[source]#
Fit to data, then transform it. Fits transformer to X and y and optionally sample_weight with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – Input samples.y (
array-likeofshape (n_samples,)or(n_samples,n_outputs), default=None) – Target values (None for unsupervised transformations).sample_weight (
array-likeofshape (n_samples,)or(n_samples,n_outputs), default=None) – sample weight values.**fit_params (
dict) – Additional fit parameters.
- Returns:
X_new (
ndarray arrayofshape (n_samples,n_features_new)) – Transformed array.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') MinRedundancyMaxRelevance#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
arfs.feature_selection.summary module#
Feature Selection Summary Module
This module provides a function for creating the summary report of a FS pipeline
Module Structure:#
make_fs_summarymain function for creating the summaryhighlight_discardedfunction for creating style for the pd.DataFrame
- arfs.feature_selection.summary.highlight_discarded(s)[source]#
highlight X in red and V in green.
- Parameters:
s (
array-likeofshape (n_features,)) – the boolean array for defining the style
- arfs.feature_selection.summary.make_fs_summary(selector_pipe)[source]#
make_fs_summary makes a summary dataframe highlighting at which step a given predictor has been rejected (if any).
- Parameters:
selector_pipe (
sklearn.pipeline.Pipeline) – the feature selector pipeline.
Examples
>>> groot_pipeline = Pipeline([ ... ('missing', MissingValueThreshold()), ... ('unique', UniqueValuesThreshold()), ... ('cardinality', CardinalityThreshold()), ... ('collinearity', CollinearityThreshold(threshold=0.5)), ... ('lowimp', VariableImportance(eval_metric='poisson', objective='poisson', verbose=2)), ... ('grootcv', GrootCV(objective='poisson', cutoff=1, n_folds=3, n_iter=5))]) >>> groot_pipeline.fit_transform( X=df[predictors], y=df[target], lowimp__sample_weight=df[weight], grootcv__sample_weight=df[weight]) >>> fs_summary_df = make_fs_summary(groot_pipeline)
arfs.feature_selection.unsupervised module#
Unsupervised Feature Selection
This module provides selectors using unsupervised statistics and a threshold
Module Structure:#
MissingValueThreshold: child class of theBaseThresholdSelector, filter out columns with too many missing valuesUniqueValuesThresholdchild of theBaseThresholdSelector, filter out columns with zero varianceCardinalityThresholdchild of theBaseThresholdSelector, filter out categorical columns with too many levelsCollinearityThresholdchild of theBaseThresholdSelector, filter out collinear columns
- class arfs.feature_selection.unsupervised.CardinalityThreshold(threshold=1000)[source]#
Bases:
BaseThresholdSelectorFeature selector that removes all categorical features with more unique values than threshold This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
- Parameters:
threshold (
int, default= 1000) – Features with a training-set missing larger than this threshold will be removed. The thresold should be >= 1- Returns:
selected_features (
listofstr) – List of selected features.- Variables:
n_features_in (
int) – number of input predictorssupport (
listofbool) – the list of the selected X-columnsselected_features (
listofstr) – the list of names of selected featuresnot_selected_features (
listofstr) – the list of names of rejected features
Example
>>> from sklearn.datasets import make_classification, make_regression >>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5 >>> X = pd.DataFrame(X) >>> y = pd.Series(y) >>> pred_name = [f"pred_{i}" for i in range(X.shape[1])] >>> X.columns = pred_name >>> selector = CardinalityThreshold(100) >>> selector.fit_transform(X)
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') CardinalityThreshold#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- class arfs.feature_selection.unsupervised.CollinearityThreshold(threshold=0.8, method='association', n_jobs=1, nom_nom_assoc=<function weighted_theils_u>, num_num_assoc=<function weighted_corr>, nom_num_assoc=<function correlation_ratio>)[source]#
Bases:
SelectorMixin,BaseEstimatorFeature selector that removes collinear features. This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning. It computes the association between features (continuous or categorical), store the pairs of collinear features and remove one of them for all pairs having an association value above the threshold.
The association measures are the Spearman correlation coefficient, correlation ratio and Theil’s U. The association matrix is not necessarily symmetrical.
By changing the method to “correlation”, data will be encoded as integer and the Spearman correlation coefficient will be used instead. Faster but not a best practice because the categorical variables are considered as numeric.
- Parameters:
threshold (
float, default= .8) – Features with a training-set missing larger than this threshold will be removed The thresold should be > 0 and =< 1method (
str, default ="association") – method for computing the association matrix. Either “association” or “correlation”. Correlation leads to encoding of categorical variables as numericn_jobs (
int, default= -1) – the number of threads, -1 uses all the threads for computating the association matrixnom_nom_assoc (
strorcallable, default ="theil") – the categorical-categorical association measure, by default Theil’s U, not symmetrical!num_num_assoc (
strorcallable, default ="spearman") – the numeric-numeric association measurenom_num_assoc (
strorcallable, default ="correlation_ratio") – the numeric-categorical association measure
- Returns:
selected_features (
listofstr) – List of selected features.- Variables:
n_features_in (
int) – number of input predictorsassoc_matrix (
pd.DataFrame) – the square association matrixcollinearity_summary (
pd.DataFrame) – the pairs of collinear features and the association valuessupport (
listofbool) – the list of the selected X-columnsselected_features (
listofstr) – the list of names of selected featuresnot_selected_features (
listofstr) – the list of names of rejected features
Example
>>> from sklearn.datasets import make_classification, make_regression >>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5 >>> X = pd.DataFrame(X) >>> y = pd.Series(y) >>> pred_name = [f"pred_{i}" for i in range(X.shape[1])] >>> X.columns = pred_name >>> selector = CollinearityThreshold(threshold=0.75) >>> selector.fit_transform(X)
- fit(X, y=None, sample_weight=None)[source]#
Learn empirical associtions from X.
- Parameters:
X (
pd.DataFrame,shape (n_samples,n_features)) – Data from which to compute variances, where n_samples is the number of samples and n_features is the number of features.y (
any, defaultNone) – Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.sample_weight (
pd.Series, optional,shape (n_samples,)) – weights for computing the statistics (e.g. weighted average)
- Returns:
self (
object) – Returns the instance itself.
- plot_association(ax=None, cmap='PuOr', figsize=None, cbar_kw=None, imgshow_kw=None)[source]#
plot_association plots the association matrix
- Parameters:
ax (
matplotlib.axes.Axes, optional) – the mpl axes if the figure object exists already, by default Nonecmap (
str, optional) – colormap name, by default “PuOr”figsize (
tupleoffloat, optional) – figure size, by default Nonecbar_kw (
dict, optional) – colorbar kwargs, by default Noneimgshow_kw (
dict, optional) – imgshow kwargs, by default None
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') CollinearityThreshold#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- class arfs.feature_selection.unsupervised.MissingValueThreshold(threshold=0.05)[source]#
Bases:
BaseThresholdSelectorFeature selector that removes all high missing percentage features. This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
- Parameters:
threshold (
float, default= .05) – Features with a training-set missing larger than this threshold will be removed.- Returns:
selected_features (
listofstr) – List of selected features.- Variables:
n_features_in (
int) – number of input predictorssupport (
listofbool) – the list of the selected X-columnsselected_features (
listofstr) – the list of names of selected featuresnot_selected_features (
listofstr) – the list of names of rejected features
Example
>>> from sklearn.datasets import make_classification, make_regression >>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5 >>> X = pd.DataFrame(X) >>> y = pd.Series(y) >>> pred_name = [f"pred_{i}" for i in range(X.shape[1])] >>> X.columns = pred_name >>> selector = MissingValueThreshold(0.05) >>> selector.fit_transform(X)
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') MissingValueThreshold#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- class arfs.feature_selection.unsupervised.UniqueValuesThreshold(threshold=1)[source]#
Bases:
BaseThresholdSelectorFeature selector that removes all features with zero variance (single unique values) or remove columns with less unique values than threshold This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
- Parameters:
threshold (
int, default= 1) – Features with a training-set missing larger than this threshold will be removed. The thresold should be >= 1- Returns:
selected_features (
listofstr) – List of selected features.- Variables:
n_features_in (
int) – number of input predictorssupport (
listofbool) – the list of the selected X-columnsselected_features (
listofstr) – the list of names of selected featuresnot_selected_features (
listofstr) – the list of names of rejected features
Example
>>> from sklearn.datasets import make_classification, make_regression >>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5 >>> X = pd.DataFrame(X) >>> y = pd.Series(y) >>> pred_name = [f"pred_{i}" for i in range(X.shape[1])] >>> X.columns = pred_name >>> selector = UniqueValuesThreshold(1) >>> selector.fit_transform(X)
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') UniqueValuesThreshold#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- arfs.feature_selection.unsupervised._pandas_count_unique_values_cat_features(X)[source]#
Counts the number of unique values in categorical features of a pandas DataFrame.
- Parameters:
X (
pandas DataFrame) – The input data.- Returns:
pandas Series– The number of unique values in each categorical feature.- Raises:
TypeError – If the input data is not a pandas DataFrame.
arfs.feature_selection.variable_importance module#
Supervised Feature Selection
This module provides selectors using supervised statistics and a threshold, using SHAP, permutation importance or impurity (Gini) importance.
Module Structure:#
VariableImportancemain class for identifying non-important features
- class arfs.feature_selection.variable_importance.VariableImportance(task='regression', encode=True, n_iterations=10, threshold=0.99, lgb_kwargs={'objective': 'rmse', 'zero_as_missing': False}, encoder_kwargs=None, fastshap=False, verbose=-1)[source]#
Bases:
SelectorMixin,BaseEstimatorFeature selector that removes predictors with zero or low variable importance.
Identify the features with zero/low importance according to SHAP values of a lightgbm. The gbm can be trained with early stopping using a utils set to prevent overfitting. The feature importances are averaged over n_iterations to reduce the variance. The predictors are then ranked from the most important to the least important and the cumulative variable importance is computed. All the predictors not contributing (VI=0) or contributing to less than the threshold to the cumulative importance are removed.
- Parameters:
task (
string) – The machine learning task, either ‘classification’ or ‘regression’ or ‘multiclass’, be sure to use a consistent objective functionencode (
boolean, default= True) – Whether or not to encode the predictorsn_iterations (
int, default= 10) – Number of iterations, the more iterations, the smaller the variancethreshold (
float, default= .99) – The selector computes the cumulative feature importance and ranks the predictors from the most important to the least important. All the predictors contributing to less than this value are rejected.lgb_kwargs (
dictionaryofkeyword arguments) – dictionary of lightgbm estimators parameters with at least the objective function {‘objective’:’rmse’}encoder_kwargs (
dictionaryofkeyword arguments, optional) – dictionary of theOrdinalEncoderPandasparameters
- Returns:
selected_features (
listofstr) – List of selected features.- Variables:
n_features_in (
int) – number of input predictorsassoc_matrix (
pd.DataFrame) – the square association matrixcollinearity_summary (
pd.DataFrame) – the pairs of collinear features and the association valuessupport (
listofbool) – the list of the selected X-columnsselected_features (
listofstr) – the list of names of selected featuresnot_selected_features (
listofstr) – the list of names of rejected featuresfastshap (
boolean) – enable or not the fasttreeshap implementationverbose (
int, default= -1) – controls the progress bar, > 1 print out progress
Example
>>> from sklearn.datasets import make_classification, make_regression >>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5 >>> X = pd.DataFrame(X) >>> y = pd.Series(y) >>> pred_name = [f"pred_{i}" for i in range(X.shape[1])] >>> X.columns = pred_name >>> selector = VariableImportance(threshold=0.75) >>> selector.fit_transform(X, y)
- fit(X, y, sample_weight=None)[source]#
Learn variable importance from X and y, supervised learning.
- Parameters:
X (
pd.DataFrame,shape (n_samples,n_features)) – Data from which to compute variances, where n_samples is the number of samples and n_features is the number of features.y (
any, defaultNone) – Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.sample_weight (
pd.Series, optional,shape (n_samples,)) – weights for computing the statistics (e.g. weighted average)
- Returns:
self (
object) – Returns the instance itself.
- fit_transform(X, y=None, sample_weight=None)[source]#
Fit to data, then transform it. Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X. :type X: :param X: Input samples. :type X:
array-likeofshape (n_samples,n_features):type y: :param y: Target values (None for unsupervised transformations). :type y:array-likeofshape (n_samples,)or(n_samples,n_outputs), default=None :param **fit_params: Additional fit parameters. :type **fit_params:dict- Returns:
X_new (
ndarray arrayofshape (n_samples,n_features_new)) – Transformed array.
- plot_importance(figsize=None, plot_n=50, n_feat_per_inch=3, log=True, style=None)[source]#
Plots plot_n most important features and the cumulative importance of features. If threshold is provided, prints the number of features needed to reach threshold cumulative importance.
- Parameters:
plot_n (
int, default= 50) – Number of most important features to plot. Defaults to 15 or the maximum number of features whichever is smallern_feat_per_inch (
int) – number of features per inch, the larger the less space between labelsfigsize (
tupleoffloat, optional) – The rendered size as a percentage sizelog (
bool, defaultTrue) – Whether or not render variable importance on a log scalestyle (
bool, defaultFalse) – set arfs style or not
- Returns:
hv.plot– the feature importances holoviews object
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') VariableImportance#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
Module contents#
- class arfs.feature_selection.BoostAGroota(estimator=None, cutoff=4, iters=10, max_rounds=500, delta=0.1, silent=True, importance='shap')[source]#
Bases:
SelectorMixin,BaseEstimatorBoostAGroota is an all-relevant feature selection method, while most others are minimal optimal. It tries to find all features carrying information usable for prediction, rather than finding a possibly compact subset of features on which some estimator has a minimal error.
Why bother with all-relevant feature selection? When you try to understand the phenomenon that made your data, you should care about all factors that contribute to it, not just the bluntest signs of it in the context of your methodology (minimal optimal set of features by definition depends on your estimator choice).
- Parameters:
estimator (
scikit-learn estimator) – The model to train, lightGBM recommended, see the reduce lightgbm method.cutoff (
float) – The value by which the max of shadow imp is divided, to compare to real importance.iters (
int (>0)) – The number of iterations to average for the feature importance (on the same split), to reduce the variance.max_rounds (
int (>0)) – The number of times the core BoostAGroota algorithm will run. Each round eliminates more and more features.delta (
float (0 < delta <= 1)) – Stopping criteria for whether another round is started.silent (
bool) – Set to True if you don’t want to see the BoostAGroota output printed.importance (
str, default'shap') – The kind of feature importance to use. Possible values: ‘shap’ (Shapley values), ‘pimp’ (permutation importance), and ‘native’ (Gini/impurity).
- Variables:
selected_features (
listofstr) – The list of columns to keep.ranking (
arrayofshape [n_features]) – The feature ranking, such thatranking_[i]corresponds to the ranking position of the i-th feature. Selected (i.e., estimated best) features are assigned rank 1, and tentative features are assigned rank 2.ranking_absolutes (
arrayofshape [n_features]) – The absolute feature ranking as ordered by the selection process. It does not guarantee that this order is correct for all models. For a model-agnostic ranking, see the attributeranking.sha_cutoff_df (
dataframe) – Feature importance of the real+shadow predictors over iterations.mean_shadow (
float) – The threshold below which the predictors are rejected.
Examples
>>> X = df[filtered_features].copy() >>> y = df['target'].copy() >>> w = df['weight'].copy() >>> model = LGBMRegressor(n_jobs=-1, n_estimators=100, objective='rmse', random_state=42, verbose=0) >>> feat_selector = BoostAGroota(estimator=model, cutoff=1, iters=10, max_rounds=10, delta=0.1, importance='shap') >>> feat_selector.fit(X, y, sample_weight=None) >>> print(feat_selector.selected_features_) >>> feat_selector.plot_importance(n_feat_per_inch=5)
- fit(X, y, sample_weight=None)[source]#
Fit the BoostAGroota transformer with the provided estimator. :type X: :param X: the predictors matrix :type X:
pd.DataFrame:type y: :param y: the target :type y:pd.Series:type sample_weight: :param sample_weight: sample_weight, if any :type sample_weight:pd.series
- plot_importance(n_feat_per_inch=5)[source]#
Boxplot of the variable importance, ordered by magnitude. The max shadow variable importance illustrated by the dashed line. Requires to apply the fit method first.
- Parameters:
n_feat_per_inch (
int, default5) – Number of features to plot per inch (for scaling the figure).- Returns:
fig (
plt.figureorNone) – The matplotlib figure object containing the boxplot, or None if there are no selected features.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') BoostAGroota#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- class arfs.feature_selection.CardinalityThreshold(threshold=1000)[source]#
Bases:
BaseThresholdSelectorFeature selector that removes all categorical features with more unique values than threshold This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
- Parameters:
threshold (
int, default= 1000) – Features with a training-set missing larger than this threshold will be removed. The thresold should be >= 1- Returns:
selected_features (
listofstr) – List of selected features.- Variables:
n_features_in (
int) – number of input predictorssupport (
listofbool) – the list of the selected X-columnsselected_features (
listofstr) – the list of names of selected featuresnot_selected_features (
listofstr) – the list of names of rejected features
Example
>>> from sklearn.datasets import make_classification, make_regression >>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5 >>> X = pd.DataFrame(X) >>> y = pd.Series(y) >>> pred_name = [f"pred_{i}" for i in range(X.shape[1])] >>> X.columns = pred_name >>> selector = CardinalityThreshold(100) >>> selector.fit_transform(X)
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') CardinalityThreshold#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- class arfs.feature_selection.CollinearityThreshold(threshold=0.8, method='association', n_jobs=1, nom_nom_assoc=<function weighted_theils_u>, num_num_assoc=<function weighted_corr>, nom_num_assoc=<function correlation_ratio>)[source]#
Bases:
SelectorMixin,BaseEstimatorFeature selector that removes collinear features. This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning. It computes the association between features (continuous or categorical), store the pairs of collinear features and remove one of them for all pairs having an association value above the threshold.
The association measures are the Spearman correlation coefficient, correlation ratio and Theil’s U. The association matrix is not necessarily symmetrical.
By changing the method to “correlation”, data will be encoded as integer and the Spearman correlation coefficient will be used instead. Faster but not a best practice because the categorical variables are considered as numeric.
- Parameters:
threshold (
float, default= .8) – Features with a training-set missing larger than this threshold will be removed The thresold should be > 0 and =< 1method (
str, default ="association") – method for computing the association matrix. Either “association” or “correlation”. Correlation leads to encoding of categorical variables as numericn_jobs (
int, default= -1) – the number of threads, -1 uses all the threads for computating the association matrixnom_nom_assoc (
strorcallable, default ="theil") – the categorical-categorical association measure, by default Theil’s U, not symmetrical!num_num_assoc (
strorcallable, default ="spearman") – the numeric-numeric association measurenom_num_assoc (
strorcallable, default ="correlation_ratio") – the numeric-categorical association measure
- Returns:
selected_features (
listofstr) – List of selected features.- Variables:
n_features_in (
int) – number of input predictorsassoc_matrix (
pd.DataFrame) – the square association matrixcollinearity_summary (
pd.DataFrame) – the pairs of collinear features and the association valuessupport (
listofbool) – the list of the selected X-columnsselected_features (
listofstr) – the list of names of selected featuresnot_selected_features (
listofstr) – the list of names of rejected features
Example
>>> from sklearn.datasets import make_classification, make_regression >>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5 >>> X = pd.DataFrame(X) >>> y = pd.Series(y) >>> pred_name = [f"pred_{i}" for i in range(X.shape[1])] >>> X.columns = pred_name >>> selector = CollinearityThreshold(threshold=0.75) >>> selector.fit_transform(X)
- fit(X, y=None, sample_weight=None)[source]#
Learn empirical associtions from X.
- Parameters:
X (
pd.DataFrame,shape (n_samples,n_features)) – Data from which to compute variances, where n_samples is the number of samples and n_features is the number of features.y (
any, defaultNone) – Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.sample_weight (
pd.Series, optional,shape (n_samples,)) – weights for computing the statistics (e.g. weighted average)
- Returns:
self (
object) – Returns the instance itself.
- plot_association(ax=None, cmap='PuOr', figsize=None, cbar_kw=None, imgshow_kw=None)[source]#
plot_association plots the association matrix
- Parameters:
ax (
matplotlib.axes.Axes, optional) – the mpl axes if the figure object exists already, by default Nonecmap (
str, optional) – colormap name, by default “PuOr”figsize (
tupleoffloat, optional) – figure size, by default Nonecbar_kw (
dict, optional) – colorbar kwargs, by default Noneimgshow_kw (
dict, optional) – imgshow kwargs, by default None
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') CollinearityThreshold#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- class arfs.feature_selection.GrootCV(objective=None, cutoff=1, n_folds=5, folds=None, n_iter=5, silent=True, rf=False, fastshap=False, n_jobs=0, lgbm_params=None)[source]#
Bases:
SelectorMixin,BaseEstimatorGrootCV is a feature selection method based on cross-validation with lightGBM.
A shuffled copy of the predictors matrix is added (shadows) to the original set of predictors. The lightGBM is fitted using repeated cross-validation, the feature importance is extracted each time and averaged to smooth out the noise. If the feature importance is larger than the average shadow feature importance then the predictors are rejected, the others are kept.
Cross-validated feature importance to smooth out the noise, based on lightGBM only (which is, most of the time, the fastest and more accurate Boosting).
the feature importance is derived using SHAP importance
Taking the max of median of the shadow var. imp over folds otherwise not enough conservative and it improves the convergence (needs less evaluation to find a threshold)
Not based on a given percentage of cols needed to be deleted
Plot method for var. imp
- Parameters:
objective (
strorcallable, defaultNone) – The objective function to use in lightGBM. If None, it uses the objective specified in lgbm_params.cutoff (
float, default1) – The value by which the max of shadow imp is divided, to compare to real importance.n_folds (
int, default5) – The number of folds for cross-validation.folds (
Optional[Union[Iterable[Tuple[np.ndarray,np.ndarray]]) – (generator or iterator of (train_idx, test_idx) tuples, scikit-learn splitter object or None, optional (default=None)) If generator or iterator, it should yield the train and test indices for each fold. If object, it should be one of the scikit-learn splitter classes (https://scikit-learn.org/stable/modules/classes.html#splitter-classes) and have split method. This argument has highest priority over other data split arguments.n_iter (
int, default5) – The number of iterations to average for the feature importance (on the same split), to reduce variance.silent (
bool, defaultTrue) – Set to True if you don’t want to see the GrootCV output printed.rf (
bool, defaultFalse) – If True, use random forest for calculating feature importances; otherwise, use lightGBM.fastshap (
bool, defaultFalse) – If True, use fastSHAP for calculating feature importances; otherwise, use SHAP.n_jobs (
int, default0) – The number of jobs to run in parallel. If 0, no parallelism is used.lgbm_params (
dict, defaultNone) – The parameters for the lightGBM model.
- Variables:
selected_features (
ndarray) – The list of columns to keep as selected features.cv_df (
pd.DataFrame) – DataFrame containing feature importance values for each fold and iteration.sha_cutoff (
float) – The threshold below which the predictors are rejected.ranking_absolutes (
list) – The absolute feature ranking as ordered by the selection process.ranking (
ndarray) – The feature ranking, where 2 corresponds to selected features and 1 to tentative features.
Warning
If sha_cutoff is None, you should apply the fit method first.
Examples
>>> X = df[filtered_features].copy() >>> y = df['target'].copy() >>> w = df['weight'].copy() >>> feat_selector = arfsgroot.GrootCV(objective='rmse', cutoff = 1, n_folds=5, n_iter=5) >>> feat_selector.fit(X, y, sample_weight=None) >>> feat_selector.plot_importance(n_feat_per_inch=5)
- fit(X, y, sample_weight=None)[source]#
Fit the GrootCV on the input data.
- Parameters:
X (
pd.DataFrameofshape (n_samples,n_features)) – The predictor dataframe.y (
array-likeofshape (n_samples,)) – The target vector.sample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default None.
- Returns:
self (
object) – Returns self.
- plot_importance(n_feat_per_inch=5)[source]#
Plot the feature importance of the fitted GrootCV.
- Parameters:
n_feat_per_inch (
int, default5) – The number of features per inch in the plot.- Returns:
fig (
matplotlib.figure.FigureorNone) – The matplotlib figure containing the plot or None if no feature is selected.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') GrootCV#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- class arfs.feature_selection.LassoFeatureSelection(family='gaussian', link=None, n_iterations=10, score='bic', fit_intercept=True, n_jobs=-1)[source]#
Bases:
BaseEstimator,TransformerMixinLassoFeatureSelection performs feature selection using GLM Lasso regularization.
- Parameters:
family (
str,(default=``”gaussian”``)) – The distributional assumption of the model. It can be any of the statsmodels distribution: “gaussian”, “binomial”, “poisson”, “gamma”, “negativebinomial”, “tweedie”link (
str, optional) – the GLM link function. It can be any of the: “identity”, “log”, “logit”, “probit”, “cloglog”, “inverse_squared”n_iterations (
int, default10) – Number of iterations for the grid search.score (
str, default"bic") – The score to use for model selection. Options: “bic” (Bayesian Information Criterion) or “mean_cv” (mean cross-validation score).n_jobs (
int, default-1) – the number of processes. -1 means all the processes
- Variables:
family (
str) – The family of the GLM.n_iterations (
int) – Number of iterations for the grid search.best_estimator (
EnetGLM) – The best estimator found after grid search cross-validation.selected_features (
ndarray) – The selected feature names.support (
ndarray) – The support of selected features (True for selected, False otherwise).feature_names_in (
ndarray) – The input feature names.score (
str) – The score used for model selection.n_jobs (
int) – the number of processes. -1 means all the processes
- fit(X, y=None, sample_weight=None)[source]#
Fit the LassoFeatureSelection model and select the best features.
- fit(X, y=None, sample_weight=None)[source]#
Fit the LassoFeatureSelection model and select the best features.
- Parameters:
X (
Union[pd.DataFrame,np.ndarray]) – The input features, can be either a pandas DataFrame or a numpy array.y (
Optional[Union[pd.Series,np.ndarray]], defaultNone) – The target values, can be either a pandas Series or a numpy array.sample_weight (
Optional[Union[pd.Series,np.ndarray]], defaultNone) – Sample weights to be used during training. Can be either a pandas Series or a numpy array.
- Returns:
LassoFeatureSelection– The fitted LassoFeatureSelection model.
- get_feature_names_out()[source]#
Get the names of the selected features.
- Return type:
ndarray- Returns:
np.ndarray– The names of the selected features.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') LassoFeatureSelection#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- transform(X)[source]#
Transform the input data to keep only the selected features.
- Parameters:
X (
Union[pd.DataFrame,np.ndarray]) – The input features, can be either a pandas DataFrame or a numpy array.- Return type:
DataFrame- Returns:
Union[pd.DataFrame,np.ndarray]– The transformed data with only the selected features.
- class arfs.feature_selection.Leshy(estimator, n_estimators=1000, perc=90, alpha=0.05, importance='shap', two_step=True, max_iter=100, random_state=None, verbose=0, keep_weak=False)[source]#
Bases:
SelectorMixin,BaseEstimatorThis is an improved version of BorutaPy which itself is an improved Python implementation of the Boruta R package. Boruta is an all relevant feature selection method, while most other are minimal optimal; this means it tries to find all features carrying information usable for prediction, rather than finding a possibly compact subset of features on which some estimator has a minimal error. Why bother with all relevant feature selection? When you try to understand the phenomenon that made your data, you should care about all factors that contribute to it, not just the bluntest signs of it in context of your methodology (minimal optimal set of features by definition depends on your estimator choice).
- Parameters:
estimator (
object) – A supervised learning estimator, with a ‘fit’ method that returns thefeature_importances_attribute. Important features must correspond to high absolute values in thefeature_importances_n_estimators (
intorstring, default= 1000) – If int sets the number of estimators in the chosen ensemble method. If ‘auto’ this is determined automatically based on the size of the dataset. The other parameters of the used estimators need to be set with initialisation.perc (
int, default= 100) – Instead of the max we use the percentile defined by the user, to pick our threshold for comparison between shadow and real features. The max tend to be too stringent. This provides a finer control over this. The lower perc is the more false positives will be picked as relevant but also the less relevant features will be left out. The usual trade-off. The default is essentially the vanilla Boruta corresponding to the max.alpha (
float, default= 0.05) – Level at which the corrected p-values will get rejected in both correction steps.importance (
str, default ='shap') – The kind of variable importance used to compare and discriminate original vs shadow predictors. Note that the builtin tree importance (gini/impurity based importance) is biased towards numerical and large cardinality predictors, even if they are random. Shapley values and permutation imp. are robust w.r.t those predictors. Possible values: ‘shap’ (Shapley values), ‘fastshap’ (FastTreeShap implementation), ‘pimp’ (permutation importance) and ‘native’ (Gini/impurity)two_step (
Boolean, default= True) – If you want to use the original implementation of Boruta with Bonferroni correction only set this to False.max_iter (
int, default= 100) – The number of maximum iterations to perform.random_state (
int,RandomState instanceorNone; default=None) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.verbose (
int, default0) – Controls verbosity of output. 0: no output, 1: displays iteration number, 2: which features have been selected already
- Variables:
n_features (
int) – The number of selected features.support (
arrayofshape [n_features]) – The mask of selected features - only confirmed ones are True.support_weak (
arrayofshape [n_features]) – The mask of selected tentative features, which haven’t gained enough support during the max_iter number of iterations.selected_features (
listofstr) – the list of columns to keepranking (
arrayofshape [n_features]) – The feature ranking, such thatranking_[i]corresponds to the ranking position of the i-th feature. Selected (i.e., estimated best) features are assigned rank one and tentative features are assigned rank 2.ranking_absolutes (
arrayofshape [n_features]) – The absolute feature ranking as ordered by selection process. It does not guarantee that this order is correct for all models. For a model agnostic ranking, see the the attributerankingcat_name (
listofstr) – the name of the categorical columnscat_idx (
listofint) – the index of the categorical columnsimp_real_hist (
array) – array of the historical feature importance of the real predictorssha_max (
float) – the maximum feature importance of the shadow predictorscol_names (
listofstr) – the names of the real predictors
Examples
>>> import pandas as pd >>> from sklearn.ensemble import RandomForestClassifier >>> from boruta import BorutaPy >>> >>> # load X and y >>> # NOTE BorutaPy accepts numpy arrays only, hence the .values attribute >>> X = pd.read_csv('examples/test_X.csv', index_col=0).values >>> y = pd.read_csv('examples/test_y.csv', header=None, index_col=0).values >>> y = y.ravel() >>> >>> # define random forest classifier, with utilising all cores and >>> # sampling in proportion to y labels >>> rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5) >>> >>> # define Boruta feature selection method >>> feat_selector = Leshy(rf, n_estimators='auto', verbose=2, random_state=1) >>> >>> # find all relevant features - 5 features should be selected >>> feat_selector.fit(X, y) >>> >>> # check selected features - first 5 features are selected >>> feat_selector.selected_features_ >>> >>> # check ranking of features >>> feat_selector.ranking_ >>> >>> # call transform() on X to filter it down to selected features >>> X_filtered = feat_selector.transform(X)
References
See the original paper [1]_ for more details.
- ..[1] Kursa M., Rudnicki W., “Feature Selection with the Boruta Package”
Journal of Statistical Software, Vol. 36, Issue 11, Sep 2010
- _add_shadows_get_imps(X, y, sample_weight, dec_reg)[source]#
Add a shuffled copy of the columns (shadows) and get the feature importance of the augmented data set
- Parameters:
X (
pd.DataFrameofshape [n_samples,n_features]) – predictor matrixy (
pd.seriesofshape [n_samples]) – targetsample_weight (
array-like,shape = [n_samples], defaultNone) – Individual weights for each sampledec_reg (
array) – holds the decision about each feature 1, 0, -1 (accepted, undecided, rejected)
- Returns:
- imp_real: array
feature importance of the real predictors
- imp_sha: array
feature importance of the shadow predictors
- static _assign_hits(hit_reg, cur_imp, imp_sha_max)[source]#
count how many times a given feature was more important than the best of the shadow features
- Parameters:
hit_reg (
array) – count how many times a given feature was more important than the best of the shadow featurescur_imp (
array) – current importanceimp_sha_max (
array) – importance of the best shadow predictor
- Returns:
hit_reg (
array) – the how many times a given feature was more important than the best of the shadow features
- _calculate_absolute_ranking()[source]#
Compute feature importance scores using SHAP values.
- Parameters:
new_x_tr (
numpy.ndarray) – The training dataset after being processed.shap_matrix (
numpy.ndarray) – The matrix containing SHAP values computed by a LightGBM model.param (
dict) – A dictionary containing the parameters for a LightGBM model.objective (
str) – The objective function of the LightGBM model.
- Returns:
list– A list of tuples containing feature names and their corresponding importance scores.
- _calculate_relative_ranking(n_feat, tentative, confirmed, imp_history)[source]#
Calculates the relative ranking of features based on their importance history.
- Parameters:
n_feat (
int) – The total number of features.tentative (
ndarrayofshape (n_tentative_features,)) – An array containing the indices of tentative features.confirmed (
ndarrayofshape (n_confirmed_features,)) – An array containing the indices of confirmed features.imp_history (
ndarrayofshape (n_iterations + 1,n_features)) – An array containing the feature importances for each iteration.
- Returns:
None
- _calculate_support(confirmed, tentative, n_feat)[source]#
Calculate the feature support arrays.
- Parameters:
confirmed (
array-likeofshape (n_confirmed,)) – Indices of confirmed features.tentative (
array-likeofshape (n_tentative,)) – Indices of tentative features.n_feat (
int) – Total number of features.
- Returns:
None– The function populates the following class attributes: - n_features_ : intNumber of selected features.
- support_ndarray of shape (n_feat,)
Boolean array indicating the selected features.
- support_weak_ndarray of shape (n_feat,)
Boolean array indicating the tentatively selected features.
- _check_params(X, y)[source]#
Private method, Check hyperparameters as well as X and y before proceeding with fit.
- Parameters:
X (
pd.DataFrame) – predictor matrixy (
pd.series) – target series
- Raises:
ValueError – [description]
ValueError – [description]
- _do_tests(dec_reg, hit_reg, _iter)[source]#
Private method, Perform the rest if the feature should be tagget as relevant (confirmed), not relevant (rejected) or undecided. The test is performed by considering the binomial tentatives over several attempts. I.e. count how many times a given feature was more important than the best of the shadow features and test if the associated probability to the z-score is below, between or above the rejection or acceptance threshold.
- Parameters:
dec_reg (
array) – holds the decision about each feature 1, 0, -1 (accepted, undecided, rejected)hit_reg (
array) – counts how many times a given feature was more important than the best of the shadow features_iter (
int) – iteration number
- Returns:
dec_reg (
array) – holds the decision about each feature 1, 0, -1 (accepted, undecided, rejected)
- static _fdrcorrection(pvals, alpha=0.05)[source]#
Benjamini/Hochberg p-value correction for false discovery rate, from statsmodels package. Included here for decoupling dependency on statsmodels.
- Parameters:
pvals (
array_like) – set of p-values of the individual tests.alpha (
float) – error rate
- Returns:
rejected (
array,bool) – True if a hypothesis is rejected, False if notpvalue-corrected (
array) – pvalues adjusted for multiple hypothesis testing to limit FDR
- _fit(X_raw, y, sample_weight=None)[source]#
Private method. See the methods overview in the documentation for explanation of the process
- Parameters:
X_raw (
array-like,shape = [n_samples,n_features]) – The training input samples.y (
array-like,shape = [n_samples]) – The target values.sample_weight (
array-like,shape = [n_samples], defaultNone) – Individual weights for each sample
- Returns:
self (
object) – Nothing but attributes
- _get_tree_num(n_feat)[source]#
- private method, get a good estimated for the number of trees
given the number of features
- Parameters:
n_feat (
int) – The number of features- Returns:
n_estimators (
int) – the number of trees
- static _nanrankdata(X, axis=1)[source]#
Replaces bottleneck’s nanrankdata with scipy and numpy alternative.
- Parameters:
X (
arrayorpd.DataFrame) – the data arrayaxis (
int, optional) – row-wise (0) or column-wise (1), by default 1
- Returns:
ranks (
array) – the ranked array
- _print_result(dec_reg, _iter, start_time)[source]#
Print the results of feature selection.
- Parameters:
dec_reg (
bool) – Decision on whether to proceed with another round of feature selection._iter (
int) – Current iteration number.start_time (
float) – Time when the feature selection process started.
- Returns:
None– The function prints the relevant results and running time.
- _print_results(dec_reg, _iter, flag)[source]#
Private method, printing the result
- Parameters:
dec_reg (
array) – if the feature as been tagged as relevant (confirmed), not relevant (rejected) or undecided_iter (
int) – the iteration numberflag (
int) – is still in the feature selection process or not
- Returns:
- output: str
the output to be printed out
- _run_iteration(X, y, sample_weight, dec_reg, sha_max_history, imp_history, hit_reg, _iter)[source]#
Run an iteration of the Gradient Boosting algorithm.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – The input samples.y (
array-likeofshape (n_samples,)) – The target values.sample_weight (
array-likeofshape (n_samples,), defaultNone) – Sample weights. If None, then samples are equally weighted.dec_reg (
array-likeofshape (n_samples,)) – Decision function of the estimator.sha_max_history (
listoffloats) – List of the maximum shadow importance value at each iteration.imp_history (
array-likeofshape (n_iterations,n_features)) – Matrix of feature importances at each iteration.hit_reg (
array-likeofshape (n_samples,)) – Array of hit counts for each sample._iter (
int) – The current iteration number.
- Returns:
dec_reg (
array-likeofshape (n_samples,)) – Updated decision function of the estimator.sha_max_history (
listoffloats) – List of the maximum shadow importance value at each iteration.imp_history (
array-likeofshape (n_iterations,n_features)) – Matrix of feature importances at each iteration.hit_reg (
array-likeofshape (n_samples,)) – Array of hit counts for each sample.imp_sha_max (
float) – The maximum shadow importance value for this iteration.
- _update_estimator()[source]#
Update the estimator with a new random state, if applicable.
If the dataset is not categorical, the estimator’s random_state parameter is updated with a new random state generated by the random_state attribute of the Leshy object. If the estimator is a LightGBM model, the random state value is generated between 0 and 10000.
- Parameters:
None –
- Returns:
None
- _update_tree_num(dec_reg)[source]#
Update the number of trees in the estimator based on the number of selected features.
- Parameters:
dec_reg (
array-likeofshape (n_features,)) – The decision rule for each feature, where negative values indicate that the feature should be rejected and non-negative values indicate that the feature should be selected.- Returns:
None
Notes
This function updates the n_estimators parameter of the estimator if it is set to “auto”. The number of trees is determined based on the number of selected features. Specifically, the number of trees is set to the value returned by the _get_tree_num method, which takes as input the number of selected features that are not rejected.
If n_estimators is not set to “auto”, this function does nothing.
- fit(X, y, sample_weight=None)[source]#
Fits the Boruta feature selection with the provided estimator.
- Parameters:
X (
array-like,shape = [n_samples,n_features]) – The training input samples.y (
array-like,shape = [n_samples]) – The target values.sample_weight (
array-like,shape = [n_samples], defaultNone) – Individual weights for each sample
- Returns:
self (
object) – Nothing but attributes
- plot_importance(n_feat_per_inch=5)[source]#
Boxplot of the variable importance, ordered by magnitude The max shadow variable importance illustrated by the dashed line. Requires to apply the fit method first.
- Parameters:
n_feat_per_inch (
int, default5) – number of features to plot per inch (for scaling the figure)- Returns:
fig (
plt.figure) – the matplotlib figure object containing the boxplot
- select_features(X, y, sample_weight=None)[source]#
Select features using the Leshy algorithm.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – The input data.y (
array-likeofshape (n_samples,)) – The target values.sample_weight (
array-likeofshape (n_samples,), defaultNone) – Individual weights for each sample.
- Returns:
dec_reg (
ndarrayofshape (n_features,)) – The decision rule. 1 means the feature is selected, 0 means the feature is not selected.sha_max_history (
list) – List of the maximum shadow importances per iteration.imp_history (
ndarrayofshape (n_iterations,n_features)) – Array containing the feature importances per iteration.imp_sha_max (
float) – Maximum shadow importance value.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') Leshy#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- class arfs.feature_selection.MinRedundancyMaxRelevance(n_features_to_select, relevance_func=None, redundancy_func=None, task='regression', denominator_func=<function mean>, only_same_domain=False, return_scores=False, n_jobs=1, show_progress=True)[source]#
Bases:
SelectorMixin,BaseEstimatorMRMR feature selection for a classification or a regression task For a classification task, the target should be of object or pandas category dtype. For a regression task, the target should be of numpy categorical dtype. The predictors can be categorical or numerical, there is no encoding required. The dtype will be automatically detected and the right method applied (either correlation, correlation ration or Theil’s U)
- Parameters:
n_features_to_select (
int) – Number of features to select.relevance_func (
callable, optional) – relevance function having arguments “X”, “y”, “sample_weight” and returning a pd.Series containing a score of relevance for each featureredundancy_func (
callable, optional) – Redundancy method. If callable, it should take “X”, “sample_weight” as input and return a pandas.Series containing a score of redundancy for each feature.denominator_func (
strorcallable (optional, default'mean')) – Synthesis function to apply to the denominator of MRMR score. If string, name of method. Supported: ‘max’, ‘mean’. If callable, it should take an iterable as input and return a scalar.task (
str) – either “regression” or “classification”only_same_domain (
bool (optional, defaultFalse)) – If False, all the necessary correlation coefficients are computed. If True, only features belonging to the same domain are compared. Domain is defined by the string preceding the first underscore: for instance “cusinfo_age” and “cusinfo_income” belong to the same domain, whereas “age” and “income” don’t.return_scores (
bool (optional, defaultFalse)) – If False, only the list of selected features is returned. If True, a tuple containing (list of selected features, relevance, redundancy) is returned.n_jobs (
int (optional, default1)) – Maximum number of workers to use. Only used when relevance = “f” or redundancy = “corr”. If -1, use as many workers as min(cpu count, number of features).show_progress (
bool (optional, defaultTrue)) – If False, no progress bar is displayed. If True, a TQDM progress bar shows the number of features processed.
- Returns:
selected_features (
listofstr) – List of selected features.- Variables:
n_features_in (
int) – number of input predictorsranking (
pd.DataFrame) – name and scores for the selected featuressupport (
listofbool) – the list of the selected X-columns
Example
>>> from sklearn.datasets import make_classification, make_regression >>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5 >>> X = pd.DataFrame(X) >>> y = pd.Series(y) >>> pred_name = [f"pred_{i}" for i in range(X.shape[1])] >>> X.columns = pred_name >>> y.name = "target" >>> fs_mrmr = MinRedundancyMaxRelevance( >>> n_features_to_select=5, >>> relevance_func=None, >>> redundancy_func=None, >>> task="regression", #"classification", >>> denominator_func=np.mean, >>> only_same_domain=False, >>> return_scores=False, >>> show_progress=True) >>> #fs_mrmr.fit(X=X, y=y.astype(str), sample_weight=None) >>> fs_mrmr.fit(X=X, y=y, sample_weight=None)
- fit(X, y, sample_weight=None)[source]#
fit the MRmr selector by learning the associations
- Parameters:
X (
pd.DataFrame,shape (n_samples,n_features)) – Data from which to compute variances, where n_samples is the number of samples and n_features is the number of features.y (
array-likeorpd.Seriesofshape (n_samples,)) – Target vector. Must be numeric for regression or categorical for classification.sample_weight (
pd.Series, optional,shape (n_samples,)) – weights for computing the statistics (e.g. weighted average)
- Returns:
self (
object) – If return_scores=False, returns self. If return_scores=True, returns (selected_features, relevance_scores).
- fit_transform(X, y, sample_weight=None, **fit_params)[source]#
Fit to data, then transform it. Fits transformer to X and y and optionally sample_weight with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – Input samples.y (
array-likeofshape (n_samples,)or(n_samples,n_outputs), default=None) – Target values (None for unsupervised transformations).sample_weight (
array-likeofshape (n_samples,)or(n_samples,n_outputs), default=None) – sample weight values.**fit_params (
dict) – Additional fit parameters.
- Returns:
X_new (
ndarray arrayofshape (n_samples,n_features_new)) – Transformed array.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') MinRedundancyMaxRelevance#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- class arfs.feature_selection.MissingValueThreshold(threshold=0.05)[source]#
Bases:
BaseThresholdSelectorFeature selector that removes all high missing percentage features. This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
- Parameters:
threshold (
float, default= .05) – Features with a training-set missing larger than this threshold will be removed.- Returns:
selected_features (
listofstr) – List of selected features.- Variables:
n_features_in (
int) – number of input predictorssupport (
listofbool) – the list of the selected X-columnsselected_features (
listofstr) – the list of names of selected featuresnot_selected_features (
listofstr) – the list of names of rejected features
Example
>>> from sklearn.datasets import make_classification, make_regression >>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5 >>> X = pd.DataFrame(X) >>> y = pd.Series(y) >>> pred_name = [f"pred_{i}" for i in range(X.shape[1])] >>> X.columns = pred_name >>> selector = MissingValueThreshold(0.05) >>> selector.fit_transform(X)
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') MissingValueThreshold#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- class arfs.feature_selection.UniqueValuesThreshold(threshold=1)[source]#
Bases:
BaseThresholdSelectorFeature selector that removes all features with zero variance (single unique values) or remove columns with less unique values than threshold This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning.
- Parameters:
threshold (
int, default= 1) – Features with a training-set missing larger than this threshold will be removed. The thresold should be >= 1- Returns:
selected_features (
listofstr) – List of selected features.- Variables:
n_features_in (
int) – number of input predictorssupport (
listofbool) – the list of the selected X-columnsselected_features (
listofstr) – the list of names of selected featuresnot_selected_features (
listofstr) – the list of names of rejected features
Example
>>> from sklearn.datasets import make_classification, make_regression >>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5 >>> X = pd.DataFrame(X) >>> y = pd.Series(y) >>> pred_name = [f"pred_{i}" for i in range(X.shape[1])] >>> X.columns = pred_name >>> selector = UniqueValuesThreshold(1) >>> selector.fit_transform(X)
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') UniqueValuesThreshold#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- class arfs.feature_selection.VariableImportance(task='regression', encode=True, n_iterations=10, threshold=0.99, lgb_kwargs={'objective': 'rmse', 'zero_as_missing': False}, encoder_kwargs=None, fastshap=False, verbose=-1)[source]#
Bases:
SelectorMixin,BaseEstimatorFeature selector that removes predictors with zero or low variable importance.
Identify the features with zero/low importance according to SHAP values of a lightgbm. The gbm can be trained with early stopping using a utils set to prevent overfitting. The feature importances are averaged over n_iterations to reduce the variance. The predictors are then ranked from the most important to the least important and the cumulative variable importance is computed. All the predictors not contributing (VI=0) or contributing to less than the threshold to the cumulative importance are removed.
- Parameters:
task (
string) – The machine learning task, either ‘classification’ or ‘regression’ or ‘multiclass’, be sure to use a consistent objective functionencode (
boolean, default= True) – Whether or not to encode the predictorsn_iterations (
int, default= 10) – Number of iterations, the more iterations, the smaller the variancethreshold (
float, default= .99) – The selector computes the cumulative feature importance and ranks the predictors from the most important to the least important. All the predictors contributing to less than this value are rejected.lgb_kwargs (
dictionaryofkeyword arguments) – dictionary of lightgbm estimators parameters with at least the objective function {‘objective’:’rmse’}encoder_kwargs (
dictionaryofkeyword arguments, optional) – dictionary of theOrdinalEncoderPandasparameters
- Returns:
selected_features (
listofstr) – List of selected features.- Variables:
n_features_in (
int) – number of input predictorsassoc_matrix (
pd.DataFrame) – the square association matrixcollinearity_summary (
pd.DataFrame) – the pairs of collinear features and the association valuessupport (
listofbool) – the list of the selected X-columnsselected_features (
listofstr) – the list of names of selected featuresnot_selected_features (
listofstr) – the list of names of rejected featuresfastshap (
boolean) – enable or not the fasttreeshap implementationverbose (
int, default= -1) – controls the progress bar, > 1 print out progress
Example
>>> from sklearn.datasets import make_classification, make_regression >>> X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 5, shuffle=False) # , n_redundant = 5 >>> X = pd.DataFrame(X) >>> y = pd.Series(y) >>> pred_name = [f"pred_{i}" for i in range(X.shape[1])] >>> X.columns = pred_name >>> selector = VariableImportance(threshold=0.75) >>> selector.fit_transform(X, y)
- fit(X, y, sample_weight=None)[source]#
Learn variable importance from X and y, supervised learning.
- Parameters:
X (
pd.DataFrame,shape (n_samples,n_features)) – Data from which to compute variances, where n_samples is the number of samples and n_features is the number of features.y (
any, defaultNone) – Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.sample_weight (
pd.Series, optional,shape (n_samples,)) – weights for computing the statistics (e.g. weighted average)
- Returns:
self (
object) – Returns the instance itself.
- fit_transform(X, y=None, sample_weight=None)[source]#
Fit to data, then transform it. Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X. :type X: :param X: Input samples. :type X:
array-likeofshape (n_samples,n_features):type y: :param y: Target values (None for unsupervised transformations). :type y:array-likeofshape (n_samples,)or(n_samples,n_outputs), default=None :param **fit_params: Additional fit parameters. :type **fit_params:dict- Returns:
X_new (
ndarray arrayofshape (n_samples,n_features_new)) – Transformed array.
- plot_importance(figsize=None, plot_n=50, n_feat_per_inch=3, log=True, style=None)[source]#
Plots plot_n most important features and the cumulative importance of features. If threshold is provided, prints the number of features needed to reach threshold cumulative importance.
- Parameters:
plot_n (
int, default= 50) – Number of most important features to plot. Defaults to 15 or the maximum number of features whichever is smallern_feat_per_inch (
int) – number of features per inch, the larger the less space between labelsfigsize (
tupleoffloat, optional) – The rendered size as a percentage sizelog (
bool, defaultTrue) – Whether or not render variable importance on a log scalestyle (
bool, defaultFalse) – set arfs style or not
- Returns:
hv.plot– the feature importances holoviews object
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') VariableImportance#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- arfs.feature_selection.make_fs_summary(selector_pipe)[source]#
make_fs_summary makes a summary dataframe highlighting at which step a given predictor has been rejected (if any).
- Parameters:
selector_pipe (
sklearn.pipeline.Pipeline) – the feature selector pipeline.
Examples
>>> groot_pipeline = Pipeline([ ... ('missing', MissingValueThreshold()), ... ('unique', UniqueValuesThreshold()), ... ('cardinality', CardinalityThreshold()), ... ('collinearity', CollinearityThreshold(threshold=0.5)), ... ('lowimp', VariableImportance(eval_metric='poisson', objective='poisson', verbose=2)), ... ('grootcv', GrootCV(objective='poisson', cutoff=1, n_folds=3, n_iter=5))]) >>> groot_pipeline.fit_transform( X=df[predictors], y=df[target], lowimp__sample_weight=df[weight], grootcv__sample_weight=df[weight]) >>> fs_summary_df = make_fs_summary(groot_pipeline)