arfs package#
Subpackages#
- arfs.feature_selection package
- Submodules
- arfs.feature_selection.allrelevant module
- Reference:
- The module structure
- Original BorutaPy version
BoostAGroota
GrootCV
Leshy
Leshy._add_shadows_get_imps()
Leshy._assign_hits()
Leshy._calculate_absolute_ranking()
Leshy._calculate_relative_ranking()
Leshy._calculate_support()
Leshy._check_params()
Leshy._do_tests()
Leshy._fdrcorrection()
Leshy._fit()
Leshy._get_tree_num()
Leshy._nanrankdata()
Leshy._print_result()
Leshy._print_results()
Leshy._run_iteration()
Leshy._update_estimator()
Leshy._update_tree_num()
Leshy.fit()
Leshy.plot_importance()
Leshy.select_features()
Leshy.set_fit_request()
Leshy.transform()
_boostaroota()
_compute_importance()
_create_shadow()
_get_confirmed_and_tentative()
_get_imp()
_get_perm_imp()
_get_shap_imp()
_get_shap_imp_fast()
_merge_importance_df()
_reduce_vars_lgb_cv()
_reduce_vars_sklearn()
_select_tentative()
_set_lgb_parameters()
_split_data()
_split_fit_estimator()
_train_lgb_model()
- arfs.feature_selection.base module
- arfs.feature_selection.lasso module
- arfs.feature_selection.mrmr module
- arfs.feature_selection.summary module
- arfs.feature_selection.unsupervised module
- arfs.feature_selection.variable_importance module
- Module contents
BoostAGroota
CardinalityThreshold
CollinearityThreshold
GrootCV
LassoFeatureSelection
Leshy
Leshy._add_shadows_get_imps()
Leshy._assign_hits()
Leshy._calculate_absolute_ranking()
Leshy._calculate_relative_ranking()
Leshy._calculate_support()
Leshy._check_params()
Leshy._do_tests()
Leshy._fdrcorrection()
Leshy._fit()
Leshy._get_tree_num()
Leshy._nanrankdata()
Leshy._print_result()
Leshy._print_results()
Leshy._run_iteration()
Leshy._update_estimator()
Leshy._update_tree_num()
Leshy.fit()
Leshy.plot_importance()
Leshy.select_features()
Leshy.set_fit_request()
Leshy.transform()
MinRedundancyMaxRelevance
MissingValueThreshold
UniqueValuesThreshold
VariableImportance
make_fs_summary()
Submodules#
arfs.association module#
Parallelized Association and Correlation matrix
This module provides parallelized methods for computing associations. Namely, correlation, correlation ratio, Theil’s U, Cramer’s V
They are the basis of the MRmr feature selection
- arfs.association._callable_association_matrix_fn(assoc_fn, X, sample_weight=None, n_jobs=1, kind='nom-nom', cols_comb=None)[source]#
_callable_association_matrix_fn private function, utility for computing association matrix for a callable custom association
- Parameters:
assoc_fn (
callable
) – a function which receives two pd.Series (and optionally a weight array) and returns a single numberX (
array-like
ofshape (n_samples
,n_features)
) – predictor dataframesample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Nonen_jobs (
int
, optional) – the number of cores to use for the computation, by default -1kind (
str
) – kind of association, either ‘num-num’ or ‘nom-nom’ or ‘nom-num’cols_comb (
list
of2-uple
ofstr
, optional) – combination of column names (list of 2-uples of strings)
- Returns:
pd.DataFrame
– the association matrix
- arfs.association._callable_association_series_fn(assoc_fn, X, target, sample_weight=None, n_jobs=1, kind='nom-nom')[source]#
_callable_association_series_fn private function, utility for computing association series for a callable custom association
- Parameters:
assoc_fn (
callable
) – a function which receives two pd.Series (and optionally a weight array) and returns a single numberX (
array-like
ofshape (n_samples
,n_features)
) – predictor dataframetarget (
str
orint
) – the predictor name or index with which to compute associationsample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Nonen_jobs (
int
, optional) – the number of cores to use for the computation, by default 1kind (
str
) – kind of association, either ‘num-num’ or ‘nom-nom’ or ‘nom-num’
- Returns:
pd.Series
– the association series- Raises:
ValueError – if kind is not ‘num-num’ or ‘nom-nom’ or ‘nom-num’
- arfs.association._check_association_input(X, sample_weight=None, handle_na='drop')[source]#
_check_association_input private function. Check the inputs, convert X to a pd.DataFrame if needed, adds column names if non are provided. Check if the sample_weight is None or of the right dimensionality and handle NA according to the chosen methods (drop, fill, None).
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – predictor dataframesample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Nonehandle_na (
str
, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
tuple
– the dataframe and the sample weights- Raises:
ValueError – if sample_weight contains NA
- arfs.association._weighted_correlation_ratio(*args)[source]#
Calculates the Correlation Ratio (sometimes marked by the greek letter Eta) for categorical-continuous association. Answers the question - given a continuous value of a measurement, is it possible to know which category is it associated with? Value is in the range [0,1], where 0 means a category cannot be determined by a continuous measurement, and 1 means a category can be determined with absolute certainty.
Based on the scikit-learn implementation of the unweighted version.
- Returns:
float
– value of the correlation ratio
- arfs.association.annotate_heatmap(im, data=None, valfmt='{x:.2f}', textcolors=('black', 'white'), threshold=None, **textkw)[source]#
annotate_heatmap annotates a heatmap
- Parameters:
im (
matplotlib.axes.Axes
) – The AxesImage to be labeleddata (
array-like
ofshape (M
,N)
, optional) – data to illustrate, if none is provided the function retrieves the array of the mlp obkect, by default Nonevalfmt (
str
, optional) – annotation formating, by default “{x:.2f}”textcolors (
tuple
, optional) – A pair of colors. The first is used for values below a threshold, the second for those above, by default (“black”, “white”)threshold (
float
, optional) – Value in data units according to which the colors from textcolors are applied. If None (the default) uses the middle of the colormap as separation, by default Nonetextkw (
dict
, optional) – All other arguments are forwarded to mpl annotation.
- Returns:
_type_
– _description_
- arfs.association.association_matrix(X, sample_weight=None, nom_nom_assoc=<function weighted_theils_u>, num_num_assoc=<function weighted_corr>, nom_num_assoc=<function correlation_ratio>, n_jobs=1, handle_na='drop')[source]#
Computes the association matrix for continuous-continuous, categorical-continuous, and categorical-categorical predictors using specified callable functions.
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – Predictor dataframe.sample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default None.nom_nom_assoc (
callable
) – Function to compute the categorical-categorical association.num_num_assoc (
callable
) – Function to compute the numerical-numerical association.nom_num_assoc (
callable
) – Function to compute the categorical-numerical association.n_jobs (
int
, optional) – The number of cores to use for the computation, by default 1.handle_na (
str
, optional) – How to handle NA values (‘drop’, ‘fill’, or None), by default “drop”.
- Returns:
pd.DataFrame
– The association matrix.
- arfs.association.association_series(X, target, features=None, sample_weight=None, nom_nom_assoc=<function weighted_theils_u>, num_num_assoc=functools.partial(<function weighted_corr>, method='spearman'), nom_num_assoc=<function correlation_ratio>, normalize=False, n_jobs=1, handle_na='drop')[source]#
Computes the association series for different types of predictors.
This function calculates the association between the specified target and other predictors in X. It supports different types of associations: nominal-nominal, numerical-numerical, and nominal-numerical.
- Parameters:
X (
array-like
,shape (n_samples
,n_features)
) – Predictor dataframe.target (
str
orint
) – The predictor name or index with which to compute the association.features (
list
ofstr
, optional) – List of features with which to compute the association. If None, all features in X are used.sample_weight (
array-like
,shape (n_samples,)
, optional) – The weight vector, by default None.nom_nom_assoc (
callable
) – Function to compute the nominal-nominal (categorical-categorical) association. It should take two pd.Series and an optional weight array, and return a single number.num_num_assoc (
callable
) – Function to compute the numerical-numerical association. It should take two pd.Series and return a single number.nom_num_assoc (
callable
) – Function to compute the nominal-numerical association. It should take two pd.Series and return a single number.normalize (
bool
, optional) – Whether to normalize the scores or not. If True, scores are normalized to the range [0, 1].n_jobs (
int
, optional) – The number of cores to use for the computation. The default, -1, uses all available cores.handle_na (
str
, optional) – How to handle NA values. Options are ‘drop’, ‘fill’, and None. The default, ‘drop’, drops rows with NA values.
- Returns:
pd.Series
– A series with all the association values with the target column, sorted in descending order.- Raises:
TypeError – If features is provided but is not a list of strings.
Examples
>>> import pandas as pd >>> from sklearn import datasets >>> iris = datasets.load_iris() >>> X = pd.DataFrame(iris.data, columns=iris.feature_names) >>> association_series(X, 'sepal length (cm)', num_num_assoc=my_num_num_function)
Notes
The function dynamically selects the appropriate association method based on the data types of the target and other predictors. For numerical-numerical associations, it uses num_num_assoc; for nominal-nominal, nom_nom_assoc; and for nominal-numerical, nom_num_assoc.
- arfs.association.cluster_sq_matrix(sq_matrix, method='ward')[source]#
Apply agglomerative clustering to sort a square correlation matrix.
- Parameters:
sq_matrix (
pd.DataFrame
) – A square correlation matrix.method (
str
, optional) – The linkage method, by default “ward”.
- Returns:
pd.DataFrame
– A sorted square matrix.
Example
>>> from some_module import association_matrix, cluster_sq_matrix
>>> assoc = association_matrix(iris_df, plot=False) >>> assoc_clustered = cluster_sq_matrix(assoc, method="complete")
- arfs.association.correlation_ratio(x, y, sample_weight=None, as_frame=False)[source]#
Compute the weighted correlation ratio. The association between a continuous predictor (y) and a categorical predictor (x). It can be weighted.
- Parameters:
x (
pd.Series
ofshape (n_samples,)
) – The categorical predictor vectory (
pd.Series
ofshape (n_samples,)
) – The continuous predictorsample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Noneas_frame (
bool
) – return output as a dataframe or a float
- Returns:
float
– value of the correlation ratio
- arfs.association.correlation_ratio_matrix(X, sample_weight=None, n_jobs=1, handle_na='drop')[source]#
correlation_ratio_matrix computes the weighted Correlation Ratio for categorical-numerical association. This is a symmetric coefficient: CR(x,y) = CR(y,x)
The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – predictor dataframesample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Nonen_jobs (
int
, optional) – the number of cores to use for the computation, by default 1handle_na (
str
, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.DataFrame
– The correlation ratio matrix (lower triangular) in a tidy (long) format.
- arfs.association.correlation_ratio_series(X, target, sample_weight=None, n_jobs=1, handle_na='drop')[source]#
correlation_ratio_series computes the weighted correlation ration for categorical-numerical association. This is a symmetric coefficient: CR(x,y) = CR(y,x)
The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format, a series.
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – predictor dataframetarget (
str
orint
) – the predictor name or index with which to compute associationsample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Nonen_jobs (
int
, optional) – the number of cores to use for the computation, by default -1handle_na (
str
, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.Series
– The Correlation ratio series (lower triangular) in a tidy (long) format.
- arfs.association.cramer_v(x, y, sample_weight=None, as_frame=False)[source]#
Computes the weighted V statistic of two categorical predictors.
- Parameters:
x (
pd.Series
ofshape (n_samples,)
) – The first categorical predictor.y (
pd.Series
ofshape (n_samples,)
) – The second categorical predictor, order doesn’t matter, symmetrical association.sample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default None.as_frame (
bool
) – Return output as a DataFrame or a float.
- Returns:
pd.DataFrame
orfloat
– Single row DataFrame with the predictor names and the statistic value, or the statistic as a float.
- arfs.association.cramer_v_matrix(X, sample_weight=None, n_jobs=1, handle_na='drop')[source]#
cramer_v_matrix computes the weighted Cramer’s V statistic for categorical-categorical association. This is a symmetric coefficient: V(x,y) = V(y,x)
It uses the corrected Cramer’s V statistics, itself based on the chi2 contingency table. The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – predictor dataframesample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Nonen_jobs (
int
, optional) – the number of cores to use for the computation, by default 1handle_na (
str
, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.DataFrame
– The Cramer’s V matrix (lower triangular) in a tidy (long) format.
- arfs.association.cramer_v_series(X, target, sample_weight=None, n_jobs=1, handle_na='drop')[source]#
cramer_v_series computes the weighted Cramer’s V statistic for categorical-categorical association. This is a symmetric coefficient: V(x,y) = V(y,x)
It uses the corrected Cramer’s V statistics, itself based on the chi2 contingency table. The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – predictor dataframetarget (
str
orint
) – the predictor name or index with which to compute associationsample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Nonen_jobs (
int
, optional) – the number of cores to use for the computation, by default 1handle_na (
str
, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.Series
– The Cramer’s V series
- arfs.association.create_col_combinations(func, selected_cols)[source]#
Create column combinations or permutations based on the symmetry of the function.
This function checks if func is symmetric. If it is, it creates combinations of selected_cols; otherwise, it creates permutations.
- Parameters:
func (
callable
) – The function to check for symmetry. Should be decorated with @symmetric_function.selected_cols (
list
) – The columns to be combined or permuted.
- Returns:
list
oftuples
– A list of tuples representing column combinations or permutations. If func is symmetric, combinations of selected_cols are returned; otherwise, permutations are returned.
- arfs.association.f_cat_classification_parallel(X, y, sample_weight=None, n_jobs=-1, force_finite=True, handle_na='drop')[source]#
Univariate information dependence.
It ranks features in the same order if all the features are positively correlated with the target. Note that it is therefore recommended as a feature selection criterion to identify potentially predictive features for a downstream classifier, irrespective of the sign of the association with the target variable.
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – The predictor dataframe.y (
array-like
ofshape (n_samples,)
) – The target vector.sample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default None.n_jobs (
int
, optional) – The number of cores to use for the computation, by default -1.handle_na (
str
, optional) – Either drop rows with NaN, fill NaN with 0, or do nothing, by default “drop”.force_finite (
bool
, optional) –Whether or not to force the F-statistics and associated p-values to be finite. There are two cases where the F-statistic is expected to not be finite:
when the target y or some features in X are constant. In this case, the Pearson’s R correlation is not defined leading to obtain np.nan values in the F-statistic and p-value. When force_finite=True, the F-statistic is set to 0.0 and the associated p-value is set to 1.0.
when a feature in X is perfectly correlated (or anti-correlated) with the target y. In this case, the F-statistic is expected to be np.inf. When force_finite=True, the F-statistic is set to np.finfo(dtype).max.
- Returns:
f_statistic (
array-like
ofshape (n_features,)
) – F-statistic for each feature.
- arfs.association.f_cat_regression(x, y, sample_weight=None, as_frame=False)[source]#
f_cat_regression computes the weighted ANOVA F-value for the provided sample. (continuous target, categorical predictor)
- Parameters:
x (
pd.Series
ofshape (n_samples,)
) – The predictor vector, the first categorical predictory (
pd.Series
ofshape (n_samples,)
) – second categorical predictor, order doesn’t matter, symmetrical associationsample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Noneas_frame (
bool
) – return output as a dataframe or a float
- Returns:
float
– value of the F-statistic
- arfs.association.f_cat_regression_parallel(X, y, sample_weight=None, n_jobs=1, handle_na='drop')[source]#
f_cat_regression_parallel computes the weighted ANOVA F-value for the provided categorical predictors using parallelization of the code (continuous target, categorical predictor).
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – predictor dataframey (
array-like
ofshape (n_samples,)
) – The target vectorsample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Nonen_jobs (
int
, optional) – the number of cores to use for the computation, by default 1handle_na (
str
, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.Series
– the value of the F-statistic for each predictor
- arfs.association.f_cont_classification(x, y, sample_weight=None, as_frame=False)[source]#
f_cont_classification computes the weighted ANOVA F-value for the provided sample. Categorical target, continuous predictor.
- Parameters:
x (
pd.Series
ofshape (n_samples,)
) – The predictor vector, the first categorical predictory (
pd.Series
ofshape (n_samples,)
) – second categorical predictor, order doesn’t matter, symmetrical associationsample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Noneas_frame (
bool
) – return output as a dataframe or a float
- Returns:
float
– value of the F-statistic
- arfs.association.f_cont_classification_parallel(X, y, sample_weight=None, n_jobs=-1, handle_na='drop')[source]#
f_cont_classification_parallel computes the weighted ANOVA F-value for the provided categorical predictors using parallelization of the code. Categorical target, continuous predictor.
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – The set of regressors that will be tested sequentiallyy (
array-like
ofshape (n_samples,)
) – The target vectorsample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Nonen_jobs (
int
, optional) – the number of cores to use for the computation, by default -1handle_na (
str
, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.Series
– the value of the F-statistic for each predictor
- arfs.association.f_cont_regression_parallel(X, y, sample_weight=None, n_jobs=-1, force_finite=True, handle_na='drop')[source]#
Univariate linear regression tests returning F-statistic.
Quick linear model for testing the effect of a single regressor, sequentially for many regressors. This is done in 2 steps: 1. The cross-correlation between each regressor and the target is computed using:
E[(X[:, i] - mean(X[:, i])) * (y - mean(y))] / (std(X[:, i]) * std(y))
It is converted to an F score ranks features in the same order if all the features are positively correlated with the target.
Note that it is therefore recommended as a feature selection criterion to identify potentially predictive features for a downstream classifier, irrespective of the sign of the association with the target variable.
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – The predictor dataframe.y (
array-like
ofshape (n_samples,)
) – The target vector.sample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default None.n_jobs (
int
, optional) – The number of cores to use for the computation, by default -1.handle_na (
str
, optional) – Either drop rows with NaN, fill NaN with 0, or do nothing, by default “drop”.force_finite (
bool
, optional) –Whether or not to force the F-statistics and associated p-values to be finite. There are two cases where the F-statistic is expected to not be finite: - when the target y or some features in X are constant. In this
case, the Pearson’s R correlation is not defined leading to obtain np.nan values in the F-statistic and p-value. When force_finite=True, the F-statistic is set to 0.0 and the associated p-value is set to 1.0.
when a feature in X is perfectly correlated (or anti-correlated) with the target y. In this case, the F-statistic is expected to be np.inf. When force_finite=True, the F-statistic is set to np.finfo(dtype).max.
- Returns:
f_statistic (
array-like
ofshape (n_features,)
) – F-statistic for each feature.
- arfs.association.f_oneway_weighted(*args)[source]#
Calculate the weighted F-statistic for one-way ANOVA (continuous target, categorical predictor).
- Parameters:
X (
array-like
ofshape (n_samples,)
) – The predictor dataframe.y (
array-like
ofshape (n_samples,)
) – The target vector.sample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default None.
- Returns:
float
– The value of the F-statistic.
Notes
The F-statistic is calculated as:
\[F(rf) = \frac{\sum_i (\bar{Y}_{i \bullet} - \bar{Y})^2 / (K-1)}{\sum_i \sum_k (\bar{Y}_{ij} - \bar{Y}_{i\bullet})^2 / (N - K)}\]
- arfs.association.f_stat_classification_parallel(X, y, sample_weight=None, n_jobs=1, force_finite=True, handle_na='drop')[source]#
Compute the weighted ANOVA F-value for the provided categorical and numerical predictors using parallelization.
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – The predictor dataframe.y (
array-like
ofshape (n_samples,)
) – The target vector.sample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default None.n_jobs (
int
, optional) – The number of cores to use for the computation, by default 1.handle_na (
str
, optional) – Either drop rows with NA, fill NA with 0, or do nothing, by default “drop”.force_finite (
bool
, optional) –Whether or not to force the F-statistics and associated p-values to be finite. There are two cases where the F-statistic is expected to not be finite: - When the target y or some features in X are constant. In this case,
the Pearson’s R correlation is not defined leading to obtain np.nan values in the F-statistic and p-value. When force_finite=True, the F-statistic is set to 0.0 and the associated p-value is set to 1.0.
When a feature in X is perfectly correlated (or anti-correlated) with the target y. In this case, the F-statistic is expected to be np.inf. When force_finite=True, the F-statistic is set to np.finfo(dtype).max.
- Returns:
pd.Series
– The value of the F-statistic for each predictor.
- arfs.association.f_stat_regression_parallel(X, y, sample_weight=None, n_jobs=-1, force_finite=True, handle_na='drop')[source]#
Compute the weighted explained variance for the provided categorical and numerical predictors using parallelization.
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – The predictor dataframe.y (
array-like
ofshape (n_samples,)
) – The target vector.sample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default None.n_jobs (
int
, optional) – The number of cores to use for the computation, by default -1.handle_na (
str
, optional) – Either drop rows with NA, fill NA with 0, or do nothing, by default “drop”.force_finite (
bool
, optional) –Whether or not to force the F-statistics and associated p-values to be finite. There are two cases where the F-statistic is expected to not be finite: - When the target y or some features in X are constant. In this case,
the Pearson’s R correlation is not defined leading to obtain np.nan values in the F-statistic and p-value. When force_finite=True, the F-statistic is set to 0.0 and the associated p-value is set to 1.0.
When a feature in X is perfectly correlated (or anti-correlated) with the target y. In this case, the F-statistic is expected to be np.inf. When force_finite=True, the F-statistic is set to np.finfo(dtype).max.
- Returns:
pd.Series
– The value of the F-statistic for each predictor.
- arfs.association.heatmap(data, row_labels, col_labels, ax=None, cbar_kw=None, cbarlabel='', **kwargs)[source]#
heatmap Create a heatmap from a numpy array and two lists of labels.
- Parameters:
data (
array-like
ofshape (M
,N)
) – matrix to plotrow_labels (
array-like
ofshape (M,)
) – labels for the rowscol_labels (
array-like
ofshape (N,)
) – labels for the columnsax (
matplotlib.axes.Axes
, optional) – A matplotlib.axes.Axes instance to which the heatmap is plotted. If not provided, use current axes or create a new one, by default Nonecbar_kw (
dict
, optional) – A dictionary with arguments to matplotlib.Figure.colorbar, by default Nonecbarlabel (
str
, optional) – The label for the colorbar, by default “”kwargs (
dict
, optional) – All other arguments are forwarded to imshow.
- Returns:
tuple
– imgshow and cbar objects
- arfs.association.is_list_of_str(str_list)[source]#
Raise an exception if
str_list
is not a list of strings :type str_list: :param str_list: to list to be tested :type str_list:list
- Raises:
TypeError – if
str_list
is not alist[str]
- arfs.association.matrix_to_xy(df, columns=None, reset_index=False)[source]#
matrix_to_xy wide to long format of the association matrix
- Parameters:
df (
pd.DataFrame
) – the wide format of the association matrixcolumns (
list
ofstr
, optional) – list of column names, by default Nonereset_index (
bool
, optional) – wether to reset_index or not, by default False
- Returns:
pd.DataFrame
– the long format of the association matrix
- arfs.association.plot_association_matrix(assoc_mat, suffix_dic=None, ax=None, cmap='PuOr', cbarlabel=None, figsize=None, show=True, cbar_kw=None, imgshow_kw=None, annotate=False)[source]#
plot_association_matrix renders the sorted associations/correlation matrix. The sorting is done using hierarchical clustering, very like in seaborn or other packages. Categorical(nom): uncertainty coefficient & correlation ratio from 0 to 1. The uncertainty coefficient is assymmetrical, (approximating how much the elements on the left PROVIDE INFORMATION on elements in the row). Continuous(con): symmetrical numerical correlations (Spearman’s) from -1 to 1
- Parameters:
assoc_mat (
pd.DataFrame
) – the square association framesuffix_dic (
Dict[str
,str]
, optional) – dictionary of data type for adding suffixes to column names in the plotting utility for association matrix, by default Noneax (
matplotlib.axes.Axes
, optional) – _description_, by default Nonecmap (
str
, optional) – the colormap. Please use a scientific colormap. See thescicomap
package, by default “PuOr”cbarlabel (
str
, optional) – the colorbar label, by default Nonefigsize (
Tuple[float
,float]
, optional) – figure size in inches, by default Noneshow (
bool
, optional) – Whether or not to display the figure, by default Truecbar_kw (
Dict
, optional) – colorbar kwargs, by default Noneimgshow_kw (
Dict
, optional) – imgshow kwargs, by default Noneannotate (
bool
) – Whether to annotate or not the colormap
- Returns:
matplotlib.figure
andmatplotlib.axes.Axes
– the figure and the axes
- arfs.association.plot_association_matrix_int(assoc_mat, suffix_dic=None, cmap='PuOr', figsize=(800, 600), cluster_matrix=True)[source]#
Plot the interactive sorted associations/correlation matrix. The sorting is done using hierarchical clustering, very like in seaborn or other packages. Categorical(nom): uncertainty coefficient & correlation ratio from 0 to 1. The uncertainty coefficient is assymmetrical, (approximating how much the elements on the left PROVIDE INFORMATION on elements in the row). Continuous(con): symmetrical numerical correlations (Spearman’s) from -1 to 1
- Parameters:
assoc_mat (
pd.DataFrame
) – the square association framesuffix_dic (
Dict[str
,str]
, optional) – dictionary of data type for adding suffixes to column names in the plotting utility for association matrix, by default Nonecmap (
str
, optional) – the colormap. Please use a scientific colormap. See thescicomap
package, by default “PuOr”figsize (
Tuple[float
,float]
, optional) – figure size in inches, by default Nonecluster_matrix (
bool
) – whether or not to cluster the square matrix, by default True
- Returns:
panel.Column
– the panel object
- arfs.association.theils_u_matrix(X, sample_weight=None, n_jobs=1, handle_na='drop')[source]#
theils_u_matrix theils_u_matrix computes the weighted Theil’s U statistic for categorical-categorical association. This is an asymmetric coefficient: U(x,y) != U(y,x) U(x, y) means the uncertainty of x given y: value is on the range of [0,1] - where 0 means y provides no information about x, and 1 means y provides full information about x
The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – predictor dataframesample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Nonen_jobs (
int
, optional) – the number of cores to use for the computation, by default -1handle_na (
str
, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.DataFrame
– The Theil’s U matrix in a tidy (long) format.
- arfs.association.theils_u_series(X, target, sample_weight=None, n_jobs=1, handle_na='drop')[source]#
theils_u_series computes the weighted Theil’s U statistic for categorical-categorical association. This is an asymmetric coefficient: U(x,y) != U(y,x) U(x, y) means the uncertainty of x given y: value is on the range of [0,1] - where 0 means y provides no information about x, and 1 means y provides full information about x
The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – predictor dataframetarget (
str
orint
) – the predictor name or index with which to compute associationsample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Nonen_jobs (
int
, optional) – the number of cores to use for the computation, by default -1handle_na (
str
, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.Series
– The Theil’s U series.
- arfs.association.wcorr(x, y, w)[source]#
wcov computes the weighted Pearson correlation coefficient
- Parameters:
x (
array-like
ofshape (n_samples,)
) – the perdictor 1 arrayy (
array-like
ofshape (n_samples,)
) – the perdictor 2 arrayw (
array-like
ofshape (n_samples,)
) – the sample weights array
- Returns:
float
– weighted correlation coefficient
- arfs.association.wcorr_matrix(X, sample_weight=None, n_jobs=1, handle_na='drop', method='spearman')[source]#
wcorr_matrix computes the weighted correlation statistic for (Pearson or Spearman) for continuous-continuous association. This is an symmetric coefficient: corr(x,y) = corr(y,x)
The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – predictor dataframesample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Nonen_jobs (
int
, optional) – the number of cores to use for the computation, by default -1handle_na (
str
, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”method (
str
) – either “spearman” or “pearson”
- Returns:
pd.DataFrame
– The Cramer’s V matrix (lower triangular) in a tidy (long) format.
- arfs.association.wcorr_series(X, target, sample_weight=None, n_jobs=1, handle_na='drop', method='spearman')[source]#
wcorr_series computes the weighted correlation coefficient (Pearson or Spearman) for continuous-continuous association. This is an symmetric coefficient: corr(x,y) = corr(y,x)
The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – predictor dataframetarget (
str
orint
) – the predictor name or index with which to compute associationsample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Nonen_jobs (
int
, optional) – the number of cores to use for the computation, by default 1handle_na (
str
, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”method (
str
) – either “spearman” or “pearson”, by default “spearman”
- Returns:
pd.Series
– The weighted correlation series.
- arfs.association.wcov(x, y, w)[source]#
wcov computes the weighted covariance
- Parameters:
x (
array-like
ofshape (n_samples,)
) – the perdictor 1 arrayy (
array-like
ofshape (n_samples,)
) – the perdictor 2 arrayw (
array-like
ofshape (n_samples,)
) – the sample weights array
- Returns:
float
– weighted covariance
- arfs.association.weighted_conditional_entropy(x, y, sample_weight=None)[source]#
Computes the weighted conditional entropy between two categorical predictors.
- Parameters:
x (
pd.Series
ofshape (n_samples,)
) – The predictor vector.y (
pd.Series
ofshape (n_samples,)
) – The target vector.sample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default None.
- Returns:
float
– Weighted conditional entropy.
- arfs.association.weighted_corr(x, y, sample_weight=None, as_frame=False, method='spearman')[source]#
weighted_corr computes the weighted correlation coefficient (Pearson or Spearman)
- Parameters:
x (
pd.Series
ofshape (n_samples,)
) – The categorical predictor vectory (
pd.Series
ofshape (n_samples,)
) – The continuous predictorsample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Noneas_frame (
bool
) – return output as a dataframe or a floatmethod (
str
) – either “spearman” or “pearson”, by default “pearson”
- Returns:
float
orpd.DataFrame
– weighted correlation coefficient
- arfs.association.weighted_correlation_1cpu(X, sample_weight=None, handle_na='drop')[source]#
weighted_correlation computes the lower triangular weighted correlation matrix using a single CPU, therefore using common numpy linear algebra
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – predictor dataframesample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default Nonehandle_na (
str
, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.DataFrame
– the lower triangular weighted correlation matrix in long format
- arfs.association.weighted_theils_u(x, y, sample_weight=None, as_frame=False)[source]#
Computes the weighted Theil’s U statistic between two categorical predictors.
- Parameters:
x (
pd.Series
ofshape (n_samples,)
) – The predictor vector.y (
pd.Series
ofshape (n_samples,)
) – The target vector.sample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default None.as_frame (
bool
) – Return output as a dataframe or a float.
- Returns:
pd.DataFrame
orfloat
– Predictor names and value of the Theil’s U statistic.
- arfs.association.wm(x, w)[source]#
wm computes the weighted mean
- Parameters:
x (
array-like
ofshape (n_samples,)
) – the target arrayw (
array-like
ofshape (n_samples,)
) – the sample weights array
- Returns:
float
– weighted mean
- arfs.association.wrank(x, w)[source]#
wrank computes the weighted rank
- Parameters:
x (
array-like
ofshape (n_samples,)
) – the target arrayw (
array-like
ofshape (n_samples,)
) – the sample weights array
- Returns:
float
– weighted rank
- arfs.association.wspearman(x, y, w)[source]#
wcov computes the weighted Spearman correlation coefficient
- Parameters:
x (
array-like
ofshape (n_samples,)
) – the perdictor 1 arrayy (
array-like
ofshape (n_samples,)
) – the perdictor 2 arrayw (
array-like
ofshape (n_samples,)
) – the sample weights array
- Returns:
float
– Spearman weighted correlation coefficient
arfs.benchmark module#
Benchmark Feature Selection
This module provides utilities for comparing and benchmarking feature selection methods
Module Structure:#
sklearn_pimp_bench
: function for comparing using the sklearn permutation importancecompare_varimp
: function for comparing using 3 kinds of var.imp.highlight_tick
: function for highlighting specific (genuine or noise for instance) predictors in the importance chart
- arfs.benchmark.compare_varimp(feat_selector, models, X, y, sample_weight=None)[source]#
Utility function to compare the results for the three possible kind of feature importance
- Parameters:
feat_selector (
object
) – an instance of either Leshy, BoostaGRoota or GrootCVmodels (
list
ofobjects
) – list of tree based scikit-learn estimatorsX (
pd.DataFrame
,shape (n_samples
,n_features)
) – the predictors framey (
pd.Series
) – the target (same length as X)sample_weight (
None
orpd.Series
, optional) – sample weights if any, by default None
- arfs.benchmark.highlight_tick(str_match, figure, color='red', axis='y')[source]#
Highlight the x/y tick-labels if they contains a given string
- Parameters:
str_match (
str
) – the substring to matchfigure (
object
) – the matplotlib figurecolor (
str
, optional) – the matplotlib color for highlighting tick-labels, by default ‘red’axis (
str
, optional) – axis to use for highlighting, by default ‘y’
- Returns:
plt.figure
– the modified matplotlib figure- Raises:
ValueError – if axis is not ‘x’ or ‘y’
- arfs.benchmark.sklearn_pimp_bench(model, X, y, task='regression', sample_weight=None)[source]#
Benchmark using sklearn permutation importance, works for regression and classification.
- Parameters:
model (
object
) – An estimator that has not been fitted, sklearn compatible.X (
ndarray
orDataFrame
,shape (n_samples
,n_features)
) – Data on which permutation importance will be computed.y (
array-like
orNone
,shape (n_samples
,)
or(n_samples
,n_classes)
) – Targets for supervised or None for unsupervised.task (
str
, optional) – kind of task, either ‘regression’ or ‘classification”, by default ‘regression’sample_weight (
array-like
ofshape (n_samples,)
, optional) – Sample weights, by default None
- Returns:
plt.figure
– the figure corresponding to the feature selection- Raises:
ValueError – if task is not ‘regression’ or ‘classification’
arfs.gbm module#
GBM Wrapper
This module offers a class to train base LightGBM and CatBoost models, with early stopping as the default behavior. The target variable can be finite discrete (classification) or continuous (regression). Additionally, the model allows boosting from an initial score (also known as a baseline for CatBoost) and accepts sample weights as input.
Module Structure:#
GradientBoosting
: main class to train a lightGBM or catboost with early stopping
- class arfs.gbm.GradientBoosting(cat_feat='auto', params=None, stratified=False, show_learning_curve=True, verbose_eval=50, return_valid_features=False)[source]#
Bases:
object
Performs the training of a base lightGBM/CatBoost using early stopping. It works for any of the supported loss function (lightGBM/CatBoost), so for regression and classification you can use an instance of this class. For the early stopping process, 20% of the data set is used and a fix seed is used for reproducibility.
The resulting model can be saved at the desired location. Last, you can pass relevant lightGBM/Catboost parameters and/or sample weights (exposure, etc.) if needed.
Init score of Booster to start from, if required (like for GLM residuals modelling using GBM).
- Parameters:
cat_feat (
List[str]
,'auto'
orNone,
) – The list of column names of the categorical predictors. For catboost, much more efficient if those columns are of dtype pd.Categorical. For lightGBM, most of the time better to integer encode and NOT consider them as categorical (set this parameter as None).params (
dict
, defaultNone
) – you can pass the parameters that you want to lightGBM/Catboost, as long as they are valid. If None, default parameters are passed.stratified (
bool
, defaultFalse
) – stratified shuffle split for the early stopping process. For classification problem, it guarantees the same proportionshow_learning_curve (
bool
, defaultTrue
) – if show or not the learning curveverbose_eval (
int
, default50
) – period for printing the train and validation results. If < 1, no output
- Variables:
cat_feat (
Union[str
,List[str]
,None]
) – The list of categorical predictors after pre-processingmodel_params (
Dict
) – the dictionary of model parameterslearning_curve (
plt.figure
) – the learning curveis_init_score (
bool
) – boosted from an initial score or notstratified (
bool
) – either if stratified shuffle split was used or not for the early stopping process
Example
>>> # set up the trainer >>> save_path = "C:/Users/mtpl_bi_pp/base/" >>> gbm_model = GradientBoosting(cat_feat='auto', >>> stratified=False, >>> params={ >>> 'objective': 'tweedie', >>> 'tweedie_variance_power': 1.1 >>> }) >>> >>> # train the model >>> gbm_model.fit(X=X_tr,y=y_tr,sample_weight=exp_tr) >>> >>> # predict new values (test set) >>> y_bt = gbm_model.predict(X_tt) >>> >>> # save the model >>> gbm_model.save(save_path='C:/models/', name="my_fancy_model")
- fit(X, y, sample_weight=None, init_score=None, groups=None)[source]#
Fit the lightGBM/Catboost either using the python API and early stopping
- Parameters:
X (
pd.DataFrame
ornp.ndarray
) – the predictors’ matrixy (
pd.Series
ornp.ndarray
) – the target series/arraysample_weight (
pd.Series
ornp.ndarray
, optional) – the sample_weight series/array, if relevant. If not None, it should be of the same length as the target (defaultNone
)init_score (
pd.Series
ornp.ndarray
, optional) – the initial score to boost from (series/array), if relevant. If not None, it should be of the same length as the target (defaultNone
)groups (
pd.Series
ornp.ndarray
, optional) – the groups (e.g. polID) for robust cross validation. The same group will not appear in two different folds.
- predict(X, predict_proba=False)[source]#
Predict the new values using the fitted model.
- Parameters:
X (
pd.DataFrame
ornp.ndarray
) – the predictors’ matrixpredict_proba (
bool
, defaultFalse
) – returns probabilities (only for classification) (defaultFalse
)
- predict_raw(X, **kwargs)[source]#
The native predict method, if you need raw_score, etc.
- Parameters:
X (
pd.DataFrame
ornp.ndarray
) – the predictors’ matrix**kwargs (
dict
, optional) – optional dictionary of other parameters for the prediction. See thelightgbm
andcatboost
documentation for details.
- Raises:
Exception – “method not found” if the method specified in the init differs from “lgb” or “cat”
- save(save_path=None, name=None)[source]#
Save method, saves the model as pkl file in the specified folder as name.pkl If the path is None, then the model is saved in the current working directory. If the name is not specified, the model is saved as ‘gbm_base_model_[TIMESTAMP].pkl
- Parameters:
save_path (
str
, optional) – folder where to save the model, as a pickle/joblib filename (
str
, optional) – name of the model name
- Returns:
str
– where the pkl file is saved
- arfs.gbm._fit_early_stopped_lgb(X, y, sample_weight=None, groups=None, init_score=None, params=None, cat_feat=None, stratified=False, learning_curve=True, verbose_eval=0, return_valid_features=False)[source]#
convenience function, early stopping for lightGBM, using dataset and setting categorical feature, sample weights and baseline (init_score), if any. User defined params can be passed. It works for classification and regression.
- Parameters:
X (
pd.DataFrame
ornp.ndarray
) – the predictors’ matrixy (
pd.Series
ornp.ndarray
) – the target series/arraysample_weight (
pd.Series
ornp.ndarray
, optional) – the sample_weight series/array, if relevant. If not None, it should be of the same length as the target (defaultNone
)groups (
pd.Series
ornp.ndarray
, optional) – the groups (e.g. polID) for robust cross validation. The same group will not appear in two different folds.params (
dict
, optional) – you can pass the parameters that you want to lightGBM/Catboost, as long as they are valid. If None, default parameters are passed.init_score (
pd.Series
ornp.ndarray
, optional) – the initial score to boost from (series/array), if relevant. If not None, it should be of the same length as the target (defaultNone
)cat_feat (
str
orlist
ofstrings
, optional) – Categorical features. If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specifyfeature_name
as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature (defaultNone
)stratified (
bool
, default= False
) – stratified shuffle split for the early stopping process. For classification problem, it guarantees the same proportionlearning_curve (
bool
, default= False
) – if show or not the learning curveverbose_eval (
int
, default= 0
) – period for printing the train and validation results. If < 1, no outputreturn_valid_features (
bool
, default= False
) – Whether or not to return validation features
- Returns:
model (
object
) – model objectfig (
plt.figure
) – the learning curves, matplotlib figure object
- arfs.gbm._make_split(X, y, sample_weight=None, init_score=None, groups=None, stratified=False, test_size=0.2)[source]#
_make_split is a private function for splitting the dataset according to the task
- Parameters:
X (
pd.DataFrame
ornp.ndarray
) – the predictors’ matrixy (
pd.Series
ornp.ndarray
) – the target series/arraysample_weight (
pd.Series
ornp.ndarray
, optional) – the sample_weight series/array, if relevant. If not None, it should be of the same length as the target (defaultNone
)groups (
pd.Series
ornp.ndarray
, optional) – the groups (e.g. polID) for robust cross validation. The same group will not appear in two different folds.stratified (
bool
, defaultFalse
) – stratified shuffle split for the early stopping process. For classification problem, it guarantees the same proportiontest_size (
float
, default0.2
) – test set size, percentage of the total number of rows, by default .2
- Returns:
Tuple[Union[pd.DataFrame
,pd.Series]]
– the split data, target, weights and initial scores (if any)
arfs.parallel module#
Parallelize Pandas
This module provides utilities for parallelizing operations on pd.DataFrame
Module Structure:#
parallel_matrix_entries
for parallelizing operations returning a matrix (2D) (apply on pairs of columns)parallel_df
for parallelizing operations returning a series (1D) (apply on a single column at a time)
- arfs.parallel._compute_matrix_entries(X, comb_list, sample_weight=None, func_xyw=None)[source]#
base closure for computing matrix entries appling a function to each chunk of combinaison of columns of the dataframe, distributed by cores. This is similar to https://github.com/smazzanti/mrmr/mrmr/pandas.py
- Parameters:
X (
pd.DataFrame
,of shape (n_samples
,n_features)
) – The set of regressors that will be tested sequentiallysample_weight (
pd.Series
ornp.array
,of shape (n_samples,)
, optional) – The weight vector, if any, by default Nonefunc_xyw (
callable
, optional) – callable (function) for computing the individual elements of the matrix takes two mandatory inputs (x and y) and an optional input w, sample_weightscomb_list (
list
of2-uple
ofstr
) – Pairs of column names corresponding to the entries
- Returns:
pd.DataFrame
– concatenated results into a single pandas DF
- arfs.parallel._compute_series(X, y, sample_weight=None, func_xyw=None)[source]#
_compute_series is a utility function for computing the series resulting of the
apply
- Parameters:
X (
pd.DataFrame
,of shape (n_samples
,n_features)
) – The set of regressors that will be tested sequentiallyy (
pd.Series
ornp.array
,of shape (n_samples,)
) – The target vectorsample_weight (
pd.Series
ornp.array
,of shape (n_samples,)
, optional) – The weight vector, if any, by default Nonefunc_xyw (
callable
, optional) – callable (function) for computing the individual elements of the series takes two mandatory inputs (x and y) and an optional input w, sample_weights
- arfs.parallel.parallel_df(func, df, series, sample_weight=None, n_jobs=-1)[source]#
parallel_df apply a function to each column of the dataframe, distributed by cores. This is similar to https://github.com/smazzanti/mrmr/mrmr/pandas.py
- Parameters:
func (
callable
) – function to be applied to each columndf (
pd.DataFrame
) – the dataframe on which to apply the functionseries (
pd.Series
) – series (target) used by the functionsample_weight (
pd.Series
ornp.array
, optional) – The weight vector, if any, of shape (n_samples,), by default Nonen_jobs (
int
, optional) – the number of cores to use for the computation, by default -1
- Returns:
pd.DataFrame
– concatenated results into a single pandas DF
- arfs.parallel.parallel_matrix_entries(func, df, comb_list, sample_weight=None, n_jobs=-1)[source]#
parallel_matrix_entries applies a function to each chunk of combinaison of columns of the dataframe, distributed by cores. This is similar to https://github.com/smazzanti/mrmr/mrmr/pandas.py
- Parameters:
func (
callable
) – function to be applied to each columndf (
pd.DataFrame
) – the dataframe on which to apply the functioncomb_list (
list
oftuples
ofstr
) – Pairs of column names corresponding to the entriessample_weight (
pd.Series
ornp.array
, optional) – The weight vector, if any, of shape (n_samples,), by default Nonen_jobs (
int
, optional) – the number of cores to use for the computation, by default -1
- Returns:
pd.DataFrame
– concatenated results into a single pandas DF
arfs.preprocessing module#
This module provides preprocessing classes
Module Structure:#
OrdinalEncoderPandas
: main class for ordinal encoding, takes in a DF and returns a DF of the same shapedtype_column_selector
: for standardizing selection of columns based on their dtypesTreeDiscretizer
: class for discretizing continuous columns and auto-group levels of categorical columnsIntervalToMidpoint
: class for converting pandas numerical intervals into their float midpointPatsyTransformer
: class for encoding data for (generalized) linear models, leveraging Patsy
- class arfs.preprocessing.IntervalToMidpoint(cols='all')[source]#
Bases:
BaseEstimator
,TransformerMixin
IntervalToMidpoint is a transformer that converts numerical intervals in a pandas DataFrame to their midpoints.
- Parameters:
cols (
list
ofstr
orstr
, default"all"
) – The column(s) to transform. If “all”, all columns with numerical intervals will be transformed.- Variables:
cols (
list
ofstr
orstr
) – The column(s) to transform.float_interval_cols (
list
ofstr
) – The columns with numerical interval data types in the input DataFrame.columns_to_transform (
list
ofstr
) – The columns to be transformed based on the specified cols attribute.
- fit(X=None, y=None)[source]#
Fit the transformer on the input data.
- Parameters:
X (
Optional
[DataFrame
]) – The input data to fit the transformer on.y (
Optional
[Series
]) – Ignored parameter.
- Returns:
self (
IntervalToMidpoint
) – The fitted transformer object.
- class arfs.preprocessing.OrdinalEncoderPandas(dtype_include=['category', 'object', 'bool'], dtype_exclude=[<class 'numpy.number'>], pattern=None, exclude_cols=None, output_dtype=<class 'numpy.float64'>, handle_unknown='use_encoded_value', unknown_value=nan, encoded_missing_value=nan, return_pandas_categorical=False)[source]#
Bases:
OrdinalEncoder
Encode categorical features as an integer array and returns a pandas DF. The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature. Read more in the scikit-learn OrdinalEncoder documentation
- Parameters:
pattern (
str
, defaultNone
) – Name of columns containing this regex pattern will be included. If None, column selection will not be selected based on pattern.dtype_include (
column dtype
orlist
ofcolumn dtypes
, defaultNone
) – A selection of dtypes to include. For more details, see pandas.DataFrame.select_dtypes.dtype_exclude (
column dtype
orlist
ofcolumn dtypes
, defaultNone
) – A selection of dtypes to exclude. For more details, see pandas.DataFrame.select_dtypes.exclude_cols (
list
ofstr
, optional) – columns to not encodeoutput_dtype (
number type
, defaultnp.float64
) – Desired dtype of output.handle_unknown (
{'error', 'use_encoded_value'}
, default'error'
) – When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform. When set to ‘use_encoded_value’, the encoded value of unknown categories will be set to the value given for the parameter unknown_value. In inverse_transform, an unknown category will be denoted as None.unknown_value (
int
ornp.nan
, defaultNone
) – When the parameter handle_unknown is set to ‘use_encoded_value’, this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If set to np.nan, the dtype parameter must be a float dtype.encoded_missing_value (
int
ornp.nan
, defaultnp.nan
) – Encoded value of missing categories. If set to np.nan, then the dtype parameter must be a float dtype.return_pandas_categorical (
bool
,defult=False
) – return encoded columns as pandas category dtype or as float
- Variables:
categories (
list
ofarrays
) – The categories of each feature determined duringfit
(in order of the features in X and corresponding with the output oftransform
). This does not include categories that weren’t seen duringfit
.feature_names_in (
ndarray
ofshape (`n_features_in_
,)`) – Names of features seen during fit. Defined only when X has feature names that are all strings.
Examples
Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to an ordinal encoding. >>> ord_enc = OrdinalEncoderPandas(exclude_cols=[“PARENT1”, “SEX”]) >>> X_enc = ord_enc.fit_transform(X) >>> X_original = ord_enc.inverse_transform(X_enc)
- fit(X, y=None)[source]#
Fit the OrdinalEncoder to X.
- Parameters:
X (
pd.DataFrame
,of shape (n_samples
,n_features)
) – The data to determine the categories of each feature.y (
Ignored. This parameter exists only for compatibility with
) –Pipeline
.
- Returns:
self
– Fitted encoder.
- fit_transform(X, y=None, sample_weight=None, **fit_params)[source]#
Fit to data, then transform it. Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X. :type X: :param X: Input samples. :type X:
array-like
ofshape (n_samples
,n_features)
:type y: :param y: Target values (None for unsupervised transformations). :type y:array-like
ofshape (n_samples,)
or(n_samples
,n_outputs)
, default=None :type **fit_params: :param **fit_params: Additional fit parameters. :type **fit_params:dict
- Returns:
X_new (
ndarray array
ofshape (n_samples
,n_features_new)
) – Transformed array.
- inverse_transform(X)[source]#
Convert the data back to the original representation. When unknown categories are encountered (all zeros in the one-hot encoding),
None
is used to represent this category. If the feature with the unknown category has a dropped category, the dropped category will be its inverse. For a given input feature, if there is an infrequent category, ‘infrequent_sklearn’ will be used to represent the infrequent category.- Parameters:
X (
pd.DataFrame
ofshape (n_samples
,n_encoded_features)
) – The transformed data.- Returns:
X_tr (
pd.Dataframe
ofshape (n_samples
,n_features)
) – Inverse transformed array.
- set_transform_request(*, sample_weight: bool | None | str = '$UNCHANGED$') OrdinalEncoderPandas #
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
sample_weight (
str
,True
,False
, orNone
, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weight
parameter intransform
.- Returns:
self (
object
) – The updated object.
- class arfs.preprocessing.PatsyTransformer(formula=None, add_intercept=True, eval_env=0, NA_action='drop', return_type='dataframe')[source]#
Bases:
BaseEstimator
,TransformerMixin
Transformer using patsy-formulas.
PatsyTransformer transforms a pandas DataFrame (or dict-like) according to the formula and produces a numpy array.
- Parameters:
formula (
string
orformula-like
) – Pasty formula used to transform the data.add_intercept (
boolean
, defaultFalse
) – Wether to add an intersept. By default scikit-learn has built-in intercepts for all models, so we don’t add an intercept to the data, even if one is specified in the formula.eval_env (
environment
orint
, default0
) – Envirionment in which to evalute the formula. Defaults to the scope in which PatsyModel was instantiated.NA_action (
string
orNAAction
, default"drop"
) – What to do with rows that contain missing values. You can"drop"
them,"raise"
an error, or for customization, pass an NAAction object. Seepatsy.NAAction
for details on what values count as ‘missing’ (and how to alter this).
- Variables:
feature_names (
list
ofstring
) – Column names / keys of training data.return_type (
string
, default"dataframe"
) – data type that transform method will return. Default is"dataframe"
for numpy array, but if you would like to get Pandas dataframe (for example for using it in scikit transformers with dataframe as input use"dataframe"
and if numpy array use"ndarray"
Note
PastyTransformer does by default not add an intercept, even if you specified it in the formula. You need to set add_intercept=True.
As scikit-learn transformers can not ouput y, the formula should not contain a left hand side. If you need to transform both features and targets, use PatsyModel.
- fit(data, y=None)[source]#
Fit the scikit-learn model using the formula.
- Parameters:
data (
dict-like (pandas dataframe)
) – Input data. Column names need to match variables in formula.
- fit_transform(data, y=None)[source]#
Fit the scikit-learn model using the formula and transform it.
- Parameters:
data (
dict-like (pandas dataframe)
) – Input data. Column names need to match variables in formula.- Returns:
X_transform (
ndarray
) – Transformed data
- set_fit_request(*, data: bool | None | str = '$UNCHANGED$') PatsyTransformer #
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
data (
str
,True
,False
, orNone
, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing fordata
parameter infit
.- Returns:
self (
object
) – The updated object.
- set_transform_request(*, data: bool | None | str = '$UNCHANGED$') PatsyTransformer #
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
data (
str
,True
,False
, orNone
, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing fordata
parameter intransform
.- Returns:
self (
object
) – The updated object.
- class arfs.preprocessing.TreeDiscretizer(bin_features='all', n_bins=10, n_bins_max=None, num_bins_as_category=False, boost_params=None, raw=False, task='regression')[source]#
Bases:
BaseEstimator
,TransformerMixin
Discretize continuous and/or categorical data using univariate regularized trees, returning a pandas DataFrame. The TreeDiscretizer is designed to support regression and binary classification tasks. Discretization, also known as quantization or binning, allows for the partitioning of continuous features into discrete values. In certain datasets with continuous attributes, discretization can be beneficial as it transforms the dataset into one with only nominal attributes. Additionally, for categorical predictors, grouping levels can help reduce overfitting and create meaningful clusters.
By encoding discretized features, a model can become more expressive while maintaining interpretability. For example, preprocessing with a discretizer can introduce nonlinearity to linear models. For more advanced possibilities, particularly smooth ones, you can refer to the section on generating polynomial features. The TreeDiscretizer function utilizes univariate regularized trees, with one tree per column to be binned. It finds the optimal partition and returns numerical intervals for numerical continuous columns and pd.Categorical for categorical columns. This approach groups similar levels together, reducing dimensionality and regularizing the model.
TreeDiscretizer handles missing values for both numerical and categorical predictors, eliminating the need for encoding categorical predictors separately.
Notes
This is a substitution to proper regularization schemes such as: - GroupLasso: Categorical predictors, which are usually encoded as multiple dummy variables,
are considered together rather than separately.
FusedLasso: Takes into account the ordering of the features.
- Parameters:
bin_features (
List
ofstring
orNone
) – The list of names of the variable that has to be binned, or “all”, “numerical” or “categorical” for splitting and grouping all, only numerical or only categorical columns.n_bins (
int
) – The number of bins that has to be created while binning the variables in the “bin_features” list.n_bins_max (
int
, optional) – The maximum number of levels that a categorical column can have to avoid being binned.num_bins_as_category (
bool
, defaultFalse
) – Save the numeric bins as pandas category or as pandas interval.boost_params (
dict
) – The boosting parameters dictionary.raw (
bool
) – Returns raw levels (non-human-interpretable) or levels matching the original ones.task (
str
) – Either regression or classification (binary).
- Variables:
tree_dic (
dict
) – The dictionary keys are binned column names and items are the univariate trees.bin_upper_bound_dic (
dict
) – The upper bound of the numerical intervals.cat_bin_dict (
dict
) – The mapping dictionary for the categorical columns.tree_imputer (
dict
) – The missing values are split by the tree and lead to similar splits and are mapped to this value.ordinal_encoder_dic (
dict
) – Dictionary with the fitted encoder, if any.cat_features (
list
) – Names of the found categorical columns.
- fit_transform(X)#
Fit and apply the transformer object on data.
Example
>>> lgb_params = {'min_split_gain': 5} >>> disc = TreeDiscretizer(bin_features='all', n_bins=10) >>> disc.fit(X=df[predictors], y=df['Frequency'], sample_weight=df['Exposure'])
- fit(X, y, sample_weight=None)[source]#
Fit the TreeDiscretizer on the input data.
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – The predictor dataframe.y (
array-like
ofshape (n_samples,)
) – The target vector.sample_weight (
array-like
ofshape (n_samples,)
, optional) – The weight vector, by default None.
- Returns:
self (
object
) – Returns self.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') TreeDiscretizer #
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
sample_weight (
str
,True
,False
, orNone
, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weight
parameter infit
.- Returns:
self (
object
) – The updated object.
- transform(X)[source]#
Apply the discretizer on X. Only the columns with more than n_bins_max unique values will be transformed.
- Parameters:
X (
array-like
ofshape (n_samples
,n_features)
) – Input data with shape (n_samples, n_features), where n_samples is the number of samples and n_features is the number of features.- Returns:
X (
pd.DataFrame
) – DataFrame with the binned and grouped columns.
- arfs.preprocessing._drop_intercept(formula, add_intercept)[source]#
Drop the intercept from formula if not add_intercept
- arfs.preprocessing.cat_var(data, col_excl=None, return_cat=True)[source]#
Ad hoc categorical encoding (as integer). Automatically detect the non-numerical columns, save the index and name of those columns, encode them as integer, save the direct and inverse mappers as dictionaries. Return the data-set with the encoded columns with a data type either int or pandas categorical.
- Parameters:
data (
pd.DataFrame
) – the datasetcol_excl (
list
ofstr
, defaultNone
) – the list of columns names not being encoded (e.g. the ID column)return_cat (
bool
, defaultTrue
) – return encoded object columns as pandas categoricals or not.
- Returns:
df (
pd.DataFrame
) – the dataframe with encoded columnscat_var_df (
pd.DataFrame
) – the dataframe with the indices and names of the categorical columnsinv_mapper (
dict
) – the dictionary to map integer –> categorymapper (
dict
) – the dictionary to map category –> integer
- class arfs.preprocessing.dtype_column_selector(pattern=None, *, dtype_include=None, dtype_exclude=None, exclude_cols=None)[source]#
Bases:
object
Create a callable to select columns to be used with
ColumnTransformer
.dtype_column_selector()
can select columns based on datatype or the columns name with a regex. When using multiple selection criteria, all criteria must match for a column to be selected.- Parameters:
pattern (
str
, defaultNone
) – Name of columns containing this regex pattern will be included. If None, column selection will not be selected based on pattern.dtype_include (
column dtype
orlist
ofcolumn dtypes
, defaultNone
) – A selection of dtypes to include. For more details, seepandas.DataFrame.select_dtypes()
.dtype_exclude (
column dtype
orlist
ofcolumn dtypes
, defaultNone
) – A selection of dtypes to exclude. For more details, seepandas.DataFrame.select_dtypes()
.exclude_cols (
list
ofcolumn names
, defaultNone
) – A selection of columns to exclude
- Returns:
selector (
callable
) – Callable for column selection to be used by aColumnTransformer
.
See also
ColumnTransformer
Class that allows combining the outputs of multiple transformer objects used on column subsets of the data into a single feature space.
Examples
>>> from sklearn.preprocessing import StandardScaler, OneHotEncoder >>> from sklearn.compose import make_column_transformer >>> from arfs.preprocessing import dtype_column_selector >>> import numpy as np >>> import pandas as pd >>> X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw'], ... 'rating': [5, 3, 4, 5]}) >>> ct = make_column_transformer( ... (StandardScaler(), ... dtype_column_selector(dtype_include=np.number)), # rating ... (OneHotEncoder(), ... dtype_column_selector(dtype_include=object))) # city >>> ct.fit_transform(X) array([[ 0.90453403, 1. , 0. , 0. ], [-1.50755672, 1. , 0. , 0. ], [-0.30151134, 0. , 1. , 0. ], [ 0.90453403, 0. , 0. , 1. ]])
- arfs.preprocessing.find_interval_midpoint(interval_series)[source]#
Find the midpoint (or left/right bound if the interval contains Inf).
- Parameters:
interval_series (
pd.Series
) – series of pandas intervals.- Return type:
ndarray
- Returns:
np.ndarray
– Array of midpoints or bounds of the intervals.
- arfs.preprocessing.highlight_discarded(s)[source]#
highlight X in red and V in green.
- Parameters:
s (
np.arrays
) –- Returns:
list
- arfs.preprocessing.transform_interval_to_midpoint(X, cols='all')[source]#
Transforms interval columns in a pandas DataFrame to their midpoint values.
Notes
Equivalent function to
IntervalToMidpoint
without the estimator API- Parameters:
X (
pd.DataFrame
) – The input DataFrame containing the data to be transformed.cols (
list
ofstr
orstr
) – The columns to be transformed. Defaults to “all” which transforms all columns.
- Return type:
DataFrame
- Returns:
pd.DataFrame
– The transformed DataFrame with interval columns replaced by their midpoint values.- Raises:
TypeError : – If the input data is not a pandas DataFrame.
arfs.sampling module#
This module provide methods for sampling large datasets for reducing the running time
- arfs.sampling._gower_distance_row(xi_cat, xi_num, xj_cat, xj_num, feature_weight_cat, feature_weight_num, feature_weight_sum, ranges_of_numeric)[source]#
Compute a row of the Gower matrix
- Parameters:
xi_cat (
np.array
) – categorical row of the X matrixxi_num (
np.array
) – numerical row of the X matrixxj_cat (
np.array
) – categorical row of the X matrixxj_num (
np.array
) – numerical row of the X matrixfeature_weight_cat (
np.array
) – weight vector for the categorical featuresfeature_weight_num (
np.array
) – weight vector for the numerical featuresfeature_weight_sum (
float
) – The sum of the wieghtsranges_of_numeric (
np.array
) – range of the scaled numerical features (between 0 and 1)
- Returns:
np.array (
array
) – a row vector of the Gower distance
- arfs.sampling.get_5_percent_splits(length)[source]#
splits dataframe into 5% intervals
- Parameters:
length (
int
) – array length- Returns:
array
– vector of sizes
- arfs.sampling.gower_matrix(data_x, data_y=None, weight=None, cat_features='auto')[source]#
Computes the gower distances between X and Y
Gower is a similarity measure for categorical, boolean and numerical mixed data.
- Parameters:
data_x (
np.array
orpd.DataFrame
) – The data for computing the Gower distancedata_y (
np.array
orpd.DataFrame
orpd.Series
, optional) – The reference matrix or vector to compare with, optionalweight (
np.array
orpd.Series
, optional) – sample weight, optionalcat_features (
list
ofstr
orbool
orint
, optional) – auto-detect cat features or a list of cat features, by default ‘auto’
- Returns:
np.array
– The Gower distance matrix, shape (n_samples, n_samples)
Notes
The non-numeric features, and numeric feature ranges are determined from X and not Y.
- Raises:
TypeError – If two dataframes are passed but have different number of columns
TypeError – If two arrays are passed but have different number of columns
TypeError – Sparse matrices are not supported
TypeError – if a list of categorical columns is passed, it should be a list of strings or integers or boolean values
- arfs.sampling.gower_topn(data_x, data_y=None, weight=None, cat_features='auto', n=5, key=None)[source]#
Get the n most similar elements
- Parameters:
data_x (
np.array
orpd.DataFrame
) – The data for the look updata_y (
np.array
orpd.DataFrame
orpd.Series
, optional) – elements for which to return the most similar elements, should be a single rowweight (
np.array
orpd.Series
, optional) – sample weight, by default Nonecat_features (
list
ofstr
orbool
orint
, optional) – auto detection of cat features or a list of strings, booleans or integers, by default ‘auto’n (
int
, optional) – the number of neighbors/similar rows to find, by default 5key (
str
, optional) – identifier key. If several rows refer to the same id, this column will be used for finding the nearest neighbors with a different id, by default None
- Returns:
dict
– the dictionary of indices and values of the closest elements- Raises:
TypeError – if the reference element is not a single row
- arfs.sampling.isof_find_sample(X, sample_weight=None)[source]#
Finds a sample by comparing the distributions of the anomally scores between the sample and the original distribution using the KS-test. Starts of a 5% howver will increase to 10% and then 15% etc. if a significant sample can not be found
References
Sampling method taken from boruta_shap, author: https://github.com/Ekeany
- Parameters:
X (
pd.DataFrame
) – the predictors matrixsample_weight (
pd.Series
ornp.array
, optional) – the sample weights, if any, by default None
- Returns:
array
– the indices for reducing the shadow predictors matrix
- arfs.sampling.isolation_forest(X, sample_weight=None)[source]#
fits isloation forest to the dataset and gives an anomally score to every sample
- Parameters:
X (
pd.DataFrame
ornp.array
) – the predictors matrixsample_weight (
pd.Series
ornp.array
, optional) – the sample weights, if any, by default None
- arfs.sampling.sample(df, n=1000, sample_weight=None, method='gower')[source]#
Sampling rows from a dataframe when random sampling is not enough for reducing the number of rows. The strategies are either using hierarchical clustering based on the Gower distance or using isolation forest for identifying the most similar elements. For the clustering algorithm, clusters are determined using the Gower distance (mixed type data) and the dataset is shrunk from n_samples to n_clusters.
For the isolation forest algorithm, samples are added till a suffisant 2-samples KS statistics is reached or if the number iteration reached the max number (20)
- Parameters:
df (
pd.DataFrame
) – the dataframe to sample, with or without the targetn (
int
, optional) – the number of clusters if method is"gower"
, by default 100sample_weight (
pd.Series
ornp.array
, optional) – sample weights, by default Nonemethod (
str
, optional) – the strategy to use for sampling the rows. Either"gower"
or"isoforest"
, by default ‘gower’
- Returns:
pd.DataFrame
– the sampled dataframe
arfs.utils module#
Utility and validation functions
- arfs.utils.LightForestClassifier(n_feat, n_estimators=10)[source]#
lightGBM implementation of the Random Forest classifier with the ideal number of features, according to Elements of statistical learning
- Parameters:
n_feat (
int
) – the number of predictors (nbr of columns of the X matrix)n_estimators (
int
, optional) – the number of trees/estimators, by default 10
- Returns:
lightgbm classifier
– sklearn random forest estimator based on lightgbm
- arfs.utils.LightForestRegressor(n_feat, n_estimators=10)[source]#
lightGBM implementation of the Random Forest regressor with the ideal number of features, according to Elements of statistical learning
- Parameters:
n_feat (
int
) – the number of predictors (nbr of columns of the X matrix)n_estimators (
int
, optional) – the number of trees/estimators, by default 10
- Returns:
lightgbm regressor
– sklearn random forest estimator based on lightgbm
- arfs.utils._get_cancer_data()[source]#
Load breast cancer data and add dummies (random predictors) and a genuine one, for benchmarking purpose Classification (binary)
- Returns:
object
– Bunch sklearn, extension of dictionary
- arfs.utils._get_titanic_data()[source]#
Load Titanic data and add dummies (random predictors, numeric and categorical) and a genuine one, for benchmarking purpose. Classification (binary)
- Returns:
object
– Bunch sklearn, extension of dictionary
- arfs.utils._load_boston_data()[source]#
Load Boston data and add dummies (random predictors, numeric and categorical) and a genuine one, for benchmarking purpose. Regression (positive domain).
- Returns:
object
– Bunch sklearn, extension of dictionary
- arfs.utils._load_housing(as_frame=False)[source]#
Load the California housing data. See here https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html for the downloadable version.
- Parameters:
as_frame (
bool
) – return a pandas dataframe? if not then a “Bunch” (enhanced dictionary) is returned (defaultTrue
)- Returns:
pd.DataFrame
orBunch
– the dataset
- arfs.utils._make_corr_dataset_classification(size=1000)[source]#
Generate an artificial dataset for classification tasks. Some columns are correlated, have no variance, large cardinality, numerical and categorical.
- Parameters:
size (int): The number of rows to generate. Default is 1000.
- Returns:
tuple: A tuple containing the predictors matrix, the target, and the weights.
- arfs.utils._make_corr_dataset_regression(size=1000)[source]#
Generate an artificial dataset for regression tasks with columns that are correlated, have no variance, large cardinality, numerical and categorical.
- Parameters:
size (
int
, optional) – number of rows to generate, by default 1000- Returns:
pd.DataFrame
,pd.Series
,pd.Series
– the predictors matrix, the target and the weights
- arfs.utils.check_if_tree_based(model)[source]#
check if estimator is tree based
- Parameters:
model (
object
) – the estimator to check- Returns:
condition (
boolean
) – if tree based or not
- arfs.utils.concat_or_group(col, x, max_length=25)[source]#
Concatenate unique values from a column or return a group value.
- Parameters:
col (
str
) – The name of the column to process.x (
pd.DataFrame
) – The DataFrame containing the data.max_length (
int
, optional) – The maximum length for concatenated strings, beyond which grouping is performed, by default 40.
- Returns:
str
– A concatenated string of unique values if the length is less than max_length, otherwise, a unique group value from the specified column.
Notes
If the concatenated string length is greater than or equal to max_length, this function returns the unique group value from the column with a “_g” suffix.
Examples
>>> data = { >>> 'Category_g': [1, 1, 2, 2, 3], >>> 'Category': ['AAAAAAAAAAAAAAA', 'Bovoh', 'Ccccccccccccccc', 'D', 'E']} >>> cat_bin_dict = {} >>> col = 'Category' >>> cat_bin_dict[col] = ( >>> X[[f"{col}_g", col]] >>> .groupby(f"{col}_g") >>> .apply(lambda x: concat_or_group(col, x)) >>> .to_dict() >>> ) >>> print(cat_bin_dict) >>> {'Category': {1: 'gr_1', 2: 'gr_2', 3: 'E'}}
- arfs.utils.create_dtype_dict(df, dic_keys='col_names')[source]#
Create a custom dictionary of data type for adding suffixes to column names in the plotting utility for association matrix.
- Parameters:
df (
pd.DataFrame
) – The dataframe used for computing the association matrix.dic_keys (
str
) – Either “col_names” or “dtypes” for returning either a dictionary with column names or dtypes as keys.
- Return type:
dict
- Returns:
dict
– A dictionary with either column names or dtypes as keys.- Raises:
ValueError – If dic_keys is not either “col_names” or “dtypes”.
- arfs.utils.get_pandas_cat_codes(X)[source]#
Converts categorical and time features in a pandas DataFrame into numerical codes.
- Parameters:
X (
pandas DataFrame
) – The input DataFrame containing categorical and/or time features.- Returns:
X (
pandas DataFrame
) – The modified input DataFrame with categorical and time features replaced by numerical codes.obj_feat (
list
orNone
) – List of column names that were converted to numerical codes. Returns None if no categorical or time features found.cat_idx (
list
orNone
) – List of column indices for the columns in obj_feat. Returns None if no categorical or time features found.
- arfs.utils.is_catboost(estimator)[source]#
check if estimator is catboost
- Parameters:
model (
object
) – the estimator to check- Returns:
condition (
boolean
) – if catboost based or not
- arfs.utils.is_lightgbm(estimator)[source]#
check if estimator is lightgbm
- Parameters:
model (
object
) – the estimator to check- Returns:
condition (
boolean
) – if lgbm based or not
- arfs.utils.is_list_of_bool(bool_list)[source]#
Check if
bool_list
is not a list of Booleans- Parameters:
bool_list (
list
ofbool
) – the list we want to check for- Returns:
bool
– True if list of Booleans, else False
- arfs.utils.is_list_of_int(int_list)[source]#
Check if
int_list
is not a list of integers- Parameters:
int_list (
list
ofint
) – the list we want to check for- Returns:
bool
– True if list of integers, else False
- arfs.utils.is_list_of_str(str_list)[source]#
Check if
str_list
is a list of strings.- Parameters:
str_list (
list
orNone
) – The list to check.- Returns:
bool
– True if the list is a list of strings, False otherwise.
- arfs.utils.is_xgboost(estimator)[source]#
check if estimator is xgboost
- Parameters:
model (
object
) – the estimator to check- Returns:
condition (
boolean
) – if xgboost based or not
- arfs.utils.load_data(name='Titanic')[source]#
Load some toy data set to test the All Relevant Feature Selection methods. Dummies (random) predictors are added and ARFS should be able to filter them out. The Titanic predictors are encoded (needed for scikit estimators).
Titanic and cancer are for binary classification, they contain synthetic random (dummies) predictors and a noisy but genuine synthetic predictor. Hopefully, a good All Relevant FS should be able to detect all the predictors genuinely related to the target.
Boston is for regression, this data set contains
- Parameters:
name (
str
, optional) – the name of the data set. Titanic is for classification with sample_weight, Boston for regression and cancer for classification (without sample weight), by default ‘Titanic’- Returns:
Bunch
– extension of dictionary, accessible by key- Raises:
ValueError – if the dataset name is invalid
- arfs.utils.plot_y_vs_X(X, y, ncols=2, figsize=(10, 10))[source]#
Plot target vs relevant and non-relevant predictors
- Parameters:
X (
pd.DataFrame
) – The DataFrame of the predictors.y (
np.array
) – The target.ncols (
int
, optional) – The number of columns in the facet plot. Default is 2.figsize (
tuple
, optional) – The figure size. Default is (10, 10).
- Returns:
plt.figure
– The univariate plots y vs pred_i.
- arfs.utils.set_my_plt_style(height=3, width=5, linewidth=2)[source]#
This set the style of matplotlib to fivethirtyeight with some modifications (colours, axes)
- Parameters:
linewidth (
int
, default2
) – line widthheight (
int
, default3
) – fig height in inches (yeah they’re still struggling with the metric system)width (
int
, default5
) – fig width in inches (yeah they’re still struggling with the metric system)
- arfs.utils.validate_pandas_input(arg)[source]#
Validate if pandas or numpy arrays are provided :type arg: :param arg: the object to validate :type arg:
pd.DataFrame
ornp.array
- Raises:
TypeError – error if pandas or numpy arrays are not provided
- arfs.utils.validate_sample_weight(sample_weight)[source]#
Validate the sample_weight parameter.
- Parameters:
sample_weight (
array-like
orNone
) – Input sample weights.- Returns:
np.ndarray
orNone
– If sample_weight is a Pandas Series, its values are returned as a numpy array. If sample_weight is already a numpy array, it is returned unmodified. If sample_weight is None, None is returned.- Raises:
ValueError – If sample_weight is not an array-like object or None.
Module contents#
init module, providing information about the arfs package