arfs package#
Subpackages#
- arfs.feature_selection package
- Submodules
- arfs.feature_selection.allrelevant module
- Reference:
- The module structure
- Original BorutaPy version
BoostAGrootaGrootCVLeshyLeshy._add_shadows_get_imps()Leshy._assign_hits()Leshy._calculate_absolute_ranking()Leshy._calculate_relative_ranking()Leshy._calculate_support()Leshy._check_params()Leshy._do_tests()Leshy._fdrcorrection()Leshy._fit()Leshy._get_tree_num()Leshy._nanrankdata()Leshy._print_result()Leshy._print_results()Leshy._run_iteration()Leshy._update_estimator()Leshy._update_tree_num()Leshy.fit()Leshy.plot_importance()Leshy.select_features()Leshy.set_fit_request()Leshy.transform()
_boostaroota()_compute_importance()_create_shadow()_get_confirmed_and_tentative()_get_imp()_get_perm_imp()_get_shap_imp()_get_shap_imp_fast()_merge_importance_df()_reduce_vars_lgb_cv()_reduce_vars_sklearn()_select_tentative()_set_lgb_parameters()_split_data()_split_fit_estimator()_train_lgb_model()
- arfs.feature_selection.base module
- arfs.feature_selection.lasso module
- arfs.feature_selection.mrmr module
- arfs.feature_selection.summary module
- arfs.feature_selection.unsupervised module
- arfs.feature_selection.variable_importance module
- Module contents
BoostAGrootaCardinalityThresholdCollinearityThresholdGrootCVLassoFeatureSelectionLeshyLeshy._add_shadows_get_imps()Leshy._assign_hits()Leshy._calculate_absolute_ranking()Leshy._calculate_relative_ranking()Leshy._calculate_support()Leshy._check_params()Leshy._do_tests()Leshy._fdrcorrection()Leshy._fit()Leshy._get_tree_num()Leshy._nanrankdata()Leshy._print_result()Leshy._print_results()Leshy._run_iteration()Leshy._update_estimator()Leshy._update_tree_num()Leshy.fit()Leshy.plot_importance()Leshy.select_features()Leshy.set_fit_request()Leshy.transform()
MinRedundancyMaxRelevanceMissingValueThresholdUniqueValuesThresholdVariableImportancemake_fs_summary()
Submodules#
arfs.association module#
Parallelized Association and Correlation matrix
This module provides parallelized methods for computing associations. Namely, correlation, correlation ratio, Theil’s U, Cramer’s V
They are the basis of the MRmr feature selection
- arfs.association._callable_association_matrix_fn(assoc_fn, X, sample_weight=None, n_jobs=1, kind='nom-nom', cols_comb=None)[source]#
_callable_association_matrix_fn private function, utility for computing association matrix for a callable custom association
- Parameters:
assoc_fn (
callable) – a function which receives two pd.Series (and optionally a weight array) and returns a single numberX (
array-likeofshape (n_samples,n_features)) – predictor dataframesample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Nonen_jobs (
int, optional) – the number of cores to use for the computation, by default -1kind (
str) – kind of association, either ‘num-num’ or ‘nom-nom’ or ‘nom-num’cols_comb (
listof2-upleofstr, optional) – combination of column names (list of 2-uples of strings)
- Returns:
pd.DataFrame– the association matrix
- arfs.association._callable_association_series_fn(assoc_fn, X, target, sample_weight=None, n_jobs=1, kind='nom-nom')[source]#
_callable_association_series_fn private function, utility for computing association series for a callable custom association
- Parameters:
assoc_fn (
callable) – a function which receives two pd.Series (and optionally a weight array) and returns a single numberX (
array-likeofshape (n_samples,n_features)) – predictor dataframetarget (
strorint) – the predictor name or index with which to compute associationsample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Nonen_jobs (
int, optional) – the number of cores to use for the computation, by default 1kind (
str) – kind of association, either ‘num-num’ or ‘nom-nom’ or ‘nom-num’
- Returns:
pd.Series– the association series- Raises:
ValueError – if kind is not ‘num-num’ or ‘nom-nom’ or ‘nom-num’
- arfs.association._check_association_input(X, sample_weight=None, handle_na='drop')[source]#
_check_association_input private function. Check the inputs, convert X to a pd.DataFrame if needed, adds column names if non are provided. Check if the sample_weight is None or of the right dimensionality and handle NA according to the chosen methods (drop, fill, None).
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – predictor dataframesample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Nonehandle_na (
str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
tuple– the dataframe and the sample weights- Raises:
ValueError – if sample_weight contains NA
- arfs.association._weighted_correlation_ratio(*args)[source]#
Calculates the Correlation Ratio (sometimes marked by the greek letter Eta) for categorical-continuous association. Answers the question - given a continuous value of a measurement, is it possible to know which category is it associated with? Value is in the range [0,1], where 0 means a category cannot be determined by a continuous measurement, and 1 means a category can be determined with absolute certainty.
Based on the scikit-learn implementation of the unweighted version.
- Returns:
float– value of the correlation ratio
- arfs.association.annotate_heatmap(im, data=None, valfmt='{x:.2f}', textcolors=('black', 'white'), threshold=None, **textkw)[source]#
annotate_heatmap annotates a heatmap
- Parameters:
im (
matplotlib.axes.Axes) – The AxesImage to be labeleddata (
array-likeofshape (M,N), optional) – data to illustrate, if none is provided the function retrieves the array of the mlp obkect, by default Nonevalfmt (
str, optional) – annotation formating, by default “{x:.2f}”textcolors (
tuple, optional) – A pair of colors. The first is used for values below a threshold, the second for those above, by default (“black”, “white”)threshold (
float, optional) – Value in data units according to which the colors from textcolors are applied. If None (the default) uses the middle of the colormap as separation, by default Nonetextkw (
dict, optional) – All other arguments are forwarded to mpl annotation.
- Returns:
_type_– _description_
- arfs.association.association_matrix(X, sample_weight=None, nom_nom_assoc=<function weighted_theils_u>, num_num_assoc=<function weighted_corr>, nom_num_assoc=<function correlation_ratio>, n_jobs=1, handle_na='drop')[source]#
Computes the association matrix for continuous-continuous, categorical-continuous, and categorical-categorical predictors using specified callable functions.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – Predictor dataframe.sample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default None.nom_nom_assoc (
callable) – Function to compute the categorical-categorical association.num_num_assoc (
callable) – Function to compute the numerical-numerical association.nom_num_assoc (
callable) – Function to compute the categorical-numerical association.n_jobs (
int, optional) – The number of cores to use for the computation, by default 1.handle_na (
str, optional) – How to handle NA values (‘drop’, ‘fill’, or None), by default “drop”.
- Returns:
pd.DataFrame– The association matrix.
- arfs.association.association_series(X, target, features=None, sample_weight=None, nom_nom_assoc=<function weighted_theils_u>, num_num_assoc=functools.partial(<function weighted_corr>, method='spearman'), nom_num_assoc=<function correlation_ratio>, normalize=False, n_jobs=1, handle_na='drop')[source]#
Computes the association series for different types of predictors.
This function calculates the association between the specified target and other predictors in X. It supports different types of associations: nominal-nominal, numerical-numerical, and nominal-numerical.
- Parameters:
X (
array-like,shape (n_samples,n_features)) – Predictor dataframe.target (
strorint) – The predictor name or index with which to compute the association.features (
listofstr, optional) – List of features with which to compute the association. If None, all features in X are used.sample_weight (
array-like,shape (n_samples,), optional) – The weight vector, by default None.nom_nom_assoc (
callable) – Function to compute the nominal-nominal (categorical-categorical) association. It should take two pd.Series and an optional weight array, and return a single number.num_num_assoc (
callable) – Function to compute the numerical-numerical association. It should take two pd.Series and return a single number.nom_num_assoc (
callable) – Function to compute the nominal-numerical association. It should take two pd.Series and return a single number.normalize (
bool, optional) – Whether to normalize the scores or not. If True, scores are normalized to the range [0, 1].n_jobs (
int, optional) – The number of cores to use for the computation. The default, -1, uses all available cores.handle_na (
str, optional) – How to handle NA values. Options are ‘drop’, ‘fill’, and None. The default, ‘drop’, drops rows with NA values.
- Returns:
pd.Series– A series with all the association values with the target column, sorted in descending order.- Raises:
TypeError – If features is provided but is not a list of strings.
Examples
>>> import pandas as pd >>> from sklearn import datasets >>> iris = datasets.load_iris() >>> X = pd.DataFrame(iris.data, columns=iris.feature_names) >>> association_series(X, 'sepal length (cm)', num_num_assoc=my_num_num_function)
Notes
The function dynamically selects the appropriate association method based on the data types of the target and other predictors. For numerical-numerical associations, it uses num_num_assoc; for nominal-nominal, nom_nom_assoc; and for nominal-numerical, nom_num_assoc.
- arfs.association.cluster_sq_matrix(sq_matrix, method='ward')[source]#
Apply agglomerative clustering to sort a square correlation matrix.
- Parameters:
sq_matrix (
pd.DataFrame) – A square correlation matrix.method (
str, optional) – The linkage method, by default “ward”.
- Returns:
pd.DataFrame– A sorted square matrix.
Example
>>> from some_module import association_matrix, cluster_sq_matrix
>>> assoc = association_matrix(iris_df, plot=False) >>> assoc_clustered = cluster_sq_matrix(assoc, method="complete")
- arfs.association.correlation_ratio(x, y, sample_weight=None, as_frame=False)[source]#
Compute the weighted correlation ratio. The association between a continuous predictor (y) and a categorical predictor (x). It can be weighted.
- Parameters:
x (
pd.Seriesofshape (n_samples,)) – The categorical predictor vectory (
pd.Seriesofshape (n_samples,)) – The continuous predictorsample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Noneas_frame (
bool) – return output as a dataframe or a float
- Returns:
float– value of the correlation ratio
- arfs.association.correlation_ratio_matrix(X, sample_weight=None, n_jobs=1, handle_na='drop')[source]#
correlation_ratio_matrix computes the weighted Correlation Ratio for categorical-numerical association. This is a symmetric coefficient: CR(x,y) = CR(y,x)
The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – predictor dataframesample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Nonen_jobs (
int, optional) – the number of cores to use for the computation, by default 1handle_na (
str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.DataFrame– The correlation ratio matrix (lower triangular) in a tidy (long) format.
- arfs.association.correlation_ratio_series(X, target, sample_weight=None, n_jobs=1, handle_na='drop')[source]#
correlation_ratio_series computes the weighted correlation ration for categorical-numerical association. This is a symmetric coefficient: CR(x,y) = CR(y,x)
The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format, a series.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – predictor dataframetarget (
strorint) – the predictor name or index with which to compute associationsample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Nonen_jobs (
int, optional) – the number of cores to use for the computation, by default -1handle_na (
str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.Series– The Correlation ratio series (lower triangular) in a tidy (long) format.
- arfs.association.cramer_v(x, y, sample_weight=None, as_frame=False)[source]#
Computes the weighted V statistic of two categorical predictors.
- Parameters:
x (
pd.Seriesofshape (n_samples,)) – The first categorical predictor.y (
pd.Seriesofshape (n_samples,)) – The second categorical predictor, order doesn’t matter, symmetrical association.sample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default None.as_frame (
bool) – Return output as a DataFrame or a float.
- Returns:
pd.DataFrameorfloat– Single row DataFrame with the predictor names and the statistic value, or the statistic as a float.
- arfs.association.cramer_v_matrix(X, sample_weight=None, n_jobs=1, handle_na='drop')[source]#
cramer_v_matrix computes the weighted Cramer’s V statistic for categorical-categorical association. This is a symmetric coefficient: V(x,y) = V(y,x)
It uses the corrected Cramer’s V statistics, itself based on the chi2 contingency table. The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – predictor dataframesample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Nonen_jobs (
int, optional) – the number of cores to use for the computation, by default 1handle_na (
str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.DataFrame– The Cramer’s V matrix (lower triangular) in a tidy (long) format.
- arfs.association.cramer_v_series(X, target, sample_weight=None, n_jobs=1, handle_na='drop')[source]#
cramer_v_series computes the weighted Cramer’s V statistic for categorical-categorical association. This is a symmetric coefficient: V(x,y) = V(y,x)
It uses the corrected Cramer’s V statistics, itself based on the chi2 contingency table. The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – predictor dataframetarget (
strorint) – the predictor name or index with which to compute associationsample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Nonen_jobs (
int, optional) – the number of cores to use for the computation, by default 1handle_na (
str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.Series– The Cramer’s V series
- arfs.association.create_col_combinations(func, selected_cols)[source]#
Create column combinations or permutations based on the symmetry of the function.
This function checks if func is symmetric. If it is, it creates combinations of selected_cols; otherwise, it creates permutations.
- Parameters:
func (
callable) – The function to check for symmetry. Should be decorated with @symmetric_function.selected_cols (
list) – The columns to be combined or permuted.
- Returns:
listoftuples– A list of tuples representing column combinations or permutations. If func is symmetric, combinations of selected_cols are returned; otherwise, permutations are returned.
- arfs.association.f_cat_classification_parallel(X, y, sample_weight=None, n_jobs=-1, force_finite=True, handle_na='drop')[source]#
Univariate information dependence.
It ranks features in the same order if all the features are positively correlated with the target. Note that it is therefore recommended as a feature selection criterion to identify potentially predictive features for a downstream classifier, irrespective of the sign of the association with the target variable.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – The predictor dataframe.y (
array-likeofshape (n_samples,)) – The target vector.sample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default None.n_jobs (
int, optional) – The number of cores to use for the computation, by default -1.handle_na (
str, optional) – Either drop rows with NaN, fill NaN with 0, or do nothing, by default “drop”.force_finite (
bool, optional) –Whether or not to force the F-statistics and associated p-values to be finite. There are two cases where the F-statistic is expected to not be finite:
when the target y or some features in X are constant. In this case, the Pearson’s R correlation is not defined leading to obtain np.nan values in the F-statistic and p-value. When force_finite=True, the F-statistic is set to 0.0 and the associated p-value is set to 1.0.
when a feature in X is perfectly correlated (or anti-correlated) with the target y. In this case, the F-statistic is expected to be np.inf. When force_finite=True, the F-statistic is set to np.finfo(dtype).max.
- Returns:
f_statistic (
array-likeofshape (n_features,)) – F-statistic for each feature.
- arfs.association.f_cat_regression(x, y, sample_weight=None, as_frame=False)[source]#
f_cat_regression computes the weighted ANOVA F-value for the provided sample. (continuous target, categorical predictor)
- Parameters:
x (
pd.Seriesofshape (n_samples,)) – The predictor vector, the first categorical predictory (
pd.Seriesofshape (n_samples,)) – second categorical predictor, order doesn’t matter, symmetrical associationsample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Noneas_frame (
bool) – return output as a dataframe or a float
- Returns:
float– value of the F-statistic
- arfs.association.f_cat_regression_parallel(X, y, sample_weight=None, n_jobs=1, handle_na='drop')[source]#
f_cat_regression_parallel computes the weighted ANOVA F-value for the provided categorical predictors using parallelization of the code (continuous target, categorical predictor).
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – predictor dataframey (
array-likeofshape (n_samples,)) – The target vectorsample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Nonen_jobs (
int, optional) – the number of cores to use for the computation, by default 1handle_na (
str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.Series– the value of the F-statistic for each predictor
- arfs.association.f_cont_classification(x, y, sample_weight=None, as_frame=False)[source]#
f_cont_classification computes the weighted ANOVA F-value for the provided sample. Categorical target, continuous predictor.
- Parameters:
x (
pd.Seriesofshape (n_samples,)) – The predictor vector, the first categorical predictory (
pd.Seriesofshape (n_samples,)) – second categorical predictor, order doesn’t matter, symmetrical associationsample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Noneas_frame (
bool) – return output as a dataframe or a float
- Returns:
float– value of the F-statistic
- arfs.association.f_cont_classification_parallel(X, y, sample_weight=None, n_jobs=-1, handle_na='drop')[source]#
f_cont_classification_parallel computes the weighted ANOVA F-value for the provided categorical predictors using parallelization of the code. Categorical target, continuous predictor.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – The set of regressors that will be tested sequentiallyy (
array-likeofshape (n_samples,)) – The target vectorsample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Nonen_jobs (
int, optional) – the number of cores to use for the computation, by default -1handle_na (
str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.Series– the value of the F-statistic for each predictor
- arfs.association.f_cont_regression_parallel(X, y, sample_weight=None, n_jobs=-1, force_finite=True, handle_na='drop')[source]#
Univariate linear regression tests returning F-statistic.
Quick linear model for testing the effect of a single regressor, sequentially for many regressors. This is done in 2 steps: 1. The cross-correlation between each regressor and the target is computed using:
E[(X[:, i] - mean(X[:, i])) * (y - mean(y))] / (std(X[:, i]) * std(y))
It is converted to an F score ranks features in the same order if all the features are positively correlated with the target.
Note that it is therefore recommended as a feature selection criterion to identify potentially predictive features for a downstream classifier, irrespective of the sign of the association with the target variable.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – The predictor dataframe.y (
array-likeofshape (n_samples,)) – The target vector.sample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default None.n_jobs (
int, optional) – The number of cores to use for the computation, by default -1.handle_na (
str, optional) – Either drop rows with NaN, fill NaN with 0, or do nothing, by default “drop”.force_finite (
bool, optional) –Whether or not to force the F-statistics and associated p-values to be finite. There are two cases where the F-statistic is expected to not be finite: - when the target y or some features in X are constant. In this
case, the Pearson’s R correlation is not defined leading to obtain np.nan values in the F-statistic and p-value. When force_finite=True, the F-statistic is set to 0.0 and the associated p-value is set to 1.0.
when a feature in X is perfectly correlated (or anti-correlated) with the target y. In this case, the F-statistic is expected to be np.inf. When force_finite=True, the F-statistic is set to np.finfo(dtype).max.
- Returns:
f_statistic (
array-likeofshape (n_features,)) – F-statistic for each feature.
- arfs.association.f_oneway_weighted(*args)[source]#
Calculate the weighted F-statistic for one-way ANOVA (continuous target, categorical predictor).
- Parameters:
X (
array-likeofshape (n_samples,)) – The predictor dataframe.y (
array-likeofshape (n_samples,)) – The target vector.sample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default None.
- Returns:
float– The value of the F-statistic.
Notes
The F-statistic is calculated as:
\[F(rf) = \frac{\sum_i (\bar{Y}_{i \bullet} - \bar{Y})^2 / (K-1)}{\sum_i \sum_k (\bar{Y}_{ij} - \bar{Y}_{i\bullet})^2 / (N - K)}\]
- arfs.association.f_stat_classification_parallel(X, y, sample_weight=None, n_jobs=1, force_finite=True, handle_na='drop')[source]#
Compute the weighted ANOVA F-value for the provided categorical and numerical predictors using parallelization.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – The predictor dataframe.y (
array-likeofshape (n_samples,)) – The target vector.sample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default None.n_jobs (
int, optional) – The number of cores to use for the computation, by default 1.handle_na (
str, optional) – Either drop rows with NA, fill NA with 0, or do nothing, by default “drop”.force_finite (
bool, optional) –Whether or not to force the F-statistics and associated p-values to be finite. There are two cases where the F-statistic is expected to not be finite: - When the target y or some features in X are constant. In this case,
the Pearson’s R correlation is not defined leading to obtain np.nan values in the F-statistic and p-value. When force_finite=True, the F-statistic is set to 0.0 and the associated p-value is set to 1.0.
When a feature in X is perfectly correlated (or anti-correlated) with the target y. In this case, the F-statistic is expected to be np.inf. When force_finite=True, the F-statistic is set to np.finfo(dtype).max.
- Returns:
pd.Series– The value of the F-statistic for each predictor.
- arfs.association.f_stat_regression_parallel(X, y, sample_weight=None, n_jobs=-1, force_finite=True, handle_na='drop')[source]#
Compute the weighted explained variance for the provided categorical and numerical predictors using parallelization.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – The predictor dataframe.y (
array-likeofshape (n_samples,)) – The target vector.sample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default None.n_jobs (
int, optional) – The number of cores to use for the computation, by default -1.handle_na (
str, optional) – Either drop rows with NA, fill NA with 0, or do nothing, by default “drop”.force_finite (
bool, optional) –Whether or not to force the F-statistics and associated p-values to be finite. There are two cases where the F-statistic is expected to not be finite: - When the target y or some features in X are constant. In this case,
the Pearson’s R correlation is not defined leading to obtain np.nan values in the F-statistic and p-value. When force_finite=True, the F-statistic is set to 0.0 and the associated p-value is set to 1.0.
When a feature in X is perfectly correlated (or anti-correlated) with the target y. In this case, the F-statistic is expected to be np.inf. When force_finite=True, the F-statistic is set to np.finfo(dtype).max.
- Returns:
pd.Series– The value of the F-statistic for each predictor.
- arfs.association.heatmap(data, row_labels, col_labels, ax=None, cbar_kw=None, cbarlabel='', **kwargs)[source]#
heatmap Create a heatmap from a numpy array and two lists of labels.
- Parameters:
data (
array-likeofshape (M,N)) – matrix to plotrow_labels (
array-likeofshape (M,)) – labels for the rowscol_labels (
array-likeofshape (N,)) – labels for the columnsax (
matplotlib.axes.Axes, optional) – A matplotlib.axes.Axes instance to which the heatmap is plotted. If not provided, use current axes or create a new one, by default Nonecbar_kw (
dict, optional) – A dictionary with arguments to matplotlib.Figure.colorbar, by default Nonecbarlabel (
str, optional) – The label for the colorbar, by default “”kwargs (
dict, optional) – All other arguments are forwarded to imshow.
- Returns:
tuple– imgshow and cbar objects
- arfs.association.is_list_of_str(str_list)[source]#
Raise an exception if
str_listis not a list of strings :type str_list: :param str_list: to list to be tested :type str_list:list- Raises:
TypeError – if
str_listis not alist[str]
- arfs.association.matrix_to_xy(df, columns=None, reset_index=False)[source]#
matrix_to_xy wide to long format of the association matrix
- Parameters:
df (
pd.DataFrame) – the wide format of the association matrixcolumns (
listofstr, optional) – list of column names, by default Nonereset_index (
bool, optional) – wether to reset_index or not, by default False
- Returns:
pd.DataFrame– the long format of the association matrix
- arfs.association.plot_association_matrix(assoc_mat, suffix_dic=None, ax=None, cmap='PuOr', cbarlabel=None, figsize=None, show=True, cbar_kw=None, imgshow_kw=None, annotate=False)[source]#
plot_association_matrix renders the sorted associations/correlation matrix. The sorting is done using hierarchical clustering, very like in seaborn or other packages. Categorical(nom): uncertainty coefficient & correlation ratio from 0 to 1. The uncertainty coefficient is asymmetrical, (approximating how much the elements on the left PROVIDE INFORMATION on elements in the row). Continuous(con): symmetrical numerical correlations (Spearman’s) from -1 to 1
- Parameters:
assoc_mat (
pd.DataFrame) – the square association framesuffix_dic (
Dict[str,str], optional) – dictionary of data type for adding suffixes to column names in the plotting utility for association matrix, by default Noneax (
matplotlib.axes.Axes, optional) – _description_, by default Nonecmap (
str, optional) – the colormap. Please use a scientific colormap. See thescicomappackage, by default “PuOr”cbarlabel (
str, optional) – the colorbar label, by default Nonefigsize (
Tuple[float,float], optional) – figure size in inches, by default Noneshow (
bool, optional) – Whether or not to display the figure, by default Truecbar_kw (
Dict, optional) – colorbar kwargs, by default Noneimgshow_kw (
Dict, optional) – imgshow kwargs, by default Noneannotate (
bool) – Whether to annotate or not the colormap
- Returns:
matplotlib.figureandmatplotlib.axes.Axes– the figure and the axes
- arfs.association.plot_association_matrix_int(assoc_mat, suffix_dic=None, cmap='PuOr', figsize=(800, 600), cluster_matrix=True)[source]#
Plot the interactive sorted associations/correlation matrix. The sorting is done using hierarchical clustering, very like in seaborn or other packages. Categorical(nom): uncertainty coefficient & correlation ratio from 0 to 1. The uncertainty coefficient is assymmetrical, (approximating how much the elements on the left PROVIDE INFORMATION on elements in the row). Continuous(con): symmetrical numerical correlations (Spearman’s) from -1 to 1
- Parameters:
assoc_mat (
pd.DataFrame) – the square association framesuffix_dic (
Dict[str,str], optional) – dictionary of data type for adding suffixes to column names in the plotting utility for association matrix, by default Nonecmap (
str, optional) – the colormap. Please use a scientific colormap. See thescicomappackage, by default “PuOr”figsize (
Tuple[float,float], optional) – figure size in inches, by default Nonecluster_matrix (
bool) – whether or not to cluster the square matrix, by default True
- Returns:
panel.Column– the panel object
- arfs.association.theils_u_matrix(X, sample_weight=None, n_jobs=1, handle_na='drop')[source]#
theils_u_matrix theils_u_matrix computes the weighted Theil’s U statistic for categorical-categorical association. This is an asymmetric coefficient: U(x,y) != U(y,x) U(x, y) means the uncertainty of x given y: value is on the range of [0,1] - where 0 means y provides no information about x, and 1 means y provides full information about x
The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – predictor dataframesample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Nonen_jobs (
int, optional) – the number of cores to use for the computation, by default -1handle_na (
str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.DataFrame– The Theil’s U matrix in a tidy (long) format.
- arfs.association.theils_u_series(X, target, sample_weight=None, n_jobs=1, handle_na='drop')[source]#
theils_u_series computes the weighted Theil’s U statistic for categorical-categorical association. This is an asymmetric coefficient: U(x,y) != U(y,x) U(x, y) means the uncertainty of x given y: value is on the range of [0,1] - where 0 means y provides no information about x, and 1 means y provides full information about x
The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – predictor dataframetarget (
strorint) – the predictor name or index with which to compute associationsample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Nonen_jobs (
int, optional) – the number of cores to use for the computation, by default -1handle_na (
str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.Series– The Theil’s U series.
- arfs.association.wcorr(x, y, w)[source]#
wcov computes the weighted Pearson correlation coefficient
- Parameters:
x (
array-likeofshape (n_samples,)) – the predictor 1 arrayy (
array-likeofshape (n_samples,)) – the predictor 2 arrayw (
array-likeofshape (n_samples,)) – the sample weights array
- Returns:
float– weighted correlation coefficient
- arfs.association.wcorr_matrix(X, sample_weight=None, n_jobs=1, handle_na='drop', method='spearman')[source]#
wcorr_matrix computes the weighted correlation statistic for (Pearson or Spearman) for continuous-continuous association. This is an symmetric coefficient: corr(x,y) = corr(y,x)
The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – predictor dataframesample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Nonen_jobs (
int, optional) – the number of cores to use for the computation, by default -1handle_na (
str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”method (
str) – either “spearman” or “pearson”
- Returns:
pd.DataFrame– The Cramer’s V matrix (lower triangular) in a tidy (long) format.
- arfs.association.wcorr_series(X, target, sample_weight=None, n_jobs=1, handle_na='drop', method='spearman')[source]#
wcorr_series computes the weighted correlation coefficient (Pearson or Spearman) for continuous-continuous association. This is an symmetric coefficient: corr(x,y) = corr(y,x)
The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – predictor dataframetarget (
strorint) – the predictor name or index with which to compute associationsample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Nonen_jobs (
int, optional) – the number of cores to use for the computation, by default 1handle_na (
str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”method (
str) – either “spearman” or “pearson”, by default “spearman”
- Returns:
pd.Series– The weighted correlation series.
- arfs.association.wcov(x, y, w)[source]#
wcov computes the weighted covariance
- Parameters:
x (
array-likeofshape (n_samples,)) – the predictor 1 arrayy (
array-likeofshape (n_samples,)) – the predictor 2 arrayw (
array-likeofshape (n_samples,)) – the sample weights array
- Returns:
float– weighted covariance
- arfs.association.weighted_conditional_entropy(x, y, sample_weight=None)[source]#
Computes the weighted conditional entropy between two categorical predictors.
- Parameters:
x (
pd.Seriesofshape (n_samples,)) – The predictor vector.y (
pd.Seriesofshape (n_samples,)) – The target vector.sample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default None.
- Returns:
float– Weighted conditional entropy.
- arfs.association.weighted_corr(x, y, sample_weight=None, as_frame=False, method='spearman')[source]#
weighted_corr computes the weighted correlation coefficient (Pearson or Spearman)
- Parameters:
x (
pd.Seriesofshape (n_samples,)) – The categorical predictor vectory (
pd.Seriesofshape (n_samples,)) – The continuous predictorsample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Noneas_frame (
bool) – return output as a dataframe or a floatmethod (
str) – either “spearman” or “pearson”, by default “pearson”
- Returns:
floatorpd.DataFrame– weighted correlation coefficient
- arfs.association.weighted_correlation_1cpu(X, sample_weight=None, handle_na='drop')[source]#
weighted_correlation computes the lower triangular weighted correlation matrix using a single CPU, therefore using common numpy linear algebra
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – predictor dataframesample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default Nonehandle_na (
str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”
- Returns:
pd.DataFrame– the lower triangular weighted correlation matrix in long format
- arfs.association.weighted_theils_u(x, y, sample_weight=None, as_frame=False)[source]#
Computes the weighted Theil’s U statistic between two categorical predictors.
- Parameters:
x (
pd.Seriesofshape (n_samples,)) – The predictor vector.y (
pd.Seriesofshape (n_samples,)) – The target vector.sample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default None.as_frame (
bool) – Return output as a dataframe or a float.
- Returns:
pd.DataFrameorfloat– Predictor names and value of the Theil’s U statistic.
- arfs.association.wm(x, w)[source]#
wm computes the weighted mean
- Parameters:
x (
array-likeofshape (n_samples,)) – the target arrayw (
array-likeofshape (n_samples,)) – the sample weights array
- Returns:
float– weighted mean
- arfs.association.wrank(x, w)[source]#
wrank computes the weighted rank
- Parameters:
x (
array-likeofshape (n_samples,)) – the target arrayw (
array-likeofshape (n_samples,)) – the sample weights array
- Returns:
float– weighted rank
- arfs.association.wspearman(x, y, w)[source]#
wcov computes the weighted Spearman correlation coefficient
- Parameters:
x (
array-likeofshape (n_samples,)) – the perdictor 1 arrayy (
array-likeofshape (n_samples,)) – the perdictor 2 arrayw (
array-likeofshape (n_samples,)) – the sample weights array
- Returns:
float– Spearman weighted correlation coefficient
arfs.benchmark module#
Benchmark Feature Selection
This module provides utilities for comparing and benchmarking feature selection methods
Module Structure:#
sklearn_pimp_bench: function for comparing using the sklearn permutation importancecompare_varimp: function for comparing using possible 4 kinds of variable importancehighlight_tick: function for highlighting specific (genuine or noise for instance) predictors in the importance chart
- arfs.benchmark.compare_varimp(feat_selector, models, X, y, sample_weight=None)[source]#
Utility function to compare the results for the three possible kind of feature importance
- Parameters:
feat_selector (
object) – an instance of either Leshy, BoostaGRoota or GrootCVmodels (
listofobjects) – list of tree based scikit-learn estimatorsX (
pd.DataFrame,shape (n_samples,n_features)) – the predictors framey (
pd.Series) – the target (same length as X)sample_weight (
Noneorpd.Series, optional) – sample weights if any, by default None
- arfs.benchmark.highlight_tick(str_match, figure, color='red', axis='y')[source]#
Highlight the x/y tick-labels if they contain a given string
- Parameters:
str_match (
str) – the substring to matchfigure (
object) – the matplotlib figurecolor (
str, optional) – the matplotlib color for highlighting tick-labels, by default ‘red’axis (
str, optional) – axis to use for highlighting, by default ‘y’
- Returns:
plt.figure– the modified matplotlib figure- Raises:
ValueError – if axis is not ‘x’ or ‘y’
- arfs.benchmark.sklearn_pimp_bench(model, X, y, task='regression', sample_weight=None)[source]#
Benchmark using sklearn permutation importance, works for regression and classification.
- Parameters:
model (
object) – An estimator that has not been fitted, sklearn compatible.X (
ndarrayorDataFrame,shape (n_samples,n_features)) – Data on which permutation importance will be computed.y (
array-likeorNone,shape (n_samples,)or(n_samples,n_classes)) – Targets for supervised or None for unsupervised.task (
str, optional) – kind of task, either ‘regression’ or ‘classification’, by default ‘regression’sample_weight (
array-likeofshape (n_samples,), optional) – Sample weights, by default None
- Returns:
plt.figure– the figure corresponding to the feature selection- Raises:
ValueError – if task is not ‘regression’ or ‘classification’
arfs.gbm module#
GBM Wrapper
This module offers a class to train base LightGBM models, with early stopping as the default behavior. The target variable can be finite discrete (classification) or continuous (regression). Additionally, the model allows boosting from an initial score and accepts sample weights as input.
This module is part of the ‘arfs’ package and relies on ‘arfs.utils’.
Module Structure:#
GradientBoosting: main class to train a lightGBM with early stopping
Dependencies:#
Requires ‘arfs.utils’ for ‘create_dtype_dict’.
- class arfs.gbm.GradientBoosting(cat_feat='auto', params=None, stratified=False, show_learning_curve=True, verbose_eval=50, return_valid_features=False)[source]#
Bases:
objectPerforms the training of a base LightGBM using early stopping.
Works for regression and classification objectives supported by LightGBM. Uses a fixed 20% validation split for early stopping (stratified if specified). Allows boosting from an initial score and using sample weights.
- Parameters:
cat_feat (
List[str],'auto', orNone, default'auto') –List of categorical feature names. If ‘auto’, uses arfs.utils.create_dtype_dict to identify columns with dtypes ‘object’, ‘category’, ‘bool’, ‘datetime’, ‘timedelta’, ‘datetimetz’, and any unrecognized types as categorical for LightGBM. If None, no features are treated as categorical by LightGBM. Note: For LightGBM, integer-encoded features often perform well even
when not explicitly marked as categorical.
params (
dict, optional) – LightGBM parameters. Must include ‘objective’. If None, uses default RMSE objective with 10,000 boosting rounds (subject to early stopping).stratified (
bool, defaultFalse) – Whether to use StratifiedShuffleSplit for the validation set. Ensures class proportions are maintained in classification tasks.show_learning_curve (
bool, defaultTrue) – If True, generates and stores the learning curve plot.verbose_eval (
int, default50) – Period (in boosting rounds) for printing training/validation metrics. Set to 0 or False to disable logging during training.return_valid_features (
bool, defaultFalse) – If True, stores the validation features (X_val) used for early stopping.
- Variables:
model (
lgb.BoosterorNone) – The trained LightGBM Booster object.cat_feat (
Union[List[str],None]) – Categorical features used (after potential ‘auto’ detection).model_params (
Dict[str,Any]orNone) – Parameters of the trained LightGBM model.params (
Dict[str,Any]orNone) – Original parameters passed during initialization.learning_curve (
plt.FigureorNone) – Matplotlib figure object of the learning curve, if generated.is_init_score (
bool) – True if the model was trained with an initial score.stratified (
bool) – Whether stratified splitting was used.show_learning_curve (
bool) – Whether the learning curve was requested.verbose_eval (
int) – Verbosity level used during training.return_valid_features (
bool) – Whether validation features were stored.valid_features (
pd.DataFrameorNone) – Validation features (X_val), if return_valid_features was True.
Example
>>> # Example Usage (assuming X_tr, y_tr, X_tt exist) >>> gbm_trainer = GradientBoosting( ... cat_feat='auto', # Automatically detect categorical/object/bool/time cols ... stratified=False, ... params={'objective': 'regression_l1', 'metric': 'mae', 'n_estimators': 500} ... ) >>> # Train the model (assuming sample_weight 'exp_tr' exists if needed) >>> # gbm_trainer.fit(X=X_tr, y=y_tr, sample_weight=exp_tr) >>> gbm_trainer.fit(X=X_tr, y=y_tr) # Without sample weight >>> >>> # Predict on test data >>> y_pred = gbm_trainer.predict(X_tt) >>> >>> # Save the model >>> # gbm_trainer.save(save_path='./models/', name="my_regression_model")
- __repr__()[source]#
Provides a string representation of the GradientBoosting object.
- Return type:
str
- fit(X, y, sample_weight=None, init_score=None, groups=None)[source]#
Fits the LightGBM model using early stopping.
- Parameters:
X (
pd.DataFrameornp.ndarray) – Predictor matrix (features).y (
pd.Seriesornp.ndarray) – Target variable.sample_weight (
pd.Seriesornp.ndarray, optional) – Sample weights. Must have the same length as y.init_score (
pd.Seriesornp.ndarray, optional) – Initial scores to boost from. Must have the same length as y.groups (
pd.Seriesornp.ndarray, optional) – Group labels for GroupShuffleSplit. Ensures samples from the same group are not in both train and validation sets.
- Return type:
None
- load(model_path)[source]#
Loads a previously saved LightGBM model.
Overwrites the current model and model_params attributes.
- Parameters:
model_path (
str) – Path to the saved model file (.joblib).- Raises:
FileNotFoundError – If the model file does not exist at the specified path.
- Return type:
None
- predict(X, predict_proba=False)[source]#
Predicts target values or probabilities for new data.
- Parameters:
X (
pd.DataFrameornp.ndarray) – Predictor matrix for which to make predictions.predict_proba (
bool, defaultFalse) – If True and the objective is classification, returns class probabilities. Otherwise, returns predicted values (regression) or class labels (classification).
- Return type:
ndarray- Returns:
np.ndarray– Predicted values or probabilities.- Raises:
AttributeError – If the model was trained with init_score (use predict_raw).
ValueError – If the model has not been trained yet.
- predict_raw(X, **kwargs)[source]#
Provides direct access to the underlying LightGBM predict method.
Useful for obtaining raw scores, leaf indices, etc., especially when init_score was used during training.
- Parameters:
X (
pd.DataFrameornp.ndarray) – Predictor matrix.**kwargs (
dict, optional) – Additional keyword arguments passed directly to lgb.Booster.predict(). Examples: raw_score=True, pred_leaf=True. See LightGBM docs.
- Return type:
ndarray- Returns:
np.ndarray– The raw prediction output from the LightGBM model.- Raises:
ValueError – If the model has not been trained yet.
- save(save_path=None, name=None)[source]#
Saves the trained model and learning curve (if generated).
Model is saved using joblib. Learning curve is saved as a PNG image.
- Parameters:
save_path (
str, optional) – Directory path to save the files. If None, saves in the current working directory. The directory will be created if it doesn’t exist.name (
str, optional) – Base name for the saved files (without extension). If None, defaults to ‘gbm_base_model_{objective}_{date}’.
- Return type:
str- Returns:
str– The full path to the saved model file (.joblib).- Raises:
ValueError – If the model has not been trained yet.
TypeError – If the learning curve exists but is not a matplotlib Figure.
- arfs.gbm._fit_early_stopped_lgb(X, y, sample_weight, groups, init_score, params, cat_feat, stratified, show_learning_curve, verbose_eval, return_valid_features)[source]#
Internal function to train LightGBM with early stopping.
- Return type:
Union[Booster,Tuple[Booster,DataFrame],Tuple[Booster,Figure],Tuple[Booster,DataFrame,Figure]]
arfs.parallel module#
Parallelize Pandas
This module provides utilities for parallelizing operations on pd.DataFrame
Module Structure:#
parallel_matrix_entriesfor parallelizing operations returning a matrix (2D) (apply on pairs of columns)parallel_dffor parallelizing operations returning a series (1D) (apply on a single column at a time)
- arfs.parallel._compute_matrix_entries(X, comb_list, sample_weight=None, func_xyw=None)[source]#
base closure for computing matrix entries applying a function to each chunk of combination of columns of the dataframe, distributed by cores. This is similar to https://github.com/smazzanti/mrmr/mrmr/pandas.py
- Parameters:
X (
pd.DataFrame,of shape (n_samples,n_features)) – The set of regressors that will be tested sequentiallysample_weight (
pd.Seriesornp.array,of shape (n_samples,), optional) – The weight vector, if any, by default Nonefunc_xyw (
callable, optional) – callable (function) for computing the individual elements of the matrix takes two mandatory inputs (x and y) and an optional input w, sample_weightscomb_list (
listof2-tupleofstr) – Pairs of column names corresponding to the entries
- Returns:
List[pd.DataFrame]– a list of partial dfs to be concatenated
- arfs.parallel._compute_series(X, y, sample_weight=None, func_xyw=None)[source]#
_compute_series is a utility function for computing the series resulting of the
apply- Parameters:
X (
pd.DataFrame,of shape (n_samples,n_features)) – The set of regressors that will be tested sequentiallyy (
pd.Seriesornp.array,of shape (n_samples,)) – The target vectorsample_weight (
pd.Seriesornp.array,of shape (n_samples,), optional) – The weight vector, if any, by default Nonefunc_xyw (
callable, optional) – callable (function) for computing the individual elements of the series takes two mandatory inputs (x and y) and an optional input w, sample_weights
- arfs.parallel.parallel_df(func, df, series, sample_weight=None, n_jobs=-1)[source]#
parallel_df apply a function to each column of the dataframe, distributed by cores. This is similar to https://github.com/smazzanti/mrmr/mrmr/pandas.py
- Parameters:
func (
callable) – function to be applied to each columndf (
pd.DataFrame) – the dataframe on which to apply the functionseries (
pd.Series) – series (target) used by the functionsample_weight (
pd.Seriesornp.array, optional) – The weight vector, if any, of shape (n_samples,), by default Nonen_jobs (
int, optional) – the number of cores to use for the computation, by default -1
- Returns:
pd.DataFrame– concatenated results into a single pandas DF
- arfs.parallel.parallel_matrix_entries(func, df, comb_list, sample_weight=None, n_jobs=-1)[source]#
parallel_matrix_entries applies a function to each chunk of combination of columns of the dataframe, distributed by cores. This is similar to https://github.com/smazzanti/mrmr/mrmr/pandas.py
- Parameters:
func (
callable) – function to be applied to each pair of columns in comb_listdf (
pd.DataFrame) – the dataframe on which to apply the functioncomb_list (
listoftuplesofstr) – Pairs of column names corresponding to the entriessample_weight (
pd.Seriesornp.array, optional) – The weight vector, if any, of shape (n_samples,), by default Nonen_jobs (
int, optional) – the number of cores to use for the computation, by default -1
- Returns:
pd.DataFrame– concatenated results into a single pandas DF
arfs.preprocessing module#
This module provides preprocessing classes
Module Structure:#
OrdinalEncoderPandas: main class for ordinal encoding, takes in a DF and returns a DF of the same shapedtype_column_selector: for standardizing selection of columns based on their dtypesTreeDiscretizer: class for discretizing continuous columns and auto-group levels of categorical columnsIntervalToMidpoint: class for converting pandas numerical intervals into their float midpointPatsyTransformer: class for encoding data for (generalized) linear models, leveraging Patsy
- class arfs.preprocessing.IntervalToMidpoint(cols='all')[source]#
Bases:
BaseEstimator,TransformerMixinIntervalToMidpoint is a transformer that converts numerical intervals in a pandas DataFrame to their midpoints.
- Parameters:
cols (
listofstrorstr, default"all") – The column(s) to transform. If “all”, all columns with numerical intervals will be transformed.- Variables:
cols (
listofstrorstr) – The column(s) to transform.float_interval_cols (
listofstr) – The columns with numerical interval data types in the input DataFrame.columns_to_transform (
listofstr) – The columns to be transformed based on the specified cols attribute.
- fit(X=None, y=None)[source]#
Fit the transformer on the input data.
- Parameters:
X (
Optional[DataFrame]) – The input data to fit the transformer on.y (
Optional[Series]) – Ignored parameter.
- Returns:
self (
IntervalToMidpoint) – The fitted transformer object.
- class arfs.preprocessing.OrdinalEncoderPandas(dtype_include=['category', 'object', 'bool'], dtype_exclude=[<class 'numpy.number'>], pattern=None, exclude_cols=None, output_dtype=<class 'numpy.float64'>, handle_unknown='use_encoded_value', unknown_value=nan, encoded_missing_value=nan, return_pandas_categorical=False)[source]#
Bases:
OrdinalEncoderEncode categorical features as an integer array and returns a pandas DF. The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature. Read more in the scikit-learn OrdinalEncoder documentation
- Parameters:
pattern (
str, defaultNone) – Name of columns containing this regex pattern will be included. If None, column selection will not be selected based on pattern.dtype_include (
column dtypeorlistofcolumn dtypes, defaultNone) – A selection of dtypes to include. For more details, see pandas.DataFrame.select_dtypes.dtype_exclude (
column dtypeorlistofcolumn dtypes, defaultNone) – A selection of dtypes to exclude. For more details, see pandas.DataFrame.select_dtypes.exclude_cols (
listofstr, optional) – columns to not encodeoutput_dtype (
number type, defaultnp.float64) – Desired dtype of output.handle_unknown (
{'error', 'use_encoded_value'}, default'error') – When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform. When set to ‘use_encoded_value’, the encoded value of unknown categories will be set to the value given for the parameter unknown_value. In inverse_transform, an unknown category will be denoted as None.unknown_value (
intornp.nan, defaultNone) – When the parameter handle_unknown is set to ‘use_encoded_value’, this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If set to np.nan, the dtype parameter must be a float dtype.encoded_missing_value (
intornp.nan, defaultnp.nan) – Encoded value of missing categories. If set to np.nan, then the dtype parameter must be a float dtype.return_pandas_categorical (
bool,defult=False) – return encoded columns as pandas category dtype or as float
- Variables:
categories (
listofarrays) – The categories of each feature determined duringfit(in order of the features in X and corresponding with the output oftransform). This does not include categories that weren’t seen duringfit.feature_names_in (
ndarrayofshape (`n_features_in_,)`) – Names of features seen during fit. Defined only when X has feature names that are all strings.
Examples
Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to an ordinal encoding. >>> ord_enc = OrdinalEncoderPandas(exclude_cols=[“PARENT1”, “SEX”]) >>> X_enc = ord_enc.fit_transform(X) >>> X_original = ord_enc.inverse_transform(X_enc)
- fit(X, y=None)[source]#
Fit the OrdinalEncoder to X.
- Parameters:
X (
pd.DataFrame,of shape (n_samples,n_features)) – The data to determine the categories of each feature.y (
Ignored. This parameter exists only for compatibility with) –Pipeline.
- Returns:
self– Fitted encoder.
- fit_transform(X, y=None, sample_weight=None, **fit_params)[source]#
Fit to data, then transform it. Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X. :type X: :param X: Input samples. :type X:
array-likeofshape (n_samples,n_features):type y: :param y: Target values (None for unsupervised transformations). :type y:array-likeofshape (n_samples,)or(n_samples,n_outputs), default=None :type **fit_params: :param **fit_params: Additional fit parameters. :type **fit_params:dict- Returns:
X_new (
ndarray arrayofshape (n_samples,n_features_new)) – Transformed array.
- inverse_transform(X)[source]#
Convert the data back to the original representation. When unknown categories are encountered (all zeros in the one-hot encoding),
Noneis used to represent this category. If the feature with the unknown category has a dropped category, the dropped category will be its inverse. For a given input feature, if there is an infrequent category, ‘infrequent_sklearn’ will be used to represent the infrequent category.- Parameters:
X (
pd.DataFrameofshape (n_samples,n_encoded_features)) – The transformed data.- Returns:
X_tr (
pd.Dataframeofshape (n_samples,n_features)) – Inverse transformed array.
- set_transform_request(*, sample_weight: bool | None | str = '$UNCHANGED$') OrdinalEncoderPandas#
Request metadata passed to the
transformmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter intransform.- Returns:
self (
object) – The updated object.
- class arfs.preprocessing.PatsyTransformer(formula=None, add_intercept=True, eval_env=0, NA_action='drop', return_type='dataframe')[source]#
Bases:
BaseEstimator,TransformerMixinTransformer using patsy-formulas.
PatsyTransformer transforms a pandas DataFrame (or dict-like) according to the formula and produces a numpy array.
- Parameters:
formula (
stringorformula-like) – Pasty formula used to transform the data.add_intercept (
boolean, defaultFalse) – Whether to add an intercept. By default scikit-learn has built-in intercepts for all models, so we don’t add an intercept to the data, even if one is specified in the formula.eval_env (
environmentorint, default0) – Environment in which to evaluate the formula. Defaults to the scope in which PatsyModel was instantiated.NA_action (
stringorNAAction, default"drop") – What to do with rows that contain missing values. You can"drop"them,"raise"an error, or for customization, pass an NAAction object. Seepatsy.NAActionfor details on what values count as ‘missing’ (and how to alter this).
- Variables:
feature_names (
listofstring) – Column names / keys of training data.return_type (
string, default"dataframe") – data type that transform method will return. Default is"dataframe"for numpy array, but if you would like to get Pandas dataframe (for example for using it in scikit transformers with dataframe as input use"dataframe"and if numpy array use"ndarray")
Note
PastyTransformer does by default not add an intercept, even if you specified it in the formula. You need to set add_intercept=True.
As scikit-learn transformers can not output y, the formula should not contain a left hand side. If you need to transform both features and targets, use PatsyModel.
- fit(data, y=None)[source]#
Fit the scikit-learn model using the formula.
- Parameters:
data (
dict-like (pandas dataframe)) – Input data. Column names need to match variables in formula.
- fit_transform(data, y=None)[source]#
Fit the scikit-learn model using the formula and transform it.
- Parameters:
data (
dict-like (pandas dataframe)) – Input data. Column names need to match variables in formula.- Returns:
X_transform (
ndarray) – Transformed data
- set_fit_request(*, data: bool | None | str = '$UNCHANGED$') PatsyTransformer#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
data (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing fordataparameter infit.- Returns:
self (
object) – The updated object.
- set_transform_request(*, data: bool | None | str = '$UNCHANGED$') PatsyTransformer#
Request metadata passed to the
transformmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
data (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing fordataparameter intransform.- Returns:
self (
object) – The updated object.
- class arfs.preprocessing.TreeDiscretizer(bin_features='all', n_bins=10, n_bins_max=None, num_bins_as_category=False, boost_params=None, raw=False, task='regression')[source]#
Bases:
BaseEstimator,TransformerMixinDiscretize continuous and/or categorical data using univariate regularized trees, returning a pandas DataFrame. The TreeDiscretizer is designed to support regression and binary classification tasks. Discretization, also known as quantization or binning, allows for the partitioning of continuous features into discrete values. In certain datasets with continuous attributes, discretization can be beneficial as it transforms the dataset into one with only nominal attributes. Additionally, for categorical predictors, grouping levels can help reduce overfitting and create meaningful clusters.
By encoding discretized features, a model can become more expressive while maintaining interpretability. For example, preprocessing with a discretizer can introduce nonlinearity to linear models. For more advanced possibilities, particularly smooth ones, you can refer to the section on generating polynomial features. The TreeDiscretizer function utilizes univariate regularized trees, with one tree per column to be binned. It finds the optimal partition and returns numerical intervals for numerical continuous columns and pd.Categorical for categorical columns. This approach groups similar levels together, reducing dimensionality and regularizing the model.
TreeDiscretizer handles missing values for both numerical and categorical predictors, eliminating the need for encoding categorical predictors separately.
Notes
This is a substitution to proper regularization schemes such as: - GroupLasso: Categorical predictors, which are usually encoded as multiple dummy variables,
are considered together rather than separately.
FusedLasso: Takes into account the ordering of the features.
- Parameters:
bin_features (
ListofstringorNone) – The list of names of the variable that has to be binned, or “all”, “numerical” or “categorical” for splitting and grouping all, only numerical or only categorical columns.n_bins (
int) – The number of bins that has to be created while binning the variables in the “bin_features” list.n_bins_max (
int, optional) – The maximum number of levels that a categorical column can have to avoid being binned.num_bins_as_category (
bool, defaultFalse) – Save the numeric bins as pandas category or as pandas interval.boost_params (
dict) – The boosting parameters dictionary.raw (
bool) – Returns raw levels (non-human-interpretable) or levels matching the original ones.task (
str) – Either regression or classification (binary).
- Variables:
tree_dic (
dict) – The dictionary keys are binned column names and items are the univariate trees.bin_upper_bound_dic (
dict) – The upper bound of the numerical intervals.cat_bin_dict (
dict) – The mapping dictionary for the categorical columns.tree_imputer (
dict) – The missing values are split by the tree and lead to similar splits and are mapped to this value.ordinal_encoder_dic (
dict) – Dictionary with the fitted encoder, if any.cat_features (
list) – Names of the found categorical columns.
- fit_transform(X)#
Fit and apply the transformer object on data.
Example
>>> lgb_params = {'min_split_gain': 5} >>> disc = TreeDiscretizer(bin_features='all', n_bins=10) >>> disc.fit(X=df[predictors], y=df['Frequency'], sample_weight=df['Exposure'])
- fit(X, y, sample_weight=None)[source]#
Fit the TreeDiscretizer on the input data.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – The predictor dataframe.y (
array-likeofshape (n_samples,)) – The target vector.sample_weight (
array-likeofshape (n_samples,), optional) – The weight vector, by default None.
- Returns:
self (
object) – Returns self.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') TreeDiscretizer#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.- Returns:
self (
object) – The updated object.
- transform(X)[source]#
Apply the discretizer on X. Only the columns with more than n_bins_max unique values will be transformed.
- Parameters:
X (
array-likeofshape (n_samples,n_features)) – Input data with shape (n_samples, n_features), where n_samples is the number of samples and n_features is the number of features.- Returns:
X (
pd.DataFrame) – DataFrame with the binned and grouped columns.
- arfs.preprocessing._drop_intercept(formula, add_intercept)[source]#
Drop the intercept from formula if not add_intercept
- arfs.preprocessing.cat_var(data, col_excl=None, return_cat=True)[source]#
Ad hoc categorical encoding (as integer). Automatically detect the non-numerical columns, save the index and name of those columns, encode them as integer, save the direct and inverse mappers as dictionaries. Return the data-set with the encoded columns with a data type either int or pandas categorical.
- Parameters:
data (
pd.DataFrame) – the datasetcol_excl (
listofstr, defaultNone) – the list of columns names not being encoded (e.g. the ID column)return_cat (
bool, defaultTrue) – return encoded object columns as pandas categoricals or not.
- Returns:
df (
pd.DataFrame) – the dataframe with encoded columnscat_var_df (
pd.DataFrame) – the dataframe with the indices and names of the categorical columnsinv_mapper (
dict) – the dictionary to map integer –> categorymapper (
dict) – the dictionary to map category –> integer
- class arfs.preprocessing.dtype_column_selector(pattern=None, *, dtype_include=None, dtype_exclude=None, exclude_cols=None)[source]#
Bases:
objectCreate a callable to select columns to be used with
ColumnTransformer.dtype_column_selector()can select columns based on datatype or the columns name with a regex. When using multiple selection criteria, all criteria must match for a column to be selected.- Parameters:
pattern (
str, defaultNone) – Name of columns containing this regex pattern will be included. If None, column selection will not be selected based on pattern.dtype_include (
column dtypeorlistofcolumn dtypes, defaultNone) – A selection of dtypes to include. For more details, seepandas.DataFrame.select_dtypes().dtype_exclude (
column dtypeorlistofcolumn dtypes, defaultNone) – A selection of dtypes to exclude. For more details, seepandas.DataFrame.select_dtypes().exclude_cols (
listofcolumn names, defaultNone) – A selection of columns to exclude
- Returns:
selector (
callable) – Callable for column selection to be used by aColumnTransformer.
See also
ColumnTransformerClass that allows combining the outputs of multiple transformer objects used on column subsets of the data into a single feature space.
Examples
>>> from sklearn.preprocessing import StandardScaler, OneHotEncoder >>> from sklearn.compose import make_column_transformer >>> from arfs.preprocessing import dtype_column_selector >>> import numpy as np >>> import pandas as pd >>> X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw'], ... 'rating': [5, 3, 4, 5]}) >>> ct = make_column_transformer( ... (StandardScaler(), ... dtype_column_selector(dtype_include=np.number)), # rating ... (OneHotEncoder(), ... dtype_column_selector(dtype_include=object))) # city >>> ct.fit_transform(X) array([[ 0.90453403, 1. , 0. , 0. ], [-1.50755672, 1. , 0. , 0. ], [-0.30151134, 0. , 1. , 0. ], [ 0.90453403, 0. , 0. , 1. ]])
- arfs.preprocessing.find_interval_midpoint(interval_series)[source]#
Find the midpoint (or left/right bound if the interval contains Inf).
- Parameters:
interval_series (
pd.Series) – series of pandas intervals.- Return type:
ndarray- Returns:
np.ndarray– Array of midpoints or bounds of the intervals.
- arfs.preprocessing.highlight_discarded(s)[source]#
highlight X in red and V in green.
- Parameters:
s (
np.arrays) –- Returns:
list
- arfs.preprocessing.transform_interval_to_midpoint(X, cols='all')[source]#
Transforms interval columns in a pandas DataFrame to their midpoint values.
Notes
Equivalent function to
IntervalToMidpointwithout the estimator API- Parameters:
X (
pd.DataFrame) – The input DataFrame containing the data to be transformed.cols (
listofstrorstr) – The columns to be transformed. Defaults to “all” which transforms all columns.
- Return type:
DataFrame- Returns:
pd.DataFrame– The transformed DataFrame with interval columns replaced by their midpoint values.- Raises:
TypeError : – If the input data is not a pandas DataFrame.
arfs.sampling module#
This module provide methods for sampling large datasets for reducing the running time
- arfs.sampling._gower_distance_row(xi_cat, xi_num, xj_cat, xj_num, feature_weight_cat, feature_weight_num, feature_weight_sum, ranges_of_numeric)[source]#
Compute a row of the Gower matrix
- Parameters:
xi_cat (
np.array) – categorical row of the X matrixxi_num (
np.array) – numerical row of the X matrixxj_cat (
np.array) – categorical row of the X matrixxj_num (
np.array) – numerical row of the X matrixfeature_weight_cat (
np.array) – weight vector for the categorical featuresfeature_weight_num (
np.array) – weight vector for the numerical featuresfeature_weight_sum (
float) – The sum of the weightsranges_of_numeric (
np.array) – range of the scaled numerical features (between 0 and 1)
- Returns:
np.array (
array) – a row vector of the Gower distance
- arfs.sampling.get_5_percent_splits(length)[source]#
splits dataframe into 5% intervals
- Parameters:
length (
int) – array length- Returns:
array– vector of sizes
- arfs.sampling.gower_matrix(data_x, data_y=None, weight=None, cat_features='auto')[source]#
Computes the gower distances between X and Y
Gower is a similarity measure for categorical, boolean and numerical mixed data.
- Parameters:
data_x (
np.arrayorpd.DataFrame) – The data for computing the Gower distancedata_y (
np.arrayorpd.DataFrameorpd.Series, optional) – The reference matrix or vector to compare with, optionalweight (
np.arrayorpd.Series, optional) – sample weight, optionalcat_features (
listofstrorboolorint, optional) – auto-detect cat features or a list of cat features, by default ‘auto’
- Returns:
np.array– The Gower distance matrix, shape (n_samples, n_samples)
Notes
The non-numeric features, and numeric feature ranges are determined from X and not Y.
- Raises:
TypeError – If two dataframes are passed but have different number of columns
TypeError – If two arrays are passed but have different number of columns
TypeError – Sparse matrices are not supported
TypeError – if a list of categorical columns is passed, it should be a list of strings or integers or boolean values
- arfs.sampling.gower_topn(data_x, data_y=None, weight=None, cat_features='auto', n=5, key=None)[source]#
Get the n most similar elements
- Parameters:
data_x (
np.arrayorpd.DataFrame) – The data for the look updata_y (
np.arrayorpd.DataFrameorpd.Series, optional) – elements for which to return the most similar elements, should be a single rowweight (
np.arrayorpd.Series, optional) – sample weight, by default Nonecat_features (
listofstrorboolorint, optional) – auto detection of cat features or a list of strings, booleans or integers, by default ‘auto’n (
int, optional) – the number of neighbors/similar rows to find, by default 5key (
str, optional) – identifier key. If several rows refer to the same id, this column will be used for finding the nearest neighbors with a different id, by default None
- Returns:
dict– the dictionary of indices and values of the closest elements- Raises:
TypeError – if the reference element is not a single row
- arfs.sampling.isof_find_sample(X, sample_weight=None)[source]#
Finds a sample by comparing the distributions of the anomaly scores between the sample and the original distribution using the KS-test. Starts of a 5% however will increase to 10% and then 15% etc. if a significant sample can not be found
References
Sampling method taken from boruta_shap, author: https://github.com/Ekeany
- Parameters:
X (
pd.DataFrame) – the predictors matrixsample_weight (
pd.Seriesornp.array, optional) – the sample weights, if any, by default None
- Returns:
array– the indices for reducing the shadow predictors matrix
- arfs.sampling.isolation_forest(X, sample_weight=None)[source]#
fits isolation forest to the dataset and gives an anomaly score to every sample
- Parameters:
X (
pd.DataFrameornp.array) – the predictors matrixsample_weight (
pd.Seriesornp.array, optional) – the sample weights, if any, by default None
- arfs.sampling.sample(df, n=1000, sample_weight=None, method='gower')[source]#
Sampling rows from a dataframe when random sampling is not enough for reducing the number of rows. The strategies are either using hierarchical clustering based on the Gower distance or using isolation forest for identifying the most similar elements. For the clustering algorithm, clusters are determined using the Gower distance (mixed type data) and the dataset is shrunk from n_samples to n_clusters.
For the isolation forest algorithm, samples are added till a sufficient 2-samples KS statistics is reached or if the number iteration reached the max number (20)
- Parameters:
df (
pd.DataFrame) – the dataframe to sample, with or without the targetn (
int, optional) – the number of clusters if method is"gower", by default 100sample_weight (
pd.Seriesornp.array, optional) – sample weights, by default Nonemethod (
str, optional) – the strategy to use for sampling the rows. Either"gower"or"isoforest", by default ‘gower’
- Returns:
pd.DataFrame– the sampled dataframe
arfs.utils module#
Utility and validation functions
- arfs.utils.LightForestClassifier(n_feat, n_estimators=10)[source]#
lightGBM implementation of the Random Forest classifier with the ideal number of features, according to Elements of statistical learning
- Parameters:
n_feat (
int) – the number of predictors (nbr of columns of the X matrix)n_estimators (
int, optional) – the number of trees/estimators, by default 10
- Returns:
lightgbm classifier– sklearn random forest estimator based on lightgbm
- arfs.utils.LightForestRegressor(n_feat, n_estimators=10)[source]#
lightGBM implementation of the Random Forest regressor with the ideal number of features, according to Elements of statistical learning
- Parameters:
n_feat (
int) – the number of predictors (nbr of columns of the X matrix)n_estimators (
int, optional) – the number of trees/estimators, by default 10
- Returns:
lightgbm regressor– sklearn random forest estimator based on lightgbm
- arfs.utils._get_cancer_data()[source]#
Load breast cancer data and add dummies (random predictors) and a genuine one, for benchmarking purpose Classification (binary)
- Returns:
object– Bunch sklearn, extension of dictionary
- arfs.utils._get_titanic_data()[source]#
Load Titanic data and add dummies (random predictors, numeric and categorical) and a genuine one, for benchmarking purpose. Classification (binary)
- Returns:
object– Bunch sklearn, extension of dictionary
- arfs.utils._load_boston_data()[source]#
Load Boston data and add dummies (random predictors, numeric and categorical) and a genuine one, for benchmarking purpose. Regression (positive domain).
- Returns:
object– Bunch sklearn, extension of dictionary
- arfs.utils._load_housing(as_frame=False)[source]#
Load the California housing data. See here https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html for the downloadable version.
- Parameters:
as_frame (
bool) – return a pandas dataframe? if not then a “Bunch” (enhanced dictionary) is returned (defaultTrue)- Returns:
pd.DataFrameorBunch– the dataset
- arfs.utils._make_corr_dataset_classification(size=1000)[source]#
Generate an artificial dataset for classification tasks. Some columns are correlated, have no variance, large cardinality, numerical and categorical.
- Parameters:
size (int): The number of rows to generate. Default is 1000.
- Returns:
tuple: A tuple containing the predictors matrix, the target, and the weights.
- arfs.utils._make_corr_dataset_regression(size=1000)[source]#
Generate an artificial dataset for regression tasks with columns that are correlated, have no variance, large cardinality, numerical and categorical.
- Parameters:
size (
int, optional) – number of rows to generate, by default 1000- Returns:
pd.DataFrame,pd.Series,pd.Series– the predictors matrix, the target and the weights
- arfs.utils.check_if_tree_based(model)[source]#
check if estimator is tree based
- Parameters:
model (
object) – the estimator to check- Returns:
condition (
boolean) – if tree based or not
- arfs.utils.concat_or_group(col, x, max_length=25)[source]#
Concatenate unique values from a column or return a group value.
- Parameters:
col (
str) – The name of the column to process.x (
pd.DataFrame) – The DataFrame containing the data.max_length (
int, optional) – The maximum length for concatenated strings, beyond which grouping is performed, by default 40.
- Returns:
str– A concatenated string of unique values if the length is less than max_length, otherwise, a unique group value from the specified column.
Notes
If the concatenated string length is greater than or equal to max_length, this function returns the unique group value from the column with a “_g” suffix.
Examples
>>> data = { >>> 'Category_g': [1, 1, 2, 2, 3], >>> 'Category': ['AAAAAAAAAAAAAAA', 'Bovoh', 'Ccccccccccccccc', 'D', 'E']} >>> cat_bin_dict = {} >>> col = 'Category' >>> cat_bin_dict[col] = ( >>> X[[f"{col}_g", col]] >>> .groupby(f"{col}_g") >>> .apply(lambda x: concat_or_group(col, x)) >>> .to_dict() >>> ) >>> print(cat_bin_dict) >>> {'Category': {1: 'gr_1', 2: 'gr_2', 3: 'E'}}
- arfs.utils.create_dtype_dict(df, dic_keys='col_names')[source]#
Create a custom dictionary of data type for adding suffixes to column names in the plotting utility for association matrix.
- Parameters:
df (
pd.DataFrame) – The dataframe used for computing the association matrix.dic_keys (
str) – Either “col_names” or “dtypes” for returning either a dictionary with column names or dtypes as keys.
- Return type:
dict- Returns:
dict– A dictionary with either column names or dtypes as keys.- Raises:
ValueError – If dic_keys is not either “col_names” or “dtypes”.
- arfs.utils.get_pandas_cat_codes(X)[source]#
Converts categorical and time features in a pandas DataFrame into numerical codes.
- Parameters:
X (
pandas DataFrame) – The input DataFrame containing categorical and/or time features.- Returns:
X (
pandas DataFrame) – The modified input DataFrame with categorical and time features replaced by numerical codes.obj_feat (
listorNone) – List of column names that were converted to numerical codes. Returns None if no categorical or time features found.cat_idx (
listorNone) – List of column indices for the columns in obj_feat. Returns None if no categorical or time features found.
- arfs.utils.is_catboost(estimator)[source]#
check if estimator is catboost
- Parameters:
model (
object) – the estimator to check- Returns:
condition (
boolean) – if catboost based or not
- arfs.utils.is_lightgbm(estimator)[source]#
check if estimator is lightgbm
- Parameters:
model (
object) – the estimator to check- Returns:
condition (
boolean) – if lgbm based or not
- arfs.utils.is_list_of_bool(bool_list)[source]#
Check if
bool_listis not a list of Booleans- Parameters:
bool_list (
listofbool) – the list we want to check for- Returns:
bool– True if list of Booleans, else False
- arfs.utils.is_list_of_int(int_list)[source]#
Check if
int_listis not a list of integers- Parameters:
int_list (
listofint) – the list we want to check for- Returns:
bool– True if list of integers, else False
- arfs.utils.is_list_of_str(str_list)[source]#
Check if
str_listis a list of strings.- Parameters:
str_list (
listorNone) – The list to check.- Returns:
bool– True if the list is a list of strings, False otherwise.
- arfs.utils.is_xgboost(estimator)[source]#
check if estimator is xgboost
- Parameters:
model (
object) – the estimator to check- Returns:
condition (
boolean) – if xgboost based or not
- arfs.utils.load_data(name='Titanic')[source]#
Load some toy data set to test the All Relevant Feature Selection methods. Dummies (random) predictors are added and ARFS should be able to filter them out. The Titanic predictors are encoded (needed for scikit estimators).
Titanic and cancer are for binary classification, they contain synthetic random (dummies) predictors and a noisy but genuine synthetic predictor. Hopefully, a good All Relevant FS should be able to detect all the predictors genuinely related to the target.
Boston is for regression, this data set contains
- Parameters:
name (
str, optional) – the name of the data set. Titanic is for classification with sample_weight, Boston for regression and cancer for classification (without sample weight), by default ‘Titanic’- Returns:
Bunch– extension of dictionary, accessible by key- Raises:
ValueError – if the dataset name is invalid
- arfs.utils.plot_y_vs_X(X, y, ncols=2, figsize=(10, 10))[source]#
Plot target vs relevant and non-relevant predictors
- Parameters:
X (
pd.DataFrame) – The DataFrame of the predictors.y (
np.array) – The target.ncols (
int, optional) – The number of columns in the facet plot. Default is 2.figsize (
tuple, optional) – The figure size. Default is (10, 10).
- Returns:
plt.figure– The univariate plots y vs pred_i.
- arfs.utils.set_my_plt_style(height=3, width=5, linewidth=2)[source]#
This set the style of matplotlib to fivethirtyeight with some modifications (colours, axes)
- Parameters:
linewidth (
int, default2) – line widthheight (
int, default3) – fig height in inches (yeah they’re still struggling with the metric system)width (
int, default5) – fig width in inches (yeah they’re still struggling with the metric system)
- arfs.utils.validate_pandas_input(arg)[source]#
Validate if pandas or numpy arrays are provided :type arg: :param arg: the object to validate :type arg:
pd.DataFrameornp.array- Raises:
TypeError – error if pandas or numpy arrays are not provided
- arfs.utils.validate_sample_weight(sample_weight)[source]#
Validate the sample_weight parameter.
- Parameters:
sample_weight (
array-likeorNone) – Input sample weights.- Returns:
np.ndarrayorNone– If sample_weight is a Pandas Series, its values are returned as a numpy array. If sample_weight is already a numpy array, it is returned unmodified. If sample_weight is None, None is returned.- Raises:
ValueError – If sample_weight is not an array-like object or None.
Module contents#
init module, providing information about the arfs package