arfs package#

Subpackages#

Submodules#

arfs.association module#

Parallelized Association and Correlation matrix

This module provides parallelized methods for computing associations. Namely, correlation, correlation ratio, Theil’s U, Cramer’s V

They are the basis of the MRmr feature selection

arfs.association._callable_association_matrix_fn(assoc_fn, X, sample_weight=None, n_jobs=1, kind='nom-nom', cols_comb=None)[source]#

_callable_association_matrix_fn private function, utility for computing association matrix for a callable custom association

Parameters:
  • assoc_fn (callable) – a function which receives two pd.Series (and optionally a weight array) and returns a single number

  • X (array-like of shape (n_samples, n_features)) – predictor dataframe

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • n_jobs (int, optional) – the number of cores to use for the computation, by default -1

  • kind (str) – kind of association, either ‘num-num’ or ‘nom-nom’ or ‘nom-num’

  • cols_comb (list of 2-uple of str, optional) – combination of column names (list of 2-uples of strings)

Returns:

pd.DataFrame – the association matrix

arfs.association._callable_association_series_fn(assoc_fn, X, target, sample_weight=None, n_jobs=1, kind='nom-nom')[source]#

_callable_association_series_fn private function, utility for computing association series for a callable custom association

Parameters:
  • assoc_fn (callable) – a function which receives two pd.Series (and optionally a weight array) and returns a single number

  • X (array-like of shape (n_samples, n_features)) – predictor dataframe

  • target (str or int) – the predictor name or index with which to compute association

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • n_jobs (int, optional) – the number of cores to use for the computation, by default 1

  • kind (str) – kind of association, either ‘num-num’ or ‘nom-nom’ or ‘nom-num’

Returns:

pd.Series – the association series

Raises:

ValueError – if kind is not ‘num-num’ or ‘nom-nom’ or ‘nom-num’

arfs.association._check_association_input(X, sample_weight=None, handle_na='drop')[source]#

_check_association_input private function. Check the inputs, convert X to a pd.DataFrame if needed, adds column names if non are provided. Check if the sample_weight is None or of the right dimensionality and handle NA according to the chosen methods (drop, fill, None).

Parameters:
  • X (array-like of shape (n_samples, n_features)) – predictor dataframe

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • handle_na (str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”

Returns:

tuple – the dataframe and the sample weights

Raises:

ValueError – if sample_weight contains NA

arfs.association._weighted_correlation_ratio(*args)[source]#

Calculates the Correlation Ratio (sometimes marked by the greek letter Eta) for categorical-continuous association. Answers the question - given a continuous value of a measurement, is it possible to know which category is it associated with? Value is in the range [0,1], where 0 means a category cannot be determined by a continuous measurement, and 1 means a category can be determined with absolute certainty.

Based on the scikit-learn implementation of the unweighted version.

Returns:

float – value of the correlation ratio

arfs.association.annotate_heatmap(im, data=None, valfmt='{x:.2f}', textcolors=('black', 'white'), threshold=None, **textkw)[source]#

annotate_heatmap annotates a heatmap

Parameters:
  • im (matplotlib.axes.Axes) – The AxesImage to be labeled

  • data (array-like of shape (M, N), optional) – data to illustrate, if none is provided the function retrieves the array of the mlp obkect, by default None

  • valfmt (str, optional) – annotation formating, by default “{x:.2f}”

  • textcolors (tuple, optional) – A pair of colors. The first is used for values below a threshold, the second for those above, by default (“black”, “white”)

  • threshold (float, optional) – Value in data units according to which the colors from textcolors are applied. If None (the default) uses the middle of the colormap as separation, by default None

  • textkw (dict, optional) – All other arguments are forwarded to mpl annotation.

Returns:

_type_ – _description_

arfs.association.association_matrix(X, sample_weight=None, nom_nom_assoc=<function weighted_theils_u>, num_num_assoc=<function weighted_corr>, nom_num_assoc=<function correlation_ratio>, n_jobs=1, handle_na='drop')[source]#

Computes the association matrix for continuous-continuous, categorical-continuous, and categorical-categorical predictors using specified callable functions.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Predictor dataframe.

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None.

  • nom_nom_assoc (callable) – Function to compute the categorical-categorical association.

  • num_num_assoc (callable) – Function to compute the numerical-numerical association.

  • nom_num_assoc (callable) – Function to compute the categorical-numerical association.

  • n_jobs (int, optional) – The number of cores to use for the computation, by default 1.

  • handle_na (str, optional) – How to handle NA values (‘drop’, ‘fill’, or None), by default “drop”.

Returns:

pd.DataFrame – The association matrix.

arfs.association.association_series(X, target, features=None, sample_weight=None, nom_nom_assoc=<function weighted_theils_u>, num_num_assoc=functools.partial(<function weighted_corr>, method='spearman'), nom_num_assoc=<function correlation_ratio>, normalize=False, n_jobs=1, handle_na='drop')[source]#

Computes the association series for different types of predictors.

This function calculates the association between the specified target and other predictors in X. It supports different types of associations: nominal-nominal, numerical-numerical, and nominal-numerical.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Predictor dataframe.

  • target (str or int) – The predictor name or index with which to compute the association.

  • features (list of str, optional) – List of features with which to compute the association. If None, all features in X are used.

  • sample_weight (array-like, shape (n_samples,), optional) – The weight vector, by default None.

  • nom_nom_assoc (callable) – Function to compute the nominal-nominal (categorical-categorical) association. It should take two pd.Series and an optional weight array, and return a single number.

  • num_num_assoc (callable) – Function to compute the numerical-numerical association. It should take two pd.Series and return a single number.

  • nom_num_assoc (callable) – Function to compute the nominal-numerical association. It should take two pd.Series and return a single number.

  • normalize (bool, optional) – Whether to normalize the scores or not. If True, scores are normalized to the range [0, 1].

  • n_jobs (int, optional) – The number of cores to use for the computation. The default, -1, uses all available cores.

  • handle_na (str, optional) – How to handle NA values. Options are ‘drop’, ‘fill’, and None. The default, ‘drop’, drops rows with NA values.

Returns:

pd.Series – A series with all the association values with the target column, sorted in descending order.

Raises:

TypeError – If features is provided but is not a list of strings.

Examples

>>> import pandas as pd
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> X = pd.DataFrame(iris.data, columns=iris.feature_names)
>>> association_series(X, 'sepal length (cm)', num_num_assoc=my_num_num_function)

Notes

The function dynamically selects the appropriate association method based on the data types of the target and other predictors. For numerical-numerical associations, it uses num_num_assoc; for nominal-nominal, nom_nom_assoc; and for nominal-numerical, nom_num_assoc.

arfs.association.asymmetric_function(func)[source]#
arfs.association.cluster_sq_matrix(sq_matrix, method='ward')[source]#

Apply agglomerative clustering to sort a square correlation matrix.

Parameters:
  • sq_matrix (pd.DataFrame) – A square correlation matrix.

  • method (str, optional) – The linkage method, by default “ward”.

Returns:

pd.DataFrame – A sorted square matrix.

Example

>>> from some_module import association_matrix, cluster_sq_matrix
>>> assoc = association_matrix(iris_df, plot=False)
>>> assoc_clustered = cluster_sq_matrix(assoc, method="complete")
arfs.association.correlation_ratio(x, y, sample_weight=None, as_frame=False)[source]#

Compute the weighted correlation ratio. The association between a continuous predictor (y) and a categorical predictor (x). It can be weighted.

Parameters:
  • x (pd.Series of shape (n_samples,)) – The categorical predictor vector

  • y (pd.Series of shape (n_samples,)) – The continuous predictor

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • as_frame (bool) – return output as a dataframe or a float

Returns:

float – value of the correlation ratio

arfs.association.correlation_ratio_matrix(X, sample_weight=None, n_jobs=1, handle_na='drop')[source]#

correlation_ratio_matrix computes the weighted Correlation Ratio for categorical-numerical association. This is a symmetric coefficient: CR(x,y) = CR(y,x)

The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – predictor dataframe

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • n_jobs (int, optional) – the number of cores to use for the computation, by default 1

  • handle_na (str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”

Returns:

pd.DataFrame – The correlation ratio matrix (lower triangular) in a tidy (long) format.

arfs.association.correlation_ratio_series(X, target, sample_weight=None, n_jobs=1, handle_na='drop')[source]#

correlation_ratio_series computes the weighted correlation ration for categorical-numerical association. This is a symmetric coefficient: CR(x,y) = CR(y,x)

The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format, a series.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – predictor dataframe

  • target (str or int) – the predictor name or index with which to compute association

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • n_jobs (int, optional) – the number of cores to use for the computation, by default -1

  • handle_na (str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”

Returns:

pd.Series – The Correlation ratio series (lower triangular) in a tidy (long) format.

arfs.association.cramer_v(x, y, sample_weight=None, as_frame=False)[source]#

Computes the weighted V statistic of two categorical predictors.

Parameters:
  • x (pd.Series of shape (n_samples,)) – The first categorical predictor.

  • y (pd.Series of shape (n_samples,)) – The second categorical predictor, order doesn’t matter, symmetrical association.

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None.

  • as_frame (bool) – Return output as a DataFrame or a float.

Returns:

pd.DataFrame or float – Single row DataFrame with the predictor names and the statistic value, or the statistic as a float.

arfs.association.cramer_v_matrix(X, sample_weight=None, n_jobs=1, handle_na='drop')[source]#

cramer_v_matrix computes the weighted Cramer’s V statistic for categorical-categorical association. This is a symmetric coefficient: V(x,y) = V(y,x)

It uses the corrected Cramer’s V statistics, itself based on the chi2 contingency table. The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – predictor dataframe

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • n_jobs (int, optional) – the number of cores to use for the computation, by default 1

  • handle_na (str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”

Returns:

pd.DataFrame – The Cramer’s V matrix (lower triangular) in a tidy (long) format.

arfs.association.cramer_v_series(X, target, sample_weight=None, n_jobs=1, handle_na='drop')[source]#

cramer_v_series computes the weighted Cramer’s V statistic for categorical-categorical association. This is a symmetric coefficient: V(x,y) = V(y,x)

It uses the corrected Cramer’s V statistics, itself based on the chi2 contingency table. The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – predictor dataframe

  • target (str or int) – the predictor name or index with which to compute association

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • n_jobs (int, optional) – the number of cores to use for the computation, by default 1

  • handle_na (str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”

Returns:

pd.Series – The Cramer’s V series

arfs.association.create_col_combinations(func, selected_cols)[source]#

Create column combinations or permutations based on the symmetry of the function.

This function checks if func is symmetric. If it is, it creates combinations of selected_cols; otherwise, it creates permutations.

Parameters:
  • func (callable) – The function to check for symmetry. Should be decorated with @symmetric_function.

  • selected_cols (list) – The columns to be combined or permuted.

Returns:

list of tuples – A list of tuples representing column combinations or permutations. If func is symmetric, combinations of selected_cols are returned; otherwise, permutations are returned.

arfs.association.f_cat_classification_parallel(X, y, sample_weight=None, n_jobs=-1, force_finite=True, handle_na='drop')[source]#

Univariate information dependence.

It ranks features in the same order if all the features are positively correlated with the target. Note that it is therefore recommended as a feature selection criterion to identify potentially predictive features for a downstream classifier, irrespective of the sign of the association with the target variable.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The predictor dataframe.

  • y (array-like of shape (n_samples,)) – The target vector.

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None.

  • n_jobs (int, optional) – The number of cores to use for the computation, by default -1.

  • handle_na (str, optional) – Either drop rows with NaN, fill NaN with 0, or do nothing, by default “drop”.

  • force_finite (bool, optional) –

    Whether or not to force the F-statistics and associated p-values to be finite. There are two cases where the F-statistic is expected to not be finite:

    • when the target y or some features in X are constant. In this case, the Pearson’s R correlation is not defined leading to obtain np.nan values in the F-statistic and p-value. When force_finite=True, the F-statistic is set to 0.0 and the associated p-value is set to 1.0.

    • when a feature in X is perfectly correlated (or anti-correlated) with the target y. In this case, the F-statistic is expected to be np.inf. When force_finite=True, the F-statistic is set to np.finfo(dtype).max.

Returns:

f_statistic (array-like of shape (n_features,)) – F-statistic for each feature.

arfs.association.f_cat_regression(x, y, sample_weight=None, as_frame=False)[source]#

f_cat_regression computes the weighted ANOVA F-value for the provided sample. (continuous target, categorical predictor)

Parameters:
  • x (pd.Series of shape (n_samples,)) – The predictor vector, the first categorical predictor

  • y (pd.Series of shape (n_samples,)) – second categorical predictor, order doesn’t matter, symmetrical association

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • as_frame (bool) – return output as a dataframe or a float

Returns:

float – value of the F-statistic

arfs.association.f_cat_regression_parallel(X, y, sample_weight=None, n_jobs=1, handle_na='drop')[source]#

f_cat_regression_parallel computes the weighted ANOVA F-value for the provided categorical predictors using parallelization of the code (continuous target, categorical predictor).

Parameters:
  • X (array-like of shape (n_samples, n_features)) – predictor dataframe

  • y (array-like of shape (n_samples,)) – The target vector

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • n_jobs (int, optional) – the number of cores to use for the computation, by default 1

  • handle_na (str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”

Returns:

pd.Series – the value of the F-statistic for each predictor

arfs.association.f_cont_classification(x, y, sample_weight=None, as_frame=False)[source]#

f_cont_classification computes the weighted ANOVA F-value for the provided sample. Categorical target, continuous predictor.

Parameters:
  • x (pd.Series of shape (n_samples,)) – The predictor vector, the first categorical predictor

  • y (pd.Series of shape (n_samples,)) – second categorical predictor, order doesn’t matter, symmetrical association

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • as_frame (bool) – return output as a dataframe or a float

Returns:

float – value of the F-statistic

arfs.association.f_cont_classification_parallel(X, y, sample_weight=None, n_jobs=-1, handle_na='drop')[source]#

f_cont_classification_parallel computes the weighted ANOVA F-value for the provided categorical predictors using parallelization of the code. Categorical target, continuous predictor.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The set of regressors that will be tested sequentially

  • y (array-like of shape (n_samples,)) – The target vector

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • n_jobs (int, optional) – the number of cores to use for the computation, by default -1

  • handle_na (str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”

Returns:

pd.Series – the value of the F-statistic for each predictor

arfs.association.f_cont_regression_parallel(X, y, sample_weight=None, n_jobs=-1, force_finite=True, handle_na='drop')[source]#

Univariate linear regression tests returning F-statistic.

Quick linear model for testing the effect of a single regressor, sequentially for many regressors. This is done in 2 steps: 1. The cross-correlation between each regressor and the target is computed using:

E[(X[:, i] - mean(X[:, i])) * (y - mean(y))] / (std(X[:, i]) * std(y))

  1. It is converted to an F score ranks features in the same order if all the features are positively correlated with the target.

Note that it is therefore recommended as a feature selection criterion to identify potentially predictive features for a downstream classifier, irrespective of the sign of the association with the target variable.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The predictor dataframe.

  • y (array-like of shape (n_samples,)) – The target vector.

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None.

  • n_jobs (int, optional) – The number of cores to use for the computation, by default -1.

  • handle_na (str, optional) – Either drop rows with NaN, fill NaN with 0, or do nothing, by default “drop”.

  • force_finite (bool, optional) –

    Whether or not to force the F-statistics and associated p-values to be finite. There are two cases where the F-statistic is expected to not be finite: - when the target y or some features in X are constant. In this

    case, the Pearson’s R correlation is not defined leading to obtain np.nan values in the F-statistic and p-value. When force_finite=True, the F-statistic is set to 0.0 and the associated p-value is set to 1.0.

    • when a feature in X is perfectly correlated (or anti-correlated) with the target y. In this case, the F-statistic is expected to be np.inf. When force_finite=True, the F-statistic is set to np.finfo(dtype).max.

Returns:

f_statistic (array-like of shape (n_features,)) – F-statistic for each feature.

arfs.association.f_oneway_weighted(*args)[source]#

Calculate the weighted F-statistic for one-way ANOVA (continuous target, categorical predictor).

Parameters:
  • X (array-like of shape (n_samples,)) – The predictor dataframe.

  • y (array-like of shape (n_samples,)) – The target vector.

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None.

Returns:

float – The value of the F-statistic.

Notes

The F-statistic is calculated as:

\[F(rf) = \frac{\sum_i (\bar{Y}_{i \bullet} - \bar{Y})^2 / (K-1)}{\sum_i \sum_k (\bar{Y}_{ij} - \bar{Y}_{i\bullet})^2 / (N - K)}\]
arfs.association.f_stat_classification_parallel(X, y, sample_weight=None, n_jobs=1, force_finite=True, handle_na='drop')[source]#

Compute the weighted ANOVA F-value for the provided categorical and numerical predictors using parallelization.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The predictor dataframe.

  • y (array-like of shape (n_samples,)) – The target vector.

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None.

  • n_jobs (int, optional) – The number of cores to use for the computation, by default 1.

  • handle_na (str, optional) – Either drop rows with NA, fill NA with 0, or do nothing, by default “drop”.

  • force_finite (bool, optional) –

    Whether or not to force the F-statistics and associated p-values to be finite. There are two cases where the F-statistic is expected to not be finite: - When the target y or some features in X are constant. In this case,

    the Pearson’s R correlation is not defined leading to obtain np.nan values in the F-statistic and p-value. When force_finite=True, the F-statistic is set to 0.0 and the associated p-value is set to 1.0.

    • When a feature in X is perfectly correlated (or anti-correlated) with the target y. In this case, the F-statistic is expected to be np.inf. When force_finite=True, the F-statistic is set to np.finfo(dtype).max.

Returns:

pd.Series – The value of the F-statistic for each predictor.

arfs.association.f_stat_regression_parallel(X, y, sample_weight=None, n_jobs=-1, force_finite=True, handle_na='drop')[source]#

Compute the weighted explained variance for the provided categorical and numerical predictors using parallelization.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The predictor dataframe.

  • y (array-like of shape (n_samples,)) – The target vector.

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None.

  • n_jobs (int, optional) – The number of cores to use for the computation, by default -1.

  • handle_na (str, optional) – Either drop rows with NA, fill NA with 0, or do nothing, by default “drop”.

  • force_finite (bool, optional) –

    Whether or not to force the F-statistics and associated p-values to be finite. There are two cases where the F-statistic is expected to not be finite: - When the target y or some features in X are constant. In this case,

    the Pearson’s R correlation is not defined leading to obtain np.nan values in the F-statistic and p-value. When force_finite=True, the F-statistic is set to 0.0 and the associated p-value is set to 1.0.

    • When a feature in X is perfectly correlated (or anti-correlated) with the target y. In this case, the F-statistic is expected to be np.inf. When force_finite=True, the F-statistic is set to np.finfo(dtype).max.

Returns:

pd.Series – The value of the F-statistic for each predictor.

arfs.association.heatmap(data, row_labels, col_labels, ax=None, cbar_kw=None, cbarlabel='', **kwargs)[source]#

heatmap Create a heatmap from a numpy array and two lists of labels.

Parameters:
  • data (array-like of shape (M, N)) – matrix to plot

  • row_labels (array-like of shape (M,)) – labels for the rows

  • col_labels (array-like of shape (N,)) – labels for the columns

  • ax (matplotlib.axes.Axes, optional) – A matplotlib.axes.Axes instance to which the heatmap is plotted. If not provided, use current axes or create a new one, by default None

  • cbar_kw (dict, optional) – A dictionary with arguments to matplotlib.Figure.colorbar, by default None

  • cbarlabel (str, optional) – The label for the colorbar, by default “”

  • kwargs (dict, optional) – All other arguments are forwarded to imshow.

Returns:

tuple – imgshow and cbar objects

arfs.association.is_list_of_str(str_list)[source]#

Raise an exception if str_list is not a list of strings :type str_list: :param str_list: to list to be tested :type str_list: list

Raises:

TypeError – if str_list is not a list[str]

arfs.association.matrix_to_xy(df, columns=None, reset_index=False)[source]#

matrix_to_xy wide to long format of the association matrix

Parameters:
  • df (pd.DataFrame) – the wide format of the association matrix

  • columns (list of str, optional) – list of column names, by default None

  • reset_index (bool, optional) – wether to reset_index or not, by default False

Returns:

pd.DataFrame – the long format of the association matrix

arfs.association.plot_association_matrix(assoc_mat, suffix_dic=None, ax=None, cmap='PuOr', cbarlabel=None, figsize=None, show=True, cbar_kw=None, imgshow_kw=None, annotate=False)[source]#

plot_association_matrix renders the sorted associations/correlation matrix. The sorting is done using hierarchical clustering, very like in seaborn or other packages. Categorical(nom): uncertainty coefficient & correlation ratio from 0 to 1. The uncertainty coefficient is assymmetrical, (approximating how much the elements on the left PROVIDE INFORMATION on elements in the row). Continuous(con): symmetrical numerical correlations (Spearman’s) from -1 to 1

Parameters:
  • assoc_mat (pd.DataFrame) – the square association frame

  • suffix_dic (Dict[str, str], optional) – dictionary of data type for adding suffixes to column names in the plotting utility for association matrix, by default None

  • ax (matplotlib.axes.Axes, optional) – _description_, by default None

  • cmap (str, optional) – the colormap. Please use a scientific colormap. See the scicomap package, by default “PuOr”

  • cbarlabel (str, optional) – the colorbar label, by default None

  • figsize (Tuple[float, float], optional) – figure size in inches, by default None

  • show (bool, optional) – Whether or not to display the figure, by default True

  • cbar_kw (Dict, optional) – colorbar kwargs, by default None

  • imgshow_kw (Dict, optional) – imgshow kwargs, by default None

  • annotate (bool) – Whether to annotate or not the colormap

Returns:

matplotlib.figure and matplotlib.axes.Axes – the figure and the axes

arfs.association.plot_association_matrix_int(assoc_mat, suffix_dic=None, cmap='PuOr', figsize=(800, 600), cluster_matrix=True)[source]#

Plot the interactive sorted associations/correlation matrix. The sorting is done using hierarchical clustering, very like in seaborn or other packages. Categorical(nom): uncertainty coefficient & correlation ratio from 0 to 1. The uncertainty coefficient is assymmetrical, (approximating how much the elements on the left PROVIDE INFORMATION on elements in the row). Continuous(con): symmetrical numerical correlations (Spearman’s) from -1 to 1

Parameters:
  • assoc_mat (pd.DataFrame) – the square association frame

  • suffix_dic (Dict[str, str], optional) – dictionary of data type for adding suffixes to column names in the plotting utility for association matrix, by default None

  • cmap (str, optional) – the colormap. Please use a scientific colormap. See the scicomap package, by default “PuOr”

  • figsize (Tuple[float, float], optional) – figure size in inches, by default None

  • cluster_matrix (bool) – whether or not to cluster the square matrix, by default True

Returns:

panel.Column – the panel object

arfs.association.symmetric_function(func)[source]#
arfs.association.theils_u_matrix(X, sample_weight=None, n_jobs=1, handle_na='drop')[source]#

theils_u_matrix theils_u_matrix computes the weighted Theil’s U statistic for categorical-categorical association. This is an asymmetric coefficient: U(x,y) != U(y,x) U(x, y) means the uncertainty of x given y: value is on the range of [0,1] - where 0 means y provides no information about x, and 1 means y provides full information about x

The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – predictor dataframe

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • n_jobs (int, optional) – the number of cores to use for the computation, by default -1

  • handle_na (str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”

Returns:

pd.DataFrame – The Theil’s U matrix in a tidy (long) format.

arfs.association.theils_u_series(X, target, sample_weight=None, n_jobs=1, handle_na='drop')[source]#

theils_u_series computes the weighted Theil’s U statistic for categorical-categorical association. This is an asymmetric coefficient: U(x,y) != U(y,x) U(x, y) means the uncertainty of x given y: value is on the range of [0,1] - where 0 means y provides no information about x, and 1 means y provides full information about x

The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – predictor dataframe

  • target (str or int) – the predictor name or index with which to compute association

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • n_jobs (int, optional) – the number of cores to use for the computation, by default -1

  • handle_na (str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”

Returns:

pd.Series – The Theil’s U series.

arfs.association.wcorr(x, y, w)[source]#

wcov computes the weighted Pearson correlation coefficient

Parameters:
  • x (array-like of shape (n_samples,)) – the perdictor 1 array

  • y (array-like of shape (n_samples,)) – the perdictor 2 array

  • w (array-like of shape (n_samples,)) – the sample weights array

Returns:

float – weighted correlation coefficient

arfs.association.wcorr_matrix(X, sample_weight=None, n_jobs=1, handle_na='drop', method='spearman')[source]#

wcorr_matrix computes the weighted correlation statistic for (Pearson or Spearman) for continuous-continuous association. This is an symmetric coefficient: corr(x,y) = corr(y,x)

The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – predictor dataframe

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • n_jobs (int, optional) – the number of cores to use for the computation, by default -1

  • handle_na (str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”

  • method (str) – either “spearman” or “pearson”

Returns:

pd.DataFrame – The Cramer’s V matrix (lower triangular) in a tidy (long) format.

arfs.association.wcorr_series(X, target, sample_weight=None, n_jobs=1, handle_na='drop', method='spearman')[source]#

wcorr_series computes the weighted correlation coefficient (Pearson or Spearman) for continuous-continuous association. This is an symmetric coefficient: corr(x,y) = corr(y,x)

The computation is embarrassingly parallel and is distributed on available cores. Moreover, the statistic is computed for the unique combinations only and returned in a tidy (long) format.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – predictor dataframe

  • target (str or int) – the predictor name or index with which to compute association

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • n_jobs (int, optional) – the number of cores to use for the computation, by default 1

  • handle_na (str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”

  • method (str) – either “spearman” or “pearson”, by default “spearman”

Returns:

pd.Series – The weighted correlation series.

arfs.association.wcov(x, y, w)[source]#

wcov computes the weighted covariance

Parameters:
  • x (array-like of shape (n_samples,)) – the perdictor 1 array

  • y (array-like of shape (n_samples,)) – the perdictor 2 array

  • w (array-like of shape (n_samples,)) – the sample weights array

Returns:

float – weighted covariance

arfs.association.weighted_conditional_entropy(x, y, sample_weight=None)[source]#

Computes the weighted conditional entropy between two categorical predictors.

Parameters:
  • x (pd.Series of shape (n_samples,)) – The predictor vector.

  • y (pd.Series of shape (n_samples,)) – The target vector.

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None.

Returns:

float – Weighted conditional entropy.

arfs.association.weighted_corr(x, y, sample_weight=None, as_frame=False, method='spearman')[source]#

weighted_corr computes the weighted correlation coefficient (Pearson or Spearman)

Parameters:
  • x (pd.Series of shape (n_samples,)) – The categorical predictor vector

  • y (pd.Series of shape (n_samples,)) – The continuous predictor

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • as_frame (bool) – return output as a dataframe or a float

  • method (str) – either “spearman” or “pearson”, by default “pearson”

Returns:

float or pd.DataFrame – weighted correlation coefficient

arfs.association.weighted_correlation_1cpu(X, sample_weight=None, handle_na='drop')[source]#

weighted_correlation computes the lower triangular weighted correlation matrix using a single CPU, therefore using common numpy linear algebra

Parameters:
  • X (array-like of shape (n_samples, n_features)) – predictor dataframe

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None

  • handle_na (str, optional) – either drop rows with na, fill na with 0 or do nothing, by default “drop”

Returns:

pd.DataFrame – the lower triangular weighted correlation matrix in long format

arfs.association.weighted_theils_u(x, y, sample_weight=None, as_frame=False)[source]#

Computes the weighted Theil’s U statistic between two categorical predictors.

Parameters:
  • x (pd.Series of shape (n_samples,)) – The predictor vector.

  • y (pd.Series of shape (n_samples,)) – The target vector.

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None.

  • as_frame (bool) – Return output as a dataframe or a float.

Returns:

pd.DataFrame or float – Predictor names and value of the Theil’s U statistic.

arfs.association.wm(x, w)[source]#

wm computes the weighted mean

Parameters:
  • x (array-like of shape (n_samples,)) – the target array

  • w (array-like of shape (n_samples,)) – the sample weights array

Returns:

float – weighted mean

arfs.association.wrank(x, w)[source]#

wrank computes the weighted rank

Parameters:
  • x (array-like of shape (n_samples,)) – the target array

  • w (array-like of shape (n_samples,)) – the sample weights array

Returns:

float – weighted rank

arfs.association.wspearman(x, y, w)[source]#

wcov computes the weighted Spearman correlation coefficient

Parameters:
  • x (array-like of shape (n_samples,)) – the perdictor 1 array

  • y (array-like of shape (n_samples,)) – the perdictor 2 array

  • w (array-like of shape (n_samples,)) – the sample weights array

Returns:

float – Spearman weighted correlation coefficient

arfs.association.xy_to_matrix(xy)[source]#

xy_to_matrix long to wide format of the association matrix

Parameters:

xy (pd.DataFrame) – the long format of the association matrix, 3 columns.

Returns:

pd.DataFrame

arfs.benchmark module#

Benchmark Feature Selection

This module provides utilities for comparing and benchmarking feature selection methods

Module Structure:#

  • sklearn_pimp_bench: function for comparing using the sklearn permutation importance

  • compare_varimp: function for comparing using 3 kinds of var.imp.

  • highlight_tick: function for highlighting specific (genuine or noise for instance) predictors in the importance chart

arfs.benchmark.compare_varimp(feat_selector, models, X, y, sample_weight=None)[source]#

Utility function to compare the results for the three possible kind of feature importance

Parameters:
  • feat_selector (object) – an instance of either Leshy, BoostaGRoota or GrootCV

  • models (list of objects) – list of tree based scikit-learn estimators

  • X (pd.DataFrame, shape (n_samples, n_features)) – the predictors frame

  • y (pd.Series) – the target (same length as X)

  • sample_weight (None or pd.Series, optional) – sample weights if any, by default None

arfs.benchmark.highlight_tick(str_match, figure, color='red', axis='y')[source]#

Highlight the x/y tick-labels if they contains a given string

Parameters:
  • str_match (str) – the substring to match

  • figure (object) – the matplotlib figure

  • color (str, optional) – the matplotlib color for highlighting tick-labels, by default ‘red’

  • axis (str, optional) – axis to use for highlighting, by default ‘y’

Returns:

plt.figure – the modified matplotlib figure

Raises:

ValueError – if axis is not ‘x’ or ‘y’

arfs.benchmark.sklearn_pimp_bench(model, X, y, task='regression', sample_weight=None)[source]#

Benchmark using sklearn permutation importance, works for regression and classification.

Parameters:
  • model (object) – An estimator that has not been fitted, sklearn compatible.

  • X (ndarray or DataFrame, shape (n_samples, n_features)) – Data on which permutation importance will be computed.

  • y (array-like or None, shape (n_samples, ) or (n_samples, n_classes)) – Targets for supervised or None for unsupervised.

  • task (str, optional) – kind of task, either ‘regression’ or ‘classification”, by default ‘regression’

  • sample_weight (array-like of shape (n_samples,), optional) – Sample weights, by default None

Returns:

plt.figure – the figure corresponding to the feature selection

Raises:

ValueError – if task is not ‘regression’ or ‘classification’

arfs.gbm module#

GBM Wrapper

This module offers a class to train base LightGBM and CatBoost models, with early stopping as the default behavior. The target variable can be finite discrete (classification) or continuous (regression). Additionally, the model allows boosting from an initial score (also known as a baseline for CatBoost) and accepts sample weights as input.

Module Structure:#

  • GradientBoosting: main class to train a lightGBM or catboost with early stopping

class arfs.gbm.GradientBoosting(cat_feat='auto', params=None, stratified=False, show_learning_curve=True, verbose_eval=50, return_valid_features=False)[source]#

Bases: object

Performs the training of a base lightGBM/CatBoost using early stopping. It works for any of the supported loss function (lightGBM/CatBoost), so for regression and classification you can use an instance of this class. For the early stopping process, 20% of the data set is used and a fix seed is used for reproducibility.

The resulting model can be saved at the desired location. Last, you can pass relevant lightGBM/Catboost parameters and/or sample weights (exposure, etc.) if needed.

Init score of Booster to start from, if required (like for GLM residuals modelling using GBM).

Parameters:
  • cat_feat (List[str], 'auto' or None,) – The list of column names of the categorical predictors. For catboost, much more efficient if those columns are of dtype pd.Categorical. For lightGBM, most of the time better to integer encode and NOT consider them as categorical (set this parameter as None).

  • params (dict, default None) – you can pass the parameters that you want to lightGBM/Catboost, as long as they are valid. If None, default parameters are passed.

  • stratified (bool, default False) – stratified shuffle split for the early stopping process. For classification problem, it guarantees the same proportion

  • show_learning_curve (bool, default True) – if show or not the learning curve

  • verbose_eval (int, default 50) – period for printing the train and validation results. If < 1, no output

Variables:
  • cat_feat (Union[str, List[str], None]) – The list of categorical predictors after pre-processing

  • model_params (Dict) – the dictionary of model parameters

  • learning_curve (plt.figure) – the learning curve

  • is_init_score (bool) – boosted from an initial score or not

  • stratified (bool) – either if stratified shuffle split was used or not for the early stopping process

Example

>>> # set up the trainer
>>> save_path = "C:/Users/mtpl_bi_pp/base/"
>>> gbm_model = GradientBoosting(cat_feat='auto',
>>>                              stratified=False,
>>>                              params={
>>>                                 'objective': 'tweedie',
>>>                                 'tweedie_variance_power': 1.1
>>>                             })
>>>
>>> # train the model
>>> gbm_model.fit(X=X_tr,y=y_tr,sample_weight=exp_tr)
>>>
>>> # predict new values (test set)
>>> y_bt = gbm_model.predict(X_tt)
>>>
>>> # save the model
>>> gbm_model.save(save_path='C:/models/', name="my_fancy_model")
fit(X, y, sample_weight=None, init_score=None, groups=None)[source]#

Fit the lightGBM/Catboost either using the python API and early stopping

Parameters:
  • X (pd.DataFrame or np.ndarray) – the predictors’ matrix

  • y (pd.Series or np.ndarray) – the target series/array

  • sample_weight (pd.Series or np.ndarray, optional) – the sample_weight series/array, if relevant. If not None, it should be of the same length as the target (default None)

  • init_score (pd.Series or np.ndarray, optional) – the initial score to boost from (series/array), if relevant. If not None, it should be of the same length as the target (default None)

  • groups (pd.Series or np.ndarray, optional) – the groups (e.g. polID) for robust cross validation. The same group will not appear in two different folds.

load(model_path)[source]#
predict(X, predict_proba=False)[source]#

Predict the new values using the fitted model.

Parameters:
  • X (pd.DataFrame or np.ndarray) – the predictors’ matrix

  • predict_proba (bool, default False) – returns probabilities (only for classification) (default False)

predict_raw(X, **kwargs)[source]#

The native predict method, if you need raw_score, etc.

Parameters:
  • X (pd.DataFrame or np.ndarray) – the predictors’ matrix

  • **kwargs (dict, optional) – optional dictionary of other parameters for the prediction. See the lightgbm and catboost documentation for details.

Raises:

Exception – “method not found” if the method specified in the init differs from “lgb” or “cat”

save(save_path=None, name=None)[source]#

Save method, saves the model as pkl file in the specified folder as name.pkl If the path is None, then the model is saved in the current working directory. If the name is not specified, the model is saved as ‘gbm_base_model_[TIMESTAMP].pkl

Parameters:
  • save_path (str, optional) – folder where to save the model, as a pickle/joblib file

  • name (str, optional) – name of the model name

Returns:

str – where the pkl file is saved

arfs.gbm._fit_early_stopped_lgb(X, y, sample_weight=None, groups=None, init_score=None, params=None, cat_feat=None, stratified=False, learning_curve=True, verbose_eval=0, return_valid_features=False)[source]#

convenience function, early stopping for lightGBM, using dataset and setting categorical feature, sample weights and baseline (init_score), if any. User defined params can be passed. It works for classification and regression.

Parameters:
  • X (pd.DataFrame or np.ndarray) – the predictors’ matrix

  • y (pd.Series or np.ndarray) – the target series/array

  • sample_weight (pd.Series or np.ndarray, optional) – the sample_weight series/array, if relevant. If not None, it should be of the same length as the target (default None)

  • groups (pd.Series or np.ndarray, optional) – the groups (e.g. polID) for robust cross validation. The same group will not appear in two different folds.

  • params (dict, optional) – you can pass the parameters that you want to lightGBM/Catboost, as long as they are valid. If None, default parameters are passed.

  • init_score (pd.Series or np.ndarray, optional) – the initial score to boost from (series/array), if relevant. If not None, it should be of the same length as the target (default None)

  • cat_feat (str or list of strings, optional) – Categorical features. If list of int, interpreted as indices. If list of strings, interpreted as feature names (need to specify feature_name as well). If ‘auto’ and data is pandas DataFrame, pandas unordered categorical columns are used. All values in categorical features should be less than int32 max value (2147483647). Large values could be memory consuming. Consider using consecutive integers starting from zero. All negative values in categorical features will be treated as missing values. The output cannot be monotonically constrained with respect to a categorical feature (default None)

  • stratified (bool, default = False) – stratified shuffle split for the early stopping process. For classification problem, it guarantees the same proportion

  • learning_curve (bool, default = False) – if show or not the learning curve

  • verbose_eval (int, default = 0) – period for printing the train and validation results. If < 1, no output

  • return_valid_features (bool, default = False) – Whether or not to return validation features

Returns:

  • model (object) – model object

  • fig (plt.figure) – the learning curves, matplotlib figure object

arfs.gbm._make_split(X, y, sample_weight=None, init_score=None, groups=None, stratified=False, test_size=0.2)[source]#

_make_split is a private function for splitting the dataset according to the task

Parameters:
  • X (pd.DataFrame or np.ndarray) – the predictors’ matrix

  • y (pd.Series or np.ndarray) – the target series/array

  • sample_weight (pd.Series or np.ndarray, optional) – the sample_weight series/array, if relevant. If not None, it should be of the same length as the target (default None)

  • groups (pd.Series or np.ndarray, optional) – the groups (e.g. polID) for robust cross validation. The same group will not appear in two different folds.

  • stratified (bool, default False) – stratified shuffle split for the early stopping process. For classification problem, it guarantees the same proportion

  • test_size (float, default 0.2) – test set size, percentage of the total number of rows, by default .2

Returns:

Tuple[Union[pd.DataFrame, pd.Series]] – the split data, target, weights and initial scores (if any)

arfs.gbm.gbm_flavour(estimator)[source]#

arfs.parallel module#

Parallelize Pandas

This module provides utilities for parallelizing operations on pd.DataFrame

Module Structure:#

  • parallel_matrix_entries for parallelizing operations returning a matrix (2D) (apply on pairs of columns)

  • parallel_df for parallelizing operations returning a series (1D) (apply on a single column at a time)

arfs.parallel._compute_matrix_entries(X, comb_list, sample_weight=None, func_xyw=None)[source]#

base closure for computing matrix entries appling a function to each chunk of combinaison of columns of the dataframe, distributed by cores. This is similar to https://github.com/smazzanti/mrmr/mrmr/pandas.py

Parameters:
  • X (pd.DataFrame, of shape (n_samples, n_features)) – The set of regressors that will be tested sequentially

  • sample_weight (pd.Series or np.array, of shape (n_samples,), optional) – The weight vector, if any, by default None

  • func_xyw (callable, optional) – callable (function) for computing the individual elements of the matrix takes two mandatory inputs (x and y) and an optional input w, sample_weights

  • comb_list (list of 2-uple of str) – Pairs of column names corresponding to the entries

Returns:

pd.DataFrame – concatenated results into a single pandas DF

arfs.parallel._compute_series(X, y, sample_weight=None, func_xyw=None)[source]#

_compute_series is a utility function for computing the series resulting of the apply

Parameters:
  • X (pd.DataFrame, of shape (n_samples, n_features)) – The set of regressors that will be tested sequentially

  • y (pd.Series or np.array, of shape (n_samples,)) – The target vector

  • sample_weight (pd.Series or np.array, of shape (n_samples,), optional) – The weight vector, if any, by default None

  • func_xyw (callable, optional) – callable (function) for computing the individual elements of the series takes two mandatory inputs (x and y) and an optional input w, sample_weights

arfs.parallel.parallel_df(func, df, series, sample_weight=None, n_jobs=-1)[source]#

parallel_df apply a function to each column of the dataframe, distributed by cores. This is similar to https://github.com/smazzanti/mrmr/mrmr/pandas.py

Parameters:
  • func (callable) – function to be applied to each column

  • df (pd.DataFrame) – the dataframe on which to apply the function

  • series (pd.Series) – series (target) used by the function

  • sample_weight (pd.Series or np.array, optional) – The weight vector, if any, of shape (n_samples,), by default None

  • n_jobs (int, optional) – the number of cores to use for the computation, by default -1

Returns:

pd.DataFrame – concatenated results into a single pandas DF

arfs.parallel.parallel_matrix_entries(func, df, comb_list, sample_weight=None, n_jobs=-1)[source]#

parallel_matrix_entries applies a function to each chunk of combinaison of columns of the dataframe, distributed by cores. This is similar to https://github.com/smazzanti/mrmr/mrmr/pandas.py

Parameters:
  • func (callable) – function to be applied to each column

  • df (pd.DataFrame) – the dataframe on which to apply the function

  • comb_list (list of tuples of str) – Pairs of column names corresponding to the entries

  • sample_weight (pd.Series or np.array, optional) – The weight vector, if any, of shape (n_samples,), by default None

  • n_jobs (int, optional) – the number of cores to use for the computation, by default -1

Returns:

pd.DataFrame – concatenated results into a single pandas DF

arfs.preprocessing module#

This module provides preprocessing classes

Module Structure:#

  • OrdinalEncoderPandas: main class for ordinal encoding, takes in a DF and returns a DF of the same shape

  • dtype_column_selector: for standardizing selection of columns based on their dtypes

  • TreeDiscretizer: class for discretizing continuous columns and auto-group levels of categorical columns

  • IntervalToMidpoint: class for converting pandas numerical intervals into their float midpoint

  • PatsyTransformer: class for encoding data for (generalized) linear models, leveraging Patsy

class arfs.preprocessing.IntervalToMidpoint(cols='all')[source]#

Bases: BaseEstimator, TransformerMixin

IntervalToMidpoint is a transformer that converts numerical intervals in a pandas DataFrame to their midpoints.

Parameters:

cols (list of str or str, default "all") – The column(s) to transform. If “all”, all columns with numerical intervals will be transformed.

Variables:
  • cols (list of str or str) – The column(s) to transform.

  • float_interval_cols (list of str) – The columns with numerical interval data types in the input DataFrame.

  • columns_to_transform (list of str) – The columns to be transformed based on the specified cols attribute.

fit(X, y=None)[source]#

Fit the transformer on the input data.

transform(X)[source]#

Transform the input data by converting numerical intervals to midpoints.

inverse_transform(X)[source]#

Inverse transform is not implemented for this transformer.

fit(X=None, y=None)[source]#

Fit the transformer on the input data.

Parameters:
  • X (Optional[DataFrame]) – The input data to fit the transformer on.

  • y (Optional[Series]) – Ignored parameter.

Returns:

self (IntervalToMidpoint) – The fitted transformer object.

inverse_transform(X)[source]#

Inverse transform is not implemented for this transformer.

Parameters:

X (pd.DataFrame) – The input data to perform inverse transform on.

Raises:

NotImplementedError – Raised since inverse transform is not implemented for this transformer.

transform(X)[source]#

Transform the input data by converting numerical intervals to midpoints.

Parameters:

X (pd.DataFrame) – The input data to transform.

Returns:

X (pd.DataFrame) – The transformed data with numerical intervals replaced by their midpoints.

class arfs.preprocessing.OrdinalEncoderPandas(dtype_include=['category', 'object', 'bool'], dtype_exclude=[<class 'numpy.number'>], pattern=None, exclude_cols=None, output_dtype=<class 'numpy.float64'>, handle_unknown='use_encoded_value', unknown_value=nan, encoded_missing_value=nan, return_pandas_categorical=False)[source]#

Bases: OrdinalEncoder

Encode categorical features as an integer array and returns a pandas DF. The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature. Read more in the scikit-learn OrdinalEncoder documentation

Parameters:
  • pattern (str, default None) – Name of columns containing this regex pattern will be included. If None, column selection will not be selected based on pattern.

  • dtype_include (column dtype or list of column dtypes, default None) – A selection of dtypes to include. For more details, see pandas.DataFrame.select_dtypes.

  • dtype_exclude (column dtype or list of column dtypes, default None) – A selection of dtypes to exclude. For more details, see pandas.DataFrame.select_dtypes.

  • exclude_cols (list of str, optional) – columns to not encode

  • output_dtype (number type, default np.float64) – Desired dtype of output.

  • handle_unknown ({'error', 'use_encoded_value'}, default 'error') – When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform. When set to ‘use_encoded_value’, the encoded value of unknown categories will be set to the value given for the parameter unknown_value. In inverse_transform, an unknown category will be denoted as None.

  • unknown_value (int or np.nan, default None) – When the parameter handle_unknown is set to ‘use_encoded_value’, this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If set to np.nan, the dtype parameter must be a float dtype.

  • encoded_missing_value (int or np.nan, default np.nan) – Encoded value of missing categories. If set to np.nan, then the dtype parameter must be a float dtype.

  • return_pandas_categorical (bool, defult=False) – return encoded columns as pandas category dtype or as float

Variables:
  • categories (list of arrays) – The categories of each feature determined during fit (in order of the features in X and corresponding with the output of transform). This does not include categories that weren’t seen during fit.

  • feature_names_in (ndarray of shape (`n_features_in_,)`) – Names of features seen during fit. Defined only when X has feature names that are all strings.

Examples

Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to an ordinal encoding. >>> ord_enc = OrdinalEncoderPandas(exclude_cols=[“PARENT1”, “SEX”]) >>> X_enc = ord_enc.fit_transform(X) >>> X_original = ord_enc.inverse_transform(X_enc)

fit(X, y=None)[source]#

Fit the OrdinalEncoder to X.

Parameters:
  • X (pd.DataFrame, of shape (n_samples, n_features)) – The data to determine the categories of each feature.

  • y (Ignored. This parameter exists only for compatibility with) – Pipeline.

Returns:

self – Fitted encoder.

fit_transform(X, y=None, sample_weight=None, **fit_params)[source]#

Fit to data, then transform it. Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X. :type X: :param X: Input samples. :type X: array-like of shape (n_samples, n_features) :type y: :param y: Target values (None for unsupervised transformations). :type y: array-like of shape (n_samples,) or (n_samples, n_outputs), default=None :type **fit_params: :param **fit_params: Additional fit parameters. :type **fit_params: dict

Returns:

X_new (ndarray array of shape (n_samples, n_features_new)) – Transformed array.

inverse_transform(X)[source]#

Convert the data back to the original representation. When unknown categories are encountered (all zeros in the one-hot encoding), None is used to represent this category. If the feature with the unknown category has a dropped category, the dropped category will be its inverse. For a given input feature, if there is an infrequent category, ‘infrequent_sklearn’ will be used to represent the infrequent category.

Parameters:

X (pd.DataFrame of shape (n_samples, n_encoded_features)) – The transformed data.

Returns:

X_tr (pd.Dataframe of shape (n_samples, n_features)) – Inverse transformed array.

set_transform_request(*, sample_weight: bool | None | str = '$UNCHANGED$') OrdinalEncoderPandas#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in transform.

Returns:

self (object) – The updated object.

transform(X, y=None, sample_weight=None)[source]#

Transform X to ordinal codes.

Parameters:

X (pd.DataFrame of shape (n_samples, n_features)) – The data to encode.

Returns:

X_out (pd.DataFrame (n_samples, n_features)) – Transformed input.

class arfs.preprocessing.PatsyTransformer(formula=None, add_intercept=True, eval_env=0, NA_action='drop', return_type='dataframe')[source]#

Bases: BaseEstimator, TransformerMixin

Transformer using patsy-formulas.

PatsyTransformer transforms a pandas DataFrame (or dict-like) according to the formula and produces a numpy array.

Parameters:
  • formula (string or formula-like) – Pasty formula used to transform the data.

  • add_intercept (boolean, default False) – Wether to add an intersept. By default scikit-learn has built-in intercepts for all models, so we don’t add an intercept to the data, even if one is specified in the formula.

  • eval_env (environment or int, default 0) – Envirionment in which to evalute the formula. Defaults to the scope in which PatsyModel was instantiated.

  • NA_action (string or NAAction, default "drop") – What to do with rows that contain missing values. You can "drop" them, "raise" an error, or for customization, pass an NAAction object. See patsy.NAAction for details on what values count as ‘missing’ (and how to alter this).

Variables:
  • feature_names (list of string) – Column names / keys of training data.

  • return_type (string, default "dataframe") – data type that transform method will return. Default is "dataframe" for numpy array, but if you would like to get Pandas dataframe (for example for using it in scikit transformers with dataframe as input use "dataframe" and if numpy array use "ndarray"

Note

PastyTransformer does by default not add an intercept, even if you specified it in the formula. You need to set add_intercept=True.

As scikit-learn transformers can not ouput y, the formula should not contain a left hand side. If you need to transform both features and targets, use PatsyModel.

fit(data, y=None)[source]#

Fit the scikit-learn model using the formula.

Parameters:

data (dict-like (pandas dataframe)) – Input data. Column names need to match variables in formula.

fit_transform(data, y=None)[source]#

Fit the scikit-learn model using the formula and transform it.

Parameters:

data (dict-like (pandas dataframe)) – Input data. Column names need to match variables in formula.

Returns:

X_transform (ndarray) – Transformed data

set_fit_request(*, data: bool | None | str = '$UNCHANGED$') PatsyTransformer#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for data parameter in fit.

Returns:

self (object) – The updated object.

set_transform_request(*, data: bool | None | str = '$UNCHANGED$') PatsyTransformer#

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

data (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for data parameter in transform.

Returns:

self (object) – The updated object.

transform(data)[source]#

Transform with estimator using formula.

Transform the data using formula, then transform it using the estimator.

Parameters:

data (dict-like (pandas dataframe)) – Input data. Column names need to match variables in formula.

class arfs.preprocessing.TreeDiscretizer(bin_features='all', n_bins=10, n_bins_max=None, num_bins_as_category=False, boost_params=None, raw=False, task='regression')[source]#

Bases: BaseEstimator, TransformerMixin

Discretize continuous and/or categorical data using univariate regularized trees, returning a pandas DataFrame. The TreeDiscretizer is designed to support regression and binary classification tasks. Discretization, also known as quantization or binning, allows for the partitioning of continuous features into discrete values. In certain datasets with continuous attributes, discretization can be beneficial as it transforms the dataset into one with only nominal attributes. Additionally, for categorical predictors, grouping levels can help reduce overfitting and create meaningful clusters.

By encoding discretized features, a model can become more expressive while maintaining interpretability. For example, preprocessing with a discretizer can introduce nonlinearity to linear models. For more advanced possibilities, particularly smooth ones, you can refer to the section on generating polynomial features. The TreeDiscretizer function utilizes univariate regularized trees, with one tree per column to be binned. It finds the optimal partition and returns numerical intervals for numerical continuous columns and pd.Categorical for categorical columns. This approach groups similar levels together, reducing dimensionality and regularizing the model.

TreeDiscretizer handles missing values for both numerical and categorical predictors, eliminating the need for encoding categorical predictors separately.

Notes

This is a substitution to proper regularization schemes such as: - GroupLasso: Categorical predictors, which are usually encoded as multiple dummy variables,

are considered together rather than separately.

  • FusedLasso: Takes into account the ordering of the features.

Parameters:
  • bin_features (List of string or None) – The list of names of the variable that has to be binned, or “all”, “numerical” or “categorical” for splitting and grouping all, only numerical or only categorical columns.

  • n_bins (int) – The number of bins that has to be created while binning the variables in the “bin_features” list.

  • n_bins_max (int, optional) – The maximum number of levels that a categorical column can have to avoid being binned.

  • num_bins_as_category (bool, default False) – Save the numeric bins as pandas category or as pandas interval.

  • boost_params (dict) – The boosting parameters dictionary.

  • raw (bool) – Returns raw levels (non-human-interpretable) or levels matching the original ones.

  • task (str) – Either regression or classification (binary).

Variables:
  • tree_dic (dict) – The dictionary keys are binned column names and items are the univariate trees.

  • bin_upper_bound_dic (dict) – The upper bound of the numerical intervals.

  • cat_bin_dict (dict) – The mapping dictionary for the categorical columns.

  • tree_imputer (dict) – The missing values are split by the tree and lead to similar splits and are mapped to this value.

  • ordinal_encoder_dic (dict) – Dictionary with the fitted encoder, if any.

  • cat_features (list) – Names of the found categorical columns.

fit(X, y, sample_weight=None)[source]#

Fit the transformer object on data.

transform(X)[source]#

Apply the fitted transformer object on new data.

fit_transform(X)#

Fit and apply the transformer object on data.

Example

>>> lgb_params = {'min_split_gain': 5}
>>> disc = TreeDiscretizer(bin_features='all', n_bins=10)
>>> disc.fit(X=df[predictors], y=df['Frequency'], sample_weight=df['Exposure'])
fit(X, y, sample_weight=None)[source]#

Fit the TreeDiscretizer on the input data.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The predictor dataframe.

  • y (array-like of shape (n_samples,)) – The target vector.

  • sample_weight (array-like of shape (n_samples,), optional) – The weight vector, by default None.

Returns:

self (object) – Returns self.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') TreeDiscretizer#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

transform(X)[source]#

Apply the discretizer on X. Only the columns with more than n_bins_max unique values will be transformed.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input data with shape (n_samples, n_features), where n_samples is the number of samples and n_features is the number of features.

Returns:

X (pd.DataFrame) – DataFrame with the binned and grouped columns.

arfs.preprocessing._drop_intercept(formula, add_intercept)[source]#

Drop the intercept from formula if not add_intercept

arfs.preprocessing.cat_var(data, col_excl=None, return_cat=True)[source]#

Ad hoc categorical encoding (as integer). Automatically detect the non-numerical columns, save the index and name of those columns, encode them as integer, save the direct and inverse mappers as dictionaries. Return the data-set with the encoded columns with a data type either int or pandas categorical.

Parameters:
  • data (pd.DataFrame) – the dataset

  • col_excl (list of str, default None) – the list of columns names not being encoded (e.g. the ID column)

  • return_cat (bool, default True) – return encoded object columns as pandas categoricals or not.

Returns:

  • df (pd.DataFrame) – the dataframe with encoded columns

  • cat_var_df (pd.DataFrame) – the dataframe with the indices and names of the categorical columns

  • inv_mapper (dict) – the dictionary to map integer –> category

  • mapper (dict) – the dictionary to map category –> integer

class arfs.preprocessing.dtype_column_selector(pattern=None, *, dtype_include=None, dtype_exclude=None, exclude_cols=None)[source]#

Bases: object

Create a callable to select columns to be used with ColumnTransformer. dtype_column_selector() can select columns based on datatype or the columns name with a regex. When using multiple selection criteria, all criteria must match for a column to be selected.

Parameters:
  • pattern (str, default None) – Name of columns containing this regex pattern will be included. If None, column selection will not be selected based on pattern.

  • dtype_include (column dtype or list of column dtypes, default None) – A selection of dtypes to include. For more details, see pandas.DataFrame.select_dtypes().

  • dtype_exclude (column dtype or list of column dtypes, default None) – A selection of dtypes to exclude. For more details, see pandas.DataFrame.select_dtypes().

  • exclude_cols (list of column names, default None) – A selection of columns to exclude

Returns:

selector (callable) – Callable for column selection to be used by a ColumnTransformer.

See also

ColumnTransformer

Class that allows combining the outputs of multiple transformer objects used on column subsets of the data into a single feature space.

Examples

>>> from sklearn.preprocessing import StandardScaler, OneHotEncoder
>>> from sklearn.compose import make_column_transformer
>>> from arfs.preprocessing import dtype_column_selector
>>> import numpy as np
>>> import pandas as pd  
>>> X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw'],
...                   'rating': [5, 3, 4, 5]})  
>>> ct = make_column_transformer(
...       (StandardScaler(),
...        dtype_column_selector(dtype_include=np.number)),  # rating
...       (OneHotEncoder(),
...        dtype_column_selector(dtype_include=object)))  # city
>>> ct.fit_transform(X)
array([[ 0.90453403,  1.        ,  0.        ,  0.        ],
       [-1.50755672,  1.        ,  0.        ,  0.        ],
       [-0.30151134,  0.        ,  1.        ,  0.        ],
       [ 0.90453403,  0.        ,  0.        ,  1.        ]])
__call__(df)[source]#

Callable for column selection to be used by a ColumnTransformer. :type df: :param df: DataFrame to select columns from. :type df: pd.DataFrame of shape (n_features, n_samples)

arfs.preprocessing.find_interval_midpoint(interval_series)[source]#

Find the midpoint (or left/right bound if the interval contains Inf).

Parameters:

interval_series (pd.Series) – series of pandas intervals.

Return type:

ndarray

Returns:

np.ndarray – Array of midpoints or bounds of the intervals.

arfs.preprocessing.highlight_discarded(s)[source]#

highlight X in red and V in green.

Parameters:

s (np.arrays) –

Returns:

list

arfs.preprocessing.transform_interval_to_midpoint(X, cols='all')[source]#

Transforms interval columns in a pandas DataFrame to their midpoint values.

Notes

Equivalent function to IntervalToMidpoint without the estimator API

Parameters:
  • X (pd.DataFrame) – The input DataFrame containing the data to be transformed.

  • cols (list of str or str) – The columns to be transformed. Defaults to “all” which transforms all columns.

Return type:

DataFrame

Returns:

pd.DataFrame – The transformed DataFrame with interval columns replaced by their midpoint values.

Raises:

TypeError : – If the input data is not a pandas DataFrame.

arfs.sampling module#

This module provide methods for sampling large datasets for reducing the running time

arfs.sampling._gower_distance_row(xi_cat, xi_num, xj_cat, xj_num, feature_weight_cat, feature_weight_num, feature_weight_sum, ranges_of_numeric)[source]#

Compute a row of the Gower matrix

Parameters:
  • xi_cat (np.array) – categorical row of the X matrix

  • xi_num (np.array) – numerical row of the X matrix

  • xj_cat (np.array) – categorical row of the X matrix

  • xj_num (np.array) – numerical row of the X matrix

  • feature_weight_cat (np.array) – weight vector for the categorical features

  • feature_weight_num (np.array) – weight vector for the numerical features

  • feature_weight_sum (float) – The sum of the wieghts

  • ranges_of_numeric (np.array) – range of the scaled numerical features (between 0 and 1)

Returns:

np.array (array) – a row vector of the Gower distance

arfs.sampling.get_5_percent_splits(length)[source]#

splits dataframe into 5% intervals

Parameters:

length (int) – array length

Returns:

array – vector of sizes

arfs.sampling.get_most_common(srs)[source]#
arfs.sampling.gower_matrix(data_x, data_y=None, weight=None, cat_features='auto')[source]#

Computes the gower distances between X and Y

Gower is a similarity measure for categorical, boolean and numerical mixed data.

Parameters:
  • data_x (np.array or pd.DataFrame) – The data for computing the Gower distance

  • data_y (np.array or pd.DataFrame or pd.Series, optional) – The reference matrix or vector to compare with, optional

  • weight (np.array or pd.Series, optional) – sample weight, optional

  • cat_features (list of str or bool or int, optional) – auto-detect cat features or a list of cat features, by default ‘auto’

Returns:

np.array – The Gower distance matrix, shape (n_samples, n_samples)

Notes

The non-numeric features, and numeric feature ranges are determined from X and not Y.

Raises:
  • TypeError – If two dataframes are passed but have different number of columns

  • TypeError – If two arrays are passed but have different number of columns

  • TypeError – Sparse matrices are not supported

  • TypeError – if a list of categorical columns is passed, it should be a list of strings or integers or boolean values

arfs.sampling.gower_topn(data_x, data_y=None, weight=None, cat_features='auto', n=5, key=None)[source]#

Get the n most similar elements

Parameters:
  • data_x (np.array or pd.DataFrame) – The data for the look up

  • data_y (np.array or pd.DataFrame or pd.Series, optional) – elements for which to return the most similar elements, should be a single row

  • weight (np.array or pd.Series, optional) – sample weight, by default None

  • cat_features (list of str or bool or int, optional) – auto detection of cat features or a list of strings, booleans or integers, by default ‘auto’

  • n (int, optional) – the number of neighbors/similar rows to find, by default 5

  • key (str, optional) – identifier key. If several rows refer to the same id, this column will be used for finding the nearest neighbors with a different id, by default None

Returns:

dict – the dictionary of indices and values of the closest elements

Raises:

TypeError – if the reference element is not a single row

arfs.sampling.isof_find_sample(X, sample_weight=None)[source]#

Finds a sample by comparing the distributions of the anomally scores between the sample and the original distribution using the KS-test. Starts of a 5% howver will increase to 10% and then 15% etc. if a significant sample can not be found

References

Sampling method taken from boruta_shap, author: https://github.com/Ekeany

Parameters:
  • X (pd.DataFrame) – the predictors matrix

  • sample_weight (pd.Series or np.array, optional) – the sample weights, if any, by default None

Returns:

array – the indices for reducing the shadow predictors matrix

arfs.sampling.isolation_forest(X, sample_weight=None)[source]#

fits isloation forest to the dataset and gives an anomally score to every sample

Parameters:
  • X (pd.DataFrame or np.array) – the predictors matrix

  • sample_weight (pd.Series or np.array, optional) – the sample weights, if any, by default None

arfs.sampling.sample(df, n=1000, sample_weight=None, method='gower')[source]#

Sampling rows from a dataframe when random sampling is not enough for reducing the number of rows. The strategies are either using hierarchical clustering based on the Gower distance or using isolation forest for identifying the most similar elements. For the clustering algorithm, clusters are determined using the Gower distance (mixed type data) and the dataset is shrunk from n_samples to n_clusters.

For the isolation forest algorithm, samples are added till a suffisant 2-samples KS statistics is reached or if the number iteration reached the max number (20)

Parameters:
  • df (pd.DataFrame) – the dataframe to sample, with or without the target

  • n (int, optional) – the number of clusters if method is "gower", by default 100

  • sample_weight (pd.Series or np.array, optional) – sample weights, by default None

  • method (str, optional) – the strategy to use for sampling the rows. Either "gower" or "isoforest", by default ‘gower’

Returns:

pd.DataFrame – the sampled dataframe

arfs.sampling.smallest_indices(ary, n)[source]#

Returns the n largest indices from a numpy array.

Parameters:
  • ary (np.array) – the array for which to return largest indices

  • n (int) – the number of indices to return

Returns:

dict – the dictionary of indices and values of the largest elements

arfs.utils module#

Utility and validation functions

arfs.utils.LightForestClassifier(n_feat, n_estimators=10)[source]#

lightGBM implementation of the Random Forest classifier with the ideal number of features, according to Elements of statistical learning

Parameters:
  • n_feat (int) – the number of predictors (nbr of columns of the X matrix)

  • n_estimators (int, optional) – the number of trees/estimators, by default 10

Returns:

lightgbm classifier – sklearn random forest estimator based on lightgbm

arfs.utils.LightForestRegressor(n_feat, n_estimators=10)[source]#

lightGBM implementation of the Random Forest regressor with the ideal number of features, according to Elements of statistical learning

Parameters:
  • n_feat (int) – the number of predictors (nbr of columns of the X matrix)

  • n_estimators (int, optional) – the number of trees/estimators, by default 10

Returns:

lightgbm regressor – sklearn random forest estimator based on lightgbm

arfs.utils._get_cancer_data()[source]#

Load breast cancer data and add dummies (random predictors) and a genuine one, for benchmarking purpose Classification (binary)

Returns:

object – Bunch sklearn, extension of dictionary

arfs.utils._get_titanic_data()[source]#

Load Titanic data and add dummies (random predictors, numeric and categorical) and a genuine one, for benchmarking purpose. Classification (binary)

Returns:

object – Bunch sklearn, extension of dictionary

arfs.utils._load_boston_data()[source]#

Load Boston data and add dummies (random predictors, numeric and categorical) and a genuine one, for benchmarking purpose. Regression (positive domain).

Returns:

object – Bunch sklearn, extension of dictionary

arfs.utils._load_housing(as_frame=False)[source]#

Load the California housing data. See here https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html for the downloadable version.

Parameters:

as_frame (bool) – return a pandas dataframe? if not then a “Bunch” (enhanced dictionary) is returned (default True)

Returns:

pd.DataFrame or Bunch – the dataset

arfs.utils._make_corr_dataset_classification(size=1000)[source]#

Generate an artificial dataset for classification tasks. Some columns are correlated, have no variance, large cardinality, numerical and categorical.

Parameters:

size (int): The number of rows to generate. Default is 1000.

Returns:

tuple: A tuple containing the predictors matrix, the target, and the weights.

arfs.utils._make_corr_dataset_regression(size=1000)[source]#

Generate an artificial dataset for regression tasks with columns that are correlated, have no variance, large cardinality, numerical and categorical.

Parameters:

size (int, optional) – number of rows to generate, by default 1000

Returns:

pd.DataFrame, pd.Series, pd.Series – the predictors matrix, the target and the weights

arfs.utils.check_if_tree_based(model)[source]#

check if estimator is tree based

Parameters:

model (object) – the estimator to check

Returns:

condition (boolean) – if tree based or not

arfs.utils.concat_or_group(col, x, max_length=25)[source]#

Concatenate unique values from a column or return a group value.

Parameters:
  • col (str) – The name of the column to process.

  • x (pd.DataFrame) – The DataFrame containing the data.

  • max_length (int, optional) – The maximum length for concatenated strings, beyond which grouping is performed, by default 40.

Returns:

str – A concatenated string of unique values if the length is less than max_length, otherwise, a unique group value from the specified column.

Notes

If the concatenated string length is greater than or equal to max_length, this function returns the unique group value from the column with a “_g” suffix.

Examples

>>> data = {
>>> 'Category_g': [1, 1, 2, 2, 3],
>>> 'Category': ['AAAAAAAAAAAAAAA', 'Bovoh', 'Ccccccccccccccc', 'D', 'E']}
>>> cat_bin_dict = {}
>>> col = 'Category'
>>> cat_bin_dict[col] = (
>>>     X[[f"{col}_g", col]]
>>>     .groupby(f"{col}_g")
>>>     .apply(lambda x: concat_or_group(col, x))
>>>     .to_dict()
>>> )
>>> print(cat_bin_dict)
>>> {'Category': {1: 'gr_1', 2: 'gr_2', 3: 'E'}}
arfs.utils.create_dtype_dict(df, dic_keys='col_names')[source]#

Create a custom dictionary of data type for adding suffixes to column names in the plotting utility for association matrix.

Parameters:
  • df (pd.DataFrame) – The dataframe used for computing the association matrix.

  • dic_keys (str) – Either “col_names” or “dtypes” for returning either a dictionary with column names or dtypes as keys.

Return type:

dict

Returns:

dict – A dictionary with either column names or dtypes as keys.

Raises:

ValueError – If dic_keys is not either “col_names” or “dtypes”.

arfs.utils.get_pandas_cat_codes(X)[source]#

Converts categorical and time features in a pandas DataFrame into numerical codes.

Parameters:

X (pandas DataFrame) – The input DataFrame containing categorical and/or time features.

Returns:

  • X (pandas DataFrame) – The modified input DataFrame with categorical and time features replaced by numerical codes.

  • obj_feat (list or None) – List of column names that were converted to numerical codes. Returns None if no categorical or time features found.

  • cat_idx (list or None) – List of column indices for the columns in obj_feat. Returns None if no categorical or time features found.

arfs.utils.is_catboost(estimator)[source]#

check if estimator is catboost

Parameters:

model (object) – the estimator to check

Returns:

condition (boolean) – if catboost based or not

arfs.utils.is_lightgbm(estimator)[source]#

check if estimator is lightgbm

Parameters:

model (object) – the estimator to check

Returns:

condition (boolean) – if lgbm based or not

arfs.utils.is_list_of_bool(bool_list)[source]#

Check if bool_list is not a list of Booleans

Parameters:

bool_list (list of bool) – the list we want to check for

Returns:

bool – True if list of Booleans, else False

arfs.utils.is_list_of_int(int_list)[source]#

Check if int_list is not a list of integers

Parameters:

int_list (list of int) – the list we want to check for

Returns:

bool – True if list of integers, else False

arfs.utils.is_list_of_str(str_list)[source]#

Check if str_list is a list of strings.

Parameters:

str_list (list or None) – The list to check.

Returns:

bool – True if the list is a list of strings, False otherwise.

arfs.utils.is_xgboost(estimator)[source]#

check if estimator is xgboost

Parameters:

model (object) – the estimator to check

Returns:

condition (boolean) – if xgboost based or not

arfs.utils.load_data(name='Titanic')[source]#

Load some toy data set to test the All Relevant Feature Selection methods. Dummies (random) predictors are added and ARFS should be able to filter them out. The Titanic predictors are encoded (needed for scikit estimators).

Titanic and cancer are for binary classification, they contain synthetic random (dummies) predictors and a noisy but genuine synthetic predictor. Hopefully, a good All Relevant FS should be able to detect all the predictors genuinely related to the target.

Boston is for regression, this data set contains

Parameters:

name (str, optional) – the name of the data set. Titanic is for classification with sample_weight, Boston for regression and cancer for classification (without sample weight), by default ‘Titanic’

Returns:

Bunch – extension of dictionary, accessible by key

Raises:

ValueError – if the dataset name is invalid

arfs.utils.plot_y_vs_X(X, y, ncols=2, figsize=(10, 10))[source]#

Plot target vs relevant and non-relevant predictors

Parameters:
  • X (pd.DataFrame) – The DataFrame of the predictors.

  • y (np.array) – The target.

  • ncols (int, optional) – The number of columns in the facet plot. Default is 2.

  • figsize (tuple, optional) – The figure size. Default is (10, 10).

Returns:

plt.figure – The univariate plots y vs pred_i.

arfs.utils.reset_plot()[source]#

Reset plot style

arfs.utils.set_my_plt_style(height=3, width=5, linewidth=2)[source]#

This set the style of matplotlib to fivethirtyeight with some modifications (colours, axes)

Parameters:
  • linewidth (int, default 2) – line width

  • height (int, default 3) – fig height in inches (yeah they’re still struggling with the metric system)

  • width (int, default 5) – fig width in inches (yeah they’re still struggling with the metric system)

arfs.utils.validate_pandas_input(arg)[source]#

Validate if pandas or numpy arrays are provided :type arg: :param arg: the object to validate :type arg: pd.DataFrame or np.array

Raises:

TypeError – error if pandas or numpy arrays are not provided

arfs.utils.validate_sample_weight(sample_weight)[source]#

Validate the sample_weight parameter.

Parameters:

sample_weight (array-like or None) – Input sample weights.

Returns:

np.ndarray or None – If sample_weight is a Pandas Series, its values are returned as a numpy array. If sample_weight is already a numpy array, it is returned unmodified. If sample_weight is None, None is returned.

Raises:

ValueError – If sample_weight is not an array-like object or None.

Module contents#

init module, providing information about the arfs package