Association and Feature Selection#

Computing the association between variables is a fundamental aspect of data analysis, enabling us to understand the relationships and dependencies among different data features. In this context, we introduce a suite of functions designed to calculate various types of associations, specifically tailored for different combinations of data types—categorical-categorical, numerical-numerical, and categorical-numerical. Whether you are exploring correlations in continuous data or examining the dependencies in categorical data, these functions offer a robust toolkit for delving into the intricate web of relationships that exist within your data, providing valuable insights for both exploratory data analysis and predictive modeling.

The computations are also parallelized, initiating a process (CPU-bound) for handling a batch of columns from the pandas DataFrame if the size justifies it. Note that the first call, when using n_jobs > 1, may be slow due to the overhead associated with starting the processes. For data of moderate size, setting n_jobs=1 is usually faster.

Lastly, you have the flexibility to use custom functions to compute associations between columns. These associations can be effectively utilized for feature elimination in the CollinearityThreshold selector.

Default association functions#

Numerical-numerical: Spearman correlation coefficient, symmetric
Nominal-Numerical: correlation coefficient, symmetric but the implementation requires the first argument to be the nominal column
Nominal-Nominal: Theil’s U statistic, asymmetric

[1]:

import pandas as pd
import numpy as np

from arfs.utils import load_data
from arfs.feature_selection.unsupervised import CollinearityThreshold
from arfs.association import asymmetric_function, xy_to_matrix
from arfs.association import association_matrix, correlation_ratio_matrix, _callable_association_matrix_fn, correlation_ratio, weighted_corr, wcorr_matrix, theils_u_matrix, weighted_theils_u, association_series, _callable_association_series_fn

[2]:

titanic = load_data(name="Titanic")
X, y = titanic.data, titanic.target
y = y.astype(int)
X.head()

[2]:

	pclass	sex	embarked	random_cat	is_alone	title	age	family_size	fare	random_num
0	1	female	S	Fry	1	Mrs	29.0000	0.0	211.3375	0.496714
1	1	male	S	Bender	0	Master	0.9167	3.0	151.5500	-0.138264
2	1	female	S	Thanos	0	Mrs	2.0000	3.0	151.5500	0.647689
3	1	male	S	Morty	0	Mr	30.0000	3.0	151.5500	1.523030
4	1	female	S	Morty	0	Mrs	25.0000	3.0	151.5500	-0.234153

Association series and matrix#

Compute the association series and matrix, for all dtypes

[3]:

association_series(
    X=X,
    target="age",
    normalize=False,
    n_jobs=1,
    handle_na="drop",
)

[3]:

age            1.000000
title          0.403618
pclass         0.375524
is_alone       0.222841
fare           0.163930
embarked       0.098631
sex            0.057398
random_cat     0.037237
random_num    -0.035203
family_size   -0.139715
dtype: float64

[4]:

assoc_m = association_matrix(X=X, n_jobs=1)
assoc_m

[4]:

	row	col	val
0	family_size	fare	0.226465
1	fare	family_size	0.226465
2	age	fare	0.171521
3	fare	age	0.171521
4	family_size	random_num	-0.019169
...	...	...	...
85	sex	random_cat	0.002444
86	random_cat	pclass	0.002035
87	random_cat	is_alone	0.001386
88	pclass	is_alone	0.001314
89	random_cat	sex	0.000818

90 rows × 3 columns

You can reshape it as

[5]:

xy_to_matrix(assoc_m)

[5]:

	age	embarked	family_size	fare	is_alone	pclass	random_cat	random_num	sex	title
age	0.000000	0.098631	-0.196996	0.171521	0.222841	0.375524	0.037237	-0.041389	0.057398	0.403618
embarked	0.098631	0.000000	0.104125	0.300998	0.006505	0.100167	0.006122	0.052706	0.011014	0.012525
family_size	-0.196996	0.104125	0.000000	0.226465	0.785592	0.059053	0.064754	-0.019169	0.188583	0.438517
fare	0.171521	0.300998	0.226465	0.000000	0.175140	0.602869	0.070161	-0.024327	0.185484	0.196217
is_alone	0.222841	0.010056	0.785592	0.175140	0.000000	0.002527	0.005153	0.010023	0.031810	0.172022
pclass	0.375524	0.080502	0.059053	0.602869	0.001314	0.000000	0.003934	0.052158	0.007678	0.029489
random_cat	0.037237	0.002545	0.064754	0.070161	0.001386	0.002035	0.000000	0.064163	0.000818	0.002632
random_num	-0.041389	0.052706	-0.019169	-0.024327	0.010023	0.052158	0.064163	0.000000	0.045685	0.066008
sex	0.057398	0.013678	0.188583	0.185484	0.025554	0.011864	0.002444	0.045685	0.000000	0.976945
title	0.403618	0.010986	0.438517	0.196217	0.097605	0.032183	0.005553	0.066008	0.690022	0.000000

User defined association functions#

The functions association_series and association_matrix call utilities for computing values using parallelization (if n_jobs > 1) : _callable_association_series_fn and _callable_association_matrix_fn. Those two functions handle the parallelization of generic functions computing association coefficients.

However the input functions must have a well defined structure:

@symmetric_function
def input_function_computing_coefficient_values(x, y, sample_weight=None, as_frame=True):
    """
    Calculate the [DESCRIPTION HERE] for series x with respect to series y.

    Parameters
    ----------
    x : pandas.Series
        A pandas Series representing a feature.
    y : pandas.Series
        Another pandas Series representing a feature.
    as_frame : bool, optional
        If True, the function returns the result as a pandas DataFrame;
        otherwise, it returns a float value. The default is False.

    Returns
    -------
    Union[float, pandas.DataFrame]
        A score representing the [COEFFICIENT NAME] between x and y.
        If `as_frame` is True, returns a DataFrame with the columns "row", "col", and "val",
        where "row" and "col" represent the names of the series x and y, respectively,
        and "val" is the PPS score. If `as_frame` is False, returns the PPS score as a float.
    """

    if x.name == y.name:
        score = 1
    else:
        df = pd.DataFrame({"x": x.values, "y": y.values})
        # Calculating the PPS and extracting the score
        [CUSTOM CODE HERE, RETURNING THE COEFFICIENT VALUE c]

    if as_frame:
        # Symmetry allows to not compute twice the same quantity
        return pd.DataFrame(
            {"row": [x_name, y_name], "col": [y_name, x_name], "val": [v, v]}
        )
    else:
        return c

for asymmetric association coefficient, as Theil’s U statistic

@asymmetric_function
def input_function_computing_coefficient_values(x, y, sample_weight=None, as_frame=True):
    """
    Calculate the [DESCRIPTION HERE] for series x with respect to series y.

    Parameters
    ----------
    x : pandas.Series
        A pandas Series representing a feature.
    y : pandas.Series
        Another pandas Series representing a feature.
    as_frame : bool, optional
        If True, the function returns the result as a pandas DataFrame;
        otherwise, it returns a float value. The default is False.

    Returns
    -------
    Union[float, pandas.DataFrame]
        A score representing the [COEFFICIENT NAME] between x and y.
        If `as_frame` is True, returns a DataFrame with the columns "row", "col", and "val",
        where "row" and "col" represent the names of the series x and y, respectively,
        and "val" is the PPS score. If `as_frame` is False, returns the PPS score as a float.
    """

    if x.name == y.name:
        score = 1
    else:
        df = pd.DataFrame({"x": x.values, "y": y.values})
        # Calculating the PPS and extracting the score
        [CUSTOM CODE HERE, RETURNING THE COEFFICIENT VALUE c]

    if as_frame:
        return pd.DataFrame({"row": x.name, "col": y.name, "val":c}, index=[0])
    else:
        return c

Example#

[6]:

import ppscore as pps

@asymmetric_function
def ppscore_arfs(x, y, sample_weight=None, as_frame=True):
    """
    Calculate the Predictive Power Score (PPS) for series x with respect to series y.

    The PPS is a score that shows the predictive relationship between two variables.
    This function calculates the PPS of x predicting y. If the series have the same name,
    the function assumes they are identical and returns a score of 1.

    Parameters
    ----------
    x : pandas.Series
        A pandas Series representing a feature.
    y : pandas.Series
        Another pandas Series representing a feature.
    as_frame : bool, optional
        If True, the function returns the result as a pandas DataFrame;
        otherwise, it returns a float value. The default is False.

    Returns
    -------
    Union[float, pandas.DataFrame]
        A score representing the PPS between x and y.
        If `as_frame` is True, returns a DataFrame with the columns "row", "col", and "val",
        where "row" and "col" represent the names of the series x and y, respectively,
        and "val" is the PPS score. If `as_frame` is False, returns the PPS score as a float.
    """

    # Merging x and y into a single DataFrame

    # Ensure x and y are DataFrames with only one column
    if (isinstance(x, pd.DataFrame) and isinstance(y, pd.DataFrame) and x.shape[1] == 1 and y.shape[1] == 1):
        # Extracting the series from the DataFrames
        x = x.iloc[:, 0]
        y = y.iloc[:, 0]

    if x.name == y.name:
        score = 1
    else:
        df = pd.DataFrame({"x": x.values, "y": y.values})
        # Calculating the PPS and extracting the score
        score = pps.score(df, df.columns[0], df.columns[1])['ppscore']

    if as_frame:
        return pd.DataFrame({"row": x.name, "col": y.name, "val":score}, index=[0])
    else:
        return score

[7]:

d = association_matrix(
    X=X,
    n_jobs=1,
    nom_nom_assoc=ppscore_arfs,
    num_num_assoc=ppscore_arfs,
    nom_num_assoc=ppscore_arfs)

xy_to_matrix(d)

The least populated class in y has only 2 members, which is less than n_splits=4.
The least populated class in y has only 2 members, which is less than n_splits=4.
The least populated class in y has only 2 members, which is less than n_splits=4.
The least populated class in y has only 2 members, which is less than n_splits=4.
The least populated class in y has only 2 members, which is less than n_splits=4.

[7]:

	age	embarked	family_size	fare	is_alone	pclass	random_num	sex	title
age	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
embarked	0.00000	0.000000	0.000000	0.000000	0.000000	0.132714	0.000000	0.000000	0.000000
family_size	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
fare	0.00000	0.000000	0.603721	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
is_alone	0.00000	0.000012	0.324128	0.000000	0.000000	0.000000	0.000000	0.166500	0.208732
pclass	0.00000	0.000012	0.000000	0.188409	0.000000	0.000000	0.000606	0.000000	0.000000
random_cat	0.00000	0.000012	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
random_num	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
sex	0.00000	0.000012	0.000000	0.000000	0.000000	0.000000	0.001554	0.000000	0.796922
title	0.04168	0.000012	0.000000	0.000000	0.267387	0.000000	0.001815	0.984495	0.000000

Link to feature selection#

The CollinearityThreshold selector of the arfs.feature_selection.unsupervised module uses the association matrix behind the scene. You can replace the default functions by your user defined one:

[8]:

selector = CollinearityThreshold(
    method="association",
    nom_nom_assoc=ppscore_arfs,
    num_num_assoc=ppscore_arfs,
    nom_num_assoc=ppscore_arfs,
    threshold=0.5,
).fit(X)

print(f"The features going in the selector are : {selector.feature_names_in_}")
print(f"The support is : {selector.support_}")
print(f"The selected features are : {selector.get_feature_names_out()}")

The least populated class in y has only 2 members, which is less than n_splits=4.
The least populated class in y has only 2 members, which is less than n_splits=4.
The least populated class in y has only 2 members, which is less than n_splits=4.
The least populated class in y has only 2 members, which is less than n_splits=4.

The features going in the selector are : ['pclass' 'sex' 'embarked' 'random_cat' 'is_alone' 'title' 'age'
 'family_size' 'fare' 'random_num']
The support is : [ True  True  True  True  True False  True  True False  True]
The selected features are : ['pclass' 'sex' 'embarked' 'random_cat' 'is_alone' 'age' 'family_size'
 'random_num']

The least populated class in y has only 2 members, which is less than n_splits=4.

[9]:

selector.assoc_matrix_

[9]:

	age	embarked	family_size	fare	is_alone	pclass	random_num	sex	title
age	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
embarked	0.00000	0.000000	0.000000	0.000000	0.000000	0.132714	0.000000	0.000000	0.000000
family_size	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
fare	0.00000	0.000000	0.603721	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
is_alone	0.00000	0.000012	0.324128	0.000000	0.000000	0.000000	0.000000	0.166500	0.208732
pclass	0.00000	0.000012	0.000000	0.188409	0.000000	0.000000	0.000606	0.000000	0.000000
random_cat	0.00000	0.000012	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
random_num	0.00000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
sex	0.00000	0.000012	0.000000	0.000000	0.000000	0.000000	0.001554	0.000000	0.796922
title	0.04168	0.000012	0.000000	0.000000	0.267387	0.000000	0.001815	0.984495	0.000000

[10]:

f = selector.plot_association(figsize=(4, 4))

../_images/notebooks_association_and_feature_selection_14_0.png

[ ]: