Lasso Regularized GLM for Feature Selection in Python#

Introduction#

Welcome to our tutorial on using the Lasso Regularized Generalized Linear Model (GLM) for feature selection in Python! Feature selection is a critical step in machine learning, helping us identify the most important variables from a dataset, making models more accurate and easier to understand.

Description#

The Lasso Regularized GLM is a powerful variant of the traditional GLM. It adds a unique twist - the L1 regularization penalty (Lasso regularization). This penalty encourages some coefficients to become exactly zero, effectively excluding irrelevant features from the model. As a result, the Lasso Regularized GLM becomes an excellent tool for feature selection, especially in datasets with many variables.

In this tutorial, we’ll build a user-friendly Python class called “LassoFeatureSelection” that harnesses the Lasso Regularized GLM to perform feature selection. The class can handle different types of data, like pandas DataFrames and numpy arrays, making it super flexible and compatible with various data formats.

Please note that one limitation of the lasso is that it treats the levels of a categorical predictor individually. However, this issue can be addressed by utilizing the TreeDiscretizer, which automatically bins numerical variables and groups the levels of categorical variables.

Key Features of the “LassoFeatureSelection” Class#

Automated Intercept Handling: Our class automatically adds an intercept column to the input data if it’s missing. It then checks if the intercept is statistically significant and keeps it only if necessary.
Model Selection: We use a grid search cross-validation approach to find the best regularization parameter (alpha) for the Lasso Regularized GLM. Users can choose from two scoring options: the Bayesian Information Criterion (BIC) or the mean cross-validation score (deviance). The BIC is much faster.
Transparent Feature Selection: After fitting the “LassoFeatureSelection” model, you can access the selected features. This lets you understand which variables are the most crucial in the final model.

[1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

from mpl_toolkits.axes_grid1.inset_locator import inset_axes

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import PoissonRegressor
from sklearn.metrics import mean_squared_error

from arfs.feature_selection.lasso import LassoFeatureSelection
from arfs.preprocessing import PatsyTransformer
import arfs.feature_selection as arfsfs


def plot_y_vs_X(
    X: pd.DataFrame, y: pd.Series, ncols: int = 2, figsize: tuple = (10, 10)
) -> plt.Figure:
    """
    Create subplots of scatter plots showing the relationship between each column in X and the target variable y.

    Parameters
    ----------
        X : pd.DataFrame
            The input DataFrame containing the predictor variables.
        y : pd.Series
            The target variable to be plotted against.
        ncols : int, optional (default: 2)
            The number of columns in the subplot grid.
        figsize : tuple, optional (default: (10, 10))
            The size of the figure (width, height) in inches.

    Returns
    -------
        plt.Figure
            The generated Figure object containing the scatter plots.
    """

    n_rows, ncols_to_plot = divmod(X.shape[1], ncols)
    n_rows += int(ncols_to_plot > 0)

    # Create figure and axes
    f, axs = plt.subplots(nrows=n_rows, ncols=ncols, figsize=figsize)

    # Iterate through columns and plot against y
    for ax, col in zip(axs.flat, X.columns):
        ax.scatter(X[col], y, alpha=0.1)
        ax.set_title(col)

    # Remove any unused subplots
    for ax in axs.flat[len(X.columns) :]:
        ax.set_axis_off()

    # Display the figure
    plt.tight_layout()
    return f

Linear model#

The main goal here is not to build a detailed model for the data, but rather to identify the most important predictors. We will check how well the feature selection process can distinguish between meaningful predictors and random noise.

[2]:

bias = 7.0
X, y, true_coef = make_regression(
    n_samples=2_000,
    n_features=10,
    n_informative=5,
    noise=1,
    random_state=8,
    bias=bias,
    coef=True,
)
X = pd.DataFrame(X)
X.columns = [f"pred_{i}" for i in range(X.shape[1])]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=42
)

print(f"The true coefficient of the linear data generating process are:\n {true_coef}")
f = plot_y_vs_X(X_train, y_train, ncols=5, figsize=(15, 5))

The true coefficient of the linear data generating process are:
 [ 0.         77.90364383  0.          0.         63.70230035  0.
 95.59499715 43.86990345  0.          4.11861919]

../_images/notebooks_lasso_feature_selection_3_1.png

[3]:

# family should be: "gaussian", "poisson", "gamma", "negativebinomial", "binomial", "tweedie"
selector = LassoFeatureSelection(n_iterations=10, family="gaussian", score="bic")
selector.fit(X=X_train, y=y_train, sample_weight=None)
selector.get_feature_names_out()

[3]:

Index(['Intercept', 'pred_1', 'pred_4', 'pred_6', 'pred_7', 'pred_9'], dtype='object')

[4]:

true_coef = pd.Series(true_coef)
true_coef.index = X.columns
true_coef = pd.Series({**{"intercept": bias}, **true_coef})
true_coef

[4]:

intercept     7.000000
pred_0        0.000000
pred_1       77.903644
pred_2        0.000000
pred_3        0.000000
pred_4       63.702300
pred_5        0.000000
pred_6       95.594997
pred_7       43.869903
pred_8        0.000000
pred_9        4.118619
dtype: float64

[5]:

selector.transform(X_test).head()

[5]:

	Intercept	pred_1	pred_4	pred_6	pred_7	pred_9
1860	1.0	1.091750	-0.015868	0.231947	-0.998250	0.041597
353	1.0	-0.871767	2.557578	-0.026118	0.936062	0.533048
1333	1.0	1.147664	0.678622	-0.313260	-1.717187	0.228773
905	1.0	2.247261	-0.008657	0.354843	0.151635	2.047569
1289	1.0	-0.061458	-1.429949	-1.012678	-1.070863	0.768713

[6]:

selector.feature_names_in_

[6]:

Index(['Intercept', 'pred_0', 'pred_1', 'pred_2', 'pred_3', 'pred_4', 'pred_5',
       'pred_6', 'pred_7', 'pred_8', 'pred_9'],
      dtype='object')

[7]:

selector.support_

[7]:

array([ True, False,  True, False, False,  True, False,  True,  True,
       False,  True])

The exciting part is that lasso, despite potential mismatches in the distribution, can still perform remarkably well. It means that even if the data does not perfectly follow a certain pattern, lasso might successfully identify relevant predictors and filter out irrelevant ones.

[8]:

selector.get_feature_names_out()

[8]:

Index(['Intercept', 'pred_1', 'pred_4', 'pred_6', 'pred_7', 'pred_9'], dtype='object')

we can even check the value of the coefficients, even though is not the intended use (the ultimate output is given by the method .get_feature_names_out())

[9]:

selector.best_estimator_.summary()

[9]:

Generalized Linear Model Regression Results
Dep. Variable:	y	No. Observations:	1000
Model:	GLM	Df Residuals:	994
Model Family:	Gaussian	Df Model:	6
Link Function:	Identity	Scale:	1.0435
Method:	elastic_net	Log-Likelihood:	-1437.2
Date:	Fri, 11 Aug 2023	Deviance:	1037.2
Time:	10:44:34	Pearson chi2:	1.04e+03
No. Iterations:	10	Pseudo R-squ. (CS):	1.000
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	7.0127	0.032	215.952	0.000	6.949	7.076
pred_0	0	0	nan	nan	0	0
pred_1	77.8866	0.033	2375.009	0.000	77.822	77.951
pred_2	0	0	nan	nan	0	0
pred_3	0	0	nan	nan	0	0
pred_4	63.7313	0.033	1958.221	0.000	63.667	63.795
pred_5	0	0	nan	nan	0	0
pred_6	95.5690	0.032	2959.937	0.000	95.506	95.632
pred_7	43.9089	0.033	1325.779	0.000	43.844	43.974
pred_8	0	0	nan	nan	0	0
pred_9	4.0589	0.033	123.946	0.000	3.995	4.123

A toy example using the selector in a pipeline#

[10]:

# Create a pipeline with LassoFeatureSelection and LinearRegression
pipeline = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("selector", LassoFeatureSelection(n_iterations=10, score="bic")),
        ("linear_regression", LinearRegression()),
    ]
)

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

Mean Squared Error: 0.9917846740502078

[11]:

arfsfs.make_fs_summary(pipeline)

[11]:

	predictor	scaler	selector	linear_regression
0	pred_0	nan	0	nan
1	pred_1	nan	1	nan
2	pred_2	nan	0	nan
3	pred_3	nan	0	nan
4	pred_4	nan	1	nan
5	pred_5	nan	0	nan
6	pred_6	nan	1	nan
7	pred_7	nan	1	nan
8	pred_8	nan	0	nan
9	pred_9	nan	1	nan

[12]:

pipeline.named_steps["selector"].get_feature_names_out()

[12]:

Index(['Intercept', 'pred_1', 'pred_4', 'pred_6', 'pred_7', 'pred_9'], dtype='object')

beyond the feature selection quality, we can even check if the model itself is good.

[13]:

import seaborn as sns

# Plot the predictions and residuals
plt.figure(figsize=(10, 5))

# Plot predictions
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot(
    [min(y_test), max(y_test)],
    [min(y_test), max(y_test)],
    color="#d1ae11",
    linestyle="--",
)
plt.xlabel("True Values")
plt.ylabel("Predictions")
plt.title("True Values vs. Predictions")

# Plot residuals
residuals = y_test - y_pred
plt.subplot(1, 2, 2)
plt.scatter(y=residuals, x=y_pred, alpha=0.5)
plt.ylabel("Residuals")
plt.xlabel("Prediction")
plt.title("Distribution of Residuals")

plt.tight_layout()
plt.show();

../_images/notebooks_lasso_feature_selection_18_0.png

Poisson toy example#

The loss can be one of the typical GLM losses: “gaussian”, “poisson”, “gamma”, “negativebinomial”, “binomial”, “tweedie”.

[14]:

# Generate synthetic data with Poisson-distributed target variable
bias = 1
X, y, true_coef = make_regression(
    n_samples=2_000,
    n_features=10,
    n_informative=5,
    noise=1,
    random_state=8,
    bias=bias,
    coef=True,
)
y = (y - y.mean()) / y.std()
y = np.exp(y)  # Transform to positive values for Poisson distribution
y = np.random.poisson(y)  # Add Poisson noise to the target variable
# dummy sample weight (e.g. exposure), smallest being 30 days
w = np.random.uniform(30 / 365, 1, size=len(y))
# make the count a Poisson rate (frequency)
y = y / w

X = pd.DataFrame(X)
X.columns = [f"pred_{i}" for i in range(X.shape[1])]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X, y, w, test_size=0.5, random_state=42
)

print(f"The true coefficient of the linear data generating process are:\n {true_coef}")

f = plot_y_vs_X(X_train, y_train, ncols=5, figsize=(15, 5))

The true coefficient of the linear data generating process are:
 [ 0.         77.90364383  0.          0.         63.70230035  0.
 95.59499715 43.86990345  0.          4.11861919]

../_images/notebooks_lasso_feature_selection_20_1.png

[15]:

[15]:

array([0.        , 2.34937393, 4.39264421, ..., 1.51556199, 5.72062399,
       0.        ])

[16]:

# Create a pipeline with LassoFeatureSelection and LinearRegression
pipeline = Pipeline(
    [
        ("scaler", StandardScaler()),
        (
            "selector",
            LassoFeatureSelection(
                n_iterations=10, score="bic", family="poisson", fit_intercept=True
            ),
        ),
        ("glm", PoissonRegressor()),
    ]
)

# Fit the pipeline to the training data
pipeline.fit(
    X_train, y_train, selector__sample_weight=w_train, glm__sample_weight=w_train
)

[16]:

Pipeline(steps=[('scaler', StandardScaler()),
                ('selector', LassoFeatureSelection(family='poisson')),
                ('glm', PoissonRegressor())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

[17]:

import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.axes_grid1.inset_locator import inset_axes

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Calculate the residuals
residuals = y_test - y_pred

# Plot the predictions and residuals
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

# Plot predictions
axs[0].scatter(y_test, y_pred, alpha=0.05)
axs[0].plot(
    [min(y_test), max(y_test)],
    [min(y_test), max(y_test)],
    color="#d1ae11",
    linestyle="--",
)
axs[0].set_xlabel("True Values")
axs[0].set_ylabel("Predictions")
axs[0].set_title("True Values vs. Predictions")


# Plot residuals
axs[1] = sns.histplot(residuals, kde=True, ax=axs[1])
axs[1].set_xlabel("Residuals")
axs[1].set_ylabel("Frequency")
axs[1].set_title("Distribution of Residuals")

plt.show();

../_images/notebooks_lasso_feature_selection_23_0.png

[18]:

true_coef = pd.Series(true_coef)
true_coef.index = X.columns
true_coef = pd.Series({**{"intercept": bias}, **true_coef})
true_coef

[18]:

intercept     1.000000
pred_0        0.000000
pred_1       77.903644
pred_2        0.000000
pred_3        0.000000
pred_4       63.702300
pred_5        0.000000
pred_6       95.594997
pred_7       43.869903
pred_8        0.000000
pred_9        4.118619
dtype: float64

the selected predictors are

[19]:

pipeline.named_steps["selector"].get_feature_names_out()

[19]:

Index(['Intercept', 'pred_1', 'pred_4', 'pred_6', 'pred_7'], dtype='object')

[20]:

arfsfs.make_fs_summary(pipeline)

[20]:

	predictor	scaler	selector	glm
0	pred_0	nan	0	nan
1	pred_1	nan	1	nan
2	pred_2	nan	0	nan
3	pred_3	nan	0	nan
4	pred_4	nan	1	nan
5	pred_5	nan	0	nan
6	pred_6	nan	1	nan
7	pred_7	nan	1	nan
8	pred_8	nan	0	nan
9	pred_9	nan	0	nan

[21]:

pipeline.named_steps["selector"].best_estimator_.summary()

[21]:

Generalized Linear Model Regression Results
Dep. Variable:	y	No. Observations:	1000
Model:	GLM	Df Residuals:	995
Model Family:	Poisson	Df Model:	5
Link Function:	Log	Scale:	1.0000
Method:	elastic_net	Log-Likelihood:	-1295.4
Date:	Fri, 11 Aug 2023	Deviance:	1561.5
Time:	10:44:42	Pearson chi2:	2.28e+03
No. Iterations:	24	Pseudo R-squ. (CS):	0.7581
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	0.6199	0.035	17.503	0.000	0.550	0.689
pred_0	0	0	nan	nan	0	0
pred_1	0.5340	0.026	20.859	0.000	0.484	0.584
pred_2	0	0	nan	nan	0	0
pred_3	0	0	nan	nan	0	0
pred_4	0.4129	0.026	16.176	0.000	0.363	0.463
pred_5	0	0	nan	nan	0	0
pred_6	0.6096	0.026	23.814	0.000	0.559	0.660
pred_7	0.3059	0.026	11.553	0.000	0.254	0.358
pred_8	0	0	nan	nan	0	0
pred_9	0	0	nan	nan	0	0

Categorical variable, missing values and miss-specified distribution#

When working with the PatsyTransformer, it’s essential to handle missing values in a specific way for categorical variables. To do this, we convert missing values into a string representation, as the native support of NaN in numpy is not yet fully implemented for general arrays.

Keep in mind that the dummy dataset we are using in this context does not necessarily correspond to any specific distribution assumption, such as the exponential family. However, for the purpose of feature selection, this might not pose a significant obstacle. The feature selection process can still be effective in identifying relevant predictors, even when the data does not precisely follow a specific distribution pattern.

Classical approach#

Unfortunatly, Patsy is not yet pickable, I’ll present to equivalent approaches, one pickable approach and another using the Patsy transformer.

[22]:

from arfs.utils import _make_corr_dataset_regression

X, y, w = _make_corr_dataset_regression(size=2_000)
# a dummy categorical variable with as many levels as rows
# and with zero variance
X = X.drop(columns=["var10", "var11"])

[23]:

# select all cols considered as categorical by numpy's standard'
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as columns_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder

cat_features = columns_selector(dtype_include=["category", "object", "boolean"])
num_features = columns_selector(dtype_include=np.number)

numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(missing_values=np.nan, strategy="median")), ("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(missing_values=np.nan, strategy="constant", fill_value="Missing_Value")),
           ("encoder", OneHotEncoder(handle_unknown='ignore', sparse_output=False))])


preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_features),
        ("cat", categorical_transformer, cat_features),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
).set_output(transform="pandas")



# Create the pipeline
pipeline = Pipeline(
    [
        ("preprocess", preprocessor),
        (
            "selector",
            LassoFeatureSelection(
                n_iterations=10, score="bic", family="gaussian", fit_intercept=True
            ),
        ),
    ]
)

# Fit the model
pipeline.fit(X, y)  # , selector__sample_weight=w)

Elastic net fitting did not converge
Elastic net fitting did not converge
Elastic net fitting did not converge
Elastic net fitting did not converge

[24]:

arfsfs.make_fs_summary(pipeline)

[24]:

	predictor	preprocess	selector
0	var0	nan	1
1	var1	nan	1
2	var2	nan	1
3	var3	nan	1
4	var4	nan	1
5	var5	nan	1
6	var6	nan	0
7	var7	nan	0
8	var8	nan	0
9	var9	nan	0
10	var12	nan	0
11	nice_guys	nan	nan

[25]:

pipeline.named_steps["selector"].get_feature_names_out()

[25]:

Index(['Intercept', 'var0', 'var1', 'var2', 'var3', 'var4', 'var5'], dtype='object')

[26]:

pipeline.named_steps["selector"].feature_names_in_

[26]:

Index(['Intercept', 'var0', 'var1', 'var2', 'var3', 'var4', 'var5', 'var6',
       'var7', 'var8', 'var9', 'var12', 'nice_guys_Alien', 'nice_guys_Bane',
       'nice_guys_Bejita', 'nice_guys_Bender', 'nice_guys_Bias',
       'nice_guys_Blackbeard', 'nice_guys_Cartman', 'nice_guys_Cell',
       'nice_guys_Coldplay', 'nice_guys_Creationist', 'nice_guys_Dracula',
       'nice_guys_Drago', 'nice_guys_Excel', 'nice_guys_Fry',
       'nice_guys_Garry', 'nice_guys_Geoffrey', 'nice_guys_Goldfinder',
       'nice_guys_Gruber', 'nice_guys_Human', 'nice_guys_Imaginedragons',
       'nice_guys_KeyserSoze', 'nice_guys_Klaue', 'nice_guys_Krueger',
       'nice_guys_Lecter', 'nice_guys_Luthor', 'nice_guys_MarkZ',
       'nice_guys_Morty', 'nice_guys_Platist', 'nice_guys_Rick',
       'nice_guys_SAS', 'nice_guys_Scrum', 'nice_guys_Terminator',
       'nice_guys_Thanos', 'nice_guys_Tinkywinky', 'nice_guys_Vador',
       'nice_guys_Variance'],
      dtype='object')

Using the Patsy transformer

[27]:

cat_features = columns_selector(dtype_include=["category", "object", "boolean"])
num_features = columns_selector(dtype_include=np.number)

numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(missing_values=np.nan, strategy="median")), ("scaler", StandardScaler())]
)

# either drop the NA or impute them
imputer = ColumnTransformer(
    transformers=[
        (
            "cat",
            SimpleImputer(
                missing_values=np.nan, strategy="constant", fill_value="Missing_Value"
            ),
            cat_features,
        ),
        ("num", numeric_transformer, num_features),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

# Create the pipeline
pipeline = Pipeline(
    [
        ("imputer", imputer),
        ("preprocess", PatsyTransformer()),
        (
            "selector",
            LassoFeatureSelection(
                n_iterations=10, score="bic", family="gaussian", fit_intercept=True
            ),
        ),
    ]
)

# Fit the model
pipeline.fit(X, y)  # , selector__sample_weight=w)

Elastic net fitting did not converge
Elastic net fitting did not converge

[28]:

f = plot_y_vs_X(X, y, ncols=7, figsize=(16, 4))

../_images/notebooks_lasso_feature_selection_37_0.png

The genuine predictors are : var0, var1, var2, var3, var4 and var12

[29]:

X.head()

[29]:

	var0	var1	var2	var3	var4	var5	var6	var7	var8	var9	var12	nice_guys
0	1.091810	0.489242	-1.235042	1.443517	1.084685	0.716430	2	0	0.532132	1	2.212539	Krueger
1	0.976880	0.235321	-1.033447	0.569663	1.017232	1.155733	1	1	0.147567	1	0.919993	KeyserSoze
2	1.161955	0.422190	-1.249014	1.210296	1.218725	-3.307900	1	0	-0.295357	2	1.506035	Klaue
3	1.335830	0.300262	-1.282522	1.693665	1.332802	-0.151363	1	0	-0.740745	3	NaN	Bane
4	0.974240	0.107206	-0.713734	0.911713	1.107156	-2.875773	2	0	0.112322	1	1.971962	Krueger

[30]:

arfsfs.make_fs_summary(pipeline)

[30]:

	predictor	imputer	preprocess	selector
0	var0	nan	nan	1
1	var1	nan	nan	1
2	var2	nan	nan	1
3	var3	nan	nan	1
4	var4	nan	nan	1
5	var5	nan	nan	1
6	var6	nan	nan	0
7	var7	nan	nan	0
8	var8	nan	nan	0
9	var9	nan	nan	0
10	var12	nan	nan	1
11	nice_guys	nan	nan	nan

[31]:

pipeline.named_steps["selector"].get_feature_names_out()

[31]:

Index(['Intercept', 'var0', 'var1', 'var12', 'var2', 'var3', 'var4', 'var5'], dtype='object')

[32]:

pipeline.named_steps["selector"].support_

[32]:

array([ True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True,  True,  True,  True,  True,  True,  True, False, False,
       False, False])

[33]:

X_trans = pipeline.transform(X)
X_trans.head()

[33]:

	Intercept	var0	var1	var12	var2	var3	var4	var5
0	1.0	0.410244	0.579520	0.515223	-0.785263	0.844345	-0.518385	0.683297
1	1.0	-0.148030	-0.373188	-1.496868	-0.068675	-1.102092	-0.803974	1.120786
2	1.0	0.750973	0.327939	-0.584584	-0.834928	0.324865	0.049133	-3.324418
3	1.0	1.595568	-0.129529	-0.083170	-0.954037	1.401527	0.532125	-0.180913
4	1.0	-0.160855	-0.853871	0.140719	1.067778	-0.340203	-0.423244	-2.894075

[34]:

pipeline.named_steps["selector"].feature_names_in_

[34]:

Index(['Intercept', 'nice_guys[T.Bane]', 'nice_guys[T.Bejita]',
       'nice_guys[T.Bender]', 'nice_guys[T.Bias]', 'nice_guys[T.Blackbeard]',
       'nice_guys[T.Cartman]', 'nice_guys[T.Cell]', 'nice_guys[T.Coldplay]',
       'nice_guys[T.Creationist]', 'nice_guys[T.Dracula]',
       'nice_guys[T.Drago]', 'nice_guys[T.Excel]', 'nice_guys[T.Fry]',
       'nice_guys[T.Garry]', 'nice_guys[T.Geoffrey]',
       'nice_guys[T.Goldfinder]', 'nice_guys[T.Gruber]', 'nice_guys[T.Human]',
       'nice_guys[T.Imaginedragons]', 'nice_guys[T.KeyserSoze]',
       'nice_guys[T.Klaue]', 'nice_guys[T.Krueger]', 'nice_guys[T.Lecter]',
       'nice_guys[T.Luthor]', 'nice_guys[T.MarkZ]', 'nice_guys[T.Morty]',
       'nice_guys[T.Platist]', 'nice_guys[T.Rick]', 'nice_guys[T.SAS]',
       'nice_guys[T.Scrum]', 'nice_guys[T.Terminator]', 'nice_guys[T.Thanos]',
       'nice_guys[T.Tinkywinky]', 'nice_guys[T.Vador]',
       'nice_guys[T.Variance]', 'var0', 'var1', 'var12', 'var2', 'var3',
       'var4', 'var5', 'var6', 'var7', 'var8', 'var9'],
      dtype='object')

[35]:

pipeline.named_steps["selector"].best_estimator_.summary()

[35]:

Generalized Linear Model Regression Results
Dep. Variable:	y	No. Observations:	2000
Model:	GLM	Df Residuals:	1992
Model Family:	Gaussian	Df Model:	8
Link Function:	Identity	Scale:	0.0021507
Method:	elastic_net	Log-Likelihood:	3308.1
Date:	Fri, 11 Aug 2023	Deviance:	4.2841
Time:	10:44:50	Pearson chi2:	4.28
No. Iterations:	49	Pseudo R-squ. (CS):	1.000
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	1.0090	0.001	973.031	0.000	1.007	1.011
nice_guys[T.Bane]	0	0	nan	nan	0	0
nice_guys[T.Bejita]	0	0	nan	nan	0	0
nice_guys[T.Bender]	0	0	nan	nan	0	0
nice_guys[T.Bias]	0	0	nan	nan	0	0
nice_guys[T.Blackbeard]	0	0	nan	nan	0	0
nice_guys[T.Cartman]	0	0	nan	nan	0	0
nice_guys[T.Cell]	0	0	nan	nan	0	0
nice_guys[T.Coldplay]	0	0	nan	nan	0	0
nice_guys[T.Creationist]	0	0	nan	nan	0	0
nice_guys[T.Dracula]	0	0	nan	nan	0	0
nice_guys[T.Drago]	0	0	nan	nan	0	0
nice_guys[T.Excel]	0	0	nan	nan	0	0
nice_guys[T.Fry]	0	0	nan	nan	0	0
nice_guys[T.Garry]	0	0	nan	nan	0	0
nice_guys[T.Geoffrey]	0	0	nan	nan	0	0
nice_guys[T.Goldfinder]	0	0	nan	nan	0	0
nice_guys[T.Gruber]	0	0	nan	nan	0	0
nice_guys[T.Human]	0	0	nan	nan	0	0
nice_guys[T.Imaginedragons]	0	0	nan	nan	0	0
nice_guys[T.KeyserSoze]	0	0	nan	nan	0	0
nice_guys[T.Klaue]	0	0	nan	nan	0	0
nice_guys[T.Krueger]	0	0	nan	nan	0	0
nice_guys[T.Lecter]	0	0	nan	nan	0	0
nice_guys[T.Luthor]	0	0	nan	nan	0	0
nice_guys[T.MarkZ]	0	0	nan	nan	0	0
nice_guys[T.Morty]	0	0	nan	nan	0	0
nice_guys[T.Platist]	0	0	nan	nan	0	0
nice_guys[T.Rick]	0	0	nan	nan	0	0
nice_guys[T.SAS]	0	0	nan	nan	0	0
nice_guys[T.Scrum]	0	0	nan	nan	0	0
nice_guys[T.Terminator]	0	0	nan	nan	0	0
nice_guys[T.Thanos]	0	0	nan	nan	0	0
nice_guys[T.Tinkywinky]	0	0	nan	nan	0	0
nice_guys[T.Vador]	0	0	nan	nan	0	0
nice_guys[T.Variance]	0	0	nan	nan	0	0
var0	0.1400	0.002	63.092	0.000	0.136	0.144
var1	0.0022	0.001	2.065	0.039	0.000	0.004
var12	0.0038	0.001	3.170	0.002	0.001	0.006
var2	-0.0152	0.001	-10.635	0.000	-0.018	-0.012
var3	0.0411	0.002	20.080	0.000	0.037	0.045
var4	0.0053	0.001	4.606	0.000	0.003	0.008
var5	-0.0014	0.001	-1.379	0.168	-0.003	0.001
var6	0	0	nan	nan	0	0
var7	0	0	nan	nan	0	0
var8	0	0	nan	nan	0	0
var9	0	0	nan	nan	0	0

Advanced approach#

We can auto-bin the predictors and group levels. The non-informative predictors will not be binned and their value will be np.nan or np.Inf

[36]:

from arfs.preprocessing import TreeDiscretizer, IntervalToMidpoint
from arfs.feature_selection import UniqueValuesThreshold, MissingValueThreshold

cat_features = columns_selector(dtype_include=["category", "object", "boolean"])
num_features = columns_selector(dtype_include=np.number)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_features),
        ("cat", OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat_features),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
).set_output(transform="pandas")


# Create the pipeline
pipeline = Pipeline(
    [
        (
            "disctretizer",
            TreeDiscretizer(bin_features="all", n_bins=10, boost_params={"min_split_gain": 0.05}),
        ),
        ("midpointer", IntervalToMidpoint()),
        ("zero_variance", UniqueValuesThreshold()),
        # the treediscretization might introduce NaN for pure noise columns
        ("missing", MissingValueThreshold(0.05)),
        ("preprocess", preprocessor),
        (
            "selector",
            LassoFeatureSelection(
                n_iterations=10, score="bic", family="gaussian", fit_intercept=True
            ),
        ),
    ]
)

pipeline.fit(X, y)

Elastic net fitting did not converge

[37]:

arfsfs.make_fs_summary(pipeline)

[37]:

	predictor	disctretizer	midpointer	zero_variance	missing	preprocess	selector
0	var0	nan	nan	1	1	nan	1
1	var1	nan	nan	1	1	nan	0
2	var2	nan	nan	1	1	nan	1
3	var3	nan	nan	1	1	nan	1
4	var4	nan	nan	1	1	nan	1
5	var5	nan	nan	1	1	nan	0
6	var6	nan	nan	0	nan	nan	nan
7	var7	nan	nan	0	nan	nan	nan
8	var8	nan	nan	1	1	nan	0
9	var9	nan	nan	0	nan	nan	nan
10	var12	nan	nan	1	1	nan	0
11	nice_guys	nan	nan	1	1	nan	nan

[38]:

pipeline.named_steps["selector"].get_feature_names_out()

[38]:

Index(['Intercept', 'var0', 'var2', 'var3', 'var4'], dtype='object')

[39]:

pipeline.transform(X).head()

[39]:

	Intercept	var0	var2	var3	var4
0	1.0	0.709677	-1.206221	0.715412	-0.378512
1	1.0	-0.224358	-0.151710	-1.149747	-1.330565
2	1.0	0.709677	-1.206221	0.175806	0.525938
3	1.0	1.845666	-1.206221	1.384054	0.954361
4	1.0	-0.224358	1.459900	-0.352069	-0.378512

[40]:

pipeline.named_steps["selector"].best_estimator_.summary()

[40]:

Generalized Linear Model Regression Results
Dep. Variable:	y	No. Observations:	2000
Model:	GLM	Df Residuals:	1995
Model Family:	Gaussian	Df Model:	5
Link Function:	Identity	Scale:	0.0027178
Method:	elastic_net	Log-Likelihood:	3072.6
Date:	Fri, 11 Aug 2023	Deviance:	5.4220
Time:	10:44:52	Pearson chi2:	5.42
No. Iterations:	48	Pseudo R-squ. (CS):	1.000
Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	1.0090	0.001	865.576	0.000	1.007	1.011
var0	0.1290	0.002	53.132	0.000	0.124	0.134
var1	0	0	nan	nan	0	0
var2	-0.0188	0.002	-11.788	0.000	-0.022	-0.016
var3	0.0477	0.002	21.737	0.000	0.043	0.052
var4	0.0104	0.001	7.293	0.000	0.008	0.013
var5	0	0	nan	nan	0	0
var8	0	0	nan	nan	0	0
var12	0	0	nan	nan	0	0
nice_guys_Alien	0	0	nan	nan	0	0
nice_guys_Bane / Bejita	0	0	nan	nan	0	0
nice_guys_Bender / Cartman / Cell / Bias / Blackbeard	0	0	nan	nan	0	0
nice_guys_Creationist / Coldplay	0	0	nan	nan	0	0
nice_guys_Fry / Drago / Excel / Garry / Dracula	0	0	nan	nan	0	0
nice_guys_Geoffrey	0	0	nan	nan	0	0
nice_guys_Goldfinder	0	0	nan	nan	0	0
nice_guys_Gruber	0	0	nan	nan	0	0
nice_guys_Imaginedragons / Human	0	0	nan	nan	0	0
nice_guys_Krueger / KeyserSoze / Klaue / SAS / Terminator / Luthor / Thanos / Variance / Platist / Scrum / MarkZ / Morty / Tinkywinky / Lecter / Vador / Rick	0	0	nan	nan	0	0