Lasso Regularized GLM for Feature Selection in Python#

Introduction#

Welcome to our tutorial on using the Lasso Regularized Generalized Linear Model (GLM) for feature selection in Python! Feature selection is a critical step in machine learning, helping us identify the most important variables from a dataset, making models more accurate and easier to understand.

Description#

The Lasso Regularized GLM is a powerful variant of the traditional GLM. It adds a unique twist - the L1 regularization penalty (Lasso regularization). This penalty encourages some coefficients to become exactly zero, effectively excluding irrelevant features from the model. As a result, the Lasso Regularized GLM becomes an excellent tool for feature selection, especially in datasets with many variables.

In this tutorial, we’ll build a user-friendly Python class called “LassoFeatureSelection” that harnesses the Lasso Regularized GLM to perform feature selection. The class can handle different types of data, like pandas DataFrames and numpy arrays, making it super flexible and compatible with various data formats.

Please note that one limitation of the lasso is that it treats the levels of a categorical predictor individually. However, this issue can be addressed by utilizing the TreeDiscretizer, which automatically bins numerical variables and groups the levels of categorical variables.

Key Features of the “LassoFeatureSelection” Class#

  1. Automated Intercept Handling: Our class automatically adds an intercept column to the input data if it’s missing. It then checks if the intercept is statistically significant and keeps it only if necessary.

  2. Model Selection: We use a grid search cross-validation approach to find the best regularization parameter (alpha) for the Lasso Regularized GLM. Users can choose from two scoring options: the Bayesian Information Criterion (BIC) or the mean cross-validation score (deviance). The BIC is much faster.

  3. Transparent Feature Selection: After fitting the “LassoFeatureSelection” model, you can access the selected features. This lets you understand which variables are the most crucial in the final model.

[1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

from mpl_toolkits.axes_grid1.inset_locator import inset_axes

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import PoissonRegressor
from sklearn.metrics import mean_squared_error

from arfs.feature_selection.lasso import LassoFeatureSelection
from arfs.preprocessing import PatsyTransformer
import arfs.feature_selection as arfsfs


def plot_y_vs_X(
    X: pd.DataFrame, y: pd.Series, ncols: int = 2, figsize: tuple = (10, 10)
) -> plt.Figure:
    """
    Create subplots of scatter plots showing the relationship between each column in X and the target variable y.

    Parameters
    ----------
        X : pd.DataFrame
            The input DataFrame containing the predictor variables.
        y : pd.Series
            The target variable to be plotted against.
        ncols : int, optional (default: 2)
            The number of columns in the subplot grid.
        figsize : tuple, optional (default: (10, 10))
            The size of the figure (width, height) in inches.

    Returns
    -------
        plt.Figure
            The generated Figure object containing the scatter plots.
    """

    n_rows, ncols_to_plot = divmod(X.shape[1], ncols)
    n_rows += int(ncols_to_plot > 0)

    # Create figure and axes
    f, axs = plt.subplots(nrows=n_rows, ncols=ncols, figsize=figsize)

    # Iterate through columns and plot against y
    for ax, col in zip(axs.flat, X.columns):
        ax.scatter(X[col], y, alpha=0.1)
        ax.set_title(col)

    # Remove any unused subplots
    for ax in axs.flat[len(X.columns) :]:
        ax.set_axis_off()

    # Display the figure
    plt.tight_layout()
    return f

Linear model#

The main goal here is not to build a detailed model for the data, but rather to identify the most important predictors. We will check how well the feature selection process can distinguish between meaningful predictors and random noise.

[2]:
bias = 7.0
X, y, true_coef = make_regression(
    n_samples=2_000,
    n_features=10,
    n_informative=5,
    noise=1,
    random_state=8,
    bias=bias,
    coef=True,
)
X = pd.DataFrame(X)
X.columns = [f"pred_{i}" for i in range(X.shape[1])]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=42
)

print(f"The true coefficient of the linear data generating process are:\n {true_coef}")
f = plot_y_vs_X(X_train, y_train, ncols=5, figsize=(15, 5))
The true coefficient of the linear data generating process are:
 [ 0.         77.90364383  0.          0.         63.70230035  0.
 95.59499715 43.86990345  0.          4.11861919]
../_images/notebooks_lasso_feature_selection_3_1.png
[3]:
# family should be: "gaussian", "poisson", "gamma", "negativebinomial", "binomial", "tweedie"
selector = LassoFeatureSelection(n_iterations=10, family="gaussian", score="bic")
selector.fit(X=X_train, y=y_train, sample_weight=None)
selector.get_feature_names_out()
[3]:
Index(['Intercept', 'pred_1', 'pred_4', 'pred_6', 'pred_7', 'pred_9'], dtype='object')
[4]:
true_coef = pd.Series(true_coef)
true_coef.index = X.columns
true_coef = pd.Series({**{"intercept": bias}, **true_coef})
true_coef
[4]:
intercept     7.000000
pred_0        0.000000
pred_1       77.903644
pred_2        0.000000
pred_3        0.000000
pred_4       63.702300
pred_5        0.000000
pred_6       95.594997
pred_7       43.869903
pred_8        0.000000
pred_9        4.118619
dtype: float64
[5]:
selector.transform(X_test).head()
[5]:
Intercept pred_1 pred_4 pred_6 pred_7 pred_9
1860 1.0 1.091750 -0.015868 0.231947 -0.998250 0.041597
353 1.0 -0.871767 2.557578 -0.026118 0.936062 0.533048
1333 1.0 1.147664 0.678622 -0.313260 -1.717187 0.228773
905 1.0 2.247261 -0.008657 0.354843 0.151635 2.047569
1289 1.0 -0.061458 -1.429949 -1.012678 -1.070863 0.768713
[6]:
selector.feature_names_in_
[6]:
Index(['Intercept', 'pred_0', 'pred_1', 'pred_2', 'pred_3', 'pred_4', 'pred_5',
       'pred_6', 'pred_7', 'pred_8', 'pred_9'],
      dtype='object')
[7]:
selector.support_
[7]:
array([ True, False,  True, False, False,  True, False,  True,  True,
       False,  True])

The exciting part is that lasso, despite potential mismatches in the distribution, can still perform remarkably well. It means that even if the data does not perfectly follow a certain pattern, lasso might successfully identify relevant predictors and filter out irrelevant ones.

[8]:
selector.get_feature_names_out()
[8]:
Index(['Intercept', 'pred_1', 'pred_4', 'pred_6', 'pred_7', 'pred_9'], dtype='object')

we can even check the value of the coefficients, even though is not the intended use (the ultimate output is given by the method .get_feature_names_out())

[9]:
selector.best_estimator_.summary()
[9]:
Generalized Linear Model Regression Results
Dep. Variable: y No. Observations: 1000
Model: GLM Df Residuals: 994
Model Family: Gaussian Df Model: 6
Link Function: Identity Scale: 1.0435
Method: elastic_net Log-Likelihood: -1437.2
Date: Fri, 11 Aug 2023 Deviance: 1037.2
Time: 10:44:34 Pearson chi2: 1.04e+03
No. Iterations: 10 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept 7.0127 0.032 215.952 0.000 6.949 7.076
pred_0 0 0 nan nan 0 0
pred_1 77.8866 0.033 2375.009 0.000 77.822 77.951
pred_2 0 0 nan nan 0 0
pred_3 0 0 nan nan 0 0
pred_4 63.7313 0.033 1958.221 0.000 63.667 63.795
pred_5 0 0 nan nan 0 0
pred_6 95.5690 0.032 2959.937 0.000 95.506 95.632
pred_7 43.9089 0.033 1325.779 0.000 43.844 43.974
pred_8 0 0 nan nan 0 0
pred_9 4.0589 0.033 123.946 0.000 3.995 4.123

A toy example using the selector in a pipeline#

[10]:
# Create a pipeline with LassoFeatureSelection and LinearRegression
pipeline = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("selector", LassoFeatureSelection(n_iterations=10, score="bic")),
        ("linear_regression", LinearRegression()),
    ]
)

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
Mean Squared Error: 0.9917846740502078
[11]:
arfsfs.make_fs_summary(pipeline)
[11]:
  predictor scaler selector linear_regression
0 pred_0 nan 0 nan
1 pred_1 nan 1 nan
2 pred_2 nan 0 nan
3 pred_3 nan 0 nan
4 pred_4 nan 1 nan
5 pred_5 nan 0 nan
6 pred_6 nan 1 nan
7 pred_7 nan 1 nan
8 pred_8 nan 0 nan
9 pred_9 nan 1 nan
[12]:
pipeline.named_steps["selector"].get_feature_names_out()
[12]:
Index(['Intercept', 'pred_1', 'pred_4', 'pred_6', 'pred_7', 'pred_9'], dtype='object')

beyond the feature selection quality, we can even check if the model itself is good.

[13]:
import seaborn as sns

# Plot the predictions and residuals
plt.figure(figsize=(10, 5))

# Plot predictions
plt.subplot(1, 2, 1)
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot(
    [min(y_test), max(y_test)],
    [min(y_test), max(y_test)],
    color="#d1ae11",
    linestyle="--",
)
plt.xlabel("True Values")
plt.ylabel("Predictions")
plt.title("True Values vs. Predictions")

# Plot residuals
residuals = y_test - y_pred
plt.subplot(1, 2, 2)
plt.scatter(y=residuals, x=y_pred, alpha=0.5)
plt.ylabel("Residuals")
plt.xlabel("Prediction")
plt.title("Distribution of Residuals")

plt.tight_layout()
plt.show();
../_images/notebooks_lasso_feature_selection_18_0.png

Poisson toy example#

The loss can be one of the typical GLM losses: “gaussian”, “poisson”, “gamma”, “negativebinomial”, “binomial”, “tweedie”.

[14]:
# Generate synthetic data with Poisson-distributed target variable
bias = 1
X, y, true_coef = make_regression(
    n_samples=2_000,
    n_features=10,
    n_informative=5,
    noise=1,
    random_state=8,
    bias=bias,
    coef=True,
)
y = (y - y.mean()) / y.std()
y = np.exp(y)  # Transform to positive values for Poisson distribution
y = np.random.poisson(y)  # Add Poisson noise to the target variable
# dummy sample weight (e.g. exposure), smallest being 30 days
w = np.random.uniform(30 / 365, 1, size=len(y))
# make the count a Poisson rate (frequency)
y = y / w

X = pd.DataFrame(X)
X.columns = [f"pred_{i}" for i in range(X.shape[1])]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X, y, w, test_size=0.5, random_state=42
)

print(f"The true coefficient of the linear data generating process are:\n {true_coef}")

f = plot_y_vs_X(X_train, y_train, ncols=5, figsize=(15, 5))
The true coefficient of the linear data generating process are:
 [ 0.         77.90364383  0.          0.         63.70230035  0.
 95.59499715 43.86990345  0.          4.11861919]
../_images/notebooks_lasso_feature_selection_20_1.png
[15]:
y
[15]:
array([0.        , 2.34937393, 4.39264421, ..., 1.51556199, 5.72062399,
       0.        ])
[16]:
# Create a pipeline with LassoFeatureSelection and LinearRegression
pipeline = Pipeline(
    [
        ("scaler", StandardScaler()),
        (
            "selector",
            LassoFeatureSelection(
                n_iterations=10, score="bic", family="poisson", fit_intercept=True
            ),
        ),
        ("glm", PoissonRegressor()),
    ]
)

# Fit the pipeline to the training data
pipeline.fit(
    X_train, y_train, selector__sample_weight=w_train, glm__sample_weight=w_train
)
[16]:
Pipeline(steps=[('scaler', StandardScaler()),
                ('selector', LassoFeatureSelection(family='poisson')),
                ('glm', PoissonRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[17]:
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.axes_grid1.inset_locator import inset_axes

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Calculate the residuals
residuals = y_test - y_pred

# Plot the predictions and residuals
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

# Plot predictions
axs[0].scatter(y_test, y_pred, alpha=0.05)
axs[0].plot(
    [min(y_test), max(y_test)],
    [min(y_test), max(y_test)],
    color="#d1ae11",
    linestyle="--",
)
axs[0].set_xlabel("True Values")
axs[0].set_ylabel("Predictions")
axs[0].set_title("True Values vs. Predictions")


# Plot residuals
axs[1] = sns.histplot(residuals, kde=True, ax=axs[1])
axs[1].set_xlabel("Residuals")
axs[1].set_ylabel("Frequency")
axs[1].set_title("Distribution of Residuals")

plt.show();

../_images/notebooks_lasso_feature_selection_23_0.png
[18]:
true_coef = pd.Series(true_coef)
true_coef.index = X.columns
true_coef = pd.Series({**{"intercept": bias}, **true_coef})
true_coef
[18]:
intercept     1.000000
pred_0        0.000000
pred_1       77.903644
pred_2        0.000000
pred_3        0.000000
pred_4       63.702300
pred_5        0.000000
pred_6       95.594997
pred_7       43.869903
pred_8        0.000000
pred_9        4.118619
dtype: float64

the selected predictors are

[19]:
pipeline.named_steps["selector"].get_feature_names_out()
[19]:
Index(['Intercept', 'pred_1', 'pred_4', 'pred_6', 'pred_7'], dtype='object')
[20]:
arfsfs.make_fs_summary(pipeline)
[20]:
  predictor scaler selector glm
0 pred_0 nan 0 nan
1 pred_1 nan 1 nan
2 pred_2 nan 0 nan
3 pred_3 nan 0 nan
4 pred_4 nan 1 nan
5 pred_5 nan 0 nan
6 pred_6 nan 1 nan
7 pred_7 nan 1 nan
8 pred_8 nan 0 nan
9 pred_9 nan 0 nan
[21]:
pipeline.named_steps["selector"].best_estimator_.summary()
[21]:
Generalized Linear Model Regression Results
Dep. Variable: y No. Observations: 1000
Model: GLM Df Residuals: 995
Model Family: Poisson Df Model: 5
Link Function: Log Scale: 1.0000
Method: elastic_net Log-Likelihood: -1295.4
Date: Fri, 11 Aug 2023 Deviance: 1561.5
Time: 10:44:42 Pearson chi2: 2.28e+03
No. Iterations: 24 Pseudo R-squ. (CS): 0.7581
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept 0.6199 0.035 17.503 0.000 0.550 0.689
pred_0 0 0 nan nan 0 0
pred_1 0.5340 0.026 20.859 0.000 0.484 0.584
pred_2 0 0 nan nan 0 0
pred_3 0 0 nan nan 0 0
pred_4 0.4129 0.026 16.176 0.000 0.363 0.463
pred_5 0 0 nan nan 0 0
pred_6 0.6096 0.026 23.814 0.000 0.559 0.660
pred_7 0.3059 0.026 11.553 0.000 0.254 0.358
pred_8 0 0 nan nan 0 0
pred_9 0 0 nan nan 0 0

Categorical variable, missing values and miss-specified distribution#

When working with the PatsyTransformer, it’s essential to handle missing values in a specific way for categorical variables. To do this, we convert missing values into a string representation, as the native support of NaN in numpy is not yet fully implemented for general arrays.

Keep in mind that the dummy dataset we are using in this context does not necessarily correspond to any specific distribution assumption, such as the exponential family. However, for the purpose of feature selection, this might not pose a significant obstacle. The feature selection process can still be effective in identifying relevant predictors, even when the data does not precisely follow a specific distribution pattern.

Classical approach#

Unfortunatly, Patsy is not yet pickable, I’ll present to equivalent approaches, one pickable approach and another using the Patsy transformer.

[22]:
from arfs.utils import _make_corr_dataset_regression

X, y, w = _make_corr_dataset_regression(size=2_000)
# a dummy categorical variable with as many levels as rows
# and with zero variance
X = X.drop(columns=["var10", "var11"])
[23]:
# select all cols considered as categorical by numpy's standard'
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as columns_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder

cat_features = columns_selector(dtype_include=["category", "object", "boolean"])
num_features = columns_selector(dtype_include=np.number)

numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(missing_values=np.nan, strategy="median")), ("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(missing_values=np.nan, strategy="constant", fill_value="Missing_Value")),
           ("encoder", OneHotEncoder(handle_unknown='ignore', sparse_output=False))])


preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_features),
        ("cat", categorical_transformer, cat_features),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
).set_output(transform="pandas")



# Create the pipeline
pipeline = Pipeline(
    [
        ("preprocess", preprocessor),
        (
            "selector",
            LassoFeatureSelection(
                n_iterations=10, score="bic", family="gaussian", fit_intercept=True
            ),
        ),
    ]
)

# Fit the model
pipeline.fit(X, y)  # , selector__sample_weight=w)
Elastic net fitting did not converge
Elastic net fitting did not converge
Elastic net fitting did not converge
Elastic net fitting did not converge
[23]:
Pipeline(steps=[('preprocess',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x0000027F605EA590>),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='Missing_Value',
                                                                                 strategy='constant')),
                                                                  ('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse_output=False))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x0000027F605EBF70>)],
                                   verbose_feature_names_out=False)),
                ('selector', LassoFeatureSelection())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[24]:
arfsfs.make_fs_summary(pipeline)
[24]:
  predictor preprocess selector
0 var0 nan 1
1 var1 nan 1
2 var2 nan 1
3 var3 nan 1
4 var4 nan 1
5 var5 nan 1
6 var6 nan 0
7 var7 nan 0
8 var8 nan 0
9 var9 nan 0
10 var12 nan 0
11 nice_guys nan nan
[25]:
pipeline.named_steps["selector"].get_feature_names_out()
[25]:
Index(['Intercept', 'var0', 'var1', 'var2', 'var3', 'var4', 'var5'], dtype='object')
[26]:
pipeline.named_steps["selector"].feature_names_in_
[26]:
Index(['Intercept', 'var0', 'var1', 'var2', 'var3', 'var4', 'var5', 'var6',
       'var7', 'var8', 'var9', 'var12', 'nice_guys_Alien', 'nice_guys_Bane',
       'nice_guys_Bejita', 'nice_guys_Bender', 'nice_guys_Bias',
       'nice_guys_Blackbeard', 'nice_guys_Cartman', 'nice_guys_Cell',
       'nice_guys_Coldplay', 'nice_guys_Creationist', 'nice_guys_Dracula',
       'nice_guys_Drago', 'nice_guys_Excel', 'nice_guys_Fry',
       'nice_guys_Garry', 'nice_guys_Geoffrey', 'nice_guys_Goldfinder',
       'nice_guys_Gruber', 'nice_guys_Human', 'nice_guys_Imaginedragons',
       'nice_guys_KeyserSoze', 'nice_guys_Klaue', 'nice_guys_Krueger',
       'nice_guys_Lecter', 'nice_guys_Luthor', 'nice_guys_MarkZ',
       'nice_guys_Morty', 'nice_guys_Platist', 'nice_guys_Rick',
       'nice_guys_SAS', 'nice_guys_Scrum', 'nice_guys_Terminator',
       'nice_guys_Thanos', 'nice_guys_Tinkywinky', 'nice_guys_Vador',
       'nice_guys_Variance'],
      dtype='object')

Using the Patsy transformer

[27]:
cat_features = columns_selector(dtype_include=["category", "object", "boolean"])
num_features = columns_selector(dtype_include=np.number)

numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(missing_values=np.nan, strategy="median")), ("scaler", StandardScaler())]
)

# either drop the NA or impute them
imputer = ColumnTransformer(
    transformers=[
        (
            "cat",
            SimpleImputer(
                missing_values=np.nan, strategy="constant", fill_value="Missing_Value"
            ),
            cat_features,
        ),
        ("num", numeric_transformer, num_features),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

# Create the pipeline
pipeline = Pipeline(
    [
        ("imputer", imputer),
        ("preprocess", PatsyTransformer()),
        (
            "selector",
            LassoFeatureSelection(
                n_iterations=10, score="bic", family="gaussian", fit_intercept=True
            ),
        ),
    ]
)

# Fit the model
pipeline.fit(X, y)  # , selector__sample_weight=w)

Elastic net fitting did not converge
Elastic net fitting did not converge
[27]:
Pipeline(steps=[('imputer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('cat',
                                                  SimpleImputer(fill_value='Missing_Value',
                                                                strategy='constant'),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x0000027F5F227BB0>),
                                                 ('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x0000027F5F227400>)],
                                   verbose_feature_names_out=False)),
                ('preprocess',
                 PatsyTransformer(formula='nice_guys + var0 + var1 + var12 + '
                                          'var2 + var3 + var4 + var5 + var6 + '
                                          'var7 + var8 + var9')),
                ('selector', LassoFeatureSelection())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[28]:
f = plot_y_vs_X(X, y, ncols=7, figsize=(16, 4))
../_images/notebooks_lasso_feature_selection_37_0.png

The genuine predictors are : var0, var1, var2, var3, var4 and var12

[29]:
X.head()
[29]:
var0 var1 var2 var3 var4 var5 var6 var7 var8 var9 var12 nice_guys
0 1.091810 0.489242 -1.235042 1.443517 1.084685 0.716430 2 0 0.532132 1 2.212539 Krueger
1 0.976880 0.235321 -1.033447 0.569663 1.017232 1.155733 1 1 0.147567 1 0.919993 KeyserSoze
2 1.161955 0.422190 -1.249014 1.210296 1.218725 -3.307900 1 0 -0.295357 2 1.506035 Klaue
3 1.335830 0.300262 -1.282522 1.693665 1.332802 -0.151363 1 0 -0.740745 3 NaN Bane
4 0.974240 0.107206 -0.713734 0.911713 1.107156 -2.875773 2 0 0.112322 1 1.971962 Krueger
[30]:
arfsfs.make_fs_summary(pipeline)
[30]:
  predictor imputer preprocess selector
0 var0 nan nan 1
1 var1 nan nan 1
2 var2 nan nan 1
3 var3 nan nan 1
4 var4 nan nan 1
5 var5 nan nan 1
6 var6 nan nan 0
7 var7 nan nan 0
8 var8 nan nan 0
9 var9 nan nan 0
10 var12 nan nan 1
11 nice_guys nan nan nan
[31]:
pipeline.named_steps["selector"].get_feature_names_out()
[31]:
Index(['Intercept', 'var0', 'var1', 'var12', 'var2', 'var3', 'var4', 'var5'], dtype='object')
[32]:
pipeline.named_steps["selector"].support_
[32]:
array([ True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True,  True,  True,  True,  True,  True,  True, False, False,
       False, False])
[33]:
X_trans = pipeline.transform(X)
X_trans.head()
[33]:
Intercept var0 var1 var12 var2 var3 var4 var5
0 1.0 0.410244 0.579520 0.515223 -0.785263 0.844345 -0.518385 0.683297
1 1.0 -0.148030 -0.373188 -1.496868 -0.068675 -1.102092 -0.803974 1.120786
2 1.0 0.750973 0.327939 -0.584584 -0.834928 0.324865 0.049133 -3.324418
3 1.0 1.595568 -0.129529 -0.083170 -0.954037 1.401527 0.532125 -0.180913
4 1.0 -0.160855 -0.853871 0.140719 1.067778 -0.340203 -0.423244 -2.894075
[34]:
pipeline.named_steps["selector"].feature_names_in_
[34]:
Index(['Intercept', 'nice_guys[T.Bane]', 'nice_guys[T.Bejita]',
       'nice_guys[T.Bender]', 'nice_guys[T.Bias]', 'nice_guys[T.Blackbeard]',
       'nice_guys[T.Cartman]', 'nice_guys[T.Cell]', 'nice_guys[T.Coldplay]',
       'nice_guys[T.Creationist]', 'nice_guys[T.Dracula]',
       'nice_guys[T.Drago]', 'nice_guys[T.Excel]', 'nice_guys[T.Fry]',
       'nice_guys[T.Garry]', 'nice_guys[T.Geoffrey]',
       'nice_guys[T.Goldfinder]', 'nice_guys[T.Gruber]', 'nice_guys[T.Human]',
       'nice_guys[T.Imaginedragons]', 'nice_guys[T.KeyserSoze]',
       'nice_guys[T.Klaue]', 'nice_guys[T.Krueger]', 'nice_guys[T.Lecter]',
       'nice_guys[T.Luthor]', 'nice_guys[T.MarkZ]', 'nice_guys[T.Morty]',
       'nice_guys[T.Platist]', 'nice_guys[T.Rick]', 'nice_guys[T.SAS]',
       'nice_guys[T.Scrum]', 'nice_guys[T.Terminator]', 'nice_guys[T.Thanos]',
       'nice_guys[T.Tinkywinky]', 'nice_guys[T.Vador]',
       'nice_guys[T.Variance]', 'var0', 'var1', 'var12', 'var2', 'var3',
       'var4', 'var5', 'var6', 'var7', 'var8', 'var9'],
      dtype='object')
[35]:
pipeline.named_steps["selector"].best_estimator_.summary()
[35]:
Generalized Linear Model Regression Results
Dep. Variable: y No. Observations: 2000
Model: GLM Df Residuals: 1992
Model Family: Gaussian Df Model: 8
Link Function: Identity Scale: 0.0021507
Method: elastic_net Log-Likelihood: 3308.1
Date: Fri, 11 Aug 2023 Deviance: 4.2841
Time: 10:44:50 Pearson chi2: 4.28
No. Iterations: 49 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept 1.0090 0.001 973.031 0.000 1.007 1.011
nice_guys[T.Bane] 0 0 nan nan 0 0
nice_guys[T.Bejita] 0 0 nan nan 0 0
nice_guys[T.Bender] 0 0 nan nan 0 0
nice_guys[T.Bias] 0 0 nan nan 0 0
nice_guys[T.Blackbeard] 0 0 nan nan 0 0
nice_guys[T.Cartman] 0 0 nan nan 0 0
nice_guys[T.Cell] 0 0 nan nan 0 0
nice_guys[T.Coldplay] 0 0 nan nan 0 0
nice_guys[T.Creationist] 0 0 nan nan 0 0
nice_guys[T.Dracula] 0 0 nan nan 0 0
nice_guys[T.Drago] 0 0 nan nan 0 0
nice_guys[T.Excel] 0 0 nan nan 0 0
nice_guys[T.Fry] 0 0 nan nan 0 0
nice_guys[T.Garry] 0 0 nan nan 0 0
nice_guys[T.Geoffrey] 0 0 nan nan 0 0
nice_guys[T.Goldfinder] 0 0 nan nan 0 0
nice_guys[T.Gruber] 0 0 nan nan 0 0
nice_guys[T.Human] 0 0 nan nan 0 0
nice_guys[T.Imaginedragons] 0 0 nan nan 0 0
nice_guys[T.KeyserSoze] 0 0 nan nan 0 0
nice_guys[T.Klaue] 0 0 nan nan 0 0
nice_guys[T.Krueger] 0 0 nan nan 0 0
nice_guys[T.Lecter] 0 0 nan nan 0 0
nice_guys[T.Luthor] 0 0 nan nan 0 0
nice_guys[T.MarkZ] 0 0 nan nan 0 0
nice_guys[T.Morty] 0 0 nan nan 0 0
nice_guys[T.Platist] 0 0 nan nan 0 0
nice_guys[T.Rick] 0 0 nan nan 0 0
nice_guys[T.SAS] 0 0 nan nan 0 0
nice_guys[T.Scrum] 0 0 nan nan 0 0
nice_guys[T.Terminator] 0 0 nan nan 0 0
nice_guys[T.Thanos] 0 0 nan nan 0 0
nice_guys[T.Tinkywinky] 0 0 nan nan 0 0
nice_guys[T.Vador] 0 0 nan nan 0 0
nice_guys[T.Variance] 0 0 nan nan 0 0
var0 0.1400 0.002 63.092 0.000 0.136 0.144
var1 0.0022 0.001 2.065 0.039 0.000 0.004
var12 0.0038 0.001 3.170 0.002 0.001 0.006
var2 -0.0152 0.001 -10.635 0.000 -0.018 -0.012
var3 0.0411 0.002 20.080 0.000 0.037 0.045
var4 0.0053 0.001 4.606 0.000 0.003 0.008
var5 -0.0014 0.001 -1.379 0.168 -0.003 0.001
var6 0 0 nan nan 0 0
var7 0 0 nan nan 0 0
var8 0 0 nan nan 0 0
var9 0 0 nan nan 0 0

Advanced approach#

We can auto-bin the predictors and group levels. The non-informative predictors will not be binned and their value will be np.nan or np.Inf

[36]:
from arfs.preprocessing import TreeDiscretizer, IntervalToMidpoint
from arfs.feature_selection import UniqueValuesThreshold, MissingValueThreshold

cat_features = columns_selector(dtype_include=["category", "object", "boolean"])
num_features = columns_selector(dtype_include=np.number)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_features),
        ("cat", OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat_features),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
).set_output(transform="pandas")


# Create the pipeline
pipeline = Pipeline(
    [
        (
            "disctretizer",
            TreeDiscretizer(bin_features="all", n_bins=10, boost_params={"min_split_gain": 0.05}),
        ),
        ("midpointer", IntervalToMidpoint()),
        ("zero_variance", UniqueValuesThreshold()),
        # the treediscretization might introduce NaN for pure noise columns
        ("missing", MissingValueThreshold(0.05)),
        ("preprocess", preprocessor),
        (
            "selector",
            LassoFeatureSelection(
                n_iterations=10, score="bic", family="gaussian", fit_intercept=True
            ),
        ),
    ]
)

pipeline.fit(X, y)

Elastic net fitting did not converge
[36]:
Pipeline(steps=[('disctretizer',
                 TreeDiscretizer(bin_features=['var0', 'var1', 'var2', 'var3',
                                               'var4', 'var5', 'var6', 'var7',
                                               'var8', 'var9', 'var12',
                                               'nice_guys'],
                                 boost_params={'max_leaf': 10,
                                               'min_split_gain': 0.05,
                                               'num_boost_round': 1,
                                               'objective': 'RMSE'})),
                ('midpointer',
                 IntervalToMidpoint(cols=Index(['var0', 'var1', 'var2', 'var3', 'var4', 'var5', 'var6', 'var7', 'var...
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('num', StandardScaler(),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x0000027F5F00F160>),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore',
                                                                sparse_output=False),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x0000027F60C12D40>)],
                                   verbose_feature_names_out=False)),
                ('selector', LassoFeatureSelection())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[37]:
arfsfs.make_fs_summary(pipeline)
[37]:
  predictor disctretizer midpointer zero_variance missing preprocess selector
0 var0 nan nan 1 1 nan 1
1 var1 nan nan 1 1 nan 0
2 var2 nan nan 1 1 nan 1
3 var3 nan nan 1 1 nan 1
4 var4 nan nan 1 1 nan 1
5 var5 nan nan 1 1 nan 0
6 var6 nan nan 0 nan nan nan
7 var7 nan nan 0 nan nan nan
8 var8 nan nan 1 1 nan 0
9 var9 nan nan 0 nan nan nan
10 var12 nan nan 1 1 nan 0
11 nice_guys nan nan 1 1 nan nan
[38]:
pipeline.named_steps["selector"].get_feature_names_out()
[38]:
Index(['Intercept', 'var0', 'var2', 'var3', 'var4'], dtype='object')
[39]:
pipeline.transform(X).head()
[39]:
Intercept var0 var2 var3 var4
0 1.0 0.709677 -1.206221 0.715412 -0.378512
1 1.0 -0.224358 -0.151710 -1.149747 -1.330565
2 1.0 0.709677 -1.206221 0.175806 0.525938
3 1.0 1.845666 -1.206221 1.384054 0.954361
4 1.0 -0.224358 1.459900 -0.352069 -0.378512
[40]:
pipeline.named_steps["selector"].best_estimator_.summary()
[40]:
Generalized Linear Model Regression Results
Dep. Variable: y No. Observations: 2000
Model: GLM Df Residuals: 1995
Model Family: Gaussian Df Model: 5
Link Function: Identity Scale: 0.0027178
Method: elastic_net Log-Likelihood: 3072.6
Date: Fri, 11 Aug 2023 Deviance: 5.4220
Time: 10:44:52 Pearson chi2: 5.42
No. Iterations: 48 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept 1.0090 0.001 865.576 0.000 1.007 1.011
var0 0.1290 0.002 53.132 0.000 0.124 0.134
var1 0 0 nan nan 0 0
var2 -0.0188 0.002 -11.788 0.000 -0.022 -0.016
var3 0.0477 0.002 21.737 0.000 0.043 0.052
var4 0.0104 0.001 7.293 0.000 0.008 0.013
var5 0 0 nan nan 0 0
var8 0 0 nan nan 0 0
var12 0 0 nan nan 0 0
nice_guys_Alien 0 0 nan nan 0 0
nice_guys_Bane / Bejita 0 0 nan nan 0 0
nice_guys_Bender / Cartman / Cell / Bias / Blackbeard 0 0 nan nan 0 0
nice_guys_Creationist / Coldplay 0 0 nan nan 0 0
nice_guys_Fry / Drago / Excel / Garry / Dracula 0 0 nan nan 0 0
nice_guys_Geoffrey 0 0 nan nan 0 0
nice_guys_Goldfinder 0 0 nan nan 0 0
nice_guys_Gruber 0 0 nan nan 0 0
nice_guys_Imaginedragons / Human 0 0 nan nan 0 0
nice_guys_Krueger / KeyserSoze / Klaue / SAS / Terminator / Luthor / Thanos / Variance / Platist / Scrum / MarkZ / Morty / Tinkywinky / Lecter / Vador / Rick 0 0 nan nan 0 0