Using sklearn data pre-processing pipelines inside LightningDataModule #19807

tiefenthaler · 2024-04-23T20:03:19Z

tiefenthaler
Apr 23, 2024

I wonder if it makes sense or does not make sense to use sklearn pipelines for data pre-processing within the LightningDataModule?
I am a big fan of sklearn pipelines since they structure the code properly and allows an easy use of pre-processing steps properly when splitting data into train, val and test data. Besides classical ML Models, I am using NNs more and more for tabular data and got some great results for some use cases. Some use cases require more individual handling of a variety of features.
PyTorch/Lightning is using the LightningDataModule so that the data can be used efficiently for the training of NNs. The LightningDataModule provides GPU support for preprocessing, like shuffling, train-val-test splits, transformations (categorical encoding, normalization, etc.), etc. which any dataframe should undergo before feeding into the dataloader (e.g. train_dataloader, val_dataloader, test_dataloader, predict_dataloader). To me it makes sense to use sklearn pipelines to define those data pre-processing steps (categorical encoding, normalization, etc.).
But I have not seen anyone using sklearn pipelnes in this context before. I was wondering if "PyTorch Tabular" is using sklearn pipelines, but they are not. They rather define a separat method for preprocessing. Which does the same job as described above and of course they use sklearn functions.
Is there a reason why not to use sklearn pipelines to do so (e.g. conflicts enabling GPU acceleration, ...)?

Pseudo Code:

class TabularDataModule(L.LightningDataModule):
    def __init__(
        self,
        data: pd.DataFrame,
        continuous_cols: List[str] = None,
        categorical_cols: List[str] = None,
        target: List[str] = None,
        batch_size: int = 64,
    ):
        super().__init__()
        self.data = data
        self.categorical_cols = categorical_cols if categorical_cols else []
        self.continuous_cols = continuous_cols if continuous_cols else []
        self.target = target
        self.batch_size = batch_size

    ...

    def setup(self, stage: str, ):
        X = self.data.iloc[:, :-1]
        y = self.data.iloc[:, -1]
        # Generate train, val, test sets
        X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(
            X, y, test_size=0.2, val_size=0.1, stratify=y, random_state=SEED
        )


        # DEFINE PREPROCESSING PIPELINE
        def pre_processing_pipeline(X, y):
              numerical_features = X.select_dtypes(include='number').columns.tolist()
              numeric_feature_pipeline = Pipeline(steps=[
                  ('impute', SimpleImputer(strategy='median')),
                  ('scale', StandardScaler())
              ])
              categorical_features = X.select_dtypes(exclude='number').columns.tolist()
              categorical_feature_pipeline = Pipeline(steps=[
                  ('impute', SimpleImputer(strategy='most_frequent')),
                  ('one_hot', OneHotEncoder(handle_unknown='ignore', max_categories=None, sparse=False))
              ])
              preprocess_pipeline = ColumnTransformer(transformers=[
                  ('number', numeric_feature_pipeline, numerical_features),
                  ('category', categorical_feature_pipeline, categorical_features)
              ])
              
              X_transformed = preprocess_pipeline.fit_transform(X)
              
              label_encoder = LabelEncoder()
              y_transformed = label_encoder.fit_transform(y)
              
              return pd.concat([X_transformed, y_transformed], axis=1), preprocess_pipeline, label_encoder
  
        #NOTE: pseudo code, the pipeline is not applied correctly to the data
        self.data_train = pre_processing_pipeline(X_train, y_train)
        self.data_val = pre_processing_pipeline(X_val, y_val)
        self.data_val = pre_processing_pipeline(X_test, y_test)

    def train_dataloader(self):
        return DataLoader(self.data_train, batch_size=self.batch_size, shuffle=True, drop_last=True)

    def val_dataloader(self):
        return DataLoader(self.data_val, batch_size=self.batch_size, shuffle=False)

    def test_dataloader(self):
        return DataLoader(self.data_test, batch_size=self.batch_size, shuffle=False)

AhmedThahir · 2024-10-17T12:33:56Z

AhmedThahir
Oct 17, 2024

Any updates?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using sklearn data pre-processing pipelines inside LightningDataModule #19807

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Using sklearn data pre-processing pipelines inside LightningDataModule #19807

Uh oh!

Uh oh!

tiefenthaler Apr 23, 2024

Replies: 1 comment

Uh oh!

AhmedThahir Oct 17, 2024

tiefenthaler
Apr 23, 2024

AhmedThahir
Oct 17, 2024