Using sklearn data pre-processing pipelines inside LightningDataModule #19807
Unanswered
tiefenthaler
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 1 comment
-
Any updates? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I wonder if it makes sense or does not make sense to use sklearn pipelines for data pre-processing within the LightningDataModule?
I am a big fan of sklearn pipelines since they structure the code properly and allows an easy use of pre-processing steps properly when splitting data into train, val and test data. Besides classical ML Models, I am using NNs more and more for tabular data and got some great results for some use cases. Some use cases require more individual handling of a variety of features.
PyTorch/Lightning is using the LightningDataModule so that the data can be used efficiently for the training of NNs. The LightningDataModule provides GPU support for preprocessing, like shuffling, train-val-test splits, transformations (categorical encoding, normalization, etc.), etc. which any dataframe should undergo before feeding into the dataloader (e.g. train_dataloader, val_dataloader, test_dataloader, predict_dataloader). To me it makes sense to use sklearn pipelines to define those data pre-processing steps (categorical encoding, normalization, etc.).
But I have not seen anyone using sklearn pipelnes in this context before. I was wondering if "PyTorch Tabular" is using sklearn pipelines, but they are not. They rather define a separat method for preprocessing. Which does the same job as described above and of course they use sklearn functions.
Is there a reason why not to use sklearn pipelines to do so (e.g. conflicts enabling GPU acceleration, ...)?
Pseudo Code:
Beta Was this translation helpful? Give feedback.
All reactions