Custom Dataloader for Very Large Datasets #17317
Replies: 3 comments
-
I would suggest looking into using |
Beta Was this translation helpful? Give feedback.
-
Thanks for the response, I am aware of the My objective is to inherit all the functionalities of the Trainer module, but just modify the dataloader part. The objective of the custom loop is defined in the following pseudocode. Since my data is very big and cannot be loaded into the memory. I am taking the following approach.
|
Beta Was this translation helpful? Give feedback.
-
@VRM1 How did you solve your problem? I'm having a similar issue while loading data from a parquet file. Basically I'm trying to load row_group per iteration. Both the torch's DataLoader and the lighting are complaining. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi All,
I want to create a PyTorch lightning data loader for reading large data that cannot fit into memory (if everything is loaded in one go). The datafile are split into parquet files. My idea is to first reach a small list of parquet files using one data loader called CustomFileLoader and another data loader called CustomDataLoader which creates batches from the list of parquet files that are
read by DataFileLoader. The idea here is to recursively call CustomFileLoader and DataFileLoader until all
files have been parsed. I wrote the following code, however, the problem I am facing is in the constructor
function of the CustomDataLoader. Here, I am iterating over the entire data and concatenating.
How can I avoid this? In other words, when you iterate you end up reading the entire data.
However, only want to read just a few files at one go and then create batches. I am not sure if I have approached this in the right way. Any suggestions is much appreciated.
Beta Was this translation helpful? Give feedback.
All reactions