Custom Dataloader for Very Large Datasets #17317

VRM1 · 2023-04-10T03:09:50Z

VRM1
Apr 10, 2023

Hi All,

I want to create a PyTorch lightning data loader for reading large data that cannot fit into memory (if everything is loaded in one go). The datafile are split into parquet files. My idea is to first reach a small list of parquet files using one data loader called CustomFileLoader and another data loader called CustomDataLoader which creates batches from the list of parquet files that are
read by DataFileLoader. The idea here is to recursively call CustomFileLoader and DataFileLoader until all
files have been parsed. I wrote the following code, however, the problem I am facing is in the constructor
function of the CustomDataLoader. Here, I am iterating over the entire data and concatenating.
How can I avoid this? In other words, when you iterate you end up reading the entire data.
However, only want to read just a few files at one go and then create batches. I am not sure if I have approached this in the right way. Any suggestions is much appreciated.

import pytorch_lightning as pl
from torch.utils.data import Dataset, DataLoader, ConcatDataset

class CustomFileLoader(Dataset):
    
    def __init__(self, files, extension, label_clm, \
                 num_features: List[str] = None, cat_features: List[str] = None):
        self.files = files
        self.ext = extension
        self.num_features, self.cat_features = num_features, cat_features
        self.label_clm = label_clm
        
    def __len__(self):
        return len(self.files)
    

    def __getitem__(self, index):

        data_obj = {'numerical_data':None, \
             'categorical_data':None, 'label':None}
        if self.ext == 'parquet':
            data = pd.read_parquet(self.files[index])
        else:
            data = pd.read_csv(self.files[index])
        
        if self.num_features:
            self.x_numerical = torch.FloatTensor(data[self.num_features].values)
        if self.cat_features:
            self.x_categ = torch.LongTensor(data[self.cat_features].values)
        self.y = torch.LongTensor(data[self.label_clm].values).flatten()

        if self.num_features and self.cat_features:
            data_obj['numerical_data'] = self.x_numerical
            data_obj['categorical_data'] = self.x_categ
            data_obj['label'] = self.y
            return data_obj
        else:
            data_obj['numerical_data'] = self.x_numerical
            # fill up with random value for categorical to avoid pytorch returning error due to None value
            data_obj['categorical_data'] = torch.zeros((1, 1))
            data_obj['label'] = self.y
            return data_obj
    
class CustomDataLoader(Dataset):
    
    def __init__(self, data: List):

        self.c_data = []
        self.n_data = []
        self.label = []
        for d in data:
          self.c_data.append(d['categorical_data'])
          self.n_data.append(d['numerical_data'])
          self.label.append(d['label'])
        
        self.n_data = ConcatDataset(self.n_data)
        self.c_data = ConcatDataset(self.c_data)
        self.label = ConcatDataset(self.label)

    def __getitem__(self, index):

            return self.n_data[index], self.c_data[index], self.label[index]
    def __len__(self):
        return len(self.n_data)

class GenericDataModule(pl.LightningDataModule):
    def __init__(self, args, t_files, extension, label_clm, num_features, cat_features):
        super().__init__()
        self.t_files = t_files
        self.extension = extension
        self.label_clm = label_clm
        self.num_features = num_features
        self.cat_featres = cat_features
        
    def setup(self, stage=None):
        
        self.train_file_loader = DataLoader(CustomFileLoader(self.t_files, self.extension, self.label_clm, \
                self.num_features, self.cat_features), batch_size=self.file_batch_size, \
                    num_workers=1)
    
    def train_dataloader(self):

        return DataLoader(CustomDataLoader(self.train_file_loader), \
                           batch_size=self.batch_size, num_workers=6)

EvanZ · 2023-04-11T17:47:17Z

EvanZ
Apr 11, 2023

I would suggest looking into using IterableDataset for your use case.

0 replies

VRM1 · 2023-04-15T22:45:12Z

VRM1
Apr 15, 2023
Author

Thanks for the response, I am aware of the IteratableDataset, but that will not work for my case. After trying a bunch of ideas, I translated my problem as follows:

My objective is to inherit all the functionalities of the Trainer module, but just modify the dataloader part. The objective of the custom loop is defined in the following pseudocode. Since my data is very big and cannot be loaded into the memory. I am taking the following approach.

#For every epochs, I would like to 

class Mlp(pl.LightningModule):

    def __init__...

    def training_step(batch..
        # calculate, loss, return loss...

    def test_step(...

    def test_epoch_end(
        self.log...

#I want to modify pl.Trainer to do validation, training and testing using the following loop.

for epoch in epochs:
    f_indx = 0
    new_loader = get_dataloader(file[f_indx], batch=<batch_size>)
    loaders = [new_loader]
    total_data_lengths = len
    while (new_loader):
        for batch in new_loader:
            # train, get loss, etc
        loaders.pop(<old_loader>)
        f_indx += 1
        new_loader = get_dataloader(file[f_indx], batch=<batch_size>)
        loaders.append()

0 replies

tdincer · 2023-09-20T21:12:56Z

tdincer
Sep 20, 2023

@VRM1 How did you solve your problem?

I'm having a similar issue while loading data from a parquet file. Basically I'm trying to load row_group per iteration. Both the torch's DataLoader and the lighting are complaining.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Custom Dataloader for Very Large Datasets #17317

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Custom Dataloader for Very Large Datasets #17317

Uh oh!

Uh oh!

VRM1 Apr 10, 2023

Replies: 3 comments

Uh oh!

EvanZ Apr 11, 2023

Uh oh!

Uh oh!

VRM1 Apr 15, 2023 Author

Uh oh!

tdincer Sep 20, 2023

VRM1
Apr 10, 2023

EvanZ
Apr 11, 2023

VRM1
Apr 15, 2023
Author

tdincer
Sep 20, 2023