feat: use datasets.IterableDataset shard if possible. #3583

ValMystletainn · 2025-05-22T09:11:49Z

What does this PR do?

Add support for datasets.IterableDataset sharding if pass it to accelerator.prepare.
Use the n_shard rather than IterableDatasetShard to reduce the data reading overhead, and make different rank reading different data shard efficient.

Fixes #3547

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@SunMarc

When `accelerator.prepare` is called on a `datasets.IterableDataset`, use the `shard` method to split the dataset across the available processes. This allows for more efficient data loading and processing. Without load and slice overhead of `IterableDatasetShard`

ValMystletainn · 2025-05-22T09:13:43Z

I guess I shoud write a test for it and change somewhere in docs.

However, I readthough the tests/test_data_loader.py and find no where to starts a multi process test suit for this function.

SunMarc

Thanks ! A test would be nice to have ! You can put the test in test_distributed_data_loop.py. We run these tests on 2 gpus. Check test_distributed_data_loop test in test_multi-gpu.py file

SunMarc · 2025-05-22T10:28:30Z

src/accelerate/data_loader.py

+        if (
+            isinstance(new_dataset, getattr(sys.modules.get("datasets"), "IterableDataset", type(None)))


That could work but let's use check if dataset is available (is_datasets_available) and import the class IterableDataset from there to perform the check.

I write in this style rather than use

if is_datasets_available(): from datasets import IterableDataset as DatasetsIterableDatasets ... if isinstance(new_dataset, DatasetsIterableDatasets): ...

is aiming to reduce import overhead like this codesnippet. it check the object is torch.Tensor or not, and skip to import the heavy pytorch package if there is no torch.Tensor object at all.

however, I think the import overhead of datasets is not so heavy like torch, so if it's for the readability and maintainability, I will change to this style

if is_datasets_available(): from datasets import IterableDataset as DatasetsIterableDatasets ...

So what do you think about it, the original version or the import and check version.

yeah it should be fine with the overhead. We only call this function once so it shouldn't create a huge overhead. Please go with the import + check version.

SunMarc reviewed May 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: use datasets.IterableDataset shard if possible. #3583

feat: use datasets.IterableDataset shard if possible. #3583

Uh oh!

ValMystletainn commented May 22, 2025

Uh oh!

ValMystletainn commented May 22, 2025

Uh oh!

SunMarc left a comment •

edited

Loading

Uh oh!

SunMarc May 22, 2025

Uh oh!

ValMystletainn May 23, 2025

Uh oh!

SunMarc May 23, 2025

Uh oh!

Uh oh!

		if (
		isinstance(new_dataset, getattr(sys.modules.get("datasets"), "IterableDataset", type(None)))

feat: use datasets.IterableDataset shard if possible. #3583

Are you sure you want to change the base?

feat: use datasets.IterableDataset shard if possible. #3583

Uh oh!

Conversation

ValMystletainn commented May 22, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

ValMystletainn commented May 22, 2025

Uh oh!

SunMarc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc May 22, 2025

Choose a reason for hiding this comment

Uh oh!

ValMystletainn May 23, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc May 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SunMarc left a comment •

edited

Loading