(fix) remove sampler_is_batch_sampler code in prepare_data_loader(..) #3469

suzyahyah · 2025-03-31T13:10:34Z

What does this PR do?

This PR fixes various confusions wrt to torch.utils.data.DataLoader, torch.utils.data.BatchSampler, and accelerate/data_loader.py:: prepare_data_loader(..).

~~#3322~~ Edit, not quite this
#3014
#2091

Motivation

accelerate/data_loader.py has various patches allowing for a BatchSampler object to be passed as an argument to sampler.
In the code, it allows this behavior using sampler_is_batch_sampler = isinstance(dataloader.sampler, BatchSampler).
Allowing this is unintuitive for developers as it directly conflicts with torch.utils.data.DataLoader documentation, when accelerate should only wrap the pytorch dataloader and not change/allow different logic or arguments.
It also permits various unintended behavior from other libraries relying on accelerate. For instance, when passing a custom BatchSampler to the dataloader argument in HuggingFace Trainer, the sampler kwargs is currently used, regardless of whether the sampler is BatchSampler or just RandomSampler. https://github.com/huggingface/transformers/blob/3b07ca78bb696825feee3e976795fab58f2b6d0c/src/transformers/trainer.py#L1026 (I'll be making a separate PR in HuggingFace on this)
I believe this is due to a misunderstanding stemming from this Issue #Error in prepared DataLoader with BatchSampler #679. As @sgugger initially suspected, this is probably a typo or misunderstanding from HuggingFace.

This should not be allowed behavior, based on the following basic test case which will throw the datasets/formatting/formatting.py: TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'

from torch.utils.data import BatchSampler, RandomSampler
from datasets import load_dataset

ds = load_dataset("wikitext", "wikitext-103-raw-v1", split='validation')

train_dataloader = DataLoader(
    ds,
    sampler=BatchSampler(RandomSampler(dev_ds), batch_size=32, drop_last=False),
    num_workers=0,  # Adjust based on your setup
    pin_memory=True,
)

for batch in train_dataloader:
  print(batch)

Official Pytorch Documentation for reference:

DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, 
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None, *, prefetch_factor=2,
           persistent_workers=False)

This PR

Reverses this historical PR: Fix DataLoader with samplers that are batch samplers #687 which supports the wrong logic.
Removes all traces of sampler_is_batch_sampler, simplifying the code
~~Immediately throws an error~~ Edit: Throws a warning if a BatchSampler had been passed as an argument to sampler

Tests

Passes all tests in tests/test_data_loader.py, but does not introduce any new tests. Open to suggestions.

Considerations

The PR Immediately throws an error if a BatchSampler had been passed as an argument to Sampler. Technically this error should be thrown earlier or in torch.utils.data.Sampler but we can make it explicit that the problem is not coming from the Accelerate Library, since it had previously been allowed.
Upon reading the torch.utils.data source code v2.6.0, I figured that torch.utils.data.DataLoader will attempt to construct a BatchSampler from a Sampler if batch_sampler=None, and will also construct a default sampler if sampler=None.

This means, we should always be able to recover a dataloader.batch_sampler when wrapping accelerate around an already constructed pytorch dataloader, and there is no need to check whether sampler_is_batch_sampler again, if the intent is just to get "new_batch_sampler".

Before submitting

[ Yes ] Did you read the contributor guideline,
Pull Request section?
[ No ] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
[NA; no docs mention batch_sampler ] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[No, open to suggestion ] Did you write any new necessary tests?

Who can review?

@SunMarc @BenjaminBossan @zach-huggingface @muellerzr

muellerzr · 2025-03-31T15:30:48Z

src/accelerate/data_loader.py

+        raise ValueError(
+            "Should not pass a BatchSampler do dataloader sampler argument. As per pytorch>2.1.0 documentation, please pass this to sampler instead"
+        )


Can you please show the exact paragraph/section of the pytorch docs that state this? You reference many issues, but I dont' see where you found this

Thanks for the quick reply! Yes it’s referenced under motivation #5.

Official PyTorch Documentation for Reference showing two arguments for sampler and batch_sampler,

“””users may use the sampler argument to specify a custom Sampler object that at each time yields the next index/key to fetch.
A custom Sampler that yields a list of batch indices at a time can be passed as the batch_sampler argument.”””

And in the docs they write this is “mutually exclusive with sampler”

batch_sampler (Sampler or Iterable, optional) – like sampler, but returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last.

If we pass a BatchSampler (yields a list of indices) as an argument to sampler, we will get a list of list of indices, because pytorch will convert whatever was passed in the sampler argument to “batch mode”.

Maybe it’s legal behavior for some very custom parallel processing on batch of batches, but it seems rare and pretty unlikely to me. Should I convert ValueError to Warning

SunMarc · 2025-04-01T14:39:03Z

can you rebase again ? the diff doesn't look correct

suzyahyah · 2025-04-01T14:47:01Z

can you rebase again ? the diff doesn't look correct

Yeah I think I fixed it now.

I dont' see where you found this

Re @muellerzr's very valid concern, I changed this to a warning and reference PyTorch documentation.

HuggingFaceDocBuilderDev · 2025-04-08T10:31:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc · 2025-04-09T16:29:59Z

Some tests are still failing @suzyahyah

…sampler argument, and Fix Typos

suzyahyah · 2025-04-12T11:27:01Z

Thanks, @SunMarc

The tests were failing because I misunderstood and had wrongly changed the logic in the function src/accelerate.data_loader.py: set_sampler(). I have reverted the logic in that function back to main branch.

After rebase against main, the tests successfully run (w/o hardware accelerators):

make test_core
make test_prod
pytest tests/test_data_loader.py
make test

Quality checks run:

make style
make quality

SunMarc

Thanks for your work ! I feel like it is better if we first do a deprecation message instead of breaking everything right now WDYT @suzyahyah ? This will help users to fix their code. Also, it would be nice to fix the HF docs, otherwise users will still use sampler arg by default

SunMarc · 2025-04-16T15:40:45Z

src/accelerate/data_loader.py

-    sampler_is_batch_sampler = isinstance(dataloader.sampler, BatchSampler)
-    if sampler_is_batch_sampler:


let's keep the isinstance(dataloader.sampler, BatchSampler) check

SunMarc · 2025-04-16T15:59:25Z

src/accelerate/data_loader.py

+    if isinstance(dataloader.sampler, BatchSampler):
+        logger.warning(
+            "BatchSampler was passed to sampler argument."
+            "If you have a custom Sampler that yields a list of batch indices at a time, please pass it as the batch_sampler argument instead."
+            "For more information, see https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader"
+        )
+


The warning is nice, maybe we should add a deprecation message also saying that we won't allow passing BatchSampler to sampler anymore ?

github-actions · 2025-05-11T15:07:01Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

SunMarc · 2025-05-12T13:54:46Z

LMK if you plan to finish the PR or I will leave it as a feature request / bug to fix @suzyahyah

muellerzr reviewed Mar 31, 2025

View reviewed changes

suzyahyah force-pushed the fix/data_loader_batch_sampler branch from 17fc17a to e228780 Compare April 1, 2025 14:41

suzyahyah force-pushed the fix/data_loader_batch_sampler branch from e228780 to b90f9a6 Compare April 12, 2025 11:07

suzyahyah added 3 commits April 12, 2025 19:11

(fix) remove sampler_is_batch_sampler code

d77b693

(Fix) data_loader: Change Value Error to Warning for BatchSampler in …

fc76f6b

…sampler argument, and Fix Typos

revert to original set_sampler logic

b60fd09

suzyahyah force-pushed the fix/data_loader_batch_sampler branch from b90f9a6 to b60fd09 Compare April 12, 2025 11:20

SunMarc reviewed Apr 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(fix) remove sampler_is_batch_sampler code in prepare_data_loader(..) #3469

(fix) remove sampler_is_batch_sampler code in prepare_data_loader(..) #3469

Uh oh!

suzyahyah commented Mar 31, 2025 •

edited

Loading

Uh oh!

muellerzr Mar 31, 2025

Uh oh!

suzyahyah Mar 31, 2025 •

edited

Loading

Uh oh!

suzyahyah Mar 31, 2025 •

edited

Loading

Uh oh!

SunMarc commented Apr 1, 2025

Uh oh!

suzyahyah commented Apr 1, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 8, 2025

Uh oh!

SunMarc commented Apr 9, 2025

Uh oh!

suzyahyah commented Apr 12, 2025

Uh oh!

SunMarc left a comment •

edited

Loading

Uh oh!

SunMarc Apr 16, 2025

Uh oh!

SunMarc Apr 16, 2025

Uh oh!

github-actions bot commented May 11, 2025

Uh oh!

SunMarc commented May 12, 2025

Uh oh!

Uh oh!

		sampler_is_batch_sampler = isinstance(dataloader.sampler, BatchSampler)
		if sampler_is_batch_sampler:

(fix) remove sampler_is_batch_sampler code in prepare_data_loader(..) #3469

Are you sure you want to change the base?

(fix) remove sampler_is_batch_sampler code in prepare_data_loader(..) #3469

Uh oh!

Conversation

suzyahyah commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

This PR

Tests

Considerations

Before submitting

Who can review?

Uh oh!

muellerzr Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

suzyahyah Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

suzyahyah Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc commented Apr 1, 2025

Uh oh!

suzyahyah commented Apr 1, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 8, 2025

Uh oh!

SunMarc commented Apr 9, 2025

Uh oh!

suzyahyah commented Apr 12, 2025

Uh oh!

SunMarc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented May 11, 2025

Uh oh!

SunMarc commented May 12, 2025

Uh oh!

Uh oh!

suzyahyah commented Mar 31, 2025 •

edited

Loading

suzyahyah Mar 31, 2025 •

edited

Loading

suzyahyah Mar 31, 2025 •

edited

Loading

SunMarc left a comment •

edited

Loading