-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Description
When calling IterableDatasetDict.map(), each split’s IterableDataset.map() is invoked without a features argument. While omitting the argument isn’t itself incorrect, the implementation then sets info.features = features, which destroys the original features content. Since IterableDataset.column_names relies on info.features, it ends up broken (None).
Reproduction
- Define an IterableDatasetDict with a non-None features schema.
- my_iterable_dataset_dict contains "text" column.
- Call:
new_dict = my_iterable_dataset_dict.map(
function=my_fn,
with_indices=False,
batched=True,
batch_size=16,
)- Observe
new_dict["train"].info.features # {'text': Value(dtype='string', id=None)}
new_dict["train"].column_names # ['text']- Call:
new_dict = my_iterable_dataset_dict.map(
function=my_fn,
with_indices=False,
batched=True,
batch_size=16,
remove_columns=["foo"]
)- Observe:
new_dict["train"].info.features # → None
new_dict["train"].column_names # → None- Internally, in dataset_dict.py this loop omits features (code):
for split, dataset in self.items():
dataset_dict[split] = dataset.map(
function=function,
with_indices=with_indices,
input_columns=input_columns,
batched=batched,
batch_size=batch_size,
drop_last_batch=drop_last_batch,
remove_columns=remove_columns,
fn_kwargs=fn_kwargs,
# features omitted → defaults to None
)- Then inside IterableDataset.map() (code) correct
info.featuresis replaced by features which is None:
info = self.info.copy()
info.features = features # features is None here
return IterableDataset(..., info=info, ...)Suggestion
It looks like this replacement was added intentionally but maybe should be done only if features is not None.
Workarround:
SFTTrainer calls dataset.map() several times and then fails on NoneType when iterating dataset.column_names.
I decided to write this patch - works form me.
def patch_iterable_dataset_map():
_orig_map = IterableDataset.map
def _patched_map(self, *args, **kwargs):
if "features" not in kwargs or kwargs["features"] is None:
kwargs["features"] = self.info.features
return _orig_map(self, *args, **kwargs)
IterableDataset.map = _patched_mapMetadata
Metadata
Assignees
Labels
No labels