`IterableDatasetDict.map()` call removes `column_names` (in fact info.features)

When calling `IterableDatasetDict.map()`, each split’s `IterableDataset.map()` is invoked without a `features` argument. While omitting the argument isn’t itself incorrect, the implementation then sets `info.features = features`, which destroys the original `features` content. Since `IterableDataset.column_names` relies on `info.features`, it ends up broken (`None`).

**Reproduction**

1.  Define an IterableDatasetDict with a non-None features schema.
2.  my_iterable_dataset_dict contains "text" column.
3. Call:
```Python
new_dict = my_iterable_dataset_dict.map(
    function=my_fn,
    with_indices=False,
    batched=True,
    batch_size=16,
)
```
4. Observe
```Python
new_dict["train"].info.features  # {'text': Value(dtype='string', id=None)}
new_dict["train"].column_names   # ['text']
```
5. Call:
```Python
new_dict = my_iterable_dataset_dict.map(
    function=my_fn,
    with_indices=False,
    batched=True,
    batch_size=16,
    remove_columns=["foo"]
)
```
6. Observe:
```Python
new_dict["train"].info.features  # → None
new_dict["train"].column_names   # → None
```
5. Internally, in dataset_dict.py this loop omits features ([code](https://github.com/huggingface/datasets/blob/b9efdc64c3bfb8f21f8a4a22b21bddd31ecd5a31/src/datasets/dataset_dict.py#L2047C5-L2056C14)):
```Python
for split, dataset in self.items():
    dataset_dict[split] = dataset.map(
        function=function,
        with_indices=with_indices,
        input_columns=input_columns,
        batched=batched,
        batch_size=batch_size,
        drop_last_batch=drop_last_batch,
        remove_columns=remove_columns,
        fn_kwargs=fn_kwargs,
        # features omitted → defaults to None
    )
```
7. Then inside IterableDataset.map() ([code](https://github.com/huggingface/datasets/blob/b9efdc64c3bfb8f21f8a4a22b21bddd31ecd5a31/src/datasets/iterable_dataset.py#L2619C1-L2622C37)) correct `info.features` is replaced by features which is None:
```Python
info = self.info.copy()
info.features = features  # features is None here
return IterableDataset(..., info=info, ...)
```

**Suggestion**
It looks like this replacement was added intentionally but maybe should be done only if `features` is `not None`.

**Workarround:**
`SFTTrainer` calls `dataset.map()` several times and then fails on `NoneType` when iterating `dataset.column_names`.
I decided to write this patch - works form me.

```python
def patch_iterable_dataset_map():
    _orig_map = IterableDataset.map

    def _patched_map(self, *args, **kwargs):
        if "features" not in kwargs or kwargs["features"] is None:
            kwargs["features"] = self.info.features
        return _orig_map(self, *args, **kwargs)

    IterableDataset.map = _patched_map
```






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`IterableDatasetDict.map()` call removes `column_names` (in fact info.features) #7568

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

IterableDatasetDict.map() call removes column_names (in fact info.features) #7568

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`IterableDatasetDict.map()` call removes `column_names` (in fact info.features) #7568