-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Open
Description
Describe the bug
from datasets import load_dataset
dataset = load_dataset("animetimm/danbooru-wdtagger-v4-w640-ws-30k")got
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/arrow_writer.py", line 626, in write_batch
arrays.append(pa.array(typed_sequence))
File "pyarrow/array.pxi", line 255, in pyarrow.lib.array
File "pyarrow/array.pxi", line 117, in pyarrow.lib._handle_arrow_array_protocol
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/arrow_writer.py", line 258, in __arrow_array__
out = cast_array_to_feature(
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/table.py", line 1798, in wrapper
return func(array, *args, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/table.py", line 2006, in cast_array_to_feature
arrays = [
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/table.py", line 2007, in <listcomp>
_c(array.field(name) if name in array_fields else null_array, subfeature)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/table.py", line 1798, in wrapper
return func(array, *args, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/table.py", line 2066, in cast_array_to_feature
casted_array_values = _c(array.values, feature.feature)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/table.py", line 1798, in wrapper
return func(array, *args, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/table.py", line 2103, in cast_array_to_feature
return array_cast(
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/table.py", line 1798, in wrapper
return func(array, *args, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/table.py", line 1949, in array_cast
raise TypeError(f"Couldn't cast array of type {_short_str(array.type)} to {_short_str(pa_type)}")
TypeError: Couldn't cast array of type string to null
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/load.py", line 2084, in load_dataset
builder_instance.download_and_prepare(
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/builder.py", line 925, in download_and_prepare
self._download_and_prepare(
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
super()._download_and_prepare(
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/builder.py", line 1001, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/builder.py", line 1487, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/ubuntu/miniconda3/lib/python3.10/site-packages/datasets/builder.py", line 1644, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
datasets==3.5.1 whats wrong
its inner json structure is like
features:
- name: "image"
dtype: "image"
- name: "json.id"
dtype: "string"
- name: "json.width"
dtype: "int32"
- name: "json.height"
dtype: "int32"
- name: "json.rating"
sequence:
dtype: "string"
- name: "json.general_tags"
sequence:
dtype: "string"
- name: "json.character_tags"
sequence:
dtype: "string"i'm 100% sure all the jsons satisfies the abovementioned format.
Steps to reproduce the bug
from datasets import load_dataset
dataset = load_dataset("animetimm/danbooru-wdtagger-v4-w640-ws-30k")Expected behavior
load the dataset successfully, with the abovementioned json format and webp images
Environment info
Copy-and-paste the text below in your GitHub issue.
datasetsversion: 3.5.1- Platform: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
- Python version: 3.10.16
huggingface_hubversion: 0.30.2- PyArrow version: 20.0.0
- Pandas version: 2.2.3
fsspecversion: 2025.3.0
Metadata
Metadata
Assignees
Labels
No labels