Issue with offline mode and partial dataset cached #7551

nrv · 2025-05-04T16:49:37Z

Describe the bug

Hi,

a issue related to #4760 here when loading a single file from a dataset, unable to access it in offline mode afterwards

Steps to reproduce the bug

import os
# os.environ["HF_HUB_OFFLINE"] = "1"
os.environ["HF_TOKEN"] = "xxxxxxxxxxxxxx"

import datasets

dataset_name = "uonlp/CulturaX"
data_files = "fr/fr_part_00038.parquet"

ds = datasets.load_dataset(dataset_name, split='train', data_files=data_files)
print(f"Dataset loaded   : {ds}")

Once the file has been cached, I rerun with the HF_HUB_OFFLINE activated an get this error :

ValueError: Couldn't find cache for uonlp/CulturaX for config 'default-1e725f978350254e'
Available configs in the cache: ['default-2935e8cdcc21c613']

Expected behavior

Should be able to access the previously cached files

Environment info

datasets version: 3.2.0
Platform: Linux-5.4.0-215-generic-x86_64-with-glibc2.31
Python version: 3.12.0
huggingface_hub version: 0.27.0
PyArrow version: 19.0.0
Pandas version: 2.2.2
fsspec version: 2024.3.1

The text was updated successfully, but these errors were encountered:

nrv · 2025-05-04T17:15:01Z

It seems the problem comes from builder.py / create_config_id()

On the first call, when the cache is empty we have

config_kwargs = {'data_files': {'train': ['hf://datasets/uonlp/CulturaX@6a8734bc69fefcbb7735f4f9250f43e4cd7a442e/fr/fr_part_00038.parquet']}}

leading to config_id beeing 'default-2935e8cdcc21c613'

then, on the second call,

config_kwargs = {'data_files': 'fr/fr_part_00038.parquet'}

thus explaining why the hash is not the same, despite having the same parameter when calling load_dataset : data_files="fr/fr_part_00038.parquet"

nrv · 2025-05-04T17:20:55Z

Same behavior with version 3.5.1

YanshekWoo · 2025-05-12T11:57:54Z

Same issue when loading google/IndicGenBench_flores_in with dataset==2.21.0 and dataset==3.6.0 .

YanshekWoo · 2025-05-13T03:18:41Z

It seems the problem comes from builder.py / create_config_id()

On the first call, when the cache is empty we have
config_kwargs = {'data_files': {'train': ['hf://datasets/uonlp/CulturaX@6a8734bc69fefcbb7735f4f9250f43e4cd7a442e/fr/fr_part_00038.parquet']}}
leading to config_id beeing 'default-2935e8cdcc21c613'

then, on the second call,
config_kwargs = {'data_files': 'fr/fr_part_00038.parquet'}
thus explaining why the hash is not the same, despite having the same parameter when calling load_dataset : data_files="fr/fr_part_00038.parquet"

I have identified that the issue indeed lies in the data_files within config_kwargs.
The format and prefix of data_files differ depending on whether HF_HUB_OFFLINE is set, leading to different final config_id values.
When I use other datasets without passing the data_files parameter, this issue does not occur.

A possible solution might be to standardize the formatting of data_files within the create_config_id function.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue with offline mode and partial dataset cached #7551

Issue with offline mode and partial dataset cached #7551

nrv commented May 4, 2025 •

edited

Loading

nrv commented May 4, 2025

Uh oh!

nrv commented May 4, 2025

Uh oh!

YanshekWoo commented May 12, 2025

Uh oh!

YanshekWoo commented May 13, 2025

Uh oh!

Issue with offline mode and partial dataset cached #7551

Issue with offline mode and partial dataset cached #7551

Comments

nrv commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

nrv commented May 4, 2025

Uh oh!

nrv commented May 4, 2025

Uh oh!

YanshekWoo commented May 12, 2025

Uh oh!

YanshekWoo commented May 13, 2025

Uh oh!

nrv commented May 4, 2025 •

edited

Loading