Skip to content

Issue with offline mode and partial dataset cached #7551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nrv opened this issue May 4, 2025 · 4 comments
Open

Issue with offline mode and partial dataset cached #7551

nrv opened this issue May 4, 2025 · 4 comments

Comments

@nrv
Copy link

nrv commented May 4, 2025

Describe the bug

Hi,

a issue related to #4760 here when loading a single file from a dataset, unable to access it in offline mode afterwards

Steps to reproduce the bug

import os
# os.environ["HF_HUB_OFFLINE"] = "1"
os.environ["HF_TOKEN"] = "xxxxxxxxxxxxxx"

import datasets

dataset_name = "uonlp/CulturaX"
data_files = "fr/fr_part_00038.parquet"

ds = datasets.load_dataset(dataset_name, split='train', data_files=data_files)
print(f"Dataset loaded   : {ds}")

Once the file has been cached, I rerun with the HF_HUB_OFFLINE activated an get this error :

ValueError: Couldn't find cache for uonlp/CulturaX for config 'default-1e725f978350254e'
Available configs in the cache: ['default-2935e8cdcc21c613']

Expected behavior

Should be able to access the previously cached files

Environment info

  • datasets version: 3.2.0
  • Platform: Linux-5.4.0-215-generic-x86_64-with-glibc2.31
  • Python version: 3.12.0
  • huggingface_hub version: 0.27.0
  • PyArrow version: 19.0.0
  • Pandas version: 2.2.2
  • fsspec version: 2024.3.1
@nrv
Copy link
Author

nrv commented May 4, 2025

It seems the problem comes from builder.py / create_config_id()

On the first call, when the cache is empty we have

config_kwargs = {'data_files': {'train': ['hf://datasets/uonlp/CulturaX@6a8734bc69fefcbb7735f4f9250f43e4cd7a442e/fr/fr_part_00038.parquet']}}

leading to config_id beeing 'default-2935e8cdcc21c613'

then, on the second call,

config_kwargs = {'data_files': 'fr/fr_part_00038.parquet'}

thus explaining why the hash is not the same, despite having the same parameter when calling load_dataset : data_files="fr/fr_part_00038.parquet"

@nrv
Copy link
Author

nrv commented May 4, 2025

Same behavior with version 3.5.1

@YanshekWoo
Copy link

Same issue when loading google/IndicGenBench_flores_in with dataset==2.21.0 and dataset==3.6.0 .

@YanshekWoo
Copy link

It seems the problem comes from builder.py / create_config_id()

On the first call, when the cache is empty we have

config_kwargs = {'data_files': {'train': ['hf://datasets/uonlp/CulturaX@6a8734bc69fefcbb7735f4f9250f43e4cd7a442e/fr/fr_part_00038.parquet']}}

leading to config_id beeing 'default-2935e8cdcc21c613'

then, on the second call,

config_kwargs = {'data_files': 'fr/fr_part_00038.parquet'}

thus explaining why the hash is not the same, despite having the same parameter when calling load_dataset : data_files="fr/fr_part_00038.parquet"

I have identified that the issue indeed lies in the data_files within config_kwargs.
The format and prefix of data_files differ depending on whether HF_HUB_OFFLINE is set, leading to different final config_id values.
When I use other datasets without passing the data_files parameter, this issue does not occur.

A possible solution might be to standardize the formatting of data_files within the create_config_id function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants