Skip to content

SapienzaNLP/OBenTO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🍱 OBenTO-LLM:
Open Benchmark Translation with Open LLMs

OBenTo-LLM provides the tools to translate evaluation benchmarks from a source language to other languages using open-source Large Language Models (LLMs) and a standardized translation pipeline.

What is the problem?

Some people may say that the translation of evaluation benchmarks is a trivial task. However, if we blindly translate the benchmarks, we may introduce biases or errors that can affect our assessment of the models' performance in the target language.

Importantly, different datasets are characterized by nuances and linguistic features that can greatly affect the translation quality, even when using state-of-the-art LLMs, such as GPT-3.5/4. For example, the translation of the ARC Challenge dataset requires a different approach than the translation of the Winogrande dataset. Not only that, different instances in ARC Challenge may require different translation strategies to obtain the best results.

However, for many existing translated benchmarks, we do not have access to the code used to generate the translations, which can cause issues when comparing models' performance across languages.

How can this project help?

OBenTO-LLM provides a standardized pipeline to translate evaluation benchmarks using Large Language Models (LLMs). The aim is to provide a pipeline that is:

  • Free: based on open-source LLMs, so it is free to use if you have the resources to run the translation models.
  • Tailored: designed to provide to maximize translation quality across different datasets by considering their peculiarities.
  • Reproducible: can be used to quickly regenerate the translations of the benchmarks.
  • Transparent: designed to be transparent, so you can understand how the translations are generated.
  • Extensible: can be easily extended to support new datasets and languages.

A note on quality: the quality of the translations depends on the LLMs used. The pipeline is designed to provide a good starting point for translating evaluation benchmarks, but, even if models like Tower-LLM are on par with GPT-3.5/4, they may not be perfect. Therefore, we always recommended to check the translations before using them in your experiments.

Supported Datasets

Dataset Original Dataset IT Translation
ARC Challenge allenai/ai2_arc sapienzanlp/arc_italian
ARC Easy allenai/ai2_arc sapienzanlp/arc_italian
BoolQ google/boolq sapienzanlp/boolq_italian
GSM8K gsm8k sapienzanlp/gsm8k_italian
HellaSwag Rowan/hellaswag sapienzanlp/hellaswag_italian
MMLU cais/mmlu sapienzanlp/mmlu_italian
PIQA piqa sapienzanlp/piqa_italian
SciQ allenai/sciq sapienzanlp/sciq_italian
TruthfulQA truthful_qa sapienzanlp/truthful_qa_italian
Winogrande winogrande sapienzanlp/winogrande_italian
GPQA Idavidrein/gpqa sapienzanlp/gpqa_italian
MuSR TAUR-Lab/MuSR sapienzanlp/MUSR_italian
MATH lighteval/MATH-Hard sapienzanlp/MATH_hard_italian
BBH SaylorTwift/bbh sapienzanlp/BBH_italian

Missing a dataset? Open an issue or submit a pull request! If you do not have the resources or hardware to translate the datasets, we can help you with that. Missing a language? We are currently writing a guide to help you translate the datasets in other languages. Stay tuned!

Supported LLMs

Currently, the pipeline supports the following LLMs:

Missing an LLM? OBenTO-LLM is designed to support most LLMs available in the Hugging Face model hub. We tested the pipeline with TowerInstruct models, but it should work with other models as well. If you encounter any issues, open an issue or submit a pull request!

Installation

It is recommended to setup a Conda environment to run the code. To create a new Conda environment, run the following command:

conda create --name llm-data-translation python=3.10

Remember to activate the Conda environment before running the code:

conda activate llm-data-translation

To install the required packages, run the following command:

pip install -r requirements.txt

Usage

The code is organized as follows:

  • src/translation/translate_<dataset>.py: Contains the code to translate the dataset from a source language to other languages, e.g., Italian.

For example, to translate the allenai/ai2_arc dataset from English to Italian, run the following command:

python src/translation/translate_arc.py \
    --source_language English \
    --target_language Italian \
    --output_path data/translations/it/arc_challenge.train.json \
    --model_name Unbabel/TowerInstruct-7B-v0.2 \
    --device_map "cuda:0" \
    --dataset_name allenai/ai2_arc \
    --dataset_config ARC-Challenge \
    --split train \
    --batch_size 4 \
    --max_new_tokens 1024 \
    --beam_size 3 \
    --length_penalty 2.5 \
    --num_return_sequences 1 \
    --do_sample False \
    --early_stopping False

The translated dataset will be saved in the specified output path. For more details on the arguments, check out the documentation of the script.

Translation scripts

The following scripts are available to translate the datasets from English to Italian:

  • scripts/local/translation/translate_arc_challenge.sh: Translates the ARC Challenge dataset.
  • scripts/local/translation/translate_arc_easy.sh: Translates the ARC Easy dataset.
  • scripts/local/translation/translate_boolq.sh: Translates the BoolQ dataset.
  • scripts/local/translation/translate_gsm8k.sh: Translates the GSM8K dataset.
  • scripts/local/translation/translate_hellaswag.sh: Translates the HellaSwag dataset.
  • scripts/local/translation/translate_mmlu.sh: Translates the MMLU dataset.
  • scripts/local/translation/translate_piqa.sh: Translates the PIQA dataset.
  • scripts/local/translation/translate_sciq.sh: Translates the SciQ dataset.
  • scripts/local/translation/translate_truthfulqa.sh: Translates the TruthfulQA dataset.
  • scripts/local/translation/translate_winogrande.sh: Translates the Winogrande dataset.
  • scripts/local/translation/translate_gpqa.sh: Translates the GPQA dataset.
  • scripts/local/translation/translate_musr.sh: Translates the MuSR dataset.
  • scripts/local/translation/translate_math.sh: Translates the MATH dataset.
  • scripts/local/translation/translate_bbh.sh: Translates the BBH dataset.

For more details about the parameters, check the scripts.

Publication and citation

If you use this library or part of it, consider to cite us:

@inproceedings{moroni-etal-2024-towards,
    title = "Towards a More Comprehensive Evaluation for {I}talian {LLM}s",
    author = "Moroni, Luca  and
      Conia, Simone  and
      Martelli, Federico  and
      Navigli, Roberto",
    editor = "Dell'Orletta, Felice  and
      Lenci, Alessandro  and
      Montemagni, Simonetta  and
      Sprugnoli, Rachele",
    booktitle = "Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)",
    month = dec,
    year = "2024",
    address = "Pisa, Italy",
    publisher = "CEUR Workshop Proceedings",
    url = "https://aclanthology.org/2024.clicit-1.67/",
    pages = "584--599",
    ISBN = "979-12-210-7060-6",
}

License

This repository is licensed under the MIT License. See the LICENSE file for more information.

Acknowledgments

Special thanks

We would like to thank:

  • Simone Conia for the core idea and the development of the OBENTO library;
  • Pere-Lluís Huguet Cabot for his help with setting up the Tower-LLM model;
  • Riccardo Orlando for his experience with multi-GPU inference.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published