OBenTo-LLM provides the tools to translate evaluation benchmarks from a source language to other languages using open-source Large Language Models (LLMs) and a standardized translation pipeline.
Some people may say that the translation of evaluation benchmarks is a trivial task. However, if we blindly translate the benchmarks, we may introduce biases or errors that can affect our assessment of the models' performance in the target language.
Importantly, different datasets are characterized by nuances and linguistic features that can greatly affect the translation quality, even when using state-of-the-art LLMs, such as GPT-3.5/4. For example, the translation of the ARC Challenge dataset requires a different approach than the translation of the Winogrande dataset. Not only that, different instances in ARC Challenge may require different translation strategies to obtain the best results.
However, for many existing translated benchmarks, we do not have access to the code used to generate the translations, which can cause issues when comparing models' performance across languages.
OBenTO-LLM provides a standardized pipeline to translate evaluation benchmarks using Large Language Models (LLMs). The aim is to provide a pipeline that is:
- Free: based on open-source LLMs, so it is free to use if you have the resources to run the translation models.
- Tailored: designed to provide to maximize translation quality across different datasets by considering their peculiarities.
- Reproducible: can be used to quickly regenerate the translations of the benchmarks.
- Transparent: designed to be transparent, so you can understand how the translations are generated.
- Extensible: can be easily extended to support new datasets and languages.
A note on quality: the quality of the translations depends on the LLMs used. The pipeline is designed to provide a good starting point for translating evaluation benchmarks, but, even if models like Tower-LLM are on par with GPT-3.5/4, they may not be perfect. Therefore, we always recommended to check the translations before using them in your experiments.
| Dataset | Original Dataset | IT Translation |
|---|---|---|
| ARC Challenge | allenai/ai2_arc | sapienzanlp/arc_italian |
| ARC Easy | allenai/ai2_arc | sapienzanlp/arc_italian |
| BoolQ | google/boolq | sapienzanlp/boolq_italian |
| GSM8K | gsm8k | sapienzanlp/gsm8k_italian |
| HellaSwag | Rowan/hellaswag | sapienzanlp/hellaswag_italian |
| MMLU | cais/mmlu | sapienzanlp/mmlu_italian |
| PIQA | piqa | sapienzanlp/piqa_italian |
| SciQ | allenai/sciq | sapienzanlp/sciq_italian |
| TruthfulQA | truthful_qa | sapienzanlp/truthful_qa_italian |
| Winogrande | winogrande | sapienzanlp/winogrande_italian |
| GPQA | Idavidrein/gpqa | sapienzanlp/gpqa_italian |
| MuSR | TAUR-Lab/MuSR | sapienzanlp/MUSR_italian |
| MATH | lighteval/MATH-Hard | sapienzanlp/MATH_hard_italian |
| BBH | SaylorTwift/bbh | sapienzanlp/BBH_italian |
Missing a dataset? Open an issue or submit a pull request! If you do not have the resources or hardware to translate the datasets, we can help you with that. Missing a language? We are currently writing a guide to help you translate the datasets in other languages. Stay tuned!
Currently, the pipeline supports the following LLMs:
Missing an LLM? OBenTO-LLM is designed to support most LLMs available in the Hugging Face model hub. We tested the pipeline with TowerInstruct models, but it should work with other models as well. If you encounter any issues, open an issue or submit a pull request!
It is recommended to setup a Conda environment to run the code. To create a new Conda environment, run the following command:
conda create --name llm-data-translation python=3.10Remember to activate the Conda environment before running the code:
conda activate llm-data-translationTo install the required packages, run the following command:
pip install -r requirements.txtThe code is organized as follows:
src/translation/translate_<dataset>.py: Contains the code to translate the dataset from a source language to other languages, e.g., Italian.
For example, to translate the allenai/ai2_arc dataset from English to Italian, run the following command:
python src/translation/translate_arc.py \
--source_language English \
--target_language Italian \
--output_path data/translations/it/arc_challenge.train.json \
--model_name Unbabel/TowerInstruct-7B-v0.2 \
--device_map "cuda:0" \
--dataset_name allenai/ai2_arc \
--dataset_config ARC-Challenge \
--split train \
--batch_size 4 \
--max_new_tokens 1024 \
--beam_size 3 \
--length_penalty 2.5 \
--num_return_sequences 1 \
--do_sample False \
--early_stopping FalseThe translated dataset will be saved in the specified output path. For more details on the arguments, check out the documentation of the script.
The following scripts are available to translate the datasets from English to Italian:
scripts/local/translation/translate_arc_challenge.sh: Translates the ARC Challenge dataset.scripts/local/translation/translate_arc_easy.sh: Translates the ARC Easy dataset.scripts/local/translation/translate_boolq.sh: Translates the BoolQ dataset.scripts/local/translation/translate_gsm8k.sh: Translates the GSM8K dataset.scripts/local/translation/translate_hellaswag.sh: Translates the HellaSwag dataset.scripts/local/translation/translate_mmlu.sh: Translates the MMLU dataset.scripts/local/translation/translate_piqa.sh: Translates the PIQA dataset.scripts/local/translation/translate_sciq.sh: Translates the SciQ dataset.scripts/local/translation/translate_truthfulqa.sh: Translates the TruthfulQA dataset.scripts/local/translation/translate_winogrande.sh: Translates the Winogrande dataset.scripts/local/translation/translate_gpqa.sh: Translates the GPQA dataset.scripts/local/translation/translate_musr.sh: Translates the MuSR dataset.scripts/local/translation/translate_math.sh: Translates the MATH dataset.scripts/local/translation/translate_bbh.sh: Translates the BBH dataset.
For more details about the parameters, check the scripts.
If you use this library or part of it, consider to cite us:
@inproceedings{moroni-etal-2024-towards,
title = "Towards a More Comprehensive Evaluation for {I}talian {LLM}s",
author = "Moroni, Luca and
Conia, Simone and
Martelli, Federico and
Navigli, Roberto",
editor = "Dell'Orletta, Felice and
Lenci, Alessandro and
Montemagni, Simonetta and
Sprugnoli, Rachele",
booktitle = "Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)",
month = dec,
year = "2024",
address = "Pisa, Italy",
publisher = "CEUR Workshop Proceedings",
url = "https://aclanthology.org/2024.clicit-1.67/",
pages = "584--599",
ISBN = "979-12-210-7060-6",
}This repository is licensed under the MIT License. See the LICENSE file for more information.
- Future AI Research for supporting this work.
- Hugging Face for building the transformers and datasets libraries.
- Unbabel for building Tower-LLM.
- Thanks to the authors of the original datasets for making them available.
We would like to thank:
- Simone Conia for the core idea and the development of the OBENTO library;
- Pere-Lluís Huguet Cabot for his help with setting up the Tower-LLM model;
- Riccardo Orlando for his experience with multi-GPU inference.