This code repository contains the code to reproduce the experiments in the paper Gandalf the Red: Adaptive Security for Large Language Models. Try out Gandalf at gandalf.lakera.ai.
It's recommended to work inside of a virtualenv or similar (Conda, Poetry) to avoid dependency conflicts. Before running any of these commands, install dependencies using
pip install -r requirements.txt
To reproduce all plots and tables from the paper, run:
python create_all_paper_plots_and_tables.py
The code is structured as follows:
analysis
│
├── adaptive_defenses # Adaptive defenses experiments
│ └── adaptive_defenses.py
│
├── attack_classification # Attack classification experiments from Appendix
│ ├── active_learning.py
│ ├── active_learning_data.py
│ ├── create_gandalf_rct_attack_categories.py
│ ├── create_gandalf_rct_subsampled.py
│ ├── labels
│ ├── plots.py
│ ├── predictions.py
│ └── sample_selection.py
│
├── defense_in_depth # Defense in depth experiments
│ ├── optimal_aggregation.py
│ └── venn_diagram.py
│
├── supporting_analyses # Supporting analyses from Appendix
│ ├── basic_statistics.py
│ ├── false_positives.py
│ ├── level_difficulty.py
│ └── session_length.py
│
├── utility_sensitivity # Sensitivity of utility to data and metric experiments
│ ├── sensitivity_to_data.py
│ └── sensitivity_to_metric.py
│
├── embedding_utils.py # Auxiliary functions for text embeddings
│
├── utils.py # Auxiliary functions used by several scripts
│
create_all_paper_plots_and_tables.py # Script to reproduce all plots and tables
│
data.py # Script with auxiliary functions to load datasets
If you find our work useful, please consider citing our paper:
@article{lakera2025gandalf,
title={Gandalf the Red: Adaptive Security for LLMs},
author={Niklas Pfister and Václav Volhejn and Manuel Knott and Santiago Arias and Julia Bazińska and Mykhailo Bichurin and Alan Commike and Janet Darling and Peter Dienes and Matthew Fiedler and David Haber and Matthias Kraft and Marco Lancini and Max Mathys and Damián Pascual-Ortiz and Jakub Podolak and Adrià Romero-López and Kyriacos Shiarlis and Andreas Signer and Zsolt Terek and Athanasios Theocharis and Daniel Timbrell and Samuel Trautwein and Samuel Watts and Yun-Han Wu and Mateo Rojas-Carulla},
journal={arXiv preprint arXiv:2501.07927},
year={2025}
}