📃 Paper •
Efficient reasoning at scale by pruning redundant reasoning traces—without sacrificing accuracy.
Large language models (LLMs) often generate multiple reasoning traces in parallel to improve answer reliability. However, these traces frequently exhibit severe inter-trace redundancy, leading to wasted computation and inflated inference costs.
DeepPrune addresses this by learning to identify and prune semantically redundant traces before full execution—enabling cost-effective parallel reasoning while preserving performance.
More details can be found in our website
cd DeepPrune
pip install -r requirements.txt
Llama-Factory for model fine-tuning and inference and we have provided the version we used in Llama-Factory folder. Here we modify it to support Focal Loss. Please refer to the GitHub issue if you want to clone LLaMA-Factory by yourself.
Qwen/Qwen3-4B-Instruct-2507 as the backbone LLM for DeepPrune. You can download it from https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507. You can also use other open-source LLMs.
⚠️ The dataset provided here is not complete because of the size! ! ! Please refer to https://huggingface.co/datasets/THU-KEG/DeepPrune for the full dataset.
To understand how to use the dataset, please refer to DeepPrune_data/README.md. Click here
To understand the motivation behind DeepPrune, explore the preliminary analysis in:
📁 Preliminaries/Preliminary experiment.ipynb
This notebook includes:
-
📊 Distribution of answer agreement:
Most trace pairs yield the same answer, revealing significant redundancy in parallel reasoning. -
📈 ROC curves for redundancy detection:
- Sentence-BERT (shallow similarity): AUROC = 0.58 → limited discriminative power.
- Qwen3-4B-Instruct (zero-shot LLM comparison): AUROC = 0.66 → moderate improvement, but still suboptimal.
🔧 To reproduce the zero-shot Qwen3-4B-Instruct results:
- Prepare the evaluation dataset using
DeepPrune/Offline/Ablation_Study.ipynb- Run
Preliminaries/zero_shot_exp.py
- Install
Llama-Factory ⚠️ Patch required: Modify the codebase to support Focal Loss (see GitHub issue for guidance).
Generate the supervised training data for DeepPrune:
jupyter notebook DeepPrune/finetuning/build_finetune_dataset.ipynbThis constructs pairwise trace comparisons labeled by answer equivalence.
Train the DeepPrune model using supervised fine-tuning:
- Config:
DeepPrune/Offline/Qwen3_full_sft.yaml - Framework: Llama-Factory
After training:
- Generate test data:
DeepPrune/Offline/Ablation_Study.ipynb - Evaluate performance:
python DeepPrune/Offline/test_model_performance_parallel.py
- Visualize results:
DeepPrune/Offline/check_model_output.ipynb
✅ Expect significant gains over shallow similarity baselines (AUROC > 0.83 in our experiments).
Deploy DeepPrune for real-time trace pruning during inference:
-
Establish baselines:
RunDeepPrune/Online/check_pass_k.ipynbto compute:pass@1: Accuracy with single tracecons@512: Consensus accuracy with 512 traces
-
Apply DeepPrune:
python DeepPrune/Online/greedy_cluster_threshold.py
This performs greedy clustering of traces using DeepPrune’s similarity scores and prunes redundant ones.
-
Trade-off control:
Adjust the similarity threshold to balance:- 💰 Cost reduction (fewer traces executed)
- 🎯 Performance retention (maintained consensus accuracy)
This code repository is developed based on Llama-Factory, vllm, DeepScaleR and DeepConf.
Thanks for their great work!
If you use DeepPrune in your research, please cite our work:
@article{tu2025deepprune,
title={DeepPrune: Parallel Scaling without Inter-trace Redundancy},
author={Shangqing Tu, Yaxuan Li, Yushi Bai, Lei Hou, Juanzi Li},
journal={arXiv preprint arXiv:2510.08483},
year={2025}
}
