Modern-BERT-Score is a reimplementation of the BERTScore metric introduced by Zhang et al., 2019, optimized for modern inference workflows using SentenceTransformers and vLLM.
This library provides fast, GPU-accelerated scoring for text generation evaluation, making BERTScore practical for large-scale inference tasks.
- Fast, efficient computation with optional vLLM support
- Compatible with all Hugging Face transformer models
- Supports truncated and optimized model versions for faster inference
- Works seamlessly with both CPU and GPU setups
Modern-BERT-Score comes in two variants: a base version and a vLLM-enhanced version. For vLLM, an NVIDIA GPU is strongly recommended.
pip install modern-bert-scorepip install modern-bert-score[vllm]
This implementation is significantly faster than the original BERTScore, especially with GPU acceleration.
BERTScore (Zhang et al., 2019) evaluates the similarity between candidate and reference texts by comparing their contextual embeddings from a pre-trained transformer model. For each token in the candidate, it finds the most similar token in the reference (using cosine similarity) and aggregates these scores to compute precision, recall, and F1. Optionally, IDF-weighting can be applied to give more importance to rare and informative words, improving the metric’s sensitivity to meaningful content over common words. Additionally, optional Baseline Rescaling shifts the scores such that the score is in the range of [0,1]. This approach captures semantic similarity beyond exact word matches, making it robust for tasks such as machine translation and text generation evaluation.
The following figure, taken from the original paper, illustrates how BERTScore works:
from modern_bert_score import BertScore
candidates = ["Hello World!", "A robin is a bird."]
references = ["Hi World!", "A robin is not a bird."]
metric = BertScore(model_id="roberta-base")
scores = metric(candidates, references)
# scores is a list of (Precision, Recall, F1) tuples
# To get separate lists of P, R, F1:
P, R, F1 = zip(*scores)
print("Precision scores:", P)
print("Recall scores:", R)
print("F1 scores:", F1)- For best performance, an optimal layer should be used for each model.
- To find the optimal layer, please use this script from the original BERTScore implementation.
Some pre-truncated models optimized for vLLM are available on Hugging Face and directly available in this library:
LazerLambda/ModernBERT-base-ModBERTScore-12->ModernBERTBaseScoreLazerLambda/ModernBERT-large-ModBERTScore-19->ModernBERTLargeScoreLazerLambda/roberta-base-ModBERTScore-10->RobertaBaseScoreLazerLambda/roberta-large-ModBERTScore-17->RobertaLargeScoreLazerLambda/roberta-large-mnli-ModBERTScore-19->RobertaLargeMNLIScore
- Implement base version and vLLM addon
- Add IDF-weighted scoring
- Add baseline-rescaling and scripts for identifying optimal baselines
- Add model (vLLM-)adaptation script for slicing the model
- Add multilingual support
- Add CLI tool

