Skip to content

LazerLambda/modern-bert-score

Repository files navigation

Modern BERTScore for Fast Inference

CI Python 3.12 Python 3.13 Python 3.14 License

Performance

Modern-BERT-Score is a reimplementation of the BERTScore metric introduced by Zhang et al., 2019, optimized for modern inference workflows using SentenceTransformers and vLLM.

This library provides fast, GPU-accelerated scoring for text generation evaluation, making BERTScore practical for large-scale inference tasks.


⚡ Features

  • Fast, efficient computation with optional vLLM support
  • Compatible with all Hugging Face transformer models
  • Supports truncated and optimized model versions for faster inference
  • Works seamlessly with both CPU and GPU setups

📦 Installation

Modern-BERT-Score comes in two variants: a base version and a vLLM-enhanced version. For vLLM, an NVIDIA GPU is strongly recommended.

Base Version

pip install modern-bert-score

vLLM Version

pip install modern-bert-score[vllm]

This implementation is significantly faster than the original BERTScore, especially with GPU acceleration.

📝 BERTScore

BERTScore (Zhang et al., 2019) evaluates the similarity between candidate and reference texts by comparing their contextual embeddings from a pre-trained transformer model. For each token in the candidate, it finds the most similar token in the reference (using cosine similarity) and aggregates these scores to compute precision, recall, and F1. Optionally, IDF-weighting can be applied to give more importance to rare and informative words, improving the metric’s sensitivity to meaningful content over common words. Additionally, optional Baseline Rescaling shifts the scores such that the score is in the range of [0,1]. This approach captures semantic similarity beyond exact word matches, making it robust for tasks such as machine translation and text generation evaluation.

The following figure, taken from the original paper, illustrates how BERTScore works:

BERTScore

🛠 Usage

Example

from modern_bert_score import BertScore

candidates = ["Hello World!", "A robin is a bird."]
references = ["Hi World!", "A robin is not a bird."]

metric = BertScore(model_id="roberta-base")
scores = metric(candidates, references)

# scores is a list of (Precision, Recall, F1) tuples
# To get separate lists of P, R, F1:
P, R, F1 = zip(*scores)

print("Precision scores:", P)
print("Recall scores:", R)
print("F1 scores:", F1)

⚠️ NOTICE

Some pre-truncated models optimized for vLLM are available on Hugging Face and directly available in this library:

  • LazerLambda/ModernBERT-base-ModBERTScore-12 -> ModernBERTBaseScore
  • LazerLambda/ModernBERT-large-ModBERTScore-19 -> ModernBERTLargeScore
  • LazerLambda/roberta-base-ModBERTScore-10 -> RobertaBaseScore
  • LazerLambda/roberta-large-ModBERTScore-17 -> RobertaLargeScore
  • LazerLambda/roberta-large-mnli-ModBERTScore-19 -> RobertaLargeMNLIScore

🗺 Roadmap

  • Implement base version and vLLM addon
  • Add IDF-weighted scoring
  • Add baseline-rescaling and scripts for identifying optimal baselines
  • Add model (vLLM-)adaptation script for slicing the model
  • Add multilingual support
  • Add CLI tool

About

Re-implementation of BERTScore for evaluation of generated text, leveraging vLLM and SentenceTransformers.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages