Evaluation, benchmark, and scorecard, targeting for performance on throughput and latency, accuracy on popular evaluation harness, safety, and hallucination
- Install from Pypi
pip install -r requirements.txt
pip install opea-evalnotes: We have to install requirements.txt at first, cause Pypi can't have direct dependency with specific commit.
- Build from Source
git clone https://github.com/opea-project/GenAIEval
cd GenAIEval
pip install -r requirements.txt
pip install -e .For evaluating the models on text-generation tasks, we follow the lm-evaluation-harness and provide the command line usage and function call usage. Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented, such as ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K and so on.
# pip install --upgrade-strategy eager optimum[habana]
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
--model gaudi-hf \
--model_args pretrained=EleutherAI/gpt-j-6B \
--tasks hellaswag \
--device hpu \
--batch_size 8cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
--model hf \
--model_args pretrained=EleutherAI/gpt-j-6B \
--tasks hellaswag \
--device cpu \
--batch_size 8from evals.evaluation.lm_evaluation_harness import LMEvalParser, evaluate
args = LMevalParser(
model="hf",
user_model=user_model,
tokenizer=tokenizer,
tasks="hellaswag",
device="cpu",
batch_size=8,
)
results = evaluate(args)- setup a separate server with GenAIComps
# build cpu docker
docker build -f Dockerfile.cpu -t opea/lm-eval:latest .
# start the server
docker run -p 9006:9006 --ipc=host -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest
- evaluate the model
- set
base_url,tokenizerand--model genai-hf
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
--model genai-hf \
--model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \
--tasks "lambada_openai" \
--batch_size 2
For evaluating the models on coding tasks or specifically coding LLMs, we follow the bigcode-evaluation-harness and provide the command line usage and function call usage. HumanEval, HumanEval+, InstructHumanEval, APPS, MBPP, MBPP+, and DS-1000 for both completion (left-to-right) and insertion (FIM) mode are available.
cd evals/evaluation/bigcode_evaluation_harness/examples
python main.py \
--model "codeparrot/codeparrot-small" \
--tasks "humaneval" \
--n_samples 100 \
--batch_size 10 \
--allow_code_executionfrom evals.evaluation.bigcode_evaluation_harness import BigcodeEvalParser, evaluate
args = BigcodeEvalParser(
user_model=user_model,
tokenizer=tokenizer,
tasks="humaneval",
n_samples=100,
batch_size=10,
allow_code_execution=True,
)
results = evaluate(args)