GenAIEval

Evaluation, benchmark, and scorecard, targeting for performance on throughput and latency, accuracy on popular evaluation harness, safety, and hallucination

Installation

Install from Pypi

pip install -r requirements.txt
pip install opea-eval

notes: We have to install requirements.txt at first, cause Pypi can't have direct dependency with specific commit.

Build from Source

git clone https://github.com/opea-project/GenAIEval
cd GenAIEval
pip install -r requirements.txt
pip install -e .

Evaluation

lm-evaluation-harness

For evaluating the models on text-generation tasks, we follow the lm-evaluation-harness and provide the command line usage and function call usage. Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented, such as ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K and so on.

command line usage

Gaudi2

# pip install --upgrade-strategy eager optimum[habana]
cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
    --model gaudi-hf \
    --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks hellaswag \
    --device hpu \
    --batch_size 8

CPU

cd evals/evaluation/lm_evaluation_harness/examples
python main.py \
    --model hf \
    --model_args pretrained=EleutherAI/gpt-j-6B \
    --tasks hellaswag \
    --device cpu \
    --batch_size 8

function call usage

from evals.evaluation.lm_evaluation_harness import LMEvalParser, evaluate

args = LMevalParser(
    model="hf",
    user_model=user_model,
    tokenizer=tokenizer,
    tasks="hellaswag",
    device="cpu",
    batch_size=8,
)
results = evaluate(args)

remote service usage

setup a separate server with GenAIComps

# build cpu docker
docker build -f Dockerfile.cpu -t opea/lm-eval:latest .

# start the server
docker run -p 9006:9006 --ipc=host  -e MODEL="hf" -e MODEL_ARGS="pretrained=Intel/neural-chat-7b-v3-3" -e DEVICE="cpu" opea/lm-eval:latest

evaluate the model

set base_url, tokenizer and --model genai-hf

cd evals/evaluation/lm_evaluation_harness/examples

python main.py \
    --model genai-hf \
    --model_args "base_url=http://{your_ip}:9006,tokenizer=Intel/neural-chat-7b-v3-3" \
    --tasks  "lambada_openai" \
    --batch_size 2

bigcode-evaluation-harness

For evaluating the models on coding tasks or specifically coding LLMs, we follow the bigcode-evaluation-harness and provide the command line usage and function call usage. HumanEval, HumanEval+, InstructHumanEval, APPS, MBPP, MBPP+, and DS-1000 for both completion (left-to-right) and insertion (FIM) mode are available.

command line usage

cd evals/evaluation/bigcode_evaluation_harness/examples
python main.py \
    --model "codeparrot/codeparrot-small" \
    --tasks "humaneval" \
    --n_samples 100 \
    --batch_size 10 \
    --allow_code_execution

function call usage

from evals.evaluation.bigcode_evaluation_harness import BigcodeEvalParser, evaluate

args = BigcodeEvalParser(
    user_model=user_model,
    tokenizer=tokenizer,
    tasks="humaneval",
    n_samples=100,
    batch_size=10,
    allow_code_execution=True,
)
results = evaluate(args)

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github		.github
docker		docker
evals		evals
tests		tests
.pre-commit-config.yaml		.pre-commit-config.yaml
LEGAL_INFORMATION.md		LEGAL_INFORMATION.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GenAIEval

Installation

Evaluation

lm-evaluation-harness

command line usage

Gaudi2

CPU

function call usage

remote service usage

bigcode-evaluation-harness

command line usage

function call usage

Additional Content

About

Uh oh!

Releases

Packages

Languages

License

XinyuYe-Intel/GenAIEval

Folders and files

Latest commit

History

Repository files navigation

GenAIEval

Installation

Evaluation

lm-evaluation-harness

command line usage

Gaudi2

CPU

function call usage

remote service usage

bigcode-evaluation-harness

command line usage

function call usage

Additional Content

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages