-
Notifications
You must be signed in to change notification settings - Fork 299
Description
I'm encountering an issue while trying to run inference on the meta-llama/Llama-3.1-8B-Instruct model using the benchmarking script provided in the repo. Here's my setup:
Environment:
- Env is created using instructions provided in intel-extension-for-pytorch/examples/cpu/llm/README.md
- Model: meta-llama/Llama-3.1-8B-Instruct
- Script: intel-extension-for-pytorch/examples/cpu/llm/inference/run.py
- Hardware: Intel(R) Xeon(R) Platinum 8592+ (EMR machine)
- Command:
python run.py --benchmark -m meta-llama/Llama-3.1-8B-Instruct --dtype float16 --max-new-tokens 1024 --input-tokens 128 --num-warmup 2 --batch-size 32 --num-iter 1
Issue: When I run the above command with --dtype float16 and IPEX enabled, I get the following error:
RuntimeError: could not create a primitive descriptor for the inner product forward propagation primitive.
However, if I remove the --ipex flag, the script runs without crashing. But the verbose logs show that the source and destination weights are still in bf16, not float16, as expected:
onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src:bf16::blocked:ab::f0 wei:bf16::blocked:ba::f0 dst:bf16::blocked:ab::f0,attr-scratchpad:user,,15232x4096:4096x14336,129.611