Skip to content

Commit f3a6c18

Browse files
guangy10Guang Yang
andauthored
Benchmark HF optimum-executorch (#11450)
Benchmark LLMs from [`optimum-executorch`](https://github.com/huggingface/optimum-executorch). With all the work recently happening in `optimum-executorch`, we are able to boost the out-of-the-box performance. Putting these models on benchmark infra to gather perf numbers and understand the remaining perf gaps between the in-house generated model via export_llama. We are able to do apple-to-apple comparison for CPU backend by introducing quant, custom SPDA, custom KV cache to native Hugging Face models in `optimum-executorch`: `hf_xnnpack_custom_spda_kv_cache_8da4w ` represents the recipe used by optimum-et, `et_xnnpack_custom_spda_kv_cache_8da4w ` is the counterpart for etLLM. Here are the benchmark jobs in our infra: - Android: https://github.com/pytorch/executorch/actions/runs/15597347625 - Apple: https://github.com/pytorch/executorch/actions/runs/15597340954 Fixed in benchmark apps (Android and iOS) to support using HF's tokenizers --------- Co-authored-by: Guang Yang <[email protected]>
1 parent b905975 commit f3a6c18

File tree

9 files changed

+229
-59
lines changed

9 files changed

+229
-59
lines changed

.ci/scripts/gather_benchmark_configs.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,8 @@
3232
BENCHMARK_CONFIGS = {
3333
"xplat": [
3434
"xnnpack_q8",
35-
"hf_xnnpack_fp32",
35+
"hf_xnnpack_custom_spda_kv_cache_8da4w",
36+
"et_xnnpack_custom_spda_kv_cache_8da4w",
3637
"llama3_fb16",
3738
"llama3_spinquant",
3839
"llama3_qlora",
@@ -129,25 +130,26 @@ def generate_compatible_configs(model_name: str, target_os=None) -> List[str]:
129130
"""
130131
configs = []
131132
if is_valid_huggingface_model_id(model_name):
133+
configs.append("hf_xnnpack_custom_spda_kv_cache_8da4w")
132134
if model_name.startswith("meta-llama/"):
133-
# LLaMA models
135+
# etLLM recipes for Llama
134136
repo_name = model_name.split("meta-llama/")[1]
135137
if "qlora" in repo_name.lower():
136138
configs.append("llama3_qlora")
137139
elif "spinquant" in repo_name.lower():
138140
configs.append("llama3_spinquant")
139141
else:
140142
configs.append("llama3_fb16")
143+
configs.append("et_xnnpack_custom_spda_kv_cache_8da4w")
141144
configs.extend(
142145
[
143146
config
144147
for config in BENCHMARK_CONFIGS.get(target_os, [])
145148
if config.startswith("llama")
146149
]
147150
)
148-
else:
149-
# Non-LLaMA models
150-
configs.append("hf_xnnpack_fp32")
151+
if model_name.startswith("Qwen/Qwen3"):
152+
configs.append("et_xnnpack_custom_spda_kv_cache_8da4w")
151153
elif model_name in MODEL_NAME_TO_MODEL:
152154
# ExecuTorch in-tree non-GenAI models
153155
configs.append("xnnpack_q8")

.github/workflows/android-perf-private-device-experiment.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ on:
1818
description: Models to be benchmarked
1919
required: false
2020
type: string
21-
default: mv3,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8
21+
default: google/gemma-3-1b-it,Qwen/Qwen3-0.6B,HuggingFaceTB/SmolLM2-135M,meta-llama/Llama-3.2-1B,allenai/OLMo-1B-hf
2222
devices:
2323
description: Target devices to run benchmark
2424
required: false
@@ -34,7 +34,7 @@ on:
3434
description: Models to be benchmarked
3535
required: false
3636
type: string
37-
default: mv3,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8
37+
default: google/gemma-3-1b-it,Qwen/Qwen3-0.6B,HuggingFaceTB/SmolLM2-135M,meta-llama/Llama-3.2-1B,allenai/OLMo-1B-hf
3838
devices:
3939
description: Target devices to run benchmark
4040
required: false
@@ -57,6 +57,6 @@ jobs:
5757
id-token: write
5858
contents: read
5959
with:
60-
models: ${{ inputs.models || 'mv3,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8' }}
60+
models: ${{ inputs.models || 'Qwen/Qwen3-0.6B' }}
6161
devices: samsung_galaxy_s22_private
6262
benchmark_configs: ${{ inputs.benchmark_configs }}

.github/workflows/android-perf.yml

Lines changed: 82 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ jobs:
7070
# Separate default values from the workflow dispatch. To ensure defaults are accessible
7171
# during scheduled runs and to provide flexibility for different defaults between
7272
# on-demand and periodic benchmarking.
73-
CRON_DEFAULT_MODELS: ${{ github.event_name == 'schedule' && 'llama,mv3,mv2,ic4,ic3,resnet50,edsr,mobilebert,w2l,meta-llama/Llama-3.2-1B,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8' || 'llama' }}
73+
CRON_DEFAULT_MODELS: ${{ github.event_name == 'schedule' && 'llama,mv3,mv2,ic4,ic3,resnet50,edsr,mobilebert,w2l,meta-llama/Llama-3.2-1B,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8,google/gemma-3-1b-it,Qwen/Qwen3-0.6B,HuggingFaceTB/SmolLM2-135M,allenai/OLMo-1B-hf' || 'llama' }}
7474
CRON_DEFAULT_DEVICES: samsung_galaxy_s22
7575
run: |
7676
set -eux
@@ -201,8 +201,8 @@ jobs:
201201
HF_MODEL_REPO=${{ matrix.model }}
202202
OUT_ET_MODEL_NAME="$(echo "$HF_MODEL_REPO" | awk -F'/' '{print $2}' | sed 's/_/-/g' | tr '[:upper:]' '[:lower:]')_${{ matrix.config }}"
203203
204+
# Convert HF checkpoint to ET via etLLM path
204205
if [[ "$HF_MODEL_REPO" == meta-llama/* ]]; then
205-
# Llama models on Hugging Face
206206
if [[ ${{ matrix.config }} == "llama3_spinquant" ]]; then
207207
# SpinQuant
208208
# Download prequantized chceckpoint from Hugging Face
@@ -272,6 +272,21 @@ jobs:
272272
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
273273
--output_name="${OUT_ET_MODEL_NAME}.pte"
274274
ls -lh "${OUT_ET_MODEL_NAME}.pte"
275+
elif [[ ${{ matrix.config }} == "et_xnnpack_custom_spda_kv_cache_8da4w" ]]; then
276+
DOWNLOADED_PATH=$(bash .ci/scripts/download_hf_hub.sh --model_id "${HF_MODEL_REPO}" --subdir "original" --files "tokenizer.model" "params.json" "consolidated.00.pth")
277+
python -m examples.models.llama.export_llama \
278+
--model llama3_2 \
279+
--checkpoint "${DOWNLOADED_PATH}/consolidated.00.pth" \
280+
--params "${DOWNLOADED_PATH}/params.json" \
281+
-kv \
282+
--use_sdpa_with_kv_cache \
283+
-d fp32 \
284+
-X \
285+
--xnnpack-extended-ops \
286+
-qmode 8da4w -G 32 -E 8,0 \
287+
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
288+
--output_name="${OUT_ET_MODEL_NAME}.pte"
289+
ls -lh "${OUT_ET_MODEL_NAME}.pte"
275290
elif [[ ${{ matrix.config }} == "llama3_qnn_htp" ]]; then
276291
export QNN_SDK_ROOT=/tmp/qnn/2.28.0.241029
277292
export LD_LIBRARY_PATH=$QNN_SDK_ROOT/lib/x86_64-linux-clang/
@@ -292,21 +307,75 @@ jobs:
292307
OUT_ET_MODEL_NAME="llama3_2_qnn" # Qualcomm hard-coded it in their script
293308
find . -name "${OUT_ET_MODEL_NAME}.pte" -not -path "./${OUT_ET_MODEL_NAME}.pte" -exec mv {} ./ \;
294309
ls -lh "${OUT_ET_MODEL_NAME}.pte"
295-
else
296-
# By default, test with the Hugging Face model and the xnnpack recipe
297-
DOWNLOADED_PATH=$(bash .ci/scripts/download_hf_hub.sh --model_id "${HF_MODEL_REPO}" --subdir "original" --files "tokenizer.model")
298-
python -m extension.export_util.export_hf_model -hfm="$HF_MODEL_REPO" -o "$OUT_ET_MODEL_NAME"
299-
ls -lh "${OUT_ET_MODEL_NAME}.pte"
300310
fi
301-
else
302-
echo "Unsupported model ${{ matrix.model }}"
303-
exit 1
311+
elif [[ "$HF_MODEL_REPO" == "Qwen/Qwen3-0.6B" ]]; then
312+
if [[ ${{ matrix.config }} == "et_xnnpack_custom_spda_kv_cache_8da4w" ]]; then
313+
DOWNLOADED_PATH=$(bash .ci/scripts/download_hf_hub.sh --model_id "${HF_MODEL_REPO}" --subdir "." --files "tokenizer.json")
314+
python -m examples.models.llama.export_llama \
315+
--model qwen3-0_6b \
316+
--params examples/models/qwen3/0_6b_config.json \
317+
-kv \
318+
--use_sdpa_with_kv_cache \
319+
-d fp32 \
320+
-X \
321+
--xnnpack-extended-ops \
322+
-qmode 8da4w \
323+
-G 32 \
324+
-E 8,0 \
325+
--metadata '{"get_bos_id": 151644, "get_eos_ids":[151645]}' \
326+
--output_name="${OUT_ET_MODEL_NAME}.pte"
327+
ls -lh "${OUT_ET_MODEL_NAME}.pte"
328+
fi
329+
fi
330+
331+
if [[ ${{ matrix.config }} == "hf_xnnpack_custom_spda_kv_cache_8da4w" ]]; then
332+
DOWNLOADED_PATH=$(
333+
bash .ci/scripts/download_hf_hub.sh \
334+
--model_id "${HF_MODEL_REPO}" \
335+
--files "tokenizer.json"
336+
)
337+
echo "tokenizer.json is downloaded to $DOWNLOADED_PATH"
338+
339+
# Install optimum-executorch
340+
git clone https://github.com/huggingface/optimum-executorch
341+
pushd optimum-executorch
342+
# There is no release yet, for CI stability, always test from the same commit on main
343+
git checkout 1c653dc49812fc431a22312c7295d97005d22e12
344+
python install_dev.py
345+
pip list
346+
347+
ARGS=(
348+
"--model" "${HF_MODEL_REPO}"
349+
"--task" "text-generation"
350+
"--recipe" "xnnpack"
351+
"--use_custom_sdpa"
352+
"--qlinear"
353+
"--qembedding"
354+
"--output_dir" ".."
355+
)
356+
357+
# Add conditional arguments based on model
358+
case "${HF_MODEL_REPO}" in
359+
*"google/gemma-3-1b-it"*)
360+
echo "--use_custom_kv_cache can not be used for HybridCache"
361+
;;
362+
*)
363+
ARGS+=("--use_custom_kv_cache")
364+
;;
365+
esac
366+
367+
optimum-cli export executorch "${ARGS[@]}"
368+
popd
369+
370+
mv model.pte ${OUT_ET_MODEL_NAME}.pte
371+
ls -lh "${OUT_ET_MODEL_NAME}.pte"
304372
fi
305373
306-
zip -j model.zip "${OUT_ET_MODEL_NAME}.pte" "${DOWNLOADED_PATH}/tokenizer.model"
374+
zip -j model.zip ${OUT_ET_MODEL_NAME}.pte ${DOWNLOADED_PATH}/tokenizer.*
307375
ls -lh model.zip
308-
mkdir -p "${ARTIFACTS_DIR_NAME}"
309-
mv model.zip "${ARTIFACTS_DIR_NAME}"
376+
mkdir -p ${ARTIFACTS_DIR_NAME}
377+
mv model.zip ${ARTIFACTS_DIR_NAME}
378+
ls -lh ${ARTIFACTS_DIR_NAME}
310379
elif [[ ${{ matrix.model }} == "llama" ]]; then
311380
# Install requirements for export_llama
312381
PYTHON_EXECUTABLE=python bash examples/models/llama/install_requirements.sh

.github/workflows/apple-perf-private-device-experiment.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ on:
1818
description: Models to be benchmarked
1919
required: false
2020
type: string
21-
default: mv3,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8
21+
default: google/gemma-3-1b-it,Qwen/Qwen3-0.6B,HuggingFaceTB/SmolLM2-135M,meta-llama/Llama-3.2-1B,allenai/OLMo-1B-hf
2222
devices:
2323
description: Target devices to run benchmark
2424
required: false
@@ -34,7 +34,7 @@ on:
3434
description: Models to be benchmarked
3535
required: false
3636
type: string
37-
default: mv3,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8
37+
default: Qwen/Qwen3-0.6B,HuggingFaceTB/SmolLM2-135M,meta-llama/Llama-3.2-1B,allenai/OLMo-1B-hf
3838
devices:
3939
description: Target devices to run benchmark
4040
required: false
@@ -57,6 +57,6 @@ jobs:
5757
id-token: write
5858
contents: read
5959
with:
60-
models: ${{ inputs.models || 'mv3,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8' }}
60+
models: ${{ inputs.models || 'Qwen/Qwen3-0.6B' }}
6161
devices: apple_iphone_15_private
6262
benchmark_configs: ${{ inputs.benchmark_configs }}

.github/workflows/apple-perf.yml

Lines changed: 84 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ jobs:
7070
# Separate default values from the workflow dispatch. To ensure defaults are accessible
7171
# during scheduled runs and to provide flexibility for different defaults between
7272
# on-demand and periodic benchmarking.
73-
CRON_DEFAULT_MODELS: ${{ github.event_name == 'schedule' && 'llama,mv3,mv2,ic4,ic3,resnet50,edsr,mobilebert,w2l,meta-llama/Llama-3.2-1B,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8' || 'llama' }}
73+
CRON_DEFAULT_MODELS: ${{ github.event_name == 'schedule' && 'llama,mv3,mv2,ic4,ic3,resnet50,edsr,mobilebert,w2l,meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8,meta-llama/Llama-3.2-1B-Instruct-QLORA_INT4_EO8,google/gemma-3-1b-it,Qwen/Qwen3-0.6B,HuggingFaceTB/SmolLM2-135M,meta-llama/Llama-3.2-1B,allenai/OLMo-1B-hf' || 'llama' }}
7474
CRON_DEFAULT_DEVICES: apple_iphone_15
7575
run: |
7676
set -eux
@@ -207,7 +207,10 @@ jobs:
207207
HF_MODEL_REPO=${{ matrix.model }}
208208
OUT_ET_MODEL_NAME="$(echo "$HF_MODEL_REPO" | awk -F'/' '{print $2}' | sed 's/_/-/g' | tr '[:upper:]' '[:lower:]')_${{ matrix.config }}"
209209
210+
# Convert HF checkpoint to ET via etLLM path
210211
if [[ "$HF_MODEL_REPO" == meta-llama/* ]]; then
212+
# The benchmark app replies on the _llm suffix to determine whether the model is a LLM or not
213+
OUT_ET_MODEL_NAME=${OUT_ET_MODEL_NAME}_llm
211214
# Llama models on Hugging Face
212215
if [[ ${{ matrix.config }} == "llama3_spinquant" ]]; then
213216
# SpinQuant
@@ -278,6 +281,21 @@ jobs:
278281
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
279282
--output_name="${OUT_ET_MODEL_NAME}.pte"
280283
ls -lh "${OUT_ET_MODEL_NAME}.pte"
284+
elif [[ ${{ matrix.config }} == "et_xnnpack_custom_spda_kv_cache_8da4w" ]]; then
285+
DOWNLOADED_PATH=$(bash .ci/scripts/download_hf_hub.sh --model_id "${HF_MODEL_REPO}" --subdir "original" --files "tokenizer.model" "params.json" "consolidated.00.pth")
286+
${CONDA_RUN} python -m examples.models.llama.export_llama \
287+
--model llama3_2 \
288+
--checkpoint "${DOWNLOADED_PATH}/consolidated.00.pth" \
289+
--params "${DOWNLOADED_PATH}/params.json" \
290+
-kv \
291+
--use_sdpa_with_kv_cache \
292+
-d fp32 \
293+
-X \
294+
--xnnpack-extended-ops \
295+
-qmode 8da4w -G 32 -E 8,0 \
296+
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
297+
--output_name="${OUT_ET_MODEL_NAME}.pte"
298+
ls -lh "${OUT_ET_MODEL_NAME}.pte"
281299
elif [[ ${{ matrix.config }} == "llama3_coreml_ane" ]]; then
282300
# ANE
283301
DOWNLOADED_PATH=$(bash .ci/scripts/download_hf_hub.sh --model_id "${HF_MODEL_REPO}" --subdir "original" --files "tokenizer.model" "params.json" "consolidated.00.pth")
@@ -293,18 +311,74 @@ jobs:
293311
--coreml-compute-units cpu_and_ne \
294312
--output_name="${OUT_ET_MODEL_NAME}.pte"
295313
ls -lh "${OUT_ET_MODEL_NAME}.pte"
296-
else
297-
# By default, test with the Hugging Face model and the xnnpack recipe
298-
DOWNLOADED_PATH=$(bash .ci/scripts/download_hf_hub.sh --model_id "${HF_MODEL_REPO}" --subdir "original" --files "tokenizer.model")
299-
${CONDA_RUN} python -m extension.export_util.export_hf_model -hfm="$HF_MODEL_REPO" -o "$OUT_ET_MODEL_NAME"
300-
ls -lh "${OUT_ET_MODEL_NAME}.pte"
301314
fi
302-
else
303-
echo "Unsupported model ${{ matrix.model }}"
304-
exit 1
315+
elif [[ "$HF_MODEL_REPO" == "Qwen/Qwen3-0.6B" ]]; then
316+
OUT_ET_MODEL_NAME=${OUT_ET_MODEL_NAME}_llm
317+
if [[ ${{ matrix.config }} == "et_xnnpack_custom_spda_kv_cache_8da4w" ]]; then
318+
DOWNLOADED_PATH=$(bash .ci/scripts/download_hf_hub.sh --model_id "${HF_MODEL_REPO}" --subdir "." --files "tokenizer.json")
319+
${CONDA_RUN} python -m examples.models.llama.export_llama \
320+
--model qwen3-0_6b \
321+
--params examples/models/qwen3/0_6b_config.json \
322+
-kv \
323+
--use_sdpa_with_kv_cache \
324+
-d fp32 \
325+
-X \
326+
--xnnpack-extended-ops \
327+
-qmode 8da4w \
328+
-G 32 \
329+
-E 8,0 \
330+
--metadata '{"get_bos_id": 151644, "get_eos_ids":[151645]}' \
331+
--output_name="${OUT_ET_MODEL_NAME}.pte"
332+
ls -lh "${OUT_ET_MODEL_NAME}.pte"
333+
fi
334+
fi
335+
336+
if [[ ${{ matrix.config }} == "hf_xnnpack_custom_spda_kv_cache_8da4w" ]]; then
337+
DOWNLOADED_PATH=$(
338+
bash .ci/scripts/download_hf_hub.sh \
339+
--model_id "${HF_MODEL_REPO}" \
340+
--files "tokenizer.json"
341+
)
342+
echo "tokenizer.json is downloaded to $DOWNLOADED_PATH"
343+
344+
# Install optimum-executorch
345+
git clone https://github.com/huggingface/optimum-executorch
346+
pushd optimum-executorch
347+
# There is no release yet, for CI stability, always test from the same commit on main
348+
git checkout 1c653dc49812fc431a22312c7295d97005d22e12
349+
${CONDA_RUN} python install_dev.py
350+
pip list
351+
352+
ARGS=(
353+
"--model" "${HF_MODEL_REPO}"
354+
"--task" "text-generation"
355+
"--recipe" "xnnpack"
356+
"--use_custom_sdpa"
357+
"--qlinear"
358+
"--qembedding"
359+
"--output_dir" ".."
360+
)
361+
362+
# Add conditional arguments based on model
363+
case "${HF_MODEL_REPO}" in
364+
*"google/gemma-3-1b-it"*)
365+
echo "--use_custom_kv_cache can not be used for HybridCache"
366+
;;
367+
*)
368+
ARGS+=("--use_custom_kv_cache")
369+
;;
370+
esac
371+
372+
${CONDA_RUN} optimum-cli export executorch "${ARGS[@]}"
373+
popd
374+
375+
# The benchmark app replies on the _llm suffix to determine whether the model is a LLM or not
376+
OUT_ET_MODEL_NAME=${OUT_ET_MODEL_NAME}_llm
377+
mv model.pte ${OUT_ET_MODEL_NAME}.pte
378+
ls -lh "${OUT_ET_MODEL_NAME}.pte"
305379
fi
306380
307-
zip -j model.zip "${OUT_ET_MODEL_NAME}.pte" "${DOWNLOADED_PATH}/tokenizer.model"
381+
zip -j model.zip ${OUT_ET_MODEL_NAME}.pte ${DOWNLOADED_PATH}/tokenizer.*
308382
ls -lh model.zip
309383
mkdir -p "${ARTIFACTS_DIR_NAME}"
310384
mv model.zip "${ARTIFACTS_DIR_NAME}"

0 commit comments

Comments
 (0)