Qualcomm AI Engine Direct - GA Static QWEN2.5 0.5B #12054

winskuo-quic · 2025-06-27T08:46:53Z

Summary

Static Qwen2.5 0.5b enablement.
Please use 16a8w for qwen as other quant configs are not yet fully supported.

Script
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s $DEVICE -m SM8650 --prompt "Hello, how are you?" --temperature 0 --model_mode kv--max_seq_len 128 --ptq 16a8w --decoder_model qwen2_5

Stats

SM8650

SM8750

Test plan

python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript.test_qwen2_5 --model SM8650 --build_folder build-android/ --executorch_root . -s $DEVICE --artifact ./qwen2_5

Author: @haowhsu-quic, @winskuo-quic

pytorch-bot · 2025-06-27T08:46:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12054

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Pending, 2 Unrelated Failures

As of commit 7beed6e with merge base 6ac5df2 ():

NEW FAILURES - The following jobs have failed:

pull / test-build-wasm-linux / linux-job (gh)
RuntimeError: Command docker exec -t abdf5f6f0eae194dc1e2c1177c55d2880d380c34d2f3ea6dcf9cd93142feeb9a /exec failed with exit code 1
pull / test-eval_llama-mmlu-linux / linux-job (gh)
RuntimeError: Command docker exec -t df7b5b6cd98c59b333f0e70e47ebe933b7d97fb5e3da8d4e585f2267805912c1 /exec failed with exit code 1

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-models-linux (mobilebert, portable, linux.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-models-linux (mobilebert, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-06-27T08:47:31Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

winskuo-quic · 2025-06-27T08:48:57Z

Hi @billmguo, @cccclai,

This is a draft for qwen2.5 0.5 using static nn.Module structure.
Please notice this is just a draft as the accuracy for the model is still bad.

Thanks

cccclai · 2025-06-27T16:57:24Z

Thanks! If it's just the accuracy is bad, @rohansjoshi has been making progress on this. Can qwen model run with this PR (don't need to be accurate)?

cccclai · 2025-06-27T17:00:00Z

examples/qualcomm/oss_scripts/qwen/model/static_qwen.py

+from transformers.configuration_utils import PretrainedConfig
+
+
+class RMSNorm(nn.Module):


I think we need to have the same static llama file (maybe with some branches) instead of starting a new static qwen model

Thanks for reviewing the PR. We are trying to have a static decoder file that can be shared across all GA decoder models + LLama.
For this draft, we are just making sure e2e flow is working, and we will work on merging all static decoder files to 1.

winskuo-quic · 2025-06-30T00:47:10Z

Thanks! If it's just the accuracy is bad, @rohansjoshi has been making progress on this. Can qwen model run with this PR (don't need to be accurate)?

Yes. The can model can run with this PR.
I have an example script and stats shared in the summary section.

rohansjoshi · 2025-07-10T16:01:14Z

Two things you can do to fix accuracy are:

In make_quantizer, add the argument act_observer=MinMaxObserver (default is MovingAverageMinMaxObserver, which is much worse)
To see better outputs from the script, flush the kv cache before each token generation loop (i.e. add the line _, atten_mask, _, k_caches, v_caches = qc_model.get_example_inputs() before the loops)

If make both these changes in qwen.py you see a much better response. @haowhsu-quic, @winskuo-quic

winskuo-quic · 2025-07-14T14:29:22Z

examples/qualcomm/oss_scripts/llama/runner/runner.cpp

-  auto eos_ids = std::make_unique<std::unordered_set<uint64_t>>();
-  // TODO: remove this once we could release the new tokens used for the
-  // tokenizer
-  if (tokenizer_ != nullptr) {


Hi @cccclai,
I am unsure about what situation will go into this if statement.
It is added in this PR: #12285
Could you verify if this PR does not break the logic?

@limintang can you take a look at the changes?

facebook-github-bot · 2025-07-14T20:40:21Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D78296169.

This reverts commit fe3062a.

cccclai · 2025-07-15T18:27:47Z

Oops I click merge by mistake...reverting in #12506 have some comments for this PR

Reverts #12054

cccclai · 2025-07-16T01:48:31Z

examples/qualcomm/oss_scripts/llama/hf_converter/convert_config.py

+# LICENSE file in the root directory of this source tree.
+
+
+def convert_configs(config):


Can you share where this config come from and how we can scale to more models?

The config can be found over here:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2/configuration_qwen2.py.
For now, I checked qwen and gemma, and I think most configs are following the same naming, where they all use PretrainedConfig as the base class. Ideally, this function should be able to support most HF decoder models, but I will need to test them 1 by 1 to confirm they can all be supported using this function.

cccclai · 2025-07-16T01:51:23Z

examples/qualcomm/oss_scripts/llama/llama.py


 sys.setrecursionlimit(4096)
 FORMAT = "[%(levelname)s %(asctime)s %(filename)s:%(lineno)s] %(message)s"
 logging.basicConfig(level=logging.INFO, format=FORMAT)
 logging.getLogger().setLevel(logging.INFO)

+HUGGING_FACE_REPO_IDS = {"qwen2_5": "Qwen/Qwen2.5-0.5B"}


Each model will have their own config and I feel like it can be shared with the cpu model https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L32

It doesn't have to be this PR, and I'd like to hear your thoughts on this

I would like to confirm the config you are mentioning here.
Do you mean that instead of creating this dictionary, maybe we could contribute to the dict here?

executorch/examples/models/llama/export_llama_lib.py

Line 114 in b14cb22

HUGGING_FACE_REPO_IDS = {

cccclai · 2025-07-16T01:52:04Z

examples/qualcomm/oss_scripts/llama/llama.py

-        custom_annotations = (annotate_matmul_16a8w,)
-        if args.llama_model == "stories110m":
+        custom_annotations = (
+            # For qwen2.5, skip annotate_conv can improve result.


Why is it the case?

This skip annotate_conv here actually refers to annotating the conv as 8a4w, not skipping quantize conv2d, which means for qwen the conv2d is 16a8w. I have tested both skip and not skip annotate conv as 8a4w, and the CPU QDQ results are a lot better when I skip annotate conv2d to 8a4w.

cccclai · 2025-07-16T01:53:37Z

After the comments are address, I can revert the revert again...

winskuo-quic · 2025-07-16T06:53:25Z

After the comments are address, I can revert the revert again...

Also, I tried rebasing to mainline to test if tokenizer's issue is resolved (#12333). However, after rebasing, I encountered another error:

libc++abi: terminating due to uncaught exception of type std::runtime_error: Unsupported Normalizer type: NFC.

@haowhsu-quic

### Summary Static Qwen2.5 0.5b enablement. Please use 16a8w for qwen as other quant configs are not yet fully supported. Script `python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s $DEVICE -m SM8650 --prompt "Hello, how are you?" --temperature 0 --model_mode kv--max_seq_len 128 --ptq 16a8w --decoder_model qwen2_5` #### Stats SM8650 <img width="1674" height="712" alt="image" src="https://github.com/user-attachments/assets/ce162c20-9025-4c1c-b794-93176e4ee677" /> SM8750 <img width="1671" height="816" alt="image" src="https://github.com/user-attachments/assets/25db8a97-8adf-42d4-b8b4-6cdfaf933c69" /> ### Test plan `python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript.test_qwen2_5 --model SM8650 --build_folder build-android/ --executorch_root . -s $DEVICE --artifact ./qwen2_5` Author: @haowhsu-quic, @winskuo-quic

Reverts #12054

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 27, 2025

cccclai reviewed Jun 27, 2025

View reviewed changes

Qualcomm AI Engine Direct - GA QWEN2.5 0.5B

7beed6e

winskuo-quic force-pushed the dev1/winskuo/qwen2_0.5B branch from 0c4785b to 7beed6e Compare July 14, 2025 14:27

winskuo-quic changed the title ~~Qualcomm AI Engine Direct - DRAFT for GA QWEN2.5 0.5B~~ Qualcomm AI Engine Direct - GA Static QWEN2.5 0.5B Jul 14, 2025

winskuo-quic commented Jul 14, 2025

View reviewed changes

winskuo-quic marked this pull request as ready for review July 14, 2025 14:29

winskuo-quic requested review from larryliu0820 and kirklandsign as code owners July 14, 2025 14:29

cccclai merged commit fe3062a into pytorch:main Jul 15, 2025
101 of 106 checks passed

cccclai added a commit that referenced this pull request Jul 15, 2025

Revert "Qualcomm AI Engine Direct - GA Static QWEN2.5 0.5B (#12054)"

d121b87

This reverts commit fe3062a.

cccclai mentioned this pull request Jul 15, 2025

Revert "Qualcomm AI Engine Direct - GA Static QWEN2.5 0.5B" #12506

Merged

cccclai added a commit that referenced this pull request Jul 15, 2025

Revert "Qualcomm AI Engine Direct - GA Static QWEN2.5 0.5B" (#12506)

e9088ee

Reverts #12054

cccclai reviewed Jul 16, 2025

View reviewed changes

winskuo-quic mentioned this pull request Jul 17, 2025

Qualcomm AI Engine Direct - Reland GA Static QWEN2.5 0.5B #12582

Open

lucylq pushed a commit that referenced this pull request Jul 17, 2025

Revert "Qualcomm AI Engine Direct - GA Static QWEN2.5 0.5B" (#12506)

2ec0b8d

Reverts #12054

		from transformers.configuration_utils import PretrainedConfig


		class RMSNorm(nn.Module):

		# LICENSE file in the root directory of this source tree.


		def convert_configs(config):

Qualcomm AI Engine Direct - GA Static QWEN2.5 0.5B #12054

Qualcomm AI Engine Direct - GA Static QWEN2.5 0.5B #12054

Uh oh!

Conversation

winskuo-quic commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stats

Test plan

Uh oh!

pytorch-bot bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12054

❌ 2 New Failures, 1 Pending, 2 Unrelated Failures

Uh oh!

github-actions bot commented Jun 27, 2025

This PR needs a release notes: label

Uh oh!

winskuo-quic commented Jun 27, 2025

Uh oh!

cccclai commented Jun 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

winskuo-quic commented Jun 30, 2025

Uh oh!

rohansjoshi commented Jul 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 14, 2025

Uh oh!

Uh oh!

cccclai commented Jul 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cccclai commented Jul 16, 2025

Uh oh!

winskuo-quic commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

winskuo-quic commented Jun 27, 2025 •

edited

Loading

pytorch-bot bot commented Jun 27, 2025 •

edited

Loading

This PR needs a `release notes:` label

winskuo-quic commented Jul 16, 2025 •

edited

Loading