Qualcomm AI Engine Direct - GA Qwen 2.5 0.5B #12333

shewu-quic · 2025-07-10T08:35:54Z

Summary:

Add a decoder_model_wrapper.py to ensure that the exported model can be fully delegated in Qnn Backend
Add a e2e script to run qwen 2.5
- Support spin quant R3
- Replace Qwen2Attention with QCQwen2Attention
- Pre-compute freqs_cos and freqs_sin to bypass rotary embedding
- Replace Qwen2RMSNorm with torch.nn,.RMSNorm
- Tag quant IO to avoid insering Q/DQ for I/O
- Reuse executorch llama runner, llama_main

Note that accuracy currently is bad, need to investigate more.

Reproduce command

python3 examples/qualcomm/oss_scripts/qwen/qwen.py  -s <serial>-H <host> -m SM8750  --prompt "My favourite condiment is "  -b build-android --decoder_model qwen2.5_0.5B --ptq 16a16w

Results

7/9

ptq: 16a16w
Speed: 62 tok/sec on SM8750, seq_len = 128
Accuracy: Bad

Outputs:

I 00:00:02.944266 executorch:stats.h:108] 	Prompt Tokens: 6    Generated Tokens: 121
I 00:00:02.944270 executorch:stats.h:114] 	Model Load Time:		0.677000 (seconds)
I 00:00:02.944274 executorch:stats.h:124] 	Total inference time:		2.034000 (seconds)		 Rate: 	59.488692 (tokens/second)
I 00:00:02.944279 executorch:stats.h:132] 		Prompt evaluation:	0.093000 (seconds)		 Rate: 	64.516129 (tokens/second)
I 00:00:02.944283 executorch:stats.h:143] 		Generated 121 tokens:	1.941000 (seconds)		 Rate: 	62.339001 (tokens/second)
I 00:00:02.944288 executorch:stats.h:151] 	Time to first generated token:	0.093000 (seconds)
I 00:00:02.944292 executorch:stats.h:158] 	Sampling time over 127 tokens:	0.059000 (seconds)
My favourite condiment is a thing, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan

7/11

ptq: 16a8w
Speed: 135 tok/sec on SM8750, seq_len = 128
Accuracy: Seems better

Outputs:

I 00:00:00.734588 executorch:text_llm_runner.cpp:100] RSS after loading model: 829.648438 MiB (0 if unsupported)
I 00:00:00.734865 executorch:text_llm_runner.cpp:157] Max new tokens resolved: 122, given start_pos 0, num_prompt_tokens 6, max_context_len 128
I 00:00:00.784392 executorch:text_llm_runner.cpp:184] RSS after prompt prefill: 829.648438 MiB (0 if unsupported)
I 00:00:01.677137 executorch:text_llm_runner.cpp:204] RSS after finishing text generation: 829.648438 MiB (0 if unsupported)
I 00:00:01.677171 executorch:stats.h:108] 	Prompt Tokens: 6    Generated Tokens: 121
I 00:00:01.677180 executorch:stats.h:114] 	Model Load Time:		0.431000 (seconds)
I 00:00:01.677187 executorch:stats.h:124] 	Total inference time:		0.943000 (seconds)		 Rate: 	128.313892 (tokens/second)
I 00:00:01.677193 executorch:stats.h:132] 		Prompt evaluation:	0.050000 (seconds)		 Rate: 	120.000000 (tokens/second)
I 00:00:01.677201 executorch:stats.h:143] 		Generated 121 tokens:	0.893000 (seconds)		 Rate: 	135.498320 (tokens/second)
I 00:00:01.677208 executorch:stats.h:151] 	Time to first generated token:	0.050000 (seconds)
I 00:00:01.677215 executorch:stats.h:158] 	Sampling time over 127 tokens:	0.017000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
[WARNING] [Qnn ExecuTorch]: QnnDsp <W> Function not called, PrepareLib isn't loaded!

/data/local/tmp/shewu/executorch/qwen_qnn_q16/outputs/: 1 file pulled. 0.7 MB/s (883 bytes in 0.001s)
INFO:root:Results[0]:
Setting up pretokenizer...
Pretokenizer set up
My favourite condiment is iced tea. I love it so much that I have to have it every day. I have a habit of making it at home. I have a few recipes for iced tea. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite

cc: @winskuo-quic , @haowhsu-quic

pytorch-bot · 2025-07-10T08:36:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12333

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-07-10T08:36:36Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

shewu-quic · 2025-07-10T08:47:34Z

Hi @cccclai @kimishpatel,

I am working on supporting the decoder only model from transformer path.
I create a wrapper for decoder model which is based on the TorchExportableModuleWithStaticCache in the transformers.
There are some changes to fully delegate in QNN Backend:

Change an attention mask to avoid the computation in the model
Add buffer for freqs_cos and freqs_sin to bypass rotary embedding in the model
Change the attention with QCQwen2Attention
Replace Qwen2RMSNorm with torch.nn,.RMSNorm

May I know these changes are acceptable?

Thanks.

shewu-quic · 2025-07-10T08:51:46Z

And one more question, when I use your runner with qwen 2.5.
But I get the error for tokenizer. I add a workaround to change behavior with MergedWithPrevious.
Could you give me some advice to address it?

Unsupported behavior 'Isolated' for Split PreTokenizer. Only 'MergedWithPrevious' is supported.

kimishpatel · 2025-07-10T22:36:46Z

And one more question, when I use your runner with qwen 2.5. But I get the error for tokenizer. I add a workaround to change behavior with MergedWithPrevious. Could you give me some advice to address it?
Unsupported behavior 'Isolated' for Split PreTokenizer. Only 'MergedWithPrevious' is supported.

@larryliu0820 on this question

kimishpatel · 2025-07-10T22:40:41Z

backends/qualcomm/utils/utils.py

@@ -156,11 +156,22 @@ def __init__(self, weight, bias=None):

        def forward(self, x):
            rank = x.dim()
-            x = x.unsqueeze(-1) if rank == 3 else x.reshape(1, *x.shape, 1)
-            x = torch.transpose(x, 1, 2)
+            if rank == 2:


I have a different question here. can this be done as graph pass?

I believe it's possible for the linear target to use set_parameter to modify the weight and to add permute and reshape nodes during transform_for_annotation.

kimishpatel · 2025-07-10T22:43:59Z

examples/qualcomm/oss_scripts/qwen/decoder_model_wrapper.py

+        # =====================================================================
+        outs = self.model(
+            input_ids=input_ids,
+            attention_mask=attn_mask,


One question: if you have to specify per layer mask, how would you?

@guangy10 does transformer api allow for per layer mask be specified here as list of tensor or something?

examples/qualcomm/oss_scripts/qwen/qwen.py

kimishpatel · 2025-07-10T23:11:32Z

examples/qualcomm/oss_scripts/qwen/qwen.py

+            )
+            if quant_dtype == QuantDtype.use_16a4w_block:
+                conv_nodes = [
+                    n for n in fx_graph_module.graph.nodes if "conv" in n.name


you dont want to check type of the node or node.target? to see if it is conv

Good point. Thanks!

examples/qualcomm/oss_scripts/qwen/qwen_model.py

kimishpatel · 2025-07-10T23:20:32Z

extension/llm/runner/text_decoder_runner.h

@@ -67,10 +67,11 @@ class ET_EXPERIMENTAL TextDecoderRunner {
      const executorch::aten::Tensor& logits_tensor,
      const float temperature = 0.0f) {
    int32_t result = 0;
-    ET_SWITCH_THREE_TYPES(
+    ET_SWITCH_FOUR_TYPES(


@larryliu0820 these change seem acceptable?

kimishpatel

left some comments but I think this feels not the right approach for enabling transformer models. I left inline comments describing why

kimishpatel · 2025-07-11T02:29:45Z

couple of other questions I have is:

what is the performance like compared to static_llama like solution and
how is the kv cache managed. if it is quantized, how are you doing quantized cache updates?

shewu-quic · 2025-07-11T03:10:07Z

left some comments but I think this feels not the right approach for enabling transformer models. I left inline comments describing why

Let me confirm my understanding. The general principle is to use the APIs provided by transformers, such as TorchExportableModuleWithStaticCache, without involving the actual model code structure.
I have a question: is it possible to extend the flexibility of the transformers API (specifically TorchExportableModuleWithStaticCache) to allow for the specification of custom attention masks or attention functions?
Because it seems that the transformer framework intends to support this, as indicated by the presence of ALL_MASK_ATTENTION_FUNCTIONS and ALL_ATTENTION_FUNCTIONS

kimishpatel · 2025-07-11T03:12:21Z

left some comments but I think this feels not the right approach for enabling transformer models. I left inline comments describing why

Let me confirm my understanding. The general principle is to use the APIs provided by transformers, such as TorchExportableModuleWithStaticCache, without involving the actual model code structure. I have a question: is it possible to extend the flexibility of the transformers API (specifically TorchExportableModuleWithStaticCache) to allow for the specification of custom attention masks or attention functions? Because it seems that the transformer framework intends to support this, as indicated by the presence of ALL_MASK_ATTENTION_FUNCTIONS and ALL_ATTENTION_FUNCTIONS

Yes. I pointed to some examples of that, but note that it wont allow you to do the kind of things maybe you are doing. Like inserting R1/R3 etc. at least to my understanding. if you can do that using attention customization interface, thats great

shewu-quic · 2025-07-11T03:20:33Z

couple of other questions I have is:

what is the performance like compared to static_llama like solution and

how is the kv cache managed. if it is quantized, how are you doing quantized cache updates?

We are currently working on this item which evaluate ppl and performance. We will get back to you with the results as soon as possible.
For transformers path, it manages kv cache with mutable buffer mechanism via index_put to update kv cache in each turn. For quantized cache, we will avoid inserting Q/DQ for the input and output of kv cache with tag_quant_io pass,

Summary: - Add a decoder_model_wrapper.py to ensure that the exported model can be fully delegated in Qnn Backend - Add a e2e script to run qwen 2.5 - Support spin quant R3 - Replace Qwen2Attention with QCQwen2Attention - Pre-compute freqs_cos and freqs_sin to bypass rotary embedding - Replace Qwen2RMSNorm with torch.nn,.RMSNorm - Tag quant IO to avoid insering Q/DQ for I/O - Reuse executorch llama runner, llama_main Note that accuracy currently is bad, need to investigate more.

cccclai · 2025-07-15T18:38:04Z

And one more question, when I use your runner with qwen 2.5. But I get the error for tokenizer. I add a workaround to change behavior with MergedWithPrevious. Could you give me some advice to address it?
Unsupported behavior 'Isolated' for Split PreTokenizer. Only 'MergedWithPrevious' is supported.

This should be fixed with the latest tokenizer https://github.com/pytorch-labs/tokenizers

- Support recompose rms norm by pattern-based - Leaverage AttentionMaskInterface and AttentionInterface without touching model structure - Add Eval script to evaluate ppl on device

shewu-quic · 2025-07-17T01:31:11Z

Hi @kimishpatel and @cccclai,

I have pushed a commit that leverages transformer APIs such as AttentionInterface, AttentionMaskInterface, and TorchExportableModuleForDecoderOnlyLM to make the decoder model QNN-friendly without altering the model structure.
Could you please let me know if this approach meets your expectations for the decoder-only model?

Results

It can be fully delegated and produce reasonable result with Qwen 2.5 0.5B.
The below result is generated by qwen 2.5 0.5B without R3, seq_len = 128, device = SM8750, quant_config = 16a8w

I 00:00:00.900210 executorch:text_llm_runner.cpp:100] RSS after loading model: 829.289062 MiB (0 if unsupported)
I 00:00:00.900500 executorch:text_llm_runner.cpp:157] Max new tokens resolved: 122, given start_pos 0, num_prompt_tokens 6, max_context_len 128
I 00:00:00.949226 executorch:text_llm_runner.cpp:184] RSS after prompt prefill: 829.289062 MiB (0 if unsupported)
I 00:00:01.851722 executorch:text_llm_runner.cpp:204] RSS after finishing text generation: 829.289062 MiB (0 if unsupported)
I 00:00:01.851748 executorch:stats.h:108] 	Prompt Tokens: 6    Generated Tokens: 121
I 00:00:01.851755 executorch:stats.h:114] 	Model Load Time:		0.516000 (seconds)
I 00:00:01.851763 executorch:stats.h:124] 	Total inference time:		0.952000 (seconds)		 Rate: 	127.100840 (tokens/second)
I 00:00:01.851769 executorch:stats.h:132] 		Prompt evaluation:	0.049000 (seconds)		 Rate: 	122.448980 (tokens/second)
I 00:00:01.851777 executorch:stats.h:143] 		Generated 121 tokens:	0.903000 (seconds)		 Rate: 	133.997785 (tokens/second)
I 00:00:01.851785 executorch:stats.h:151] 	Time to first generated token:	0.049000 (seconds)
I 00:00:01.851792 executorch:stats.h:158] 	Sampling time over 127 tokens:	0.012000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
[WARNING] [Qnn ExecuTorch]: QnnDsp <W> Function not called, PrepareLib isn't loaded!

/data/local/tmp/shewu/executorch/qwen_qnn_q16/outputs/: 1 file pulled. 0.5 MB/s (893 bytes in 0.002s)
INFO:root:Results[0]:
Setting up pretokenizer...
Pretokenizer set up
My favourite condiment is iced tea. I love the taste of it, and I love the fact that it is so easy to make. I make it in a couple of ways. The first way is to make a batch of iced tea at the end of the day and put it in the fridge. The second way is to make it in advance and put it in the freezer. I like the second way because it is much easier to make. I make it in advance by putting the tea in a large pitcher and adding the tea leaves. I add the sugar and the milk and stir it all together. I then put the

Validate the pte on wikitext limit = 1

The PPL of original nn module: 49
The PPL of QDQ module: 51
The PPL of QNN delegated on device: 51

Reproduce command

# export command
python3 examples/qualcomm/oss_scripts/qwen/qwen.py  -s <serial>-H <host> -m SM8750  -a qwen2 --prompt "My favourite condiment is "  -b build-android --decoder_model qwen2.5_0.5B  --calibration_tasks wikitext --calibration_limit 1 --ptq 16a8w

# eval command
python3 examples/qualcomm/oss_scripts/qwen/eval_qwen_qnn.py  -s <serial>-H <host> -m SM8750  -a qwen2 -b build-android --limit 1 --tokenizer_path qwen2 /tokenizer.json --pte qwen2 /qwen_qnn_q16.pte --logits_quant_attr_path qwen2 /qwen_qnn_q16_quant_attrs.txt

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 10, 2025