Skip to content

Qualcomm AI Engine Direct - GA Qwen 2.5 0.5B #12333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

shewu-quic
Copy link
Collaborator

@shewu-quic shewu-quic commented Jul 10, 2025

Summary:

  • Add a decoder_model_wrapper.py to ensure that the exported model can be fully delegated in Qnn Backend
  • Add a e2e script to run qwen 2.5
    • Support spin quant R3
    • Replace Qwen2Attention with QCQwen2Attention
    • Pre-compute freqs_cos and freqs_sin to bypass rotary embedding
    • Replace Qwen2RMSNorm with torch.nn,.RMSNorm
    • Tag quant IO to avoid insering Q/DQ for I/O
    • Reuse executorch llama runner, llama_main

Note that accuracy currently is bad, need to investigate more.

Reproduce command

python3 examples/qualcomm/oss_scripts/qwen/qwen.py  -s <serial>-H <host> -m SM8750  --prompt "My favourite condiment is "  -b build-android --decoder_model qwen2.5_0.5B --ptq 16a16w

Results

7/9

ptq: 16a16w
Speed: 62 tok/sec on SM8750, seq_len = 128
Accuracy: Bad

Outputs:

I 00:00:02.944266 executorch:stats.h:108] 	Prompt Tokens: 6    Generated Tokens: 121
I 00:00:02.944270 executorch:stats.h:114] 	Model Load Time:		0.677000 (seconds)
I 00:00:02.944274 executorch:stats.h:124] 	Total inference time:		2.034000 (seconds)		 Rate: 	59.488692 (tokens/second)
I 00:00:02.944279 executorch:stats.h:132] 		Prompt evaluation:	0.093000 (seconds)		 Rate: 	64.516129 (tokens/second)
I 00:00:02.944283 executorch:stats.h:143] 		Generated 121 tokens:	1.941000 (seconds)		 Rate: 	62.339001 (tokens/second)
I 00:00:02.944288 executorch:stats.h:151] 	Time to first generated token:	0.093000 (seconds)
I 00:00:02.944292 executorch:stats.h:158] 	Sampling time over 127 tokens:	0.059000 (seconds)
My favourite condiment is a thing, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan, and I am a fan

7/11

ptq: 16a8w
Speed: 135 tok/sec on SM8750, seq_len = 128
Accuracy: Seems better

Outputs:

I 00:00:00.734588 executorch:text_llm_runner.cpp:100] RSS after loading model: 829.648438 MiB (0 if unsupported)
I 00:00:00.734865 executorch:text_llm_runner.cpp:157] Max new tokens resolved: 122, given start_pos 0, num_prompt_tokens 6, max_context_len 128
I 00:00:00.784392 executorch:text_llm_runner.cpp:184] RSS after prompt prefill: 829.648438 MiB (0 if unsupported)
I 00:00:01.677137 executorch:text_llm_runner.cpp:204] RSS after finishing text generation: 829.648438 MiB (0 if unsupported)
I 00:00:01.677171 executorch:stats.h:108] 	Prompt Tokens: 6    Generated Tokens: 121
I 00:00:01.677180 executorch:stats.h:114] 	Model Load Time:		0.431000 (seconds)
I 00:00:01.677187 executorch:stats.h:124] 	Total inference time:		0.943000 (seconds)		 Rate: 	128.313892 (tokens/second)
I 00:00:01.677193 executorch:stats.h:132] 		Prompt evaluation:	0.050000 (seconds)		 Rate: 	120.000000 (tokens/second)
I 00:00:01.677201 executorch:stats.h:143] 		Generated 121 tokens:	0.893000 (seconds)		 Rate: 	135.498320 (tokens/second)
I 00:00:01.677208 executorch:stats.h:151] 	Time to first generated token:	0.050000 (seconds)
I 00:00:01.677215 executorch:stats.h:158] 	Sampling time over 127 tokens:	0.017000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
[WARNING] [Qnn ExecuTorch]: QnnDsp <W> Function not called, PrepareLib isn't loaded!

/data/local/tmp/shewu/executorch/qwen_qnn_q16/outputs/: 1 file pulled. 0.7 MB/s (883 bytes in 0.001s)
INFO:root:Results[0]:
Setting up pretokenizer...
Pretokenizer set up
My favourite condiment is iced tea. I love it so much that I have to have it every day. I have a habit of making it at home. I have a few recipes for iced tea. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite iced tea recipes. I have a few favorite

cc: @winskuo-quic , @haowhsu-quic

Copy link

pytorch-bot bot commented Jul 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12333

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 10, 2025
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@shewu-quic
Copy link
Collaborator Author

Hi @cccclai @kimishpatel,

I am working on supporting the decoder only model from transformer path.
I create a wrapper for decoder model which is based on the TorchExportableModuleWithStaticCache in the transformers.
There are some changes to fully delegate in QNN Backend:

  1. Change an attention mask to avoid the computation in the model
  2. Add buffer for freqs_cos and freqs_sin to bypass rotary embedding in the model
  3. Change the attention with QCQwen2Attention
  4. Replace Qwen2RMSNorm with torch.nn,.RMSNorm

May I know these changes are acceptable?

Thanks.

@shewu-quic
Copy link
Collaborator Author

And one more question, when I use your runner with qwen 2.5.
But I get the error for tokenizer. I add a workaround to change behavior with MergedWithPrevious.
Could you give me some advice to address it?

Unsupported behavior 'Isolated' for Split PreTokenizer. Only 'MergedWithPrevious' is supported.

@kimishpatel
Copy link
Contributor

And one more question, when I use your runner with qwen 2.5. But I get the error for tokenizer. I add a workaround to change behavior with MergedWithPrevious. Could you give me some advice to address it?

Unsupported behavior 'Isolated' for Split PreTokenizer. Only 'MergedWithPrevious' is supported.

@larryliu0820 on this question

@@ -156,11 +156,22 @@ def __init__(self, weight, bias=None):

def forward(self, x):
rank = x.dim()
x = x.unsqueeze(-1) if rank == 3 else x.reshape(1, *x.shape, 1)
x = torch.transpose(x, 1, 2)
if rank == 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a different question here. can this be done as graph pass?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's possible for the linear target to use set_parameter to modify the weight and to add permute and reshape nodes during transform_for_annotation.

# =====================================================================
outs = self.model(
input_ids=input_ids,
attention_mask=attn_mask,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: if you have to specify per layer mask, how would you?

@guangy10 does transformer api allow for per layer mask be specified here as list of tensor or something?

)
if quant_dtype == QuantDtype.use_16a4w_block:
conv_nodes = [
n for n in fx_graph_module.graph.nodes if "conv" in n.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you dont want to check type of the node or node.target? to see if it is conv

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Thanks!

@@ -67,10 +67,11 @@ class ET_EXPERIMENTAL TextDecoderRunner {
const executorch::aten::Tensor& logits_tensor,
const float temperature = 0.0f) {
int32_t result = 0;
ET_SWITCH_THREE_TYPES(
ET_SWITCH_FOUR_TYPES(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larryliu0820 these change seem acceptable?

Copy link
Contributor

@kimishpatel kimishpatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments but I think this feels not the right approach for enabling transformer models. I left inline comments describing why

@kimishpatel
Copy link
Contributor

couple of other questions I have is:

  1. what is the performance like compared to static_llama like solution and
  2. how is the kv cache managed. if it is quantized, how are you doing quantized cache updates?

@shewu-quic
Copy link
Collaborator Author

left some comments but I think this feels not the right approach for enabling transformer models. I left inline comments describing why

Let me confirm my understanding. The general principle is to use the APIs provided by transformers, such as TorchExportableModuleWithStaticCache, without involving the actual model code structure.
I have a question: is it possible to extend the flexibility of the transformers API (specifically TorchExportableModuleWithStaticCache) to allow for the specification of custom attention masks or attention functions?
Because it seems that the transformer framework intends to support this, as indicated by the presence of ALL_MASK_ATTENTION_FUNCTIONS and ALL_ATTENTION_FUNCTIONS

@kimishpatel
Copy link
Contributor

left some comments but I think this feels not the right approach for enabling transformer models. I left inline comments describing why

Let me confirm my understanding. The general principle is to use the APIs provided by transformers, such as TorchExportableModuleWithStaticCache, without involving the actual model code structure. I have a question: is it possible to extend the flexibility of the transformers API (specifically TorchExportableModuleWithStaticCache) to allow for the specification of custom attention masks or attention functions? Because it seems that the transformer framework intends to support this, as indicated by the presence of ALL_MASK_ATTENTION_FUNCTIONS and ALL_ATTENTION_FUNCTIONS

Yes. I pointed to some examples of that, but note that it wont allow you to do the kind of things maybe you are doing. Like inserting R1/R3 etc. at least to my understanding. if you can do that using attention customization interface, thats great

@shewu-quic
Copy link
Collaborator Author

couple of other questions I have is:

  1. what is the performance like compared to static_llama like solution and
  2. how is the kv cache managed. if it is quantized, how are you doing quantized cache updates?
  1. We are currently working on this item which evaluate ppl and performance. We will get back to you with the results as soon as possible.
  2. For transformers path, it manages kv cache with mutable buffer mechanism via index_put to update kv cache in each turn. For quantized cache, we will avoid inserting Q/DQ for the input and output of kv cache with tag_quant_io pass,

Summary:
- Add a decoder_model_wrapper.py to ensure that the exported model can
  be fully delegated in Qnn Backend
- Add a e2e script to run qwen 2.5
  - Support spin quant R3
  - Replace Qwen2Attention with QCQwen2Attention
  - Pre-compute freqs_cos and freqs_sin to bypass rotary embedding
  - Replace Qwen2RMSNorm with torch.nn,.RMSNorm
  - Tag quant IO to avoid insering Q/DQ for I/O
  - Reuse executorch llama runner, llama_main

Note that accuracy currently is bad, need to investigate more.
@cccclai
Copy link
Contributor

cccclai commented Jul 15, 2025

And one more question, when I use your runner with qwen 2.5. But I get the error for tokenizer. I add a workaround to change behavior with MergedWithPrevious. Could you give me some advice to address it?

Unsupported behavior 'Isolated' for Split PreTokenizer. Only 'MergedWithPrevious' is supported.

This should be fixed with the latest tokenizer https://github.com/pytorch-labs/tokenizers

- Support recompose rms norm by pattern-based
- Leaverage AttentionMaskInterface and AttentionInterface without
  touching model structure
- Add Eval script to evaluate ppl on device
@shewu-quic shewu-quic force-pushed the dev1/hutton/ga_qwen2.5_0.5B branch from 34838d6 to 381bf9b Compare July 17, 2025 01:18
@shewu-quic
Copy link
Collaborator Author

shewu-quic commented Jul 17, 2025

Hi @kimishpatel and @cccclai,

I have pushed a commit that leverages transformer APIs such as AttentionInterface, AttentionMaskInterface, and TorchExportableModuleForDecoderOnlyLM to make the decoder model QNN-friendly without altering the model structure.
Could you please let me know if this approach meets your expectations for the decoder-only model?

Results

It can be fully delegated and produce reasonable result with Qwen 2.5 0.5B.
The below result is generated by qwen 2.5 0.5B without R3, seq_len = 128, device = SM8750, quant_config = 16a8w

I 00:00:00.900210 executorch:text_llm_runner.cpp:100] RSS after loading model: 829.289062 MiB (0 if unsupported)
I 00:00:00.900500 executorch:text_llm_runner.cpp:157] Max new tokens resolved: 122, given start_pos 0, num_prompt_tokens 6, max_context_len 128
I 00:00:00.949226 executorch:text_llm_runner.cpp:184] RSS after prompt prefill: 829.289062 MiB (0 if unsupported)
I 00:00:01.851722 executorch:text_llm_runner.cpp:204] RSS after finishing text generation: 829.289062 MiB (0 if unsupported)
I 00:00:01.851748 executorch:stats.h:108] 	Prompt Tokens: 6    Generated Tokens: 121
I 00:00:01.851755 executorch:stats.h:114] 	Model Load Time:		0.516000 (seconds)
I 00:00:01.851763 executorch:stats.h:124] 	Total inference time:		0.952000 (seconds)		 Rate: 	127.100840 (tokens/second)
I 00:00:01.851769 executorch:stats.h:132] 		Prompt evaluation:	0.049000 (seconds)		 Rate: 	122.448980 (tokens/second)
I 00:00:01.851777 executorch:stats.h:143] 		Generated 121 tokens:	0.903000 (seconds)		 Rate: 	133.997785 (tokens/second)
I 00:00:01.851785 executorch:stats.h:151] 	Time to first generated token:	0.049000 (seconds)
I 00:00:01.851792 executorch:stats.h:158] 	Sampling time over 127 tokens:	0.012000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
[WARNING] [Qnn ExecuTorch]: QnnDsp <W> Function not called, PrepareLib isn't loaded!

/data/local/tmp/shewu/executorch/qwen_qnn_q16/outputs/: 1 file pulled. 0.5 MB/s (893 bytes in 0.002s)
INFO:root:Results[0]:
Setting up pretokenizer...
Pretokenizer set up
My favourite condiment is iced tea. I love the taste of it, and I love the fact that it is so easy to make. I make it in a couple of ways. The first way is to make a batch of iced tea at the end of the day and put it in the fridge. The second way is to make it in advance and put it in the freezer. I like the second way because it is much easier to make. I make it in advance by putting the tea in a large pitcher and adding the tea leaves. I add the sugar and the milk and stir it all together. I then put the

Validate the pte on wikitext limit = 1

  • The PPL of original nn module: 49
  • The PPL of QDQ module: 51
  • The PPL of QNN delegated on device: 51

Reproduce command

# export command
python3 examples/qualcomm/oss_scripts/qwen/qwen.py  -s <serial>-H <host> -m SM8750  -a qwen2 --prompt "My favourite condiment is "  -b build-android --decoder_model qwen2.5_0.5B  --calibration_tasks wikitext --calibration_limit 1 --ptq 16a8w

# eval command
python3 examples/qualcomm/oss_scripts/qwen/eval_qwen_qnn.py  -s <serial>-H <host> -m SM8750  -a qwen2 -b build-android --limit 1 --tokenizer_path qwen2 /tokenizer.json --pte qwen2 /qwen_qnn_q16.pte --logits_quant_attr_path qwen2 /qwen_qnn_q16_quant_attrs.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants