Skip to content

Qualcomm AI Engine Direct - GA Static QWEN2.5 0.5B #12054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 15, 2025

Conversation

winskuo-quic
Copy link
Collaborator

@winskuo-quic winskuo-quic commented Jun 27, 2025

Summary

Static Qwen2.5 0.5b enablement.
Please use 16a8w for qwen as other quant configs are not yet fully supported.

Script
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s $DEVICE -m SM8650 --prompt "Hello, how are you?" --temperature 0 --model_mode kv--max_seq_len 128 --ptq 16a8w --decoder_model qwen2_5

Stats

SM8650
image

SM8750
image

Test plan

python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript.test_qwen2_5 --model SM8650 --build_folder build-android/ --executorch_root . -s $DEVICE --artifact ./qwen2_5

Author: @haowhsu-quic, @winskuo-quic

Copy link

pytorch-bot bot commented Jun 27, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12054

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Pending, 2 Unrelated Failures

As of commit 7beed6e with merge base 6ac5df2 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 27, 2025
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@winskuo-quic
Copy link
Collaborator Author

Hi @billmguo, @cccclai,

This is a draft for qwen2.5 0.5 using static nn.Module structure.
Please notice this is just a draft as the accuracy for the model is still bad.

Thanks

@cccclai
Copy link
Contributor

cccclai commented Jun 27, 2025

Thanks! If it's just the accuracy is bad, @rohansjoshi has been making progress on this. Can qwen model run with this PR (don't need to be accurate)?

from transformers.configuration_utils import PretrainedConfig


class RMSNorm(nn.Module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to have the same static llama file (maybe with some branches) instead of starting a new static qwen model

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing the PR. We are trying to have a static decoder file that can be shared across all GA decoder models + LLama.
For this draft, we are just making sure e2e flow is working, and we will work on merging all static decoder files to 1.

@winskuo-quic
Copy link
Collaborator Author

Thanks! If it's just the accuracy is bad, @rohansjoshi has been making progress on this. Can qwen model run with this PR (don't need to be accurate)?

Yes. The can model can run with this PR.
I have an example script and stats shared in the summary section.

@rohansjoshi
Copy link
Contributor

Two things you can do to fix accuracy are:

  1. In make_quantizer, add the argument act_observer=MinMaxObserver (default is MovingAverageMinMaxObserver, which is much worse)
  2. To see better outputs from the script, flush the kv cache before each token generation loop (i.e. add the line _, atten_mask, _, k_caches, v_caches = qc_model.get_example_inputs() before the loops)

If make both these changes in qwen.py you see a much better response. @haowhsu-quic, @winskuo-quic

@winskuo-quic winskuo-quic force-pushed the dev1/winskuo/qwen2_0.5B branch from 0c4785b to 7beed6e Compare July 14, 2025 14:27
@winskuo-quic winskuo-quic changed the title Qualcomm AI Engine Direct - DRAFT for GA QWEN2.5 0.5B Qualcomm AI Engine Direct - GA Static QWEN2.5 0.5B Jul 14, 2025
auto eos_ids = std::make_unique<std::unordered_set<uint64_t>>();
// TODO: remove this once we could release the new tokens used for the
// tokenizer
if (tokenizer_ != nullptr) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cccclai,
I am unsure about what situation will go into this if statement.
It is added in this PR: #12285
Could you verify if this PR does not break the logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@limintang can you take a look at the changes?

@winskuo-quic winskuo-quic marked this pull request as ready for review July 14, 2025 14:29
@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this in D78296169.

@cccclai cccclai merged commit fe3062a into pytorch:main Jul 15, 2025
101 of 106 checks passed
cccclai added a commit that referenced this pull request Jul 15, 2025
@cccclai
Copy link
Contributor

cccclai commented Jul 15, 2025

Oops I click merge by mistake...reverting in #12506 have some comments for this PR

# LICENSE file in the root directory of this source tree.


def convert_configs(config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share where this config come from and how we can scale to more models?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config can be found over here:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2/configuration_qwen2.py.
For now, I checked qwen and gemma, and I think most configs are following the same naming, where they all use PretrainedConfig as the base class. Ideally, this function should be able to support most HF decoder models, but I will need to test them 1 by 1 to confirm they can all be supported using this function.


sys.setrecursionlimit(4096)
FORMAT = "[%(levelname)s %(asctime)s %(filename)s:%(lineno)s] %(message)s"
logging.basicConfig(level=logging.INFO, format=FORMAT)
logging.getLogger().setLevel(logging.INFO)

HUGGING_FACE_REPO_IDS = {"qwen2_5": "Qwen/Qwen2.5-0.5B"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each model will have their own config and I feel like it can be shared with the cpu model https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L32

It doesn't have to be this PR, and I'd like to hear your thoughts on this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to confirm the config you are mentioning here.
Do you mean that instead of creating this dictionary, maybe we could contribute to the dict here?

HUGGING_FACE_REPO_IDS = {

custom_annotations = (annotate_matmul_16a8w,)
if args.llama_model == "stories110m":
custom_annotations = (
# For qwen2.5, skip annotate_conv can improve result.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it the case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This skip annotate_conv here actually refers to annotating the conv as 8a4w, not skipping quantize conv2d, which means for qwen the conv2d is 16a8w. I have tested both skip and not skip annotate conv as 8a4w, and the CPU QDQ results are a lot better when I skip annotate conv2d to 8a4w.

@cccclai
Copy link
Contributor

cccclai commented Jul 16, 2025

After the comments are address, I can revert the revert again...

@winskuo-quic
Copy link
Collaborator Author

winskuo-quic commented Jul 16, 2025

After the comments are address, I can revert the revert again...

Also, I tried rebasing to mainline to test if tokenizer's issue is resolved (#12333). However, after rebasing, I encountered another error:

libc++abi: terminating due to uncaught exception of type std::runtime_error: Unsupported Normalizer type: NFC.

lucylq pushed a commit that referenced this pull request Jul 17, 2025
### Summary
Static Qwen2.5 0.5b enablement.
Please use 16a8w for qwen as other quant configs are not yet fully
supported.

Script
`python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s
$DEVICE -m SM8650 --prompt "Hello, how are you?" --temperature 0
--model_mode kv--max_seq_len 128 --ptq 16a8w --decoder_model qwen2_5`

#### Stats
SM8650
<img width="1674" height="712" alt="image"
src="https://github.com/user-attachments/assets/ce162c20-9025-4c1c-b794-93176e4ee677"
/>


SM8750
<img width="1671" height="816" alt="image"
src="https://github.com/user-attachments/assets/25db8a97-8adf-42d4-b8b4-6cdfaf933c69"
/>



### Test plan
`python backends/qualcomm/tests/test_qnn_delegate.py -k
TestExampleLLMScript.test_qwen2_5 --model SM8650 --build_folder
build-android/ --executorch_root . -s $DEVICE --artifact ./qwen2_5`


Author: @haowhsu-quic, @winskuo-quic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants