-
Notifications
You must be signed in to change notification settings - Fork 622
Qualcomm AI Engine Direct - GA Static QWEN2.5 0.5B #12054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12054
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 1 Pending, 2 Unrelated FailuresAs of commit 7beed6e with merge base 6ac5df2 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Thanks! If it's just the accuracy is bad, @rohansjoshi has been making progress on this. Can qwen model run with this PR (don't need to be accurate)? |
from transformers.configuration_utils import PretrainedConfig | ||
|
||
|
||
class RMSNorm(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to have the same static llama file (maybe with some branches) instead of starting a new static qwen model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing the PR. We are trying to have a static decoder file that can be shared across all GA decoder models + LLama.
For this draft, we are just making sure e2e flow is working, and we will work on merging all static decoder files to 1.
Yes. The can model can run with this PR. |
Two things you can do to fix accuracy are:
If make both these changes in |
0c4785b
to
7beed6e
Compare
auto eos_ids = std::make_unique<std::unordered_set<uint64_t>>(); | ||
// TODO: remove this once we could release the new tokens used for the | ||
// tokenizer | ||
if (tokenizer_ != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@limintang can you take a look at the changes?
Oops I click merge by mistake...reverting in #12506 have some comments for this PR |
# LICENSE file in the root directory of this source tree. | ||
|
||
|
||
def convert_configs(config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share where this config come from and how we can scale to more models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config can be found over here:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2/configuration_qwen2.py.
For now, I checked qwen and gemma, and I think most configs are following the same naming, where they all use PretrainedConfig
as the base class. Ideally, this function should be able to support most HF decoder models, but I will need to test them 1 by 1 to confirm they can all be supported using this function.
|
||
sys.setrecursionlimit(4096) | ||
FORMAT = "[%(levelname)s %(asctime)s %(filename)s:%(lineno)s] %(message)s" | ||
logging.basicConfig(level=logging.INFO, format=FORMAT) | ||
logging.getLogger().setLevel(logging.INFO) | ||
|
||
HUGGING_FACE_REPO_IDS = {"qwen2_5": "Qwen/Qwen2.5-0.5B"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each model will have their own config and I feel like it can be shared with the cpu model https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L32
It doesn't have to be this PR, and I'd like to hear your thoughts on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to confirm the config you are mentioning here.
Do you mean that instead of creating this dictionary, maybe we could contribute to the dict here?
HUGGING_FACE_REPO_IDS = { |
custom_annotations = (annotate_matmul_16a8w,) | ||
if args.llama_model == "stories110m": | ||
custom_annotations = ( | ||
# For qwen2.5, skip annotate_conv can improve result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This skip annotate_conv here actually refers to annotating the conv as 8a4w, not skipping quantize conv2d, which means for qwen the conv2d is 16a8w. I have tested both skip and not skip annotate conv as 8a4w, and the CPU QDQ results are a lot better when I skip annotate conv2d to 8a4w.
After the comments are address, I can revert the revert again... |
Also, I tried rebasing to mainline to test if tokenizer's issue is resolved (#12333). However, after rebasing, I encountered another error:
|
### Summary Static Qwen2.5 0.5b enablement. Please use 16a8w for qwen as other quant configs are not yet fully supported. Script `python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s $DEVICE -m SM8650 --prompt "Hello, how are you?" --temperature 0 --model_mode kv--max_seq_len 128 --ptq 16a8w --decoder_model qwen2_5` #### Stats SM8650 <img width="1674" height="712" alt="image" src="https://github.com/user-attachments/assets/ce162c20-9025-4c1c-b794-93176e4ee677" /> SM8750 <img width="1671" height="816" alt="image" src="https://github.com/user-attachments/assets/25db8a97-8adf-42d4-b8b4-6cdfaf933c69" /> ### Test plan `python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript.test_qwen2_5 --model SM8650 --build_folder build-android/ --executorch_root . -s $DEVICE --artifact ./qwen2_5` Author: @haowhsu-quic, @winskuo-quic
Summary
Static Qwen2.5 0.5b enablement.
Please use 16a8w for qwen as other quant configs are not yet fully supported.
Script
python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -s $DEVICE -m SM8650 --prompt "Hello, how are you?" --temperature 0 --model_mode kv--max_seq_len 128 --ptq 16a8w --decoder_model qwen2_5
Stats
SM8650

SM8750

Test plan
python backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript.test_qwen2_5 --model SM8650 --build_folder build-android/ --executorch_root . -s $DEVICE --artifact ./qwen2_5
Author: @haowhsu-quic, @winskuo-quic