-
Notifications
You must be signed in to change notification settings - Fork 621
Qualcomm AI Engine Direct - GA Qwen 2.5 0.5B #12333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12333
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Hi @cccclai @kimishpatel, I am working on supporting the decoder only model from transformer path.
May I know these changes are acceptable? Thanks. |
And one more question, when I use your runner with qwen 2.5.
|
@larryliu0820 on this question |
@@ -156,11 +156,22 @@ def __init__(self, weight, bias=None): | |||
|
|||
def forward(self, x): | |||
rank = x.dim() | |||
x = x.unsqueeze(-1) if rank == 3 else x.reshape(1, *x.shape, 1) | |||
x = torch.transpose(x, 1, 2) | |||
if rank == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a different question here. can this be done as graph pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's possible for the linear target to use set_parameter to modify the weight and to add permute and reshape nodes during transform_for_annotation.
# ===================================================================== | ||
outs = self.model( | ||
input_ids=input_ids, | ||
attention_mask=attn_mask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question: if you have to specify per layer mask, how would you?
@guangy10 does transformer api allow for per layer mask be specified here as list of tensor or something?
) | ||
if quant_dtype == QuantDtype.use_16a4w_block: | ||
conv_nodes = [ | ||
n for n in fx_graph_module.graph.nodes if "conv" in n.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you dont want to check type of the node or node.target? to see if it is conv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Thanks!
@@ -67,10 +67,11 @@ class ET_EXPERIMENTAL TextDecoderRunner { | |||
const executorch::aten::Tensor& logits_tensor, | |||
const float temperature = 0.0f) { | |||
int32_t result = 0; | |||
ET_SWITCH_THREE_TYPES( | |||
ET_SWITCH_FOUR_TYPES( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larryliu0820 these change seem acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments but I think this feels not the right approach for enabling transformer models. I left inline comments describing why
couple of other questions I have is:
|
Let me confirm my understanding. The general principle is to use the APIs provided by transformers, such as TorchExportableModuleWithStaticCache, without involving the actual model code structure. |
Yes. I pointed to some examples of that, but note that it wont allow you to do the kind of things maybe you are doing. Like inserting R1/R3 etc. at least to my understanding. if you can do that using attention customization interface, thats great |
|
Summary: - Add a decoder_model_wrapper.py to ensure that the exported model can be fully delegated in Qnn Backend - Add a e2e script to run qwen 2.5 - Support spin quant R3 - Replace Qwen2Attention with QCQwen2Attention - Pre-compute freqs_cos and freqs_sin to bypass rotary embedding - Replace Qwen2RMSNorm with torch.nn,.RMSNorm - Tag quant IO to avoid insering Q/DQ for I/O - Reuse executorch llama runner, llama_main Note that accuracy currently is bad, need to investigate more.
This should be fixed with the latest tokenizer https://github.com/pytorch-labs/tokenizers |
- Support recompose rms norm by pattern-based - Leaverage AttentionMaskInterface and AttentionInterface without touching model structure - Add Eval script to evaluate ppl on device
34838d6
to
381bf9b
Compare
Hi @kimishpatel and @cccclai, I have pushed a commit that leverages transformer APIs such as AttentionInterface, AttentionMaskInterface, and TorchExportableModuleForDecoderOnlyLM to make the decoder model QNN-friendly without altering the model structure. ResultsIt can be fully delegated and produce reasonable result with Qwen 2.5 0.5B.
Validate the pte on wikitext limit = 1
Reproduce command
|
Summary:
Note that accuracy currently is bad, need to investigate more.
Reproduce command
Results
7/9
ptq: 16a16w
Speed: 62 tok/sec on SM8750, seq_len = 128
Accuracy: Bad
Outputs:
7/11
ptq: 16a8w
Speed: 135 tok/sec on SM8750, seq_len = 128
Accuracy: Seems better
Outputs:
cc: @winskuo-quic , @haowhsu-quic