-
Notifications
You must be signed in to change notification settings - Fork 620
Implemented range setting in QNN llama flow #12377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12377
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New FailuresAs of commit d55c96d with merge base dd4488d ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D78127727 |
This PR needs a
|
Summary: `llama.py` now has the `--range_setting` flag, for which there are the options `mse_weight_only` and `mse_with_act_loss`. There is also an eval script for computing perplexity called `eval_llama_qnn.py` (for faster eval, try seq length 1024). This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments. Commands: ```python examples/qualcomm/oss_scripts/llama/llama.py --checkpoint {MODEL_DIR}/consolidated.00.pth --params {MODEL_DIR}/params.json --tokenizer_path {MODEL_DIR}/tokenizer.model --max_seq_length 128 --ptq 16a4w --range_setting mse_with_act_loss``` ```python examples/qualcomm/oss_scripts/llama/eval_llama_qnn.py --checkpoint {MODEL_DIR}/consolidated.00.pth --params {MODEL_DIR}/params.json --tokenizer_path {MODEL_DIR}/tokenizer.model --max_seq_length 128 --ptq 16a4w --range_setting mse_with_act_loss``` Rollback Plan: Differential Revision: D78127727
a457091
to
d55c96d
Compare
This pull request was exported from Phabricator. Differential Revision: D78127727 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reading, will finish reading in a bit
model.ar_len = model.max_seq_len | ||
tokens, atten_mask = model.get_example_inputs(use_kv_cache=False) | ||
atten_mask.to(torch.float) | ||
print(atten_mask.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing debugging line
kv_quant_attrs=kv_quant_attrs, | ||
), | ||
) | ||
# custom_annotations = custom_annotations + ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I need to have a separate PR for this.
Summary:
llama.py
now has the--range_setting
flag, for which there are the optionsmse_weight_only
andmse_with_act_loss
. There is also an eval script for computing perplexity calledeval_llama_qnn.py
(for faster eval, try seq length 1024). This script also has a flag --quant_linear_only to only quantize linear/conv nodes, to run faster experiments.Commands:
python examples/qualcomm/oss_scripts/llama/llama.py --checkpoint {MODEL_DIR}/consolidated.00.pth --params {MODEL_DIR}/params.json --tokenizer_path {MODEL_DIR}/tokenizer.model --max_seq_length 128 --ptq 16a4w --range_setting mse_with_act_loss
python examples/qualcomm/oss_scripts/llama/eval_llama_qnn.py --checkpoint {MODEL_DIR}/consolidated.00.pth --params {MODEL_DIR}/params.json --tokenizer_path {MODEL_DIR}/tokenizer.model --max_seq_length 128 --ptq 16a4w --range_setting mse_with_act_loss
Rollback Plan:
Differential Revision: D78127727