Skip to content

Commit dc267a8

Browse files
committed
Update LLM documentation
1 parent 3419b46 commit dc267a8

File tree

5 files changed

+738
-875
lines changed

5 files changed

+738
-875
lines changed

docs/source/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.md

Lines changed: 33 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -13,37 +13,57 @@ This tutorial demonstrates how to export Llama 3 8B Instruct for Qualcomm AI Eng
1313

1414
## Instructions
1515

16-
### Step1: Prepare the checkpoint of the model and optimized matrix from [Spin Quant](https://github.com/facebookresearch/SpinQuant)
16+
### Step 1: Prepare the checkpoint of the model and optimized matrix from [Spin Quant](https://github.com/facebookresearch/SpinQuant)
1717

1818
1. For Llama 3 tokenizer and checkpoint, please refer to https://github.com/meta-llama/llama-models/blob/main/README.md for further instructions on how to download `tokenizer.model`, `consolidated.00.pth` and `params.json`.
1919
2. To get the optimized matrix, please refer to [SpinQuant on GitHub](https://github.com/facebookresearch/SpinQuant). You can download the optimized rotation matrices in the Quantized Models section. Please choose **LLaMA-3-8B/8B_W4A16KV16_lr_1.5_seed_0**.
2020

21-
### Step2: Export to ExecuTorch with Qualcomm AI Engine Direct Backend
21+
### Step 2: Export to ExecuTorch with Qualcomm AI Engine Direct Backend
2222
Deploying large language models like Llama 3 on-device presents the following challenges:
2323

2424
1. The model size is too large to fit in device memory for inference.
2525
2. High model loading and inference time.
2626
3. Difficulty in quantization.
2727

2828
To address these challenges, we have implemented the following solutions:
29-
1. Using `--pt2e_quantize qnn_16a4w` to quantize activations and weights, thereby reducing the on-disk model size and alleviating memory pressure during inference.
30-
2. Using `--num_sharding 8` to shard the model into sub-parts.
29+
1. Using `quantization.pt2e_quantize = "qnn_16a4w'` to quantize activations and weights, thereby reducing the on-disk model size and alleviating memory pressure during inference.
30+
2. Using `backed.qnn.num_sharding = 8` to shard the model into sub-parts.
3131
3. Performing graph transformations to convert or decompose operations into more accelerator-friendly operations.
32-
4. Using `--optimized_rotation_path <path_to_optimized_matrix>` to apply R1 and R2 of [Spin Quant](https://github.com/facebookresearch/SpinQuant) to improve accuracy.
33-
5. Using `--calibration_data "<|start_header_id|>system<|end_header_id|..."` to ensure that during the quantization of Llama 3 8B instruct, the calibration includes special tokens in the prompt template. For more details on the prompt template, refer to [the model card of meta llama3 instruct](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/).
32+
4. Using `backend.qnn.optimized_rotation_path = "<path_to_optimized_matrix>"` to apply R1 and R2 of [Spin Quant](https://github.com/facebookresearch/SpinQuant) to improve accuracy.
33+
5. Using `quantization.calibration_data = "<|start_header_id|>system<|end_header_id|..."` to ensure that during quantization, the calibration includes special tokens in the prompt template. For more details on the prompt template, refer to [the model card](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/).
3434

35-
To export Llama 3 8B instruct with the Qualcomm AI Engine Direct Backend, ensure the following:
35+
To export with the Qualcomm AI Engine Direct Backend, ensure the following:
3636

3737
1. The host machine has more than 100GB of memory (RAM + swap space).
3838
2. The entire process takes a few hours.
3939

4040
```bash
41-
# Please note that calibration_data must include the prompt template for special tokens.
42-
python -m examples.models.llama.export_llama -t <path_to_tokenizer.model>
43-
llama3/Meta-Llama-3-8B-Instruct/tokenizer.model -p <path_to_params.json> -c <path_to_checkpoint_for_Meta-Llama-3-8B-Instruct> --use_kv_cache --qnn --pt2e_quantize qnn_16a4w --disable_dynamic_shape --num_sharding 8 --calibration_tasks wikitext --calibration_limit 1 --calibration_seq_length 128 --optimized_rotation_path <path_to_optimized_matrix> --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
41+
# path/to/config.yaml
42+
base:
43+
model_class: llama3
44+
checkpoint: path/to/consolidated.00.pth
45+
params: path/to/params.json
46+
tokenizer_path: path/to/tokenizer.model
47+
metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
48+
model:
49+
use_kv_cache: True
50+
enable_dynamic_shape: False
51+
quantization:
52+
pt2e_quantize: qnn_16a4w
53+
# Please note that calibration_data must include the prompt template for special tokens.
54+
calibration_data: "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
55+
backend:
56+
qnn:
57+
enabled: True
58+
num_sharding: 8
59+
60+
61+
# export_llm
62+
python -m extension.llm.export.export_llm \
63+
--config path/to/config.yaml
4464
```
4565

46-
### Step3: Invoke the Runtime on an Android smartphone with Qualcomm SoCs
66+
### Step 3: Invoke the Runtime on an Android smartphone with Qualcomm SoCs
4767
1. Build executorch with Qualcomm AI Engine Direct Backend for android
4868
```bash
4969
cmake \
@@ -116,9 +136,9 @@ You should see the message:
116136
```
117137
118138
## What is coming?
119-
- Improve the performance for Llama 3 Instruct
139+
- Performance improvements
120140
- Reduce the memory pressure during inference to support 12GB Qualcomm devices
121-
- Support more LLMs
141+
- Support more LLMs (Qwen, Phi-4-mini, etc.)
122142
123143
## FAQ
124144

0 commit comments

Comments
 (0)