From 1b2fd42c601f40de161ced334177459150508afd Mon Sep 17 00:00:00 2001 From: Jack <32371937+jackzhxng@users.noreply.github.com> Date: Thu, 10 Jul 2025 16:44:20 -0700 Subject: [PATCH 1/4] Update LLM documentation --- docs/source/index.md | 24 +- ...lama3-qualcomm-ai-engine-direct-backend.md | 46 +- docs/source/llm/export-custom-llm.md | 424 +++++++++ docs/source/llm/export-llm.md | 271 ++++++ docs/source/llm/getting-started.md | 873 +----------------- docs/source/llm/run-with-c-plus-plus.md | 1 + 6 files changed, 756 insertions(+), 883 deletions(-) create mode 100644 docs/source/llm/export-custom-llm.md create mode 100644 docs/source/llm/export-llm.md create mode 100644 docs/source/llm/run-with-c-plus-plus.md diff --git a/docs/source/index.md b/docs/source/index.md index 5f114d547ac..c69ba8c172e 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -87,11 +87,13 @@ ExecuTorch provides support for: - [Custom ATen Kernel](kernel-library-custom-aten-kernel) - [Selective Build](kernel-library-selective-build) #### Working with LLMs -- [Llama](llm/llama.md) -- [Llama on Android](llm/llama-demo-android.md) -- [Llama on iOS](llm/llama-demo-ios.md) -- [Llama on Android via Qualcomm backend](llm/build-run-llama3-qualcomm-ai-engine-direct-backend.md) -- [Intro to LLMs in Executorch](llm/getting-started.md) +- [Getting Started](llm/getting-started.md) +- [Exporting LLMs with export_llm](llm/export-llm.md) +- [Exporting custom LLMs](llm/export-custom-llm.md) +- [Running with C++](llm/run-with-c-plus-plus.md) +- [Running on Android (XNNPack)](llm/llama-demo-android.md) +- [Running on Android (QNN)](llm/build-run-llama3-qualcomm-ai-engine-direct-backend.md) +- [Running on iOS](llm/llama-demo-ios.md) #### Backend Development - [Delegates Integration](backend-delegates-integration) - [XNNPACK Reference](backend-delegates-xnnpack-reference) @@ -239,11 +241,13 @@ kernel-library-selective-build :caption: Working with LLMs :hidden: -Llama -Llama on Android -Llama on iOS -Llama on Android via Qualcomm backend -Intro to LLMs in Executorch +Getting Started +Exporting LLMs with export_llm +Exporting custom LLMs +Running with C++ +Running on Android +Running on Android +Running on iOS ``` ```{toctree} diff --git a/docs/source/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.md b/docs/source/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.md index b61dd93233e..4587589a51b 100644 --- a/docs/source/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.md +++ b/docs/source/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.md @@ -13,12 +13,12 @@ This tutorial demonstrates how to export Llama 3 8B Instruct for Qualcomm AI Eng ## Instructions -### Step1: Prepare the checkpoint of the model and optimized matrix from [Spin Quant](https://github.com/facebookresearch/SpinQuant) +### Step 1: Prepare the checkpoint of the model and optimized matrix from [Spin Quant](https://github.com/facebookresearch/SpinQuant) 1. For Llama 3 tokenizer and checkpoint, please refer to https://github.com/meta-llama/llama-models/blob/main/README.md for further instructions on how to download `tokenizer.model`, `consolidated.00.pth` and `params.json`. 2. To get the optimized matrix, please refer to [SpinQuant on GitHub](https://github.com/facebookresearch/SpinQuant). You can download the optimized rotation matrices in the Quantized Models section. Please choose **LLaMA-3-8B/8B_W4A16KV16_lr_1.5_seed_0**. -### Step2: Export to ExecuTorch with Qualcomm AI Engine Direct Backend +### Step 2: Export to ExecuTorch with Qualcomm AI Engine Direct Backend Deploying large language models like Llama 3 on-device presents the following challenges: 1. The model size is too large to fit in device memory for inference. @@ -26,24 +26,44 @@ Deploying large language models like Llama 3 on-device presents the following ch 3. Difficulty in quantization. To address these challenges, we have implemented the following solutions: -1. Using `--pt2e_quantize qnn_16a4w` to quantize activations and weights, thereby reducing the on-disk model size and alleviating memory pressure during inference. -2. Using `--num_sharding 8` to shard the model into sub-parts. +1. Using `quantization.pt2e_quantize = "qnn_16a4w'` to quantize activations and weights, thereby reducing the on-disk model size and alleviating memory pressure during inference. +2. Using `backed.qnn.num_sharding = 8` to shard the model into sub-parts. 3. Performing graph transformations to convert or decompose operations into more accelerator-friendly operations. -4. Using `--optimized_rotation_path ` to apply R1 and R2 of [Spin Quant](https://github.com/facebookresearch/SpinQuant) to improve accuracy. -5. Using `--calibration_data "<|start_header_id|>system<|end_header_id|..."` to ensure that during the quantization of Llama 3 8B instruct, the calibration includes special tokens in the prompt template. For more details on the prompt template, refer to [the model card of meta llama3 instruct](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/). +4. Using `backend.qnn.optimized_rotation_path = ""` to apply R1 and R2 of [Spin Quant](https://github.com/facebookresearch/SpinQuant) to improve accuracy. +5. Using `quantization.calibration_data = "<|start_header_id|>system<|end_header_id|..."` to ensure that during quantization, the calibration includes special tokens in the prompt template. For more details on the prompt template, refer to [the model card](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/). -To export Llama 3 8B instruct with the Qualcomm AI Engine Direct Backend, ensure the following: +To export with the Qualcomm AI Engine Direct Backend, ensure the following: 1. The host machine has more than 100GB of memory (RAM + swap space). 2. The entire process takes a few hours. ```bash -# Please note that calibration_data must include the prompt template for special tokens. -python -m examples.models.llama.export_llama -t -llama3/Meta-Llama-3-8B-Instruct/tokenizer.model -p -c --use_kv_cache --qnn --pt2e_quantize qnn_16a4w --disable_dynamic_shape --num_sharding 8 --calibration_tasks wikitext --calibration_limit 1 --calibration_seq_length 128 --optimized_rotation_path --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" +# path/to/config.yaml +base: + model_class: llama3 + checkpoint: path/to/consolidated.00.pth + params: path/to/params.json + tokenizer_path: path/to/tokenizer.model + metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' +model: + use_kv_cache: True + enable_dynamic_shape: False +quantization: + pt2e_quantize: qnn_16a4w + # Please note that calibration_data must include the prompt template for special tokens. + calibration_data: "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" +backend: + qnn: + enabled: True + num_sharding: 8 + + +# export_llm +python -m extension.llm.export.export_llm \ + --config path/to/config.yaml ``` -### Step3: Invoke the Runtime on an Android smartphone with Qualcomm SoCs +### Step 3: Invoke the Runtime on an Android smartphone with Qualcomm SoCs 1. Build executorch with Qualcomm AI Engine Direct Backend for android ```bash cmake \ @@ -116,9 +136,9 @@ You should see the message: ``` ## What is coming? -- Improve the performance for Llama 3 Instruct +- Performance improvements - Reduce the memory pressure during inference to support 12GB Qualcomm devices -- Support more LLMs +- Support more LLMs (Qwen, Phi-4-mini, etc.) ## FAQ diff --git a/docs/source/llm/export-custom-llm.md b/docs/source/llm/export-custom-llm.md new file mode 100644 index 00000000000..c1191cf85f6 --- /dev/null +++ b/docs/source/llm/export-custom-llm.md @@ -0,0 +1,424 @@ +# Exporting custom LLMs + +If you have your own PyTorch model that is an LLM, this guide will show you how to manually export and lower to ExecuTorch, with many of the same optimizations as covered in the previous `export_llm` guide. + +This example uses Karpathy’s [nanoGPT](https://github.com/karpathy/nanoGPT), which is a minimal implementation of +GPT-2 124M. This guide is applicable to other language models, as ExecuTorch is model-invariant. + + +## Exporting to ExecuTorch (basic) + +Exporting takes a PyTorch model and converts it into a format that can run efficiently on consumer devices. + +For this example, you will need the nanoGPT model and the corresponding tokenizer vocabulary. + +::::{tab-set} +:::{tab-item} curl +``` +curl https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py -O +curl https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json -O +``` +::: +:::{tab-item} wget +``` +wget https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py +wget https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json +``` +::: +:::: + +To convert the model into a format optimized for standalone execution, there are two steps. First, use the PyTorch +`export` function to convert the PyTorch model into an intermediate, platform-independent intermediate representation. Then +use the ExecuTorch `to_edge` and `to_executorch` methods to prepare the model for on-device execution. This creates a .pte +file which can be loaded by a desktop or mobile application at runtime. + +Create a file called export_nanogpt.py with the following contents: + +```python +# export_nanogpt.py + +import torch + +from executorch.exir import EdgeCompileConfig, to_edge +from torch.nn.attention import sdpa_kernel, SDPBackend +from torch.export import export, export_for_training + +from model import GPT + +# Load the model. +model = GPT.from_pretrained('gpt2') + +# Create example inputs. This is used in the export process to provide +# hints on the expected shape of the model input. +example_inputs = (torch.randint(0, 100, (1, model.config.block_size), dtype=torch.long), ) + +# Set up dynamic shape configuration. This allows the sizes of the input tensors +# to differ from the sizes of the tensors in `example_inputs` during runtime, as +# long as they adhere to the rules specified in the dynamic shape configuration. +# Here we set the range of 0th model input's 1st dimension as +# [0, model.config.block_size]. +# See https://pytorch.org/executorch/main/concepts#dynamic-shapes +# for details about creating dynamic shapes. +dynamic_shape = ( + {1: torch.export.Dim("token_dim", max=model.config.block_size)}, +) + +# Trace the model, converting it to a portable intermediate representation. +# The torch.no_grad() call tells PyTorch to exclude training-specific logic. +with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad(): + m = export_for_training(model, example_inputs, dynamic_shapes=dynamic_shape).module() + traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape) + +# Convert the model into a runnable ExecuTorch program. +edge_config = EdgeCompileConfig(_check_ir_validity=False) +edge_manager = to_edge(traced_model, compile_config=edge_config) +et_program = edge_manager.to_executorch() + +# Save the ExecuTorch program to a file. +with open("nanogpt.pte", "wb") as file: + file.write(et_program.buffer) +``` + +To export, run the script with `python export_nanogpt.py` (or python3, as appropriate for your environment). It will generate a `nanogpt.pte` file in the current directory. + +For more information, see [Exporting to ExecuTorch](https://pytorch.org/executorch/main/tutorials/export-to-executorch-tutorial) and +[torch.export](https://pytorch.org/docs/stable/export.html). + +## Backend delegation + +While ExecuTorch provides a portable, cross-platform implementation for all +operators, it also provides specialized backends for a number of different +targets. These include, but are not limited to, x86 and ARM CPU acceleration via +the XNNPACK backend, Apple acceleration via the Core ML backend and Metal +Performance Shader (MPS) backend, and GPU acceleration via the Vulkan backend. + +Because optimizations are specific to a given backend, each pte file is specific +to the backend(s) targeted at export. To support multiple devices, such as +XNNPACK acceleration for Android and Core ML for iOS, export a separate PTE file +for each backend. + +To delegate a model to a specific backend during export, ExecuTorch uses the +`to_edge_transform_and_lower()` function. This function takes the exported program +from `torch.export` and a backend-specific partitioner object. The partitioner +identifies parts of the computation graph that can be optimized by the target +backend. Within `to_edge_transform_and_lower()`, the exported program is +converted to an edge dialect program. The partitioner then delegates compatible +graph sections to the backend for acceleration and optimization. Any graph parts +not delegated are executed by ExecuTorch's default operator implementations. + +To delegate the exported model to a specific backend, we need to import its +partitioner as well as edge compile config from ExecuTorch codebase first, then +call `to_edge_transform_and_lower`. + +Here's an example of how to delegate nanoGPT to XNNPACK (if you're deploying to an Android phone for instance): + +```python +# export_nanogpt.py + +# Load partitioner for Xnnpack backend +from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner + +# Model to be delegated to specific backend should use specific edge compile config +from executorch.backends.xnnpack.utils.configs import get_xnnpack_edge_compile_config +from executorch.exir import EdgeCompileConfig, to_edge_transform_and_lower + +import torch +from torch.export import export +from torch.nn.attention import sdpa_kernel, SDPBackend +from torch.export import export_for_training + +from model import GPT + +# Load the nanoGPT model. +model = GPT.from_pretrained('gpt2') + +# Create example inputs. This is used in the export process to provide +# hints on the expected shape of the model input. +example_inputs = ( + torch.randint(0, 100, (1, model.config.block_size - 1), dtype=torch.long), + ) + +# Set up dynamic shape configuration. This allows the sizes of the input tensors +# to differ from the sizes of the tensors in `example_inputs` during runtime, as +# long as they adhere to the rules specified in the dynamic shape configuration. +# Here we set the range of 0th model input's 1st dimension as +# [0, model.config.block_size]. +# See https://pytorch.org/executorch/main/concepts.html#dynamic-shapes +# for details about creating dynamic shapes. +dynamic_shape = ( + {1: torch.export.Dim("token_dim", max=model.config.block_size - 1)}, +) + +# Trace the model, converting it to a portable intermediate representation. +# The torch.no_grad() call tells PyTorch to exclude training-specific logic. +with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad(): + m = export_for_training(model, example_inputs, dynamic_shapes=dynamic_shape).module() + traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape) + +# Convert the model into a runnable ExecuTorch program. +# To be further lowered to Xnnpack backend, `traced_model` needs xnnpack-specific edge compile config +edge_config = get_xnnpack_edge_compile_config() +# Converted to edge program and then delegate exported model to Xnnpack backend +# by invoking `to` function with Xnnpack partitioner. +edge_manager = to_edge_transform_and_lower(traced_model, partitioner = [XnnpackPartitioner()], compile_config = edge_config) +et_program = edge_manager.to_executorch() + +# Save the Xnnpack-delegated ExecuTorch program to a file. +with open("nanogpt.pte", "wb") as file: + file.write(et_program.buffer) +``` + + +## Quantization + +Quantization refers to a set of techniques for running calculations and storing tensors using lower precision types. +Compared to 32-bit floating point, using 8-bit integers can provide both a significant speedup and reduction in +memory usage. There are many approaches to quantizing a model, varying in amount of pre-processing required, data +types used, and impact on model accuracy and performance. + +Because compute and memory are highly constrained on mobile devices, some form of quantization is necessary to ship +large models on consumer electronics. In particular, large language models, such as Llama2, may require quantizing +model weights to 4 bits or less. + +Leveraging quantization requires transforming the model before export. PyTorch provides the pt2e (PyTorch 2 Export) +API for this purpose. This example targets CPU acceleration using the XNNPACK delegate. As such, it needs to use the + XNNPACK-specific quantizer. Targeting a different backend will require use of the corresponding quantizer. + +To use 8-bit integer dynamic quantization with the XNNPACK delegate, call `prepare_pt2e`, calibrate the model by +running with a representative input, and then call `convert_pt2e`. This updates the computational graph to use +quantized operators where available. + +```python +# export_nanogpt.py + +from executorch.backends.transforms.duplicate_dynamic_quant_chain import ( + DuplicateDynamicQuantChainPass, +) +from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import ( + get_symmetric_quantization_config, + XNNPACKQuantizer, +) +from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e +``` + +```python +# Use dynamic, per-channel quantization. +xnnpack_quant_config = get_symmetric_quantization_config( + is_per_channel=True, is_dynamic=True +) +xnnpack_quantizer = XNNPACKQuantizer() +xnnpack_quantizer.set_global(xnnpack_quant_config) + +m = export_for_training(model, example_inputs).module() + +# Annotate the model for quantization. This prepares the model for calibration. +m = prepare_pt2e(m, xnnpack_quantizer) + +# Calibrate the model using representative inputs. This allows the quantization +# logic to determine the expected range of values in each tensor. +m(*example_inputs) + +# Perform the actual quantization. +m = convert_pt2e(m, fold_quantize=False) +DuplicateDynamicQuantChainPass()(m) + +traced_model = export(m, example_inputs) +``` + +Additionally, add or update the `to_edge_transform_and_lower()` call to use `XnnpackPartitioner`. This +instructs ExecuTorch to optimize the model for CPU execution via the XNNPACK backend. + +```python +from executorch.backends.xnnpack.partition.xnnpack_partitioner import ( + XnnpackPartitioner, +) +``` + +```python +edge_config = get_xnnpack_edge_compile_config() +# Convert to edge dialect and lower to XNNPack. +edge_manager = to_edge_transform_and_lower(traced_model, partitioner = [XnnpackPartitioner()], compile_config = edge_config) +et_program = edge_manager.to_executorch() + +with open("nanogpt.pte", "wb") as file: + file.write(et_program.buffer) +``` + +For more information, see [Quantization in ExecuTorch](../quantization-overview.md). + +## Profiling and Debugging +After lowering a model by calling `to_edge_transform_and_lower()`, you may want to see what got delegated and what didn’t. ExecuTorch +provides utility methods to give insight on the delegation. You can use this information to gain visibility into +the underlying computation and diagnose potential performance issues. Model authors can use this information to +structure the model in a way that is compatible with the target backend. + +### Visualizing the Delegation + +The `get_delegation_info()` method provides a summary of what happened to the model after the `to_edge_transform_and_lower()` call: + +```python +from executorch.devtools.backend_debug import get_delegation_info +from tabulate import tabulate + +# ... After call to to_edge_transform_and_lower(), but before to_executorch() +graph_module = edge_manager.exported_program().graph_module +delegation_info = get_delegation_info(graph_module) +print(delegation_info.get_summary()) +df = delegation_info.get_operator_delegation_dataframe() +print(tabulate(df, headers="keys", tablefmt="fancy_grid")) +``` + +For nanoGPT targeting the XNNPACK backend, you might see the following (note that the numbers below are for illustration purposes only and actual values may vary): +``` +Total delegated subgraphs: 145 +Number of delegated nodes: 350 +Number of non-delegated nodes: 760 +``` + + +| | op_type | # in_delegated_graphs | # in_non_delegated_graphs | +|----|---------------------------------|------- |-----| +| 0 | aten__softmax_default | 12 | 0 | +| 1 | aten_add_tensor | 37 | 0 | +| 2 | aten_addmm_default | 48 | 0 | +| 3 | aten_any_dim | 0 | 12 | +| | ... | | | +| 25 | aten_view_copy_default | 96 | 122 | +| | ... | | | +| 30 | Total | 350 | 760 | + +From the table, the operator `aten_view_copy_default` appears 96 times in delegate graphs and 122 times in non-delegated graphs. +To see a more detailed view, use the `format_delegated_graph()` method to get a formatted str of printout of the whole graph or use `print_delegated_graph()` to print directly: + +```python +from executorch.exir.backend.utils import format_delegated_graph +graph_module = edge_manager.exported_program().graph_module +print(format_delegated_graph(graph_module)) +``` +This may generate a large amount of output for large models. Consider using "Control+F" or "Command+F" to locate the operator you’re interested in +(e.g. “aten_view_copy_default”). Observe which instances are not under lowered graphs. + +In the fragment of the output for nanoGPT below, observe that a transformer module has been delegated to XNNPACK while the where operator is not. + +``` +%aten_where_self_22 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.where.self](args = (%aten_logical_not_default_33, %scalar_tensor_23, %scalar_tensor_22), kwargs = {}) +%lowered_module_144 : [num_users=1] = get_attr[target=lowered_module_144] +backend_id: XnnpackBackend +lowered graph(): + %p_transformer_h_0_attn_c_attn_weight : [num_users=1] = placeholder[target=p_transformer_h_0_attn_c_attn_weight] + %p_transformer_h_0_attn_c_attn_bias : [num_users=1] = placeholder[target=p_transformer_h_0_attn_c_attn_bias] + %getitem : [num_users=1] = placeholder[target=getitem] + %sym_size : [num_users=2] = placeholder[target=sym_size] + %aten_view_copy_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.view_copy.default](args = (%getitem, [%sym_size, 768]), kwargs = {}) + %aten_permute_copy_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.permute_copy.default](args = (%p_transformer_h_0_attn_c_attn_weight, [1, 0]), kwargs = {}) + %aten_addmm_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.addmm.default](args = (%p_transformer_h_0_attn_c_attn_bias, %aten_view_copy_default, %aten_permute_copy_default), kwargs = {}) + %aten_view_copy_default_1 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.view_copy.default](args = (%aten_addmm_default, [1, %sym_size, 2304]), kwargs = {}) + return [aten_view_copy_default_1] +``` + +### Performance Analysis + +Through the ExecuTorch Developer Tools, users are able to profile model execution, giving timing information for each operator in the model. + +#### Prerequisites + +##### ETRecord generation (Optional) + +An ETRecord is an artifact generated at the time of export that contains model graphs and source-level metadata linking the ExecuTorch program to the original PyTorch model. You can view all profiling events without an ETRecord, though with an ETRecord, you will also be able to link each event to the types of operators being executed, module hierarchy, and stack traces of the original PyTorch source code. For more information, see [the ETRecord docs](../etrecord.rst). + + +In your export script, after calling `to_edge()` and `to_executorch()`, call `generate_etrecord()` with the `EdgeProgramManager` from `to_edge()` and the `ExecuTorchProgramManager` from `to_executorch()`. Make sure to copy the `EdgeProgramManager`, as the call to `to_edge_transform_and_lower()` mutates the graph in-place. + +``` +# export_nanogpt.py + +import copy +from executorch.devtools import generate_etrecord + +# Make the deep copy immediately after to to_edge() +edge_manager_copy = copy.deepcopy(edge_manager) + +# ... +# Generate ETRecord right after to_executorch() +etrecord_path = "etrecord.bin" +generate_etrecord(etrecord_path, edge_manager_copy, et_program) +``` + +Run the export script and the ETRecord will be generated as `etrecord.bin`. + +##### ETDump generation + +An ETDump is an artifact generated at runtime containing a trace of the model execution. For more information, see [the ETDump docs](../etdump.md). + +Include the ETDump header and namespace in your code. +```cpp +// main.cpp + +#include + +using executorch::etdump::ETDumpGen; +using torch::executor::etdump_result; +``` + +Create an Instance of the ETDumpGen class and pass it to the Module constructor. +```cpp +std::unique_ptr etdump_gen_ = std::make_unique(); +Module model("nanogpt.pte", Module::LoadMode::MmapUseMlockIgnoreErrors, std::move(etdump_gen_)); +``` + +After calling `generate()`, save the ETDump to a file. You can capture multiple +model runs in a single trace, if desired. +```cpp +ETDumpGen* etdump_gen = static_cast(model.event_tracer()); + +ET_LOG(Info, "ETDump size: %zu blocks", etdump_gen->get_num_blocks()); +etdump_result result = etdump_gen->get_etdump_data(); +if (result.buf != nullptr && result.size > 0) { + // On a device with a file system, users can just write it to a file. + FILE* f = fopen("etdump.etdp", "w+"); + fwrite((uint8_t*)result.buf, 1, result.size, f); + fclose(f); + free(result.buf); +} +``` + +Additionally, update CMakeLists.txt to build with Developer Tools and enable events to be traced and logged into ETDump: + +``` +option(EXECUTORCH_ENABLE_EVENT_TRACER "" ON) +option(EXECUTORCH_BUILD_DEVTOOLS "" ON) + +# ... + +target_link_libraries( + # ... omit existing ones + etdump) # Provides event tracing and logging + +target_compile_options(executorch PUBLIC -DET_EVENT_TRACER_ENABLED) +target_compile_options(portable_ops_lib PUBLIC -DET_EVENT_TRACER_ENABLED) +``` +Build and run the runner, you will see a file named “etdump.etdp” is generated. (Note that this time we build in release mode to get around a flatccrt build limitation.) +```bash +(rm -rf cmake-out && mkdir cmake-out && cd cmake-out && cmake -DCMAKE_BUILD_TYPE=Release ..) +cmake --build cmake-out -j10 +./cmake-out/nanogpt_runner +``` + +## Performance debugging and profiling + +Once you’ve collected debug artifacts ETDump (and optionally an ETRecord), you can use the Inspector API to view performance information. + +```python +from executorch.devtools import Inspector + +inspector = Inspector(etdump_path="etdump.etdp") +# If you also generated an ETRecord, then pass that in as well: `inspector = Inspector(etdump_path="etdump.etdp", etrecord="etrecord.bin")` + +with open("inspector_out.txt", "w") as file: + inspector.print_data_tabular(file) +``` +This prints the performance data in a tabular format in “inspector_out.txt”, with each row being a profiling event. Top rows look like this: +![](../_static/img/llm_manual_print_data_tabular.png) +View in full size + +To learn more about the Inspector and the rich functionality it provides, see the [Inspector API Reference](../model-inspector.rst). diff --git a/docs/source/llm/export-llm.md b/docs/source/llm/export-llm.md new file mode 100644 index 00000000000..d8b69d51c40 --- /dev/null +++ b/docs/source/llm/export-llm.md @@ -0,0 +1,271 @@ +# Exporting popular LLMs out of the box + +Instead of needing to manually write code to call torch.export(), use ExecuTorch's assortment of lowering APIs, or even interact with TorchAO quantize_ APIs for quantization, we have provided an out of box experience which performantly exports a selection of supported models to ExecuTorch. + +As of this doc, the list of supported LLMs include the following: +- Llama 2/3/3.1/3.2 +- Qwen 2.5/3 +- Phi 3.5/4-mini +- SmolLM2 + +The up-to-date list of supported LLMs can be found in the code [here](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L32). + +## The export_llm API +`export_llm` is ExecuTorch's high-level export API for LLMs. In this tutorial, we will focus on exporting Llama 3.2 1B using this API. `export_llm`'s arguments are specified either through CLI args or through a yaml configuration whose fields are defined in [`LlmConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py). To call `export_llm`: + +``` +python -m executorch.examples.extension.llm.export.export_llm + --config + +base. +``` + +## Basic export + +To perform a basic export of Llama3.2, we will first need to download the checkpoint file (`consolidated.00.pth`) and params file (`params.json`). You can find these from the [Llama website](https://www.llama.com/llama-downloads/) or [Hugging Face](https://huggingface.co/meta-llama/Llama-3.2-1B/tree/main/original). + +Then, we specify the `model_class`, `checkpoint` (path to checkpoint file), and `params` (path to params file) as arguments. Additionally, later when we run the exported .pte with our runner APIs, the runner will need to know about the bos and eos ids for this model to know when to terminate. These are exposed through bos and eos getter methods in the .pte, which we can add by specifying bos and eos ids in a `metadata` argument. The values for these tokens can usually be found in the model's `tokenizer_config.json` on HuggingFace. + +``` +# path/to/config.yaml +base: + model_class: llama3_2 + checkpoint: path/to/consolidated.00.pth + params: path/to/params.json + metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' + +# export_llm +python -m extension.llm.export.export_llm \ + --config path/to/config.yaml +``` + +We only require manually specifying a checkpoint path for the Llama model family, since it is our most optimized model and we have more advanced optimizations such as [SpinQuant](https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#spinquant) that require custom checkpoints. + +For the other supported LLMs, the checkpoint will be downloaded from HuggingFace automatically, and the param files can be found in their respective directories under `executorch/examples/models`, for instance `executorch/examples/models/qwen3/config/0_6b_config.json`. + +## Adding optimizations +`export_llm` performs a variety of optimizations to the model before export, during export, and during lowering. Quantization and delegation to accelerator backends are the main ones and will be covered in the next two sections. All other optimizations can be found under [`ModelConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L120). We will go ahead and add a few optimizations. + +``` +# path/to/config.yaml +base: + model_class: llama3_2 + checkpoint: path/to/consolidated.00.pth + params: path/to/params.json + metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' +model: + use_kv_cache: True + use_sdpa_with_kv_cache: True + +# export_llm +python -m extension.llm.export.export_llm \ + --config path/to/config.yaml +``` + +`use_kv_cache` and `use_sdpa_with_kv_cache` are recommended to export any LLM, while other options are useful situationally. For example: +- `use_shared_embedding` can help for models with tied input/output embedding layers, given that you quantize using TorchAO low bit ops (`quantization.qmode: torchao:8da(\\d+)w` or `quantization.qmode: torchao:fpa(\d+)w`), see more [here](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L307). +- `use_attention_sink` to extend generation by removing from the beginning of the KV cache when the max context length is reached. +- `quantize_kv_cache` quantizes the KV cache in int8. +- `local_global_attention` impements [Local-Global Attention](https://arxiv.org/abs/2411.09604), making specific attention layers use a much smaller localized sliding window KV cache. + +## Quantization +Quantization options are defined by [`QuantizationConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L283). ExecuTorch does quantization in two ways: +1. TorchAO [`quantize_`](https://docs.pytorch.org/ao/stable/generated/torchao.quantization.quantize_.html) API +2. [pt2e quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) + +### TorchAO +TorchAO quantizes at the source code level, swapping out Linear modules for QuantizedLinear modules. +This is the recommended quantization path for running on CPU. +The quantization modes are defined [here](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L306). + +Common ones to use are: +- `8da4w`: short for int8 dynamic activation + int4 weight quantization. +- `int8`: int8 weight-only quanziation. + +For Arm CPUs, there are also [low-bit kernels](https://pytorch.org/blog/hi-po-low-bit-operators/) for int8 dynamic activation + int[1-8] weight quantization. Note that this should not be used alongside XNNPACK, and experimentally we have found that the performance could sometimes even be better for the equivalent `8da4w`. To use these, specify `qmode` to either: +- `torchao:8da(\d+)w`: int8 dynamic activation + int[1-8] weights, for example `torchao:8da5w` +- `torchao:fpa(\d+)w`: int[1-8] weight only, for example `torchao:fpa4w` + +To quantize embeddings, specify either `embedding_quantize: ,` (`bitwidth` here must be 2, 4, or 8), or for low-bit kernels use `embedding_quantize: torchao:,` (`bitwidth` can be from 1-8). + +``` +# path/to/config.yaml +base: + model_class: llama3_2 + checkpoint: path/to/consolidated.00.pth + params: path/to/params.json + metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' +model: + use_kv_cache: True + use_sdpa_withp_kv_cache: True +quantization: + embedding_quantize: 4,32 + qmode: 8da4w + +# export_llm +python -m extension.llm.export.export_llm \ + --config path/to/config.yaml +``` + +### pt2e +pt2e quantizes at the post-export graph level, swapping nodes and injecting quant/dequant nodes. +Used mainly for non-CPU backends (QNN, CoreML, Vulkan). +Read more about pt2e [here](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html), and how ExecuTorch uses pt2e [here](https://github.com/pytorch/executorch/blob/main/docs/source/quantization-overview.md). + + +## Backend support +Backend options are defined by [`BackendConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L434). Each backend has their own backend configuration options. Here is an example of lowering the LLM to XNNPACK for CPU acceleration: + +``` +# path/to/config.yaml +base: + model_class: llama3_2 + checkpoint: path/to/consolidated.00.pth + params: path/to/params.json + metadata: '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' +model: + use_kv_cache: True + use_sdpa_withp_kv_cache: True +quantization: + embedding_quantize: 4,32 + qmode: 8da4w +backend: + xnnpack: + enabled: True + extended_ops: True # Expand the selection of ops delegated to XNNPACK. + +# export_llm +python -m extension.llm.export.export_llm \ + --config path/to/config.yaml +``` + + +## Profiling and Debugging +To see which ops got delegated to the backend and which didn't, specify `verbose: True`: + +``` +# path/to/config.yaml +... +debug: + verbose: True +... + +# export_llm +python -m extension.llm.export.export_llm \ + --config path/to/config.yaml +``` + +In the logs, there will be a table of all ops in the graph, and which ones were and were not delegated. Here is an example: +``` +Total delegated subgraphs: 368 +Number of delegated nodes: 2588 +Number of non-delegated nodes: 2513 +╒════╤═══════════════════════════════════════════╤═══════════════════════════════════╤═══════════════════════════════════════╕ +│ │ op_type │ occurrences_in_delegated_graphs │ occurrences_in_non_delegated_graphs │ +╞════╪═══════════════════════════════════════════╪═══════════════════════════════════╪═══════════════════════════════════════╡ +│ 0 │ _assert_scalar │ 0 │ 167 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 1 │ _local_scalar_dense │ 0 │ 123 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 2 │ add │ 0 │ 31 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 3 │ aten__to_copy_default │ 0 │ 44 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +| 4 │ aten_add_tensor │ 418 │ 44 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 5 │ aten_alias_copy_default │ 0 │ 52 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 6 │ aten_arange_start_step │ 0 │ 66 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 7 │ aten_bitwise_and_tensor │ 0 │ 44 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 8 │ aten_cat_default │ 52 │ 52 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 9 │ aten_copy_default │ 0 │ 22 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 10 │ aten_eq_scalar │ 0 │ 22 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 11 │ aten_full_default │ 0 │ 22 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 12 │ aten_ge_scalar │ 0 │ 44 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 13 │ aten_gelu_default │ 26 │ 0 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 14 │ aten_index_put_default │ 0 │ 22 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 15 │ aten_linear_default │ 183 │ 0 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 16 │ aten_lt_scalar │ 0 │ 44 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 17 │ aten_mean_dim │ 0 │ 157 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 18 │ aten_mul_tensor │ 445 │ 0 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 19 │ aten_neg_default │ 52 │ 0 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 20 │ aten_pow_tensor_scalar │ 157 │ 0 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 21 │ aten_remainder_scalar │ 0 │ 22 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 22 │ aten_rsqrt_default │ 157 │ 0 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 23 │ aten_select_copy_int │ 0 │ 124 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 24 │ aten_slice_copy_tensor │ 0 │ 107 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 25 │ aten_sub_tensor │ 0 │ 22 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 26 │ aten_unsqueeze_copy_default │ 0 │ 74 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 27 │ aten_view_copy_default │ 0 │ 126 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 28 │ aten_where_self │ 0 │ 44 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 29 │ auto_functionalized │ 0 │ 52 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 30 │ ge │ 0 │ 75 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 31 │ getitem │ 366 │ 628 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 32 │ le │ 0 │ 57 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 33 │ llama_custom_sdpa_default │ 0 │ 26 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 34 │ lt │ 0 │ 35 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 35 │ quantized_decomposed_embedding_4bit_dtype │ 0 │ 1 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 36 │ scalar_tensor │ 0 │ 88 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 37 │ sym_constrain_range_for_size │ 0 │ 75 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 38 │ sym_size │ 0 │ 1 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 39 │ torchao_choose_qparams_affine_default │ 183 │ 0 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 40 │ torchao_dequantize_affine_default │ 366 │ 0 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 41 │ torchao_quantize_affine_default │ 183 │ 0 │ +├────┼───────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤ +│ 42 │ Total │ 2588 │ 2513 │ +╘════╧═══════════════════════════════════════════╧═══════════════════════════════════╧═══════════════════════════════════════╛ +``` + +To do further performance analysis, you can may opt to use [ExecuTorch's Inspector APIs](https://docs.pytorch.org/executorch/stable/llm/getting-started.html#performance-analysis) to do things such as trace individual operator performance back to source code, view memory planning, and debug intermediate activations. To generate the ETRecord necessary for the Inspector APIs to link back to source code, you can use: + +``` +# path/to/config.yaml +... +debug: + generate_etrecord: True +... + +# export_llm +python -m extension.llm.export.export_llm \ + --config path/to/config.yaml +``` + +Other debug and profiling options can be found in [DebugConfig](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L228). + +A few examples ones: +- `profile_memory`: Used to generate activation memory profile in chrome trace format. It allows one to visualize the lifetimes of different intermediate tensors of a model, how their lifetimes overlap, where these tensors come from, and how they impact the memory footprint of the model during its execution. Click [here](https://github.com/pytorch/executorch/blob/dd4488d720d676a1227450e8ea0c0c97beed900c/docs/source/memory-planning-inspection.md?plain=1#L19) for more details on memory profiling. +- `profile_path`: Used to generate time profile of various components of export_llm. Such components include `torch.export`, quantization, `to_edge`, delegation via to_backend APIs etc. This option generate a .html file that gives you time profile in flamegraph/icicle format. It is helpful to understand what part of `export_llm` takes the most time. Largely useful for developers and contributors of ExecuTorch. For more details on flamegraph one can checkout https://www.parca.dev/docs/icicle-graph-anatomy/ diff --git a/docs/source/llm/getting-started.md b/docs/source/llm/getting-started.md index 06705c2e4d5..b4925a492cc 100644 --- a/docs/source/llm/getting-started.md +++ b/docs/source/llm/getting-started.md @@ -1,871 +1,24 @@ -# Intro to LLMs in Executorch +# Deploying LLMs to Executorch -Welcome to LLM Manual! This manual is designed to provide a practical example to leverage -ExecuTorch in onboarding your own Large Language Models (LLMs). Our primary goal is to offer - a clear and concise guideline on how to integrate our system with your own LLMs. - -Please note that this project is intended as a demonstration and not as a fully functional -example with optimal performance. As such, certain components such as the sampler, tokenizer, -and others are provided in their bare minimum versions solely for demonstration purposes. -Consequently, the results produced by the model may vary and might not always be optimal. +ExecuTorch is designed to support all types of machine learning models, and LLMs are no exception. +In this section we demonstrate how to leverage ExecuTorch to performantly run state of the art +LLMs on-device out of the box with our provided export LLM APIs, acceleration backends, quantization +libraries, tokenizers, and more. We encourage users to use this project as a starting point and adapt it to their specific needs, which includes creating your own versions of the tokenizer, sampler, acceleration backends, and other components. We hope this project serves as a useful guide in your journey with LLMs and ExecuTorch. -For deploying Llama with optimal performance, please see [Llama guide](llama.md). - -### Table Of Contents - - -1. Prerequisites -2. Hello World Example -3. Quantization -4. Using Mobile Acceleration -5. Debugging and Profiling -6. How to use custom kernels -7. How to build mobile apps - ## Prerequisites -To follow this guide, you'll need to clone the ExecuTorch repository and install dependencies. -ExecuTorch recommends Python 3.10 and the use of Conda to manage your environment. Conda is not -required, though be aware that you may need to replace the use of python/pip with python3/pip3 -depending on your environment. - -::::{tab-set} -:::{tab-item} conda -Instructions on installing miniconda can be [found here](https://docs.anaconda.com/free/miniconda). - -``` -# Create a directory for this example. -mkdir et-nanogpt -cd et-nanogpt - -# Clone the ExecuTorch repository. -mkdir third-party -git clone -b viable/strict https://github.com/pytorch/executorch.git third-party/executorch && cd third-party/executorch - -# Create either a Python virtual environment: -python3 -m venv .venv && source .venv/bin/activate && pip install --upgrade pip - -# Or a Conda environment: -conda create -yn executorch python=3.10.0 && conda activate executorch - -# Install requirements -./install_executorch.sh - -cd ../.. -``` -::: -:::{tab-item} pyenv-virtualenv -Instructions on installing pyenv-virtualenv can be [found here](https://github.com/pyenv/pyenv-virtualenv?tab=readme-ov-file#installing-with-homebrew-for-macos-users). - -Importantly, if installing pyenv through brew, it does not automatically enable pyenv in the terminal, leading to errors. Run the following commands to enable. -See the pyenv-virtualenv installation guide above on how to add this to your .bashrc or .zshrc to avoid needing to run these commands manually. -``` -eval "$(pyenv init -)" -eval "$(pyenv virtualenv-init -)" -``` - -``` -# Create a directory for this example. -mkdir et-nanogpt -cd et-nanogpt - -pyenv install -s 3.10 -pyenv virtualenv 3.10 executorch -pyenv activate executorch - -# Clone the ExecuTorch repository. -git clone -b viable/strict https://github.com/pytorch/executorch.git third-party/executorch && cd third-party/executorch - -# Install requirements. -PYTHON_EXECUTABLE=python ./install_executorch.sh - -cd ../.. -``` -::: -:::: - -For more information, see [Setting Up ExecuTorch](../getting-started-setup.rst). - - -## Running a Large Language Model Locally - -This example uses Karpathy’s [nanoGPT](https://github.com/karpathy/nanoGPT), which is a minimal implementation of -GPT-2 124M. This guide is applicable to other language models, as ExecuTorch is model-invariant. - -There are two steps to running a model with ExecuTorch: - -1. Export the model. This step preprocesses it into a format suitable for runtime execution. -2. At runtime, load the model file and run with the ExecuTorch runtime. - -
- -The export step happens ahead of time, typically as part of the application build or when the model changes. The resultant -.pte file is distributed with the application. At runtime, the application loads the .pte file and passes it to the -ExecuTorch runtime. - -### Step 1. Exporting to ExecuTorch - -Exporting takes a PyTorch model and converts it into a format that can run efficiently on consumer devices. - -For this example, you will need the nanoGPT model and the corresponding tokenizer vocabulary. - -::::{tab-set} -:::{tab-item} curl -``` -curl https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py -O -curl https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json -O -``` -::: -:::{tab-item} wget -``` -wget https://raw.githubusercontent.com/karpathy/nanoGPT/master/model.py -wget https://huggingface.co/openai-community/gpt2/resolve/main/vocab.json -``` -::: -:::: - -To convert the model into a format optimized for standalone execution, there are two steps. First, use the PyTorch -`export` function to convert the PyTorch model into an intermediate, platform-independent intermediate representation. Then -use the ExecuTorch `to_edge` and `to_executorch` methods to prepare the model for on-device execution. This creates a .pte -file which can be loaded by a desktop or mobile application at runtime. - -Create a file called export_nanogpt.py with the following contents: - -```python -# export_nanogpt.py - -import torch - -from executorch.exir import EdgeCompileConfig, to_edge -from torch.nn.attention import sdpa_kernel, SDPBackend -from torch.export import export, export_for_training - -from model import GPT - -# Load the model. -model = GPT.from_pretrained('gpt2') - -# Create example inputs. This is used in the export process to provide -# hints on the expected shape of the model input. -example_inputs = (torch.randint(0, 100, (1, model.config.block_size), dtype=torch.long), ) - -# Set up dynamic shape configuration. This allows the sizes of the input tensors -# to differ from the sizes of the tensors in `example_inputs` during runtime, as -# long as they adhere to the rules specified in the dynamic shape configuration. -# Here we set the range of 0th model input's 1st dimension as -# [0, model.config.block_size]. -# See https://pytorch.org/executorch/main/concepts#dynamic-shapes -# for details about creating dynamic shapes. -dynamic_shape = ( - {1: torch.export.Dim("token_dim", max=model.config.block_size)}, -) - -# Trace the model, converting it to a portable intermediate representation. -# The torch.no_grad() call tells PyTorch to exclude training-specific logic. -with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad(): - m = export_for_training(model, example_inputs, dynamic_shapes=dynamic_shape).module() - traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape) - -# Convert the model into a runnable ExecuTorch program. -edge_config = EdgeCompileConfig(_check_ir_validity=False) -edge_manager = to_edge(traced_model, compile_config=edge_config) -et_program = edge_manager.to_executorch() - -# Save the ExecuTorch program to a file. -with open("nanogpt.pte", "wb") as file: - file.write(et_program.buffer) -``` - -To export, run the script with `python export_nanogpt.py` (or python3, as appropriate for your environment). It will generate a `nanogpt.pte` file in the current directory. - -For more information, see [Exporting to ExecuTorch](https://pytorch.org/executorch/main/tutorials/export-to-executorch-tutorial) and -[torch.export](https://pytorch.org/docs/stable/export.html). - -### Step 2. Invoking the Runtime - -ExecuTorch provides a set of runtime APIs and types to load and run models. - -Create a file called main.cpp with the following contents: - -```cpp -// main.cpp - -#include - -#include "basic_sampler.h" -#include "basic_tokenizer.h" - -#include -#include -#include -#include -#include - -using executorch::aten::ScalarType; -using executorch::aten::Tensor; -using executorch::extension::from_blob; -using executorch::extension::Module; -using executorch::runtime::EValue; -using executorch::runtime::Result; -``` - -The model inputs and outputs take the form of tensors. A tensor can be thought of as an multi-dimensional array. -The ExecuTorch `EValue` class provides a wrapper around tensors and other ExecuTorch data types. - -Since the LLM generates one token at a time, the driver code needs to repeatedly invoke the model, building the -output token by token. Each generated token is passed as input for the next run. - -```cpp -// main.cpp - -// The value of the gpt2 `<|endoftext|>` token. -#define ENDOFTEXT_TOKEN 50256 - -std::string generate( - Module& llm_model, - std::string& prompt, - BasicTokenizer& tokenizer, - BasicSampler& sampler, - size_t max_input_length, - size_t max_output_length) { - // Convert the input text into a list of integers (tokens) that represents it, - // using the string-to-token mapping that the model was trained on. Each token - // is an integer that represents a word or part of a word. - std::vector input_tokens = tokenizer.encode(prompt); - std::vector output_tokens; - - for (auto i = 0u; i < max_output_length; i++) { - // Convert the input_tokens from a vector of int64_t to EValue. EValue is a - // unified data type in the ExecuTorch runtime. - auto inputs = from_blob( - input_tokens.data(), - {1, static_cast(input_tokens.size())}, - ScalarType::Long); - - // Run the model. It will return a tensor of logits (log-probabilities). - auto logits_evalue = llm_model.forward(inputs); - - // Convert the output logits from EValue to std::vector, which is what the - // sampler expects. - Tensor logits_tensor = logits_evalue.get()[0].toTensor(); - std::vector logits( - logits_tensor.data_ptr(), - logits_tensor.data_ptr() + logits_tensor.numel()); - - // Sample the next token from the logits. - int64_t next_token = sampler.sample(logits); - - // Break if we reached the end of the text. - if (next_token == ENDOFTEXT_TOKEN) { - break; - } - - // Add the next token to the output. - output_tokens.push_back(next_token); - - std::cout << tokenizer.decode({next_token}); - std::cout.flush(); - - // Update next input. - input_tokens.push_back(next_token); - if (input_tokens.size() > max_input_length) { - input_tokens.erase(input_tokens.begin()); - } - } - - std::cout << std::endl; - - // Convert the output tokens into a human-readable string. - std::string output_string = tokenizer.decode(output_tokens); - return output_string; -} -``` - -The `Module` class handles loading the .pte file and preparing for execution. - -The tokenizer is responsible for converting from a human-readable string representation of the prompt to the -numerical form expected by the model. To do this, the tokenzier associates short substrings with a given token ID. -The tokens can be thought of as representing words or parts of words, though, in-practice, they may be arbitrary -sequences of characters. - -The tokenizer loads the vocabulary from a file, which contains the mapping between each token ID and the text it -represents. Call `tokenizer.encode()` and `tokenizer.decode()` to convert between string and token representations. - -The sampler is responsible for selecting the next token, based on the logits, or log-probabilties, output by the -model. The LLM returns a logit value for each possible next token. The sampler chooses which token to use based -on some strategy. The simplest approach, used here, is to take the token with the highest logit value. - -Samplers may provide configurable options, such as configurable amount of randomness to the outputs selection, -penalties for repeated tokens, and biases to prioritize or de-prioritize specific tokens. - - -```cpp -// main.cpp - -int main() { - // Set up the prompt. This provides the seed text for the model to elaborate. - std::cout << "Enter model prompt: "; - std::string prompt; - std::getline(std::cin, prompt); - - // The tokenizer is used to convert between tokens (used by the model) and - // human-readable strings. - BasicTokenizer tokenizer("vocab.json"); - - // The sampler is used to sample the next token from the logits. - BasicSampler sampler = BasicSampler(); - - // Load the exported nanoGPT program, which was generated via the previous - // steps. - Module model("nanogpt.pte", Module::LoadMode::MmapUseMlockIgnoreErrors); - - const auto max_input_tokens = 1024; - const auto max_output_tokens = 30; - std::cout << prompt; - generate( - model, prompt, tokenizer, sampler, max_input_tokens, max_output_tokens); -} -``` - -Finally, download the following files into the same directory as main.cpp: - -``` -curl -O https://raw.githubusercontent.com/pytorch/executorch/main/examples/llm_manual/basic_sampler.h -curl -O https://raw.githubusercontent.com/pytorch/executorch/main/examples/llm_manual/basic_tokenizer.h -``` - -To learn more, see the [Runtime APIs Tutorial](../extension-module.md). - -### Building and Running - -ExecuTorch uses the CMake build system. To compile and link against the ExecuTorch runtime, -include the ExecuTorch project via `add_directory` and link against `executorch` and additional -dependencies. - -Create a file named CMakeLists.txt with the following content: - -``` -# CMakeLists.txt - -cmake_minimum_required(VERSION 3.19) -project(nanogpt_runner) - -set(CMAKE_CXX_STANDARD 17) -set(CMAKE_CXX_STANDARD_REQUIRED True) - -# Set options for executorch build. -option(EXECUTORCH_ENABLE_LOGGING "" ON) -option(EXECUTORCH_BUILD_EXTENSION_DATA_LOADER "" ON) -option(EXECUTORCH_BUILD_EXTENSION_MODULE "" ON) -option(EXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR "" ON) -option(EXECUTORCH_BUILD_EXTENSION_TENSOR "" ON) -option(EXECUTORCH_BUILD_KERNELS_OPTIMIZED "" ON) - -# Include the executorch subdirectory. -add_subdirectory( - ${CMAKE_CURRENT_SOURCE_DIR}/third-party/executorch - ${CMAKE_BINARY_DIR}/executorch -) - -add_executable(nanogpt_runner main.cpp) -target_link_libraries( - nanogpt_runner - PRIVATE executorch - extension_module_static # Provides the Module class - extension_tensor # Provides the TensorPtr class - optimized_native_cpu_ops_lib # Provides baseline cross-platform - # kernels -) -``` - -At this point, the working directory should contain the following files: - -- CMakeLists.txt -- main.cpp -- basic_tokenizer.h -- basic_sampler.h -- export_nanogpt.py -- model.py -- vocab.json -- nanogpt.pte - -If all of these are present, you can now build and run: -```bash -(mkdir cmake-out && cd cmake-out && cmake ..) -cmake --build cmake-out -j10 -./cmake-out/nanogpt_runner -``` - -You should see the message: - -``` -Enter model prompt: -``` - -Type some seed text for the model and press enter. Here we use "Hello world!" as -an example prompt: - -``` -Enter model prompt: Hello world! -Hello world! - -I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in -``` - -At this point, it is likely to run very slowly. This is because ExecuTorch hasn't been told to optimize for -specific hardware (delegation), and because it is doing all of the calculations in 32-bit floating point (no quantization). - -## Delegation - -While ExecuTorch provides a portable, cross-platform implementation for all -operators, it also provides specialized backends for a number of different -targets. These include, but are not limited to, x86 and ARM CPU acceleration via -the XNNPACK backend, Apple acceleration via the Core ML backend and Metal -Performance Shader (MPS) backend, and GPU acceleration via the Vulkan backend. - -Because optimizations are specific to a given backend, each pte file is specific -to the backend(s) targeted at export. To support multiple devices, such as -XNNPACK acceleration for Android and Core ML for iOS, export a separate PTE file -for each backend. - -To delegate a model to a specific backend during export, ExecuTorch uses the -`to_edge_transform_and_lower()` function. This function takes the exported program -from `torch.export` and a backend-specific partitioner object. The partitioner -identifies parts of the computation graph that can be optimized by the target -backend. Within `to_edge_transform_and_lower()`, the exported program is -converted to an edge dialect program. The partitioner then delegates compatible -graph sections to the backend for acceleration and optimization. Any graph parts -not delegated are executed by ExecuTorch's default operator implementations. - -To delegate the exported model to a specific backend, we need to import its -partitioner as well as edge compile config from ExecuTorch codebase first, then -call `to_edge_transform_and_lower`. - -Here's an example of how to delegate nanoGPT to XNNPACK (if you're deploying to an Android phone for instance): - -```python -# export_nanogpt.py - -# Load partitioner for Xnnpack backend -from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner - -# Model to be delegated to specific backend should use specific edge compile config -from executorch.backends.xnnpack.utils.configs import get_xnnpack_edge_compile_config -from executorch.exir import EdgeCompileConfig, to_edge_transform_and_lower - -import torch -from torch.export import export -from torch.nn.attention import sdpa_kernel, SDPBackend -from torch.export import export_for_training - -from model import GPT - -# Load the nanoGPT model. -model = GPT.from_pretrained('gpt2') - -# Create example inputs. This is used in the export process to provide -# hints on the expected shape of the model input. -example_inputs = ( - torch.randint(0, 100, (1, model.config.block_size - 1), dtype=torch.long), - ) - -# Set up dynamic shape configuration. This allows the sizes of the input tensors -# to differ from the sizes of the tensors in `example_inputs` during runtime, as -# long as they adhere to the rules specified in the dynamic shape configuration. -# Here we set the range of 0th model input's 1st dimension as -# [0, model.config.block_size]. -# See https://pytorch.org/executorch/main/concepts.html#dynamic-shapes -# for details about creating dynamic shapes. -dynamic_shape = ( - {1: torch.export.Dim("token_dim", max=model.config.block_size - 1)}, -) - -# Trace the model, converting it to a portable intermediate representation. -# The torch.no_grad() call tells PyTorch to exclude training-specific logic. -with torch.nn.attention.sdpa_kernel([SDPBackend.MATH]), torch.no_grad(): - m = export_for_training(model, example_inputs, dynamic_shapes=dynamic_shape).module() - traced_model = export(m, example_inputs, dynamic_shapes=dynamic_shape) - -# Convert the model into a runnable ExecuTorch program. -# To be further lowered to Xnnpack backend, `traced_model` needs xnnpack-specific edge compile config -edge_config = get_xnnpack_edge_compile_config() -# Converted to edge program and then delegate exported model to Xnnpack backend -# by invoking `to` function with Xnnpack partitioner. -edge_manager = to_edge_transform_and_lower(traced_model, partitioner = [XnnpackPartitioner()], compile_config = edge_config) -et_program = edge_manager.to_executorch() - -# Save the Xnnpack-delegated ExecuTorch program to a file. -with open("nanogpt.pte", "wb") as file: - file.write(et_program.buffer) -``` - -Additionally, update CMakeLists.txt to build and link the XNNPACK backend to -ExecuTorch runner. - -``` -cmake_minimum_required(VERSION 3.19) -project(nanogpt_runner) - -set(CMAKE_CXX_STANDARD 17) -set(CMAKE_CXX_STANDARD_REQUIRED True) - -# Set options for executorch build. -option(EXECUTORCH_ENABLE_LOGGING "" ON) -option(EXECUTORCH_BUILD_EXTENSION_DATA_LOADER "" ON) -option(EXECUTORCH_BUILD_EXTENSION_MODULE "" ON) -option(EXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR "" ON) -option(EXECUTORCH_BUILD_EXTENSION_TENSOR "" ON) -option(EXECUTORCH_BUILD_KERNELS_OPTIMIZED "" ON) -option(EXECUTORCH_BUILD_XNNPACK "" ON) # Build with Xnnpack backend - -# Include the executorch subdirectory. -add_subdirectory( - ${CMAKE_CURRENT_SOURCE_DIR}/third-party/executorch - ${CMAKE_BINARY_DIR}/executorch -) - -add_executable(nanogpt_runner main.cpp) -target_link_libraries( - nanogpt_runner - PRIVATE executorch - extension_module_static # Provides the Module class - extension_tensor # Provides the TensorPtr class - optimized_native_cpu_ops_lib # Provides baseline cross-platform - # kernels - xnnpack_backend # Provides the XNNPACK CPU acceleration backend -) -``` - -Keep the rest of the code the same. For more details refer to [Exporting -to ExecuTorch](#step-1-exporting-to-executorch) and [Invoking the -Runtime](#step-2-invoking-the-runtime) for more details - -At this point, the working directory should contain the following files: - -- CMakeLists.txt -- main.cpp -- basic_tokenizer.h -- basic_sampler.h -- export_nanogpt.py -- model.py -- vocab.json - -If all of these are present, you can now export Xnnpack delegated pte model: -```bash -python export_nanogpt.py -``` - -It will generate `nanogpt.pte`, under the same working directory. - -Then we can build and run the model by: -```bash -(rm -rf cmake-out && mkdir cmake-out && cd cmake-out && cmake ..) -cmake --build cmake-out -j10 -./cmake-out/nanogpt_runner -``` - - -You should see the message: - -``` -Enter model prompt: -``` - -Type some seed text for the model and press enter. Here we use "Hello world!" as -an example prompt: - -``` -Enter model prompt: Hello world! -Hello world! - -I'm not sure if you've heard of the "Curse of the Dragon" or not, but it's a very popular game in -``` - -The delegated model should be noticeably faster compared to the non-delegated model. - -For more information regarding backend delegation, see the ExecuTorch guides -for the [XNNPACK Backend](../backends-xnnpack.md), [Core ML -Backend](../backends-coreml.md) and [Qualcomm AI Engine Direct Backend](../backends-qualcomm.md). - -## Quantization - -Quantization refers to a set of techniques for running calculations and storing tensors using lower precision types. -Compared to 32-bit floating point, using 8-bit integers can provide both a significant speedup and reduction in -memory usage. There are many approaches to quantizing a model, varying in amount of pre-processing required, data -types used, and impact on model accuracy and performance. - -Because compute and memory are highly constrained on mobile devices, some form of quantization is necessary to ship -large models on consumer electronics. In particular, large language models, such as Llama2, may require quantizing -model weights to 4 bits or less. - -Leveraging quantization requires transforming the model before export. PyTorch provides the pt2e (PyTorch 2 Export) -API for this purpose. This example targets CPU acceleration using the XNNPACK delegate. As such, it needs to use the - XNNPACK-specific quantizer. Targeting a different backend will require use of the corresponding quantizer. - -To use 8-bit integer dynamic quantization with the XNNPACK delegate, call `prepare_pt2e`, calibrate the model by -running with a representative input, and then call `convert_pt2e`. This updates the computational graph to use -quantized operators where available. - -```python -# export_nanogpt.py - -from executorch.backends.transforms.duplicate_dynamic_quant_chain import ( - DuplicateDynamicQuantChainPass, -) -from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import ( - get_symmetric_quantization_config, - XNNPACKQuantizer, -) -from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e -``` - -```python -# Use dynamic, per-channel quantization. -xnnpack_quant_config = get_symmetric_quantization_config( - is_per_channel=True, is_dynamic=True -) -xnnpack_quantizer = XNNPACKQuantizer() -xnnpack_quantizer.set_global(xnnpack_quant_config) - -m = export_for_training(model, example_inputs).module() - -# Annotate the model for quantization. This prepares the model for calibration. -m = prepare_pt2e(m, xnnpack_quantizer) - -# Calibrate the model using representative inputs. This allows the quantization -# logic to determine the expected range of values in each tensor. -m(*example_inputs) - -# Perform the actual quantization. -m = convert_pt2e(m, fold_quantize=False) -DuplicateDynamicQuantChainPass()(m) - -traced_model = export(m, example_inputs) -``` - -Additionally, add or update the `to_edge_transform_and_lower()` call to use `XnnpackPartitioner`. This -instructs ExecuTorch to optimize the model for CPU execution via the XNNPACK backend. - -```python -from executorch.backends.xnnpack.partition.xnnpack_partitioner import ( - XnnpackPartitioner, -) -``` - -```python -edge_config = get_xnnpack_edge_compile_config() -# Convert to edge dialect and lower to XNNPack. -edge_manager = to_edge_transform_and_lower(traced_model, partitioner = [XnnpackPartitioner()], compile_config = edge_config) -et_program = edge_manager.to_executorch() - -with open("nanogpt.pte", "wb") as file: - file.write(et_program.buffer) -``` - -Then run: -```bash -python export_nanogpt.py -./cmake-out/nanogpt_runner -``` - -For more information, see [Quantization in ExecuTorch](../quantization-overview.md). - -## Profiling and Debugging -After lowering a model by calling `to_edge_transform_and_lower()`, you may want to see what got delegated and what didn’t. ExecuTorch -provides utility methods to give insight on the delegation. You can use this information to gain visibility into -the underlying computation and diagnose potential performance issues. Model authors can use this information to -structure the model in a way that is compatible with the target backend. - -### Visualizing the Delegation - -The `get_delegation_info()` method provides a summary of what happened to the model after the `to_edge_transform_and_lower()` call: - -```python -from executorch.devtools.backend_debug import get_delegation_info -from tabulate import tabulate - -# ... After call to to_edge_transform_and_lower(), but before to_executorch() -graph_module = edge_manager.exported_program().graph_module -delegation_info = get_delegation_info(graph_module) -print(delegation_info.get_summary()) -df = delegation_info.get_operator_delegation_dataframe() -print(tabulate(df, headers="keys", tablefmt="fancy_grid")) -``` - -For nanoGPT targeting the XNNPACK backend, you might see the following (note that the numbers below are for illustration purposes only and actual values may vary): -``` -Total delegated subgraphs: 145 -Number of delegated nodes: 350 -Number of non-delegated nodes: 760 -``` - - -| | op_type | # in_delegated_graphs | # in_non_delegated_graphs | -|----|---------------------------------|------- |-----| -| 0 | aten__softmax_default | 12 | 0 | -| 1 | aten_add_tensor | 37 | 0 | -| 2 | aten_addmm_default | 48 | 0 | -| 3 | aten_any_dim | 0 | 12 | -| | ... | | | -| 25 | aten_view_copy_default | 96 | 122 | -| | ... | | | -| 30 | Total | 350 | 760 | - -From the table, the operator `aten_view_copy_default` appears 96 times in delegate graphs and 122 times in non-delegated graphs. -To see a more detailed view, use the `format_delegated_graph()` method to get a formatted str of printout of the whole graph or use `print_delegated_graph()` to print directly: - -```python -from executorch.exir.backend.utils import format_delegated_graph -graph_module = edge_manager.exported_program().graph_module -print(format_delegated_graph(graph_module)) -``` -This may generate a large amount of output for large models. Consider using "Control+F" or "Command+F" to locate the operator you’re interested in -(e.g. “aten_view_copy_default”). Observe which instances are not under lowered graphs. - -In the fragment of the output for nanoGPT below, observe that a transformer module has been delegated to XNNPACK while the where operator is not. - -``` -%aten_where_self_22 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.where.self](args = (%aten_logical_not_default_33, %scalar_tensor_23, %scalar_tensor_22), kwargs = {}) -%lowered_module_144 : [num_users=1] = get_attr[target=lowered_module_144] -backend_id: XnnpackBackend -lowered graph(): - %p_transformer_h_0_attn_c_attn_weight : [num_users=1] = placeholder[target=p_transformer_h_0_attn_c_attn_weight] - %p_transformer_h_0_attn_c_attn_bias : [num_users=1] = placeholder[target=p_transformer_h_0_attn_c_attn_bias] - %getitem : [num_users=1] = placeholder[target=getitem] - %sym_size : [num_users=2] = placeholder[target=sym_size] - %aten_view_copy_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.view_copy.default](args = (%getitem, [%sym_size, 768]), kwargs = {}) - %aten_permute_copy_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.permute_copy.default](args = (%p_transformer_h_0_attn_c_attn_weight, [1, 0]), kwargs = {}) - %aten_addmm_default : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.addmm.default](args = (%p_transformer_h_0_attn_c_attn_bias, %aten_view_copy_default, %aten_permute_copy_default), kwargs = {}) - %aten_view_copy_default_1 : [num_users=1] = call_function[target=executorch.exir.dialects.edge._ops.aten.view_copy.default](args = (%aten_addmm_default, [1, %sym_size, 2304]), kwargs = {}) - return [aten_view_copy_default_1] -``` - -### Performance Analysis - -Through the ExecuTorch Developer Tools, users are able to profile model execution, giving timing information for each operator in the model. - -#### Prerequisites - -##### ETRecord generation (Optional) - -An ETRecord is an artifact generated at the time of export that contains model graphs and source-level metadata linking the ExecuTorch program to the original PyTorch model. You can view all profiling events without an ETRecord, though with an ETRecord, you will also be able to link each event to the types of operators being executed, module hierarchy, and stack traces of the original PyTorch source code. For more information, see [the ETRecord docs](../etrecord.rst). - - -In your export script, after calling `to_edge()` and `to_executorch()`, call `generate_etrecord()` with the `EdgeProgramManager` from `to_edge()` and the `ExecuTorchProgramManager` from `to_executorch()`. Make sure to copy the `EdgeProgramManager`, as the call to `to_edge_transform_and_lower()` mutates the graph in-place. - -``` -# export_nanogpt.py - -import copy -from executorch.devtools import generate_etrecord - -# Make the deep copy immediately after to to_edge() -edge_manager_copy = copy.deepcopy(edge_manager) - -# ... -# Generate ETRecord right after to_executorch() -etrecord_path = "etrecord.bin" -generate_etrecord(etrecord_path, edge_manager_copy, et_program) -``` - -Run the export script and the ETRecord will be generated as `etrecord.bin`. - -##### ETDump generation - -An ETDump is an artifact generated at runtime containing a trace of the model execution. For more information, see [the ETDump docs](../etdump.md). - -Include the ETDump header and namespace in your code. -```cpp -// main.cpp - -#include - -using executorch::etdump::ETDumpGen; -using torch::executor::etdump_result; -``` - -Create an Instance of the ETDumpGen class and pass it to the Module constructor. -```cpp -std::unique_ptr etdump_gen_ = std::make_unique(); -Module model("nanogpt.pte", Module::LoadMode::MmapUseMlockIgnoreErrors, std::move(etdump_gen_)); -``` - -After calling `generate()`, save the ETDump to a file. You can capture multiple -model runs in a single trace, if desired. -```cpp -ETDumpGen* etdump_gen = static_cast(model.event_tracer()); - -ET_LOG(Info, "ETDump size: %zu blocks", etdump_gen->get_num_blocks()); -etdump_result result = etdump_gen->get_etdump_data(); -if (result.buf != nullptr && result.size > 0) { - // On a device with a file system, users can just write it to a file. - FILE* f = fopen("etdump.etdp", "w+"); - fwrite((uint8_t*)result.buf, 1, result.size, f); - fclose(f); - free(result.buf); -} -``` - -Additionally, update CMakeLists.txt to build with Developer Tools and enable events to be traced and logged into ETDump: - -``` -option(EXECUTORCH_ENABLE_EVENT_TRACER "" ON) -option(EXECUTORCH_BUILD_DEVTOOLS "" ON) - -# ... - -target_link_libraries( - # ... omit existing ones - etdump) # Provides event tracing and logging - -target_compile_options(executorch PUBLIC -DET_EVENT_TRACER_ENABLED) -target_compile_options(portable_ops_lib PUBLIC -DET_EVENT_TRACER_ENABLED) -``` -Build and run the runner, you will see a file named “etdump.etdp” is generated. (Note that this time we build in release mode to get around a flatccrt build limitation.) -```bash -(rm -rf cmake-out && mkdir cmake-out && cd cmake-out && cmake -DCMAKE_BUILD_TYPE=Release ..) -cmake --build cmake-out -j10 -./cmake-out/nanogpt_runner -``` - -#### Analyze with Inspector APIs - -Once you’ve collected debug artifacts ETDump (and optionally an ETRecord), you can use the Inspector API to view performance information. - -```python -from executorch.devtools import Inspector - -inspector = Inspector(etdump_path="etdump.etdp") -# If you also generated an ETRecord, then pass that in as well: `inspector = Inspector(etdump_path="etdump.etdp", etrecord="etrecord.bin")` - -with open("inspector_out.txt", "w") as file: - inspector.print_data_tabular(file) -``` -This prints the performance data in a tabular format in “inspector_out.txt”, with each row being a profiling event. Top rows look like this: -![](../_static/img/llm_manual_print_data_tabular.png) -View in full size - -To learn more about the Inspector and the rich functionality it provides, see the [Inspector API Reference](../model-inspector.rst). - -## Custom Kernels -With the ExecuTorch custom operator APIs, custom operator and kernel authors can easily bring in their kernel into PyTorch/ExecuTorch. - -There are three steps to use custom kernels in ExecuTorch: - -1. [Write the custom kernel](../kernel-library-custom-aten-kernel.md#c-api-for-custom-ops) using ExecuTorch types. -2. [Compile and link the custom kernel](../kernel-library-custom-aten-kernel.md#compile-and-link-the-custom-kernel) to both AOT Python environment as well as the runtime binary. -3. [Source-to-source transformation](../kernel-library-custom-aten-kernel.md#using-a-custom-operator-in-a-model) to swap an operator with a custom op. - -For more information, see [PyTorch Custom Operators](https://pytorch.org/tutorials/advanced/torch_script_custom_ops.html) and -and [ExecuTorch Kernel Registration](../kernel-library-custom-aten-kernel.md). +To follow this guide, you'll need to install ExecuTorch. Please see [Setting Up ExecuTorch](../getting-started.md#installation). -## How to Build Mobile Apps -See the instructions for building and running LLMs using ExecuTorch on iOS and Android. +## Next steps -* **[iOS ExecuTorch LLaMA Demo App](llama-demo-ios.md)** -* **[Android ExecuTorch LLaMA Demo App](llama-demo-android.md)** +- [Exporting popular LLMs out of the box](export-llm.md) +- [Exporting custom LLMs](export-custom-llm.md) +- [Running with C++](run-with-c-plus-plus.md) +- [Running on Android (XNNPack)](llama-demo-android.md) +- [Running on Android (Qualcomm)](build-run-llama3-qualcomm-ai-engine-direct-backend.md) +- [Running on iOS](llama-demo-ios.md) diff --git a/docs/source/llm/run-with-c-plus-plus.md b/docs/source/llm/run-with-c-plus-plus.md new file mode 100644 index 00000000000..a8b348e209e --- /dev/null +++ b/docs/source/llm/run-with-c-plus-plus.md @@ -0,0 +1 @@ +# TODO(mengwei) \ No newline at end of file From 659db4f05bf1d020ec18d0109a8bd2c2f97ab9e5 Mon Sep 17 00:00:00 2001 From: Jack Zhang <32371937+jackzhxng@users.noreply.github.com> Date: Wed, 16 Jul 2025 16:41:09 -0700 Subject: [PATCH 2/4] Scott comments --- docs/source/llm/export-llm.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/docs/source/llm/export-llm.md b/docs/source/llm/export-llm.md index d8b69d51c40..9545ea87796 100644 --- a/docs/source/llm/export-llm.md +++ b/docs/source/llm/export-llm.md @@ -72,9 +72,9 @@ Quantization options are defined by [`QuantizationConfig`](https://github.com/py 1. TorchAO [`quantize_`](https://docs.pytorch.org/ao/stable/generated/torchao.quantization.quantize_.html) API 2. [pt2e quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) -### TorchAO +### TorchAO (XNNPACK) TorchAO quantizes at the source code level, swapping out Linear modules for QuantizedLinear modules. -This is the recommended quantization path for running on CPU. +**To quantize on XNNPACK backend, this is the quantization path to follow.** The quantization modes are defined [here](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L306). Common ones to use are: @@ -106,11 +106,13 @@ python -m extension.llm.export.export_llm \ --config path/to/config.yaml ``` -### pt2e +### pt2e (QNN, CoreML, and Vulkan) pt2e quantizes at the post-export graph level, swapping nodes and injecting quant/dequant nodes. -Used mainly for non-CPU backends (QNN, CoreML, Vulkan). +**To quantize on non-CPU backends (QNN, CoreML, Vulkan), this is the quantization path to follow.** Read more about pt2e [here](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html), and how ExecuTorch uses pt2e [here](https://github.com/pytorch/executorch/blob/main/docs/source/quantization-overview.md). +*CoreML and Vulkan support for export_llm is currently experimental and limited. To read more about QNN export, please read [Running on Android (Qualcomm)](build-run-llama3-qualcomm-ai-engine-direct-backend.md).* + ## Backend support Backend options are defined by [`BackendConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L434). Each backend has their own backend configuration options. Here is an example of lowering the LLM to XNNPACK for CPU acceleration: From 1a3e4221a568552c1d537d9caa0d822e2402668c Mon Sep 17 00:00:00 2001 From: Jack Zhang <32371937+jackzhxng@users.noreply.github.com> Date: Fri, 18 Jul 2025 17:18:15 -0700 Subject: [PATCH 3/4] Mergen comments --- docs/source/index.md | 2 +- docs/source/llm/export-llm.md | 15 +++++++++++++-- docs/source/llm/getting-started.md | 6 ++++-- 3 files changed, 18 insertions(+), 5 deletions(-) diff --git a/docs/source/index.md b/docs/source/index.md index c69ba8c172e..b354bd40bfe 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -88,7 +88,7 @@ ExecuTorch provides support for: - [Selective Build](kernel-library-selective-build) #### Working with LLMs - [Getting Started](llm/getting-started.md) -- [Exporting LLMs with export_llm](llm/export-llm.md) +- [Exporting LLMs](llm/export-llm.md) - [Exporting custom LLMs](llm/export-custom-llm.md) - [Running with C++](llm/run-with-c-plus-plus.md) - [Running on Android (XNNPack)](llm/llama-demo-android.md) diff --git a/docs/source/llm/export-llm.md b/docs/source/llm/export-llm.md index 9545ea87796..f952e88c7b0 100644 --- a/docs/source/llm/export-llm.md +++ b/docs/source/llm/export-llm.md @@ -1,4 +1,4 @@ -# Exporting popular LLMs out of the box +# Exporting LLMs Instead of needing to manually write code to call torch.export(), use ExecuTorch's assortment of lowering APIs, or even interact with TorchAO quantize_ APIs for quantization, we have provided an out of box experience which performantly exports a selection of supported models to ExecuTorch. @@ -42,6 +42,9 @@ We only require manually specifying a checkpoint path for the Llama model family For the other supported LLMs, the checkpoint will be downloaded from HuggingFace automatically, and the param files can be found in their respective directories under `executorch/examples/models`, for instance `executorch/examples/models/qwen3/config/0_6b_config.json`. +## Export settings +[ExportConfig](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py) contains settings for the exported `.pte`, such as `max_seq_length` (max length of the prompt) and `max_context_length` (max length of the model's memory/cache). + ## Adding optimizations `export_llm` performs a variety of optimizations to the model before export, during export, and during lowering. Quantization and delegation to accelerator backends are the main ones and will be covered in the next two sections. All other optimizations can be found under [`ModelConfig`](https://github.com/pytorch/executorch/blob/main/extension/llm/export/config/llm_config.py#L120). We will go ahead and add a few optimizations. @@ -81,6 +84,9 @@ Common ones to use are: - `8da4w`: short for int8 dynamic activation + int4 weight quantization. - `int8`: int8 weight-only quanziation. +Group size is specified with: +- `group_size`: 8, 32, 64, etc. + For Arm CPUs, there are also [low-bit kernels](https://pytorch.org/blog/hi-po-low-bit-operators/) for int8 dynamic activation + int[1-8] weight quantization. Note that this should not be used alongside XNNPACK, and experimentally we have found that the performance could sometimes even be better for the equivalent `8da4w`. To use these, specify `qmode` to either: - `torchao:8da(\d+)w`: int8 dynamic activation + int[1-8] weights, for example `torchao:8da5w` - `torchao:fpa(\d+)w`: int[1-8] weight only, for example `torchao:fpa4w` @@ -156,7 +162,10 @@ python -m extension.llm.export.export_llm \ --config path/to/config.yaml ``` -In the logs, there will be a table of all ops in the graph, and which ones were and were not delegated. Here is an example: +In the logs, there will be a table of all ops in the graph, and which ones were and were not delegated. + +Here is an example: +
``` Total delegated subgraphs: 368 Number of delegated nodes: 2588 @@ -251,6 +260,8 @@ Number of non-delegated nodes: 2513 │ 42 │ Total │ 2588 │ 2513 │ ╘════╧═══════════════════════════════════════════╧═══════════════════════════════════╧═══════════════════════════════════════╛ ``` +
+
To do further performance analysis, you can may opt to use [ExecuTorch's Inspector APIs](https://docs.pytorch.org/executorch/stable/llm/getting-started.html#performance-analysis) to do things such as trace individual operator performance back to source code, view memory planning, and debug intermediate activations. To generate the ETRecord necessary for the Inspector APIs to link back to source code, you can use: diff --git a/docs/source/llm/getting-started.md b/docs/source/llm/getting-started.md index b4925a492cc..c75d5bbc3f5 100644 --- a/docs/source/llm/getting-started.md +++ b/docs/source/llm/getting-started.md @@ -1,4 +1,4 @@ -# Deploying LLMs to Executorch +# Deploying LLMs to ExecuTorch ExecuTorch is designed to support all types of machine learning models, and LLMs are no exception. In this section we demonstrate how to leverage ExecuTorch to performantly run state of the art @@ -16,7 +16,9 @@ To follow this guide, you'll need to install ExecuTorch. Please see [Setting Up ## Next steps -- [Exporting popular LLMs out of the box](export-llm.md) +Deploying LLMs to ExecuTorch can be boiled down to a two-step process: (1) exporting the LLM to a `.pte` file and (2) running the `.pte` file using our C++ APIs or Swift/Java bindings. + +- [Exporting LLMs](export-llm.md) - [Exporting custom LLMs](export-custom-llm.md) - [Running with C++](run-with-c-plus-plus.md) - [Running on Android (XNNPack)](llama-demo-android.md) From 5fafb3a21f45b1ed87e2908c6f7a6b8044aede4a Mon Sep 17 00:00:00 2001 From: Jack <32371937+jackzhxng@users.noreply.github.com> Date: Mon, 21 Jul 2025 15:55:11 -0700 Subject: [PATCH 4/4] Anthony comment Co-authored-by: Anthony Shoumikhin --- docs/source/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/index.md b/docs/source/index.md index b354bd40bfe..d0b7adbaab1 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -93,7 +93,7 @@ ExecuTorch provides support for: - [Running with C++](llm/run-with-c-plus-plus.md) - [Running on Android (XNNPack)](llm/llama-demo-android.md) - [Running on Android (QNN)](llm/build-run-llama3-qualcomm-ai-engine-direct-backend.md) -- [Running on iOS](llm/llama-demo-ios.md) +- [Running on iOS](llm/run-on-ios.md) #### Backend Development - [Delegates Integration](backend-delegates-integration) - [XNNPACK Reference](backend-delegates-xnnpack-reference)