Heterogeneously run the LLaMA model on both the QNN and XNNPACK backends. #13629
-
I’m planning to deploy the quantized LLaMA 3.2-3B model on QNN and run some of its linear layers on XNNPACK. Would this be possible? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 7 replies
-
@yujiaoliang For QNN, specifically, you can instruct the QNN partitioner to skip specific node IDs or operators, which will allow them to fall back to XNNPACK. See the QNN partitioner args here - https://www.internalfb.com/code/fbsource/[3369a2d3a668]/fbcode/executorch/backends/qualcomm/partition/qnn_partitioner.py?lines=135. You can then pass both the QnnPartitioner and XnnpackPartitioner to_edge_transform_and_lower. The second partitioner will act as a fallback. to_edge_transform_and_lower(
ep,
partitioner=[qnn_partitioner, xnnpack_partitioner]
) You can also provide a custom partitioner for advanced use cases, but it will require a bit of coding. There is an example in https://docs.pytorch.org/executorch/main/compiler-delegate-and-partitioner.html#common-questions under "5. Can we delegate to multiple backends?". |
Beta Was this translation helpful? Give feedback.
Which executor are you using? Some of them might not linked with qnn backend or the other way around. This executor runner should link the qnn backend https://github.com/pytorch/executorch/blob/main/examples/qualcomm/executor_runner/qnn_executor_runner.cpp not quite sure if it's linked with xnnpack backend