Skip to content

QNN runtime fails to load all context binaries when model contains >50 context binaries #14985

@goka-wu

Description

@goka-wu

Description

I am currently deploying the Llama-3.2-1B-Instruct model on QCS8550. I customized the QNN partitioner to offload Linear operations to the CPU, while delegating all non-Linear operations to the QNN Backend (HtpV73). This partitioning resulted in 66 QNN backend subgraphs, consequently generating 66 context binaries serialized within the resulting *.pte file.

However, upon execution on the QCS8550, the process failed. The error log indicates an issue during context loading.
Image

I have attached the detailed logs for reference. Please review them.
err_20251009.log

The source code is from main branch with the following commit info:

commit cf6e895c53bd1052f1266821a76bd7c5a85ace52 (HEAD -> dev)
Author: Šimon Strýček <[email protected]>
Date:   Tue Sep 16 11:33:01 2025 +0200

    NXP backend: Relocation of remove_io_quant_ops_pass.py (#14202)

    ### Summary
    Relocate `remove_io_quant_ops_pass.py` to `nxp/edge_passes`.

    ### Test plan
    Should be covered by already existing unit tests.

    Co-authored-by: Roman Janik <[email protected]>

Thank you for your time!

Analyze

(1) It appears that the QnnExecuTorchBackend first loads and manages all 66 Context Binaries upfront. It then executes the specific graph stored within a Context Binary as needed, and finally destroys all context binaries and other resources together.

(2) I believe the 66 context binaries are inherently valid because I successfully executed all of them sequentially (Load one Context Binary → Execute the model graph → Destroy the Context Binary ). The main modification for test is as belows:

Result<DelegateHandle*> QnnExecuTorchBackend::init(
    BackendInitContext& context,
    FreeableBuffer* processed,
    ArrayRef<CompileSpec> compile_specs) const {
    // convert CompileSpec to qnn ExecuTorch option
    for (auto& compile_spec : compile_specs) {
        if (std::strcmp(compile_spec.key, QNN_COMPILE_SPEC) == 0)
        qnn_executorch_options_ =
            GetQnnExecuTorchOptions(compile_spec.value.buffer);
        else
        QNN_EXECUTORCH_LOG_WARN("unknown argument: %s", compile_spec.key);
    }

  ET_LOG(Info, "WQnnExecuTorchBackend::init() is a dummpy function.");
  return processed;
}

Error QnnExecuTorchBackend::execute(
    BackendExecutionContext& context,
    DelegateHandle* handle,
    Span<EValue*> args) const {
  FreeableBuffer* processed = (FreeableBuffer*)(handle);
  QnnExecuTorchContextBinary qnn_context_blob;
  auto [status, signature, ctx_size, ctx_bin] =
      QnnContextCustomProtocol().DeserializeContextCustomBuffer(
          const_cast<void*>(processed->data()));
  if (status == Error::Ok) {
    QNN_EXECUTORCH_LOG_INFO(
        "Deserializing processed data using QnnContextCustomProtocol");
    // After this stage, qnn_context_blob.nbytes & qnn_context_blob.buffer will
    // only store qnn_context_binary.
    qnn_context_blob.nbytes = ctx_size;
    qnn_context_blob.buffer = ctx_bin;
    std::string file_name = "contexts/context_"  + std::to_string(backend_cnt_) + ".txt";
    write_file(file_name.c_str(), ctx_bin, ctx_size);
  } else {
    // This buffer will be verified again in QnnBackendCache.
    QNN_EXECUTORCH_LOG_INFO("Deserializing processed data using Dlc");
    qnn_context_blob.buffer = const_cast<void*>(processed->data());
    qnn_context_blob.nbytes = processed->size();
  }


  // Create QnnManager
  MemoryAllocator* runtime_allocator = context.get_temp_allocator();
  QnnManager* qnn_manager = runtime_allocator->allocateInstance<QnnManager>();
  if (qnn_manager == nullptr) {
    return Error::MemoryAllocationFailed;
  }
  // NOTE: Since we use placement new and since this type is not trivially
  // destructible, we must call the destructor manually in destroy().
  new (qnn_manager) QnnManager(qnn_executorch_options_, qnn_context_blob);
  // TODO: this is a temporal solution for multi-graph support, will be
  //       removed once framework starts to accept runtime configuration
  // ---
  // check if current context binary has already been initialized
  // return cached one for reducing memory footprint
  ET_CHECK_OR_RETURN_ERROR(
      qnn_manager->Init() == Error::Ok,
      Internal,
      "Fail to initialize Qnn Manager");
    
   ......

  ET_CHECK_OR_RETURN_ERROR(
      qnn_manager->Execute(
          method_name,
          input_tensor_structs,
          output_tensor_structs,
          context.event_tracer()) == Error::Ok,
      Internal,
      "Fail to execute graph");
  ET_CHECK_OR_RETURN_ERROR(
      qnn_manager->ProfileExecuteData(method_name, context.event_tracer()) ==
          Error::Ok,
      Internal,
      "Fail to profile graph");
  
 qnn_manager->Destroy();
  return Error::Ok;
}

cc @cccclai @winskuo-quic @shewu-quic @haowhsu-quic @DannyYuyang-quic @cbilgin

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: qnnIssues related to Qualcomm's QNN delegate and code under backends/qualcomm/partner: qualcommFor backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions