Skip to content

Session activity cancellation causes permanent TMPRL1100 Non-Determinism Error on replay when DataConverter/codec fails #2206

@d2army

Description

@d2army

Expected Behavior

When a DataConverter (e.g., hitting the remote codec server) returns any error during encodeArgs in ExecuteActivity, the workflow task should fail and be retryable without triggering an NDE. The workflow should remain recoverable.

Actual Behavior

Two bugs combine to permanently brick the workflow:

Bug 1: encodeArgs panics instead of returning an error (internal/workflow.go:943):

input, err := encodeArgs(dataConverter, args)
if err != nil {
    panic(err) // Should fail the workflow task gracefully, not panic
}

Bug 2: handleCancelInitiatedEvent rejects Initiated state on replay (internal/internal_command_state_machine.go:566-574):

func (d *commandStateMachineBase) handleCancelInitiatedEvent() {
    switch d.state {
    case commandStateCancellationCommandSent, commandStateCanceledAfterInitiated:
        // No state change
    default:
        d.failStateTransition(eventCancelInitiated) // Panics for Initiated state
    }
}

When a workflow uses CreateSession + defer CompleteSession, the panic from Bug 1 causes an UNHANDLED_FAILURE. On replay, the SDK encounters ActivityTaskCancelRequested for the internalSessionCreationActivity (the session creation activity scheduled by CreateSession, typically the first activity in the workflow, e.g. event ID 7 in our case) before the workflow code has re-executed CompleteSession() to transition the state machine to CancellationCommandSent. The state machine is still in Initiated, so Bug 2 panics with TMPRL1100. Every subsequent replay hits the same panic -- the workflow is permanently unrecoverable.

Note: Even if Bug 1 is not fixed (i.e., encodeArgs continues to panic), Bug 2 alone can still brick a workflow whenever any cancel event hits a session activity in Initiated state during replay. The state machine fix is required regardless of whether the panic-on-encode behavior is changed.

Decoded failure from the bricked workflow:

message: "failed to encode payloads: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
stack_trace: "...workflowEnvironmentInterceptor.ExecuteActivity at workflow.go:943..."

Decoded NDE from replay:

[TMPRL1100] invalid state transition: attempt to handleCancelInitiatedEvent,
CommandType: Activity, ID: 7, state=Initiated, isDone()=false,
history=[Created handleCommandSent CommandSent handleInitiatedEvent Initiated handleCancelInitiatedEvent]

Activity ID 7 here is the internalSessionCreationActivity scheduled on the default__internal_session_creation task queue by workflow.CreateSession(). The ActivityTaskCancelRequested for this activity is generated when CompleteSession() cancels the session context. On replay, this cancel event arrives while the activity state machine is still in Initiated, causing the TMPRL1100 panic.

Steps to Reproduce the Problem

  1. Create a workflow that uses workflow.CreateSession() with defer workflow.CompleteSession(sessionCtx)
  2. Configure a DataConverter that can fail (see reproducer below)
  3. When the workflow calls workflow.ExecuteActivity(), the DataConverter.ToPayloads() returns an error
  4. The SDK panics at workflow.go:943, recorded as WORKFLOW_TASK_FAILED with WORKFLOW_WORKER_UNHANDLED_FAILURE
  5. Temporal retries the workflow task → replay hits handleCancelInitiatedEvent in Initiated state → TMPRL1100
  6. Workflow is permanently bricked -- every replay attempt hits the same TMPRL1100 panic

Minimal workflow code:

func MyWorkflow(ctx workflow.Context) error {
    ctx = workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
        StartToCloseTimeout: 10 * time.Minute,
    })

    sessionCtx, err := workflow.CreateSession(ctx, &workflow.SessionOptions{
        CreationTimeout:  2 * time.Minute,
        ExecutionTimeout: 8 * time.Hour,
    })
    if err != nil {
        return err
    }
    defer workflow.CompleteSession(sessionCtx)

    // If the DataConverter fails when this executes,
    // encodeArgs panics → UNHANDLED_FAILURE → NDE on replay → permanently bricked
    var result MyResult
    err = workflow.ExecuteActivity(sessionCtx, MyActivity, myInput).Get(sessionCtx, &result)
    return err
}

Deterministic reproducer -- custom DataConverter that fails on demand:

To reproduce without a remote codec server, wrap the default DataConverter so that ToPayloads returns an error after a configurable number of successful calls. Set failAfter to a value that allows early calls to succeed (e.g., session creation, SideEffect encoding) but fails when ExecuteActivity encodes its arguments. For example, failAfter=5 lets the first few encode calls through, then fails on the activity argument encoding. No custom retry policy is needed -- the default workflow task retry behavior is sufficient to trigger the replay NDE.

import (
    "fmt"
    "sync/atomic"

    commonpb "go.temporal.io/api/common/v1"
    "go.temporal.io/sdk/converter"
)

// failingDataConverter wraps a DataConverter and fails ToPayloads after N successful calls.
type failingDataConverter struct {
    converter.DataConverter
    callCount atomic.Int32
    failAfter int32
}

func (f *failingDataConverter) ToPayloads(values ...interface{}) (*commonpb.Payloads, error) {
    if f.callCount.Add(1) >= f.failAfter {
        return nil, fmt.Errorf("simulated codec server timeout: DeadlineExceeded")
    }
    return f.DataConverter.ToPayloads(values...)
}

func (f *failingDataConverter) ToPayload(value interface{}) (*commonpb.Payload, error) {
    if f.callCount.Load() >= f.failAfter {
        return nil, fmt.Errorf("simulated codec server timeout: DeadlineExceeded")
    }
    return f.DataConverter.ToPayload(value)
}

Use this as the DataConverter in client.Options:

c, err := client.Dial(client.Options{
    DataConverter: &failingDataConverter{
        DataConverter: converter.GetDefaultDataConverter(),
        failAfter:     5, // allow session creation to succeed, fail on activity encode
    },
})

Key observation: When sessions are disabled (CreateSession / CompleteSession removed), the same DataConverter failure triggers Bug 1 (the encodeArgs panic) but the workflow task retries and recovers successfully because there is no session activity cancel event to trigger Bug 2 on replay. With sessions enabled, the workflow is permanently unrecoverable. This has been reproduced consistently across multiple workflows and namespaces on Temporal Cloud.

Prior art -- same class of state machine bug, partially fixed multiple times:

  • PR #323 (Dec 2020): Child workflow cancellation event ordering -- fixed by accepting both event orderings
  • Issue #343 (Jan 2021): Another child workflow cancel state transition variant -- Maxim Fateev noted "there is a state transition which is still missing" from PR Child workflow cancellation unusual event ordering bugfix #323. Fixed by Spencer Judge within a day.
  • PR #625 / v1.11.1 (Nov 2021): "Fix state machine bug that occurs when actions happen after activity cancellation"
  • PR #726 / v1.13.1 (Feb 2022): "Remove pending activity cancellations when activity completion occurs"
  • Issue #1227 (Sep 2023): Sessions + worker versioning edge case -- acknowledged as a bug by Quinn Klassen

Each fix addressed one cancellation ordering variant, but handleCancelInitiatedEvent + Initiated state for activities/sessions was not covered.

Suggested fix:

  1. workflow.go:943: Change panic(err) to return the error, failing the workflow task gracefully instead of panicking
  2. internal_command_state_machine.go:handleCancelInitiatedEvent: Add commandStateInitiated to the accepted states, consistent with the approach in PR Child workflow cancellation unusual event ordering bugfix #323 (accept the unexpected but valid event ordering rather than panicking)

Note: Even if (1) is not changed, fix (2) is still required -- any code path that produces an ActivityTaskCancelRequested for a session activity during replay can trigger this permanent bricking via the state machine panic.

Specifications

  • Version: v1.36.0 (confirmed still present in v1.40.0 as of Feb 26, 2026 -- panic(err) at workflow.go:943 and handleCancelInitiatedEvent at internal_command_state_machine.go:566-574 are unchanged)
  • Platform: Linux (Kubernetes), Temporal Cloud, remote codec server via gRPC for payload encryption

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions