You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a DataConverter (e.g., hitting the remote codec server) returns any error during encodeArgs in ExecuteActivity, the workflow task should fail and be retryable without triggering an NDE. The workflow should remain recoverable.
Actual Behavior
Two bugs combine to permanently brick the workflow:
Bug 1: encodeArgs panics instead of returning an error (internal/workflow.go:943):
input, err:=encodeArgs(dataConverter, args)
iferr!=nil {
panic(err) // Should fail the workflow task gracefully, not panic
}
Bug 2: handleCancelInitiatedEvent rejects Initiated state on replay (internal/internal_command_state_machine.go:566-574):
func (d*commandStateMachineBase) handleCancelInitiatedEvent() {
switchd.state {
casecommandStateCancellationCommandSent, commandStateCanceledAfterInitiated:
// No state changedefault:
d.failStateTransition(eventCancelInitiated) // Panics for Initiated state
}
}
When a workflow uses CreateSession + defer CompleteSession, the panic from Bug 1 causes an UNHANDLED_FAILURE. On replay, the SDK encounters ActivityTaskCancelRequested for the internalSessionCreationActivity (the session creation activity scheduled by CreateSession, typically the first activity in the workflow, e.g. event ID 7 in our case) before the workflow code has re-executed CompleteSession() to transition the state machine to CancellationCommandSent. The state machine is still in Initiated, so Bug 2 panics with TMPRL1100. Every subsequent replay hits the same panic -- the workflow is permanently unrecoverable.
Note: Even if Bug 1 is not fixed (i.e., encodeArgs continues to panic), Bug 2 alone can still brick a workflow whenever any cancel event hits a session activity in Initiated state during replay. The state machine fix is required regardless of whether the panic-on-encode behavior is changed.
Decoded failure from the bricked workflow:
message: "failed to encode payloads: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
stack_trace: "...workflowEnvironmentInterceptor.ExecuteActivity at workflow.go:943..."
Decoded NDE from replay:
[TMPRL1100] invalid state transition: attempt to handleCancelInitiatedEvent,
CommandType: Activity, ID: 7, state=Initiated, isDone()=false,
history=[Created handleCommandSent CommandSent handleInitiatedEvent Initiated handleCancelInitiatedEvent]
Activity ID 7 here is the internalSessionCreationActivity scheduled on the default__internal_session_creation task queue by workflow.CreateSession(). The ActivityTaskCancelRequested for this activity is generated when CompleteSession() cancels the session context. On replay, this cancel event arrives while the activity state machine is still in Initiated, causing the TMPRL1100 panic.
Steps to Reproduce the Problem
Create a workflow that uses workflow.CreateSession() with defer workflow.CompleteSession(sessionCtx)
Configure a DataConverter that can fail (see reproducer below)
When the workflow calls workflow.ExecuteActivity(), the DataConverter.ToPayloads() returns an error
The SDK panics at workflow.go:943, recorded as WORKFLOW_TASK_FAILED with WORKFLOW_WORKER_UNHANDLED_FAILURE
Temporal retries the workflow task → replay hits handleCancelInitiatedEvent in Initiated state → TMPRL1100
Workflow is permanently bricked -- every replay attempt hits the same TMPRL1100 panic
Minimal workflow code:
funcMyWorkflow(ctx workflow.Context) error {
ctx=workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
StartToCloseTimeout: 10*time.Minute,
})
sessionCtx, err:=workflow.CreateSession(ctx, &workflow.SessionOptions{
CreationTimeout: 2*time.Minute,
ExecutionTimeout: 8*time.Hour,
})
iferr!=nil {
returnerr
}
deferworkflow.CompleteSession(sessionCtx)
// If the DataConverter fails when this executes,// encodeArgs panics → UNHANDLED_FAILURE → NDE on replay → permanently brickedvarresultMyResulterr=workflow.ExecuteActivity(sessionCtx, MyActivity, myInput).Get(sessionCtx, &result)
returnerr
}
Deterministic reproducer -- custom DataConverter that fails on demand:
To reproduce without a remote codec server, wrap the default DataConverter so that ToPayloads returns an error after a configurable number of successful calls. Set failAfter to a value that allows early calls to succeed (e.g., session creation, SideEffect encoding) but fails when ExecuteActivity encodes its arguments. For example, failAfter=5 lets the first few encode calls through, then fails on the activity argument encoding. No custom retry policy is needed -- the default workflow task retry behavior is sufficient to trigger the replay NDE.
import (
"fmt""sync/atomic"
commonpb "go.temporal.io/api/common/v1""go.temporal.io/sdk/converter"
)
// failingDataConverter wraps a DataConverter and fails ToPayloads after N successful calls.typefailingDataConverterstruct {
converter.DataConvertercallCount atomic.Int32failAfterint32
}
func (f*failingDataConverter) ToPayloads(values...interface{}) (*commonpb.Payloads, error) {
iff.callCount.Add(1) >=f.failAfter {
returnnil, fmt.Errorf("simulated codec server timeout: DeadlineExceeded")
}
returnf.DataConverter.ToPayloads(values...)
}
func (f*failingDataConverter) ToPayload(valueinterface{}) (*commonpb.Payload, error) {
iff.callCount.Load() >=f.failAfter {
returnnil, fmt.Errorf("simulated codec server timeout: DeadlineExceeded")
}
returnf.DataConverter.ToPayload(value)
}
Use this as the DataConverter in client.Options:
c, err:=client.Dial(client.Options{
DataConverter: &failingDataConverter{
DataConverter: converter.GetDefaultDataConverter(),
failAfter: 5, // allow session creation to succeed, fail on activity encode
},
})
Key observation: When sessions are disabled (CreateSession / CompleteSession removed), the same DataConverter failure triggers Bug 1 (the encodeArgs panic) but the workflow task retries and recovers successfully because there is no session activity cancel event to trigger Bug 2 on replay. With sessions enabled, the workflow is permanently unrecoverable. This has been reproduced consistently across multiple workflows and namespaces on Temporal Cloud.
Prior art -- same class of state machine bug, partially fixed multiple times:
PR #323 (Dec 2020): Child workflow cancellation event ordering -- fixed by accepting both event orderings
Issue #1227 (Sep 2023): Sessions + worker versioning edge case -- acknowledged as a bug by Quinn Klassen
Each fix addressed one cancellation ordering variant, but handleCancelInitiatedEvent + Initiated state for activities/sessions was not covered.
Suggested fix:
workflow.go:943: Change panic(err) to return the error, failing the workflow task gracefully instead of panicking
internal_command_state_machine.go:handleCancelInitiatedEvent: Add commandStateInitiated to the accepted states, consistent with the approach in PR Child workflow cancellation unusual event ordering bugfix #323 (accept the unexpected but valid event ordering rather than panicking)
Note: Even if (1) is not changed, fix (2) is still required -- any code path that produces an ActivityTaskCancelRequested for a session activity during replay can trigger this permanent bricking via the state machine panic.
Specifications
Version: v1.36.0 (confirmed still present in v1.40.0 as of Feb 26, 2026 -- panic(err) at workflow.go:943 and handleCancelInitiatedEvent at internal_command_state_machine.go:566-574 are unchanged)
Platform: Linux (Kubernetes), Temporal Cloud, remote codec server via gRPC for payload encryption
Expected Behavior
When a
DataConverter(e.g., hitting the remote codec server) returns any error duringencodeArgsinExecuteActivity, the workflow task should fail and be retryable without triggering an NDE. The workflow should remain recoverable.Actual Behavior
Two bugs combine to permanently brick the workflow:
Bug 1:
encodeArgspanics instead of returning an error (internal/workflow.go:943):Bug 2:
handleCancelInitiatedEventrejectsInitiatedstate on replay (internal/internal_command_state_machine.go:566-574):When a workflow uses
CreateSession+defer CompleteSession, the panic from Bug 1 causes anUNHANDLED_FAILURE. On replay, the SDK encountersActivityTaskCancelRequestedfor theinternalSessionCreationActivity(the session creation activity scheduled byCreateSession, typically the first activity in the workflow, e.g. event ID 7 in our case) before the workflow code has re-executedCompleteSession()to transition the state machine toCancellationCommandSent. The state machine is still inInitiated, so Bug 2 panics with TMPRL1100. Every subsequent replay hits the same panic -- the workflow is permanently unrecoverable.Note: Even if Bug 1 is not fixed (i.e.,
encodeArgscontinues to panic), Bug 2 alone can still brick a workflow whenever any cancel event hits a session activity inInitiatedstate during replay. The state machine fix is required regardless of whether the panic-on-encode behavior is changed.Decoded failure from the bricked workflow:
Decoded NDE from replay:
Activity ID 7 here is the
internalSessionCreationActivityscheduled on thedefault__internal_session_creationtask queue byworkflow.CreateSession(). TheActivityTaskCancelRequestedfor this activity is generated whenCompleteSession()cancels the session context. On replay, this cancel event arrives while the activity state machine is still inInitiated, causing the TMPRL1100 panic.Steps to Reproduce the Problem
workflow.CreateSession()withdefer workflow.CompleteSession(sessionCtx)DataConverterthat can fail (see reproducer below)workflow.ExecuteActivity(), theDataConverter.ToPayloads()returns an errorworkflow.go:943, recorded asWORKFLOW_TASK_FAILEDwithWORKFLOW_WORKER_UNHANDLED_FAILUREhandleCancelInitiatedEventinInitiatedstate → TMPRL1100Minimal workflow code:
Deterministic reproducer -- custom DataConverter that fails on demand:
To reproduce without a remote codec server, wrap the default
DataConverterso thatToPayloadsreturns an error after a configurable number of successful calls. SetfailAfterto a value that allows early calls to succeed (e.g., session creation, SideEffect encoding) but fails whenExecuteActivityencodes its arguments. For example,failAfter=5lets the first few encode calls through, then fails on the activity argument encoding. No custom retry policy is needed -- the default workflow task retry behavior is sufficient to trigger the replay NDE.Use this as the
DataConverterinclient.Options:Key observation: When sessions are disabled (
CreateSession/CompleteSessionremoved), the sameDataConverterfailure triggers Bug 1 (theencodeArgspanic) but the workflow task retries and recovers successfully because there is no session activity cancel event to trigger Bug 2 on replay. With sessions enabled, the workflow is permanently unrecoverable. This has been reproduced consistently across multiple workflows and namespaces on Temporal Cloud.Prior art -- same class of state machine bug, partially fixed multiple times:
Each fix addressed one cancellation ordering variant, but
handleCancelInitiatedEvent+Initiatedstate for activities/sessions was not covered.Suggested fix:
workflow.go:943: Changepanic(err)to return the error, failing the workflow task gracefully instead of panickinginternal_command_state_machine.go:handleCancelInitiatedEvent: AddcommandStateInitiatedto the accepted states, consistent with the approach in PR Child workflow cancellation unusual event ordering bugfix #323 (accept the unexpected but valid event ordering rather than panicking)Note: Even if (1) is not changed, fix (2) is still required -- any code path that produces an
ActivityTaskCancelRequestedfor a session activity during replay can trigger this permanent bricking via the state machine panic.Specifications
panic(err)atworkflow.go:943andhandleCancelInitiatedEventatinternal_command_state_machine.go:566-574are unchanged)