Session activity cancellation causes permanent TMPRL1100 Non-Determinism Error on replay when DataConverter/codec fails

## Expected Behavior

When a `DataConverter` (e.g., hitting the remote codec server) returns any error during `encodeArgs` in `ExecuteActivity`, the workflow task should fail and be retryable without triggering an NDE. The workflow should remain recoverable.

## Actual Behavior

Two bugs combine to permanently brick the workflow:

**Bug 1: `encodeArgs` panics instead of returning an error** (`internal/workflow.go:943`):
```go
input, err := encodeArgs(dataConverter, args)
if err != nil {
    panic(err) // Should fail the workflow task gracefully, not panic
}
```

**Bug 2: `handleCancelInitiatedEvent` rejects `Initiated` state on replay** (`internal/internal_command_state_machine.go:566-574`):
```go
func (d *commandStateMachineBase) handleCancelInitiatedEvent() {
    switch d.state {
    case commandStateCancellationCommandSent, commandStateCanceledAfterInitiated:
        // No state change
    default:
        d.failStateTransition(eventCancelInitiated) // Panics for Initiated state
    }
}
```

When a workflow uses `CreateSession` + `defer CompleteSession`, the panic from Bug 1 causes an `UNHANDLED_FAILURE`. On replay, the SDK encounters `ActivityTaskCancelRequested` for the `internalSessionCreationActivity` (the session creation activity scheduled by `CreateSession`, typically the first activity in the workflow, e.g. event ID 7 in our case) before the workflow code has re-executed `CompleteSession()` to transition the state machine to `CancellationCommandSent`. The state machine is still in `Initiated`, so Bug 2 panics with TMPRL1100. Every subsequent replay hits the same panic -- the workflow is permanently unrecoverable.

**Note:** Even if Bug 1 is not fixed (i.e., `encodeArgs` continues to panic), **Bug 2 alone** can still brick a workflow whenever any cancel event hits a session activity in `Initiated` state during replay. The state machine fix is required regardless of whether the panic-on-encode behavior is changed.

**Decoded failure from the bricked workflow:**
```
message: "failed to encode payloads: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
stack_trace: "...workflowEnvironmentInterceptor.ExecuteActivity at workflow.go:943..."
```

**Decoded NDE from replay:**
```
[TMPRL1100] invalid state transition: attempt to handleCancelInitiatedEvent,
CommandType: Activity, ID: 7, state=Initiated, isDone()=false,
history=[Created handleCommandSent CommandSent handleInitiatedEvent Initiated handleCancelInitiatedEvent]
```

Activity ID 7 here is the `internalSessionCreationActivity` scheduled on the `default__internal_session_creation` task queue by `workflow.CreateSession()`. The `ActivityTaskCancelRequested` for this activity is generated when `CompleteSession()` cancels the session context. On replay, this cancel event arrives while the activity state machine is still in `Initiated`, causing the TMPRL1100 panic.

## Steps to Reproduce the Problem

1. Create a workflow that uses `workflow.CreateSession()` with `defer workflow.CompleteSession(sessionCtx)`
2. Configure a `DataConverter` that can fail (see reproducer below)
3. When the workflow calls `workflow.ExecuteActivity()`, the `DataConverter.ToPayloads()` returns an error
4. The SDK panics at `workflow.go:943`, recorded as `WORKFLOW_TASK_FAILED` with `WORKFLOW_WORKER_UNHANDLED_FAILURE`
5. Temporal retries the workflow task → replay hits `handleCancelInitiatedEvent` in `Initiated` state → TMPRL1100
6. Workflow is permanently bricked -- every replay attempt hits the same TMPRL1100 panic

**Minimal workflow code:**
```go
func MyWorkflow(ctx workflow.Context) error {
    ctx = workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
        StartToCloseTimeout: 10 * time.Minute,
    })

    sessionCtx, err := workflow.CreateSession(ctx, &workflow.SessionOptions{
        CreationTimeout:  2 * time.Minute,
        ExecutionTimeout: 8 * time.Hour,
    })
    if err != nil {
        return err
    }
    defer workflow.CompleteSession(sessionCtx)

    // If the DataConverter fails when this executes,
    // encodeArgs panics → UNHANDLED_FAILURE → NDE on replay → permanently bricked
    var result MyResult
    err = workflow.ExecuteActivity(sessionCtx, MyActivity, myInput).Get(sessionCtx, &result)
    return err
}
```

**Deterministic reproducer -- custom DataConverter that fails on demand:**

To reproduce without a remote codec server, wrap the default `DataConverter` so that `ToPayloads` returns an error after a configurable number of successful calls. Set `failAfter` to a value that allows early calls to succeed (e.g., session creation, SideEffect encoding) but fails when `ExecuteActivity` encodes its arguments. For example, `failAfter=5` lets the first few encode calls through, then fails on the activity argument encoding. No custom retry policy is needed -- the default workflow task retry behavior is sufficient to trigger the replay NDE.

```go
import (
    "fmt"
    "sync/atomic"

    commonpb "go.temporal.io/api/common/v1"
    "go.temporal.io/sdk/converter"
)

// failingDataConverter wraps a DataConverter and fails ToPayloads after N successful calls.
type failingDataConverter struct {
    converter.DataConverter
    callCount atomic.Int32
    failAfter int32
}

func (f *failingDataConverter) ToPayloads(values ...interface{}) (*commonpb.Payloads, error) {
    if f.callCount.Add(1) >= f.failAfter {
        return nil, fmt.Errorf("simulated codec server timeout: DeadlineExceeded")
    }
    return f.DataConverter.ToPayloads(values...)
}

func (f *failingDataConverter) ToPayload(value interface{}) (*commonpb.Payload, error) {
    if f.callCount.Load() >= f.failAfter {
        return nil, fmt.Errorf("simulated codec server timeout: DeadlineExceeded")
    }
    return f.DataConverter.ToPayload(value)
}
```

Use this as the `DataConverter` in `client.Options`:
```go
c, err := client.Dial(client.Options{
    DataConverter: &failingDataConverter{
        DataConverter: converter.GetDefaultDataConverter(),
        failAfter:     5, // allow session creation to succeed, fail on activity encode
    },
})
```

**Key observation:** When sessions are disabled (`CreateSession` / `CompleteSession` removed), the same `DataConverter` failure triggers Bug 1 (the `encodeArgs` panic) but the workflow task retries and recovers successfully because there is no session activity cancel event to trigger Bug 2 on replay. With sessions enabled, the workflow is permanently unrecoverable. This has been reproduced consistently across multiple workflows and namespaces on Temporal Cloud.

**Prior art -- same class of state machine bug, partially fixed multiple times:**

- [PR #323](https://github.com/temporalio/sdk-go/pull/323) (Dec 2020): Child workflow cancellation event ordering -- fixed by accepting both event orderings
- [Issue #343](https://github.com/temporalio/sdk-go/issues/343) (Jan 2021): Another child workflow cancel state transition variant -- Maxim Fateev noted "there is a state transition which is still missing" from PR #323. Fixed by Spencer Judge within a day.
- [PR #625](https://github.com/temporalio/sdk-go/pull/625) / v1.11.1 (Nov 2021): "Fix state machine bug that occurs when actions happen after activity cancellation"
- [PR #726](https://github.com/temporalio/sdk-go/pull/726) / v1.13.1 (Feb 2022): "Remove pending activity cancellations when activity completion occurs"
- [Issue #1227](https://github.com/temporalio/sdk-go/issues/1227) (Sep 2023): Sessions + worker versioning edge case -- acknowledged as a bug by Quinn Klassen

Each fix addressed one cancellation ordering variant, but `handleCancelInitiatedEvent` + `Initiated` state for activities/sessions was not covered.

**Suggested fix:**

1. **`workflow.go:943`**: Change `panic(err)` to return the error, failing the workflow task gracefully instead of panicking
2. **`internal_command_state_machine.go:handleCancelInitiatedEvent`**: Add `commandStateInitiated` to the accepted states, consistent with the approach in PR #323 (accept the unexpected but valid event ordering rather than panicking)

Note: Even if (1) is not changed, fix (2) is still required -- any code path that produces an `ActivityTaskCancelRequested` for a session activity during replay can trigger this permanent bricking via the state machine panic.

## Specifications

- Version: v1.36.0 (confirmed still present in v1.40.0 as of Feb 26, 2026 -- `panic(err)` at `workflow.go:943` and `handleCancelInitiatedEvent` at `internal_command_state_machine.go:566-574` are unchanged)
- Platform: Linux (Kubernetes), Temporal Cloud, remote codec server via gRPC for payload encryption


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Session activity cancellation causes permanent TMPRL1100 Non-Determinism Error on replay when DataConverter/codec fails #2206

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Session activity cancellation causes permanent TMPRL1100 Non-Determinism Error on replay when DataConverter/codec fails #2206

Description

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions