Skip to content

[Realtime API] gpt-realtime-translate emits transcript.delta containing only U+FFFD for multi-byte UTF-8 input #43806

Description

@AyumuKataoka

Summary

When using Azure OpenAI Realtime API with the gpt-realtime-translate deployment, individual
session.input_transcript.delta and session.output_transcript.delta events frequently contain
only U+FFFD (Replacement Character) in the delta field when the source language is Japanese
(likely affects any multi-byte UTF-8 language).

The final transcript delivered via the *.completed / *.done events is correct, so this is
purely a streaming-delta issue, but it makes realtime subtitle UI unusable for multi-byte
languages.

Environment

  • Service: Azure OpenAI on Azure AI Foundry (endpoint host: *.services.ai.azure.com, eastus2)
  • Model deployment: gpt-realtime-translate (model: gpt-realtime-translate-2026-05-06)
  • Client: macOS, Swift / SwiftUI, URLSessionWebSocketTask (.string messages)
  • Source language: Japanese (ja), Target language: en / vi (both reproduce)

Reproduction

  1. Open a WebSocket session with a gpt-realtime-translate deployment.
  2. Send session.update with:
    {
      "type": "session.update",
      "session": {
        "audio": {
          "input": { "transcription": { "model": "gpt-realtime-whisper" } },
          "output": { "language": "en" }
        }
      }
    }
  3. Stream Japanese speech (PCM16 24kHz, base64) via input_audio_buffer.append.
  4. Observe session.input_transcript.delta and session.output_transcript.delta events.

Expected

Each delta field should contain either a complete grapheme cluster or at least valid UTF-8,
so that concatenating consecutive deltas reproduces the same string returned by the eventual
*.completed.transcript.

Actual

Many delta events contain exactly "delta": "�" (a single Replacement Character).
Concatenating consecutive deltas produces strings with multiple characters that cannot
be recovered on the client side.

Sample raw WebSocket payloads captured at the moment URLSession delivers .string to the
application (113 bytes total, delta field is a single U+FFFD):

{ ... ,"item_id":"...yc0oOJQYT","delta":"�","elapsed_ms":11000, ... }
{ ... ,"item_id":"...8UwU24ILJ","delta":"�","elapsed_ms":11000, ... }

The replacement character is already present in the JSON returned by the server; the client
has performed no transformation at the point of detection.

Impact

  • Realtime subtitle UI displays during speech.
  • Streaming display is effectively unusable for Japanese / Chinese / Korean / any multi-byte
    language.
  • Final transcripts (conversation.item.input_audio_transcription.completed,
    session.output_transcript.done) are correct.

Workaround (current)

  • Discard delta events that contain U+FFFD and rely on *.completed / *.done for display.
  • This defeats the purpose of streaming.

Request

Please consider one of the following:

  1. Ensure server-emitted delta strings are split only at valid UTF-8 (preferably grapheme
    cluster) boundaries.
  2. Provide a session.update option to receive delta as base64-encoded raw bytes (e.g.
    delta_b64) so clients can concatenate and decode them safely with an incremental
    UTF-8 decoder.
  3. Alternatively, document that partial transcript deltas are not guaranteed for multi-byte
    languages on Azure and recommend *.completed for display.

Metadata

Metadata

Assignees

No one assigned

    Labels

    OpenAIOpenAI serviceService AttentionWorkflow: This issue is responsible by Azure service team.customer-reportedIssues that are reported by GitHub users external to the Azure organization.questionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions