Summary
When using Azure OpenAI Realtime API with the gpt-realtime-translate deployment, individual
session.input_transcript.delta and session.output_transcript.delta events frequently contain
only U+FFFD (Replacement Character) in the delta field when the source language is Japanese
(likely affects any multi-byte UTF-8 language).
The final transcript delivered via the *.completed / *.done events is correct, so this is
purely a streaming-delta issue, but it makes realtime subtitle UI unusable for multi-byte
languages.
Environment
- Service: Azure OpenAI on Azure AI Foundry (endpoint host:
*.services.ai.azure.com, eastus2)
- Model deployment:
gpt-realtime-translate (model: gpt-realtime-translate-2026-05-06)
- Client: macOS, Swift / SwiftUI,
URLSessionWebSocketTask (.string messages)
- Source language: Japanese (
ja), Target language: en / vi (both reproduce)
Reproduction
- Open a WebSocket session with a
gpt-realtime-translate deployment.
- Send
session.update with:
{
"type": "session.update",
"session": {
"audio": {
"input": { "transcription": { "model": "gpt-realtime-whisper" } },
"output": { "language": "en" }
}
}
}
- Stream Japanese speech (PCM16 24kHz, base64) via
input_audio_buffer.append.
- Observe
session.input_transcript.delta and session.output_transcript.delta events.
Expected
Each delta field should contain either a complete grapheme cluster or at least valid UTF-8,
so that concatenating consecutive deltas reproduces the same string returned by the eventual
*.completed.transcript.
Actual
Many delta events contain exactly "delta": "�" (a single Replacement Character).
Concatenating consecutive deltas produces strings with multiple � characters that cannot
be recovered on the client side.
Sample raw WebSocket payloads captured at the moment URLSession delivers .string to the
application (113 bytes total, delta field is a single U+FFFD):
{ ... ,"item_id":"...yc0oOJQYT","delta":"�","elapsed_ms":11000, ... }
{ ... ,"item_id":"...8UwU24ILJ","delta":"�","elapsed_ms":11000, ... }
The replacement character is already present in the JSON returned by the server; the client
has performed no transformation at the point of detection.
Impact
- Realtime subtitle UI displays
� during speech.
- Streaming display is effectively unusable for Japanese / Chinese / Korean / any multi-byte
language.
- Final transcripts (
conversation.item.input_audio_transcription.completed,
session.output_transcript.done) are correct.
Workaround (current)
- Discard
delta events that contain U+FFFD and rely on *.completed / *.done for display.
- This defeats the purpose of streaming.
Request
Please consider one of the following:
- Ensure server-emitted
delta strings are split only at valid UTF-8 (preferably grapheme
cluster) boundaries.
- Provide a
session.update option to receive delta as base64-encoded raw bytes (e.g.
delta_b64) so clients can concatenate and decode them safely with an incremental
UTF-8 decoder.
- Alternatively, document that partial transcript deltas are not guaranteed for multi-byte
languages on Azure and recommend *.completed for display.
Summary
When using Azure OpenAI Realtime API with the
gpt-realtime-translatedeployment, individualsession.input_transcript.deltaandsession.output_transcript.deltaevents frequently containonly U+FFFD (Replacement Character) in the
deltafield when the source language is Japanese(likely affects any multi-byte UTF-8 language).
The final transcript delivered via the
*.completed/*.doneevents is correct, so this ispurely a streaming-delta issue, but it makes realtime subtitle UI unusable for multi-byte
languages.
Environment
*.services.ai.azure.com, eastus2)gpt-realtime-translate(model:gpt-realtime-translate-2026-05-06)URLSessionWebSocketTask(.stringmessages)ja), Target language:en/vi(both reproduce)Reproduction
gpt-realtime-translatedeployment.session.updatewith:{ "type": "session.update", "session": { "audio": { "input": { "transcription": { "model": "gpt-realtime-whisper" } }, "output": { "language": "en" } } } }input_audio_buffer.append.session.input_transcript.deltaandsession.output_transcript.deltaevents.Expected
Each
deltafield should contain either a complete grapheme cluster or at least valid UTF-8,so that concatenating consecutive deltas reproduces the same string returned by the eventual
*.completed.transcript.Actual
Many delta events contain exactly
"delta": "�"(a single Replacement Character).Concatenating consecutive deltas produces strings with multiple
�characters that cannotbe recovered on the client side.
Sample raw WebSocket payloads captured at the moment URLSession delivers
.stringto theapplication (113 bytes total,
deltafield is a single U+FFFD):The replacement character is already present in the JSON returned by the server; the client
has performed no transformation at the point of detection.
Impact
�during speech.language.
conversation.item.input_audio_transcription.completed,session.output_transcript.done) are correct.Workaround (current)
deltaevents that contain U+FFFD and rely on*.completed/*.donefor display.Request
Please consider one of the following:
deltastrings are split only at valid UTF-8 (preferably graphemecluster) boundaries.
session.updateoption to receivedeltaas base64-encoded raw bytes (e.g.delta_b64) so clients can concatenate and decode them safely with an incrementalUTF-8 decoder.
languages on Azure and recommend
*.completedfor display.