This repository contains a production-ready implementation of a Genesys AudioHook & Transcription Connector server that processes real-time audio streams for transcription using either Google Cloud Speech-to-Text API or OpenAI Speech-to-Text API and translation using Google Gemini. The transcribed (and optionally translated) text is then injected back into Genesys Cloud via event messages. This server is designed to meet the Genesys AudioHook protocol requirements and supports essential transactions such as session establishment, audio streaming, ping/pong heartbeats, and clean disconnection.
The project is designed to be deployed on Digital Ocean (or a similar platform) and integrates with Google Cloud or OpenAI for transcription and Google Gemini for translation.
- Overview
- Use Cases
- Architecture
- Code Structure
- Transcription and Translation Processing
- Supported Speech Models
- Language Handling
- Dynamic Transcription Vendor Selection
- Synthetic Timestamps and Confidence Scores
- Deployment
- Prerequisites
- Usage
- Error Handling and Logging
- Configuration
- Known Issues
The server accepts WebSocket connections from Genesys Cloud (the AudioHook client) and performs the following key operations:
-
Connection Establishment & Validation:
- Validates incoming
HTTPSupgrade requests against required headers (e.g.,API key,organization ID). - Negotiates a media format (typically
PCMUat 8000 Hz).
- Validates incoming
-
Session Lifecycle Management:
- Manages the session lifecycle by handling
"open","ping","close", and other transaction messages. - Sends an
"opened"message to Genesys Cloud upon successful open transaction, enabling audio streaming.
- Manages the session lifecycle by handling
-
Real-Time Audio Processing and Control Message Handling:
- Processes incoming audio frames (in
PCMUformat) in real time. - Converts audio frames from
PCMU(u-law) toPCM16using the Pythonaudioopmodule. - Control Message Handling: The server processes control messages—such as
"paused","discarded", and"resumed"—to adjust the effective audio timeline. This ensures that the computed offsets for transcription events exclude any periods where audio was lost or intentionally paused, aligning with the Genesys AudioHook protocol requirements.
- Processes incoming audio frames (in
-
Transcription via Google Cloud Speech-to-Text or OpenAI Speech-to-Text:
- Sends
PCM16audio to either Google Cloud Speech-to-Text API or OpenAI's Speech-to-Text API for transcription in the source language. - The transcription vendor can be specified in the open message via
customConfig.transcriptionVendor, with fallback to the environment variable.
- Sends
-
Translation via Google Gemini (Optional):
- If enabled via
customConfig.enableTranslationin the open message, translates the transcribed text to the destination language using Google Gemini (see 'enableTranslation' in 'Language Handling' section). - Uses structured output to ensure only the translated text is returned.
- If disabled or not specified, the original transcript is returned without translation, using the input language.
- If enabled via
-
Injection Back into Genesys Cloud:
- Constructs a transcript event message with the (translated or original) text, including accurate offset and duration values adjusted for any control messages.
- Sends the message back to Genesys Cloud via the WebSocket connection for injection into the conversation.
This connector is designed to support two primary use cases that address different needs in contact center environments:
This use case is ideal when you need a specialized transcription engine different than the native options provided by Genesys Cloud or EVTS.
Key benefits:
- Leverage either Google's or OpenAI's advanced speech recognition capabilities
- Supports languages that might not be available in Genesys' native transcription or EVTS
Configuration:
- Set
enableTranslation: falseor omit it in thecustomConfig - Ensure the
inputLanguageincustomConfigmatches the language being spoken
This approach maintains the original language throughout the conversation, making it suitable for environments where all systems (including analytics, agent assistance, etc.) support the source language.
This use case is particularly valuable for enabling advanced Genesys features (like Copilot or Speech & Text Analytics) for languages that aren't directly supported by these tools.
Example scenario: A contact center serves customers who speak a regionally-important language (such as Basque, Zulu, Welsh, etc.) that isn't directly supported by Genesys Copilot or STA. However, these tools do support a widely-used language in that same region (such as Spanish or English).
How it works:
- The customer speaks in their preferred language (e.g., Basque)
- The connector transcribes the audio in the source language
- The text is translated to a widely-supported language (e.g., Spanish)
- Genesys Cloud receives the translated transcript, enabling tools like Copilot and STA to function
Key benefits:
- Extends advanced Genesys features to additional languages
- Provides a more inclusive customer experience
- Leverages existing agent language capabilities
- Enables analytics and assistance tools across more languages
Configuration:
- Set
enableTranslation: truein thecustomConfig - Set
inputLanguageto the regionally-important language (source) - The
languagefield in the message determines the target language for translation
This use case is especially valuable in regions with linguistic diversity, where contact centers need to support regional languages while leveraging tools optimized for more widely-spoken languages.
The application is built around the following core components:
-
WebSocket Server:
- Uses the
websocketslibrary to manage connections and message exchanges with Genesys Cloud.
- Uses the
-
Session Handler (
AudioHookServer):- Processes incoming messages, handles transactions (open, ping, close, etc.), manages rate limiting, and adjusts transcription offsets based on control messages.
- Implemented in
audio_hook_server.py.
-
Audio Processing:
- Converts audio frames from
PCMUtoPCM16usingaudioop. - Feeds
PCM16audio to either Google Cloud Speech-to-Text or OpenAI Speech-to-Text for transcription. - Optionally translates transcribed text using Google Gemini.
- Converts audio frames from
-
Transcription and Translation:
- Transcription: Uses either:
- Google Cloud Speech-to-Text API with streaming recognition for real-time transcription.
- OpenAI Speech-to-Text API with buffered streaming for real-time transcription.
- Translation (Optional): Uses Google Gemini with structured output to ensure only the translated text is returned. This step is performed only if
customConfig.enableTranslationis set to true in the open message.
- Transcription: Uses either:
-
Rate Limiting:
- Implements a custom rate limiter to prevent exceeding GC AudioHook's messaging rate limits.
- Defined in
rate_limiter.py.
-
Environment Configuration:
- Loads configurations (API keys, Google Cloud settings, OpenAI settings, rate limits, supported languages, etc.) from environment variables.
- Managed in
config.py.
-
ProcfileSpecifies the command to start the application:web: python main.py
-
main.py- Main entry point that starts the WebSocket server.
- Validates incoming connections and delegates handling to
AudioHookServer. - Includes WebSocket handshake validation and health endpoint (
/health) for Digital Ocean.
-
audio_hook_server.py- Contains the
AudioHookServerclass, which manages:- Session lifecycle (open, ping, close, etc.).
- Audio frame processing, control message handling, and rate limiting.
- Transcription and (optionally) translation event sending back to Genesys Cloud.
- Dynamic loading of transcription providers based on
customConfig. - For probe connections, the server sends the list of supported languages (as defined in the
SUPPORTED_LANGUAGESenvironment variable) to Genesys Cloud. - The server adjusts transcript offsets based on control messages (
"paused","discarded", and"resumed") to ensure that only the processed audio timeline is considered.
- Contains the
-
google_speech_transcription.py- Implements the
StreamingTranscriptionclass for real-time transcription using Google Cloud Speech-to-Text. - Handles audio conversion from
PCMUtoPCM16and feeds it to the API. - Includes
normalize_language_codefor BCP-47 language code normalization.
- Implements the
-
openai_speech_transcription.py- Implements the
StreamingTranscriptionclass for real-time transcription using OpenAI's Speech-to-Text API. - Features:
- Intelligent buffering system that accumulates audio until complete utterances are detected
- Voice Activity Detection (VAD) to identify speech segments and silence
- Creates temporary WAV files for processing detected utterances
- Streams data to OpenAI's API with appropriate parameters for real-time performance
- Processes response chunks to build complete transcripts with confidence scores
- Includes language code mapping from BCP-47 to ISO formats
- Generates synthetic word-level timing information for compatibility with Genesys AudioHook
- Advanced token-to-word confidence mapping for accurate per-word confidence scores
- Artifact filtering to prevent spurious words/phrases in transcripts
- Initial frame skipping to avoid connection sounds/beeps
- Low confidence token filtering
- Implements the
-
google_gemini_translation.py- Implements the
translate_with_geminifunction for translating text using Google Gemini. - Uses structured output (via Pydantic) to ensure only the translation is returned.
- Handles translation errors and logs them appropriately.
- Implements the
-
language_mapping.py- Contains functions for normalizing language codes:
normalize_language_code: Normalizes language codes to BCP-47 format (e.g., "es-es" → "es-ES").get_openai_language_code: Maps BCP-47 codes to ISO 639-1/639-3 codes compatible with OpenAI's API.is_openai_unsupported_language: Identifies languages not directly supported by OpenAI's API.get_language_specific_prompt: Provides native language prompts for unsupported languages.
- Contains functions for normalizing language codes:
-
rate_limiter.py- Provides an asynchronous rate limiter (
RateLimiter) to throttle message sending. - Supports Genesys Cloud's rate limits (e.g., 5 messages/sec, 25 burst limit).
- Provides an asynchronous rate limiter (
-
config.py- Loads all configuration variables from environment variables.
- Includes settings for Google Cloud, OpenAI, Google Gemini, Genesys, rate limiting, and supported languages.
-
utils.py- Contains helper functions:
format_json: Pretty-prints JSON for logging.parse_iso8601_duration: Parses ISO 8601 duration strings for rate limiting.
- Contains helper functions:
-
requirements.txt- Lists all Python dependencies required for the project.
-
Receiving Audio:
- Genesys Cloud streams audio frames (binary WebSocket messages) after the open transaction.
- Each frame is received in
AudioHookServer.handle_audio_frame.
-
Real-Time Processing:
- Converts audio frames from
PCMU(u-law) toPCM16usingaudioop. - Supports multi-channel audio (e.g., stereo, with external and internal channels).
- Converts audio frames from
-
Control Message Handling:
- The server processes control messages such as
"paused","discarded", and"resumed". - These messages adjust an internal offset (tracked as
processed_audio_samples) so that transcription offsets and durations accurately reflect only the audio that was received (excluding any gaps due to pauses or audio loss).
- The server processes control messages such as
-
Transcription Provider Selection:
- The system dynamically selects the appropriate transcription provider based on
customConfig.transcriptionVendorwith fallback to theDEFAULT_SPEECH_PROVIDERenvironment variable. - This selection determines which implementation of
StreamingTranscriptionis instantiated.
- The system dynamically selects the appropriate transcription provider based on
-
Google Cloud Transcription:
- Uses Google Cloud Speech-to-Text API with streaming recognition.
- Feeds
PCM16audio directly to the API. - Retrieves transcription results with word-level timing and confidence scores when available.
-
OpenAI Transcription:
- Uses an intelligent utterance detection system:
- Accumulates audio frames in a buffer
- Applies Voice Activity Detection (VAD) to identify speech segments
- Detects end of utterances based on silence duration (currently 800ms)
- Creates temporary WAV files for complete utterances
- Sends audio to OpenAI's API via streaming for real-time results
- Process chunked responses to build complete transcriptions
- Extracts confidence scores from logprobs when available
- Maps token-level confidence to word-level confidence
- Generates synthetic word-level timing since OpenAI doesn't provide it
- Implements safeguards:
- Timeout-based processing to ensure audio doesn't accumulate indefinitely
- Energy thresholds to avoid processing silence
- Buffer overflow prevention
- Duplicate transcript prevention
- Transcript Quality Improvements:
- Skips initial audio frames to avoid connection sounds/beeps
- Filters out known spurious artifacts ("context:", "ring", etc.)
- Uses regex pattern matching to identify and remove common artifacts
- Filters low-confidence tokens that might represent misinterpreted sounds
- Uses enhanced prompting to instruct the model to ignore system sounds
- Uses an intelligent utterance detection system:
-
Translation (Optional):
- If
customConfig.enableTranslationis set totruein the open message, the transcribed text is sent to Google Gemini for translation into the destination language. - If disabled or not specified, the original transcript is returned without translation, using the input language.
- Structured output ensures that only the translated (or original) text is returned.
- Translation failures are logged and skipped.
- If
-
Injection Back into Genesys Cloud:
- Constructs a transcript event message with:
- Unique transcript ID.
- Channel identifier (e.g., 0 for external, 1 for internal).
- Transcribed text with adjusted offsets, duration, and confidence.
- Sends the event to Genesys Cloud via WebSocket for conversation injection.
- Constructs a transcript event message with:
This connector supports two speech recognition providers:
Google Cloud Speech-to-Text
- The most advanced model with full feature support, including:
- Greater performance
- Faster
- Word-level confidence scores
- Limited language support
- Good model with broad language support:
- Does not support word-level confidence scores (fixed value of
1.0is used) - Slower, a bit more lag to get the transcript back into GC
- Does not support word-level confidence scores (fixed value of
OpenAI Speech-to-Text
- Default model, balancing speed and accuracy
- Limited parameter support (no timestamps)
- Uses a sophisticated buffering system to detect complete utterances
- Features:
- Voice Activity Detection to process only speech segments
- Response streaming for real-time results
- Confidence scores derived from token logprobs
- Synthetic word-level timing for Genesys compatibility
- Artifact filtering to prevent spurious transcripts
- Higher quality model for more accurate transcriptions
- Limited parameter support (no timestamps)
- Same processing features as
gpt-4o-mini-transcribe
The connector automatically adapts to whichever provider and model is specified in the environment variables, adjusting request parameters and response handling accordingly. When using models without word-level confidence or timing, the connector still maintains full compatibility with the Genesys AudioHook protocol by supplying generated values where needed.
-
Input Language (Source):
- Determined from the
customConfig.inputLanguagefield in the"open"message received from Genesys Cloud. For example:
{ "inputLanguage": "es-es", "enableTranslation": true }- Used for transcription via Google Cloud Speech-to-Text or OpenAI Speech-to-Text.
- Defaults to
"en-US"if not provided. - Normalized to BCP-47 format using
normalize_language_code.
- Determined from the
-
Language Code Mapping for OpenAI:
- OpenAI's speech models support ISO 639-1/639-3 language codes rather than BCP-47 format.
- The connector automatically maps BCP-47 codes (e.g.,
"es-ES") to ISO codes (e.g.,"es") before sending to OpenAI using theget_openai_language_codefunction. - This mapping covers all major language variants (Spanish, English, French, etc.) and gracefully handles unsupported codes.
- This mapping is handled transparently, so you can continue using BCP-47 codes in your Genesys configuration.
-
Unsupported Languages Handling:
- For languages not officially supported by OpenAI's API (like Zulu/
zu-ZA), the connector uses a special approach:- Detects unsupported languages using the
is_openai_unsupported_languagefunction - Omits the
languageparameter that would cause API errors - Instead, includes a native language prompt (e.g.,
"Humusha ngesizulu (Mzansi Afrika)"for Zulu) - The prompt is provided in the target language to help guide the model appropriately
- Detects unsupported languages using the
- This approach allows transcription in languages that OpenAI doesn't explicitly support via their
languageparameter.
- For languages not officially supported by OpenAI's API (like Zulu/
-
Destination Language:
- Determined from the
languagefield in the"open"message. - Used as the target language for translation via Google Gemini when translation is enabled.
- Normalized to BCP-47 format.
- Determined from the
-
Supported Languages:
- Defined in the
SUPPORTED_LANGUAGESenvironment variable (comma-separated, e.g.,"es-ES,it-IT,en-US"). - Sent to Genesys Cloud in the
"opened"message for probe connections.
- Defined in the
-
Translation Toggle:
- The
customConfig.enableTranslationboolean in the"open"message controls whether translation is enabled for the session. - If disabled or not specified, the server returns the original transcription without translation, using the input language.
- The
The server now supports dynamic selection of the transcription vendor on a per-conversation basis:
-
Configuration in Genesys Open Message:
- The transcription vendor can be specified in the
"open"message via thecustomConfig.transcriptionVendorfield:
{ "transcriptionVendor": "google", // or "openai" "inputLanguage": "es-es", "enableTranslation": true }- This allows different conversations to use different transcription providers based on specific needs.
- The transcription vendor can be specified in the
-
Default Fallback:
- If not specified, the server falls back to using the
DEFAULT_SPEECH_PROVIDERenvironment variable. - This maintains backward compatibility with existing deployments.
- If not specified, the server falls back to using the
-
Dynamic Module Loading:
- The server uses Python's
importlibmodule to dynamically load the appropriate transcription provider at runtime. - The
_load_transcription_provider()method instantiates the correct module after receiving the"open"message.
- The server uses Python's
-
Fault Tolerance:
- If the specified provider fails to load, the system gracefully falls back to the Google provider.
- This ensures robustness even if configuration errors occur.
-
Benefits:
- Enables A/B testing between different providers
- Allows different language needs to be serviced by different providers
- Creates flexibility to use the most appropriate provider for specific scenarios
- Eliminates the need for multiple deployments for different provider needs
Since OpenAI's Speech-to-Text API doesn't provide word-level timing information, the connector generates synthetic timestamps to ensure compatibility with Genesys AudioHook:
-
Audio Position Tracking:
- The system tracks the accurate position of audio in the stream, accounting for paused and discarded segments.
- This ensures that even synthetic timestamps accurately reflect the true audio timeline.
-
Utterance-Based Timestamp Generation:
- When speech is detected, the system records the position where the utterance begins.
- This position is used as the base timestamp for all words in the utterance.
- The utterance duration is calculated based on the number of audio samples processed.
-
Word-Level Distribution:
- The total utterance duration is evenly distributed across all words in the transcript.
- Each word is assigned a precise start and end time relative to the utterance start.
- For example, with a 3-second utterance containing 6 words, each word would be allocated 0.5 seconds.
-
Timeline Adjustment:
- Timestamps are adjusted based on
"paused"and"discarded"messages. - This ensures that reported timestamps exclude periods of silence or discarded audio.
- The
offset_adjustmentproperty tracks the number of samples to adjust in the timeline.
- Timestamps are adjusted based on
-
Accurate Temporal Alignment:
- The result is synthetic timestamps that closely align with the actual speech.
- This provides a realistic visualization in the Genesys Cloud UI.
- The timestamps properly account for the complete audio timeline, including gaps and pauses.
The connector ensures proper confidence scores for all transcription models:
-
Google Chirp 2:
- Uses native word-level confidence scores directly from the API
- These scores represent the model's certainty for each word
- Values range from
0.0to1.0with higher values indicating greater confidence
-
Google Chirp:
- Since Chirp doesn't support word-level confidence scores (always returns
0.0) - The connector automatically replaces these
0.0values with1.0 - This ensures proper display in Genesys Cloud UI and prevents errors
- This fallback applies only to words with explicitly zero confidence scores
- Since Chirp doesn't support word-level confidence scores (always returns
-
OpenAI models:
- Derives confidence scores from token logprobs
- Maps token-level confidence to word-level confidence using advanced matching
- Provides realistic variation in confidence based on the model's certainty
- Reflects lower confidence for unusual terms or unclear audio
This approach ensures consistent and meaningful confidence scores across all providers and models, maintaining full compatibility with the Genesys AudioHook protocol even when using models with limited confidence score support.
This project is designed to be deployed on Digital Ocean (or a similar platform). It integrates with Google Cloud or OpenAI for transcription (Speech-to-Text API) and Google Gemini for translation.
When deploying this application on Digital Ocean App Platform, you'll need to configure the following settings:
-
HTTP Request Routes
- Route Path:
/audiohook - Preserve Path Prefix: Enabled (check this option to ensure the path will remain
/audiohookwhen forwarded to the component)
- Route Path:
-
Ports
- Public HTTP Port:
443(for HTTPS connections)
- Public HTTP Port:
-
Health Checks
- Path:
/health - Protocol: HTTP
- Path:
-
Commands
- Build Command: None
- Run Command:
python main.py
These settings ensure that:
- The application listens on the correct path (
/audiohook) for incoming Genesys Cloud AudioHook connections. - The health check path (
/health) is properly configured to allow Digital Ocean to monitor the application's status. - The application starts correctly with the proper run command.
Important: When configuring your Genesys Cloud AudioHook integration, use the full URL provided by Digital Ocean (e.g., https://startish-app-1gxm4.ondigitalocean.app/audiohook) as your connector endpoint.
-
Dependencies:
- All Python dependencies are listed in
requirements.txt:websocketsaiohttppydubpython-dotenvgoogle-cloud-speechgoogle-generativeaiopenai
- All Python dependencies are listed in
-
Google Cloud Account:
- Required for Google Cloud Speech-to-Text API access if using Google as the speech provider.
- Set up a service account and download the JSON key.
-
OpenAI API Key:
- Required for OpenAI Speech-to-Text API access if using OpenAI as the speech provider.
- Obtain from OpenAI's platform.
-
Google Gemini API Key:
- Required for translation services.
- Obtain from Google AI Studio or similar.
-
Local Development:
- Set up your environment variables (you can use a
.envfile). - Install dependencies:
pip install -r requirements.txt
- Run the server:
python main.py
- Set up your environment variables (you can use a
-
Deployment on Digital Ocean App Platform:
- Configure environment variables in the App Platform settings.
- Set up HTTP routes, health checks, and commands as described in the Digital Ocean App Platform Configuration section.
- Deploy the application; the Run Command will trigger the start command.
-
Error Logging:
- Logs detailed debug and error messages for:
- WebSocket connection issues.
- Audio processing errors.
- Transcription and translation failures.
- Rate limiting events.
- Logs detailed debug and error messages for:
-
Transcription and Translation Logging:
- Transcription results and events sent to Genesys are logged at the
INFOlevel. - Translation failures are logged with details.
- Transcription results and events sent to Genesys are logged at the
-
Graceful Shutdown:
- Handles close transactions by sending a
"closed"message to Genesys Cloud. - Cleans up session resources (stops transcription threads, cancels tasks).
- Handles close transactions by sending a
-
Rate Limiting:
- Implements backoff for
429errors (rate limit exceeded) from Genesys. - Supports retry-after durations from Genesys or HTTP headers.
- Implements backoff for
All configurable parameters are defined in config.py and loaded from environment variables. Below is a list of required environment variables:
| Variable | Description | Default |
|---|---|---|
GOOGLE_CLOUD_PROJECT |
Google Cloud project ID for Speech-to-Text API (required for Google provider) | - |
GOOGLE_APPLICATION_CREDENTIALS |
JSON key for Google Cloud service account (required for Google provider) | - |
GOOGLE_SPEECH_MODEL |
Google Speech recognition model ('chirp_2' or 'chirp') |
chirp_2 |
GOOGLE_TRANSLATION_MODEL |
Google Gemini model for translation | - |
GEMINI_API_KEY |
API key for Google Gemini | - |
OPENAI_API_KEY |
API key for OpenAI (required for OpenAI provider) | - |
OPENAI_SPEECH_MODEL |
OpenAI Speech-to-Text model | gpt-4o-mini-transcribe |
DEFAULT_SPEECH_PROVIDER |
Default speech provider if not specified in customConfig ('google' or 'openai') | google |
GENESYS_API_KEY |
API key for Genesys Cloud Transcription Connector | - |
GENESYS_ORG_ID |
Genesys Cloud organization ID | - |
DEBUG |
Set to "true" for increased logging granularity |
false |
SUPPORTED_LANGUAGES |
Comma-separated list of supported input languages (e.g., "es-ES,it-IT,en-US") | es-ES,it-IT |
-
Google Transcription: Random numbers in the transcription:
- From time to time some arbitrary numbers show up in the transcription, totally unrelated to the conversation itself. It requires further investigation.
-
OpenAI Transcription: Synthetic word timing:
- Because OpenAI's Speech API doesn't provide word-level timing information, the connector generates synthetic timestamps based on transcript length.
- This may cause slight misalignment in the Genesys UI compared to actual speech timing.
-
OpenAI Transcription: Latency during speech detection:
- The utterance detection system waits for silence (
800ms) to determine the end of speech. - This introduces a small latency between someone speaking and the transcript appearing.
- This is a tradeoff to ensure complete utterances are captured rather than partial fragments.
- The utterance detection system waits for silence (