feat(kokoro): IPA-input synthesis + G2P-kind query for espeak-less builds (#11776)#43
Merged
Merged
Conversation
…ilds (#11776) The fused eliza_inference_kokoro_synthesize() takes raw text and phonemizes inside the lib: kokoro_phonemize() uses real espeak-ng G2P only when KOKORO_USE_ESPEAK is compiled in, else falls back to a lossy per-byte ASCII grapheme map. Every build that does not link libespeak-ng (Android + iOS always; desktop whenever the host lacks libespeak-ng dev files) therefore produces speech-shaped but unintelligible audio. The kokoro lib already had eliza_kokoro::ipa_to_token_ids() (embedded 115-entry vocab, exact reference ids) — it just was not reachable through the fused FFI. Expose it additively (ABI v12 -> v14; v13 is the main-lineage vision surface, so this develop-pinned lineage advances to v14 to stay collision-free through the #11386 fork reconciliation): - eliza_inference_kokoro_g2p_kind(ctx): reports ELIZA_KOKORO_G2P_ESPEAK vs _ASCII so the caller knows whether it must pre-phonemize. - eliza_inference_kokoro_synthesize_ipa(ctx, ipa, ...): synthesize from precomputed espeak-ng IPA, routed through ipa_to_token_ids(), bypassing the in-lib phonemizer entirely. kokoro_synthesize() and the new kokoro_synthesize_ipa() now share one synthesis core (kokoro_synthesize_from_input_ids); the only difference is the G2P front end. All new symbols are additive — a v12/v13 caller is unaffected. Native tests extended: test_kokoro_phonemes asserts g2p_kind_of_build() mirrors espeak_available() and that the IPA-input path derives the exact wrapped input_ids; test_kokoro_g2p_espeak asserts g2p_kind == ESPEAK when linked. Refs elizaOS/eliza#11776, #10726, #10727, #11238. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
lalalune
added a commit
to elizaOS/eliza
that referenced
this pull request
Jul 3, 2026
…ilds (#11776) (#11827) The fused eliza_inference_kokoro_synthesize() takes raw text and phonemizes inside the lib. Real espeak-ng G2P is compiled in only under KOKORO_USE_ESPEAK; otherwise it uses a lossy per-byte ASCII grapheme fallback. Every fused build that does not link libespeak-ng — Android + iOS always, desktop whenever the host lacks libespeak-ng dev files — therefore produced speech-shaped but unintelligible audio. Fork (gitlink bump 66ab678cb, a descendant of the develop pin dda200ab0 — elizaOS/llama.cpp#43, additive ABI v12 -> v14): expose the kokoro lib's existing ipa_to_token_ids() through the fused FFI as eliza_inference_kokoro_synthesize_ipa plus a eliza_inference_kokoro_g2p_kind capability query. TS runtime: - ffi-bindings.ts: bind the two new symbols as an additive kokoro-g2p family, add a develop-pinned-lineage cascade rung (v12 + Kokoro IPA WITHOUT the main-lineage vision-stream v13 symbols — the fork advanced 12 -> 14 for Kokoro, fork-sync #11386), accept a lib reporting v14, and keep accepting v13/v12. ELIZA_INFERENCE_ABI_VERSION 13 -> 14. - KokoroFfiRuntime queries g2p_kind once at load and routes per lib: espeak -> raw text (keeps the #11238 fix, no double-phonemization); ascii -> feed the espeak-ng-WASM IPA the TS phonemizer already produced through synthesize_ipa; unknown (pre-v14 lib) -> raw text with a loud one-time warning. When only the lossy dev phonemizer resolved, warn once but still use the IPA entry. - kokoro-backend threads the phonemizer id so the runtime can name the fallback. Staging honesty: stage-desktop-fused-lib.mjs warns loudly (non-fatal) when the host has no libespeak-ng, since the TS IPA path now keeps it intelligible. The desktop + iOS verify-symbol lists require the two new v14 symbols. Evidence (.github/issue-evidence/11776-kokoro-ipa-g2p/, real eliza-1-asr ASR): - espeak-less lib mean WER: RAW-TEXT 0.958 (bug) -> WASM-IPA 0.042 (fix); espeak-linked baseline 0.042 == 0.042 (the fix reaches parity). - Real TS runtime smoke on the espeak-less lib: WER 0.13, phonemizer=phonemizer. - Android emulator-5554 (arm64, real 76MB fused .so, g2p=ascii): RAW-TEXT mean WER 0.958 -> WASM-IPA mean WER 0.042, non-empty accurate transcript (the exact round-trip that returned EMPTY in #10727's emu leg). iOS inherits the fix (platform-neutral TS+FFI; g2p=ascii); on-device iOS capture tracked by #11612-residual / #11734. Closes #11776. Co-authored-by: Shaw <shawgotbags@gmail.com> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The fused
eliza_inference_kokoro_synthesize()takes raw text and phonemizes inside the lib:kokoro_phonemize()uses real espeak-ng G2P only underKOKORO_USE_ESPEAK, else falls back to a lossy per-byte ASCII grapheme map. So every fused build that does not link libespeak-ng — Android + iOS always, desktop whenever the host lacks libespeak-ng dev files — synthesizes unintelligible audio (elizaOS/eliza#11776).The kokoro lib already had
eliza_kokoro::ipa_to_token_ids()(embedded 115-entry vocab, exact reference ids); it just wasn't reachable through the fused FFI. This exposes it additively.What (additive ABI v12 -> v14)
eliza_inference_kokoro_g2p_kind(ctx)->ELIZA_KOKORO_G2P_ESPEAK|ELIZA_KOKORO_G2P_ASCII— the caller (TS voice layer) queries whether it must pre-phonemize.eliza_inference_kokoro_synthesize_ipa(ctx, ipa, ...)— synthesize from precomputed espeak-ng IPA, routed throughipa_to_token_ids(), bypassing the in-lib phonemizer entirely (the intelligible path for espeak-less builds, fed by the TS espeak-ng-WASM phonemizer).kokoro_synthesize()/ newkokoro_synthesize_ipa()now share one synthesis core (kokoro_synthesize_from_input_ids); only the G2P front end differs.All new symbols are additive — a v12/v13 caller is unaffected; a library that predates this surface reports the symbols absent and the loader falls back to raw text.
ABI numbering
v13 (token-by-token vision describe) is the main-lineage vision surface. This develop-pinned lineage (the eliza
developsubmodule pinsdda200ab0) advances 12 -> 14 for the Kokoro IPA surface so the two independent bumps stay collision-free through the #11386 fork reconciliation.Tests
Built on macOS (Apple, espeak linked via Homebrew) — both kokoro unit suites pass:
test_kokoro_phonemes:OK— new asserts thatg2p_kind_of_build()mirrorsespeak_available()and thatwrap_input_ids(ipa_to_token_ids("həlˈoʊ"))yields the exact wrappedinput_idsthe IPA entry feeds the model.test_kokoro_g2p_espeak:ALL PASS— new assertg2p_kind_of_build() == ESPEAKwhen linked; reference ids still match.Also built the espeak-less fused shared lib (
-DKOKORO_ENABLE_ESPEAK=OFF):eliza_inference_kokoro_g2p_kind+_synthesize_ipaexport correctly and the lib has no libespeak-ng dependency — the espeak-less end-to-end WER round-trip is in the eliza-side PR.Base
Targets
fix/11377-ifgo-reader-sync(tip =dda200ab0, the current elizadevelopsubmodule pin) so the diff is purely the additive kokoro-IPA change with nothing from the divergentmainlineage. The eliza gitlink bump will point at this branch's merged tip (a descendant ofdda200ab0— no regression of the #11612 Metal fixes or the IFGO diarizer guard).Refs elizaOS/eliza#11776, #10726, #10727, #11238.
🤖 Generated with Claude Code