Skip to content

feat(kokoro): IPA-input synthesis + G2P-kind query for espeak-less builds (#11776)#43

Merged
lalalune merged 1 commit into
fix/11377-ifgo-reader-syncfrom
fix/11776-kokoro-ipa-ffi
Jul 3, 2026
Merged

feat(kokoro): IPA-input synthesis + G2P-kind query for espeak-less builds (#11776)#43
lalalune merged 1 commit into
fix/11377-ifgo-reader-syncfrom
fix/11776-kokoro-ipa-ffi

Conversation

@lalalune

@lalalune lalalune commented Jul 3, 2026

Copy link
Copy Markdown
Member

Why

The fused eliza_inference_kokoro_synthesize() takes raw text and phonemizes inside the lib: kokoro_phonemize() uses real espeak-ng G2P only under KOKORO_USE_ESPEAK, else falls back to a lossy per-byte ASCII grapheme map. So every fused build that does not link libespeak-ng — Android + iOS always, desktop whenever the host lacks libespeak-ng dev files — synthesizes unintelligible audio (elizaOS/eliza#11776).

The kokoro lib already had eliza_kokoro::ipa_to_token_ids() (embedded 115-entry vocab, exact reference ids); it just wasn't reachable through the fused FFI. This exposes it additively.

What (additive ABI v12 -> v14)

  • eliza_inference_kokoro_g2p_kind(ctx) -> ELIZA_KOKORO_G2P_ESPEAK | ELIZA_KOKORO_G2P_ASCII — the caller (TS voice layer) queries whether it must pre-phonemize.
  • eliza_inference_kokoro_synthesize_ipa(ctx, ipa, ...) — synthesize from precomputed espeak-ng IPA, routed through ipa_to_token_ids(), bypassing the in-lib phonemizer entirely (the intelligible path for espeak-less builds, fed by the TS espeak-ng-WASM phonemizer).
  • kokoro_synthesize() / new kokoro_synthesize_ipa() now share one synthesis core (kokoro_synthesize_from_input_ids); only the G2P front end differs.

All new symbols are additive — a v12/v13 caller is unaffected; a library that predates this surface reports the symbols absent and the loader falls back to raw text.

ABI numbering

v13 (token-by-token vision describe) is the main-lineage vision surface. This develop-pinned lineage (the eliza develop submodule pins dda200ab0) advances 12 -> 14 for the Kokoro IPA surface so the two independent bumps stay collision-free through the #11386 fork reconciliation.

Tests

Built on macOS (Apple, espeak linked via Homebrew) — both kokoro unit suites pass:

  • test_kokoro_phonemes: OK — new asserts that g2p_kind_of_build() mirrors espeak_available() and that wrap_input_ids(ipa_to_token_ids("həlˈoʊ")) yields the exact wrapped input_ids the IPA entry feeds the model.
  • test_kokoro_g2p_espeak: ALL PASS — new assert g2p_kind_of_build() == ESPEAK when linked; reference ids still match.

Also built the espeak-less fused shared lib (-DKOKORO_ENABLE_ESPEAK=OFF): eliza_inference_kokoro_g2p_kind + _synthesize_ipa export correctly and the lib has no libespeak-ng dependency — the espeak-less end-to-end WER round-trip is in the eliza-side PR.

Base

Targets fix/11377-ifgo-reader-sync (tip = dda200ab0, the current eliza develop submodule pin) so the diff is purely the additive kokoro-IPA change with nothing from the divergent main lineage. The eliza gitlink bump will point at this branch's merged tip (a descendant of dda200ab0 — no regression of the #11612 Metal fixes or the IFGO diarizer guard).

Refs elizaOS/eliza#11776, #10726, #10727, #11238.

🤖 Generated with Claude Code

…ilds (#11776)

The fused eliza_inference_kokoro_synthesize() takes raw text and phonemizes
inside the lib: kokoro_phonemize() uses real espeak-ng G2P only when
KOKORO_USE_ESPEAK is compiled in, else falls back to a lossy per-byte ASCII
grapheme map. Every build that does not link libespeak-ng (Android + iOS
always; desktop whenever the host lacks libespeak-ng dev files) therefore
produces speech-shaped but unintelligible audio.

The kokoro lib already had eliza_kokoro::ipa_to_token_ids() (embedded 115-entry
vocab, exact reference ids) — it just was not reachable through the fused FFI.
Expose it additively (ABI v12 -> v14; v13 is the main-lineage vision surface,
so this develop-pinned lineage advances to v14 to stay collision-free through
the #11386 fork reconciliation):

- eliza_inference_kokoro_g2p_kind(ctx): reports ELIZA_KOKORO_G2P_ESPEAK vs
  _ASCII so the caller knows whether it must pre-phonemize.
- eliza_inference_kokoro_synthesize_ipa(ctx, ipa, ...): synthesize from
  precomputed espeak-ng IPA, routed through ipa_to_token_ids(), bypassing the
  in-lib phonemizer entirely.

kokoro_synthesize() and the new kokoro_synthesize_ipa() now share one synthesis
core (kokoro_synthesize_from_input_ids); the only difference is the G2P front
end. All new symbols are additive — a v12/v13 caller is unaffected.

Native tests extended: test_kokoro_phonemes asserts g2p_kind_of_build() mirrors
espeak_available() and that the IPA-input path derives the exact wrapped
input_ids; test_kokoro_g2p_espeak asserts g2p_kind == ESPEAK when linked.

Refs elizaOS/eliza#11776, #10726, #10727, #11238.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jul 3, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5c8c50c2-47e3-4c2d-ad25-447210ff0815

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/11776-kokoro-ipa-ffi

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@lalalune lalalune merged commit 1045ed5 into fix/11377-ifgo-reader-sync Jul 3, 2026
8 of 38 checks passed
lalalune added a commit to elizaOS/eliza that referenced this pull request Jul 3, 2026
…ilds (#11776) (#11827)

The fused eliza_inference_kokoro_synthesize() takes raw text and phonemizes
inside the lib. Real espeak-ng G2P is compiled in only under KOKORO_USE_ESPEAK;
otherwise it uses a lossy per-byte ASCII grapheme fallback. Every fused build
that does not link libespeak-ng — Android + iOS always, desktop whenever the
host lacks libespeak-ng dev files — therefore produced speech-shaped but
unintelligible audio.

Fork (gitlink bump 66ab678cb, a descendant of the develop pin dda200ab0 —
elizaOS/llama.cpp#43, additive ABI v12 -> v14): expose the kokoro lib's existing
ipa_to_token_ids() through the fused FFI as eliza_inference_kokoro_synthesize_ipa
plus a eliza_inference_kokoro_g2p_kind capability query.

TS runtime:
- ffi-bindings.ts: bind the two new symbols as an additive kokoro-g2p family,
  add a develop-pinned-lineage cascade rung (v12 + Kokoro IPA WITHOUT the
  main-lineage vision-stream v13 symbols — the fork advanced 12 -> 14 for Kokoro,
  fork-sync #11386), accept a lib reporting v14, and keep accepting v13/v12.
  ELIZA_INFERENCE_ABI_VERSION 13 -> 14.
- KokoroFfiRuntime queries g2p_kind once at load and routes per lib: espeak ->
  raw text (keeps the #11238 fix, no double-phonemization); ascii -> feed the
  espeak-ng-WASM IPA the TS phonemizer already produced through synthesize_ipa;
  unknown (pre-v14 lib) -> raw text with a loud one-time warning. When only the
  lossy dev phonemizer resolved, warn once but still use the IPA entry.
- kokoro-backend threads the phonemizer id so the runtime can name the fallback.

Staging honesty: stage-desktop-fused-lib.mjs warns loudly (non-fatal) when the
host has no libespeak-ng, since the TS IPA path now keeps it intelligible. The
desktop + iOS verify-symbol lists require the two new v14 symbols.

Evidence (.github/issue-evidence/11776-kokoro-ipa-g2p/, real eliza-1-asr ASR):
- espeak-less lib mean WER: RAW-TEXT 0.958 (bug) -> WASM-IPA 0.042 (fix);
  espeak-linked baseline 0.042 == 0.042 (the fix reaches parity).
- Real TS runtime smoke on the espeak-less lib: WER 0.13, phonemizer=phonemizer.
- Android emulator-5554 (arm64, real 76MB fused .so, g2p=ascii): RAW-TEXT mean
  WER 0.958 -> WASM-IPA mean WER 0.042, non-empty accurate transcript (the exact
  round-trip that returned EMPTY in #10727's emu leg).

iOS inherits the fix (platform-neutral TS+FFI; g2p=ascii); on-device iOS capture
tracked by #11612-residual / #11734.

Closes #11776.

Co-authored-by: Shaw <shawgotbags@gmail.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant