Skip to content

Conversation

@Bennethxyz
Copy link

This PR introduces batched sampling to reduce per-token overhead when multiple requests reach the last shard concurrently.\n\nWhat\n- Add an async batcher in Node to group sampling calls within a short window (default 5ms) or until (default 8).\n- Stack logits and call once for the batch; on failure, fall back to per-request sampling.\n- Emit per-request token callbacks and forward the sampled token for continued generation, preserving current behavior.\n\nWhy\n- Incremental progress toward full forward-pass batching requested in #1 ([BOUNTY - ] Batched Requests). Sampling is a measurable hotspot and can benefit from batching with minimal risk.\n\nNotes\n- No changes to public APIs or gRPC schema; fully backward compatible.\n- Future work: extend batching earlier in the pipeline (prompt encode and forward passes) with per-request caches combined into batch-aware caches.\n\nConfig\n- (default 8)\n- (default 5)\n\nI’m happy to iterate on full tensor-forward batching next (MLX/Tinygrad cache semantics).

…-token overhead\n\n- Add async batch queues with short timeout and max batch size\n- Stack logits and call engine.sample once for the batch (with per-request fallback)\n- Forward sampled tokens and emit callbacks per-request\n\nThis is an incremental step toward full forward-pass batching as requested in exo-explore#1.
… to per-request\n\n- Prevent passing batched logits to TinygradDynamicShardInferenceEngine.sample\n- Safety check token count; fallback if mismatch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant