Feature request
Add support for token-based rate-limiting for both input and output tokens when using RemoteInferenceEngine.
Motivation / references
Many API providers in addition to request-based limits also have token-based limits for both input and output tokens:
https://console.anthropic.com/settings/limits
https://platform.openai.com/settings/organization/limits
If we don't allow users to configure these limits, their requests end up failing after a certain point.
Your contribution
- Update remote_params.py to add input and output token limits. Also modify it to use RPM rather than a custom politeness policy:
|
politeness_policy: float = 0.0 |
- Update RemoteInferenceEngine to use "requests per minute" rather than a custom politeness policy
- Update RemoteInferenceEngine to sleep either when the RPM or the TPM limit is reached (whichever comes first):
|
await asyncio.sleep(remote_params.politeness_policy) |
Most APIs specify in a successful API response how many input and output tokens were processed for that request.
Feature request
Add support for token-based rate-limiting for both input and output tokens when using RemoteInferenceEngine.
Motivation / references
Many API providers in addition to request-based limits also have token-based limits for both input and output tokens:
https://console.anthropic.com/settings/limits
https://platform.openai.com/settings/organization/limits
If we don't allow users to configure these limits, their requests end up failing after a certain point.
Your contribution
oumi/src/oumi/core/configs/params/remote_params.py
Line 45 in eee596f
oumi/src/oumi/inference/remote_inference_engine.py
Line 427 in eee596f
Most APIs specify in a successful API response how many input and output tokens were processed for that request.