Skip to content

[Feature] Add support for token-based rate-limiting in RemoteInferenceEngine #1457

@jgreer013

Description

@jgreer013

Feature request

Add support for token-based rate-limiting for both input and output tokens when using RemoteInferenceEngine.

Motivation / references

Many API providers in addition to request-based limits also have token-based limits for both input and output tokens:

https://console.anthropic.com/settings/limits
https://platform.openai.com/settings/organization/limits

If we don't allow users to configure these limits, their requests end up failing after a certain point.

Your contribution

  1. Update remote_params.py to add input and output token limits. Also modify it to use RPM rather than a custom politeness policy:
    politeness_policy: float = 0.0
  2. Update RemoteInferenceEngine to use "requests per minute" rather than a custom politeness policy
  3. Update RemoteInferenceEngine to sleep either when the RPM or the TPM limit is reached (whichever comes first):
    await asyncio.sleep(remote_params.politeness_policy)

Most APIs specify in a successful API response how many input and output tokens were processed for that request.

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions