[Feature] Add support for token-based rate-limiting in RemoteInferenceEngine

### Feature request

Add support for token-based rate-limiting for both input and output tokens when using RemoteInferenceEngine.

### Motivation / references

Many API providers in addition to request-based limits also have token-based limits for both input and output tokens:

https://console.anthropic.com/settings/limits
https://platform.openai.com/settings/organization/limits

If we don't allow users to configure these limits, their requests end up failing after a certain point.

### Your contribution

1. Update remote_params.py to add input and output token limits. Also modify it to use RPM rather than a custom politeness policy: https://github.com/oumi-ai/oumi/blob/eee596facf92e86ad01af9c99e04eb3ac02eb894/src/oumi/core/configs/params/remote_params.py#L45
2. Update RemoteInferenceEngine to use "requests per minute" rather than a custom politeness policy
3. Update RemoteInferenceEngine to sleep either when the RPM or the TPM limit is reached (whichever comes first): https://github.com/oumi-ai/oumi/blob/eee596facf92e86ad01af9c99e04eb3ac02eb894/src/oumi/inference/remote_inference_engine.py#L427

Most APIs specify in a successful API response how many input and output tokens were processed for that request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add support for token-based rate-limiting in RemoteInferenceEngine #1457

Feature request

Motivation / references

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Add support for token-based rate-limiting in RemoteInferenceEngine #1457

Description

Feature request

Motivation / references

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions