Rate Limiting

Rate limiting is an essential mechanism to prevent API abuse by controlling the number of requests allowed within a specific time frame. You can configure rate limits by setting hourly, daily and monthly total limits

This ensures fair usage and helps maintain system performance and stability.

# Limit to 1000 requests per hour
ai-gateway serve \
    --rate-hourly 1000
    --rate-daily 1000
    --rate-monthly 1000

Or in config.yaml:

rate_limit:
  hourly: 100
  daily: 1000
  monthly: 10000

When a rate limit is exceeded, the API will return a 429 (Too Many Requests) response.

Why Rate Limiting Matters

Prevents excessive LLM API usage: Controls the number of requests per user to avoid resource exhaustion.
Optimizes model inference efficiency: Ensures that LLM requests are processed smoothly without congestion.

Why Rate Limiting Matters​

Why Rate Limiting Matters