Skip to main content

Rate Limiting

Rate limiting is an essential mechanism to prevent API abuse by controlling the number of requests allowed within a specific time frame. You can configure rate limits by setting hourly, daily and monthly total limits

This ensures fair usage and helps maintain system performance and stability.

# Limit to 1000 requests per hour
ai-gateway serve \
--rate-hourly 1000
--rate-daily 1000
--rate-monthly 1000

Or in config.yaml:

rate_limit:
hourly: 100
daily: 1000
monthly: 10000

When a rate limit is exceeded, the API will return a 429 (Too Many Requests) response.

Why Rate Limiting Matters

  • Prevents excessive LLM API usage: Controls the number of requests per user to avoid resource exhaustion.
  • Optimizes model inference efficiency: Ensures that LLM requests are processed smoothly without congestion.