Skip to main content

Load Balancer Routing

Distribute requests across multiple models to ensure high availability, balance load, and optimize performance. Use real-time metrics to select the best available model.

Use Case

  • High availability requirements
  • Load distribution across models
  • Performance optimization
  • Failover scenarios

Configuration

{
"model": "router/dynamic",
"router": {
"type": "conditional",
"routes": [
{
"name": "Balanced",
"targets": {
"$any": [
"openai/gpt-4.1-nano",
"gemini/gemini-2.0-flash",
"bedrock/llama3-2-3b-instruct-v1.0"
],
"sort_by": "requests",
"sort_order": "min"
}
}
]
}
}

How It Works

  1. Model Pool: Defines three models for load distribution (GPT-4.1-nano, Gemini-2.0-flash, Llama3-2-3b)
  2. Load Balancing: Automatically selects the model with the least current load (requests)
  3. Automatic Distribution: Requests are distributed across the available models based on their current usage

Variables Used

  • requests: Current load metric (used for sorting)

Customization

  • Adjust health thresholds
  • Add more models to the pool
  • Use different sorting strategies (ttft, price, etc.)
  • Implement weighted load balancing
  • Add geographic considerations