Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.llmgrid.ai/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The Router Settings screen controls how requests are routed to model deployments and how failures are handled. It allows platform administrators to configure load balancing strategies, fallback models, and retry behavior to improve reliability and performance. Changes made here apply tenant-wide and affect how requests are resolved at runtime.

Tabs

The Router Settings screen is organized into three tabs:
  • Loadbalancing
  • Fallbacks
  • General
Use the tabs to configure different aspects of routing behavior.

Loadbalancing

The Loadbalancing tab defines how requests are distributed across multiple deployments.

Routing Strategy

Select a strategy used to balance traffic. Available options include:
  • simple-shuffle
    Randomly selects a deployment from the available list. Simple and fast.
  • least-busy
    Routes requests to the deployment with the lowest number of ongoing requests.
  • usage-based-routing (deprecated)
    Routes based on lowest token usage.
  • usage-based-routing-v2
    Improved version of usage-based routing with better tracking.
  • latency-based-routing
    Routes to the deployment with the lowest observed latency over a sliding window.
  • cost-based-routing
    Routes to the deployment with the lowest cost per token.
Choose a strategy based on your primary goal (throughput, latency, or efficiency).

Enable Tag Filtering

When enabled, routing decisions can consider tags associated with requests.
  • Useful for environment-aware or workload-aware routing
  • Common use cases include separating production, staging, or experimental traffic

Reliability & Retries

This section defines how failures are handled during request execution.

Allowed Fails

Number of times a deployment can fail before it is placed into cooldown.

Cooldown Time

Length of time a failed deployment is excluded before it becomes eligible again.

Number of Retries

Maximum number of retry attempts for a failed request.

Timeout

Maximum time allowed for a request before it is considered failed.

Retry After

Minimum waiting period before retrying a failed request.

Retry Policy

Optional custom retry behavior for different error types.

Model Group Alias

Defines aliases for model groups that can be referenced in routing and configuration. This allows:
  • Abstracting underlying model sets
  • Safer model migrations
  • Stable references across applications

Fallbacks

The Fallbacks tab allows you to define fallback models for improved availability.

Add Fallbacks

Select Add Fallbacks to configure fallback behavior.

Primary Model

The model that users initially request.

Fallback Models

One or more models used when the primary model is unavailable or fails.
Order matters: fallback models are tried sequentially (first, second, third, and so on).

General Settings

The General tab under Router Settings defines global constraints and limits that apply to all routed requests. These settings help control concurrency, payload size, and proxy-level protection across the tenant.

max_parallel_requests

Maximum number of concurrent requests allowed per API key. Purpose
  • Prevents a single key or client from overwhelming the system
  • Enforces fair usage at the key level
If not set, no per-key concurrency limit is enforced.

global_max_parallel_requests

Maximum number of concurrent requests allowed across the entire proxy instance. Purpose
  • Acts as a hard upper bound for total concurrency
  • Protects the proxy from overload during traffic spikes
This limit is evaluated before per-key limits.

max_request_size_mb

Maximum allowed size of a request payload. Behavior
  • Requests exceeding this size are rejected immediately
  • Applies to the full request body
Use this to protect against unusually large prompts or inputs.

max_response_size_mb

Maximum allowed size of a response payload. Behavior
  • Responses larger than this limit are rejected
  • Prevents excessive output from impacting stability
Useful for controlling verbose model responses or accidental large outputs.

pass_through_endpoints

Defines provider-specific pass-through endpoints that bypass standard routing. Use cases
  • Access provider-native APIs
  • Support custom or non-standard endpoints
  • Enable advanced integrations not covered by default routes
When configured, requests targeting these endpoints are forwarded directly to the provider with required headers and constraints.

Managing Settings

For each setting:
  • Enter a value in the Value column
  • Select Update to apply the change
  • Use the delete action to remove a previously set value

Status Indicator

  • Not Set – The setting is currently not enforced
  • Set values become active immediately after saving

Best Practices

  • Set conservative global limits to protect the proxy
  • Use per-key parallel limits for multi-tenant environments
  • Restrict request and response sizes for production workloads
  • Review and document any pass-through endpoints carefully
  • Test changes in non-production environments before rollout

  • Loadbalancing – Control request distribution
  • Fallbacks – Improve reliability with backup models
  • Logs – Observe request rejection and limit enforcement
  • Virtual Keys – Apply limits at the key level


Save or Reset Changes

  • Save Changes
    Applies all configuration updates.
  • Reset
    Reverts unsaved changes.

Common Use Cases

  • Improve reliability with fallback models
  • Reduce latency using intelligent routing
  • Balance load across multiple deployments
  • Protect against flapping or unstable deployments
  • Standardize routing behavior across teams

Best Practices

  • Start with simple-shuffle unless you need optimization
  • Add fallback models for critical workloads
  • Keep retry limits conservative to avoid cascading failures
  • Monitor routing behavior using logs and usage dashboards
  • Test routing changes in non-production environments first

  • Models – Manage available deployments
  • Usage & Logs – Observe routing outcomes
  • Virtual Keys – Apply access and constraints
  • Guardrails – Enforce safe and compliant responses “