Skip to main content

Routing

LLMGrid’s router selects the best model for each request based on strategy + policies. Routes are tenant‑scoped and can be customized per project.

Strategies

  • cost_optimized — Prefer the lowest price within SLA and policy constraints.
  • latency_optimized — Prefer models with the fastest recent response times.
  • performance_graded — Route using eval scores by task type (e.g., coding, summarization).
  • balanced — Weighted blend of cost, latency, and performance.
  • direct — Always call a specific model (bypass scoring).
  • custom — Your scoring hook for full control.
Strategies respect rate limits, budgets, allow/deny lists, redaction/guardrails, and fallbacks configured on the route, project, or tenant.

Example: Balanced Strategy (JSON editor)

{
  "name": "chat_default",
  "type": "chat",
  "strategy": "balanced",
  "weights": { "cost": 0.4, "latency": 0.3, "performance": 0.3 },
  "models": [
    { "provider": "openai", "id": "gpt-4o-mini" },
    { "provider": "anthropic", "id": "claude-3-haiku" }
  ],
  "policies": {
    "rateLimit": { "rpm": 300, "rpsBurst": 20 },
    "budget": { "maxTokensPerDay": 1000000, "perUserCap": 50000 },
    "retries": { "maxAttempts": 2, "backoff": "exponential" },
    "fallback": {
      "onTimeout": { "toModel": { "provider": "openai", "id": "gpt-4o-mini" }, "maxAttempts": 1 }
    },
    "redaction": { "enabled": true, "pii": { "email": "mask", "phone": "mask" } }
  }
}

A/B Experiments

Split traffic across models for experiments or gradual rollouts.
{
  "name": "chat_ab_test",
  "type": "chat",
  "strategy": "balanced",
  "models": [
    { "provider": "openai", "id": "gpt-4o-mini", "weight": 0.5 },
    { "provider": "anthropic", "id": "claude-3-haiku", "weight": 0.5 }
  ]
}

Sticky Sessions

Keep a user on the same model within a session for UX consistency.

{
  "name": "chat_sticky",
  "type": "chat",
  "strategy": "balanced",
  "sticky": { "enabled": true, "key": "session_id", "ttlSeconds": 3600 },
  "models": [
    { "provider": "openai", "id": "gpt-4o-mini" },
    { "provider": "anthropic", "id": "claude-3-haiku" }
  ]
}

Fallbacks & Retries

Configure resilience for timeouts and transient failures.

{
  "policies": {
    "retries": { "maxAttempts": 2, "backoff": "exponential" },
    "fallback": {
      "onTimeout": { "toModel": { "provider": "anthropic", "id": "claude-3-haiku" }, "maxAttempts": 1 },
      "onError":   { "toRoute": "chat_direct_backup" }
    }
  }
}

Custom Routing (Pseudo-code)

Define a scoring function that blends cost, latency, and performance with your own weights.

function score(candidate, context) {
  // candidate: { modelId, costPer1kTokens }
  // context: { latency[modelId], performance[modelId], taskType }
  const price = candidate.costPer1kTokens;
  const perf  = (context.performance[candidate.modelId] ?? 0.7);
  const lat   = (context.latency[candidate.modelId] ?? 1000);

  // Example: emphasize performance for "coding" tasks
  const perfWeight = context.taskType === "coding" ? 0.7 : 0.5;

  return (perf * perfWeight) + ((1 / lat) * 0.2) + ((1 / price) * 0.3);
}

Policy Interactions

Rate limits and budgets are evaluated first; overages block or throttle requests. Allow/Deny lists restrict which models can be selected. Redaction/guardrails may post‑filter outputs or block certain requests. Streaming support preserves router decisions while sending partial tokens to the client.

Best Practices

Start with balanced; tune weights based on your latency/cost targets. Use A/B to validate improvements before full rollout. Enable sticky for conversational UX; disable for batch jobs. Add fallbacks to a reliable, cost‑effective baseline model. Track metrics (latency, errors, tokens, cost) and refine strategies regularly. Next: See API Reference for request format and Observability for routing analytics.