Configuration Guide

The system uses two configuration files for runtime setup:

  • config/models.yaml (required): Registers available models and their backend adapters

  • config/routing.yaml (optional): Configures traffic distribution between local and remote deployments

These files are independent: models.yaml provides the candidate set of adapters, while routing.yaml adjusts weights on top of registered adapters. Without routing.yaml, the system uses default weights from models.yaml or environment variables (typically 1.0).

1. Environment Variables and Priority

Variable Substitution

The system supports environment variable placeholders in YAML:

  • ${VAR}: Reads VAR from environment

  • ${VAR:-default}: Reads VAR from environment, uses default if not found

Configuration Priority

Environment variables (.env or system) > routing.yaml/models.yaml > code defaults

2. models.yaml (Required)

Registers models and adapters at startup. Each entry describes a model identifier and its backend configuration (base_url, api_key, capabilities, aliases, etc.).

Example:

# config/models.yaml
models:
  - id: llama-3.3-70b-instruct
    name: Llama 3.3 70B Instruct
    provider: llama
    base_url: ${LLAMA_BASE_URL}
    api_key: ${LLAMA_API_KEY}
    context_length: 131072
    max_output_length: 8192
    supports_tools: true
    supports_structured_output: true
    supported_params: [temperature, top_p, top_k, min_p, max_tokens, stop, seed]
    aliases: ["llama-3.3-70b-instruct"]
    route:
      - kind: llama
        weight: 1.0
        base_url: ${LLAMA_BASE_URL}
        api_key: ${LLAMA_API_KEY}

  - id: llama-4-scout
    name: Llama 4 Scout
    provider: vllm
    base_url: ${LOCAL_BASE_URL}
    provider_model_id: "/models/meta-llama_Llama-4-Scout-17B-16E"  # backend expects this id
    context_length: 262144
    max_output_length: 16384
    supports_tools: true
    supports_structured_output: true
    supported_params: [temperature, top_p, top_k, min_p, max_tokens, stop, seed]
    aliases: ["/models/meta-llama_Llama-4-Scout-17B-16E"]
    route:
      - kind: vllm
        weight: 1.0
        base_url: ${LOCAL_BASE_URL}

Key Points:

  • id: Public model ID exposed by the API (what clients use to call the model)

  • provider_model_id: The actual model name sent to the backend provider (e.g., vLLM/freeinference’s /models/...). If omitted, uses id

  • aliases: Additional public aliases that are registered alongside id to point to the same adapter

  • provider: Determines adapter type (llama, vllm, deepseek, gemini, etc.)

  • /v1/models endpoint dynamically generates its response from registered adapters

3. routing.yaml (Optional)

Controls traffic distribution between local and remote deployments with optional health monitoring.

Fixed-Ratio Strategy

  • Set routing_strategy: fixed

  • Control local traffic percentage via routing_parameter.local_fraction (0.0–1.0)

  • Weights are distributed equally within local and remote groups

Health Checking (Optional)

  • health_check: N: Sends GET request to /health every N seconds

  • Unhealthy endpoints temporarily get weight 0

  • Endpoints recover automatically when health checks succeed

  • Set to 0 or omit to disable health checking

Example: Hybrid Deployment (60% local / 40% remote)

# config/routing.yaml
routing_strategy: fixed
routing_parameter:
  local_fraction: 0.6
timeout: 2
health_check: 30
logging:
  output: output.log
local_deployment:
  - endpoint: ${LOCAL_BASE_URL:-http://localhost:8000}
    models:
      - llama-3.3-70b-instruct
      - llama-4-scout
remote_deployment:
  - endpoint: ${LLAMA_BASE_URL}
    models:
      - llama-3.3-70b-instruct

How It Works:

  • At startup, RoutingManager applies 60/40 weights to registered adapters

  • If local endpoint becomes unhealthy, weights adjust automatically (0% local, 100% remote)

  • System falls back gracefully to maintain service availability

Local-Only Deployment

Simply omit routing.yaml to use default weights from models.yaml (typically 1.0 for all adapters).

4. Running the System

Set Environment Variables:

export LOCAL_BASE_URL=http://localhost:8000
export LLAMA_BASE_URL=https://api.llama.com/compat/v1
export LLAMA_API_KEY=sk-...

Start the Server:

python -m serving.servers.app
# Or use uvicorn/pm2/supervisor for production
# uvicorn serving.servers.app:app --host 0.0.0.0 --port 8080

Verify Operation:

  • GET /v1/models - Returns available models

  • POST /v1/chat/completions - Routes requests based on configured ratios

  • GET /routing - Shows current routing configuration

5. FAQ

Q: What if routing.yaml conflicts with models.yaml? A: routing.yaml only adjusts weights; it doesn’t add/remove adapters. The candidate set comes from models.yaml and environment variables.

Q: How to disable health checks? A: Set health_check: 0 or omit the field entirely.

Q: Can I use other routing strategies? A: Currently only fixed is built-in. You can add new strategies in routing/strategies.py and configure them in routing.yaml.

Q: What happens during failover? A: The system automatically tries alternative adapters when the primary fails, ensuring continuous service availability.