# Configuration Guide The system uses two configuration files for runtime setup: - `config/models.yaml` (required): Registers available models and their backend adapters - `config/routing.yaml` (optional): Configures traffic distribution between local and remote deployments These files are independent: `models.yaml` provides the candidate set of adapters, while `routing.yaml` adjusts weights on top of registered adapters. Without `routing.yaml`, the system uses default weights from `models.yaml` or environment variables (typically 1.0). ## 1. Environment Variables and Priority ### Variable Substitution The system supports environment variable placeholders in YAML: - `${VAR}`: Reads `VAR` from environment - `${VAR:-default}`: Reads `VAR` from environment, uses `default` if not found ### Configuration Priority Environment variables (`.env` or system) > `routing.yaml`/`models.yaml` > code defaults ## 2. models.yaml (Required) Registers models and adapters at startup. Each entry describes a model identifier and its backend configuration (base_url, api_key, capabilities, aliases, etc.). Example: ```yaml # config/models.yaml models: - id: llama-3.3-70b-instruct name: Llama 3.3 70B Instruct provider: llama base_url: ${LLAMA_BASE_URL} api_key: ${LLAMA_API_KEY} context_length: 131072 max_output_length: 8192 supports_tools: true supports_structured_output: true supported_params: [temperature, top_p, top_k, min_p, max_tokens, stop, seed] aliases: ["llama-3.3-70b-instruct"] route: - kind: llama weight: 1.0 base_url: ${LLAMA_BASE_URL} api_key: ${LLAMA_API_KEY} - id: llama-4-scout name: Llama 4 Scout provider: vllm base_url: ${LOCAL_BASE_URL} provider_model_id: "/models/meta-llama_Llama-4-Scout-17B-16E" # backend expects this id context_length: 262144 max_output_length: 16384 supports_tools: true supports_structured_output: true supported_params: [temperature, top_p, top_k, min_p, max_tokens, stop, seed] aliases: ["/models/meta-llama_Llama-4-Scout-17B-16E"] route: - kind: vllm weight: 1.0 base_url: ${LOCAL_BASE_URL} ``` ### Key Points: - `id`: Public model ID exposed by the API (what clients use to call the model) - `provider_model_id`: The actual model name sent to the backend provider (e.g., vLLM/freeinference's `/models/...`). If omitted, uses `id` - `aliases`: Additional public aliases that are registered alongside `id` to point to the same adapter - `provider`: Determines adapter type (`llama`, `vllm`, `deepseek`, `gemini`, etc.) - `/v1/models` endpoint dynamically generates its response from registered adapters ## 3. routing.yaml (Optional) Controls traffic distribution between local and remote deployments with optional health monitoring. ### Fixed-Ratio Strategy - Set `routing_strategy: fixed` - Control local traffic percentage via `routing_parameter.local_fraction` (0.0–1.0) - Weights are distributed equally within local and remote groups ### Health Checking (Optional) - `health_check: N`: Sends GET request to `/health` every N seconds - Unhealthy endpoints temporarily get weight 0 - Endpoints recover automatically when health checks succeed - Set to 0 or omit to disable health checking ### Example: Hybrid Deployment (60% local / 40% remote) ```yaml # config/routing.yaml routing_strategy: fixed routing_parameter: local_fraction: 0.6 timeout: 2 health_check: 30 logging: output: output.log local_deployment: - endpoint: ${LOCAL_BASE_URL:-http://localhost:8000} models: - llama-3.3-70b-instruct - llama-4-scout remote_deployment: - endpoint: ${LLAMA_BASE_URL} models: - llama-3.3-70b-instruct ``` ### How It Works: - At startup, `RoutingManager` applies 60/40 weights to registered adapters - If local endpoint becomes unhealthy, weights adjust automatically (0% local, 100% remote) - System falls back gracefully to maintain service availability ### Local-Only Deployment Simply omit `routing.yaml` to use default weights from `models.yaml` (typically 1.0 for all adapters). ## 4. Running the System ### Set Environment Variables: ```bash export LOCAL_BASE_URL=http://localhost:8000 export LLAMA_BASE_URL=https://api.llama.com/compat/v1 export LLAMA_API_KEY=sk-... ``` ### Start the Server: ```bash python -m serving.servers.app # Or use uvicorn/pm2/supervisor for production # uvicorn serving.servers.app:app --host 0.0.0.0 --port 8080 ``` ### Verify Operation: - `GET /v1/models` - Returns available models - `POST /v1/chat/completions` - Routes requests based on configured ratios - `GET /routing` - Shows current routing configuration ## 5. FAQ **Q: What if routing.yaml conflicts with models.yaml?** A: `routing.yaml` only adjusts weights; it doesn't add/remove adapters. The candidate set comes from `models.yaml` and environment variables. **Q: How to disable health checks?** A: Set `health_check: 0` or omit the field entirely. **Q: Can I use other routing strategies?** A: Currently only `fixed` is built-in. You can add new strategies in `routing/strategies.py` and configure them in `routing.yaml`. **Q: What happens during failover?** A: The system automatically tries alternative adapters when the primary fails, ensuring continuous service availability.