# Adding a New Model (OpenRouter-Compatible) This guide explains how to add support for new LLM models and providers to the hybridInference gateway while keeping full OpenRouter/OpenAI API compatibility. Reference PR for provider integration example: https://github.com/HarvardSys/hybridInference/pull/34 ## Overview There is a single guide for both needs. Depending on your case, follow one of: 1) Use an existing provider adapter (vLLM, DeepSeek, Gemini, Llama, Zhipu) — only YAML + env changes. 2) Integrate a new provider — add an adapter class + small registration changes, then YAML + env. ## Quick Start ### Adding a Model with an Existing Provider If the provider is already supported (vLLM, DeepSeek, Gemini, Llama, Zhipu), you only need to add configuration: 1. **Add model configuration** in `config/models.yaml`: ```yaml models: - id: your-model-id name: Your Model Display Name provider: existing_provider # e.g., "gemini", "deepseek" provider_model_id: "actual-provider-model-id" base_url: ${PROVIDER_BASE_URL} api_key: ${PROVIDER_API_KEY} quantization: "bf16" input_modalities: ["text"] output_modalities: ["text"] context_length: 8192 max_output_length: 4096 supports_tools: true supports_structured_output: true supported_params: [temperature, top_p, max_tokens, stop] pricing: prompt: "0" completion: "0" image: "0" request: "0" input_cache_reads: "0" input_cache_writes: "0" route: - kind: existing_provider weight: 1.0 base_url: ${PROVIDER_BASE_URL} api_key: ${PROVIDER_API_KEY} ``` 2. **Configure environment variables** in `.env`: ```ini PROVIDER_BASE_URL=https://api.provider.com/v1 PROVIDER_API_KEY=your-api-key ``` 3. **Restart the server** to load the new model. 4. **Verify**: ```bash curl http://localhost:8080/v1/models | jq curl -s -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"your-model-id","messages":[{"role":"user","content":"Hello"}]}' | jq ``` Note on aliases: If you want the model to appear under an OpenRouter-style slug (e.g., a local vLLM path), add it in `aliases` so clients can call either name. OFFLOAD behavior: When `OFFLOAD=1`, the service will remove all local adapters whose `base_url` matches `LOCAL_BASE_URL` and only use remote adapters. See “Hybrid Routing & OFFLOAD” below. ## Adding a New Provider If you need to integrate a completely new provider, follow these steps: ### Step 1: Create Provider Adapter Create a new file in `serving/adapters/` (e.g., `serving/adapters/your_provider.py`): ```python import json from collections.abc import AsyncGenerator from typing import Any from serving.stream import done_sentinel, make_final_usage_chunk from serving.utils.tokens import estimate_prompt_tokens, estimate_text_tokens from .base import BaseAdapter, UsageInfo class YourProviderAdapter(BaseAdapter): """Adapter for YourProvider API. This adapter translates OpenAI-compatible requests to YourProvider's API format and normalizes responses back to OpenAI format. """ async def chat_completion( self, messages: list[dict[str, Any]], **params ) -> dict[str, Any]: """Execute a non-streaming chat completion request. Args: messages: List of chat messages in OpenAI format. **params: Additional parameters (temperature, max_tokens, etc.). Returns: OpenAI-compatible response dictionary. """ # Validate and filter parameters validated_params = self.validate_params(params) # Build provider-specific request payload payload = { "model": self.config.provider_model_id or self.config.id, "messages": messages, **validated_params, } # Add optional features if params.get("tools"): payload["tools"] = params["tools"] if params.get("response_format", {}).get("type") == "json_object": payload["response_format"] = {"type": "json_object"} # Set up authentication headers headers = { "Content-Type": "application/json", "Authorization": f"Bearer {self.config.api_key}", } # Make API request data = await self.http.json_post_with_retry( f"{self.config.base_url}/chat/completions", json=payload, headers=headers, ) # Extract usage information usage = UsageInfo( prompt_tokens=data.get("usage", {}).get("prompt_tokens", 0), completion_tokens=data.get("usage", {}).get("completion_tokens", 0), total_tokens=data.get("usage", {}).get("total_tokens", 0), ) # Fallback to estimation if provider doesn't return usage if usage.total_tokens == 0: content = data["choices"][0]["message"].get("content", "") prompt_tokens = estimate_prompt_tokens(messages) completion_tokens = estimate_text_tokens(content) usage = UsageInfo( prompt_tokens=int(prompt_tokens), completion_tokens=int(completion_tokens), total_tokens=int(prompt_tokens + completion_tokens), ) # Extract tool calls if present tool_calls = None if "tool_calls" in data["choices"][0]["message"]: tool_calls = data["choices"][0]["message"]["tool_calls"] # Return normalized response return self.format_response( content=data["choices"][0]["message"].get("content", ""), model=self.config.id, usage=usage, tool_calls=tool_calls, finish_reason=data["choices"][0].get("finish_reason", "stop"), ) async def stream_chat_completion( self, messages: list[dict[str, Any]], **params ) -> AsyncGenerator[str, None]: """Execute a streaming chat completion request. Args: messages: List of chat messages in OpenAI format. **params: Additional parameters. Yields: Server-sent event formatted strings. """ validated_params = self.validate_params(params) payload = { "model": self.config.provider_model_id or self.config.id, "messages": messages, "stream": True, **validated_params, } if params.get("tools"): payload["tools"] = params["tools"] headers = { "Content-Type": "application/json", "Authorization": f"Bearer {self.config.api_key}", } total_content = "" prompt_tokens = 0 async for line in self.http.stream_post( f"{self.config.base_url}/chat/completions", json=payload, headers=headers, ): if not line.startswith("data: "): continue if line == "data: [DONE]": # Emit final usage chunk using shared helper for consistency yield make_final_usage_chunk( model=self.config.id, messages=messages, total_content=total_content, prompt_tokens_override=prompt_tokens or None, finish_reason="stop", ) yield done_sentinel() break try: chunk_data = json.loads(line[6:]) # Extract usage if available if "usage" in chunk_data: prompt_tokens = chunk_data["usage"].get("prompt_tokens", prompt_tokens) # Extract and yield content delta if chunk_data["choices"][0]["delta"].get("content"): content = chunk_data["choices"][0]["delta"]["content"] total_content += content yield self.format_stream_chunk(content, self.config.id) except json.JSONDecodeError: continue ``` ### Step 2: Register the Adapter 1. **Update `serving/adapters/__init__.py`**: ```python from .your_provider import YourProviderAdapter __all__ = [ # ... existing exports "YourProviderAdapter", ] ``` 2. **Update `serving/servers/registry.py`**: Add the import at the top: ```python from serving.adapters import ( # ... existing imports YourProviderAdapter, ) ``` Add a branch in the `_make_adapter` function: ```python def _make_adapter(kind: str, cfg: dict[str, Any]): """Construct a provider adapter from a kind string and model config.""" model_cfg = ModelConfig(**cfg) # ... existing conditions if kind == "your_provider": return YourProviderAdapter(model_cfg) raise ValueError(f"Unknown adapter kind: {kind}") ``` ### Step 3: Add Model Configuration Add your model to `config/models.yaml`: ```yaml models: - id: your-model-id name: Your Model Name provider: your_provider provider_model_id: "actual-model-id" base_url: ${YOUR_PROVIDER_BASE_URL} api_key: ${YOUR_PROVIDER_API_KEY} quantization: "bf16" input_modalities: ["text"] output_modalities: ["text"] context_length: 8192 max_output_length: 4096 supports_tools: true supports_structured_output: true supported_params: [temperature, top_p, max_tokens, stop] aliases: [] # Optional alternative names pricing: prompt: "0" completion: "0" image: "0" request: "0" input_cache_reads: "0" input_cache_writes: "0" route: - kind: your_provider weight: 1.0 base_url: ${YOUR_PROVIDER_BASE_URL} api_key: ${YOUR_PROVIDER_API_KEY} ``` ### Step 4: Configure Environment Variables Add to `.env`: ```ini YOUR_PROVIDER_BASE_URL=https://api.yourprovider.com/v1 YOUR_PROVIDER_API_KEY=your-api-key-here ``` ### Step 5: Test the Integration ```bash # Start the server uvicorn serving.servers.app:app --host 0.0.0.0 --port 8080 # List available models curl http://localhost:8080/v1/models # Test chat completion curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "your-model-id", "messages": [{"role": "user", "content": "Hello!"}] }' ``` Streaming test: ```bash curl -N -s -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "your-model-id", "messages": [{"role": "user", "content": "Stream test"}], "stream": true, "max_tokens": 64 }' ``` ## Configuration Reference ### ModelConfig Fields | Field | Type | Required | Description | |-------|------|----------|-------------| | `id` | string | Yes | Unique model identifier | | `name` | string | Yes | Display name | | `provider` | string | Yes | Provider/adapter kind | | `base_url` | string | Yes | API endpoint base URL | | `api_key` | string | No | API authentication key | | `provider_model_id` | string | No | Provider's model identifier (overrides `id`) | | `aliases` | list[string] | No | Alternative names for routing | | `quantization` | string | No | Quantization format (default: "bf16") | | `input_modalities` | list[string] | No | Input types: "text", "image" | | `output_modalities` | list[string] | No | Output types: "text" | | `context_length` | int | No | Maximum context window (default: 8192) | | `max_output_length` | int | No | Maximum output tokens (default: 4096) | | `supports_tools` | bool | No | Function calling support (default: false) | | `supports_structured_output` | bool | No | JSON mode support (default: false) | | `supported_params` | list[string] | No | Allowed parameter names | | `pricing` | dict | No | Cost information per token/request | ### Route Configuration Routes allow multiple endpoints for a single model with weighted distribution: ```yaml route: # Local vLLM deployment - kind: vllm weight: 0.7 # 70% of traffic base_url: http://localhost:8000 provider_model_id: "/models/local-model" # Remote API fallback - kind: your_provider weight: 0.3 # 30% of traffic base_url: https://api.provider.com api_key: ${API_KEY} ``` ### Hybrid Routing & OFFLOAD - Weighted routes are applied at registration time. You can further adjust weights or override distribution centrally using `config/routing.yaml` (loaded by the `RoutingManager`). - If `OFFLOAD=1`, the bootstrap process removes any adapter whose `base_url` matches `LOCAL_BASE_URL`, effectively forcing traffic to remote providers only. This lets you flip from hybrid to remote-only during incidents without editing YAML. ## BaseAdapter API Reference All adapters must inherit from `BaseAdapter` and implement: ### Required Methods ```python async def chat_completion( self, messages: list[dict[str, Any]], **params ) -> dict[str, Any]: """Execute non-streaming chat completion.""" pass async def stream_chat_completion( self, messages: list[dict[str, Any]], **params ) -> AsyncGenerator[str, None]: """Execute streaming chat completion.""" pass ``` ### Utility Methods ```python def validate_params(self, params: dict[str, Any]) -> dict[str, Any]: """Validate and clamp parameters to supported ranges.""" def format_response( self, content: str, model: str, usage: UsageInfo | None = None, tool_calls: list[dict] | None = None, finish_reason: str = "stop", ) -> dict[str, Any]: """Format response in OpenAI-compatible format.""" def format_stream_chunk( self, content: str, model: str, finish_reason: str | None = None ) -> str: """Format SSE chunk for streaming responses.""" ``` ### Available Attributes ```python self.config # ModelConfig instance self.http # AsyncHTTPClient for API requests ``` ## Advanced Features ### Multi-Modal Support For models supporting images: ```yaml input_modalities: ["text", "image"] ``` Implement image handling in your adapter's `chat_completion` method. ### Tool/Function Calling For models supporting function calls: ```yaml supports_tools: true ``` Parse and include `tool_calls` in the response: ```python tool_calls = [] if "function_call" in data: tool_calls.append({ "id": f"call_{int(time.time() * 1000)}", "type": "function", "function": { "name": data["function_call"]["name"], "arguments": data["function_call"]["arguments"], }, }) return self.format_response( content=content, model=self.config.id, usage=usage, tool_calls=tool_calls, ) ``` ### Structured Output (JSON Mode) For models supporting JSON schema: ```yaml supports_structured_output: true ``` Handle `response_format` parameter: ```python if params.get("response_format", {}).get("type") == "json_object": payload["response_format"] = {"type": "json_object"} ``` ### Rate Limiting (Optional) If the provider has known token policies and you want server-side fairness controls, add a limiter configuration in `serving/servers/bootstrap.py` alongside existing examples (Gemini/DeepSeek/Zhipu). This enables per-model queues, burst control, and persistent counters. ## Examples ### Example 1: OpenAI-Compatible Provider See `serving/adapters/deepseek.py` for a simple OpenAI-compatible implementation. ### Example 2: Custom API Format See `serving/adapters/gemini.py` for handling non-standard API formats with message conversion. ### Example 3: Local Deployment See `serving/adapters/vllm.py` for integrating local inference servers. ## Troubleshooting ### Model Not Appearing in `/v1/models` - Check `config/models.yaml` syntax - Verify environment variables are set - Check server logs for configuration errors - If using `aliases`, verify the canonical `id` appears exactly once and aliases do not collide with other model IDs. ### Authentication Failures - Verify API key in `.env` - Check if `${ENV_VAR}` expansion is working - Ensure `base_url` is correct ### Response Format Errors - Ensure `format_response()` returns OpenAI-compatible structure - Validate `UsageInfo` fields are integers - Check `finish_reason` is valid: "stop", "length", "content_filter" - For streaming, ensure the first non-empty content chunk is emitted as soon as available so TTFT metrics record properly. ### Streaming Issues - Ensure chunks are SSE-formatted: `data: {json}\n\n` - Send final usage chunk before `data: [DONE]` - Handle JSON parsing errors gracefully ## Best Practices 1. **Error handling**: Use `self.http.json_post_with_retry()` and handle provider faults gracefully with useful messages. 2. **Usage accounting**: Prefer provider usage when available; otherwise fall back to `estimate_prompt_tokens()`/`estimate_text_tokens()`. 3. **Streaming helpers**: Use `format_stream_chunk()`, `make_final_usage_chunk()`, and `done_sentinel()` for consistent SSE. 4. **Type safety**: Provide full type hints and keep request/response shapes aligned with `serving/schemas.py`. 5. **Testing**: Exercise both streaming and non-streaming paths, and try large prompts to validate token clamping. 6. **Docs & style**: Keep adapter docstrings and comments in English (Google style). Avoid provider-specific logic in shared code. 7. **Env expansion**: Use `${ENV_VAR}` in YAML instead of hardcoding secrets or endpoints; let dotenv load `.env`. ## See Also - [OpenRouter Gateway Overview](openrouter.md) — Architecture and endpoints - [Routing Configuration](routing.md) — Central weight overrides and strategy - [Configuration Guide](configuration.md) — Environment and YAML configuration - [API Reference](../api-reference.md) — Top-level usage and running