Adding a New Model (OpenRouter-Compatible)
This guide explains how to add support for new LLM models and providers to the hybridInference gateway while keeping full OpenRouter/OpenAI API compatibility.
Reference PR for provider integration example: https://github.com/HarvardSys/hybridInference/pull/34
Overview
There is a single guide for both needs. Depending on your case, follow one of:
Use an existing provider adapter (vLLM, DeepSeek, Gemini, Llama, Zhipu) — only YAML + env changes.
Integrate a new provider — add an adapter class + small registration changes, then YAML + env.
Quick Start
Adding a Model with an Existing Provider
If the provider is already supported (vLLM, DeepSeek, Gemini, Llama, Zhipu), you only need to add configuration:
Add model configuration in
config/models.yaml:
models:
- id: your-model-id
name: Your Model Display Name
provider: existing_provider # e.g., "gemini", "deepseek"
provider_model_id: "actual-provider-model-id"
base_url: ${PROVIDER_BASE_URL}
api_key: ${PROVIDER_API_KEY}
quantization: "bf16"
input_modalities: ["text"]
output_modalities: ["text"]
context_length: 8192
max_output_length: 4096
supports_tools: true
supports_structured_output: true
supported_params: [temperature, top_p, max_tokens, stop]
pricing:
prompt: "0"
completion: "0"
image: "0"
request: "0"
input_cache_reads: "0"
input_cache_writes: "0"
route:
- kind: existing_provider
weight: 1.0
base_url: ${PROVIDER_BASE_URL}
api_key: ${PROVIDER_API_KEY}
Configure environment variables in
.env:
PROVIDER_BASE_URL=https://api.provider.com/v1
PROVIDER_API_KEY=your-api-key
Restart the server to load the new model.
Verify:
curl http://localhost:8080/v1/models | jq
curl -s -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"your-model-id","messages":[{"role":"user","content":"Hello"}]}' | jq
Note on aliases: If you want the model to appear under an OpenRouter-style slug (e.g., a local vLLM path), add it in aliases so clients can call either name.
OFFLOAD behavior: When OFFLOAD=1, the service will remove all local adapters whose base_url matches LOCAL_BASE_URL and only use remote adapters. See “Hybrid Routing & OFFLOAD” below.
Adding a New Provider
If you need to integrate a completely new provider, follow these steps:
Step 1: Create Provider Adapter
Create a new file in serving/adapters/ (e.g., serving/adapters/your_provider.py):
import json
from collections.abc import AsyncGenerator
from typing import Any
from serving.stream import done_sentinel, make_final_usage_chunk
from serving.utils.tokens import estimate_prompt_tokens, estimate_text_tokens
from .base import BaseAdapter, UsageInfo
class YourProviderAdapter(BaseAdapter):
"""Adapter for YourProvider API.
This adapter translates OpenAI-compatible requests to YourProvider's
API format and normalizes responses back to OpenAI format.
"""
async def chat_completion(
self, messages: list[dict[str, Any]], **params
) -> dict[str, Any]:
"""Execute a non-streaming chat completion request.
Args:
messages: List of chat messages in OpenAI format.
**params: Additional parameters (temperature, max_tokens, etc.).
Returns:
OpenAI-compatible response dictionary.
"""
# Validate and filter parameters
validated_params = self.validate_params(params)
# Build provider-specific request payload
payload = {
"model": self.config.provider_model_id or self.config.id,
"messages": messages,
**validated_params,
}
# Add optional features
if params.get("tools"):
payload["tools"] = params["tools"]
if params.get("response_format", {}).get("type") == "json_object":
payload["response_format"] = {"type": "json_object"}
# Set up authentication headers
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {self.config.api_key}",
}
# Make API request
data = await self.http.json_post_with_retry(
f"{self.config.base_url}/chat/completions",
json=payload,
headers=headers,
)
# Extract usage information
usage = UsageInfo(
prompt_tokens=data.get("usage", {}).get("prompt_tokens", 0),
completion_tokens=data.get("usage", {}).get("completion_tokens", 0),
total_tokens=data.get("usage", {}).get("total_tokens", 0),
)
# Fallback to estimation if provider doesn't return usage
if usage.total_tokens == 0:
content = data["choices"][0]["message"].get("content", "")
prompt_tokens = estimate_prompt_tokens(messages)
completion_tokens = estimate_text_tokens(content)
usage = UsageInfo(
prompt_tokens=int(prompt_tokens),
completion_tokens=int(completion_tokens),
total_tokens=int(prompt_tokens + completion_tokens),
)
# Extract tool calls if present
tool_calls = None
if "tool_calls" in data["choices"][0]["message"]:
tool_calls = data["choices"][0]["message"]["tool_calls"]
# Return normalized response
return self.format_response(
content=data["choices"][0]["message"].get("content", ""),
model=self.config.id,
usage=usage,
tool_calls=tool_calls,
finish_reason=data["choices"][0].get("finish_reason", "stop"),
)
async def stream_chat_completion(
self, messages: list[dict[str, Any]], **params
) -> AsyncGenerator[str, None]:
"""Execute a streaming chat completion request.
Args:
messages: List of chat messages in OpenAI format.
**params: Additional parameters.
Yields:
Server-sent event formatted strings.
"""
validated_params = self.validate_params(params)
payload = {
"model": self.config.provider_model_id or self.config.id,
"messages": messages,
"stream": True,
**validated_params,
}
if params.get("tools"):
payload["tools"] = params["tools"]
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {self.config.api_key}",
}
total_content = ""
prompt_tokens = 0
async for line in self.http.stream_post(
f"{self.config.base_url}/chat/completions",
json=payload,
headers=headers,
):
if not line.startswith("data: "):
continue
if line == "data: [DONE]":
# Emit final usage chunk using shared helper for consistency
yield make_final_usage_chunk(
model=self.config.id,
messages=messages,
total_content=total_content,
prompt_tokens_override=prompt_tokens or None,
finish_reason="stop",
)
yield done_sentinel()
break
try:
chunk_data = json.loads(line[6:])
# Extract usage if available
if "usage" in chunk_data:
prompt_tokens = chunk_data["usage"].get("prompt_tokens", prompt_tokens)
# Extract and yield content delta
if chunk_data["choices"][0]["delta"].get("content"):
content = chunk_data["choices"][0]["delta"]["content"]
total_content += content
yield self.format_stream_chunk(content, self.config.id)
except json.JSONDecodeError:
continue
Step 2: Register the Adapter
Update
serving/adapters/__init__.py:
from .your_provider import YourProviderAdapter
__all__ = [
# ... existing exports
"YourProviderAdapter",
]
Update
serving/servers/registry.py:
Add the import at the top:
from serving.adapters import (
# ... existing imports
YourProviderAdapter,
)
Add a branch in the _make_adapter function:
def _make_adapter(kind: str, cfg: dict[str, Any]):
"""Construct a provider adapter from a kind string and model config."""
model_cfg = ModelConfig(**cfg)
# ... existing conditions
if kind == "your_provider":
return YourProviderAdapter(model_cfg)
raise ValueError(f"Unknown adapter kind: {kind}")
Step 3: Add Model Configuration
Add your model to config/models.yaml:
models:
- id: your-model-id
name: Your Model Name
provider: your_provider
provider_model_id: "actual-model-id"
base_url: ${YOUR_PROVIDER_BASE_URL}
api_key: ${YOUR_PROVIDER_API_KEY}
quantization: "bf16"
input_modalities: ["text"]
output_modalities: ["text"]
context_length: 8192
max_output_length: 4096
supports_tools: true
supports_structured_output: true
supported_params: [temperature, top_p, max_tokens, stop]
aliases: [] # Optional alternative names
pricing:
prompt: "0"
completion: "0"
image: "0"
request: "0"
input_cache_reads: "0"
input_cache_writes: "0"
route:
- kind: your_provider
weight: 1.0
base_url: ${YOUR_PROVIDER_BASE_URL}
api_key: ${YOUR_PROVIDER_API_KEY}
Step 4: Configure Environment Variables
Add to .env:
YOUR_PROVIDER_BASE_URL=https://api.yourprovider.com/v1
YOUR_PROVIDER_API_KEY=your-api-key-here
Step 5: Test the Integration
# Start the server
uvicorn serving.servers.app:app --host 0.0.0.0 --port 8080
# List available models
curl http://localhost:8080/v1/models
# Test chat completion
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-id",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Streaming test:
curl -N -s -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-id",
"messages": [{"role": "user", "content": "Stream test"}],
"stream": true,
"max_tokens": 64
}'
Configuration Reference
ModelConfig Fields
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Unique model identifier |
|
string |
Yes |
Display name |
|
string |
Yes |
Provider/adapter kind |
|
string |
Yes |
API endpoint base URL |
|
string |
No |
API authentication key |
|
string |
No |
Provider’s model identifier (overrides |
|
list[string] |
No |
Alternative names for routing |
|
string |
No |
Quantization format (default: “bf16”) |
|
list[string] |
No |
Input types: “text”, “image” |
|
list[string] |
No |
Output types: “text” |
|
int |
No |
Maximum context window (default: 8192) |
|
int |
No |
Maximum output tokens (default: 4096) |
|
bool |
No |
Function calling support (default: false) |
|
bool |
No |
JSON mode support (default: false) |
|
list[string] |
No |
Allowed parameter names |
|
dict |
No |
Cost information per token/request |
Route Configuration
Routes allow multiple endpoints for a single model with weighted distribution:
route:
# Local vLLM deployment
- kind: vllm
weight: 0.7 # 70% of traffic
base_url: http://localhost:8000
provider_model_id: "/models/local-model"
# Remote API fallback
- kind: your_provider
weight: 0.3 # 30% of traffic
base_url: https://api.provider.com
api_key: ${API_KEY}
Hybrid Routing & OFFLOAD
Weighted routes are applied at registration time. You can further adjust weights or override distribution centrally using
config/routing.yaml(loaded by theRoutingManager).If
OFFLOAD=1, the bootstrap process removes any adapter whosebase_urlmatchesLOCAL_BASE_URL, effectively forcing traffic to remote providers only.
This lets you flip from hybrid to remote-only during incidents without editing YAML.
BaseAdapter API Reference
All adapters must inherit from BaseAdapter and implement:
Required Methods
async def chat_completion(
self, messages: list[dict[str, Any]], **params
) -> dict[str, Any]:
"""Execute non-streaming chat completion."""
pass
async def stream_chat_completion(
self, messages: list[dict[str, Any]], **params
) -> AsyncGenerator[str, None]:
"""Execute streaming chat completion."""
pass
Utility Methods
def validate_params(self, params: dict[str, Any]) -> dict[str, Any]:
"""Validate and clamp parameters to supported ranges."""
def format_response(
self,
content: str,
model: str,
usage: UsageInfo | None = None,
tool_calls: list[dict] | None = None,
finish_reason: str = "stop",
) -> dict[str, Any]:
"""Format response in OpenAI-compatible format."""
def format_stream_chunk(
self, content: str, model: str, finish_reason: str | None = None
) -> str:
"""Format SSE chunk for streaming responses."""
Available Attributes
self.config # ModelConfig instance
self.http # AsyncHTTPClient for API requests
Advanced Features
Multi-Modal Support
For models supporting images:
input_modalities: ["text", "image"]
Implement image handling in your adapter’s chat_completion method.
Tool/Function Calling
For models supporting function calls:
supports_tools: true
Parse and include tool_calls in the response:
tool_calls = []
if "function_call" in data:
tool_calls.append({
"id": f"call_{int(time.time() * 1000)}",
"type": "function",
"function": {
"name": data["function_call"]["name"],
"arguments": data["function_call"]["arguments"],
},
})
return self.format_response(
content=content,
model=self.config.id,
usage=usage,
tool_calls=tool_calls,
)
Structured Output (JSON Mode)
For models supporting JSON schema:
supports_structured_output: true
Handle response_format parameter:
if params.get("response_format", {}).get("type") == "json_object":
payload["response_format"] = {"type": "json_object"}
Rate Limiting (Optional)
If the provider has known token policies and you want server-side fairness controls, add a limiter configuration in serving/servers/bootstrap.py alongside existing examples (Gemini/DeepSeek/Zhipu). This enables per-model queues, burst control, and persistent counters.
Examples
Example 1: OpenAI-Compatible Provider
See serving/adapters/deepseek.py for a simple OpenAI-compatible implementation.
Example 2: Custom API Format
See serving/adapters/gemini.py for handling non-standard API formats with message conversion.
Example 3: Local Deployment
See serving/adapters/vllm.py for integrating local inference servers.
Troubleshooting
Model Not Appearing in /v1/models
Check
config/models.yamlsyntaxVerify environment variables are set
Check server logs for configuration errors
If using
aliases, verify the canonicalidappears exactly once and aliases do not collide with other model IDs.
Authentication Failures
Verify API key in
.envCheck if
${ENV_VAR}expansion is workingEnsure
base_urlis correct
Response Format Errors
Ensure
format_response()returns OpenAI-compatible structureValidate
UsageInfofields are integersCheck
finish_reasonis valid: “stop”, “length”, “content_filter”For streaming, ensure the first non-empty content chunk is emitted as soon as available so TTFT metrics record properly.
Streaming Issues
Ensure chunks are SSE-formatted:
data: {json}\n\nSend final usage chunk before
data: [DONE]Handle JSON parsing errors gracefully
Best Practices
Error handling: Use
self.http.json_post_with_retry()and handle provider faults gracefully with useful messages.Usage accounting: Prefer provider usage when available; otherwise fall back to
estimate_prompt_tokens()/estimate_text_tokens().Streaming helpers: Use
format_stream_chunk(),make_final_usage_chunk(), anddone_sentinel()for consistent SSE.Type safety: Provide full type hints and keep request/response shapes aligned with
serving/schemas.py.Testing: Exercise both streaming and non-streaming paths, and try large prompts to validate token clamping.
Docs & style: Keep adapter docstrings and comments in English (Google style). Avoid provider-specific logic in shared code.
Env expansion: Use
${ENV_VAR}in YAML instead of hardcoding secrets or endpoints; let dotenv load.env.
See Also
OpenRouter Gateway Overview — Architecture and endpoints
Routing Configuration — Central weight overrides and strategy
Configuration Guide — Environment and YAML configuration
API Reference — Top-level usage and running