Adding a New Model (OpenRouter-Compatible)

This guide explains how to add support for new LLM models and providers to the hybridInference gateway while keeping full OpenRouter/OpenAI API compatibility.

Reference PR for provider integration example: https://github.com/HarvardSys/hybridInference/pull/34

Overview

There is a single guide for both needs. Depending on your case, follow one of:

  1. Use an existing provider adapter (vLLM, DeepSeek, Gemini, Llama, Zhipu) — only YAML + env changes.

  2. Integrate a new provider — add an adapter class + small registration changes, then YAML + env.

Quick Start

Adding a Model with an Existing Provider

If the provider is already supported (vLLM, DeepSeek, Gemini, Llama, Zhipu), you only need to add configuration:

  1. Add model configuration in config/models.yaml:

models:
  - id: your-model-id
    name: Your Model Display Name
    provider: existing_provider  # e.g., "gemini", "deepseek"
    provider_model_id: "actual-provider-model-id"
    base_url: ${PROVIDER_BASE_URL}
    api_key: ${PROVIDER_API_KEY}
    quantization: "bf16"
    input_modalities: ["text"]
    output_modalities: ["text"]
    context_length: 8192
    max_output_length: 4096
    supports_tools: true
    supports_structured_output: true
    supported_params: [temperature, top_p, max_tokens, stop]
    pricing:
      prompt: "0"
      completion: "0"
      image: "0"
      request: "0"
      input_cache_reads: "0"
      input_cache_writes: "0"
    route:
      - kind: existing_provider
        weight: 1.0
        base_url: ${PROVIDER_BASE_URL}
        api_key: ${PROVIDER_API_KEY}
  1. Configure environment variables in .env:

PROVIDER_BASE_URL=https://api.provider.com/v1
PROVIDER_API_KEY=your-api-key
  1. Restart the server to load the new model.

  2. Verify:

curl http://localhost:8080/v1/models | jq
curl -s -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"your-model-id","messages":[{"role":"user","content":"Hello"}]}' | jq

Note on aliases: If you want the model to appear under an OpenRouter-style slug (e.g., a local vLLM path), add it in aliases so clients can call either name.

OFFLOAD behavior: When OFFLOAD=1, the service will remove all local adapters whose base_url matches LOCAL_BASE_URL and only use remote adapters. See “Hybrid Routing & OFFLOAD” below.

Adding a New Provider

If you need to integrate a completely new provider, follow these steps:

Step 1: Create Provider Adapter

Create a new file in serving/adapters/ (e.g., serving/adapters/your_provider.py):

import json
from collections.abc import AsyncGenerator
from typing import Any

from serving.stream import done_sentinel, make_final_usage_chunk
from serving.utils.tokens import estimate_prompt_tokens, estimate_text_tokens
from .base import BaseAdapter, UsageInfo


class YourProviderAdapter(BaseAdapter):
    """Adapter for YourProvider API.

    This adapter translates OpenAI-compatible requests to YourProvider's
    API format and normalizes responses back to OpenAI format.
    """

    async def chat_completion(
        self, messages: list[dict[str, Any]], **params
    ) -> dict[str, Any]:
        """Execute a non-streaming chat completion request.

        Args:
            messages: List of chat messages in OpenAI format.
            **params: Additional parameters (temperature, max_tokens, etc.).

        Returns:
            OpenAI-compatible response dictionary.
        """
        # Validate and filter parameters
        validated_params = self.validate_params(params)

        # Build provider-specific request payload
        payload = {
            "model": self.config.provider_model_id or self.config.id,
            "messages": messages,
            **validated_params,
        }

        # Add optional features
        if params.get("tools"):
            payload["tools"] = params["tools"]

        if params.get("response_format", {}).get("type") == "json_object":
            payload["response_format"] = {"type": "json_object"}

        # Set up authentication headers
        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {self.config.api_key}",
        }

        # Make API request
        data = await self.http.json_post_with_retry(
            f"{self.config.base_url}/chat/completions",
            json=payload,
            headers=headers,
        )

        # Extract usage information
        usage = UsageInfo(
            prompt_tokens=data.get("usage", {}).get("prompt_tokens", 0),
            completion_tokens=data.get("usage", {}).get("completion_tokens", 0),
            total_tokens=data.get("usage", {}).get("total_tokens", 0),
        )

        # Fallback to estimation if provider doesn't return usage
        if usage.total_tokens == 0:
            content = data["choices"][0]["message"].get("content", "")
            prompt_tokens = estimate_prompt_tokens(messages)
            completion_tokens = estimate_text_tokens(content)
            usage = UsageInfo(
                prompt_tokens=int(prompt_tokens),
                completion_tokens=int(completion_tokens),
                total_tokens=int(prompt_tokens + completion_tokens),
            )

        # Extract tool calls if present
        tool_calls = None
        if "tool_calls" in data["choices"][0]["message"]:
            tool_calls = data["choices"][0]["message"]["tool_calls"]

        # Return normalized response
        return self.format_response(
            content=data["choices"][0]["message"].get("content", ""),
            model=self.config.id,
            usage=usage,
            tool_calls=tool_calls,
            finish_reason=data["choices"][0].get("finish_reason", "stop"),
        )

    async def stream_chat_completion(
        self, messages: list[dict[str, Any]], **params
    ) -> AsyncGenerator[str, None]:
        """Execute a streaming chat completion request.

        Args:
            messages: List of chat messages in OpenAI format.
            **params: Additional parameters.

        Yields:
            Server-sent event formatted strings.
        """
        validated_params = self.validate_params(params)

        payload = {
            "model": self.config.provider_model_id or self.config.id,
            "messages": messages,
            "stream": True,
            **validated_params,
        }

        if params.get("tools"):
            payload["tools"] = params["tools"]

        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {self.config.api_key}",
        }

        total_content = ""
        prompt_tokens = 0

        async for line in self.http.stream_post(
            f"{self.config.base_url}/chat/completions",
            json=payload,
            headers=headers,
        ):
            if not line.startswith("data: "):
                continue

            if line == "data: [DONE]":
                # Emit final usage chunk using shared helper for consistency
                yield make_final_usage_chunk(
                    model=self.config.id,
                    messages=messages,
                    total_content=total_content,
                    prompt_tokens_override=prompt_tokens or None,
                    finish_reason="stop",
                )
                yield done_sentinel()
                break

            try:
                chunk_data = json.loads(line[6:])

                # Extract usage if available
                if "usage" in chunk_data:
                    prompt_tokens = chunk_data["usage"].get("prompt_tokens", prompt_tokens)

                # Extract and yield content delta
                if chunk_data["choices"][0]["delta"].get("content"):
                    content = chunk_data["choices"][0]["delta"]["content"]
                    total_content += content
                    yield self.format_stream_chunk(content, self.config.id)
            except json.JSONDecodeError:
                continue

Step 2: Register the Adapter

  1. Update serving/adapters/__init__.py:

from .your_provider import YourProviderAdapter

__all__ = [
    # ... existing exports
    "YourProviderAdapter",
]
  1. Update serving/servers/registry.py:

Add the import at the top:

from serving.adapters import (
    # ... existing imports
    YourProviderAdapter,
)

Add a branch in the _make_adapter function:

def _make_adapter(kind: str, cfg: dict[str, Any]):
    """Construct a provider adapter from a kind string and model config."""
    model_cfg = ModelConfig(**cfg)
    # ... existing conditions
    if kind == "your_provider":
        return YourProviderAdapter(model_cfg)
    raise ValueError(f"Unknown adapter kind: {kind}")

Step 3: Add Model Configuration

Add your model to config/models.yaml:

models:
  - id: your-model-id
    name: Your Model Name
    provider: your_provider
    provider_model_id: "actual-model-id"
    base_url: ${YOUR_PROVIDER_BASE_URL}
    api_key: ${YOUR_PROVIDER_API_KEY}
    quantization: "bf16"
    input_modalities: ["text"]
    output_modalities: ["text"]
    context_length: 8192
    max_output_length: 4096
    supports_tools: true
    supports_structured_output: true
    supported_params: [temperature, top_p, max_tokens, stop]
    aliases: []  # Optional alternative names
    pricing:
      prompt: "0"
      completion: "0"
      image: "0"
      request: "0"
      input_cache_reads: "0"
      input_cache_writes: "0"
    route:
      - kind: your_provider
        weight: 1.0
        base_url: ${YOUR_PROVIDER_BASE_URL}
        api_key: ${YOUR_PROVIDER_API_KEY}

Step 4: Configure Environment Variables

Add to .env:

YOUR_PROVIDER_BASE_URL=https://api.yourprovider.com/v1
YOUR_PROVIDER_API_KEY=your-api-key-here

Step 5: Test the Integration

# Start the server
uvicorn serving.servers.app:app --host 0.0.0.0 --port 8080

# List available models
curl http://localhost:8080/v1/models

# Test chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-id",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Streaming test:

curl -N -s -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-id",
    "messages": [{"role": "user", "content": "Stream test"}],
    "stream": true,
    "max_tokens": 64
  }'

Configuration Reference

ModelConfig Fields

Field

Type

Required

Description

id

string

Yes

Unique model identifier

name

string

Yes

Display name

provider

string

Yes

Provider/adapter kind

base_url

string

Yes

API endpoint base URL

api_key

string

No

API authentication key

provider_model_id

string

No

Provider’s model identifier (overrides id)

aliases

list[string]

No

Alternative names for routing

quantization

string

No

Quantization format (default: “bf16”)

input_modalities

list[string]

No

Input types: “text”, “image”

output_modalities

list[string]

No

Output types: “text”

context_length

int

No

Maximum context window (default: 8192)

max_output_length

int

No

Maximum output tokens (default: 4096)

supports_tools

bool

No

Function calling support (default: false)

supports_structured_output

bool

No

JSON mode support (default: false)

supported_params

list[string]

No

Allowed parameter names

pricing

dict

No

Cost information per token/request

Route Configuration

Routes allow multiple endpoints for a single model with weighted distribution:

route:
  # Local vLLM deployment
  - kind: vllm
    weight: 0.7  # 70% of traffic
    base_url: http://localhost:8000
    provider_model_id: "/models/local-model"

  # Remote API fallback
  - kind: your_provider
    weight: 0.3  # 30% of traffic
    base_url: https://api.provider.com
    api_key: ${API_KEY}

Hybrid Routing & OFFLOAD

  • Weighted routes are applied at registration time. You can further adjust weights or override distribution centrally using config/routing.yaml (loaded by the RoutingManager).

  • If OFFLOAD=1, the bootstrap process removes any adapter whose base_url matches LOCAL_BASE_URL, effectively forcing traffic to remote providers only.

This lets you flip from hybrid to remote-only during incidents without editing YAML.

BaseAdapter API Reference

All adapters must inherit from BaseAdapter and implement:

Required Methods

async def chat_completion(
    self, messages: list[dict[str, Any]], **params
) -> dict[str, Any]:
    """Execute non-streaming chat completion."""
    pass

async def stream_chat_completion(
    self, messages: list[dict[str, Any]], **params
) -> AsyncGenerator[str, None]:
    """Execute streaming chat completion."""
    pass

Utility Methods

def validate_params(self, params: dict[str, Any]) -> dict[str, Any]:
    """Validate and clamp parameters to supported ranges."""

def format_response(
    self,
    content: str,
    model: str,
    usage: UsageInfo | None = None,
    tool_calls: list[dict] | None = None,
    finish_reason: str = "stop",
) -> dict[str, Any]:
    """Format response in OpenAI-compatible format."""

def format_stream_chunk(
    self, content: str, model: str, finish_reason: str | None = None
) -> str:
    """Format SSE chunk for streaming responses."""

Available Attributes

self.config       # ModelConfig instance
self.http         # AsyncHTTPClient for API requests

Advanced Features

Multi-Modal Support

For models supporting images:

input_modalities: ["text", "image"]

Implement image handling in your adapter’s chat_completion method.

Tool/Function Calling

For models supporting function calls:

supports_tools: true

Parse and include tool_calls in the response:

tool_calls = []
if "function_call" in data:
    tool_calls.append({
        "id": f"call_{int(time.time() * 1000)}",
        "type": "function",
        "function": {
            "name": data["function_call"]["name"],
            "arguments": data["function_call"]["arguments"],
        },
    })

return self.format_response(
    content=content,
    model=self.config.id,
    usage=usage,
    tool_calls=tool_calls,
)

Structured Output (JSON Mode)

For models supporting JSON schema:

supports_structured_output: true

Handle response_format parameter:

if params.get("response_format", {}).get("type") == "json_object":
    payload["response_format"] = {"type": "json_object"}

Rate Limiting (Optional)

If the provider has known token policies and you want server-side fairness controls, add a limiter configuration in serving/servers/bootstrap.py alongside existing examples (Gemini/DeepSeek/Zhipu). This enables per-model queues, burst control, and persistent counters.

Examples

Example 1: OpenAI-Compatible Provider

See serving/adapters/deepseek.py for a simple OpenAI-compatible implementation.

Example 2: Custom API Format

See serving/adapters/gemini.py for handling non-standard API formats with message conversion.

Example 3: Local Deployment

See serving/adapters/vllm.py for integrating local inference servers.

Troubleshooting

Model Not Appearing in /v1/models

  • Check config/models.yaml syntax

  • Verify environment variables are set

  • Check server logs for configuration errors

  • If using aliases, verify the canonical id appears exactly once and aliases do not collide with other model IDs.

Authentication Failures

  • Verify API key in .env

  • Check if ${ENV_VAR} expansion is working

  • Ensure base_url is correct

Response Format Errors

  • Ensure format_response() returns OpenAI-compatible structure

  • Validate UsageInfo fields are integers

  • Check finish_reason is valid: “stop”, “length”, “content_filter”

  • For streaming, ensure the first non-empty content chunk is emitted as soon as available so TTFT metrics record properly.

Streaming Issues

  • Ensure chunks are SSE-formatted: data: {json}\n\n

  • Send final usage chunk before data: [DONE]

  • Handle JSON parsing errors gracefully

Best Practices

  1. Error handling: Use self.http.json_post_with_retry() and handle provider faults gracefully with useful messages.

  2. Usage accounting: Prefer provider usage when available; otherwise fall back to estimate_prompt_tokens()/estimate_text_tokens().

  3. Streaming helpers: Use format_stream_chunk(), make_final_usage_chunk(), and done_sentinel() for consistent SSE.

  4. Type safety: Provide full type hints and keep request/response shapes aligned with serving/schemas.py.

  5. Testing: Exercise both streaming and non-streaming paths, and try large prompts to validate token clamping.

  6. Docs & style: Keep adapter docstrings and comments in English (Google style). Avoid provider-specific logic in shared code.

  7. Env expansion: Use ${ENV_VAR} in YAML instead of hardcoding secrets or endpoints; let dotenv load .env.

See Also