HybridInference

Getting Started:

  • Quick Start
    • Overview
    • Get Your API Key
    • Make Your First Request
      • Using curl
      • Using Python
      • Using JavaScript/TypeScript
    • Streaming Responses
    • Check Available Models
    • Next Steps
  • Available Models
    • Model Overview
    • Model Details
      • Llama 3.3 70B Instruct
      • Llama 4 Scout
      • Llama 4 Maverick
      • Gemini 2.5 Flash
      • GLM-4.5
      • GPT-5
      • Claude Opus 4.1
    • Using Different Models
    • Model Selection Guide
  • Code Examples
    • Installation
    • Basic Chat Completion
      • Python
      • JavaScript
      • curl
    • Streaming Responses
      • Python
      • JavaScript
      • curl
    • Function Calling
      • Python
      • JavaScript
    • Structured Output (JSON Mode)
      • Python
      • JavaScript
    • Tips
      • Temperature Settings
      • Max Tokens
      • Choosing Models
  • API Reference
    • Base URL
    • Authentication
    • Endpoints
      • Chat Completions
      • List Models
      • Health Check
    • Parameters Reference
      • Temperature
      • Max Tokens
      • Top P (Nucleus Sampling)
      • Stop Sequences
    • Response Formats
      • Standard Text Response
      • JSON Mode
    • Function Calling
      • Tool Definition
      • Tool Choice Options
      • Response with Function Call
    • Vision (Multimodal)
      • Image URL
      • Base64 Image
    • Error Codes
      • Error Response Format
    • Rate Limits
    • OpenRouter Compatibility

Developer Guide:

  • Installation
    • System Requirements
    • Installation Methods
      • Using uv (Recommended)
      • Using conda
    • Configuration
      • Environment Variables
    • Verification
    • Troubleshooting
      • Common Issues
  • Deployment Guide
    • Production Deployment
      • Using systemd
      • Environment Variables
      • Health Checks
      • Logs
    • Monitoring
      • Prometheus Metrics
      • Grafana Dashboards
    • Database Setup
      • PostgreSQL
    • Troubleshooting
      • Service won’t start
      • High latency
      • Rate limiting
  • Architecture Overview
    • System Architecture
    • Core Components
      • Serving Layer (serving/)
      • Routing Layer (routing/)
      • Configuration (config/)
      • Infrastructure (infrastructure/)
    • Key Design Principles
    • Data Flow
  • Hybrid Inference Routing System
    • Architecture
      • Decision Layer (routing/manager.py + routing/strategies.py)
      • Execution Layer (routing/executor.py)
    • Features
    • Configuration
      • Required Files
      • Optional Files
      • Example Configuration (60/40 split):
    • Running the Server
    • API Endpoints
    • Extending the System
      • Adding New Strategies
      • Health Monitoring
    • Migration Notes
  • Adding a New Model (OpenRouter-Compatible)
    • Overview
    • Quick Start
      • Adding a Model with an Existing Provider
    • Adding a New Provider
      • Step 1: Create Provider Adapter
      • Step 2: Register the Adapter
      • Step 3: Add Model Configuration
      • Step 4: Configure Environment Variables
      • Step 5: Test the Integration
    • Configuration Reference
      • ModelConfig Fields
      • Route Configuration
      • Hybrid Routing & OFFLOAD
    • BaseAdapter API Reference
      • Required Methods
      • Utility Methods
      • Available Attributes
    • Advanced Features
      • Multi-Modal Support
      • Tool/Function Calling
      • Structured Output (JSON Mode)
      • Rate Limiting (Optional)
    • Examples
      • Example 1: OpenAI-Compatible Provider
      • Example 2: Custom API Format
      • Example 3: Local Deployment
    • Troubleshooting
      • Model Not Appearing in /v1/models
      • Authentication Failures
      • Response Format Errors
      • Streaming Issues
    • Best Practices
    • See Also
  • Configuration Guide
    • 1. Environment Variables and Priority
      • Variable Substitution
      • Configuration Priority
    • 2. models.yaml (Required)
      • Key Points:
    • 3. routing.yaml (Optional)
      • Fixed-Ratio Strategy
      • Health Checking (Optional)
      • Example: Hybrid Deployment (60% local / 40% remote)
      • How It Works:
      • Local-Only Deployment
    • 4. Running the System
      • Set Environment Variables:
      • Start the Server:
      • Verify Operation:
    • 5. FAQ
  • PostgreSQL Admin Playbook
    • Environment Variables
    • Restart Admin Stack
    • Access pgAdmin Securely
    • Register Primary Database
    • Post-Restart Checks
  • OpenRouter-Compatible API Gateway
    • Architecture
      • Key Components
    • Features
    • Development Setup
      • Prerequisites
      • Create Environment
      • Local Environment Variables
      • Run Locally
      • Quick Checks
    • Production Deployment
    • API Surface
      • Example Requests
    • Logging and Metrics
    • Testing
    • Troubleshooting
    • Related Docs
  • FreeInference Deployment
    • FastAPI + systemd (current)
      • Overview
      • Deployment Steps
      • Runtime Operations
      • Why We Dropped Nginx
    • Legacy Architectures
      • Nginx (v2, abandoned)
      • Nginx + Lua via OpenResty (v1, abandoned)
        • Overview
        • Installation Notes
      • Nginx (v0, abandoned)
  • FASRC Deployment
    • docker
  • Contributing
    • Development Setup
    • Code Quality Standards
      • Pre-commit Hooks
    • Development Workflow
    • Testing
    • Documentation
    • Pull Request Guidelines

Project Info:

  • Changelog
    • [Unreleased]
      • Added
    • [0.1.0] - 2025-10-26
      • Added
HybridInference
  • Search


© Copyright 2025, Harvard System Lab.

Built with Sphinx using a theme provided by Read the Docs.