# Architecture Overview HybridInference is designed as a modular, high-performance inference gateway. ## System Architecture ``` ┌─────────────┐ │ Clients │ └──────┬──────┘ │ ▼ ┌─────────────────┐ │ FastAPI Gateway│ │ (serving/) │ └────────┬────────┘ │ ┌────┴────┐ │ │ ▼ ▼ ┌────────┐ ┌─────────┐ │Routing │ │Adapters │ │Manager │ │ │ └────┬───┘ └────┬────┘ │ │ ▼ ▼ ┌──────────────────┐ │ LLM Providers │ │ ┌──────────────┐ │ │ │ Local vLLM │ │ │ │ OpenAI API │ │ │ │ Gemini API │ │ │ └──────────────┘ │ └──────────────────┘ ``` ## Core Components ### Serving Layer (`serving/`) The serving layer provides the FastAPI-based gateway: - **Gateway**: HTTP API endpoints for inference requests - **Adapters**: Provider-specific API adapters - **Observability**: Logging, metrics, and tracing - **Storage**: PostgreSQL integration for request/response logging ### Routing Layer (`routing/`) Intelligent routing and load balancing: - **Manager**: Routes requests to optimal providers - **Strategies**: Different routing algorithms (round-robin, cost-based, latency-based) - **Health Checks**: Monitor provider availability and performance ### Configuration (`config/`) Centralized configuration management: - Model configurations - Provider settings - Routing policies - Feature flags ### Infrastructure (`infrastructure/`) Deployment and observability: - Systemd service definitions - Prometheus metrics collection - Grafana dashboards - Alert manager rules ## Key Design Principles 1. **Modularity**: Clear separation between serving, routing, and provider layers 2. **Extensibility**: Easy to add new providers and routing strategies 3. **Observability**: Comprehensive logging and metrics at every layer 4. **Performance**: Optimized for low-latency, high-throughput inference 5. **Reliability**: Health checks, retries, and fallback mechanisms ## Data Flow 1. Client sends inference request to Gateway 2. Gateway validates and preprocesses request 3. Routing Manager selects optimal provider 4. Adapter translates request to provider-specific format 5. Provider processes inference 6. Response is logged to PostgreSQL 7. Metrics are exported to Prometheus 8. Response is returned to client