# Architecture Overview

HybridInference is designed as a modular, high-performance inference gateway.

## System Architecture

```
┌─────────────┐
│   Clients   │
└──────┬──────┘
       │
       ▼
┌─────────────────┐
│  FastAPI Gateway│
│   (serving/)    │
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
┌────────┐ ┌─────────┐
│Routing │ │Adapters │
│Manager │ │         │
└────┬───┘ └────┬────┘
     │          │
     ▼          ▼
┌──────────────────┐
│   LLM Providers  │
│ ┌──────────────┐ │
│ │ Local vLLM   │ │
│ │ OpenAI API   │ │
│ │ Gemini API   │ │
│ └──────────────┘ │
└──────────────────┘
```

## Core Components

### Serving Layer (`serving/`)

The serving layer provides the FastAPI-based gateway:

- **Gateway**: HTTP API endpoints for inference requests
- **Adapters**: Provider-specific API adapters
- **Observability**: Logging, metrics, and tracing
- **Storage**: PostgreSQL integration for request/response logging

### Routing Layer (`routing/`)

Intelligent routing and load balancing:

- **Manager**: Routes requests to optimal providers
- **Strategies**: Different routing algorithms (round-robin, cost-based, latency-based)
- **Health Checks**: Monitor provider availability and performance

### Configuration (`config/`)

Centralized configuration management:

- Model configurations
- Provider settings
- Routing policies
- Feature flags

### Infrastructure (`infrastructure/`)

Deployment and observability:

- Systemd service definitions
- Prometheus metrics collection
- Grafana dashboards
- Alert manager rules

## Key Design Principles

1. **Modularity**: Clear separation between serving, routing, and provider layers
2. **Extensibility**: Easy to add new providers and routing strategies
3. **Observability**: Comprehensive logging and metrics at every layer
4. **Performance**: Optimized for low-latency, high-throughput inference
5. **Reliability**: Health checks, retries, and fallback mechanisms

## Data Flow

1. Client sends inference request to Gateway
2. Gateway validates and preprocesses request
3. Routing Manager selects optimal provider
4. Adapter translates request to provider-specific format
5. Provider processes inference
6. Response is logged to PostgreSQL
7. Metrics are exported to Prometheus
8. Response is returned to client