# FreeInference Deployment ## FastAPI + systemd (current) We serve OpenRouter-compatible traffic directly through a FastAPI application listening on port 80. Removing Nginx reduces operational overhead, keeps debugging straightforward, and lets `systemd` own the lifecycle of the gateway process. ### Overview ```bash ┌─────────────┐ ┌─────────────────┐ ┌────────────────────┐ │ OpenRouter │─────▶│ FastAPI Gateway │─────▶│ Model Executors... │ └─────────────┘ └─────────────────┘ └────────────────────┘ ``` - FastAPI binds to `0.0.0.0:80` and exposes `/v1` endpoints consumed by OpenRouter clients. - The gateway handles request authentication, routing, and backpressure before invoking the selected model adapter. - `systemd` supervises the process, ensuring automatic restarts after crashes or host reboots. ### Deployment Steps 1. **Install runtime dependencies** Ensure Python environment and model weights are ready. Confirm the FastAPI entry point (`serving.servers.bootstrap:app`) is reachable via `uvicorn` or the configured launcher script. 2. **Create the unit file** ```bash sudo tee /etc/systemd/system/freeinference.service <<'UNIT' [Unit] Description=FreeInference FastAPI service After=network-online.target Wants=network-online.target [Service] Type=simple User=ubuntu WorkingDirectory=/home/ubuntu/hybridInference ExecStart=/usr/bin/env uvicorn serving.servers.bootstrap:app --host 0.0.0.0 --port 80 Restart=always RestartSec=5 Environment=PYTHONUNBUFFERED=1 [Install] WantedBy=multi-user.target UNIT ``` Replace `User`, `WorkingDirectory`, and `Environment` entries as needed for the target host. The repository carries a maintained version of this unit at `infrastructure/systemd/hybrid_inference.service`; copy or symlink it into `/etc/systemd/system/freeinference.service` during deploys. 3. **Reload and enable the service** ```bash sudo systemctl daemon-reload sudo systemctl enable freeinference.service sudo systemctl start freeinference.service sudo systemctl status freeinference.service ``` ### Runtime Operations - Restart on demand: `sudo systemctl restart freeinference.service` - Follow logs: `journalctl -u freeinference.service -f` - Health check: `curl https://freeinference.org/health` - List registered models: `curl https://freeinference.org/v1/models | jq` ### Why We Dropped Nginx - FastAPI already terminates HTTP and exposes the required OpenRouter-compatible endpoints. - Nginx added another moving part, increasing failover complexity and opaque error handling. - Debugging latency or request routing is simpler when traffic is handled in a single process. ## Legacy Architectures ### Nginx (v2, abandoned) We briefly fronted FastAPI (running on port 8080) with vanilla Nginx that exposed `http://freeinference.org` on port 80 and terminated TLS for the public endpoint. Once Cloudflare took over edge SSL duties, the extra hop mostly added deployment and observability complexity without material benefit, so the setup was removed. ### Nginx + Lua via OpenResty (v1, abandoned) We previously relied on OpenResty (Nginx + Lua) to provide a production routing tier across multiple LLM backends. The stack handled model mapping, load balancing, health checks, and error handling. We keep the installation notes for posterity. #### Overview ```bash ┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Client │─────▶│ OpenResty │─────▶│ Backend 1 │ │ (API Call) │ │ (Router) │ │ (Qwen@8000) │ └─────────────┘ │ │ └─────────────────┘ │ - Model Mapping │ │ - Load Balancing│ ┌─────────────────┐ │ - Health Checks │─────▶│ Backend 2 │ │ - Error Handling│ │ (Llama@8001) │ └──────────────────┘ └─────────────────┘ ``` #### Installation Notes ```bash # Add repository wget -O - https://openresty.org/package/pubkey.gpg | sudo apt-key add - echo "deb http://openresty.org/package/ubuntu $(lsb_release -sc) main" | \ sudo tee /etc/apt/sources.list.d/openresty.list # Install sudo apt-get update sudo apt-get install openresty ``` ```bash # Create directory sudo mkdir -p /usr/local/openresty/nginx/conf/sites-available sudo mkdir -p /usr/local/openresty/nginx/conf/sites-enabled # Copy Config file sudo cp /usr/local/openresty/nginx/conf/sites-available/vllm # Enable the site sudo ln -s /usr/local/openresty/nginx/conf/sites-available/vllm \ /usr/local/openresty/nginx/conf/sites-enabled/vllm ``` ```bash http { # ... Others ... # Lua settings lua_package_path "/usr/local/openresty/lualib/?.lua;;"; lua_shared_dict model_cache 10m; # Include Site Configuration include /usr/local/openresty/nginx/conf/sites-enabled/*; } ``` ```bash # test openresty config sudo openresty -t # Start sudo systemctl start openresty # Enable auto-start sudo systemctl enable openresty # reload openresty sudo openresty -s reload ``` ```bash # check service status curl https://freeinference.org/health # list all models curl https://freeinference.org/v1/models | jq # Chat with Qwen3-Coder curl -X POST http://freeinference.org/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "/models/Qwen_Qwen3-Coder-480B-A35B-Instruct-FP8", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}' # Chat with llama4-scout curl -X POST http://freeinference.org/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "/models/meta-llama_Llama-4-Scout-17B-16E", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 50}' ``` ### Nginx (v0, abandoned) ```bash sudo vim /etc/nginx/sites-available/vllm sudo nginx -t sudo systemctl reload nginx # to test the endpoint curl https://freeinference.org/v1/models ```