API Reference

Complete reference for the HybridInference API.

Base URL

https://freeinference.org/v1

Authentication

All API requests require authentication using an API key in the Authorization header:

Authorization: Bearer YOUR_API_KEY

Endpoints

Chat Completions

Create a chat completion using a specified model.

Endpoint: POST /v1/chat/completions

Headers:

Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

Request Body:

Parameter	Type	Required	Description
`model`	string	Yes	Model ID to use (e.g., `llama-3.3-70b-instruct`)
`messages`	array	Yes	Array of message objects
`temperature`	number	No	Sampling temperature (0-2). Default: 1
`max_tokens`	integer	No	Maximum tokens to generate
`top_p`	number	No	Nucleus sampling parameter (0-1)
`stream`	boolean	No	Whether to stream responses. Default: false
`stop`	string or array	No	Stop sequences
`response_format`	object	No	Format of response (e.g., `{"type": "json_object"}`)
`tools`	array	No	Function calling tools
`tool_choice`	string or object	No	Tool choice strategy

Message Object:

Field	Type	Description
`role`	string	Role: `system`, `user`, or `assistant`
`content`	string or array	Message content (text or multimodal)

Example Request:

{
  "model": "llama-3.3-70b-instruct",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"}
  ],
  "temperature": 0.7,
  "max_tokens": 1000
}

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "llama-3.3-70b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 8,
    "total_tokens": 28
  }
}

Streaming Response:

When stream: true, responses are sent as Server-Sent Events (SSE). With curl, use -N (no-buffer) to see tokens as they arrive:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1234567890,"model":"llama-3.3-70b-instruct","choices":[{"index":0,"delta":{"role":"assistant","content":"The"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1234567890,"model":"llama-3.3-70b-instruct","choices":[{"index":0,"delta":{"content":" capital"},"finish_reason":null}]}

data: [DONE]

List Models

Get a list of available models.

Endpoint: GET /v1/models

Headers:

Authorization: Bearer YOUR_API_KEY

Response:

{
  "object": "list",
  "data": [
    {
      "id": "llama-3.3-70b-instruct",
      "object": "model",
      "created": 1234567890,
      "owned_by": "system",
      "context_length": 131072,
      "architecture": {
        "modality": "text",
        "tokenizer": "llama3",
        "instruct_type": "llama3"
      },
      "pricing": {
        "prompt": "0",
        "completion": "0",
        "request": "0"
      }
    }
  ]
}

Health Check

Check API health status.

Endpoint: GET /health

Response:

{
  "status": "ok",
  "timestamp": "2025-10-26T08:00:00Z"
}

Parameters Reference

Temperature

Controls randomness in responses.

Range: 0.0 - 2.0
Default: 1.0
Lower values: More focused and deterministic
Higher values: More creative and diverse

Examples:

0.0 - Deterministic (good for factual tasks)
0.7 - Balanced (general use)
1.5 - Creative (storytelling, brainstorming)

Max Tokens

Maximum number of tokens to generate.

Default: Model-specific
Note: Input + output tokens cannot exceed model’s context length

Top P (Nucleus Sampling)

Alternative to temperature for controlling diversity.

Range: 0.0 - 1.0
Default: 1.0
Lower values: More focused
Higher values: More diverse

Note: It’s recommended to use either temperature or top_p, not both.

Stop Sequences

Sequences where the model will stop generating.

Examples:

{
  "stop": "\n"  // Single stop sequence
}

{
  "stop": ["\n", "###", "END"]  // Multiple stop sequences
}

Response Formats

Standard Text Response

Default response format.

{
  "response_format": {"type": "text"}
}

JSON Mode

Forces the model to respond with valid JSON.

{
  "response_format": {"type": "json_object"}
}

Example:

{
  "model": "llama-3.3-70b-instruct",
  "messages": [
    {
      "role": "user",
      "content": "Extract person info: John is a 30-year-old engineer. Return as JSON."
    }
  ],
  "response_format": {"type": "json_object"}
}

Response:

{
  "name": "John",
  "age": 30,
  "occupation": "engineer"
}

Function Calling

Enable the model to call functions you define.

Tool Definition

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "City name"
            },
            "unit": {
              "type": "string",
              "enum": ["celsius", "fahrenheit"]
            }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Tool Choice Options

"auto" - Model decides whether to call a function
"none" - Model will not call any function
{"type": "function", "function": {"name": "function_name"}} - Force specific function

Response with Function Call

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\": \"Paris\", \"unit\": \"celsius\"}"
            }
          }
        ]
      }
    }
  ]
}

Vision (Multimodal)

Send images along with text (requires vision-capable models like llama-4-maverick).

Image URL

{
  "model": "llama-4-maverick",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What's in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/image.jpg"
          }
        }
      ]
    }
  ]
}

Base64 Image

{
  "type": "image_url",
  "image_url": {
    "url": "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
  }
}

Error Codes

Code	Description
400	Bad Request - Invalid parameters
401	Unauthorized - Invalid or missing API key
404	Not Found - Model or endpoint not found
429	Too Many Requests - Rate limit exceeded
500	Internal Server Error - Server error
503	Service Unavailable - Server overloaded

Error Response Format

{
  "error": {
    "message": "Invalid API key provided",
    "type": "invalid_request_error",
    "code": "invalid_api_key"
  }
}

Rate Limits

Current rate limits (subject to change):

Requests per minute: Based on your API key tier
Tokens per minute: Based on your API key tier

Rate limit headers are included in responses:

X-RateLimit-Limit-Requests: 100
X-RateLimit-Remaining-Requests: 99
X-RateLimit-Reset-Requests: 2025-10-26T08:01:00Z

OpenRouter Compatibility

This API is fully compatible with OpenRouter clients and libraries. Simply change the base URL to:

client = openai.OpenAI(
    base_url="https://freeinference.org/v1",
    api_key="your-api-key-here"
)

All OpenRouter-compatible parameters and features are supported.