# Available Models HybridInference provides access to multiple state-of-the-art LLM models. ## Model Overview | Model ID | Name | Context Length | Pricing | |----------|------|----------------|---------| | `llama-3.3-70b-instruct` | Llama 3.3 70B Instruct | 131K tokens | Free | | `llama-4-scout` | Llama 4 Scout | 128K tokens | Free | | `llama-4-maverick` | Llama 4 Maverick | 128K tokens | Free | | `gemini-2.5-flash` | Gemini 2.5 Flash | 1M tokens | Free | | `gemini-2.5-flash-preview-09-2025` | Gemini 2.5 Flash Preview | 1M tokens | Free | | `glm-4.5` | GLM-4.5 | 128K tokens | Free | | `gpt-5` | GPT-5 | 128K tokens | Free | | `custom-model-alpha` | Claude Opus 4.1 | 200K tokens | Free | | `custom-model-beta` | GPT-5 (Azure) | 400K tokens | Free | ## Model Details ### Llama 3.3 70B Instruct **Model ID:** `llama-3.3-70b-instruct` High-performance open-source model optimized for instruction following. **Key Features:** - Context length: 131,072 tokens - Max output: 8,192 tokens - Function calling support - Structured output (JSON mode) - Quantization: bf16 **Best For:** - General purpose chat - Long-form content generation - Code generation - Instruction following **Example:** ```python response = client.chat.completions.create( model="llama-3.3-70b-instruct", messages=[{"role": "user", "content": "Explain quantum computing"}], max_tokens=2048 ) ``` --- ### Llama 4 Scout **Model ID:** `llama-4-scout` Efficient MoE (Mixture of Experts) model for fast inference. **Key Features:** - Context length: 128,000 tokens - Max output: 16,384 tokens - Function calling support - Structured output - Quantization: fp8 **Best For:** - Fast inference scenarios - Cost-effective deployments - Production workloads --- ### Llama 4 Maverick **Model ID:** `llama-4-maverick` Advanced MoE (Mixture of Experts) model for complex tasks. **Key Features:** - Context length: 128,000 tokens - Max output: 16,384 tokens - Function calling support - Structured output - Quantization: fp8 **Best For:** - Complex reasoning tasks - Long-form generation - Production workloads with high quality requirements --- ### Gemini 2.5 Flash **Model ID:** `gemini-2.5-flash` Google's fast and efficient model. **Key Features:** - Fast inference speed - High throughput - Large context window - Production-ready **Best For:** - Real-time applications - High-volume workloads - Quick responses needed --- ### GLM-4.5 **Model ID:** `glm-4.5` Bilingual model optimized for Chinese and English. **Best For:** - Chinese language tasks - Bilingual applications - Cross-language translation --- ### GPT-5 **Model ID:** `gpt-5` Latest OpenAI flagship model. **Best For:** - Complex reasoning - Advanced code generation - Research applications --- ### Claude Opus 4.1 **Model ID:** `custom-model-alpha` Anthropic's most capable model for complex tasks. **Best For:** - Long-form writing - Advanced analysis - Research and development --- ## Using Different Models Simply change the `model` parameter in your request: ```python import openai client = openai.OpenAI( base_url="https://freeinference.org/v1", api_key="your-api-key-here" ) # Use Llama 3.3 response = client.chat.completions.create( model="llama-3.3-70b-instruct", messages=[{"role": "user", "content": "Hello!"}] ) # Switch to Gemini response = client.chat.completions.create( model="gemini-2.5-flash", messages=[{"role": "user", "content": "Hello!"}] ) ``` ## Model Selection Guide **For general chat and instructions:** - `llama-3.3-70b-instruct` - Best balance of quality and speed - `llama-4-maverick` - High quality, complex tasks **For fast inference:** - `llama-4-scout` - Optimized for speed - `gemini-2.5-flash` - High throughput, real-time use **For Chinese language:** - `glm-4.5` - Bilingual Chinese/English support **For advanced reasoning:** - `gpt-5` - Latest OpenAI capabilities - `custom-model-alpha` (Claude Opus 4.1) - Complex analysis and long-form writing