Overview
LLM Services provides a unified interface for interacting with multiple Large Language Model providers. It abstracts provider-specific complexities, allowing you to switch models via configuration without changing code logic.
Key Features:
- Multi-Provider: Support for multiple LLM providers, enabling flexible selection based on compliance, cost, and performance requirements.
- Structured Output: Built-in json_mode ensures responses are valid JSON objects, regardless of the underlying provider.
- Dynamic Ollama: Automatically discovers local models from your Ollama instance.
- Resilience: Automatic retries and circuit breakers for high availability.
- Advanced Logic: Support for "Thinking" models and streaming responses.
- Conditional Flask: Use as a library (no Flask required) or API server (Flask optional).
- Debugging & Transparency: Built-in logging, request tracking, performance metrics, and configuration status endpoints.
Supported Providers
Access leading Cloud and Local LLMs through a single interface.
Resilience Architecture
Built-in fault tolerance mechanisms to ensure high availability. Settings can be tuned in llm_config.json.
Configuration Parameters
The resilience behavior is controlled by the following parameters in the root "resilience" block of your config file.
| Parameter | Type | Default | Description |
|---|---|---|---|
| max_retries | Int | 3 | Number of retry attempts for transient errors (e.g., 5xx errors, rate limits, timeouts). |
| backoff_factor | Float | 1.5 | Multiplier for calculating wait time between retries. Formula: wait = factor ^ attempt. |
| circuit_breaker_failure_threshold | Int | 5 | Consecutive failures allowed before a provider is blocked (Circuit Open state). |
| circuit_breaker_recovery_timeout | Int | 60 | Seconds to wait before testing a blocked provider again (Half-Open state). |
Setup & Configuration
The service separates secrets (API keys) from logic (Model definitions).
1. Requirements & Secrets
Install packages and set environment variables.
2. Environment Variables (.env)
Store your API keys and endpoints here. Do not commit this file.
.envLLM_API_MODE=false (default), you can use LLMService as a Python library without installing Flask. Set LLM_API_MODE=true to enable the HTTP API server.
3. Model & Resilience Configuration (llm_config.json)
Define models and resilience behaviors. This file can be hot-reloaded.
llm_config.jsonAPI Endpoints
RESTful API endpoints for interaction and management.
LLM Call Endpoint
Config Reload Endpoint
Debugging & Transparency
Comprehensive debugging tools for monitoring, troubleshooting, and optimizing your LLM integrations.
Environment Variables
Control logging and debugging behavior through environment variables in your .env file.
| Variable | Values | Description |
|---|---|---|
| LLM_LOG_LEVEL | DEBUG, INFO, WARNING, ERROR |
Set the logging verbosity level. Default: INFO |
| LLM_DEBUG | true, false |
Enable verbose logging with request/response details. Default: false |
LLM_DEBUG=true or LLM_LOG_LEVEL=DEBUG in production as they may log sensitive data including request payloads and response content.
Configuration Status Endpoint
Check your configuration and provider status programmatically.
Response Example:
{
"version": "v1.9.0",
"providers_configured": ["openai", "anthropic", "deepseek"],
"providers_status": {
"openai": {"configured": true},
"anthropic": {"configured": true},
"ollama": {
"configured": false,
"missing_keys": ["OLLAMA_CHAT_ENDPOINT"]
}
},
"environment": {
"LLM_API_MODE": false,
"LLM_LOG_LEVEL": "INFO",
"LLM_DEBUG": false,
"config_loaded": true
},
"clients_initialized": {
"openai": true,
"anthropic": true,
"gemini": false
},
"circuit_breakers": {
"openai": {
"state": "CLOSED",
"failure_count": 0,
"failure_threshold": 5
}
}
}
Request Tracking
Every request automatically gets a unique request_id for tracing.
from llmservices import LLMService, LLMRequest
req = LLMRequest(
provider="openai",
model="gpt-4o",
prompt="Hello"
)
# Access the auto-generated request ID
print(f"Request ID: {req.request_id}")
response = LLMService.call(req)
# Response includes debugging information
print(f"Request ID: {response.request_id}")
print(f"Finish Reason: {response.finish_reason}")
print(f"Usage: {response.usage}")
if response.timing:
print(f"Duration: {response.timing.total_duration_ms}ms")
if response.metadata:
print(f"Metadata: {response.metadata.to_dict()}")
Enhanced Error Handling
The LLMError class provides detailed error context for debugging.
from llmservices import LLMService, LLMRequest, LLMError
try:
req = LLMRequest(provider="openai", model="gpt-4o", prompt="Hello")
response = LLMService.call(req)
except LLMError as e:
# Access structured error information
print(f"Error: {e.message}")
print(f"Provider: {e.provider}")
print(f"Model: {e.model}")
print(f"Status Code: {e.status_code}")
print(f"Request ID: {e.request_id}")
print(f"Error Code: {e.error_code}")
# Convert to dictionary for logging
error_dict = e.to_dict()
import json
print(json.dumps(error_dict, indent=2))
Logging Examples
Enable different log levels based on your needs:
# .env file for development LLM_LOG_LEVEL=DEBUG LLM_DEBUG=true # .env file for production LLM_LOG_LEVEL=WARNING LLM_DEBUG=false
Debugging Use Cases
- Track Requests: Use
request_idto trace requests through your system - Monitor Performance: Check
response.timingfor latency metrics - Debug Errors: Use
LLMErrorfor structured error information - Verify Configuration: Call
LLMConfig.get_status()to check setup - Monitor Token Usage: Check
response.usagefor cost tracking - Trace Failures: Use circuit breaker status to identify provider issues
LLMConfig.get_status() directly in your Python code to check configuration status without making an HTTP request.
Best Practices
Guidelines for optimal performance and security.
- JSON Mode: Use
"json_mode": truewhen building tools that require structured data parsing (e.g., dashboards, extracting data from CVs). - Use Configuration Files: Keep model definitions in
llm_config.jsonto allow hot-swapping models without code changes. - Temperature Settings: Use
0.3for factual tasks and0.7-1.0for creative writing. - Enable Thinking: For logic or math, use models like o1 or R1 and set
"enable_thinking": true. - Streaming: Always use
"stream": truefor long-form generation to improve UX.
Usage Modes
Use as a standalone Python library or as an HTTP API server.
Library Mode (Default)
Import and use LLMService directly in your Python code. No Flask required.
from llmservices import LLMService, LLMRequest req = LLMRequest(provider="openai", model="gpt-4o", prompt="Hello") response = LLMService.call(req) print(response.content)
API Server Mode
Set LLM_API_MODE=true and run as an HTTP server.
# Enable API mode in .env export LLM_API_MODE=true # Run the server python llmservices.py # Or programmatically from llmservices import run_api_server run_api_server(port=8888)
Python Integration Patterns
Choose the integration method that fits your architecture.
Comparison: API vs. Direct Class
| Feature | API Approach (HTTP) | Direct Approach (Class) |
|---|---|---|
| Use Case | Microservices, Frontend-to-Backend, Polyglot systems | Internal Python Tools, Monolithic Backends |
| Performance | Network overhead introduced | No network overhead (Zero latency) |
| Data Structure | Raw JSON Responses | Typed Objects (LLMResponse) |
Method 1: API Approach (HTTP)
Best for decoupled services or when calling from non-Python languages.
import requests
import json
# Standard API call
url = 'http://localhost:8888/api/llm/call'
payload = {
'provider': 'deepseek',
'model': 'deepseek-reasoner',
'prompt': 'Solve this complex logic puzzle...',
'enable_thinking': True,
'stream': True
}
try:
response = requests.post(url, json=payload, stream=True)
# Check for Circuit Breaker (503) or other errors
if response.status_code == 503:
error_data = response.json()
print(f"⛔ Circuit Breaker Open: {error_data['error']}")
elif response.status_code != 200:
print(f"⚠ Error {response.status_code}: {response.text}")
else:
# Process successful stream
for line in response.iter_lines():
if line:
decoded_line = line.decode('utf-8')
if decoded_line.startswith('data: '):
print(decoded_line)
except requests.exceptions.ConnectionError:
print("❌ Could not connect to the LLM Service API.")
Method 2: Direct Approach (LLMService Class)
Best for high-performance internal Python scripts with simple streaming.
from llmservices import LLMService, LLMRequest, CircuitBreakerOpenException
# Create typed request object
llm_request = LLMRequest(
provider='deepseek',
model='deepseek-reasoner',
prompt='Write a short poem about AI.',
enable_thinking=True,
stream=True
)
# Call the LLM service directly (No HTTP overhead)
try:
for chunk in LLMService.stream(llm_request):
print(chunk, end='', flush=True)
except CircuitBreakerOpenException as e:
print(f"Service Unavailable: {e}")
except Exception as e:
print(f"Error: {e}")
Method 3: Structured JSON Output
Best for extracting data or building software tools. Works with all providers.
from llmservices import LLMService, LLMRequest
import json
llm_request = LLMRequest(
provider='openai', # Works with anthropic, gemini, etc.
model='gpt-4o',
prompt='Extract names and dates from the text: "Meeting with Sarah on 2025-05-12."',
json_mode=True # <--- Forces valid JSON output
)
response = LLMService.call(llm_request)
# Parse response as standard JSON
data = json.loads(response.content)
print(data)
# Output: {"names": ["Sarah"], "dates": ["2025-05-12"]}
Method 4: Reasoning Models (o1, gpt-5)
Best for complex logic, math, and multi-step reasoning. OpenAI supports reasoning_effort levels (low/medium/high).
from llmservices import LLMService, LLMRequest
# OpenAI o1/gpt-5 with reasoning effort control
llm_request = LLMRequest(
provider='openai',
model='o1', # or 'gpt-5'
prompt='Solve this step by step: What is 12345 + 67890?',
enable_thinking=True,
reasoning_effort='high', # Options: 'low', 'medium', 'high'
max_tokens=2000
)
response = LLMService.call(llm_request)
print(f"Answer: {response.content}")
print(f"Reasoning: {response.reasoning_content}")
print(f"Tokens: {response.usage}")
reasoning_effort levels for o1/gpt-5. DeepSeek uses boolean enable_thinking (no effort levels). Anthropic's extended thinking is automatic.
Method 5: Streaming with JSON Mode (Advanced)
Best for real-time JSON responses. Shows SSE parsing to handle the [DONE] marker.
from llmservices import LLMService, LLMRequest
import json
def parse_stream_chunk(chunk: str) -> str:
"""Parse SSE format and extract text content."""
chunk = chunk.strip()
if chunk.startswith('data: '):
data_part = chunk[6:] # Remove "data: " prefix
if data_part.strip() == '[DONE]':
return '' # Filter out stream end marker
try:
data = json.loads(data_part)
if 'chunk' in data:
return data['chunk']
elif 'delta' in data and 'content' in data['delta']:
return data['delta']['content']
elif 'content' in data:
return data['content']
except json.JSONDecodeError:
pass
return chunk
# Stream JSON response
llm_request = LLMRequest(
provider='openai',
model='gpt-4o',
prompt='List 3 programming languages with their release years.',
json_mode=True,
stream=True
)
full_response = ""
for raw_chunk in LLMService.stream(llm_request):
text = parse_stream_chunk(raw_chunk)
if text: # Skip empty strings (filters out [DONE])
print(text, end='', flush=True)
full_response += text
# Parse the accumulated JSON
data = json.loads(full_response)
print(f"\nParsed: {json.dumps(data, indent=2)}")
data: [DONE] as the final chunk. The parse_stream_chunk() helper returns an empty string for this marker. Use if text: checks when accumulating content to skip it.
Complete Standalone Example
A complete, runnable example showing library mode setup from scratch.
"""
Example of using llmservices in Library Mode with Streaming Enabled
This example demonstrates how to use LLMService as a Python library
(without Flask) to stream responses from an LLM provider.
Prerequisites:
1. Install required packages:
pip install python-dotenv openai anthropic google-genai requests urllib3
2. Set up your .env file with API keys:
The .env file can be located in either:
- This directory (llmbase_demo/.env) - for local configuration
- Parent llmbase directory (llmbase/.env) - shared configuration
Example .env file:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...
LLM_CONFIG_FILE=llm_config.json
3. (Optional) Create llm_config.json with custom model configurations
- Default config: llmbase/llm_config.json (already exists with sensible defaults)
- Local override: Create llmbase_demo/llm_config.json for project-specific settings
- Useful for overriding default settings like max_tokens, temperature, etc.
"""
import os
import json
from dotenv import load_dotenv
import sys
from pathlib import Path
# ============================================================================
# PATH CONFIGURATION
# ============================================================================
# Add parent llmbase directory to Python path for imports
# This allows importing llmservices from a sibling directory
# Example structure:
# /Users/cio/gai/
# ├── llmbase/ (contains llmservices.py)
# └── llmbase_demo/ (contains this file)
sys.path.insert(0, str(Path(__file__).parent.parent / "llmbase"))
# ============================================================================
# LLM SERVICE IMPORTS
# ============================================================================
from llmservices import LLMService, LLMRequest, CircuitBreakerOpenException
# LLMService: Main class for making LLM calls
# LLMRequest: Data class for request parameters
# CircuitBreakerOpenException: Raised when provider is blocked due to failures
# ============================================================================
# ENVIRONMENT SETUP
# ============================================================================
# Load environment variables from .env file
# Required: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, etc.
# Optional: LLM_CONFIG_FILE (defaults to llm_config.json)
load_dotenv()
def parse_stream_chunk(chunk: str) -> str:
"""
Parse SSE (Server-Sent Events) format and extract the text content.
Handles formats like:
- data: {"chunk": "text"}
- data: {"delta": {"content": "text"}}
- data: [DONE] (stream end marker)
- raw text
Returns:
str: The extracted text content, or empty string for [DONE] markers
"""
chunk = chunk.strip()
# Handle SSE format: data: {...}
if chunk.startswith('data: '):
data_part = chunk[6:] # Remove "data: " prefix
# Check for stream end marker
if data_part.strip() == '[DONE]':
return '' # Signal end of stream
try:
data = json.loads(data_part)
# Try different possible keys for the content
if 'chunk' in data:
return data['chunk']
elif 'delta' in data and 'content' in data['delta']:
return data['delta']['content']
elif 'content' in data:
return data['content']
elif 'text' in data:
return data['text']
except json.JSONDecodeError:
pass
# Return as-is if not SSE format or parsing failed
return chunk
def stream_basic_example():
"""Basic streaming example with default settings."""
llm_request = LLMRequest(
provider='openai',
model='gpt-4o',
prompt='Write a short haiku about artificial intelligence.',
stream=True # Enable streaming
)
try:
for raw_chunk in LLMService.stream(llm_request):
text = parse_stream_chunk(raw_chunk)
print(text, end='', flush=True)
print()
except CircuitBreakerOpenException as e:
print(f"Service Unavailable (Circuit Breaker): {e}")
except Exception as e:
print(f"Error: {e}")
# Run the example
if __name__ == "__main__":
stream_basic_example()
streaming_example.py) and run it directly. All required setup including path configuration, imports, and environment loading is included.