Overview
LLM Services provides a unified interface for interacting with multiple Large Language Model providers. It abstracts provider-specific complexities, allowing you to switch models via configuration without changing code logic.
Key Features:
- Multi-Provider: Support for multiple LLM providers, enabling flexible selection based on compliance, cost, and performance requirements.
- Structured Output: Built-in json_mode ensures responses are valid JSON objects, regardless of the underlying provider.
- Dynamic Ollama: Automatically discovers local models from your Ollama instance.
- Resilience: Automatic retries and circuit breakers for high availability.
- Advanced Logic: Support for "Thinking" models and streaming responses.
- Conditional Flask: Use as a library (no Flask required) or API server (Flask optional).
Supported Providers
Access leading Cloud and Local LLMs through a single interface.
Resilience Architecture
Built-in fault tolerance mechanisms to ensure high availability. Settings can be tuned in llm_config.json.
Configuration Parameters
The resilience behavior is controlled by the following parameters in the root "resilience" block of your config file.
| Parameter | Type | Default | Description |
|---|---|---|---|
| max_retries | Int | 3 | Number of retry attempts for transient errors (e.g., 5xx errors, rate limits, timeouts). |
| backoff_factor | Float | 1.5 | Multiplier for calculating wait time between retries. Formula: wait = factor ^ attempt. |
| circuit_breaker_failure_threshold | Int | 5 | Consecutive failures allowed before a provider is blocked (Circuit Open state). |
| circuit_breaker_recovery_timeout | Int | 60 | Seconds to wait before testing a blocked provider again (Half-Open state). |
Setup & Configuration
The service separates secrets (API keys) from logic (Model definitions).
1. Requirements & Secrets
Install packages and set environment variables.
2. Environment Variables (.env)
Store your API keys and endpoints here. Do not commit this file.
.envLLM_API_MODE=false (default), you can use LLMService as a Python library without installing Flask. Set LLM_API_MODE=true to enable the HTTP API server.
3. Model & Resilience Configuration (llm_config.json)
Define models and resilience behaviors. This file can be hot-reloaded.
llm_config.jsonAPI Endpoints
RESTful API endpoints for interaction and management.
LLM Call Endpoint
Config Reload Endpoint
Best Practices
Guidelines for optimal performance and security.
- JSON Mode: Use
"json_mode": truewhen building tools that require structured data parsing (e.g., dashboards, extracting data from CVs). - Use Configuration Files: Keep model definitions in
llm_config.jsonto allow hot-swapping models without code changes. - Temperature Settings: Use
0.3for factual tasks and0.7-1.0for creative writing. - Enable Thinking: For logic or math, use models like o1 or R1 and set
"enable_thinking": true. - Streaming: Always use
"stream": truefor long-form generation to improve UX.
Usage Modes
Use as a standalone Python library or as an HTTP API server.
Library Mode (Default)
Import and use LLMService directly in your Python code. No Flask required.
from llmservices import LLMService, LLMRequest req = LLMRequest(provider="openai", model="gpt-4o", prompt="Hello") response = LLMService.call(req) print(response.content)
API Server Mode
Set LLM_API_MODE=true and run as an HTTP server.
# Enable API mode in .env export LLM_API_MODE=true # Run the server python llmservices.py # Or programmatically from llmservices import run_api_server run_api_server(port=8888)
Python Integration Patterns
Choose the integration method that fits your architecture.
Comparison: API vs. Direct Class
| Feature | API Approach (HTTP) | Direct Approach (Class) |
|---|---|---|
| Use Case | Microservices, Frontend-to-Backend, Polyglot systems | Internal Python Tools, Monolithic Backends |
| Performance | Network overhead introduced | No network overhead (Zero latency) |
| Data Structure | Raw JSON Responses | Typed Objects (LLMResponse) |
Method 1: API Approach (HTTP)
Best for decoupled services or when calling from non-Python languages.
import requests
import json
# Standard API call
url = 'http://localhost:8888/api/llm/call'
payload = {
'provider': 'deepseek',
'model': 'deepseek-reasoner',
'prompt': 'Solve this complex logic puzzle...',
'enable_thinking': True,
'stream': True
}
try:
response = requests.post(url, json=payload, stream=True)
# Check for Circuit Breaker (503) or other errors
if response.status_code == 503:
error_data = response.json()
print(f"⛔ Circuit Breaker Open: {error_data['error']}")
elif response.status_code != 200:
print(f"⚠ Error {response.status_code}: {response.text}")
else:
# Process successful stream
for line in response.iter_lines():
if line:
decoded_line = line.decode('utf-8')
if decoded_line.startswith('data: '):
print(decoded_line)
except requests.exceptions.ConnectionError:
print("❌ Could not connect to the LLM Service API.")
Method 2: Direct Approach (LLMService Class)
Best for high-performance internal Python scripts with simple streaming.
from llmservices import LLMService, LLMRequest, CircuitBreakerOpenException
# Create typed request object
llm_request = LLMRequest(
provider='deepseek',
model='deepseek-reasoner',
prompt='Write a short poem about AI.',
enable_thinking=True,
stream=True
)
# Call the LLM service directly (No HTTP overhead)
try:
for chunk in LLMService.stream(llm_request):
print(chunk, end='', flush=True)
except CircuitBreakerOpenException as e:
print(f"Service Unavailable: {e}")
except Exception as e:
print(f"Error: {e}")
Method 3: Structured JSON Output
Best for extracting data or building software tools. Works with all providers.
from llmservices import LLMService, LLMRequest
import json
llm_request = LLMRequest(
provider='openai', # Works with anthropic, gemini, etc.
model='gpt-4o',
prompt='Extract names and dates from the text: "Meeting with Sarah on 2025-05-12."',
json_mode=True # <--- Forces valid JSON output
)
response = LLMService.call(llm_request)
# Parse response as standard JSON
data = json.loads(response.content)
print(data)
# Output: {"names": ["Sarah"], "dates": ["2025-05-12"]}
Method 4: Reasoning Models (o1, gpt-5)
Best for complex logic, math, and multi-step reasoning. OpenAI supports reasoning_effort levels (low/medium/high).
from llmservices import LLMService, LLMRequest
# OpenAI o1/gpt-5 with reasoning effort control
llm_request = LLMRequest(
provider='openai',
model='o1', # or 'gpt-5'
prompt='Solve this step by step: What is 12345 + 67890?',
enable_thinking=True,
reasoning_effort='high', # Options: 'low', 'medium', 'high'
max_tokens=2000
)
response = LLMService.call(llm_request)
print(f"Answer: {response.content}")
print(f"Reasoning: {response.reasoning_content}")
print(f"Tokens: {response.usage}")
reasoning_effort levels for o1/gpt-5. DeepSeek uses boolean enable_thinking (no effort levels). Anthropic's extended thinking is automatic.
Method 5: Streaming with JSON Mode (Advanced)
Best for real-time JSON responses. Shows SSE parsing to handle the [DONE] marker.
from llmservices import LLMService, LLMRequest
import json
def parse_stream_chunk(chunk: str) -> str:
"""Parse SSE format and extract text content."""
chunk = chunk.strip()
if chunk.startswith('data: '):
data_part = chunk[6:] # Remove "data: " prefix
if data_part.strip() == '[DONE]':
return '' # Filter out stream end marker
try:
data = json.loads(data_part)
if 'chunk' in data:
return data['chunk']
elif 'delta' in data and 'content' in data['delta']:
return data['delta']['content']
elif 'content' in data:
return data['content']
except json.JSONDecodeError:
pass
return chunk
# Stream JSON response
llm_request = LLMRequest(
provider='openai',
model='gpt-4o',
prompt='List 3 programming languages with their release years.',
json_mode=True,
stream=True
)
full_response = ""
for raw_chunk in LLMService.stream(llm_request):
text = parse_stream_chunk(raw_chunk)
if text: # Skip empty strings (filters out [DONE])
print(text, end='', flush=True)
full_response += text
# Parse the accumulated JSON
data = json.loads(full_response)
print(f"\nParsed: {json.dumps(data, indent=2)}")
data: [DONE] as the final chunk. The parse_stream_chunk() helper returns an empty string for this marker. Use if text: checks when accumulating content to skip it.
Complete Standalone Example
A complete, runnable example showing library mode setup from scratch.
"""
Example of using llmservices in Library Mode with Streaming Enabled
This example demonstrates how to use LLMService as a Python library
(without Flask) to stream responses from an LLM provider.
Prerequisites:
1. Install required packages:
pip install python-dotenv openai anthropic google-genai requests urllib3
2. Set up your .env file with API keys:
The .env file can be located in either:
- This directory (llmbase_demo/.env) - for local configuration
- Parent llmbase directory (llmbase/.env) - shared configuration
Example .env file:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...
LLM_CONFIG_FILE=llm_config.json
3. (Optional) Create llm_config.json with custom model configurations
- Default config: llmbase/llm_config.json (already exists with sensible defaults)
- Local override: Create llmbase_demo/llm_config.json for project-specific settings
- Useful for overriding default settings like max_tokens, temperature, etc.
"""
import os
import json
from dotenv import load_dotenv
import sys
from pathlib import Path
# ============================================================================
# PATH CONFIGURATION
# ============================================================================
# Add parent llmbase directory to Python path for imports
# This allows importing llmservices from a sibling directory
# Example structure:
# /Users/cio/gai/
# ├── llmbase/ (contains llmservices.py)
# └── llmbase_demo/ (contains this file)
sys.path.insert(0, str(Path(__file__).parent.parent / "llmbase"))
# ============================================================================
# LLM SERVICE IMPORTS
# ============================================================================
from llmservices import LLMService, LLMRequest, CircuitBreakerOpenException
# LLMService: Main class for making LLM calls
# LLMRequest: Data class for request parameters
# CircuitBreakerOpenException: Raised when provider is blocked due to failures
# ============================================================================
# ENVIRONMENT SETUP
# ============================================================================
# Load environment variables from .env file
# Required: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, etc.
# Optional: LLM_CONFIG_FILE (defaults to llm_config.json)
load_dotenv()
def parse_stream_chunk(chunk: str) -> str:
"""
Parse SSE (Server-Sent Events) format and extract the text content.
Handles formats like:
- data: {"chunk": "text"}
- data: {"delta": {"content": "text"}}
- data: [DONE] (stream end marker)
- raw text
Returns:
str: The extracted text content, or empty string for [DONE] markers
"""
chunk = chunk.strip()
# Handle SSE format: data: {...}
if chunk.startswith('data: '):
data_part = chunk[6:] # Remove "data: " prefix
# Check for stream end marker
if data_part.strip() == '[DONE]':
return '' # Signal end of stream
try:
data = json.loads(data_part)
# Try different possible keys for the content
if 'chunk' in data:
return data['chunk']
elif 'delta' in data and 'content' in data['delta']:
return data['delta']['content']
elif 'content' in data:
return data['content']
elif 'text' in data:
return data['text']
except json.JSONDecodeError:
pass
# Return as-is if not SSE format or parsing failed
return chunk
def stream_basic_example():
"""Basic streaming example with default settings."""
llm_request = LLMRequest(
provider='openai',
model='gpt-4o',
prompt='Write a short haiku about artificial intelligence.',
stream=True # Enable streaming
)
try:
for raw_chunk in LLMService.stream(llm_request):
text = parse_stream_chunk(raw_chunk)
print(text, end='', flush=True)
print()
except CircuitBreakerOpenException as e:
print(f"Service Unavailable (Circuit Breaker): {e}")
except Exception as e:
print(f"Error: {e}")
# Run the example
if __name__ == "__main__":
stream_basic_example()
streaming_example.py) and run it directly. All required setup including path configuration, imports, and environment loading is included.