Overview

LLM Services provides a unified interface for interacting with multiple Large Language Model providers. It abstracts provider-specific complexities, allowing you to switch models via configuration without changing code logic.

Key Features:

  • Multi-Provider: Support for multiple LLM providers, enabling flexible selection based on compliance, cost, and performance requirements.
  • Structured Output: Built-in json_mode ensures responses are valid JSON objects, regardless of the underlying provider.
  • Dynamic Ollama: Automatically discovers local models from your Ollama instance.
  • Resilience: Automatic retries and circuit breakers for high availability.
  • Advanced Logic: Support for "Thinking" models and streaming responses.
  • Conditional Flask: Use as a library (no Flask required) or API server (Flask optional).
🏗️

Supported Providers

Access leading Cloud and Local LLMs through a single interface.

OpenAI / Azure
GPT-4o, o1
Anthropic
Claude 3.5, 3.7
Gemini
Gemini 2.5, 3
DeepSeek
V3, R1 Reasoner
xAI / Grok
Grok 3, 4
Ollama
Local Models
Note: Ollama models require no manual configuration. If the endpoint is set, the service automatically fetches all available local models.
🛡️

Resilience Architecture

Built-in fault tolerance mechanisms to ensure high availability. Settings can be tuned in llm_config.json.

Configuration Parameters

The resilience behavior is controlled by the following parameters in the root "resilience" block of your config file.

Parameter Type Default Description
max_retries Int 3 Number of retry attempts for transient errors (e.g., 5xx errors, rate limits, timeouts).
backoff_factor Float 1.5 Multiplier for calculating wait time between retries. Formula: wait = factor ^ attempt.
circuit_breaker_failure_threshold Int 5 Consecutive failures allowed before a provider is blocked (Circuit Open state).
circuit_breaker_recovery_timeout Int 60 Seconds to wait before testing a blocked provider again (Half-Open state).
Note on Timeouts: HTTP connection timeouts are handled internally by the application (Standard: 300s for Ollama, Default client settings for others). The parameters above control application-level retry logic and circuit breaking.
⚙️

Setup & Configuration

The service separates secrets (API keys) from logic (Model definitions).

1. Requirements & Secrets

Install packages and set environment variables.

# Core requirements (Library mode - no Flask) pip install python-dotenv openai anthropic google-genai requests urllib3 # Optional: For API server mode pip install flask flask-session

2. Environment Variables (.env)

Store your API keys and endpoints here. Do not commit this file.

.env
# Provider Keys OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-... GEMINI_API_KEY=... DEEPSEEK_API_KEY=sk-... # Azure (Optional) AZURE_OAI_ENDPOINT=https://your-resource.azure.com/ AZURE_OAI_KEY=... # Ollama (Optional) OLLAMA_CHAT_ENDPOINT=http://localhost:11434/api/chat OLLAMA_MODELS_ENDPOINT=http://localhost:11434/api/models # Service Config FLASK_SECRET_KEY=your-secret-key LLM_CONFIG_FILE=llm_config.json # API Mode (Optional - set to true to enable Flask API server) LLM_API_MODE=false
Library Mode: When LLM_API_MODE=false (default), you can use LLMService as a Python library without installing Flask. Set LLM_API_MODE=true to enable the HTTP API server.

3. Model & Resilience Configuration (llm_config.json)

Define models and resilience behaviors. This file can be hot-reloaded.

llm_config.json
{ "resilience": { "max_retries": 3, "backoff_factor": 1.5, "retry_jitter": 0.5, "circuit_breaker_failure_threshold": 5, "circuit_breaker_recovery_timeout": 60 }, "openai": { "api_base": "https://api.openai.com/v1", "default_model": "gpt-4o", "models": { "gpt-4o": { "max_tokens": 16384, "supports_streaming": true, "temperature_default": 0.3 } } }, "anthropic": { "models": { "claude-3-5-sonnet-20240620": { "max_tokens": 8192, "supports_extended_thinking": true } } } }
🔌

API Endpoints

RESTful API endpoints for interaction and management.

LLM Call Endpoint

POST /api/llm/call { "provider": "openai", "model": "gpt-4o", "prompt": "List 3 colors and their hex codes", "json_mode": true, "stream": false, "temperature": 0.3, "system_prompt": "You are a color theorist." }

Config Reload Endpoint

POST /api/config/reload { "config_file": "llm_config.json" }
🏆

Best Practices

Guidelines for optimal performance and security.

  • JSON Mode: Use "json_mode": true when building tools that require structured data parsing (e.g., dashboards, extracting data from CVs).
  • Use Configuration Files: Keep model definitions in llm_config.json to allow hot-swapping models without code changes.
  • Temperature Settings: Use 0.3 for factual tasks and 0.7-1.0 for creative writing.
  • Enable Thinking: For logic or math, use models like o1 or R1 and set "enable_thinking": true.
  • Streaming: Always use "stream": true for long-form generation to improve UX.
💻

Usage Modes

Use as a standalone Python library or as an HTTP API server.

Library Mode (Default)

Import and use LLMService directly in your Python code. No Flask required.

from llmservices import LLMService, LLMRequest

req = LLMRequest(provider="openai", model="gpt-4o", prompt="Hello")
response = LLMService.call(req)
print(response.content)

API Server Mode

Set LLM_API_MODE=true and run as an HTTP server.

# Enable API mode in .env
export LLM_API_MODE=true

# Run the server
python llmservices.py

# Or programmatically
from llmservices import run_api_server
run_api_server(port=8888)
💻

Python Integration Patterns

Choose the integration method that fits your architecture.

Comparison: API vs. Direct Class

Feature API Approach (HTTP) Direct Approach (Class)
Use Case Microservices, Frontend-to-Backend, Polyglot systems Internal Python Tools, Monolithic Backends
Performance Network overhead introduced No network overhead (Zero latency)
Data Structure Raw JSON Responses Typed Objects (LLMResponse)

Method 1: API Approach (HTTP)

Best for decoupled services or when calling from non-Python languages.

import requests
import json

# Standard API call
url = 'http://localhost:8888/api/llm/call'
payload = {
    'provider': 'deepseek',
    'model': 'deepseek-reasoner',
    'prompt': 'Solve this complex logic puzzle...',
    'enable_thinking': True,
    'stream': True
}

try:
    response = requests.post(url, json=payload, stream=True)

    # Check for Circuit Breaker (503) or other errors
    if response.status_code == 503:
        error_data = response.json()
        print(f"⛔ Circuit Breaker Open: {error_data['error']}")
    elif response.status_code != 200:
        print(f"⚠ Error {response.status_code}: {response.text}")
    else:
        # Process successful stream
        for line in response.iter_lines():
            if line:
                decoded_line = line.decode('utf-8')
                if decoded_line.startswith('data: '):
                    print(decoded_line)
except requests.exceptions.ConnectionError:
    print("❌ Could not connect to the LLM Service API.")

Method 2: Direct Approach (LLMService Class)

Best for high-performance internal Python scripts with simple streaming.

from llmservices import LLMService, LLMRequest, CircuitBreakerOpenException

# Create typed request object
llm_request = LLMRequest(
    provider='deepseek',
    model='deepseek-reasoner',
    prompt='Write a short poem about AI.',
    enable_thinking=True,
    stream=True
)

# Call the LLM service directly (No HTTP overhead)
try:
    for chunk in LLMService.stream(llm_request):
        print(chunk, end='', flush=True)
except CircuitBreakerOpenException as e:
    print(f"Service Unavailable: {e}")
except Exception as e:
    print(f"Error: {e}")

Method 3: Structured JSON Output

Best for extracting data or building software tools. Works with all providers.

from llmservices import LLMService, LLMRequest
import json

llm_request = LLMRequest(
    provider='openai',  # Works with anthropic, gemini, etc.
    model='gpt-4o',
    prompt='Extract names and dates from the text: "Meeting with Sarah on 2025-05-12."',
    json_mode=True  # <--- Forces valid JSON output
)

response = LLMService.call(llm_request)

# Parse response as standard JSON
data = json.loads(response.content)
print(data)
# Output: {"names": ["Sarah"], "dates": ["2025-05-12"]}

Method 4: Reasoning Models (o1, gpt-5)

Best for complex logic, math, and multi-step reasoning. OpenAI supports reasoning_effort levels (low/medium/high).

from llmservices import LLMService, LLMRequest

# OpenAI o1/gpt-5 with reasoning effort control
llm_request = LLMRequest(
    provider='openai',
    model='o1',  # or 'gpt-5'
    prompt='Solve this step by step: What is 12345 + 67890?',
    enable_thinking=True,
    reasoning_effort='high',  # Options: 'low', 'medium', 'high'
    max_tokens=2000
)

response = LLMService.call(llm_request)

print(f"Answer: {response.content}")
print(f"Reasoning: {response.reasoning_content}")
print(f"Tokens: {response.usage}")
Provider Differences: OpenAI supports reasoning_effort levels for o1/gpt-5. DeepSeek uses boolean enable_thinking (no effort levels). Anthropic's extended thinking is automatic.

Method 5: Streaming with JSON Mode (Advanced)

Best for real-time JSON responses. Shows SSE parsing to handle the [DONE] marker.

from llmservices import LLMService, LLMRequest
import json

def parse_stream_chunk(chunk: str) -> str:
    """Parse SSE format and extract text content."""
    chunk = chunk.strip()
    if chunk.startswith('data: '):
        data_part = chunk[6:]  # Remove "data: " prefix
        if data_part.strip() == '[DONE]':
            return ''  # Filter out stream end marker
        try:
            data = json.loads(data_part)
            if 'chunk' in data:
                return data['chunk']
            elif 'delta' in data and 'content' in data['delta']:
                return data['delta']['content']
            elif 'content' in data:
                return data['content']
        except json.JSONDecodeError:
            pass
    return chunk

# Stream JSON response
llm_request = LLMRequest(
    provider='openai',
    model='gpt-4o',
    prompt='List 3 programming languages with their release years.',
    json_mode=True,
    stream=True
)

full_response = ""
for raw_chunk in LLMService.stream(llm_request):
    text = parse_stream_chunk(raw_chunk)
    if text:  # Skip empty strings (filters out [DONE])
        print(text, end='', flush=True)
        full_response += text

# Parse the accumulated JSON
data = json.loads(full_response)
print(f"\nParsed: {json.dumps(data, indent=2)}")
Note on [DONE] marker: When streaming, the service sends data: [DONE] as the final chunk. The parse_stream_chunk() helper returns an empty string for this marker. Use if text: checks when accumulating content to skip it.
📄

Complete Standalone Example

A complete, runnable example showing library mode setup from scratch.

"""
Example of using llmservices in Library Mode with Streaming Enabled

This example demonstrates how to use LLMService as a Python library
(without Flask) to stream responses from an LLM provider.

Prerequisites:
1. Install required packages:
   pip install python-dotenv openai anthropic google-genai requests urllib3

2. Set up your .env file with API keys:
   The .env file can be located in either:
   - This directory (llmbase_demo/.env) - for local configuration
   - Parent llmbase directory (llmbase/.env) - shared configuration

   Example .env file:
   OPENAI_API_KEY=sk-...
   ANTHROPIC_API_KEY=sk-ant-...
   GEMINI_API_KEY=...
   LLM_CONFIG_FILE=llm_config.json

3. (Optional) Create llm_config.json with custom model configurations
   - Default config: llmbase/llm_config.json (already exists with sensible defaults)
   - Local override: Create llmbase_demo/llm_config.json for project-specific settings
   - Useful for overriding default settings like max_tokens, temperature, etc.
"""

import os
import json
from dotenv import load_dotenv
import sys
from pathlib import Path

# ============================================================================
# PATH CONFIGURATION
# ============================================================================
# Add parent llmbase directory to Python path for imports
# This allows importing llmservices from a sibling directory
# Example structure:
#   /Users/cio/gai/
#     ├── llmbase/           (contains llmservices.py)
#     └── llmbase_demo/      (contains this file)
sys.path.insert(0, str(Path(__file__).parent.parent / "llmbase"))


# ============================================================================
# LLM SERVICE IMPORTS
# ============================================================================
from llmservices import LLMService, LLMRequest, CircuitBreakerOpenException
# LLMService: Main class for making LLM calls
# LLMRequest: Data class for request parameters
# CircuitBreakerOpenException: Raised when provider is blocked due to failures

# ============================================================================
# ENVIRONMENT SETUP
# ============================================================================
# Load environment variables from .env file
# Required: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, etc.
# Optional: LLM_CONFIG_FILE (defaults to llm_config.json)
load_dotenv()


def parse_stream_chunk(chunk: str) -> str:
    """
    Parse SSE (Server-Sent Events) format and extract the text content.

    Handles formats like:
    - data: {"chunk": "text"}
    - data: {"delta": {"content": "text"}}
    - data: [DONE] (stream end marker)
    - raw text

    Returns:
        str: The extracted text content, or empty string for [DONE] markers
    """
    chunk = chunk.strip()

    # Handle SSE format: data: {...}
    if chunk.startswith('data: '):
        data_part = chunk[6:]  # Remove "data: " prefix

        # Check for stream end marker
        if data_part.strip() == '[DONE]':
            return ''  # Signal end of stream

        try:
            data = json.loads(data_part)

            # Try different possible keys for the content
            if 'chunk' in data:
                return data['chunk']
            elif 'delta' in data and 'content' in data['delta']:
                return data['delta']['content']
            elif 'content' in data:
                return data['content']
            elif 'text' in data:
                return data['text']
        except json.JSONDecodeError:
            pass

    # Return as-is if not SSE format or parsing failed
    return chunk


def stream_basic_example():
    """Basic streaming example with default settings."""
    llm_request = LLMRequest(
        provider='openai',
        model='gpt-4o',
        prompt='Write a short haiku about artificial intelligence.',
        stream=True  # Enable streaming
    )

    try:
        for raw_chunk in LLMService.stream(llm_request):
            text = parse_stream_chunk(raw_chunk)
            print(text, end='', flush=True)
        print()
    except CircuitBreakerOpenException as e:
        print(f"Service Unavailable (Circuit Breaker): {e}")
    except Exception as e:
        print(f"Error: {e}")


# Run the example
if __name__ == "__main__":
    stream_basic_example()
Ready to run: Save this as a Python file (e.g., streaming_example.py) and run it directly. All required setup including path configuration, imports, and environment loading is included.