⚡

Overview

LLM Services provides a unified interface for interacting with multiple Large Language Model providers. It abstracts provider-specific complexities, allowing you to switch models via configuration without changing code logic.

Key Features:

Multi-Provider: Support for multiple LLM providers, enabling flexible selection based on compliance, cost, and performance requirements.
Structured Output: Built-in json_mode ensures responses are valid JSON objects, regardless of the underlying provider.
Dynamic Ollama: Automatically discovers local models from your Ollama instance.
Resilience: Automatic retries and circuit breakers for high availability.
Advanced Logic: Support for "Thinking" models and streaming responses.
Conditional Flask: Use as a library (no Flask required) or API server (Flask optional).
Debugging & Transparency: Built-in logging, request tracking, performance metrics, and configuration status endpoints.

🏗️

Supported Providers

Access leading Cloud and Local LLMs through a single interface.

OpenAI / Azure

Anthropic

Gemini

DeepSeek

xAI / Grok

Perplexity

Ollama

Local Models

Note: Ollama models require no manual configuration. If the endpoint is set, the service automatically fetches all available local models.

🛡️

Resilience Architecture

Built-in fault tolerance mechanisms to ensure high availability. Settings can be tuned in llm_config.json.

Configuration Parameters

The resilience behavior is controlled by the following parameters in the root "resilience" block of your config file.

Parameter	Type	Default	Description
max_retries	Int	3	Number of retry attempts for transient errors (e.g., 5xx errors, rate limits, timeouts).
backoff_factor	Float	1.5	Multiplier for calculating wait time between retries. Formula: `wait = factor ^ attempt`.
circuit_breaker_failure_threshold	Int	5	Consecutive failures allowed before a provider is blocked (Circuit Open state).
circuit_breaker_recovery_timeout	Int	60	Seconds to wait before testing a blocked provider again (Half-Open state).

Note on Timeouts: HTTP connection timeouts are handled internally by the application (Standard: 300s for Ollama, Default client settings for others). The parameters above control application-level retry logic and circuit breaking.

⚙️

Setup & Configuration

The service separates secrets (API keys) from logic (Model definitions).

1. Requirements & Secrets

Install packages and set environment variables.

# Core requirements (Library mode - no Flask)
pip install python-dotenv openai anthropic google-genai requests urllib3

# Optional: For API server mode
pip install flask flask-session

2. Environment Variables (.env)

Store your API keys and endpoints here. Do not commit this file.

.env

# Provider Keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...
DEEPSEEK_API_KEY=sk-...

# Azure (Optional)
AZURE_OAI_ENDPOINT=https://your-resource.azure.com/
AZURE_OAI_KEY=...

# Ollama (Optional)
OLLAMA_CHAT_ENDPOINT=http://localhost:11434/api/chat
OLLAMA_MODELS_ENDPOINT=http://localhost:11434/api/models

# Service Config
FLASK_SECRET_KEY=your-secret-key
LLM_CONFIG_FILE=llm_config.json

# API Mode (Optional - set to true to enable Flask API server)
LLM_API_MODE=false

Library Mode: When LLM_API_MODE=false (default), you can use LLMService as a Python library without installing Flask. Set LLM_API_MODE=true to enable the HTTP API server.

3. Model & Resilience Configuration (llm_config.json)

Define models and resilience behaviors. This file can be hot-reloaded.

llm_config.json

{
  "resilience": {
    "max_retries": 3,
    "backoff_factor": 1.5,
    "retry_jitter": 0.5,
    "circuit_breaker_failure_threshold": 5,
    "circuit_breaker_recovery_timeout": 60
  },
  "openai": {
    "api_base": "https://api.openai.com/v1",
    "default_model": "gpt-4o",
    "models": {
      "gpt-4o": {
        "max_tokens": 16384,
        "supports_streaming": true,
        "temperature_default": 0.3
      }
    }
  },
  "anthropic": {
    "models": {
      "claude-3-5-sonnet-20240620": {
        "max_tokens": 8192,
        "supports_extended_thinking": true
      }
    }
  }
}

🔌

API Endpoints

RESTful API endpoints for interaction and management.

LLM Call Endpoint

POST /api/llm/call
{
    "provider": "openai",
    "model": "gpt-4o",
    "prompt": "List 3 colors and their hex codes",
    "json_mode": true,
    "stream": false,
    "temperature": 0.3,
    "system_prompt": "You are a color theorist."
}

Config Reload Endpoint

POST /api/config/reload
{
    "config_file": "llm_config.json"
}

🔍

Debugging & Transparency

Comprehensive debugging tools for monitoring, troubleshooting, and optimizing your LLM integrations.

Environment Variables

Control logging and debugging behavior through environment variables in your .env file.

Variable	Values	Description
LLM_LOG_LEVEL	`DEBUG`, `INFO`, `WARNING`, `ERROR`	Set the logging verbosity level. Default: `INFO`
LLM_DEBUG	`true`, `false`	Enable verbose logging with request/response details. Default: `false`

Production Warning: Avoid using LLM_DEBUG=true or LLM_LOG_LEVEL=DEBUG in production as they may log sensitive data including request payloads and response content.

Configuration Status Endpoint

Check your configuration and provider status programmatically.

GET /api/config/status

Response Example:

{
  "version": "v1.9.0",
  "providers_configured": ["openai", "anthropic", "deepseek"],
  "providers_status": {
    "openai": {"configured": true},
    "anthropic": {"configured": true},
    "ollama": {
      "configured": false,
      "missing_keys": ["OLLAMA_CHAT_ENDPOINT"]
    }
  },
  "environment": {
    "LLM_API_MODE": false,
    "LLM_LOG_LEVEL": "INFO",
    "LLM_DEBUG": false,
    "config_loaded": true
  },
  "clients_initialized": {
    "openai": true,
    "anthropic": true,
    "gemini": false
  },
  "circuit_breakers": {
    "openai": {
      "state": "CLOSED",
      "failure_count": 0,
      "failure_threshold": 5
    }
  }
}

Request Tracking

Every request automatically gets a unique request_id for tracing.

from llmservices import LLMService, LLMRequest

req = LLMRequest(
    provider="openai",
    model="gpt-4o",
    prompt="Hello"
)

# Access the auto-generated request ID
print(f"Request ID: {req.request_id}")

response = LLMService.call(req)

# Response includes debugging information
print(f"Request ID: {response.request_id}")
print(f"Finish Reason: {response.finish_reason}")
print(f"Usage: {response.usage}")
if response.timing:
    print(f"Duration: {response.timing.total_duration_ms}ms")
if response.metadata:
    print(f"Metadata: {response.metadata.to_dict()}")

Enhanced Error Handling

The LLMError class provides detailed error context for debugging.

from llmservices import LLMService, LLMRequest, LLMError

try:
    req = LLMRequest(provider="openai", model="gpt-4o", prompt="Hello")
    response = LLMService.call(req)
except LLMError as e:
    # Access structured error information
    print(f"Error: {e.message}")
    print(f"Provider: {e.provider}")
    print(f"Model: {e.model}")
    print(f"Status Code: {e.status_code}")
    print(f"Request ID: {e.request_id}")
    print(f"Error Code: {e.error_code}")

    # Convert to dictionary for logging
    error_dict = e.to_dict()
    import json
    print(json.dumps(error_dict, indent=2))

Logging Examples

Enable different log levels based on your needs:

# .env file for development
LLM_LOG_LEVEL=DEBUG
LLM_DEBUG=true

# .env file for production
LLM_LOG_LEVEL=WARNING
LLM_DEBUG=false

Debugging Use Cases

Track Requests: Use request_id to trace requests through your system
Monitor Performance: Check response.timing for latency metrics
Debug Errors: Use LLMError for structured error information
Verify Configuration: Call LLMConfig.get_status() to check setup
Monitor Token Usage: Check response.usage for cost tracking
Trace Failures: Use circuit breaker status to identify provider issues

Programmatic Access: You can also call LLMConfig.get_status() directly in your Python code to check configuration status without making an HTTP request.

🏆

Best Practices

Guidelines for optimal performance and security.

JSON Mode: Use "json_mode": true when building tools that require structured data parsing (e.g., dashboards, extracting data from CVs).
Use Configuration Files: Keep model definitions in llm_config.json to allow hot-swapping models without code changes.
Temperature Settings: Use 0.3 for factual tasks and 0.7-1.0 for creative writing.
Enable Thinking: For logic or math, use models like o1 or R1 and set "enable_thinking": true.
Streaming: Always use "stream": true for long-form generation to improve UX.

💻

Usage Modes

Use as a standalone Python library or as an HTTP API server.

Library Mode (Default)

Import and use LLMService directly in your Python code. No Flask required.

from llmservices import LLMService, LLMRequest

req = LLMRequest(provider="openai", model="gpt-4o", prompt="Hello")
response = LLMService.call(req)
print(response.content)

API Server Mode

Set LLM_API_MODE=true and run as an HTTP server.

# Enable API mode in .env
export LLM_API_MODE=true

# Run the server
python llmservices.py

# Or programmatically
from llmservices import run_api_server
run_api_server(port=8888)

💻

Python Integration Patterns

Choose the integration method that fits your architecture.

Comparison: API vs. Direct Class

Feature	API Approach (HTTP)	Direct Approach (Class)
Use Case	Microservices, Frontend-to-Backend, Polyglot systems	Internal Python Tools, Monolithic Backends
Performance	Network overhead introduced	No network overhead (Zero latency)
Data Structure	Raw JSON Responses	Typed Objects (`LLMResponse`)

Method 1: API Approach (HTTP)

Best for decoupled services or when calling from non-Python languages.

import requests
import json

# Standard API call
url = 'http://localhost:8888/api/llm/call'
payload = {
    'provider': 'deepseek',
    'model': 'deepseek-reasoner',
    'prompt': 'Solve this complex logic puzzle...',
    'enable_thinking': True,
    'stream': True
}

try:
    response = requests.post(url, json=payload, stream=True)

    # Check for Circuit Breaker (503) or other errors
    if response.status_code == 503:
        error_data = response.json()
        print(f"⛔ Circuit Breaker Open: {error_data['error']}")
    elif response.status_code != 200:
        print(f"⚠ Error {response.status_code}: {response.text}")
    else:
        # Process successful stream
        for line in response.iter_lines():
            if line:
                decoded_line = line.decode('utf-8')
                if decoded_line.startswith('data: '):
                    print(decoded_line)
except requests.exceptions.ConnectionError:
    print("❌ Could not connect to the LLM Service API.")

Method 2: Direct Approach (LLMService Class)

Best for high-performance internal Python scripts with simple streaming.

from llmservices import LLMService, LLMRequest, CircuitBreakerOpenException

# Create typed request object
llm_request = LLMRequest(
    provider='deepseek',
    model='deepseek-reasoner',
    prompt='Write a short poem about AI.',
    enable_thinking=True,
    stream=True
)

# Call the LLM service directly (No HTTP overhead)
try:
    for chunk in LLMService.stream(llm_request):
        print(chunk, end='', flush=True)
except CircuitBreakerOpenException as e:
    print(f"Service Unavailable: {e}")
except Exception as e:
    print(f"Error: {e}")

Method 3: Structured JSON Output

Best for extracting data or building software tools. Works with all providers.

from llmservices import LLMService, LLMRequest
import json

llm_request = LLMRequest(
    provider='openai',  # Works with anthropic, gemini, etc.
    model='gpt-4o',
    prompt='Extract names and dates from the text: "Meeting with Sarah on 2025-05-12."',
    json_mode=True  # <--- Forces valid JSON output
)

response = LLMService.call(llm_request)

# Parse response as standard JSON
data = json.loads(response.content)
print(data)
# Output: {"names": ["Sarah"], "dates": ["2025-05-12"]}

Method 4: Reasoning Models (o1, gpt-5)

Best for complex logic, math, and multi-step reasoning. OpenAI supports reasoning_effort levels (low/medium/high).

from llmservices import LLMService, LLMRequest

# OpenAI o1/gpt-5 with reasoning effort control
llm_request = LLMRequest(
    provider='openai',
    model='o1',  # or 'gpt-5'
    prompt='Solve this step by step: What is 12345 + 67890?',
    enable_thinking=True,
    reasoning_effort='high',  # Options: 'low', 'medium', 'high'
    max_tokens=2000
)

response = LLMService.call(llm_request)

print(f"Answer: {response.content}")
print(f"Reasoning: {response.reasoning_content}")
print(f"Tokens: {response.usage}")

Provider Differences: OpenAI supports reasoning_effort levels for o1/gpt-5. DeepSeek uses boolean enable_thinking (no effort levels). Anthropic's extended thinking is automatic.

Method 5: Streaming with JSON Mode (Advanced)

Best for real-time JSON responses. Shows SSE parsing to handle the [DONE] marker.

from llmservices import LLMService, LLMRequest
import json

def parse_stream_chunk(chunk: str) -> str:
    """Parse SSE format and extract text content."""
    chunk = chunk.strip()
    if chunk.startswith('data: '):
        data_part = chunk[6:]  # Remove "data: " prefix
        if data_part.strip() == '[DONE]':
            return ''  # Filter out stream end marker
        try:
            data = json.loads(data_part)
            if 'chunk' in data:
                return data['chunk']
            elif 'delta' in data and 'content' in data['delta']:
                return data['delta']['content']
            elif 'content' in data:
                return data['content']
        except json.JSONDecodeError:
            pass
    return chunk

# Stream JSON response
llm_request = LLMRequest(
    provider='openai',
    model='gpt-4o',
    prompt='List 3 programming languages with their release years.',
    json_mode=True,
    stream=True
)

full_response = ""
for raw_chunk in LLMService.stream(llm_request):
    text = parse_stream_chunk(raw_chunk)
    if text:  # Skip empty strings (filters out [DONE])
        print(text, end='', flush=True)
        full_response += text

# Parse the accumulated JSON
data = json.loads(full_response)
print(f"\nParsed: {json.dumps(data, indent=2)}")

Note on [DONE] marker: When streaming, the service sends data: [DONE] as the final chunk. The parse_stream_chunk() helper returns an empty string for this marker. Use if text: checks when accumulating content to skip it.

📄

Complete Standalone Example

A complete, runnable example showing library mode setup from scratch.

"""
Example of using llmservices in Library Mode with Streaming Enabled

This example demonstrates how to use LLMService as a Python library
(without Flask) to stream responses from an LLM provider.

Prerequisites:
1. Install required packages:
   pip install python-dotenv openai anthropic google-genai requests urllib3

2. Set up your .env file with API keys:
   The .env file can be located in either:
   - This directory (llmbase_demo/.env) - for local configuration
   - Parent llmbase directory (llmbase/.env) - shared configuration

   Example .env file:
   OPENAI_API_KEY=sk-...
   ANTHROPIC_API_KEY=sk-ant-...
   GEMINI_API_KEY=...
   LLM_CONFIG_FILE=llm_config.json

3. (Optional) Create llm_config.json with custom model configurations
   - Default config: llmbase/llm_config.json (already exists with sensible defaults)
   - Local override: Create llmbase_demo/llm_config.json for project-specific settings
   - Useful for overriding default settings like max_tokens, temperature, etc.
"""

import os
import json
from dotenv import load_dotenv
import sys
from pathlib import Path

# ============================================================================
# PATH CONFIGURATION
# ============================================================================
# Add parent llmbase directory to Python path for imports
# This allows importing llmservices from a sibling directory
# Example structure:
#   /Users/cio/gai/
#     ├── llmbase/           (contains llmservices.py)
#     └── llmbase_demo/      (contains this file)
sys.path.insert(0, str(Path(__file__).parent.parent / "llmbase"))


# ============================================================================
# LLM SERVICE IMPORTS
# ============================================================================
from llmservices import LLMService, LLMRequest, CircuitBreakerOpenException
# LLMService: Main class for making LLM calls
# LLMRequest: Data class for request parameters
# CircuitBreakerOpenException: Raised when provider is blocked due to failures

# ============================================================================
# ENVIRONMENT SETUP
# ============================================================================
# Load environment variables from .env file
# Required: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, etc.
# Optional: LLM_CONFIG_FILE (defaults to llm_config.json)
load_dotenv()


def parse_stream_chunk(chunk: str) -> str:
    """
    Parse SSE (Server-Sent Events) format and extract the text content.

    Handles formats like:
    - data: {"chunk": "text"}
    - data: {"delta": {"content": "text"}}
    - data: [DONE] (stream end marker)
    - raw text

    Returns:
        str: The extracted text content, or empty string for [DONE] markers
    """
    chunk = chunk.strip()

    # Handle SSE format: data: {...}
    if chunk.startswith('data: '):
        data_part = chunk[6:]  # Remove "data: " prefix

        # Check for stream end marker
        if data_part.strip() == '[DONE]':
            return ''  # Signal end of stream

        try:
            data = json.loads(data_part)

            # Try different possible keys for the content
            if 'chunk' in data:
                return data['chunk']
            elif 'delta' in data and 'content' in data['delta']:
                return data['delta']['content']
            elif 'content' in data:
                return data['content']
            elif 'text' in data:
                return data['text']
        except json.JSONDecodeError:
            pass

    # Return as-is if not SSE format or parsing failed
    return chunk


def stream_basic_example():
    """Basic streaming example with default settings."""
    llm_request = LLMRequest(
        provider='openai',
        model='gpt-4o',
        prompt='Write a short haiku about artificial intelligence.',
        stream=True  # Enable streaming
    )

    try:
        for raw_chunk in LLMService.stream(llm_request):
            text = parse_stream_chunk(raw_chunk)
            print(text, end='', flush=True)
        print()
    except CircuitBreakerOpenException as e:
        print(f"Service Unavailable (Circuit Breaker): {e}")
    except Exception as e:
        print(f"Error: {e}")


# Run the example
if __name__ == "__main__":
    stream_basic_example()

Ready to run: Save this as a Python file (e.g., streaming_example.py) and run it directly. All required setup including path configuration, imports, and environment loading is included.