Overview

The Tablemind is a Python library for building retrieval-augmented generation (RAG) systems that properly handle tables, figures, and document-level queries in research papers, technical reports, and business documents.

Problem it Solves

Standard RAG systems chunk documents to work within LLM context limits. This works for prose but fails for:

Solution

Treat tables as first-class citizens — preserve complete structure, detect references, and intelligently route between chunk-based retrieval and full document review.

Quick Start

Installation

bash
pip install docling sentence-transformers qdrant-client fastapi uvicorn

Basic Usage

python
from rag_ingestion import RAGIngestor
from rag_library import RAGLibrary

# Initialize
ingestor = RAGIngestor(collection_name="my_docs")
rag = RAGLibrary(ingestor=ingestor)

# Ingest a document
result = ingestor.ingest_document("paper.pdf")
print(f"Ingested {result['chunks_indexed']} chunks")

# Query
response = rag.query("What's the F1 score in Table 3?",
                    search_mode="semantic",
                    table_priority=True)
print(response['answer'])

# Reindex all documents
result = ingestor.reindex_collection("./docs")
print(f"Reindexed {result['indexed_count']} documents")

Core Modules

docling_parser.py — Document Parsing

Extracts structure from PDF, Markdown, HTML, DOCX, and text files.

python
from docling_parser import DoclingParser

parser = DoclingParser()
parsed = parser.parse("document.pdf")

# Access parsed data
print(f"Sections: {list(parsed.sections.keys())}")
print(f"Tables: {len(parsed.tables)}")
print(f"Figures: {len(parsed.figures)}")

# Get table as markdown
table_md = parsed.tables[0]['markdown']

Returns:

rag_ingestion.py — Vector Database Ingestion

Manages document ingestion into Qdrant vector database with intelligent chunking.

python
from rag_ingestion import RAGIngestor

ingestor = RAGIngestor(
    collection_name="my_documents",
    embedding_model="nomic-ai/nomic-embed-text-v1.5",
    qdrant_path="./qdrant_db"
)

# Ingest single file
result = ingestor.ingest_document("paper.pdf")
# Returns: {"status": "success", "doc_id": "sha256_hash", "chunks_indexed": 42}

# Batch ingest directory
results = ingestor.ingest_directory("./docs", pattern="*.pdf")

# Delete document
ingestor.delete_document(doc_id="sha256_hash")

Key features:

rag_library.py — Query System

The main RAG pipeline with multiple query modes and retrieval strategies.

python
from rag_library import RAGLibrary

rag = RAGLibrary(ingestor=ingestor)

# Standard RAG query (chunk-based, fast)
response = rag.query(
    query="What is the accuracy of Model A?",
    search_mode="semantic",  # or "keyword", "hybrid"
    table_priority=True,
    agentic_references=True
)

# Full document review (for broad questions)
response = rag.query(
    query="Does the paper's narrative flow logically?",
    query_mode="full_review"
)

# Auto mode (LLM chooses appropriate mode)
response = rag.query(
    query="Compare all approaches in the paper",
    query_mode="auto"
)

# Access results
print(response['answer'])
for source in response['sources']:
    print(f"- {source['file_name']} ({source['chunk_type']})")

Query modes:

Search modes:

web_app.py — Web Interface

Flask-based web server providing REST API and interactive chat interface for document querying.

bash
# Set environment variables
export LLM_PROVIDER=gemini
export LLM_MODEL=gemini-2.5-pro
export PDF_DIRECTORY=./docs

# Start server
python web_app.py
# Server runs on http://localhost:5005

Key features:

Web API Endpoints

Starting the server:

bash
export LLM_PROVIDER=gemini
export LLM_MODEL=gemini-2.5-pro
export PDF_DIRECTORY=./docs

python web_app.py
# Server runs on http://localhost:5005

Query / Chat

POST /api/chat
Query documents with streaming response
json
{
  "query": "What's the F1 score in Table 3?",
  "conversation_id": "optional-conv-id",
  "search_mode": "semantic",
  "query_mode": "auto",
  "table_priority": true,
  "agentic_references": true
}

Document Management

POST /api/documents/upload
bash
curl -X POST http://localhost:5005/api/documents/upload \
  -F "file=@document.pdf"
GET /api/documents
List all documents
DELETE /api/documents/{relative_path}
Delete document from storage and vector DB
GET /api/documents/status/{task_id}
Check upload status
POST /api/documents/reindex
Reindex all documents (clears and rebuilds vector database)
bash
curl -X POST http://localhost:5005/api/documents/reindex

Clears the vector database collection and re-ingests all documents from the docs folder. Runs in background. Use the status endpoint with the returned task_id to track progress.

File Watcher

POST /api/watcher/start
Start watching for file changes
json
{
  "check_interval": 10.0,
  "debounce_interval": 2.0
}
GET /api/watcher
Get watcher status
POST /api/watcher/stop
Stop file watcher

Configuration

GET /api/config
Get current configuration
POST /api/config/reload
Reload config from environment (no restart needed)

Conversations

POST /api/conversations
Create new conversation
GET /api/conversations/{conv_id}
Get conversation history
DELETE /api/conversations/{conv_id}
Delete conversation

Configuration

About .env File Location

The .env file should be placed in your current working directory (where you run your scripts from). The library uses load_dotenv() which automatically loads environment variables from .env in the current directory.

Environment Variables

.env
# Qdrant Configuration
QDRANT_COLLECTION_NAME=my_documents
QDRANT_PATH=./qdrant_db

# Model Configuration
EMBEDDING_MODEL=nomic-ai/nomic-embed-text-v1.5
RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2

# LLM Configuration
LLM_PROVIDER=gemini  # anthropic, openai, ollama
LLM_MODEL=gemini-2.5-pro
LLM_TEMPERATURE=0.3
LLM_MAX_TOKENS=8192

# API Keys
# Note: These keys are loaded from .env and passed to the LLM service
ANTHROPIC_API_KEY=your-api-key
OPENAI_API_KEY=your-api-key

# Chunking Configuration
MAX_TOKENS=4000
MERGE_PEERS=true
INCLUDE_STRUCTURE_CONTEXT=true

# Retrieval
RETRIEVE_N=100
RERANK_TOP_K=6
QUERY_EXPANSION=false

# Documents
PDF_DIRECTORY=./docs

Chunking Configuration Options

These settings control how documents are split into chunks during ingestion:

MAX_TOKENS

Default: 4000

Maximum number of tokens per chunk. Larger chunks contain more context but may reduce retrieval precision. Smaller chunks provide more granular matching but may fragment related information.

  • 2000-3000: More chunks, better for specific questions
  • 4000 (default): Balanced approach
  • 6000-8000: Fewer chunks, better for broad summaries
MERGE_PEERS

Default: true

Whether to merge adjacent "peer" chunks (chunks from the same section/heading) that are small enough to fit together within MAX_TOKENS.

When enabled (true):

  • Related content stays together (better context)
  • Fewer, more coherent chunks
  • Reduced fragmentation of ideas

When disabled (false):

  • Maximum granularity
  • More chunks for precise matching
  • Sections split into smaller pieces

Recommendation: Keep true for most RAG use cases. Only disable if you need maximum chunk granularity for very specific queries.

INCLUDE_STRUCTURE_CONTEXT

Default: true

Whether to prepend document structure (headings, section paths) to each chunk. This uses Docling's contextualize() method.

With structure context (true):

Example Chunk
# 5.3 Understanding Opportunities for Improvement

## Ablation Study Results

The results show that our method achieved 95% accuracy...
  • LLMs understand document organization
  • Better semantic retrieval (headings add context)
  • Clearer source attribution

Without structure context (false):

Example Chunk
The results show that our method achieved 95% accuracy...
  • Smaller chunks (no redundant headings)
  • LLMs lose context about content location
  • Worse retrieval for generic text

Recommendation: Always keep true for RAG. The structure context significantly improves both retrieval relevance and LLM comprehension of source material.

How Environment Variables Are Loaded

The library loads configuration in the following order:

  1. rag_library.py calls load_dotenv() on import
  2. Environment variables are read from .env file in current working directory
  3. Variables not found in .env use default values
  4. Explicit parameters override .env values
Python - Override .env with parameters
from rag_library import RAGConfig, RAGSystem

# .env values are used by default
config = RAGConfig()

# Or override specific values
config = RAGConfig(
    collection_name="custom_collection",
    temperature=0.3
    # Other values come from .env
)
Dynamic Configuration (Web App)

The web app supports dynamic LLM provider changes without restart:

bash
# Update environment
export LLM_PROVIDER=anthropic
export LLM_MODEL=claude-sonnet-4-5

# Reload via API
curl -X POST http://localhost:5005/api/config/reload

Advanced Usage

File Watcher for Auto-Ingestion

python
from rag_ingestion import RAGIngestor

ingestor = RAGIngestor(collection_name="docs")

watcher = ingestor.create_file_watcher(
    directory="./docs",
    pattern="*.pdf",
    check_interval=10.0,
    debounce_interval=2.0,
    callback=lambda event_type, data: print(f"{event_type}: {data}"),
    autostart=True
)

# Watcher runs in background, auto-ingests new/modified files
watcher.stop()

Direct Library Usage (No Vector DB)

For large-context LLMs, skip vector retrieval:

python
from docling_parser import DoclingParser
from rag_library import RAGLibrary

parser = DoclingParser()
parsed = parser.parse("paper.pdf")

rag = RAGLibrary(ingestor=None)  # No ingestor needed
response = rag.query_with_full_context(
    query="Summarize the methodology",
    parsed_document=parsed,
    selected_tables=[0, 2]
)

CLI Commands

tablemind — Query Documents

The main CLI for querying your document collection with RAG. Uses agentic AI to automatically evaluate query intent and select the optimal retrieval strategy (standard RAG vs. full-document review).

Usage

The CLI is built into rag_library.py. Use one of these methods:

  • python rag_library.py "your question"
  • python -m rag_library "your question"
  • Create an alias: alias tablemind='python /path/to/rag_library.py'
Terminal
# Run the CLI directly (uses agentic auto-detection by default)
python rag_library.py "What are the main findings?"

# Or as a module
python -m rag_library "What are the main findings?"

# Query with custom retrieval options
python rag_library.py "Compare table 3" --retrieve-n 50 --top-k 10

# Enable verbose output (shows agentic classification)
python rag_library.py "What datasets were used?" -v

# Force specific query mode
python rag_library.py "What are the findings?" --query-mode specific

# Force full-document review mode
python rag_library.py "Summarize the results" --query-mode full_review

# Table-only query with agentic fetching
python rag_library.py "What are the performance metrics?" --tables-only --prioritize-tables

# Query with all options
python rag_library.py "Analyze the results" \
  --retrieve-n 100 \
  --top-k 20 \
  --agentic \
  --prioritize-tables \
  --verbose

Command-Line Options

Option Default Description
-v, --verbose False Print detailed progress information
--retrieve-n N 20 Number of chunks to retrieve before reranking
--top-k K 5 Number of chunks to keep after reranking
--query-mode MODE auto Query mode: auto (agentic), specific, full_review
--agentic / --no-agentic True Enable/disable agentic table/figure fetching
--tables-only False Only search table chunks
--figures-only False Only search figure chunks
--prioritize-tables False Boost table chunks in retrieval results
--show-reasoning False Include LLM reasoning in response

Other CLI Commands

Terminal
# Run the example script
python ragexample.py

# Run specific examples
python ragexample.py --parse          # Parse PDF example
python ragexample.py --ingest         # Ingest documents example
python ragexample.py --query           # Query documents example
python ragexample.py "Your question"  # Custom query

# For custom query with options, edit ragexample.py or use rag_library.py directly:
python rag_library.py "Your question" --prioritize-tables -v

Performance Considerations

Mode Speed Token Usage Best For
Standard RAG Fast Low Specific facts, table lookups
Full Review Slow High Document synthesis, structure analysis
Auto Variable Variable Mixed workloads

Recommendations

Limitations

API Quick Reference

docling_parser.py

Class/Function Description
DoclingParser Main parser class for PDF, Markdown, HTML, DOCX, and text documents
ParsedDocument Dataclass containing parsed document data
parse_document(file_path) Parse any supported format (auto-detected)
parse_pdf(file_path) Quick function to parse a single PDF
parse_markdown(file_path) Quick function to parse a Markdown file
parse_html(file_path) Quick function to parse an HTML file
parse_docx(file_path) Quick function to parse a DOCX file
parse_text_file(file_path) Quick function to parse a text file
parse_documents(directory) Parse all documents in directory (recursive)

rag_ingestion.py

Class/Function Description
RAGIngestor Main ingestion class for documents
VectorDBConfig Configuration for vector database
EmbeddingConfig Configuration for embedding models
ingest_document(file_path) Ingest a single document (any format)
ingest_directory(directory) Batch ingest all documents (recursive)
delete_document(doc_id) Delete a document from the collection
get_collection_stats() Get collection statistics

rag_library.py

Class/Function Description
RAGLibrary Main RAG query system class
LLMService Base class for LLM providers (Anthropic, OpenAI, Gemini, Ollama)
RAGLibrary.query(question, search_mode) Perform query with retrieval and LLM generation
RAGLibrary.query_full_review(question) Perform hierarchical full-document analysis

Query Mode Parameter

Value Behavior
"auto" LLM evaluates query and selects strategy (default)
"standard" Force standard RAG (fast, focused retrieval)
"full_review" Force full-document hierarchical review

Examples

Example 1: PDF-Only Pipeline

Python
from rag_ingestion import ingest_pdfs
from rag_library import query_rag

# Step 1: Ingest all PDFs from a directory
print("Ingesting documents...")
results = ingest_pdfs(
    directory="./pdfs",
    collection_name="research_papers",
    pattern="*.pdf"
)

for r in results:
    print(f"  {r['file_path']}: {r['status']} ({r['chunks_indexed']} chunks)")

# Step 2: Query the documents
print("\nQuerying documents...")
result = query_rag(
    "What are the main findings across all papers?",
    collection_name="research_papers",
    db_path="./qdrant_db",
    retrieve_n=50,
    rerank_top_k=10,
    agentic=True
)

print(f"\nAnswer:\n{result['answer']}")

# Step 3: Show sources
print(f"\nSources ({len(result['sources'])}):")
for i, s in enumerate(result['sources'], 1):
    marker = "📊" if s.get('is_table') else "📄"
    print(f"  {i}. {marker} {s.get('file_name')} - {s.get('heading')}")

Example 2: Multi-Format Directory Pipeline

Parse and ingest a directory containing PDFs, Markdown, HTML, DOCX, and text files. The system automatically detects formats and handles each appropriately.

Python
from pathlib import Path
from rag_ingestion import RAGIngestor, VectorDBConfig
from rag_library import query_rag

# Step 1: Set up ingestor with multi-format support
db_config = VectorDBConfig(
    collection_name="multi_format_docs",
    path="./qdrant_db"
)
ingestor = RAGIngestor(db_config=db_config)

# Step 2: Define directory with mixed formats
docs_dir = Path("./documents")

# Step 3: Ingest all supported formats
print("Ingesting multi-format documents...")

# Supported extensions: .pdf, .md, .markdown, .html, .htm, .docx, .txt
supported_extensions = [".pdf", ".md", ".html", ".docx", ".txt"]

results = []
for ext in supported_extensions:
    for file_path in docs_dir.glob(f"*{ext}"):
        try:
            result = ingestor.ingest_file(file_path)
            results.append(result)
            print(f"  ✓ {file_path.name}: {result['status']} ({result['chunks_indexed']} chunks, {result['num_tables']} tables)")
        except Exception as e:
            print(f"  ✗ {file_path.name}: {e}")

# Step 4: Query across all formats
print("\nQuerying multi-format collection...")
result = query_rag(
    "What are the main findings across all documents?",
    collection_name="multi_format_docs",
    db_path="./qdrant_db",
    retrieve_n=30,
    rerank_top_k=10,
    agentic=True,
    verbose=True
)

print(f"\nAnswer:\n{result['answer']}")

# Step 5: Show sources with format indicators
print(f"\nSources ({len(result['sources'])}):")
for i, s in enumerate(result['sources'], 1):
    # Determine icon based on content type
    if s.get('is_table'):
        icon = "📊"
    elif s.get('is_figure'):
        icon = "🖼️"
    else:
        icon = "📄"

    # Get file extension for format indicator
    file_name = s.get('file_name', "Unknown")
    ext = Path(file_name).suffix.upper() if '.' in file_name else "TXT"

    print(f"  {i}. {icon} [{ext}] {file_name} - {s.get('heading', 'N/A')}")

Example 3: Using ragexample.py for Multi-Format

The bundled ragexample.py script also supports multi-format ingestion:

Terminal
# Put documents (PDF, MD, HTML, DOCX, TXT) in ./documents folder
# Then run:

# Ingest all supported formats
python ragexample.py --ingest

# Query with agentic AI (auto-detects optimal strategy)
python ragexample.py "What are the main findings?"

# The agent evaluates query intent and selects appropriate mode