Overview

The Tablemind is a Python library for building retrieval-augmented generation (RAG) systems that properly handle tables, figures, and document-level queries in research papers, technical reports, and business documents.

Problem it Solves

Standard RAG systems chunk documents to work within LLM context limits. This works for prose but fails for:

Tables — get fragmented, headers separated, structure lost
Cross-references — text mentions "Table 3" but retrieval can't connect it
Document-level questions — requires synthesis across entire paper

Solution

Treat tables as first-class citizens — preserve complete structure, detect references, and intelligently route between chunk-based retrieval and full document review.

Quick Start

Installation

bash

pip install docling sentence-transformers qdrant-client fastapi uvicorn

Basic Usage

python

from rag_ingestion import RAGIngestor
from rag_library import RAGLibrary

# Initialize
ingestor = RAGIngestor(collection_name="my_docs")
rag = RAGLibrary(ingestor=ingestor)

# Ingest a document
result = ingestor.ingest_document("paper.pdf")
print(f"Ingested {result['chunks_indexed']} chunks")

# Query
response = rag.query("What's the F1 score in Table 3?",
                    search_mode="semantic",
                    table_priority=True)
print(response['answer'])

# Reindex all documents
result = ingestor.reindex_collection("./docs")
print(f"Reindexed {result['indexed_count']} documents")

Core Modules

docling_parser.py — Document Parsing

Extracts structure from PDF, Markdown, HTML, DOCX, and text files.

python

from docling_parser import DoclingParser

parser = DoclingParser()
parsed = parser.parse("document.pdf")

# Access parsed data
print(f"Sections: {list(parsed.sections.keys())}")
print(f"Tables: {len(parsed.tables)}")
print(f"Figures: {len(parsed.figures)}")

# Get table as markdown
table_md = parsed.tables[0]['markdown']

Returns:

sections: Hierarchical section structure
tables: Complete markdown representation with row/column counts
figures: Captions and descriptions
full_text: Plain text extraction
markdown: Full markdown representation

rag_ingestion.py — Vector Database Ingestion

Manages document ingestion into Qdrant vector database with intelligent chunking.

python

from rag_ingestion import RAGIngestor

ingestor = RAGIngestor(
    collection_name="my_documents",
    embedding_model="nomic-ai/nomic-embed-text-v1.5",
    qdrant_path="./qdrant_db"
)

# Ingest single file
result = ingestor.ingest_document("paper.pdf")
# Returns: {"status": "success", "doc_id": "sha256_hash", "chunks_indexed": 42}

# Batch ingest directory
results = ingestor.ingest_directory("./docs", pattern="*.pdf")

# Delete document
ingestor.delete_document(doc_id="sha256_hash")

Key features:

SHA256-based stable document IDs (unchanged files skip reindexing)
Docling HybridChunker (merges related content, respects section boundaries)
Configurable chunk sizes and merge behavior
Metadata-rich chunks (table/figure flags, headings, captions)

rag_library.py — Query System

The main RAG pipeline with multiple query modes and retrieval strategies.

python

from rag_library import RAGLibrary

rag = RAGLibrary(ingestor=ingestor)

# Standard RAG query (chunk-based, fast)
response = rag.query(
    query="What is the accuracy of Model A?",
    search_mode="semantic",  # or "keyword", "hybrid"
    table_priority=True,
    agentic_references=True
)

# Full document review (for broad questions)
response = rag.query(
    query="Does the paper's narrative flow logically?",
    query_mode="full_review"
)

# Auto mode (LLM chooses appropriate mode)
response = rag.query(
    query="Compare all approaches in the paper",
    query_mode="auto"
)

# Access results
print(response['answer'])
for source in response['sources']:
    print(f"- {source['file_name']} ({source['chunk_type']})")

Query modes:

standard: Chunk-based retrieval (fast, within context limits)
full_review: Hierarchical section summarization (slower, comprehensive)
auto: LLM analyzes query and selects appropriate mode

Search modes:

semantic: Vector embeddings (conceptual queries)
keyword: BM25 (exact terms, model names, metrics)
hybrid: Combined with configurable weights

web_app.py — Web Interface

Flask-based web server providing REST API and interactive chat interface for document querying.

bash

# Set environment variables
export LLM_PROVIDER=gemini
export LLM_MODEL=gemini-2.5-pro
export PDF_DIRECTORY=./docs

# Start server
python web_app.py
# Server runs on http://localhost:5005

Key features:

Real-time streaming responses with SSE
Conversation history with memory compaction
Dynamic document ingestion (upload via web UI)
File watcher for auto-ingestion on file changes
Configurable search mode (semantic/keyword/hybrid/auto)
Table prioritization and agentic reference fetching
Query mode selection (standard RAG or full document review)
Dynamic LLM provider switching without restart

Web API Endpoints

Starting the server:

bash

export LLM_PROVIDER=gemini
export LLM_MODEL=gemini-2.5-pro
export PDF_DIRECTORY=./docs

python web_app.py
# Server runs on http://localhost:5005

Query / Chat

POST /api/chat

Query documents with streaming response

json

{
  "query": "What's the F1 score in Table 3?",
  "conversation_id": "optional-conv-id",
  "search_mode": "semantic",
  "query_mode": "auto",
  "table_priority": true,
  "agentic_references": true
}

Document Management

POST /api/documents/upload

bash

curl -X POST http://localhost:5005/api/documents/upload \
  -F "file=@document.pdf"

GET /api/documents

List all documents

DELETE /api/documents/{relative_path}

Delete document from storage and vector DB

GET /api/documents/status/{task_id}

Check upload status

POST /api/documents/reindex

Reindex all documents (clears and rebuilds vector database)

bash

curl -X POST http://localhost:5005/api/documents/reindex

Clears the vector database collection and re-ingests all documents from the docs folder. Runs in background. Use the status endpoint with the returned task_id to track progress.

File Watcher

POST /api/watcher/start

Start watching for file changes

json

{
  "check_interval": 10.0,
  "debounce_interval": 2.0
}

GET /api/watcher

Get watcher status

POST /api/watcher/stop

Stop file watcher

Configuration

GET /api/config

Get current configuration

POST /api/config/reload

Reload config from environment (no restart needed)

Conversations

POST /api/conversations

Create new conversation

GET /api/conversations/{conv_id}

Get conversation history

DELETE /api/conversations/{conv_id}

Delete conversation

Configuration

About .env File Location

The .env file should be placed in your current working directory (where you run your scripts from). The library uses load_dotenv() which automatically loads environment variables from .env in the current directory.

Environment Variables

.env

# Qdrant Configuration
QDRANT_COLLECTION_NAME=my_documents
QDRANT_PATH=./qdrant_db

# Model Configuration
EMBEDDING_MODEL=nomic-ai/nomic-embed-text-v1.5
RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2

# LLM Configuration
LLM_PROVIDER=gemini  # anthropic, openai, ollama
LLM_MODEL=gemini-2.5-pro
LLM_TEMPERATURE=0.3
LLM_MAX_TOKENS=8192

# API Keys
# Note: These keys are loaded from .env and passed to the LLM service
ANTHROPIC_API_KEY=your-api-key
OPENAI_API_KEY=your-api-key

# Chunking Configuration
MAX_TOKENS=4000
MERGE_PEERS=true
INCLUDE_STRUCTURE_CONTEXT=true

# Retrieval
RETRIEVE_N=100
RERANK_TOP_K=6
QUERY_EXPANSION=false

# Documents
PDF_DIRECTORY=./docs

Chunking Configuration Options

These settings control how documents are split into chunks during ingestion:

MAX_TOKENS

Default: 4000

Maximum number of tokens per chunk. Larger chunks contain more context but may reduce retrieval precision. Smaller chunks provide more granular matching but may fragment related information.

2000-3000: More chunks, better for specific questions
4000 (default): Balanced approach
6000-8000: Fewer chunks, better for broad summaries

MERGE_PEERS

Default: true

Whether to merge adjacent "peer" chunks (chunks from the same section/heading) that are small enough to fit together within MAX_TOKENS.

When enabled (true):

Related content stays together (better context)
Fewer, more coherent chunks
Reduced fragmentation of ideas

When disabled (false):

Maximum granularity
More chunks for precise matching
Sections split into smaller pieces

Recommendation: Keep true for most RAG use cases. Only disable if you need maximum chunk granularity for very specific queries.

INCLUDE_STRUCTURE_CONTEXT

Default: true

Whether to prepend document structure (headings, section paths) to each chunk. This uses Docling's contextualize() method.

With structure context (true):

Example Chunk

# 5.3 Understanding Opportunities for Improvement

## Ablation Study Results

The results show that our method achieved 95% accuracy...

LLMs understand document organization
Better semantic retrieval (headings add context)
Clearer source attribution

Without structure context (false):

Example Chunk

The results show that our method achieved 95% accuracy...

Smaller chunks (no redundant headings)
LLMs lose context about content location
Worse retrieval for generic text

Recommendation: Always keep true for RAG. The structure context significantly improves both retrieval relevance and LLM comprehension of source material.

How Environment Variables Are Loaded

The library loads configuration in the following order:

rag_library.py calls load_dotenv() on import
Environment variables are read from .env file in current working directory
Variables not found in .env use default values
Explicit parameters override .env values

Python - Override .env with parameters

from rag_library import RAGConfig, RAGSystem

# .env values are used by default
config = RAGConfig()

# Or override specific values
config = RAGConfig(
    collection_name="custom_collection",
    temperature=0.3
    # Other values come from .env
)

Dynamic Configuration (Web App)

The web app supports dynamic LLM provider changes without restart:

bash

# Update environment
export LLM_PROVIDER=anthropic
export LLM_MODEL=claude-sonnet-4-5

# Reload via API
curl -X POST http://localhost:5005/api/config/reload

Advanced Usage

File Watcher for Auto-Ingestion

python

from rag_ingestion import RAGIngestor

ingestor = RAGIngestor(collection_name="docs")

watcher = ingestor.create_file_watcher(
    directory="./docs",
    pattern="*.pdf",
    check_interval=10.0,
    debounce_interval=2.0,
    callback=lambda event_type, data: print(f"{event_type}: {data}"),
    autostart=True
)

# Watcher runs in background, auto-ingests new/modified files
watcher.stop()

Direct Library Usage (No Vector DB)

For large-context LLMs, skip vector retrieval:

python

from docling_parser import DoclingParser
from rag_library import RAGLibrary

parser = DoclingParser()
parsed = parser.parse("paper.pdf")

rag = RAGLibrary(ingestor=None)  # No ingestor needed
response = rag.query_with_full_context(
    query="Summarize the methodology",
    parsed_document=parsed,
    selected_tables=[0, 2]
)

CLI Commands

tablemind — Query Documents

The main CLI for querying your document collection with RAG. Uses agentic AI to automatically evaluate query intent and select the optimal retrieval strategy (standard RAG vs. full-document review).

Usage

The CLI is built into rag_library.py. Use one of these methods:

python rag_library.py "your question"
python -m rag_library "your question"
Create an alias: alias tablemind='python /path/to/rag_library.py'

Terminal

# Run the CLI directly (uses agentic auto-detection by default)
python rag_library.py "What are the main findings?"

# Or as a module
python -m rag_library "What are the main findings?"

# Query with custom retrieval options
python rag_library.py "Compare table 3" --retrieve-n 50 --top-k 10

# Enable verbose output (shows agentic classification)
python rag_library.py "What datasets were used?" -v

# Force specific query mode
python rag_library.py "What are the findings?" --query-mode specific

# Force full-document review mode
python rag_library.py "Summarize the results" --query-mode full_review

# Table-only query with agentic fetching
python rag_library.py "What are the performance metrics?" --tables-only --prioritize-tables

# Query with all options
python rag_library.py "Analyze the results" \
  --retrieve-n 100 \
  --top-k 20 \
  --agentic \
  --prioritize-tables \
  --verbose

Command-Line Options

Option	Default	Description
`-v, --verbose`	`False`	Print detailed progress information
`--retrieve-n N`	`20`	Number of chunks to retrieve before reranking
`--top-k K`	`5`	Number of chunks to keep after reranking
`--query-mode MODE`	`auto`	Query mode: auto (agentic), specific, full_review
`--agentic / --no-agentic`	`True`	Enable/disable agentic table/figure fetching
`--tables-only`	`False`	Only search table chunks
`--figures-only`	`False`	Only search figure chunks
`--prioritize-tables`	`False`	Boost table chunks in retrieval results
`--show-reasoning`	`False`	Include LLM reasoning in response

Other CLI Commands

Terminal

# Run the example script
python ragexample.py

# Run specific examples
python ragexample.py --parse          # Parse PDF example
python ragexample.py --ingest         # Ingest documents example
python ragexample.py --query           # Query documents example
python ragexample.py "Your question"  # Custom query

# For custom query with options, edit ragexample.py or use rag_library.py directly:
python rag_library.py "Your question" --prioritize-tables -v

Performance Considerations

Mode	Speed	Token Usage	Best For
Standard RAG	Fast	Low	Specific facts, table lookups
Full Review	Slow	High	Document synthesis, structure analysis
Auto	Variable	Variable	Mixed workloads

Recommendations

Use standard mode for most queries (default via auto)
Enable table_priority for table-heavy documents
Use full_review only when needed (auto mode handles this)
Set MAX_TOKENS=4000 for better table preservation
Enable MERGE_PEERS=true for better context coherence

Limitations

Large tables: 20+ columns can be challenging for LLMs
Cross-document references: Not currently supported
OCR quality: Scanned PDFs may have table structure errors
Full review cost: More expensive than chunk-based retrieval

API Quick Reference

docling_parser.py

Class/Function	Description
`DoclingParser`	Main parser class for PDF, Markdown, HTML, DOCX, and text documents
`ParsedDocument`	Dataclass containing parsed document data
`parse_document(file_path)`	Parse any supported format (auto-detected)
`parse_pdf(file_path)`	Quick function to parse a single PDF
`parse_markdown(file_path)`	Quick function to parse a Markdown file
`parse_html(file_path)`	Quick function to parse an HTML file
`parse_docx(file_path)`	Quick function to parse a DOCX file
`parse_text_file(file_path)`	Quick function to parse a text file
`parse_documents(directory)`	Parse all documents in directory (recursive)

rag_ingestion.py

Class/Function	Description
`RAGIngestor`	Main ingestion class for documents
`VectorDBConfig`	Configuration for vector database
`EmbeddingConfig`	Configuration for embedding models
`ingest_document(file_path)`	Ingest a single document (any format)
`ingest_directory(directory)`	Batch ingest all documents (recursive)
`delete_document(doc_id)`	Delete a document from the collection
`get_collection_stats()`	Get collection statistics

rag_library.py

Class/Function	Description
`RAGLibrary`	Main RAG query system class
`LLMService`	Base class for LLM providers (Anthropic, OpenAI, Gemini, Ollama)
`RAGLibrary.query(question, search_mode)`	Perform query with retrieval and LLM generation
`RAGLibrary.query_full_review(question)`	Perform hierarchical full-document analysis

Query Mode Parameter

Value	Behavior
`"auto"`	LLM evaluates query and selects strategy (default)
`"standard"`	Force standard RAG (fast, focused retrieval)
`"full_review"`	Force full-document hierarchical review

Examples

Example 1: PDF-Only Pipeline

Python

from rag_ingestion import ingest_pdfs
from rag_library import query_rag

# Step 1: Ingest all PDFs from a directory
print("Ingesting documents...")
results = ingest_pdfs(
    directory="./pdfs",
    collection_name="research_papers",
    pattern="*.pdf"
)

for r in results:
    print(f"  {r['file_path']}: {r['status']} ({r['chunks_indexed']} chunks)")

# Step 2: Query the documents
print("\nQuerying documents...")
result = query_rag(
    "What are the main findings across all papers?",
    collection_name="research_papers",
    db_path="./qdrant_db",
    retrieve_n=50,
    rerank_top_k=10,
    agentic=True
)

print(f"\nAnswer:\n{result['answer']}")

# Step 3: Show sources
print(f"\nSources ({len(result['sources'])}):")
for i, s in enumerate(result['sources'], 1):
    marker = "📊" if s.get('is_table') else "📄"
    print(f"  {i}. {marker} {s.get('file_name')} - {s.get('heading')}")

Example 2: Multi-Format Directory Pipeline

Parse and ingest a directory containing PDFs, Markdown, HTML, DOCX, and text files. The system automatically detects formats and handles each appropriately.

Python

from pathlib import Path
from rag_ingestion import RAGIngestor, VectorDBConfig
from rag_library import query_rag

# Step 1: Set up ingestor with multi-format support
db_config = VectorDBConfig(
    collection_name="multi_format_docs",
    path="./qdrant_db"
)
ingestor = RAGIngestor(db_config=db_config)

# Step 2: Define directory with mixed formats
docs_dir = Path("./documents")

# Step 3: Ingest all supported formats
print("Ingesting multi-format documents...")

# Supported extensions: .pdf, .md, .markdown, .html, .htm, .docx, .txt
supported_extensions = [".pdf", ".md", ".html", ".docx", ".txt"]

results = []
for ext in supported_extensions:
    for file_path in docs_dir.glob(f"*{ext}"):
        try:
            result = ingestor.ingest_file(file_path)
            results.append(result)
            print(f"  ✓ {file_path.name}: {result['status']} ({result['chunks_indexed']} chunks, {result['num_tables']} tables)")
        except Exception as e:
            print(f"  ✗ {file_path.name}: {e}")

# Step 4: Query across all formats
print("\nQuerying multi-format collection...")
result = query_rag(
    "What are the main findings across all documents?",
    collection_name="multi_format_docs",
    db_path="./qdrant_db",
    retrieve_n=30,
    rerank_top_k=10,
    agentic=True,
    verbose=True
)

print(f"\nAnswer:\n{result['answer']}")

# Step 5: Show sources with format indicators
print(f"\nSources ({len(result['sources'])}):")
for i, s in enumerate(result['sources'], 1):
    # Determine icon based on content type
    if s.get('is_table'):
        icon = "📊"
    elif s.get('is_figure'):
        icon = "🖼️"
    else:
        icon = "📄"

    # Get file extension for format indicator
    file_name = s.get('file_name', "Unknown")
    ext = Path(file_name).suffix.upper() if '.' in file_name else "TXT"

    print(f"  {i}. {icon} [{ext}] {file_name} - {s.get('heading', 'N/A')}")

Example 3: Using ragexample.py for Multi-Format

The bundled ragexample.py script also supports multi-format ingestion:

Terminal

# Put documents (PDF, MD, HTML, DOCX, TXT) in ./documents folder
# Then run:

# Ingest all supported formats
python ragexample.py --ingest

# Query with agentic AI (auto-detects optimal strategy)
python ragexample.py "What are the main findings?"

# The agent evaluates query intent and selects appropriate mode