v2.5.0

Tablemind

Powerful document parsing & Retrieval-Augmented Generation

Tablemind

A table-first RAG library that solves the critical gap in document QA: querying tables. Built on semantic/layout-preserving document parsing (PDF, Markdown, HTML, DOCX) with Docling, it provides modular components for parsing, ingestion, and retrieval — plus an interactive web interface for demonstration. Features automatic query mode detection, agentic AI for table/figure fetching, and full-document review capabilities.

Incremental Efficiency: SHA256-based content hashing ensures only changed documents are re-processed. The intelligent file watching system automatically detects additions, modifications, and deletions — reindexing only what changed, dramatically reducing processing time and computational overhead for growing document collections.

📄

Structure Preservation

Extracts and preserves document structure including sections, headings, and hierarchical organization for rich context understanding.

📊

Table Extraction

Advanced table detection and extraction with Markdown formatting, preserving cell structure and relationships.

🔍

Hybrid Search

Combines semantic, keyword (BM25), and hybrid search with cross-encoder reranking for optimal retrieval.

🤖

Agentic RAG

LLM-powered agent that automatically detects and fetches referenced tables and figures for complete answers.

⚡

Production Ready

Built for scale with modular design, comprehensive error handling, and flexible configuration options.

🔌

Easy Integration

Simple API with convenience functions for quick integration. Works with Anthropic, OpenAI, and more.

01 What is Tablemind?

A comprehensive document intelligence library for RAG applications

đŸŽ¯ Purpose

Tablemind is a Python library that helps you build Retrieval-Augmented Generation (RAG) applications. It handles the entire pipeline from document parsing to intelligent query answering, with special focus on preserving document structure, extracting tables and figures, and providing accurate source citations.

🔄 Workflow

1. Parse

Extract structure, tables, figures from PDF/MD/HTML/DOCX

2. Ingest

Chunk, embed, store in vector database (Qdrant)

3. Retrieve

Vector search + cross-encoder reranking

4. Answer

LLM generates answer with source citations

đŸ—ī¸ Architecture

The library is organized into three core modules:

docling_parser.py

Multi-format document parsing (PDF, Markdown, HTML, DOCX, TXT) with table/figure extraction and structure preservation

rag_ingestion.py

Document ingestion pipeline with chunking, embedding generation, and vector database storage (Qdrant)

rag_library.py

RAG query system with vector search, reranking, agentic table/figure fetching, and LLM answer generation

Key Differentiators

✓ Table-First Design

Tables are extracted as complete Markdown (not chunked), with special retrieval prioritization and agentic fetching

✓ Semantic Context Authority

Document structure and semantic references are preserved, enabling accurate cross-section table discovery

✓ Multi-Format Support

Parse PDF, Markdown, HTML, DOCX, and text files with a unified API and automatic format detection

✓ Production Ready

Incremental updates, change detection, error handling, and comprehensive configuration options

Use Cases

Research Paper Q&A

Query academic papers with accurate table/figure citations

Data Analysis

Extract and compare metrics from reports

Document Search

Semantic search across large document collections

Knowledge Base

Build AI assistants with domain-specific knowledge

02 Installation

Get up and running in seconds with pip

Quick Install

Terminal
pip install tablemind

Or install from source for development:

Terminal
git clone https://github.com/ngstcf/tablemind.git
cd tablemind
pip install -e .

03 Library Modules

Three powerful modules for complete document intelligence

docling_parser.py

Multi-Format Document Parsing

  • PDF: Full Docling support with OCR option
  • Markdown: Native Docling support (.md, .markdown)
  • HTML: Native Docling support (.html, .htm)
  • DOCX: Native Docling support (.docx)
  • Text: Custom parser with structure extraction
  • Extract tables as Markdown
  • Identify figures and captions
  • Preserve section hierarchy
  • Get LLM-ready context
  • Batch directory processing (recursive)
  • Export to JSON/dict

rag_ingestion.py

Document Ingestion for RAG

  • Vector database integration (Qdrant)
  • Automatic chunking with merge
  • Embedding generation
  • Metadata extraction
  • Change detection (skip unchanged)
  • Batch ingestion
  • Custom configurations

rag_library.py

RAG Query System

  • Vector similarity search
  • Cross-encoder reranking
  • LLM answer generation
  • Agentic table/figure fetching
  • Table-prioritized retrieval
  • Query filtering
  • Multi-LLM support

04 Quick Start

Get results in minutes with these simple steps

1

Parse Your Documents

Extract structured data including tables, figures, and sections from PDF, Markdown, HTML, DOCX, or text files.

Python
from docling_parser import parse_document

# Parse any supported format (auto-detected)
doc = parse_document("research_paper.pdf")
doc = parse_document("README.md")
doc = parse_document("report.html")
doc = parse_document("document.docx")
doc = parse_document("notes.txt")

# Access extracted data
print(f"Format: {doc.metadata.get('format')}")
print(f"Tables: {len(doc.tables)}")
print(f"Figures: {len(doc.figures)}")
print(f"Sections: {len(doc.sections)}")
2

Ingest into Vector Database

Chunk, embed, and store your documents for fast semantic retrieval.

Python
from rag_ingestion import ingest_document, ingest_documents

# Ingest a single document (any format)
result = ingest_document(
    "research_paper.pdf",
    collection_name="my_documents",
    db_path="./qdrant_db"
)

# Or ingest entire directory recursively
results = ingest_documents(
    directory="./docs",
    collection_name="my_documents",
    recursive=True
)

print(f"Status: {result['status']}")
print(f"Chunks indexed: {result['chunks_indexed']}")
3

Query with RAG

Ask questions and get accurate answers with source citations and agentic table fetching.

Python
from rag_library import query_rag

# Query your documents
result = query_rag(
    "What are the main findings from the ablation study?",
    collection_name="my_documents",
    db_path="./qdrant_db",
    retrieve_n=50,
    rerank_top_k=10,
    agentic=True
)

print(f"Answer: {result['answer']}")

05 Incremental Updates

Smart change detection for efficient document reindexing

How It Works

The ingestion system automatically detects file changes using SHA256 content hashing. When you ingest documents, it calculates a hash of each file and stores it as the doc_id. On subsequent ingestions, the system checks if a document with that hash already exists in the vector database and skips re-indexing if unchanged.

Key Features

🔐

SHA256 Content Hashing

Each document is hashed based on its binary content, creating a unique identifier that changes if any byte is modified.

⚡

Skip Unchanged Files

Documents already in the collection are automatically skipped, saving processing time and API costs.

🔄

Force Re-index

Use force=True to delete old chunks and re-index a document from scratch.

📊

Progress Tracking

Get detailed status for each file including skipped, success, and chunks indexed.

👀

Automatic File Watching

Monitor directories for changes and automatically reindex when files are added, modified, or deleted.

🔔

Event Callbacks

Register callbacks to be notified when files are added, modified, or deleted from the collection.

Usage Examples

1

Ingest Only New/Changed Documents

By default, only documents that haven't been indexed before will be processed.

Python
from rag_ingestion import ingest_pdfs

# Only ingest new/changed documents (default)
results = ingest_pdfs(
    directory="./pdfs",
    collection_name="my_documents"
)

for r in results:
    if r["status"] == "skipped":
        print(f"â­ī¸  Skipped: {r['file_path']}")
    elif r["status"] == "success":
        print(f"✅ Indexed: {r['file_path']} ({r['chunks_indexed']} chunks)")
2

Force Re-index All Documents

Delete existing chunks and re-process all documents from scratch.

Python
# Force re-index all documents
results = ingest_pdfs(
    directory="./pdfs",
    collection_name="my_documents",
    force=True  # Delete old chunks and re-index
)

# Or force re-index a single file
from rag_ingestion import ingest_pdf
result = ingest_pdf(
    "updated_document.pdf",
    force=True
)
3

Understanding the Status

Each ingestion result includes status information.

Python
# Possible status values:
"success"     # Document was indexed successfully
"skipped"     # Document already indexed (unchanged)
"error"       # Error during ingestion

# Result structure:
{
    "status": "success",
    "file_path": "./pdfs/paper.pdf",
    "doc_id": "a3f5e9...",  # SHA256 hash
    "chunks_indexed": 42,
    "message": "Successfully indexed"
}
4

Automatic File Watching

Monitor a directory for changes and automatically reindex when files are added, modified, or deleted.

Python
from rag_ingestion import RAGIngestor, VectorDBConfig

ingestor = RAGIngestor(
    db_config=VectorDBConfig(
        collection_name="my_documents",
        path="./qdrant_db"
    )
)

# Start watching a directory
watcher = ingestor.watch(
    directory="./documents",
    pattern="*.pdf",
    debounce_interval=2.0,  # Wait 2s before processing
    check_interval=10.0     # Check every 10s
)

# Watcher runs in background, automatically detecting:
# - New files added ➕
# - Existing files deleted đŸ—‘ī¸
# - Previously indexed files modified 🔄

# Stop when done
watcher.stop()

File Watching with Callbacks

Register callbacks to be notified of file changes:

Python
from rag_ingestion import RAGIngestor

ingestor = RAGIngestor()

def on_file_change(event_type, event_data):
    """Callback for file change events"""
    if event_type == "added":
        print(f"➕ New: {event_data['file_path']}")
        print(f"   Chunks: {event_data['chunks_indexed']}")
    elif event_type == "modified":
        print(f"🔄 Modified: {event_data['file_path']}")
    elif event_type == "deleted":
        print(f"đŸ—‘ī¸  Deleted: {event_data['file_path']}")

watcher = ingestor.watch(
    "./documents",
    callback=on_file_change
)

# Or use context manager for automatic cleanup
with ingestor.watch("./documents"):
    import time
    time.sleep(300)  # Watch for 5 minutes

CLI Usage

Terminal
# Ingest only new documents (default)
python -m rag_ingestion ./pdfs --pattern "*.pdf"

# Force re-index all documents
python -m rag_ingestion ./pdfs --pattern "*.pdf" --force

# Watch directory for changes (automatic reindexing)
python -m rag_ingestion ./pdfs --watch

# Watch with custom intervals
python -m rag_ingestion ./pdfs --watch --check-interval 10 --debounce 2

# Watch with verbose output
python -m rag_ingestion ./pdfs --watch --verbose

# Use custom collection
python -m rag_ingestion ./pdfs --collection my_docs --watch

Understanding Watch Parameters

📊 Debounce Interval

The --debounce parameter defines how long to wait (in seconds) after detecting a file change before processing it. This prevents redundant reindexing when files are being rapidly modified.

Example: If you save a PDF file that triggers multiple filesystem events, the watcher waits 2 seconds (with --debounce 2) after the last change before reindexing. This ensures only the final version is processed.

🔍 Check Interval

The --check-interval parameter defines how frequently (in seconds) the watcher scans the directory for changes. Lower values = faster detection but higher CPU usage.

📁 Default Collection

When --collection is not specified, the system uses:

  1. The QDRANT_COLLECTION_NAME environment variable (if set)
  2. Falls back to my_documents as the default collection name

Tip: Set QDRANT_COLLECTION_NAME in your .env file to avoid specifying --collection for every command.

âš™ī¸ Recommended Settings

Use Case Check Interval Debounce
Active development 2-5 seconds 1-2 seconds
Production/low-load 10-30 seconds 2-5 seconds
Batch processing 60+ seconds 5-10 seconds

💡 Best Practices

  • Run incremental ingestion regularly to add new documents without reprocessing existing ones
  • Use force=True only when documents have been updated or you need to refresh embeddings
  • The SHA256 hash is based on file content, so renaming a file won't trigger re-indexing
  • Check the status field in results to understand what was processed
  • Use file watching for automatic reindexing in production environments
  • Adjust debounce_interval based on how quickly files are typically modified
  • Set check_interval appropriately to balance responsiveness and system load

06 Large-Context LLM Integration

Pass parsed documents directly to LLMs without chunking or vector databases

1

Get Document Context for LLM

Parse a PDF and get LLM-ready context with all tables, figures, and sections preserved.

Python
from docling_parser import get_llm_context

# Get full document context suitable for large-context LLMs
# Parameters:
# - include_tables: Adds tables as Markdown with full structure (headers, rows, columns)
# - include_figures: Adds figure captions (not the images themselves)
# - table_numbers: Optional list to include specific tables only (e.g., [1, 3, 5])
context = get_llm_context(
    "research_paper.pdf",
    include_tables=True,
    include_figures=True
)

print(f"Context length: {len(context)} characters")
2

Query with OpenAI

IMPORTANT: The prompt must explicitly instruct the LLM to output tables in markdown format. Without proper instructions, LLMs will summarize tables into bullet points.

Python
from openai import OpenAI
from docling_parser import get_llm_context

# 1. Get full document context with tables and figures
context = get_llm_context(
    "research_paper.pdf",
    include_tables=True,   # Include tables as Markdown
    include_figures=True   # Include figure captions
)

# 2. Initialize OpenAI client
client = OpenAI()

# 3. CRITICAL: Include explicit table presentation instructions in prompt
table_prompt = """
Structure your response as follows:
1. First: Provide your analysis and answer to the question citing specific sources from the context.
2. Then: If any tables support your answer, reproduce those complete tables in markdown format

For non-table content, provide a detailed answer citing specific sources from the context.
"""

response = client.chat.completions.create(
    model="gpt-4o",  # 128K context window
    messages=[{
        "role": "user",
        "content": f"""Document:

{context}

{table_prompt}

Question: What are the key findings from the ablation study?
"""
    }]
)

print(response.choices[0].message.content)

âš ī¸ Why Prompt Instructions Matter

LLMs naturally tend to summarize tabular data into bullet points or prose. To get accurate table reproduction, you must explicitly instruct them to:

  • Output markdown tables - Tell the LLM to create markdown table syntax with | separators
  • Include ALL data - Specify that all rows, columns, and values must be included
  • Preserve precision - Instruct not to round numerical values

Without these instructions: LLM output → "The table shows GPT-4 scored 0.85 while Claude scored 0.82."
With these instructions: LLM output → Full markdown table with all variants and metrics preserved

Advanced Options

Python
from docling_parser import get_llm_context

# Include only specific tables
context = get_llm_context(
    "research_paper.pdf",
    include_tables=True,
    table_numbers=[1, 3, 5]  # Only tables 1, 3, and 5
)

# Exclude figures
context = get_llm_context(
    "research_paper.pdf",
    include_tables=True,
    include_figures=False  # No figure captions
)

When to Use Direct LLM Context

📄

Single Document Analysis

Perfect for analyzing one document without needing a vector database.

📊

Table-Heavy Documents

Preserves all table data in Markdown format for accurate analysis.

🔍

Detailed Review

When you need comprehensive understanding of the entire document.

⚡

Fast Prototyping

No setup required - just parse and send to your LLM of choice.

07 Multi-Format Support

Parse PDF, Markdown, and text files with a unified API

Supported Formats

The docling_parser module supports multiple document formats with automatic format detection based on file extension.

📄

PDF (.pdf)

Full Docling support with OCR option for scanned documents. Preserves tables, figures, and sections.

📝

Markdown (.md, .markdown)

Native Docling support. Extracts tables, figures, and preserves document hierarchy.

🌐

HTML (.html, .htm)

Native Docling support. Parses web pages with structure, tables, and embedded content.

📝

DOCX (.docx)

Native Docling support. Parses Word documents with formatting, tables, and sections.

📃

Text (.txt)

Custom parser with markdown-style table and section extraction. Great for README and notes files.

Usage Examples

Python - Parse any format
from docling_parser import parse_document

# Format is auto-detected from file extension
pdf_doc = parse_document("paper.pdf")
md_doc = parse_document("README.md")
html_doc = parse_document("report.html")
docx_doc = parse_document("document.docx")
txt_doc = parse_document("notes.txt")

# All return ParsedDocument with same structure
for doc in [pdf_doc, md_doc, html_doc, docx_doc, txt_doc]:
    print(f"{doc.file_name}: {doc.metadata.get('format')}")
Python - Format-specific convenience functions
from docling_parser import (
    parse_pdf, parse_markdown, parse_html,
    parse_docx, parse_text_file
)

# Use format-specific functions for clarity
pdf_doc = parse_pdf("paper.pdf")
md_doc = parse_markdown("README.md")
html_doc = parse_html("report.html")
docx_doc = parse_docx("document.docx")
txt_doc = parse_text_file("notes.txt")
Python - Parse directory with multiple formats
from docling_parser import parse_documents

# Parse all supported formats in a directory
docs = parse_documents("./docs", recursive=True)

for doc in docs:
    print(f"{doc.file_name} ({doc.metadata.get('format')}): {len(doc.tables)} tables")

Format Detection

The library automatically detects the format based on file extension:

Extension Format Parser
.pdf PDF Docling with OCR option
.md, .markdown Markdown Docling native
.html, .htm HTML Docling native
.docx DOCX Docling native
.txt Text Custom parser

You can also selectively enable formats using the supported_formats parameter:

Python
from docling_parser import DoclingParser

# Only enable PDF, Markdown, and HTML
parser = DoclingParser(supported_formats=["pdf", "markdown", "html"])
doc = parser.parse("paper.pdf")

08 Capabilities

Everything you need for production document intelligence

Smart Chunking

Merges related chunks while preserving document structure and context boundaries.

Table Prioritization

Boost retrieval of table chunks for data-heavy queries.

Cross-Encoder Rerank

Uses state-of-the-art reranking for highly relevant results.

Agentic Fetching

LLM agent detects missing references and fetches tables automatically.

Complete Table Extraction

Reproduces full tables with all rows and columns preserved intact.

Source Attribution

Every answer includes source citations with file names and headings.

Full Document Review

Comprehensive document-level analysis with hierarchical section summarization.

Auto Mode Detection

Automatically detects when to use standard RAG vs full document review.

09 Agentic RAG

Intelligent table and figure fetching using LLM-powered agent

What is Agentic RAG?

Agentic RAG is an intelligent agent that analyzes your question and the retrieved context to detect missing references to tables or figures. When the text mentions "Table 3" but Table 3 isn't included in the retrieved chunks, the agent automatically fetches it from the vector database.

How It Works

1

Initial Retrieval

System retrieves relevant chunks using vector similarity search based on your question.

2

LLM Analysis

An LLM analyzes the retrieved context to find explicit references (e.g., "see Table 3") to tables/figures that are NOT included.

3

Smart Fetching or Boosting

The agent either fetches missing tables/figures from the database OR boosts existing ones that were already retrieved but marked important.

4

Re-insertion Protection

Agentic chunks are protected from being filtered out by the reranker, ensuring critical data is always included in the final answer.

Key Concepts

🤖

Agentic Fetch

Tables/figures that were NOT in the initial retrieval and had to be fetched from the database. Marked with is_agentic_fetch: true.

âŦ†ī¸

Agentic Boost

Tables/figures that WERE already retrieved but the LLM agent identified as critically important. Marked with agentic_boosted: true.

🔄

Re-insertion

If the reranker filters out agentic chunks, they're automatically re-inserted with a high score to ensure inclusion.

đŸŽ¯

Reference Detection

Detects explicit references like "Table 3", "see Figure 2", and contextual references like "the following table".

Example Output

Terminal
🔍 Question: What are the main findings from the ablation study?

📚 Step 1: Retrieving top 30 chunks...
   Retrieved 30 chunks

🧠 Step 1.5: Agentic analysis...
   📊 Referenced but missing: Table 3
   ✅ Added 1 agenticly-fetched tables/figures
   âŦ†ī¸  Boosted 1 existing tables/figures

đŸŽ¯ Step 2: Reranking to top 10 most relevant...
   🔄 Re-inserted agentic-boosted table 3 (excluded by reranker)

📝 Step 3: Generating answer...

When to Use Agentic RAG

✅ Enable When:

  • Questions likely to reference specific tables/figures
  • Analyzing research papers with data tables
  • Comparative analysis requiring multiple data sources
  • When accuracy is critical

⚡ Disable When:

  • Simple factual questions
  • General summaries
  • Speed is more important than completeness
  • Documents with no tables/figures

🧠 Autonomous Query Mode Selection

The agentic system also works together with autonomous query mode detection. When you set query_mode="auto" (the default), an LLM agent evaluates your question and automatically selects the optimal retrieval strategy:

  • Specific queries → Fast semantic search with agentic table/figure fetching
  • Document-level queries → Full-document review with hierarchical summarization

The agent provides explainable reasoning for its decision, included in the response metadata as agent_reasoning. This ensures you get both the right retrieval strategy and comprehensive table/figure coverage—all automatically.

11 Querying Tables and Figures

Best practices for retrieving tabular and visual data

📊
Why Tables Need Special Handling
Tables contain dense, structured data that's often critical for answering questions. However, they're often stored as separate chunks from the surrounding text, which can make them harder to retrieve with standard semantic search. This section explains how to effectively query tables and figures.

Understanding the Challenge

When you query a document, the system uses semantic search to find relevant chunks. However, tables pose unique challenges:

🔍

Semantic Mismatch

Table headers and data may not semantically match your question (e.g., asking "ablation results" but table contains numeric values).

đŸ“Ļ

Separate Chunks

Tables are often chunked separately from the text that references them, causing the reference to be retrieved but not the table.

âœ‚ī¸

Reranker Blindness

The reranker sees only a truncated preview of each chunk, so table data at the end gets missed. Use full-document review for comprehensive analysis.

đŸŽ¯

Solution: prioritize_tables

Put table data at the beginning of chunks and boost table relevance scores during retrieval.

Essential: prioritize_tables Parameter

For best results when asking about tables, ALWAYS use prioritize_tables=True.

This parameter does four things:

  1. Doubles retrieval count: Fetches 2× more chunks to increase table inclusion probability
  2. Boosts table scores: Multiplies table chunk relevance scores by 3.0×
  3. Ensures table representation: Guarantees at least 50% of top results are tables (minimum 3 tables)
  4. Re-sorts results: Moves tables higher in the ranking before reranking
🔍
Automatic Table-Focused Query Detection
The RAG system automatically detects table-focused queries and enables prioritize_tables. Detection keywords include: table, chart, graph, figure, data, result, finding, metric, performance, score, compare, ablation, baseline, experiment.
âš ī¸
Manual Override Recommended
While auto-detection works for most cases, explicitly setting prioritize_tables=True ensures best results for table queries, especially when using the library directly.

Example: Querying Tables

Python - Query with table prioritization
from rag_library import query_rag

# ✅ GOOD: Explicitly enable prioritize_tables
result = query_rag(
    question="What are the main findings from the ablation study?",
    prioritize_tables=True,  # Critical for table queries!
    retrieve_n=30,
    rerank_top_k=10,
    agentic=True,
    verbose=True
)

# ❌ BAD: Without prioritize_tables, tables may be missed
result = query_rag(
    question="What are the main findings from the ablation study?",
    # Missing prioritize_tables parameter!
    retrieve_n=30,
    rerank_top_k=10
)

CLI Usage for Table Queries

When using the tablemind CLI, always include the --prioritize-tables flag for table-related questions:

Terminal
# Query with table prioritization
tablemind "What are the main findings from the ablation study?" --prioritize-tables -v

# Table-only query (excludes non-table chunks)
tablemind "Compare the performance metrics" --tables-only --prioritize-tables

# High retrieval for comprehensive table search
tablemind "Show me Table 3" --prioritize-tables --retrieve-n 50 --top-k 10

Advanced: Tables-Only Queries

For questions that ONLY require table data (no surrounding text), use include_tables_only=True:

Python
from rag_library import RAGSystem, RAGConfig

config = RAGConfig()
rag = RAGSystem(config)

# Only search table chunks
result = rag.query(
    question="Compare all the performance metrics across tables",
    include_tables_only=True,
    prioritize_tables=True,
    retrieve_n=50,
    rerank_top_k=10
)

Combining with Agentic RAG

For comprehensive table queries, combine prioritize_tables with agentic=True:

✅
Recommended Settings for Table Queries
Use this combination for best results: prioritize_tables=True, agentic=True, retrieve_n=30 (or higher), rerank_top_k=10

Example Output: Ablation Study Query

Below is an example RAG query output for the question "What are the main results from the ablation study?" from the research paper "DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis" (arXiv:2508.20033):

RAG Query Output - Ablation Study Results
Figure: RAG query output showing ablation study results with table data extracted from arXiv:2508.20033. The system retrieved relevant chunks, performed agentic analysis to detect missing table references, and generated a comprehensive answer with proper source citations.

10 Query Modes

Agentic AI query classification for optimized retrieval strategies

What are Query Modes?

The RAG system uses an agentic AI approach to automatically classify your query and select the optimal retrieval strategy. An LLM agent evaluates your question's intent and context, then chooses between standard RAG (fast, focused) or full-document review (comprehensive). The agent provides reasoning for its decision, visible in response metadata.

Available Query Modes

đŸŽ¯

Specific

Standard RAG retrieval for focused questions about specific facts, data points, or concepts. Uses vector search with configurable retrieval limits.

📊

Table Reference

Auto-detected for queries mentioning tables, figures, or charts. Combines prioritized table retrieval with agentic fetching.

📄

Full Document Review

Retrieves entire document in sequential order, organizes by sections, and performs hierarchical summarization for comprehensive analysis.

🧠 Agentic Query Classification

How Agentic Detection Works

The system uses an LLM agent to intelligently evaluate query intent:

  1. User submits query with query_mode="auto" (default)
  2. LLM agent evaluates the query's intent and context
  3. Agent chooses strategy:
    • Standard RAG - Fast semantic retrieval for focused questions
    • Full Document Review - Comprehensive analysis for document-level queries
  4. Agent provides reasoning for its decision (visible in response metadata)
  5. Users can override by manually specifying query_mode

Benefits: No brittle keyword lists. Agent considers query context, not just surface-level patterns. Adapts automatically to new query types. Provides explainable decisions.

Full-Document Review Flow

1

Agentic Query Classification

An LLM agent analyzes your question's intent and context, then classifies it as "specific" or "full_review" with reasoning.

2

Sequential Document Retrieval

For full-review queries, all chunks are retrieved in document order (by page number and index), preserving the document's natural flow.

3

Section Organization

Chunks are grouped by document headings (Introduction, Methods, Results, etc.) to understand the paper's structure.

4

Hierarchical Summarization

Each section is summarized independently (2-3 sentences), then synthesized into a comprehensive document-level review.

Example: Agentic Query Detection

Python - Agentic query classification with reasoning
from rag_library import query_rag

# Query that triggers full-document review (agentic detection)
result = query_rag(
    question="How can I improve this paper's flow and clarity?",
    verbose=True
)

# Agent reasoning is included in response metadata
print(f"Query Mode: {result['query_mode']}")
print(f"Agent Reasoning: {result['agent_reasoning']}")

# Output:
# 🤖 Agent evaluating query type...
#    ✓ Decision: full_review
#    📝 Reasoning: Query asks for document-level assessment of flow and clarity
#
# 📄 Full-Document Review Mode
#    Question: How can I improve this paper's flow and clarity?
#
# 📚 Step 1: Retrieving all document chunks...
#    Retrieved 247 chunks from document
#
# 📋 Step 2: Organizing content by sections...
#    Found 8 sections
#
# 🔍 Step 3: Analyzing each section...
#
# 🤖 Step 4: Generating comprehensive review...
#
# Query Mode: full_review
# Agent Reasoning: Query asks for document-level assessment of flow and clarity

Manual Query Mode Selection

Python - Explicit query mode
from rag_library import RAGSystem, RAGConfig

rag = RAGSystem()

# Force full-document review mode
result = rag.query(
    question="What are the key findings?",
    query_mode="full_review",  # Override auto-detection
    verbose=True
)

# Force specific mode (standard RAG)
result = rag.query(
    question="Review the structure",
    query_mode="specific",  # Use standard RAG instead
    verbose=True
)

When to Use Each Mode

đŸŽ¯ Specific Mode

  • Factual questions ("What is X?")
  • Specific data point queries
  • Concept explanations
  • Targeted information retrieval
  • Fast, focused responses

📊 Table Reference Mode

  • Questions mentioning tables/figures
  • Data comparison queries
  • Performance metrics analysis
  • Experimental results
  • Ablation studies

📄 Full-Document Review

  • Writing quality assessment
  • Flow and structure analysis
  • Comprehensive critiques
  • Argument evaluation
  • Overall document assessment

CLI Usage

Terminal - Query mode selection
# Auto-detect query mode (default)
python -m rag_library "How can I improve the writing?" -v

# Force specific mode
python -m rag_library "What are the findings?" --query-mode specific -v

# Force full-document review
python -m rag_library "Summarize the results" --query-mode full_review -v

13 API Reference

Core classes and functions

docling_parser.py

Class/Function Description
DoclingParser Main parser class for PDF, Markdown, HTML, DOCX, and text documents
ParsedDocument Dataclass containing parsed document data
parse_document(file_path) Parse any supported format (auto-detected)
parse_pdf(file_path) Quick function to parse a single PDF
parse_markdown(file_path) Quick function to parse a Markdown file
parse_html(file_path) Quick function to parse an HTML file
parse_docx(file_path) Quick function to parse a DOCX file
parse_text_file(file_path) Quick function to parse a text file
parse_documents(directory) Parse all documents in directory (recursive)
get_llm_context(file_path) Get document context formatted for LLMs

rag_ingestion.py

Class/Function Description
RAGIngestor Main ingestion class for documents
VectorDBConfig Configuration for vector database
EmbeddingConfig Configuration for embedding models
ingest_document(file_path) Ingest a single document (any format)
ingest_pdf(file_path) Quick function to ingest a single PDF
ingest_documents(directory) Batch ingest all documents (recursive)
ingest_pdfs(directory) Batch ingest all PDFs in a directory

rag_library.py

Class/Function Description
RAGSystem Main RAG query system class with agentic AI capabilities
RAGConfig Configuration for RAG system (models, database, LLM settings)
query_rag(question, query_mode) Quick function to perform a RAG query with optional query mode
RAGSystem.query(question, query_mode) Perform query with full options including agentic classification
RAGSystem.retrieve() Retrieve chunks without LLM generation
RAGSystem.query_full_document_review() Perform hierarchical full-document analysis

🧠 Query Mode Parameter

The query_mode parameter controls query classification strategy:

Value Behavior
"auto" or None Agentic LLM evaluates query and selects strategy (default)
"specific" Force standard RAG (fast, focused retrieval)
"table_reference" Force table-prioritized retrieval with agentic fetching
"full_review" Force full-document hierarchical review

Response Metadata: When using query_mode="auto", the response includes agent_reasoning field explaining the agent's decision.

14 Configuration

Environment variables for customization

â„šī¸
About .env File Location
The .env file should be placed in your current working directory (where you run your scripts from). The library uses load_dotenv() which automatically loads environment variables from .env in the current directory. API keys (ANTHROPIC_API_KEY, OPENAI_API_KEY) are passed to the LLM service via environment variables.
.env
# Qdrant Configuration
QDRANT_COLLECTION_NAME=my_documents
QDRANT_PATH=./qdrant_db

# Model Configuration
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2

# LLM Configuration
LLM_PROVIDER=anthropic
LLM_MODEL=claude-sonnet-4-20250514
LLM_TEMPERATURE=0.7
LLM_MAX_TOKENS=2000

# API Keys
# Note: These keys are loaded from .env and passed to the LLM service
ANTHROPIC_API_KEY=your-api-key
OPENAI_API_KEY=your-api-key

# Chunking Configuration
MAX_TOKENS=4000
MERGE_PEERS=true
INCLUDE_STRUCTURE_CONTEXT=true

Chunking Configuration Options

These settings control how documents are split into chunks during ingestion:

MAX_TOKENS

Default: 4000

Maximum number of tokens per chunk. Larger chunks contain more context but may reduce retrieval precision. Smaller chunks provide more granular matching but may fragment related information.

  • 2000-3000: More chunks, better for specific questions
  • 4000 (default): Balanced approach
  • 6000-8000: Fewer chunks, better for broad summaries

MERGE_PEERS

Default: true

Whether to merge adjacent "peer" chunks (chunks from the same section/heading) that are small enough to fit together within MAX_TOKENS.

When enabled (true):

  • Related content stays together (better context)
  • Fewer, more coherent chunks
  • Reduced fragmentation of ideas

When disabled (false):

  • Maximum granularity
  • More chunks for precise matching
  • Sections split into smaller pieces

Recommendation: Keep true for most RAG use cases. Only disable if you need maximum chunk granularity for very specific queries.

INCLUDE_STRUCTURE_CONTEXT

Default: true

Whether to prepend document structure (headings, section paths) to each chunk. This uses Docling's contextualize() method.

With structure context (true):

Example Chunk
# 5.3 Understanding Opportunities for Improvement

## Ablation Study Results

The results show that our method achieved 95% accuracy...
  • LLMs understand document organization
  • Better semantic retrieval (headings add context)
  • Clearer source attribution

Without structure context (false):

Example Chunk
The results show that our method achieved 95% accuracy...
  • Smaller chunks (no redundant headings)
  • LLMs lose context about content location
  • Worse retrieval for generic text

Recommendation: Always keep true for RAG. The structure context significantly improves both retrieval relevance and LLM comprehension of source material.

How Environment Variables Are Loaded

The library loads configuration in the following order:

  1. rag_library.py calls load_dotenv() on import
  2. Environment variables are read from .env file in current working directory
  3. Variables not found in .env use default values
  4. Explicit parameters override .env values
Python - Override .env with parameters
from rag_library import RAGConfig, RAGSystem

# .env values are used by default
config = RAGConfig()

# Or override specific values
config = RAGConfig(
    collection_name="custom_collection",
    temperature=0.3
    # Other values come from .env
)

15 CLI Commands

Command-line interface for quick operations

tablemind - Query Documents

The main CLI for querying your document collection with RAG. Uses agentic AI to automatically evaluate query intent and select the optimal retrieval strategy (standard RAG vs. full-document review). The agent provides reasoning for its decisions.

Usage:
The CLI is built into rag_library.py. Use one of these methods:
  • python rag_library.py "your question"
  • python -m rag_library "your question"
  • Create an alias: alias tablemind='python /path/to/rag_library.py'
Terminal
# Run the CLI directly (uses agentic auto-detection by default)
python rag_library.py "What are the main findings?"

# Or as a module
python -m rag_library "What are the main findings?"

# Query with custom retrieval options
python rag_library.py "Compare table 3" --retrieve-n 50 --top-k 10

# Enable verbose output (shows agentic classification)
python rag_library.py "What datasets were used?" -v

# Force specific query mode
python rag_library.py "What are the findings?" --query-mode specific

# Force full-document review mode
python rag_library.py "Summarize the results" --query-mode full_review

# Table-only query with agentic fetching
python rag_library.py "What are the performance metrics?" --tables-only --prioritize-tables

# Query with all options
python rag_library.py "Analyze the results" \
  --retrieve-n 100 \
  --top-k 20 \
  --agentic \
  --prioritize-tables \
  --verbose

Command-Line Options

Option Default Description
-v, --verbose False Print detailed progress information
--retrieve-n N 20 Number of chunks to retrieve before reranking
--top-k K 5 Number of chunks to keep after reranking
--query-mode MODE auto Query mode: auto (agentic), specific, full_review
--agentic / --no-agentic True Enable/disable agentic table/figure fetching
--tables-only False Only search table chunks
--figures-only False Only search figure chunks
--prioritize-tables False Boost table chunks in retrieval results
--show-reasoning False Include LLM reasoning in response

Other CLI Commands

Terminal
# Run the example script
python ragexample.py

# Run specific examples
python ragexample.py --parse          # Parse PDF example
python ragexample.py --ingest         # Ingest documents example
python ragexample.py --query           # Query documents example
python ragexample.py "Your question"  # Custom query

# For custom query with options, edit ragexample.py or use rag_library.py directly:
python rag_library.py "Your question" --prioritize-tables -v

16 Web Interface

Interactive chat interface with dynamic document ingestion

What is the Web Interface?

The RAG system includes a Flask-based web interface that provides an interactive chat experience for querying your documents. It features real-time streaming responses, conversation history management, configurable retrieval options, and dynamic document ingestion.

Key Features

đŸ’Ŧ

Interactive Chat

Real-time streaming responses with markdown rendering and code highlighting

📤

Dynamic Upload

Upload and index documents while the server is running - no restart needed

👀

File Watching

Automatic reindexing when documents in your library folder change

âš™ī¸

Configurable Options

Search modes, table prioritization, agentic mode, and table content display

📊

Table Control

Toggle whether tables are shown in responses or just used for context

🔄

Conversation History

Full conversation context with automatic compaction to manage token limits

Starting the Web Interface

Terminal
# Start the web server
python web_app.py

# The interface will be available at:
# http://localhost:5005

Document Management

The web interface supports two ways to add documents:

📤 Upload via UI

  • Click "Upload Document" in the sidebar
  • Supported: PDF, MD, HTML, DOCX, TXT
  • Automatic parsing and indexing
  • Progress tracking for ingestion

👀 Auto-Watch Folder

  • Enable "File Watcher" to watch ./docs folder
  • Automatically detects new/modified/deleted files
  • Reindexes on changes with debouncing
  • Configurable check intervals

Table Content Control

The "Show Tables in Response" checkbox controls how tables are displayed in LLM responses:

Setting Behavior
✅ Checked (default) LLM reproduces full table markdown in response
☐ Unchecked LLM analyzes tables but doesn't reproduce them - more token efficient

Note: Tables are ALWAYS retrieved and considered by the LLM for accuracy. This setting only controls presentation in the response.

Web Interface Screenshot

Below is a screenshot of the RAG web interface showing document management, table control options, and agentic mode:

RAG Web Interface - Full Features
Figure: RAG web interface with: (1) Document upload and file watcher controls, (2) "Show Tables in Response" toggle, (3) Agentic mode indicators showing fetched tables, (4) Configurable search options including semantic/keyword/hybrid modes, table prioritization, and hybrid alpha slider, (5) Query modes, including specific, full-document review, and auto agentic detection.

Interface Options

🔍 Search Mode

  • Semantic: Vector-based conceptual search
  • Keyword: Exact term matching with BM25
  • Hybrid: Weighted combination of both

📊 Prioritize Tables

  • Boosts table chunks in retrieval (3x score)
  • Ensures at least 50% of results are tables
  • Ideal for data-heavy queries

🤖 Agentic Mode

  • Intelligently detects table/figure references
  • Fetches missing tables automatically
  • Boosts relevant existing content

đŸ–Ĩī¸ Show Tables

  • On: Full table markdown in response
  • Off: Tables analyzed but not shown
  • Saves tokens when unchecked

API Endpoints

HTTP Endpoints
# === Chat Endpoints ===

# Create a new conversation
POST /api/conversations
Response: {"conversation_id": "uuid"}

# Send a chat message
POST /api/chat
Body: {
  "conversation_id": "uuid",
  "message": "Your question",
  "search_mode": "semantic",       # semantic|keyword|hybrid
  "prioritize_tables": true,
  "agentic_mode": true,
  "include_table_content": true,  # NEW: show tables in response
  "stream": true
}

# Get conversation history
GET /api/conversations/{conversation_id}

# === Document Management Endpoints ===

# Upload a document
POST /api/documents/upload
Content-Type: multipart/form-data
Response: {"task_id": "uuid", "filename": "doc.pdf", "status": "pending"}

# Get ingestion status
GET /api/documents/status/{task_id}
Response: {"status": "completed", "chunks_indexed": 45, ...}

# List all documents
GET /api/documents
Response: {"documents": [{"name": "doc.pdf", "path": "...", ...}]}

# Delete a document
DELETE /api/documents/{path}

# Reindex all documents (clears and rebuilds vector database)
POST /api/documents/reindex
Response: {
  "status": "started",
  "message": "Reindexing started in background",
  "note": "Use /api/documents/status/ to track progress"
}

# === File Watcher Endpoints ===

# Get watcher status (includes last event)
GET /api/watcher
Response: {
  "enabled": true,
  "folder": "./docs",
  "pattern": "*",  # Watches all files, filters by extension
  "last_event": {  # Most recent file change event
    "type": "added",
    "data": {"file_path": "doc.pdf", "chunks_indexed": 45},
    "timestamp": "2026-01-03T12:34:56"
  }
}

# Start file watcher
POST /api/watcher/start
Body: {"check_interval": 10.0, "debounce_interval": 2.0}
Response: {"status": "started", "enabled": true, "folder": "./docs"}

# Stop file watcher
POST /api/watcher/stop

# Note: File watcher supports PDF, MD, HTML, DOCX, TXT
# Unsupported extensions are automatically skipped

# === Configuration ===

# Get current configuration
GET /api/config
Response: {
  "embedding_model": "...",
  "llm_provider": "anthropic",
  "llm_model": "claude-sonnet-4-5",
  "collection_name": "my_documents",
  "available_search_modes": ["semantic", "keyword", "hybrid"]
}

17 Complete Example

End-to-end pipeline in a few lines of code

Example 1: PDF-Only Pipeline

Python
from rag_ingestion import ingest_pdfs
from rag_library import query_rag

# Step 1: Ingest all PDFs from a directory
print("Ingesting documents...")
results = ingest_pdfs(
    directory="./pdfs",
    collection_name="research_papers",
    pattern="*.pdf"
)

for r in results:
    print(f"  {r['file_path']}: {r['status']} ({r['chunks_indexed']} chunks)")

# Step 2: Query the documents
print("\nQuerying documents...")
result = query_rag(
    "What are the main findings across all papers?",
    collection_name="research_papers",
    db_path="./qdrant_db",
    retrieve_n=50,
    rerank_top_k=10,
    agentic=True
)

print(f"\nAnswer:\n{result['answer']}")

# Step 3: Show sources
print(f"\nSources ({len(result['sources'])}):")
for i, s in enumerate(result['sources'], 1):
    marker = "📊" if s.get('is_table') else "📄"
    print(f"  {i}. {marker} {s.get('file_name')} - {s.get('heading')})

Example 2: Multi-Format Directory Pipeline

Parse and ingest a directory containing PDFs, Markdown, HTML, DOCX, and text files. The system automatically detects formats and handles each appropriately.

Python
from pathlib import Path
from rag_ingestion import RAGIngestor, VectorDBConfig
from rag_library import query_rag

# Step 1: Set up ingestor with multi-format support
db_config = VectorDBConfig(
    collection_name="multi_format_docs",
    path="./qdrant_db"
)
ingestor = RAGIngestor(db_config=db_config)

# Step 2: Define directory with mixed formats
docs_dir = Path("./documents")

# Step 3: Ingest all supported formats
print("Ingesting multi-format documents...")

# Supported extensions: .pdf, .md, .markdown, .html, .htm, .docx, .txt
supported_extensions = [".pdf", ".md", ".html", ".docx", ".txt"]

results = []
for ext in supported_extensions:
    for file_path in docs_dir.glob(f"*{ext}"):
        try:
            result = ingestor.ingest_file(file_path)
            results.append(result)
            print(f"  ✓ {file_path.name}: {result['status']} ({result['chunks_indexed']} chunks, {result['num_tables']} tables)")
        except Exception as e:
            print(f"  ✗ {file_path.name}: {e}")

# Step 4: Query across all formats
print("\nQuerying multi-format collection...")
result = query_rag(
    "What are the main findings across all documents?",
    collection_name="multi_format_docs",
    db_path="./qdrant_db",
    retrieve_n=30,
    rerank_top_k=10,
    agentic=True,
    verbose=True
)

print(f"\nAnswer:\n{result['answer']}")

# Step 5: Show sources with format indicators
print(f"\nSources ({len(result['sources'])}):")
for i, s in enumerate(result['sources'], 1):
    # Determine icon based on content type
    if s.get('is_table'):
        icon = "📊"
    elif s.get('is_figure'):
        icon = "đŸ–ŧī¸"
    else:
        icon = "📄"

    # Get file extension for format indicator
    file_name = s.get('file_name', "Unknown")
    ext = Path(file_name).suffix.upper() if '.' in file_name else "TXT"

    print(f"  {i}. {icon} [{ext}] {file_name} - {s.get('heading', 'N/A')})

Example 3: Using ragexample.py for Multi-Format

The bundled ragexample.py script also supports multi-format ingestion:

Terminal
# Put documents (PDF, MD, HTML, DOCX, TXT) in ./documents folder
# Then run:

# Ingest all supported formats
python ragexample.py --ingest

# Query with agentic AI (auto-detects optimal strategy)
python ragexample.py "What are the main findings?"

# The agent evaluates query intent and selects appropriate mode