Overview
The Tablemind is a Python library for building retrieval-augmented generation (RAG) systems that properly handle tables, figures, and document-level queries in research papers, technical reports, and business documents.
Problem it Solves
Standard RAG systems chunk documents to work within LLM context limits. This works for prose but fails for:
- Tables — get fragmented, headers separated, structure lost
- Cross-references — text mentions "Table 3" but retrieval can't connect it
- Document-level questions — requires synthesis across entire paper
Solution
Treat tables as first-class citizens — preserve complete structure, detect references, and intelligently route between chunk-based retrieval and full document review.
Quick Start
Installation
pip install docling sentence-transformers qdrant-client fastapi uvicorn
Basic Usage
from rag_ingestion import RAGIngestor
from rag_library import RAGLibrary
# Initialize
ingestor = RAGIngestor(collection_name="my_docs")
rag = RAGLibrary(ingestor=ingestor)
# Ingest a document
result = ingestor.ingest_document("paper.pdf")
print(f"Ingested {result['chunks_indexed']} chunks")
# Query
response = rag.query("What's the F1 score in Table 3?",
search_mode="semantic",
table_priority=True)
print(response['answer'])
# Reindex all documents
result = ingestor.reindex_collection("./docs")
print(f"Reindexed {result['indexed_count']} documents")
Core Modules
docling_parser.py — Document Parsing
Extracts structure from PDF, Markdown, HTML, DOCX, and text files.
from docling_parser import DoclingParser
parser = DoclingParser()
parsed = parser.parse("document.pdf")
# Access parsed data
print(f"Sections: {list(parsed.sections.keys())}")
print(f"Tables: {len(parsed.tables)}")
print(f"Figures: {len(parsed.figures)}")
# Get table as markdown
table_md = parsed.tables[0]['markdown']
Returns:
sections: Hierarchical section structuretables: Complete markdown representation with row/column countsfigures: Captions and descriptionsfull_text: Plain text extractionmarkdown: Full markdown representation
rag_ingestion.py — Vector Database Ingestion
Manages document ingestion into Qdrant vector database with intelligent chunking.
from rag_ingestion import RAGIngestor
ingestor = RAGIngestor(
collection_name="my_documents",
embedding_model="nomic-ai/nomic-embed-text-v1.5",
qdrant_path="./qdrant_db"
)
# Ingest single file
result = ingestor.ingest_document("paper.pdf")
# Returns: {"status": "success", "doc_id": "sha256_hash", "chunks_indexed": 42}
# Batch ingest directory
results = ingestor.ingest_directory("./docs", pattern="*.pdf")
# Delete document
ingestor.delete_document(doc_id="sha256_hash")
Key features:
- SHA256-based stable document IDs (unchanged files skip reindexing)
- Docling HybridChunker (merges related content, respects section boundaries)
- Configurable chunk sizes and merge behavior
- Metadata-rich chunks (table/figure flags, headings, captions)
rag_library.py — Query System
The main RAG pipeline with multiple query modes and retrieval strategies.
from rag_library import RAGLibrary
rag = RAGLibrary(ingestor=ingestor)
# Standard RAG query (chunk-based, fast)
response = rag.query(
query="What is the accuracy of Model A?",
search_mode="semantic", # or "keyword", "hybrid"
table_priority=True,
agentic_references=True
)
# Full document review (for broad questions)
response = rag.query(
query="Does the paper's narrative flow logically?",
query_mode="full_review"
)
# Auto mode (LLM chooses appropriate mode)
response = rag.query(
query="Compare all approaches in the paper",
query_mode="auto"
)
# Access results
print(response['answer'])
for source in response['sources']:
print(f"- {source['file_name']} ({source['chunk_type']})")
Query modes:
standard: Chunk-based retrieval (fast, within context limits)full_review: Hierarchical section summarization (slower, comprehensive)auto: LLM analyzes query and selects appropriate mode
Search modes:
semantic: Vector embeddings (conceptual queries)keyword: BM25 (exact terms, model names, metrics)hybrid: Combined with configurable weights
web_app.py — Web Interface
Flask-based web server providing REST API and interactive chat interface for document querying.
# Set environment variables
export LLM_PROVIDER=gemini
export LLM_MODEL=gemini-2.5-pro
export PDF_DIRECTORY=./docs
# Start server
python web_app.py
# Server runs on http://localhost:5005
Key features:
- Real-time streaming responses with SSE
- Conversation history with memory compaction
- Dynamic document ingestion (upload via web UI)
- File watcher for auto-ingestion on file changes
- Configurable search mode (semantic/keyword/hybrid/auto)
- Table prioritization and agentic reference fetching
- Query mode selection (standard RAG or full document review)
- Dynamic LLM provider switching without restart
Web API Endpoints
Starting the server:
export LLM_PROVIDER=gemini
export LLM_MODEL=gemini-2.5-pro
export PDF_DIRECTORY=./docs
python web_app.py
# Server runs on http://localhost:5005
Query / Chat
{
"query": "What's the F1 score in Table 3?",
"conversation_id": "optional-conv-id",
"search_mode": "semantic",
"query_mode": "auto",
"table_priority": true,
"agentic_references": true
}
Document Management
curl -X POST http://localhost:5005/api/documents/upload \
-F "file=@document.pdf"
curl -X POST http://localhost:5005/api/documents/reindex
Clears the vector database collection and re-ingests all documents from the docs folder. Runs in background. Use the status endpoint with the returned task_id to track progress.
File Watcher
{
"check_interval": 10.0,
"debounce_interval": 2.0
}
Configuration
Conversations
Configuration
The .env file should be placed in your current working directory (where you run your scripts from). The library uses load_dotenv() which automatically loads environment variables from .env in the current directory.
Environment Variables
# Qdrant Configuration
QDRANT_COLLECTION_NAME=my_documents
QDRANT_PATH=./qdrant_db
# Model Configuration
EMBEDDING_MODEL=nomic-ai/nomic-embed-text-v1.5
RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2
# LLM Configuration
LLM_PROVIDER=gemini # anthropic, openai, ollama
LLM_MODEL=gemini-2.5-pro
LLM_TEMPERATURE=0.3
LLM_MAX_TOKENS=8192
# API Keys
# Note: These keys are loaded from .env and passed to the LLM service
ANTHROPIC_API_KEY=your-api-key
OPENAI_API_KEY=your-api-key
# Chunking Configuration
MAX_TOKENS=4000
MERGE_PEERS=true
INCLUDE_STRUCTURE_CONTEXT=true
# Retrieval
RETRIEVE_N=100
RERANK_TOP_K=6
QUERY_EXPANSION=false
# Documents
PDF_DIRECTORY=./docs
Chunking Configuration Options
These settings control how documents are split into chunks during ingestion:
Default: 4000
Maximum number of tokens per chunk. Larger chunks contain more context but may reduce retrieval precision. Smaller chunks provide more granular matching but may fragment related information.
- 2000-3000: More chunks, better for specific questions
- 4000 (default): Balanced approach
- 6000-8000: Fewer chunks, better for broad summaries
Default: true
Whether to merge adjacent "peer" chunks (chunks from the same section/heading) that are small enough to fit together within MAX_TOKENS.
When enabled (true):
- Related content stays together (better context)
- Fewer, more coherent chunks
- Reduced fragmentation of ideas
When disabled (false):
- Maximum granularity
- More chunks for precise matching
- Sections split into smaller pieces
Recommendation: Keep true for most RAG use cases. Only disable if you need maximum chunk granularity for very specific queries.
Default: true
Whether to prepend document structure (headings, section paths) to each chunk. This uses Docling's contextualize() method.
With structure context (true):
# 5.3 Understanding Opportunities for Improvement
## Ablation Study Results
The results show that our method achieved 95% accuracy...
- LLMs understand document organization
- Better semantic retrieval (headings add context)
- Clearer source attribution
Without structure context (false):
The results show that our method achieved 95% accuracy...
- Smaller chunks (no redundant headings)
- LLMs lose context about content location
- Worse retrieval for generic text
Recommendation: Always keep true for RAG. The structure context significantly improves both retrieval relevance and LLM comprehension of source material.
How Environment Variables Are Loaded
The library loads configuration in the following order:
rag_library.pycallsload_dotenv()on import- Environment variables are read from
.envfile in current working directory - Variables not found in
.envuse default values - Explicit parameters override
.envvalues
from rag_library import RAGConfig, RAGSystem
# .env values are used by default
config = RAGConfig()
# Or override specific values
config = RAGConfig(
collection_name="custom_collection",
temperature=0.3
# Other values come from .env
)
The web app supports dynamic LLM provider changes without restart:
# Update environment
export LLM_PROVIDER=anthropic
export LLM_MODEL=claude-sonnet-4-5
# Reload via API
curl -X POST http://localhost:5005/api/config/reload
Advanced Usage
File Watcher for Auto-Ingestion
from rag_ingestion import RAGIngestor
ingestor = RAGIngestor(collection_name="docs")
watcher = ingestor.create_file_watcher(
directory="./docs",
pattern="*.pdf",
check_interval=10.0,
debounce_interval=2.0,
callback=lambda event_type, data: print(f"{event_type}: {data}"),
autostart=True
)
# Watcher runs in background, auto-ingests new/modified files
watcher.stop()
Direct Library Usage (No Vector DB)
For large-context LLMs, skip vector retrieval:
from docling_parser import DoclingParser
from rag_library import RAGLibrary
parser = DoclingParser()
parsed = parser.parse("paper.pdf")
rag = RAGLibrary(ingestor=None) # No ingestor needed
response = rag.query_with_full_context(
query="Summarize the methodology",
parsed_document=parsed,
selected_tables=[0, 2]
)
CLI Commands
tablemind — Query Documents
The main CLI for querying your document collection with RAG. Uses agentic AI to automatically evaluate query intent and select the optimal retrieval strategy (standard RAG vs. full-document review).
The CLI is built into rag_library.py. Use one of these methods:
python rag_library.py "your question"python -m rag_library "your question"- Create an alias:
alias tablemind='python /path/to/rag_library.py'
# Run the CLI directly (uses agentic auto-detection by default)
python rag_library.py "What are the main findings?"
# Or as a module
python -m rag_library "What are the main findings?"
# Query with custom retrieval options
python rag_library.py "Compare table 3" --retrieve-n 50 --top-k 10
# Enable verbose output (shows agentic classification)
python rag_library.py "What datasets were used?" -v
# Force specific query mode
python rag_library.py "What are the findings?" --query-mode specific
# Force full-document review mode
python rag_library.py "Summarize the results" --query-mode full_review
# Table-only query with agentic fetching
python rag_library.py "What are the performance metrics?" --tables-only --prioritize-tables
# Query with all options
python rag_library.py "Analyze the results" \
--retrieve-n 100 \
--top-k 20 \
--agentic \
--prioritize-tables \
--verbose
Command-Line Options
| Option | Default | Description |
|---|---|---|
-v, --verbose |
False |
Print detailed progress information |
--retrieve-n N |
20 |
Number of chunks to retrieve before reranking |
--top-k K |
5 |
Number of chunks to keep after reranking |
--query-mode MODE |
auto |
Query mode: auto (agentic), specific, full_review |
--agentic / --no-agentic |
True |
Enable/disable agentic table/figure fetching |
--tables-only |
False |
Only search table chunks |
--figures-only |
False |
Only search figure chunks |
--prioritize-tables |
False |
Boost table chunks in retrieval results |
--show-reasoning |
False |
Include LLM reasoning in response |
Other CLI Commands
# Run the example script
python ragexample.py
# Run specific examples
python ragexample.py --parse # Parse PDF example
python ragexample.py --ingest # Ingest documents example
python ragexample.py --query # Query documents example
python ragexample.py "Your question" # Custom query
# For custom query with options, edit ragexample.py or use rag_library.py directly:
python rag_library.py "Your question" --prioritize-tables -v
Performance Considerations
| Mode | Speed | Token Usage | Best For |
|---|---|---|---|
| Standard RAG | Fast | Low | Specific facts, table lookups |
| Full Review | Slow | High | Document synthesis, structure analysis |
| Auto | Variable | Variable | Mixed workloads |
Recommendations
- Use
standardmode for most queries (default via auto) - Enable
table_priorityfor table-heavy documents - Use
full_reviewonly when needed (auto mode handles this) - Set
MAX_TOKENS=4000for better table preservation - Enable
MERGE_PEERS=truefor better context coherence
Limitations
- Large tables: 20+ columns can be challenging for LLMs
- Cross-document references: Not currently supported
- OCR quality: Scanned PDFs may have table structure errors
- Full review cost: More expensive than chunk-based retrieval
API Quick Reference
docling_parser.py
| Class/Function | Description |
|---|---|
DoclingParser |
Main parser class for PDF, Markdown, HTML, DOCX, and text documents |
ParsedDocument |
Dataclass containing parsed document data |
parse_document(file_path) |
Parse any supported format (auto-detected) |
parse_pdf(file_path) |
Quick function to parse a single PDF |
parse_markdown(file_path) |
Quick function to parse a Markdown file |
parse_html(file_path) |
Quick function to parse an HTML file |
parse_docx(file_path) |
Quick function to parse a DOCX file |
parse_text_file(file_path) |
Quick function to parse a text file |
parse_documents(directory) |
Parse all documents in directory (recursive) |
rag_ingestion.py
| Class/Function | Description |
|---|---|
RAGIngestor |
Main ingestion class for documents |
VectorDBConfig |
Configuration for vector database |
EmbeddingConfig |
Configuration for embedding models |
ingest_document(file_path) |
Ingest a single document (any format) |
ingest_directory(directory) |
Batch ingest all documents (recursive) |
delete_document(doc_id) |
Delete a document from the collection |
get_collection_stats() |
Get collection statistics |
rag_library.py
| Class/Function | Description |
|---|---|
RAGLibrary |
Main RAG query system class |
LLMService |
Base class for LLM providers (Anthropic, OpenAI, Gemini, Ollama) |
RAGLibrary.query(question, search_mode) |
Perform query with retrieval and LLM generation |
RAGLibrary.query_full_review(question) |
Perform hierarchical full-document analysis |
Query Mode Parameter
| Value | Behavior |
|---|---|
"auto" |
LLM evaluates query and selects strategy (default) |
"standard" |
Force standard RAG (fast, focused retrieval) |
"full_review" |
Force full-document hierarchical review |
Examples
Example 1: PDF-Only Pipeline
from rag_ingestion import ingest_pdfs
from rag_library import query_rag
# Step 1: Ingest all PDFs from a directory
print("Ingesting documents...")
results = ingest_pdfs(
directory="./pdfs",
collection_name="research_papers",
pattern="*.pdf"
)
for r in results:
print(f" {r['file_path']}: {r['status']} ({r['chunks_indexed']} chunks)")
# Step 2: Query the documents
print("\nQuerying documents...")
result = query_rag(
"What are the main findings across all papers?",
collection_name="research_papers",
db_path="./qdrant_db",
retrieve_n=50,
rerank_top_k=10,
agentic=True
)
print(f"\nAnswer:\n{result['answer']}")
# Step 3: Show sources
print(f"\nSources ({len(result['sources'])}):")
for i, s in enumerate(result['sources'], 1):
marker = "📊" if s.get('is_table') else "📄"
print(f" {i}. {marker} {s.get('file_name')} - {s.get('heading')}")
Example 2: Multi-Format Directory Pipeline
Parse and ingest a directory containing PDFs, Markdown, HTML, DOCX, and text files. The system automatically detects formats and handles each appropriately.
from pathlib import Path
from rag_ingestion import RAGIngestor, VectorDBConfig
from rag_library import query_rag
# Step 1: Set up ingestor with multi-format support
db_config = VectorDBConfig(
collection_name="multi_format_docs",
path="./qdrant_db"
)
ingestor = RAGIngestor(db_config=db_config)
# Step 2: Define directory with mixed formats
docs_dir = Path("./documents")
# Step 3: Ingest all supported formats
print("Ingesting multi-format documents...")
# Supported extensions: .pdf, .md, .markdown, .html, .htm, .docx, .txt
supported_extensions = [".pdf", ".md", ".html", ".docx", ".txt"]
results = []
for ext in supported_extensions:
for file_path in docs_dir.glob(f"*{ext}"):
try:
result = ingestor.ingest_file(file_path)
results.append(result)
print(f" ✓ {file_path.name}: {result['status']} ({result['chunks_indexed']} chunks, {result['num_tables']} tables)")
except Exception as e:
print(f" ✗ {file_path.name}: {e}")
# Step 4: Query across all formats
print("\nQuerying multi-format collection...")
result = query_rag(
"What are the main findings across all documents?",
collection_name="multi_format_docs",
db_path="./qdrant_db",
retrieve_n=30,
rerank_top_k=10,
agentic=True,
verbose=True
)
print(f"\nAnswer:\n{result['answer']}")
# Step 5: Show sources with format indicators
print(f"\nSources ({len(result['sources'])}):")
for i, s in enumerate(result['sources'], 1):
# Determine icon based on content type
if s.get('is_table'):
icon = "📊"
elif s.get('is_figure'):
icon = "🖼️"
else:
icon = "📄"
# Get file extension for format indicator
file_name = s.get('file_name', "Unknown")
ext = Path(file_name).suffix.upper() if '.' in file_name else "TXT"
print(f" {i}. {icon} [{ext}] {file_name} - {s.get('heading', 'N/A')}")
Example 3: Using ragexample.py for Multi-Format
The bundled ragexample.py script also supports multi-format ingestion:
# Put documents (PDF, MD, HTML, DOCX, TXT) in ./documents folder
# Then run:
# Ingest all supported formats
python ragexample.py --ingest
# Query with agentic AI (auto-detects optimal strategy)
python ragexample.py "What are the main findings?"
# The agent evaluates query intent and selects appropriate mode