What is llms.txt?
Just as robots.txt provides instructions for web crawlers, llms.txt is a proposed standard specifically designed to help Large Language Models (LLMs) discover and consume a website's core content without processing visual noise (ads, navigation, scripts).
This tool automates the creation of this file according to the specifications outlined at llmstxt.org.
1. Core Capabilities
The generate_corpus.py script is an autonomous agent designed to transform a dynamic website into a static knowledge base suitable for RAG (Retrieval Augmented Generation) pipelines and AI agents.
Capability A: The "Shadow Corpus" Strategy
The tool creates a local mirror of your content in Markdown format. This strips away HTML overhead, reducing token usage by up to 95% while preserving semantic meaning (headers, lists, tables).
Capability B: Incremental Intelligence (New in v2.2)
Unlike basic crawlers that reset every session, this tool maintains a persistent state using crawl_state.json.
- Smart Resuming: It remembers pages scanned in previous runs. If you add 5 new blog posts, the next run will detect them and simply append them to the existing index without losing historical data.
- Efficiency: It eliminates redundant processing, allowing for efficient daily or weekly updates of large documentation sites.
Capability C: Quality Control Filters
To ensure the AI index contains only high-value "Answers," the script now enforces strict quality gates:
- Error Rejection: Automatically drops pages returning 403 Forbidden or 404 Not Found errors.
- "Thin Content" Filtering: Skips pages with missing or placeholder meta descriptions (e.g., "No description available"), ensuring the
llms.txtmanifest remains a high-signal resource.
2. Output Architecture
Upon execution, the script generates a robust file structure including the new state database:
3. Configuration for Efficiency
To maximize crawling speed while ensuring full coverage, the script uses a "Hub vs. Leaf" logic. It skips known "Leaf" pages (articles) but re-crawls "Hub" pages (indices) to find new content. You must customize these patterns to match your URL structure.
Defining Your Hubs
Locate the HUB_EXACT_PATHS and HUB_SUBSTRINGS lists at the top of the Python script.
Tip: If your blog uses a different structure (e.g., /articles/ instead of /blog/), update these lists to ensure the crawler detects new posts correctly.