Automated llms.txt Generator

A Python utility for creating AI-ready site indices and shadow corpora with state persistence.

The Proposed Standard

What is llms.txt?
Just as robots.txt provides instructions for web crawlers, llms.txt is a proposed standard specifically designed to help Large Language Models (LLMs) discover and consume a website's core content without processing visual noise (ads, navigation, scripts).

This tool automates the creation of this file according to the specifications outlined at llmstxt.org.

1. Core Capabilities

The generate_corpus.py script is an autonomous agent designed to transform a dynamic website into a static knowledge base suitable for RAG (Retrieval Augmented Generation) pipelines and AI agents.

Capability A: The "Shadow Corpus" Strategy

The tool creates a local mirror of your content in Markdown format. This strips away HTML overhead, reducing token usage by up to 95% while preserving semantic meaning (headers, lists, tables).

Capability B: Incremental Intelligence (New in v2.2)

Unlike basic crawlers that reset every session, this tool maintains a persistent state using crawl_state.json.

  • Smart Resuming: It remembers pages scanned in previous runs. If you add 5 new blog posts, the next run will detect them and simply append them to the existing index without losing historical data.
  • Efficiency: It eliminates redundant processing, allowing for efficient daily or weekly updates of large documentation sites.

Capability C: Quality Control Filters

To ensure the AI index contains only high-value "Answers," the script now enforces strict quality gates:

  • Error Rejection: Automatically drops pages returning 403 Forbidden or 404 Not Found errors.
  • "Thin Content" Filtering: Skips pages with missing or placeholder meta descriptions (e.g., "No description available"), ensuring the llms.txt manifest remains a high-signal resource.

2. Output Architecture

Upon execution, the script generates a robust file structure including the new state database:

📁 Project Root/ │ ├── 📄 llms.txt (The Master Index) │ ↳ A consolidated manifest of ALL valid pages (historical + new). │ ├── ⚙️ crawl_state.json (The Database) │ ↳ Persists crawl history to enable incremental updates. │ └── 📂 llms_content/ (The Library) ├── 📝 index.md ├── 📝 about-us.md └── 📝 blog-2025-ai-trends.md

3. Configuration for Efficiency

To maximize crawling speed while ensuring full coverage, the script uses a "Hub vs. Leaf" logic. It skips known "Leaf" pages (articles) but re-crawls "Hub" pages (indices) to find new content. You must customize these patterns to match your URL structure.

Defining Your Hubs

Locate the HUB_EXACT_PATHS and HUB_SUBSTRINGS lists at the top of the Python script.

# 1. Exact Paths: Top-level landing pages (re-crawled every time) HUB_EXACT_PATHS = [ "/blog", "/news", "/research", "/" ] # 2. Substrings: Identifiers for pagination or archives HUB_SUBSTRINGS = [ "/page/", # Standard pagination: /blog/page/2 "page=", # Query pagination: ?page=2 "/category/", # Category archives "/archive/" # Date archives ]

Tip: If your blog uses a different structure (e.g., /articles/ instead of /blog/), update these lists to ensure the crawler detects new posts correctly.

4. Generated File Examples

The Manifest (llms.txt)

# docs.python.org Site Index > Generated for AI Agents. Updated: 2025-12-22. Contains 150 core pages. - [Python 3.13 Documentation](https://docs.python.org/3/): The official home of the Python documentation. - [What's New in Python 3.13](https://docs.python.org/3/whatsnew/3.13.html): Summary of new features and changes.

The Shadow Content (*.md)

# What's New in Python 3.13 > Source: https://docs.python.org/3/whatsnew/3.13.html > Description: Summary of new features... ## Summary – Release highlights Python 3.13 is the latest stable release... * Feature A * Feature B