Automated llms.txt Generator

A Python utility for creating AI-ready site indices and shadow corpora with state persistence.

The Proposed Standard

What is llms.txt?
Just as robots.txt provides instructions for web crawlers, llms.txt is a proposed standard specifically designed to help Large Language Models (LLMs) discover and consume a website's core content without processing visual noise (ads, navigation, scripts).

This tool automates the creation of this file according to the specifications outlined at llmstxt.org.

1. Core Capabilities

The generate_corpus.py script is an autonomous agent designed to transform a dynamic website into a static knowledge base suitable for RAG (Retrieval Augmented Generation) pipelines and AI agents.

Capability A: The "Shadow Corpus" Strategy

The tool creates a local mirror of your content in Markdown format. This strips away HTML overhead, reducing token usage by up to 95% while preserving semantic meaning (headers, lists, tables).

Capability B: Incremental Intelligence (New in v2.2)

Unlike basic crawlers that reset every session, this tool maintains a persistent state using crawl_state.json.

  • Smart Resuming: It remembers pages scanned in previous runs. If you add 5 new blog posts, the next run will detect them and simply append them to the existing index without losing historical data.
  • Efficiency: It eliminates redundant processing, allowing for efficient daily or weekly updates of large documentation sites.

Capability C: Quality Control Filters

To ensure the AI index contains only high-value "Answers," the script now enforces strict quality gates:

  • Error Rejection: Automatically drops pages returning 403 Forbidden or 404 Not Found errors.
  • "Thin Content" Filtering: Skips pages with missing or placeholder meta descriptions (e.g., "No description available"), ensuring the llms.txt manifest remains a high-signal resource.

2. Output Architecture

Upon execution, the script generates a robust file structure including the new state database:

📁 Project Root/ │ ├── 📄 llms.txt (The Master Index) │ ↳ A consolidated manifest of ALL valid pages (historical + new). │ ├── ⚙️ crawl_state.json (The Database) │ ↳ Persists crawl history to enable incremental updates. │ └── 📂 llms_content/ (The Library) ├── 📝 index.md ├── 📝 about-us.md └── 📝 blog-2025-ai-trends.md

3. Configuration for Efficiency

To maximize crawling speed while ensuring full coverage, the script uses a "Hub vs. Leaf" logic. It skips known "Leaf" pages (articles) but re-crawls "Hub" pages (indices) to find new content. You must customize these patterns to match your URL structure.

Defining Your Hubs

Locate the HUB_EXACT_PATHS and HUB_SUBSTRINGS lists at the top of the Python script.

# 1. Exact Paths: Top-level landing pages (re-crawled every time) HUB_EXACT_PATHS = [ "/blog", "/news", "/research", "/" ] # 2. Substrings: Identifiers for pagination or archives HUB_SUBSTRINGS = [ "/page/", # Standard pagination: /blog/page/2 "page=", # Query pagination: ?page=2 "/category/", # Category archives "/archive/" # Date archives ]

Tip: If your blog uses a different structure (e.g., /articles/ instead of /blog/), update these lists to ensure the crawler detects new posts correctly.

4. Generated File Examples

The Manifest (llms.txt) - llmstxt.org Compliant

The output follows the official specification from llmstxt.org:

# example.com > AI-curated index of 150 pages from example.com. Generated for LLM consumption per the llms.txt specification (llmstxt.org). This index provides structured access to site content crawled and archived for AI model consumption. Each entry includes the page title, direct URL, and a description to help LLMs understand the content context. ## All Pages - [Python 3.13 Documentation](https://docs.python.org/3/): The official home of the Python documentation. - [What's New in Python 3.13](https://docs.python.org/3/whatsnew/3.13.html): Summary of new features and changes.

The Shadow Content (*.md)

# What's New in Python 3.13 > Source: https://docs.python.org/3/whatsnew/3.13.html > Description: Summary of new features... ## Summary – Release highlights Python 3.13 is the latest stable release... * Feature A * Feature B

Reference Documentation

For more on SEO in the AI era, see the accompanying article: "SEO for the AI Era: A 2025 Quick Guide".