Home / When PDFs Fight Back: Converting Complex Documents for Information Diffusion

When PDFs Fight Back: Converting Complex Documents for Information Diffusion

Or: How I learned to stop trusting automated tools and verify everything

The Experiment: Unlocking the Value in Long-Form PDFs

We’ve all been there: you have a comprehensive 200-page PDF—a research report, technical manual, or detailed analysis—packed with valuable information. But actually using it? Nearly impossible. No search that works well, no way to jump between sections, footnotes buried at the bottom of pages, tables that don’t resize on mobile.

The information is there, but it’s locked up. It doesn’t diffuse to the people who need it.

I wanted to test if modern tools could transform this locked information into something actually useful: a navigable, searchable, interactive web experience. So I chose a deliberately complex 200+ page document as my stress test:

  • Complex nested tables with historical data
  • 125+ footnotes with citations
  • Cross-references woven throughout
  • External links to reference materials
  • Multiple appendices with detailed case studies

If the conversion tools could handle this, they could handle anything.

Spoiler: they couldn’t. But the journey revealed something important about information architecture in the age of AI.

The First Mistake: Trusting pdftohtml

I started optimistically. Use pdftohtml, extract everything, clean it up a bit. Maybe a day of work.

The HTML looked fine at first glance. Pages rendered, headings appeared, tables showed up. Success!

Then I started clicking links.

1,800 Broken Links: The Hidden Chaos

Links that should have said “See Appendix 3: Methodology” showed up as three separate, useless fragments:

  • A link containing just “A”
  • A link containing “PPENDIX”
  • A link containing “3:”

I had extracted 2,055 intra-page links. Approximately 1,800 were completely broken—an 87% failure rate.

Why? PDFs aren’t documents—they’re printer instructions. Text is stored as: “Place ‘d’ at position (72.5, 234.8). Place ‘o’ at (77.2, 234.8)…”

When pdftohtml wraps hyperlinks around text, it only catches fragments because text is scattered as individual characters. The word “required” with a hyperlink becomes:

<a href='output.html#111'>d </a>

Just the letter “d”. Linked. Useless.

The Markdown Breakthrough: Enabling Information Diffusion

Frustrated, I tried converting to Markdown first.

This was the key decision that saved the experiment—not just for accuracy, but because it unlocked automation.

Markdown preserves semantic structure:

See [Appendix 3: Research Methodology](#appendix-3)

Not fragmented garbage:

See <a>A</a><a>PPENDIX</a> <a>3:</a>

Why this matters: Markdown enables automated generation of interactive experiences.

With clean, hierarchical syntax, I could use LLMs to automatically generate:

  • Navigation menus from headers (zero manual work)
  • Table of contents with proper nesting
  • Search indices from semantic content
  • Cross-reference maps
  • Mobile-responsive layouts

Try doing that with fragmented HTML spans. Good luck.

The key insight: Markdown preserves meaning, not just positioning. And meaning is what enables information to flow to users.

But Markdown didn’t have the anchor IDs I needed for navigation. For that, I needed the HTML version.

The Hybrid Solution: Best of Both Worlds

I realized I needed both:

Markdown as source of truth:

  • All content and link text (accurate and complete)
  • Table structure
  • Foundation for automated generation

HTML as navigation guide:

  • Anchor IDs
  • Section boundaries
  • Where things link TO

Think of it like manuscript (Markdown) + architectural blueprint (HTML). Both essential.

The Reconstruction Effort

1. Fix 1,800 Fragmented Links (LLM-Assisted, But Not Perfect)

Here’s where LLMs became invaluable. For each broken link, I needed to:

  • Find fragments in HTML
  • Look up correct text in Markdown
  • Identify the anchor target
  • Reconstruct the proper link

Before: <a href="#111">M</a><a href="#111">ETHODOLOGY</a>
After: <a href="#methodology">Research Methodology</a>

Instead of doing this 1,800 times manually, I used LLMs to:

  • Parse both HTML and Markdown systematically
  • Match fragmented links to their complete Markdown counterparts
  • Generate corrected HTML with proper anchor targets
  • Batch process hundreds of fixes at once

The LLM could understand the semantic relationship between <a href="#111">M</a><a href="#111">ETHODOLOGY</a> in the HTML and [Research Methodology](#methodology) in the Markdown, then reconstruct the proper link automatically.

But here’s the reality: LLM output wasn’t perfect. It would:

  • Occasionally misidentify which Markdown link matched which HTML fragment
  • Generate anchor IDs that didn’t exist in the target document
  • Miss edge cases where fragmentation patterns were unusual
  • Confidently produce incorrect links that looked plausible

Manual verification was essential. The LLM got me from 1,800 manual fixes down to maybe 200-300 that needed human review—a huge time saver, but not fully automated.

Critical insight: This is only possible because Markdown provides clean, semantic content that LLMs can parse. The fragmented HTML alone would be impossible to fix at scale—you need the Markdown as the source of truth.

2. Clean Nested Anchor Nightmares (LLM + Scripts + Manual Review)

The converter created invalid HTML like:

<a href="https://example.org/<a href="REAL">TEXT</a>" target="_blank">

I found 145+ instances. Rather than manually fixing each one, I:

  • Used LLMs to identify the pattern and generate regex solutions
  • Wrote Python scripts based on LLM suggestions
  • Batch processed all fixes systematically
  • Manually reviewed edge cases where the LLM’s regex was too aggressive or missed nuances

The combination of LLM pattern recognition and scripting made quick work of what would have been tedious manual editing—but human oversight was still needed to catch the 10-15% of cases where automated fixes broke something else.

3. Automate the Interactive Experience (LLM-Powered)

Markdown’s machine-readability paid off. I fed it to LLMs to auto-generate:

  • Navigation structure: LLM parsed headers and generated sidebar menus
  • Search functionality: Extracted sections and built search indices
  • Responsive layouts: Transformed semantic structure into mobile-friendly HTML/CSS
  • Link reconstruction: Matched fragmented HTML to complete Markdown text

The key workflow:

  1. Clean Markdown as input
  2. LLM parses structure and generates code
  3. Minimal manual intervention needed

I didn’t hand-code 200 pages of navigation. LLMs generated it from semantic structure, because Markdown preserves the meaning machines need to understand.

4. Verify Systematically (Catches LLM Mistakes)

Spot-checking catches nothing. I wrote scripts to:

  • Map every section source to destination
  • Verify all tables and footnotes
  • Validate anchor-link relationships
  • Cross-reference Markdown against HTML

This verification was crucial for catching LLM errors:

  • Links that looked correct but pointed to wrong anchors
  • Generated IDs that didn’t match the target structure
  • Regex fixes that broke valid HTML in edge cases
  • Subtle mismatches between Markdown and final output

This caught dozens of issues that manual checking would have missed—and more importantly, caught the issues that LLM automation created while trying to help.

The Result: Information That Flows

The final interactive web document:

✓ Instant search across 200+ pages
✓ Auto-generated navigation from structure
✓ Mobile-responsive
✓ Hover tooltips for footnotes
✓ Every link works
Information flows to users instead of hiding from them

Users find what they need in seconds, not minutes.

What I Learned

1. Automated Conversion Gets You 60-70% There

Tools like pdftohtml are impressive, but the remaining 30-40% is where all the work lives. Budget accordingly.

2. Markdown Enables Information Diffusion

The key lesson: It’s not about converting a document—it’s about making information accessible.

Markdown’s semantic structure enables:

  • Machine parsing and transformation
  • Automated navigation generation
  • Multi-channel distribution (web, mobile, print)
  • LLM-powered enhancements

HTML preserves positioning. Markdown preserves meaning. Meaning enables information flow.

3. Multiple Formats Provide Redundancy

Both Markdown and HTML were essential. Each preserved different aspects correctly.

4. Systematic Verification Is Non-Negotiable

When staring at hundreds of links, your brain fills in gaps. Automated verification catches weird edge cases—and there are always edge cases.

Even more critical with LLM assistance: LLMs are confidently wrong. They’ll generate links that look perfect but point to the wrong place. They’ll create anchor IDs that don’t exist. Systematic verification is what catches these plausible-but-broken outputs.

LLMs accelerate the work dramatically, but they don’t eliminate the need for verification—they make it more important.

5. Universal Principles Apply

Whether converting research papers, technical manuals, reports, or reference guides—same challenges, same solutions. PDF’s design makes it hostile to semantic extraction, regardless of content domain.

The Takeaway

This experiment proved: converting complex documents isn’t a technical problem—it’s an information architecture problem.

You’re transforming a locked vault into a diffusion system—making information flow to people in the format they need, when they need it.

Markdown is the key. Not because it’s perfect, but because it preserves semantic meaning that both humans and machines can understand and transform.

If you’re tackling similar conversions:

  1. Choose Markdown as source of truth (enables LLM automation)
  2. Use multiple conversion outputs for redundancy
  3. Leverage LLMs for link reconstruction and code generation
  4. Build verification scripts (LLMs can help write these too)
  5. Expect to manually review 10-20% of LLM output (they’re helpful, not perfect)
  6. Budget 3x your initial estimate (even with LLM assistance)

The work is messy and frustrating. LLMs accelerate it significantly—turning weeks into days—but they don’t eliminate the need for careful verification.

When you unlock 200 pages of valuable information and make it searchable, navigable, and accessible—you understand why it matters.

Information wants to be free. This experiment proved we can set it free responsibly—with the right approach, LLM-assisted workflows, systematic verification, and realistic expectations.


Have your own document conversion stories? The failures are often more educational than the successes.