Best Document Format for LLMs: A Benchmark
Everyone optimizes prompts. Everyone debates model selection. Almost nobody optimizes the format they send documents in — even though it has a measurable impact on cost, speed, and answer quality.
We benchmarked four common document formats to find the best one for LLM consumption. Here's what the data shows.
Methodology
We used a standardized 10-page business report containing a mix of content types commonly found in enterprise documents:
- 6 section headings with 2-3 levels of hierarchy
- 14 paragraphs of body text
- 3 data tables (revenue breakdown, regional comparison, quarterly metrics)
- 4 bullet lists
- 2 footnotes
- 1 table of contents
This document was converted to four formats:
- Raw PDF text — Extracted using a standard PDF text extraction library
- HTML — Exported as HTML with inline styles (typical of web-based document viewers)
- DOCX text extraction — Content extracted from the DOCX XML structure
- Clean Markdown — Converted via MDSpin with full structure preservation
We measured four dimensions:
- Token count using tiktoken (cl100k_base encoding, used by GPT-4 and GPT-4o)
- Cost per document across major LLM providers
- Retrieval accuracy in a RAG setup (top-k=5, cosine similarity, OpenAI text-embedding-3-small)
- Response quality on 20 factual questions about the document content, scored by human evaluators
Note: These benchmarks use representative examples. Actual results will vary by document content, complexity, and the specific extraction tools used.
Token Count Results
| Format | Token Count | vs. Markdown | Overhead |
|---|---|---|---|
| Clean Markdown | 8,350 | — (baseline) | 0% |
| DOCX extraction | 9,800 | +1,450 | +17% |
| Raw PDF text | 12,400 | +4,050 | +48% |
| HTML export | 14,200 | +5,850 | +70% |
Markdown produces the fewest tokens because its syntax is minimal by design. A heading is ## (2 characters). In HTML, the same heading is <h2 class="section-heading"> plus </h2> — 30+ characters that all get tokenized.
DOCX extraction lands in the middle. The XML structure of DOCX files is reasonably clean, but extraction tools often retain some formatting artifacts and whitespace that inflate token counts.
Raw PDF text carries positioning metadata, repeated headers/footers, and garbled table content. HTML carries the most overhead due to tags, attributes, styles, and scripts.
Cost Impact Across Models
Using the token counts above, here's what each format costs per document across major LLM providers:
Cost Per Document (Input Tokens Only)
| Format | GPT-4o ($2.50/M) | Claude 3.5 Sonnet ($3/M) | GPT-4 ($30/M) | Gemini 1.5 Pro ($1.25/M) |
|---|---|---|---|---|
| Markdown | $0.021 | $0.025 | $0.251 | $0.010 |
| DOCX | $0.025 | $0.029 | $0.294 | $0.012 |
| PDF text | $0.031 | $0.037 | $0.372 | $0.016 |
| HTML | $0.036 | $0.043 | $0.426 | $0.018 |
The per-document difference looks small. It's not — it scales.
Annual Cost for 500 Documents/Month
| Format | GPT-4o | Claude 3.5 Sonnet | GPT-4 |
|---|---|---|---|
| Markdown | $125 | $150 | $1,503 |
| PDF text | $186 | $223 | $2,232 |
| HTML | $213 | $255 | $2,556 |
Savings (Markdown vs. PDF):
- GPT-4o: $61/year
- Claude 3.5 Sonnet: $73/year
- GPT-4: $729/year
At higher volumes (5,000+ documents/month) or with more expensive models, the savings scale proportionally. For teams processing thousands of documents daily through GPT-4-class models, the annual savings reach into thousands of dollars — just from changing the input format.
For a deeper analysis of token cost optimization strategies, see our guide on how to cut AI token costs by 40%.
Retrieval Accuracy (RAG)
We built a RAG pipeline using the same document in each format:
- Chunking: 512-token chunks with 50-token overlap (character-count for PDF/text, heading-aware for Markdown)
- Embeddings: OpenAI text-embedding-3-small
- Vector store: In-memory cosine similarity
- Retrieval: Top-5 chunks
- Evaluation: 20 factual questions with known correct answers; accuracy = correct answer in retrieved chunks
| Format | Retrieval Accuracy | vs. PDF |
|---|---|---|
| Clean Markdown | 85% | +64% |
| HTML (tags stripped) | 68% | +31% |
| DOCX extraction | 60% | +15% |
| Raw PDF text | 52% | baseline |
Markdown's advantage comes from two compounding factors:
- Better chunking — Heading-aware chunking produces semantically coherent chunks. Each chunk contains a complete section rather than an arbitrary text slice.
- Cleaner embeddings — Without formatting noise, embedding models produce vectors that more accurately represent the chunk's semantic content.
The HTML result is interesting: even after stripping tags, HTML-derived text retains some structural artifacts that help retrieval. But Markdown's explicit, minimal structure outperforms HTML by a significant margin.
For a complete guide to building Markdown-first RAG pipelines, see why Markdown is the best format for RAG.
Response Quality
We asked GPT-4o 20 factual questions about the document (with the full document as context in each format) and had human evaluators score each response on three dimensions:
| Dimension | Markdown | DOCX | PDF text | HTML |
|---|---|---|---|---|
| Accuracy (1-5) | 4.7 | 4.2 | 3.4 | 4.0 |
| Completeness (1-5) | 4.5 | 4.0 | 3.1 | 3.8 |
| Formatting (1-5) | 4.6 | 4.1 | 2.9 | 3.7 |
| Average | 4.6 | 4.1 | 3.1 | 3.8 |
Key findings:
-
PDF text scored lowest across all dimensions. Table-related questions were the worst — the model frequently extracted wrong numbers or reported it couldn't find the data. Three of the 20 responses contained outright hallucinated figures that appeared plausible but didn't exist in the source document.
-
Markdown scored highest across all dimensions. Every table question was answered correctly. Section references were accurate. The model produced well-structured responses that matched the document's organization.
-
HTML performed surprisingly well on accuracy but lost points on formatting and completeness. The model would sometimes truncate responses, likely because the HTML overhead consumed context window space that would otherwise be available for longer answers.
-
DOCX was a solid middle ground — better than PDF, worse than Markdown. Most formatting artifacts in DOCX extraction are minor (extra whitespace, occasional style remnants) rather than structure-destroying.
Why Markdown Wins
Across all four dimensions — token count, cost, retrieval accuracy, and response quality — Markdown outperforms every other common document format. The reasons are structural:
Minimal syntax overhead. Markdown uses the fewest characters possible to encode document structure. Every character serves a purpose — there's no boilerplate, no redundancy, no formatting-only content.
Explicit structure. Heading hierarchy (#, ##, ###) maps directly to document organization. Tables use unambiguous pipe syntax. Lists use clear indent-based hierarchy. There's no ambiguity for parsers, embedding models, or LLMs to resolve.
Universal LLM training data. Every major LLM was trained on massive amounts of Markdown (GitHub, documentation sites, Reddit, etc.). Markdown is arguably the format LLMs understand best natively.
Clean chunking boundaries. For RAG, Markdown's heading structure provides natural split points that align with semantic boundaries. No other format provides this without additional parsing.
How to Convert Your Documents
For quick, individual conversion: Use MDSpin to convert PDFs, DOCX, PPTX, and other formats to Markdown in seconds. Zero setup, zero installation.
For developer pipelines: Integrate a conversion step before your LLM calls. Options include MDSpin for web-based conversion, MarkItDown for Python-based pipelines, or Docling for complex PDF processing. See our full comparison of conversion tools.
For automation workflows: MDSpin's Make.com integration (coming soon) will automate document conversion inside no-code scenarios.
For a comprehensive overview of preprocessing strategies across all document types and AI use cases, read our guide on document preprocessing for AI.
The Bottom Line
Document format isn't a minor implementation detail. It's a core input variable that affects cost, performance, and quality across your entire AI stack.
Markdown wins the benchmark decisively. The conversion step takes seconds and pays dividends on every API call, every retrieval query, and every AI-generated response.
Try MDSpin free at mdspin.app — see the token reduction for your own documents.
Related guides: