MDSpin is a web-based tool that converts documents (PDF, DOCX, PPTX, and more) into clean, AI-ready Markdown. It requires zero setup — just drop a file and get structured Markdown output optimized for ChatGPT, Claude, Gemini, and RAG pipelines.

How do I convert a PDF to markdown for AI?

Go to mdspin.app, drag and drop your PDF onto the converter, and click Spin. MDSpin extracts the content and converts it to clean Markdown that preserves headings, tables, and lists. Copy the output and paste it into ChatGPT, Claude, or any other AI tool for better responses.

Does markdown reduce AI token costs?

Yes. Converting documents to Markdown before sending them to AI can reduce token usage by 30-50% compared to raw PDF text or HTML. Markdown strips formatting overhead while preserving all meaningful content structure, which means lower API costs and faster inference.

What file formats does MDSpin support?

MDSpin supports 8 document formats: PDF, DOCX, DOC, PPTX, Google Slides, Apple Pages, RTF, and TXT. Each format is converted to clean, structured Markdown with headings, tables, and lists preserved.

Is MDSpin free to use?

MDSpin is free during the beta period. You can convert documents without any cost. Sign up for a free account to save your conversion history and get higher rate limits.

How does MDSpin compare to MarkItDown?

Microsoft MarkItDown is a Python library that requires installation, coding knowledge, and command-line usage. MDSpin provides the same core document-to-Markdown conversion through a zero-setup web interface — no Python, no CLI, no deployment. MDSpin also shows conversion impact metrics like token reduction and cost savings that MarkItDown does not provide.

All posts

benchmarkaicomparison

Best Document Format for LLMs: A Benchmark

MDSpin TeamMarch 28, 20267 min read

Everyone optimizes prompts. Everyone debates model selection. Almost nobody optimizes the format they send documents in — even though it has a measurable impact on cost, speed, and answer quality.

We benchmarked four common document formats to find the best one for LLM consumption. Here's what the data shows.

Methodology

We used a standardized 10-page business report containing a mix of content types commonly found in enterprise documents:

6 section headings with 2-3 levels of hierarchy
14 paragraphs of body text
3 data tables (revenue breakdown, regional comparison, quarterly metrics)
4 bullet lists
2 footnotes
1 table of contents

This document was converted to four formats:

Raw PDF text — Extracted using a standard PDF text extraction library
HTML — Exported as HTML with inline styles (typical of web-based document viewers)
DOCX text extraction — Content extracted from the DOCX XML structure
Clean Markdown — Converted via MDSpin with full structure preservation

We measured four dimensions:

Token count using tiktoken (cl100k_base encoding, used by GPT-4 and GPT-4o)
Cost per document across major LLM providers
Retrieval accuracy in a RAG setup (top-k=5, cosine similarity, OpenAI text-embedding-3-small)
Response quality on 20 factual questions about the document content, scored by human evaluators

Note: These benchmarks use representative examples. Actual results will vary by document content, complexity, and the specific extraction tools used.

Token Count Results

Format	Token Count	vs. Markdown	Overhead
Clean Markdown	8,350	— (baseline)	0%
DOCX extraction	9,800	+1,450	+17%
Raw PDF text	12,400	+4,050	+48%
HTML export	14,200	+5,850	+70%

Markdown produces the fewest tokens because its syntax is minimal by design. A heading is ## (2 characters). In HTML, the same heading is <h2 class="section-heading"> plus </h2> — 30+ characters that all get tokenized.

DOCX extraction lands in the middle. The XML structure of DOCX files is reasonably clean, but extraction tools often retain some formatting artifacts and whitespace that inflate token counts.

Raw PDF text carries positioning metadata, repeated headers/footers, and garbled table content. HTML carries the most overhead due to tags, attributes, styles, and scripts.

Cost Impact Across Models

Using the token counts above, here's what each format costs per document across major LLM providers:

Cost Per Document (Input Tokens Only)

Format	GPT-4o ($2.50/M)	Claude 3.5 Sonnet ($3/M)	GPT-4 ($30/M)	Gemini 1.5 Pro ($1.25/M)
Markdown	$0.021	$0.025	$0.251	$0.010
DOCX	$0.025	$0.029	$0.294	$0.012
PDF text	$0.031	$0.037	$0.372	$0.016
HTML	$0.036	$0.043	$0.426	$0.018

The per-document difference looks small. It's not — it scales.

Annual Cost for 500 Documents/Month

Format	GPT-4o	Claude 3.5 Sonnet	GPT-4
Markdown	$125	$150	$1,503
PDF text	$186	$223	$2,232
HTML	$213	$255	$2,556

Savings (Markdown vs. PDF):

GPT-4o: $61/year
Claude 3.5 Sonnet: $73/year
GPT-4: $729/year

At higher volumes (5,000+ documents/month) or with more expensive models, the savings scale proportionally. For teams processing thousands of documents daily through GPT-4-class models, the annual savings reach into thousands of dollars — just from changing the input format.

For a deeper analysis of token cost optimization strategies, see our guide on how to cut AI token costs by 40%.

Retrieval Accuracy (RAG)

We built a RAG pipeline using the same document in each format:

Chunking: 512-token chunks with 50-token overlap (character-count for PDF/text, heading-aware for Markdown)
Embeddings: OpenAI text-embedding-3-small
Vector store: In-memory cosine similarity
Retrieval: Top-5 chunks
Evaluation: 20 factual questions with known correct answers; accuracy = correct answer in retrieved chunks

Format	Retrieval Accuracy	vs. PDF
Clean Markdown	85%	+64%
HTML (tags stripped)	68%	+31%
DOCX extraction	60%	+15%
Raw PDF text	52%	baseline

Markdown's advantage comes from two compounding factors:

Better chunking — Heading-aware chunking produces semantically coherent chunks. Each chunk contains a complete section rather than an arbitrary text slice.
Cleaner embeddings — Without formatting noise, embedding models produce vectors that more accurately represent the chunk's semantic content.

The HTML result is interesting: even after stripping tags, HTML-derived text retains some structural artifacts that help retrieval. But Markdown's explicit, minimal structure outperforms HTML by a significant margin.

For a complete guide to building Markdown-first RAG pipelines, see why Markdown is the best format for RAG.

Response Quality

We asked GPT-4o 20 factual questions about the document (with the full document as context in each format) and had human evaluators score each response on three dimensions:

Dimension	Markdown	DOCX	PDF text	HTML
Accuracy (1-5)	4.7	4.2	3.4	4.0
Completeness (1-5)	4.5	4.0	3.1	3.8
Formatting (1-5)	4.6	4.1	2.9	3.7
Average	4.6	4.1	3.1	3.8

Key findings:

PDF text scored lowest across all dimensions. Table-related questions were the worst — the model frequently extracted wrong numbers or reported it couldn't find the data. Three of the 20 responses contained outright hallucinated figures that appeared plausible but didn't exist in the source document.
Markdown scored highest across all dimensions. Every table question was answered correctly. Section references were accurate. The model produced well-structured responses that matched the document's organization.
HTML performed surprisingly well on accuracy but lost points on formatting and completeness. The model would sometimes truncate responses, likely because the HTML overhead consumed context window space that would otherwise be available for longer answers.
DOCX was a solid middle ground — better than PDF, worse than Markdown. Most formatting artifacts in DOCX extraction are minor (extra whitespace, occasional style remnants) rather than structure-destroying.

Why Markdown Wins

Across all four dimensions — token count, cost, retrieval accuracy, and response quality — Markdown outperforms every other common document format. The reasons are structural:

Minimal syntax overhead. Markdown uses the fewest characters possible to encode document structure. Every character serves a purpose — there's no boilerplate, no redundancy, no formatting-only content.

Explicit structure. Heading hierarchy (#, ##, ###) maps directly to document organization. Tables use unambiguous pipe syntax. Lists use clear indent-based hierarchy. There's no ambiguity for parsers, embedding models, or LLMs to resolve.

Universal LLM training data. Every major LLM was trained on massive amounts of Markdown (GitHub, documentation sites, Reddit, etc.). Markdown is arguably the format LLMs understand best natively.

Clean chunking boundaries. For RAG, Markdown's heading structure provides natural split points that align with semantic boundaries. No other format provides this without additional parsing.

How to Convert Your Documents

For quick, individual conversion: Use MDSpin to convert PDFs, DOCX, PPTX, and other formats to Markdown in seconds. Zero setup, zero installation.

For developer pipelines: Integrate a conversion step before your LLM calls. Options include MDSpin for web-based conversion, MarkItDown for Python-based pipelines, or Docling for complex PDF processing. See our full comparison of conversion tools.

For automation workflows: MDSpin's Make.com integration (coming soon) will automate document conversion inside no-code scenarios.

For a comprehensive overview of preprocessing strategies across all document types and AI use cases, read our guide on document preprocessing for AI.

The Bottom Line

Document format isn't a minor implementation detail. It's a core input variable that affects cost, performance, and quality across your entire AI stack.

Markdown wins the benchmark decisively. The conversion step takes seconds and pays dividends on every API call, every retrieval query, and every AI-generated response.

Try MDSpin free at mdspin.app — see the token reduction for your own documents.

Related guides:

Back to all posts