MDSpin is a web-based tool that converts documents (PDF, DOCX, PPTX, and more) into clean, AI-ready Markdown. It requires zero setup — just drop a file and get structured Markdown output optimized for ChatGPT, Claude, Gemini, and RAG pipelines.

How do I convert a PDF to markdown for AI?

Go to mdspin.app, drag and drop your PDF onto the converter, and click Spin. MDSpin extracts the content and converts it to clean Markdown that preserves headings, tables, and lists. Copy the output and paste it into ChatGPT, Claude, or any other AI tool for better responses.

Does markdown reduce AI token costs?

Yes. Converting documents to Markdown before sending them to AI can reduce token usage by 30-50% compared to raw PDF text or HTML. Markdown strips formatting overhead while preserving all meaningful content structure, which means lower API costs and faster inference.

What file formats does MDSpin support?

MDSpin supports 8 document formats: PDF, DOCX, DOC, PPTX, Google Slides, Apple Pages, RTF, and TXT. Each format is converted to clean, structured Markdown with headings, tables, and lists preserved.

Is MDSpin free to use?

MDSpin is free during the beta period. You can convert documents without any cost. Sign up for a free account to save your conversion history and get higher rate limits.

How does MDSpin compare to MarkItDown?

Microsoft MarkItDown is a Python library that requires installation, coding knowledge, and command-line usage. MDSpin provides the same core document-to-Markdown conversion through a zero-setup web interface — no Python, no CLI, no deployment. MDSpin also shows conversion impact metrics like token reduction and cost savings that MarkItDown does not provide.

All guides

guideragretrievalai-engineering

Why Markdown is the Best Format for RAG Pipelines

MDSpin TeamMarch 26, 20268 min read

Your RAG pipeline retrieves the wrong chunks. The LLM gets irrelevant context. The user gets a bad answer. You blame the embedding model, the chunk size, the retrieval algorithm.

But the problem started earlier — when the document was ingested. The format you use to feed documents into your RAG pipeline determines the quality ceiling for everything downstream.

What RAG Does (and Where Format Fits In)

Retrieval-Augmented Generation works in three stages:

Ingest — Documents are split into chunks, each chunk is converted to a vector embedding, and embeddings are stored in a vector database.
Retrieve — When a user asks a question, the query is embedded and the most similar document chunks are retrieved.
Generate — The retrieved chunks are injected into the LLM's context window alongside the query, and the model generates an answer grounded in the retrieved content.

Document format affects all three stages. During ingestion, format determines how well chunks capture coherent semantic units. During retrieval, format affects embedding quality and similarity matching. During generation, format determines how efficiently the context window is used and how well the LLM can parse the retrieved content.

Most RAG tutorials focus on chunk size, overlap, embedding models, and retrieval algorithms. Almost none address the preprocessing step — converting source documents into a format that makes the rest of the pipeline work better.

The Format Comparison

PDF

PDF is the most common source format and the worst for RAG pipelines.

PDF text extraction produces a stream of characters with position information stripped away. Multi-column layouts get linearized incorrectly — text from the left column merges with text from the right column. Table cells bleed into adjacent cells. Headers and footers repeat on every page and get mixed into body content. Footnotes splice into paragraphs.

The result: chunks that cross semantic boundaries, contain garbled table data, and mix structural elements randomly. Embedding models produce poor vectors from this noisy input, and retrieval suffers accordingly.

HTML

HTML preserves document structure through tags, which is a significant improvement over PDF. Headings are wrapped in <h1> through <h6> tags, tables use <table>, <tr>, <td> elements, and lists use <ul> / <ol>.

The problem is overhead. A typical HTML document carries 2-5x more tokens than necessary due to CSS classes, inline styles, data attributes, script tags, navigation elements, and boilerplate. These extra tokens consume context window space without adding retrievable information. They also dilute embedding quality — the embedding model encodes formatting noise alongside actual content.

Plain Text

Plain text is token-efficient but structureless. There are no headings, no tables, no lists — just a continuous stream of characters. The chunking algorithm has no structural cues to work with, so it falls back to arbitrary character-count splits that cut through sentences, paragraphs, and even words.

Without structure, embedding models can't distinguish between a section heading and body text. Retrieval queries like "What does section 3 say about pricing?" have nothing to match against because the concept of "section 3" doesn't exist in the plain text representation.

Markdown

Markdown hits the sweet spot: explicit structure with minimal token overhead.

Headings (#, ##, ###) create a clear document hierarchy
Tables (| pipe syntax) preserve tabular data unambiguously
Lists (-, 1.) maintain item-level structure
Code blocks (triple backticks) separate code from prose
Links and emphasis use inline syntax that adds negligible tokens

Markdown gives chunking algorithms the structural cues they need, gives embedding models clean semantic content to encode, and gives the LLM clear context to reason about — all while using the fewest tokens possible.

Impact on Chunking

Chunking is where document format has its most direct impact on RAG quality.

Character-Count Chunking (What You're Stuck With for PDF/Text)

When the source format has no structural markers, chunking algorithms split at fixed character counts — 500 characters, 1000 characters, etc. With overlap, adjacent chunks share some text to reduce information loss at boundaries.

The problem: these splits are semantically arbitrary. A 500-character chunk might contain the end of one section and the beginning of another. A table might be split across three chunks, making each chunk individually meaningless. A list might be cut between items.

Structure-Aware Chunking (What Markdown Enables)

With Markdown, chunking algorithms can split at heading boundaries. Each chunk corresponds to a coherent section of the document:

Chunk 1: "## Revenue Overview\n\nThe company reported..."
Chunk 2: "## Regional Breakdown\n\n| Region | Revenue |..."
Chunk 3: "## Year-over-Year Comparison\n\nCompared to..."

Each chunk is a self-contained semantic unit. Tables stay intact within their sections. Lists aren't split mid-item. The heading at the top of each chunk gives the embedding model clear context about what the chunk contains.

This produces meaningfully better embeddings and meaningfully better retrieval.

Impact on Retrieval Accuracy

Better chunks produce better embeddings, which produce better retrieval. The chain is direct.

When a user asks "What was the revenue in Asia Pacific?", the retrieval system needs to find the chunk containing the regional revenue table. With Markdown-based chunking:

The chunk containing ## Regional Breakdown and the complete revenue table is a single, coherent unit
The embedding for this chunk strongly encodes "regional," "revenue," and the table data
The query embedding matches this chunk with high similarity

With PDF-extracted text and character-count chunking:

The revenue table might be split across two chunks, with column headers in one and data rows in another
Adjacent non-table text dilutes the embedding signal
The query has to match against fragmented, noisy chunks

Our testing shows clean Markdown input can improve retrieval accuracy by up to 64% compared to raw PDF text extraction. The improvement comes from two compounding factors: better chunk boundaries and cleaner embedding input.

Impact on Generation Quality

Even after retrieval, format affects the final answer quality.

The retrieved chunks are injected into the LLM's context window. If those chunks contain formatting artifacts, the LLM has to parse through noise to find the answer. If a table is garbled, the LLM may extract the wrong number. If headings are missing, the LLM can't navigate the content structure.

Clean Markdown context means the LLM spends its reasoning capacity on actual content analysis, not input interpretation. The result: more accurate answers, better citations, and fewer hallucinations from misread data.

Markdown also means fewer tokens per retrieved chunk, which leaves more context window space for additional chunks or conversation history. For reducing token costs at scale, this is a significant factor.

Building a Markdown-First RAG Pipeline

Here's the practical implementation:

Step 1: Convert Source Documents to Markdown

Before ingestion, convert all source documents to clean Markdown. This is the preprocessing step that most pipelines skip.

For manual conversion: Use MDSpin to convert PDFs, DOCX, PPTX, and other formats
For automated pipelines: Integrate a conversion step that processes incoming documents before they reach the chunking stage
For Python pipelines: Tools like MarkItDown provide programmatic conversion (see our tool comparison)

Step 2: Chunk at Heading Boundaries

Configure your chunking strategy to split at Markdown headings. Most RAG frameworks (LangChain, LlamaIndex) support Markdown-aware text splitters:

Split on ## headings for section-level chunks
Use ### splits within large sections
Keep tables intact within their parent sections
Set a maximum chunk size as a fallback for very long sections

Step 3: Embed and Store

Use any standard embedding model (OpenAI text-embedding-3-small, Cohere embed-english-v3.0, or open-source alternatives). The cleaner input from Markdown conversion means these models produce higher-quality vectors regardless of which model you choose.

Step 4: Retrieve and Generate

Standard retrieval (cosine similarity, top-k) works better with Markdown-derived chunks because the embeddings are more semantically meaningful. The retrieved context is cleaner, and the LLM can reason about it more effectively.

MDSpin for RAG Preprocessing

MDSpin converts PDF, DOCX, PPTX, and other supported formats to clean Markdown optimized for LLM consumption. For RAG pipelines, this means:

Source documents are converted to a consistent format regardless of original type
Document structure (headings, tables, lists) is preserved for intelligent chunking
Token overhead is minimized, maximizing context window utilization
Conversion is fast enough to process documents at ingestion time

For individual documents, the web converter at mdspin.app handles conversion instantly. For automated pipelines, MDSpin's Make.com integration (coming soon) will enable document conversion as a step in automation workflows.

Start Building Better RAG

The quality ceiling of your RAG pipeline is set at the preprocessing step. If you're ingesting PDFs and HTML directly, you're leaving retrieval accuracy and generation quality on the table.

Convert to Markdown first. The improvement compounds through every stage of the pipeline.

Try MDSpin free at mdspin.app — convert your first document and see the difference.

Related reading:

Back to all guides