MDSpin is a web-based tool that converts documents (PDF, DOCX, PPTX, and more) into clean, AI-ready Markdown. It requires zero setup — just drop a file and get structured Markdown output optimized for ChatGPT, Claude, Gemini, and RAG pipelines.

How do I convert a PDF to markdown for AI?

Go to mdspin.app, drag and drop your PDF onto the converter, and click Spin. MDSpin extracts the content and converts it to clean Markdown that preserves headings, tables, and lists. Copy the output and paste it into ChatGPT, Claude, or any other AI tool for better responses.

Does markdown reduce AI token costs?

Yes. Converting documents to Markdown before sending them to AI can reduce token usage by 30-50% compared to raw PDF text or HTML. Markdown strips formatting overhead while preserving all meaningful content structure, which means lower API costs and faster inference.

What file formats does MDSpin support?

MDSpin supports 8 document formats: PDF, DOCX, DOC, PPTX, Google Slides, Apple Pages, RTF, and TXT. Each format is converted to clean, structured Markdown with headings, tables, and lists preserved.

Is MDSpin free to use?

MDSpin is free during the beta period. You can convert documents without any cost. Sign up for a free account to save your conversion history and get higher rate limits.

How does MDSpin compare to MarkItDown?

Microsoft MarkItDown is a Python library that requires installation, coding knowledge, and command-line usage. MDSpin provides the same core document-to-Markdown conversion through a zero-setup web interface — no Python, no CLI, no deployment. MDSpin also shows conversion impact metrics like token reduction and cost savings that MarkItDown does not provide.

All guides

guidepreprocessingai-engineeringllm

Document Preprocessing for AI: The Complete Guide

MDSpin TeamMarch 25, 202611 min read

Teams spend months on prompt engineering, model selection, and fine-tuning. Then they feed their LLM a raw PDF and wonder why the output is wrong.

Document preprocessing — the step between "file exists" and "AI can reason about it" — is where 80% of AI quality issues originate. It's also the step most teams skip entirely.

This guide covers everything you need to know about preprocessing documents for AI: what goes wrong when you skip it, what best practices look like, and how to implement preprocessing for different AI use cases.

The AI Document Pipeline

Every AI system that processes documents follows the same fundamental pipeline, whether it's a ChatGPT conversation, a RAG-powered knowledge base, or an autonomous AI agent:

Source Document → Extraction & Conversion → Processing → LLM → Output
     (PDF)          (Preprocessing)       (Chunking/    (GPT/Claude)
                                           Embedding)

Source documents arrive in various formats: PDF, DOCX, PPTX, HTML, CSV, RTF, TXT. Each format encodes content differently, with varying amounts of structural information and formatting overhead.

Extraction and conversion is the preprocessing step. This is where the document's content is extracted from its native format and converted into something an LLM can process efficiently. This step determines the quality ceiling for everything downstream.

Processing depends on the use case. For direct chat, the converted text is injected into the context window. For RAG, the text is chunked and embedded. For agents, the text becomes part of the agent's working memory.

LLM processing happens after all the upstream steps. The model can only work with what it receives. If the preprocessing step produces garbled text, every downstream operation inherits that corruption.

The Five Pitfalls of Document Preprocessing

Pitfall 1: Pasting Raw PDF Text

The most common mistake. Someone copies text from a PDF viewer and pastes it into ChatGPT. Or a pipeline uses a basic PDF text extraction library without post-processing.

What happens: PDF stores text as positioned characters, not as structured content. When you extract "text" from a PDF, you get characters in approximate reading order — but multi-column layouts get merged, table cells bleed together, headers and footers repeat on every page, and footnotes splice into body paragraphs.

The AI receives text soup and reasons about it as if it were accurate. The result: wrong numbers extracted from tables, misattributed quotes, hallucinated section boundaries.

Pitfall 2: Using HTML as an Intermediate Format

Some teams export documents to HTML before feeding them to AI, thinking HTML's tags will preserve structure. The structure is preserved — but buried under massive overhead.

A typical HTML document carries 2-5x more tokens than the same content in Markdown. CSS classes, inline styles, data attributes, script blocks, navigation elements, and boilerplate all get tokenized and sent to the LLM. You're paying for tokens that carry zero information.

Worse, HTML noise dilutes embedding quality in RAG pipelines. The embedding model encodes class="article-section-heading text-lg font-bold" right alongside the actual heading text, producing less meaningful vectors.

Pitfall 3: Ignoring Document Structure

Converting to plain text strips all structure. No headings, no tables, no lists — just a continuous character stream. This might seem "clean," but it destroys the semantic information that helps LLMs understand and navigate documents.

When an LLM receives structured Markdown with ## Revenue by Region followed by a table, it can navigate directly to revenue data. When it receives plain text where "Revenue by Region" is just another line with no structural distinction, it has to infer structure from content alone — and it frequently gets this wrong.

For RAG, losing structure means losing the ability to chunk at section boundaries. Character-count chunking is the only fallback, and it produces semantically arbitrary splits that degrade retrieval accuracy.

Pitfall 4: One-Size-Fits-All Conversion

Different document types need different conversion strategies:

PDFs need column detection, table reconstruction, and header/footer stripping
PPTX files need slide-by-slide extraction with speaker notes included
DOCX files need tracked changes and comments stripped while preserving heading hierarchy
CSV files need conversion to properly formatted Markdown tables
HTML files need tag stripping while preserving semantic elements

Using the same extraction method for all formats produces inconsistent results. A PDF parser applied to a DOCX file misses heading levels. A plain text extractor applied to a PPTX file loses slide boundaries.

Pitfall 5: Not Validating Output

Teams build a preprocessing pipeline, run documents through it, and assume the output is correct. Nobody spot-checks what the LLM actually receives.

This leads to silent quality degradation. A table that renders correctly in the original PDF might come through as a jumbled text block. A heading hierarchy that's clear in the DOCX might flatten to a single level. These issues only surface when users complain about bad AI output — and even then, the preprocessing step is rarely the first suspect.

Always validate your conversion output on representative documents before deploying a pipeline.

Best Practices for Document Preprocessing

Convert to Markdown

Markdown is the optimal target format for AI preprocessing. It provides:

Semantic structure — Headings, tables, lists, and emphasis preserved with minimal syntax
Token efficiency — 30-50% fewer tokens than raw PDF text, 50-70% fewer than HTML
Universal compatibility — Every LLM and every framework understands Markdown natively
Reproducibility — Same document produces same Markdown output every time

For a detailed analysis of why Markdown outperforms other formats, see our guide on why Markdown is the best format for RAG pipelines.

Preserve Heading Hierarchy

Heading structure is the single most important structural element for AI processing. It enables:

Section-aware chunking in RAG pipelines
Document navigation in chat contexts
Table of contents generation for long documents
Scope-limited queries ("What does section 3 say about...")

Ensure your conversion preserves #, ##, ### heading levels that match the original document's structure.

Maintain Table Structure

Tables contain some of the highest-value information in business documents — financial data, comparison matrices, specifications, schedules. When table structure is lost, the AI can't reliably extract data from specific cells.

Use Markdown's pipe-delimited table syntax:

| Region | Q3 Revenue | YoY Growth |
|--------|-----------|------------|
| NA     | $2.1M     | 23%        |
| EU     | $1.4M     | 18%        |

This is unambiguous. The AI knows exactly which number belongs to which region and which metric.

Strip Non-Content Elements

Remove elements that exist for printing or display but carry no semantic content:

Page headers and footers (page numbers, document titles repeated on every page)
Watermarks
Print margins and layout metadata
Font specifications and styling information
Revision history and edit tracking metadata

These elements inflate token counts without helping the AI understand the content.

Handle Images Thoughtfully

Most document conversion pipelines drop images entirely. For text-focused AI processing, this is usually acceptable. But if images contain important information (diagrams, charts, screenshots with text), consider:

Using OCR to extract text from images
Adding alt-text descriptions where available
Noting where images appeared in the document flow

For most business document preprocessing, dropping images is the right default — but validate that no critical information is image-only.

Validate Before Processing

Build validation into your pipeline:

Spot-check converted output for representative documents
Compare heading counts between source and output
Verify table dimensions match the original
Check that document length is reasonable (dramatic length changes indicate extraction issues)
Test with your actual LLM to confirm response quality

Format-Specific Preprocessing Tips

PDF

Watch for multi-column layouts — columns must be linearized into reading order
Detect and strip repeating headers and footers
Reconstruct tables from positioned text elements
Handle scanned PDFs with OCR if needed
Be aware that some PDFs use images for text (common in scanned documents)

DOCX / DOC

Generally the cleanest extraction source after plain text
Strip tracked changes and comments
Preserve heading styles as Markdown heading levels
Convert embedded tables to Markdown table syntax
Handle nested lists and mixed formatting

PPTX

Extract slide-by-slide with clear separators
Include speaker notes — they often contain the most detailed information
Preserve bullet hierarchy within slides
Note that slides are inherently non-linear; conversion creates a linear document

HTML

Strip all CSS, JavaScript, and non-semantic markup
Preserve semantic elements: headings, tables, lists, links
Remove navigation, sidebars, ads, and boilerplate
Convert relative links to absolute where needed

CSV

Convert to Markdown tables with proper header rows
Handle quoted fields containing commas
Detect and label column types where possible

RTF

Strip RTF control codes entirely
Preserve structural formatting (bold, headings, lists) as Markdown syntax
Handle legacy encoding properly

For details on how MDSpin handles each format, see our supported formats page.

Tool Comparison for Document Preprocessing

Tool	Type	Best For	Setup Required	Formats
MDSpin	Web app	Knowledge workers, quick conversion	None	PDF, DOCX, PPTX, + 5 more
MarkItDown	Python library	Developer pipelines	Python + CLI	15+ formats
Docling	Python library	Complex PDFs with tables/equations	Python + compute	PDF-focused
Unstructured	API platform	Enterprise-scale processing	API integration	Multiple
Firecrawl	API	Web content to Markdown	API integration	URLs/HTML

For a detailed comparison, read our full analysis of MDSpin vs. MarkItDown, Docling, and other converters.

Preprocessing for Different AI Use Cases

Chat (ChatGPT, Claude, Gemini)

For direct chat use, preprocessing means converting the document to Markdown before pasting it into the conversation.

The conversion is simple and the impact is immediate: the AI receives clean, structured input and produces more accurate responses. For a step-by-step walkthrough, see our guide on converting PDFs for ChatGPT, Claude, and Gemini.

RAG Pipelines

For RAG, preprocessing is the first step in the ingestion pipeline. Convert all source documents to Markdown before chunking and embedding. This enables structure-aware chunking at heading boundaries, produces higher-quality embeddings, and maximizes context window utilization during generation.

See our dedicated guide on why Markdown is the best format for RAG pipelines.

AI Agents

AI agents that process documents autonomously need consistent, reliable input. When an agent's document parser produces inconsistent results, the agent's behavior becomes unpredictable. Markdown conversion provides a consistent preprocessing layer that standardizes document input regardless of source format.

API Integrations

For teams calling LLM APIs programmatically, preprocessing reduces cost per request and improves response quality. The token cost savings compound with volume — at thousands of API calls per day, formatting overhead becomes a significant line item.

Summarization

Document summarization quality depends on the AI's ability to understand document structure. With Markdown, the LLM can produce section-aware summaries that follow the document's organizational logic, rather than treating the entire document as an undifferentiated text block.

Building an Automated Preprocessing Pipeline

For Individuals

Use MDSpin to convert documents on demand. The web converter handles conversion in seconds with zero setup.

For Teams

Standardize your team's AI workflows around Markdown input. MDSpin's Make.com integration (coming soon) will automate conversion inside no-code automation scenarios — every document gets preprocessed before it reaches an AI module.

For Developers

Build a conversion step into your data pipeline:

Incoming document is received (via upload, email, API, etc.)
Document format is detected
Conversion to Markdown is applied using the appropriate method for the format
Clean Markdown output is passed to the next pipeline stage
Validation confirms the output is structurally sound

This step adds seconds to your pipeline but compounds in quality improvement across every downstream operation.

Start Preprocessing

The gap between "documents exist" and "AI can reason about them accurately" is a preprocessing problem. Most teams skip this step and pay for it in lower AI quality, higher token costs, and more user complaints.

The fix is straightforward: convert to Markdown before processing. The improvement is measurable at every stage of the pipeline.

Try MDSpin free at mdspin.app — convert your first document and see what your AI has been missing.

Related reading:

Back to all guides