DeepSeek launches 3 billion parameter vision-language model

DeepSeek has open‑sourced a 3‑billion‑parameter vision‑language model built for OCR and structured conversion, with native and dynamic tiling modes, Markdown prompts, and vLLM/Transformers support.

Tuesday October 21, 2025 , 2 min Read

DeepSeek has released an open‑source, 3‑billion‑parameter vision‑language model (VLM) for optical character recognition and document parsing, positioning the system squarely at the junction of Optical Character Recognition (OCR) and structured conversion.

The model, published on 20 October 2025 with code on GitHub and weights on Hugging Face, is designed to convert complex pages into clean, structured text such as Markdown while keeping compute costs in check.

It ships with example prompts for “Convert the document to markdown,” and supports both Transformers and vLLM inference for batch processing and PDF workflows.

How it works: vision‑as‑compression for documents

DeepSeek’s release refers to OCR as “contexts optical compression”: images are encoded into a small set of vision tokens that the language model then decodes to text, allowing long documents to be handled with fewer tokens.

The repository details native‑resolution modes—Tiny (64 tokens), Small (100), Base (256) and Large (400)—plus a dynamic “Gundam” tiling mode for dense pages. The open materials include runnable scripts, prompt templates and guidance for PDF throughput using vLLM.

The model is presented first and foremost as a tool for structured document conversion. Out‑of‑the‑box prompts target Markdown extraction, while additional examples cover figure parsing and layout‑aware OCR, indicating a focus on end‑to‑end page understanding rather than plain text transcription.

Licence, availability and paper

DeepSeek‑OCR has been released under an MIT licence, with instructions for CUDA 11.8/PyTorch 2.6 environments and both Transformers and vLLM back‑ends.

The GitHub repository links a technical paper alongside the model card on Hugging Face, which lists the model at 3B parameters.

Why this matters:

Token efficiency: By compressing long contexts into vision tokens, the approach aims to reduce cost while preserving page structure—key for invoices, tables, equations and forms.
Developer readiness: Example prompts, PDF pipelines and vLLM acceleration could make it practical to integrate into production conversion flows.
Open ecosystem: Code and weights are public, enabling independent benchmarking and rapid iteration by the document‑AI community.

Advertise with us