DeepSeek Coder Guide

What is this about?

This guide is the model-centric companion to the Cursor + DeepSeek + VS Code Guide. That guide is about the editor setup. This one is about the model itself: what DeepSeek Coder is, the three realistic ways to host it, and how to use it well once it is running.

Important context

The facts in section 1 are taken from the official DeepSeek Coder project page at deepseekcoder.github.io, checked on June 22, 2026. That page documents the original DeepSeek Coder release. Where this guide goes beyond the page (hosting commands, usage advice), I mark it as practice or inference, and I avoid quoting numbers for newer models I cannot verify from the linked source.

1. What DeepSeek Coder is

DeepSeek Coder is a family of open-weight code models published by DeepSeek. According to the official project page, the release includes:

1.1 Variants

Parameter sizes: 1.3B, 5.7B, 6.7B, and 33B
Two flavors of each:
- Base — pretrained, best for raw completion and fill-in-the-middle
- Instruct — instruction-tuned, best for chat-style "do this for me" prompts

1.2 Training and capabilities

Trained from scratch on 2 trillion tokens
Composition: 87% code, 13% natural language (English and Chinese)
Coverage of 80+ programming languages
16K token context window, which the page frames as enabling project-level code completion and infilling (not just single-function snippets)

1.3 Benchmarks (as stated on the page)

The page positions the 33B model against CodeLlama-34B:

Benchmark	DeepSeek-Coder-33B vs CodeLlama-34B
HumanEval (Python)	+7.9%
HumanEval (Multilingual)	+9.3%
MBPP	+10.8%
DS-1000	+5.9%

It also states that DeepSeek-Coder-Instruct-33B surpasses GPT-3.5-turbo on HumanEval. Treat these as the vendor's own published figures, not as independent third-party results.

1.4 License and access

Described on the page as open source and free for research and commercial use
Weights: huggingface.co/deepseek-ai
Code and exact usage formats: github.com/deepseek-ai/DeepSeek-Coder
Hosted chat: coder.deepseek.com

Always re-check the license before commercial use

"Free for commercial use" is what the project page states, but model licenses can carry conditions and can change between releases. Read the license file shipped with the specific weights you download before you ship anything on top of it.

Newer models exist — verify their specs yourself

The linked page covers the original DeepSeek Coder. DeepSeek has since published newer code-capable models (e.g. a DeepSeek-Coder-V2 line and general V-series chat/reasoner models). I am deliberately not quoting their parameter counts or benchmarks here, because the linked source does not contain them. For anything newer, check the DeepSeek-Coder repository and the deepseek-ai Hugging Face org directly.

2. The three ways to host it

There are exactly three realistic modes. Pick by how much control and scale you need.

Mode	What it is	Best when
Local runtime (Ollama / LM Studio)	One-machine inference, minimal setup	Single developer, laptop/workstation, "just works"
Self-hosted server (vLLM / TGI)	OpenAI-compatible API on your GPU box	A team, larger models, many tools sharing one backend
DeepSeek API	DeepSeek's own hosted, token-billed API	No hardware, fastest start, accept the data leaving your network

The rest of this section walks each one. For the editor wiring (Continue in VS Code, Cursor's free tier) see the Cursor + DeepSeek + VS Code Guide — this guide does not repeat it.

2.1 Local runtime with Ollama (easiest)

The fastest path for one machine. Install Ollama, then pull a DeepSeek Coder model:

ollama run deepseek-coder:6.7b

The Ollama library exposes these DeepSeek Coder tags:

deepseek-coder:1.3b
deepseek-coder:6.7b
deepseek-coder:33b

Ollama serves an HTTP endpoint on http://localhost:11434 and also exposes an OpenAI-compatible path at http://localhost:11434/v1, so most tools and SDKs can talk to it without custom code.

Quick smoke test

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-coder:6.7b",
    "messages": [{"role": "user", "content": "Write a Python function that reverses a linked list."}]
  }'

If you get JSON back with a code block in it, your local model is live.

2.2 Self-hosted server with vLLM (scales)

When one laptop is not enough — bigger models, a shared team backend, or routing many tools to one endpoint — use vLLM, which exposes an OpenAI-compatible server:

vllm serve deepseek-ai/deepseek-coder-6.7b-instruct

That gives you an endpoint like http://your-server:8000/v1 that behaves like the OpenAI API. Any OpenAI-compatible client — VS Code extensions, Codex-style and Claude Code-style tools, your own apps — can point at it.

Inference, not a hardware guarantee

The 33B model needs serious GPU memory (think a high-VRAM card or multi-GPU, especially at higher precision). The 6.7B model is the realistic "runs on a single decent GPU" choice. Exact VRAM depends on quantization and context length — benchmark on your own hardware rather than trusting a table.

2.3 DeepSeek's hosted API (no hardware)

If you do not want to run anything, DeepSeek offers a paid, OpenAI-compatible API. This is the least private option — your prompts and code leave your network — so weigh it against your data-handling requirements before sending anything sensitive.

3. Which variant should you actually run?

Two independent choices: size and base vs instruct.

3.1 Pick a size

Inference from model size, not a vendor SLA

Size	Use it for
1.3B	Smoke tests, weak laptops, autocomplete-only, very low latency
6.7B	The realistic local baseline — good quality, modest hardware
33B	Highest quality, needs strong GPU hardware or a server

If you are unsure, start at 6.7B and only move up if quality is the bottleneck, or down if latency is.

3.2 Base vs Instruct — this matters

This is the part people get wrong.

Use -instruct when you talk to the model in chat: "refactor this", "explain this stack trace", "write tests for this function". This is what you want for an editor chat panel or an agent.
Use -base when you want raw completion or fill-in-the-middle with no conversational wrapper — inline autocomplete, code infilling, or your own custom prompting harness.

A common strong setup is two models at once: a small base model for fast inline completion, and a larger instruct model for the chat/agent panel.

4. Getting the best results

4.1 Use fill-in-the-middle (FIM) for completion

The official page explicitly calls out infilling and project-level completion as core capabilities — the model was trained to fill a gap given code on both sides of the cursor, not just to continue from the end. This is exactly what makes it strong for inline autocomplete.

Use the exact FIM token format from the repo

FIM only works if you wrap the prefix/suffix/hole in DeepSeek Coder's specific special tokens. The exact token strings are defined in the official repository — copy them from there rather than from memory, because getting the delimiters wrong silently degrades output. Most editor extensions (e.g. Continue) handle this for you when you select the right template; you mainly need to care if you build your own completion harness.

4.2 Feed it real context

The 16K window is enough for multi-file context, which is where DeepSeek Coder is meant to shine. Practical wins:

Paste the actual surrounding file(s), not just the one function.
Include relevant type definitions, interfaces, and call sites so it matches your codebase instead of inventing APIs.
For "write tests for X", give it both X and an existing test so it copies your testing conventions.

4.3 Prompt the instruct model like a senior

Be explicit about language, framework version, and constraints ("PHP 8.3, Laravel 11, no Eloquent raw queries").
Ask for one thing per turn; long multi-task prompts get muddier output than a focused sequence.
When it drifts from your conventions, correct with a concrete example rather than restating the rule abstractly.

4.4 Tune the sampling

For code, keep temperature low (≈0–0.3). High temperature buys you creative prose and buggy code.
Set a stop sequence if you are completing inside a larger structure, so it does not run past the boundary you care about.

4.5 Always review the output

DeepSeek Coder will produce confident, plausible, and sometimes wrong code — like every code model. The 87%-code training mix makes it fluent, not infallible. Treat every suggestion as a draft to review, especially around security-sensitive paths, edge cases, and anything touching real data.

5. Recommended setups

5.1 Single developer, cheap and local

Ollama + deepseek-coder:6.7b  (instruct for chat)
+ a small base model for inline completion

No token bills, fully local, good daily quality.

5.2 Team, shared backend

vLLM serving a DeepSeek Coder model on a GPU box
→ OpenAI-compatible endpoint
→ every developer's editor + internal tools point at it

One model to maintain, consistent behavior across the team, code never leaves your infrastructure.

5.3 Zero infrastructure

DeepSeek hosted API

Fastest to start; accept that prompts and code leave your network.

6. Bottom line

What it is: an open-weight code-model family (1.3B–33B, base + instruct), 2T-token training, 16K context, strong published code benchmarks, open and free for commercial use per the project page.
How to host it: Ollama for one machine, vLLM for a team server, the DeepSeek API if you want no hardware.
How to use it well: match base-vs-instruct to the task, lean on fill-in-the-middle for completion, feed it real multi-file context, keep temperature low, and review everything.

For the editor side of this story — Cursor's free tier, Continue in VS Code, and the self-hosting trade-offs — go to the Cursor + DeepSeek + VS Code Guide.

1. What DeepSeek Coder is​

1.1 Variants​

1.2 Training and capabilities​

1.3 Benchmarks (as stated on the page)​

1.4 License and access​

2. The three ways to host it​

2.1 Local runtime with Ollama (easiest)​

2.2 Self-hosted server with vLLM (scales)​

2.3 DeepSeek's hosted API (no hardware)​

3. Which variant should you actually run?​

3.1 Pick a size​

3.2 Base vs Instruct — this matters​

4. Getting the best results​

4.1 Use fill-in-the-middle (FIM) for completion​

4.2 Feed it real context​

4.3 Prompt the instruct model like a senior​

4.4 Tune the sampling​

4.5 Always review the output​

5. Recommended setups​

5.1 Single developer, cheap and local​

5.2 Team, shared backend​

5.3 Zero infrastructure​

6. Bottom line​

Sources​