Zum Hauptinhalt springen

vLLM Server Guide

What is this about?

This is the server companion to the DeepSeek Coder Guide. That guide names vLLM as the "scales to a team" hosting path. This guide is the how: installing vLLM on an Ubuntu GPU server, running it as a proper OpenAI-compatible service, and pointing your tools at it.

Important context

Requirements and commands in sections 1–4 come from the official vLLM docs, checked on June 22, 2026. The production-hardening parts (systemd, binding, reverse proxy, firewall) are marked as practice β€” standard Linux server operations, not vLLM-specific documentation.

1. "Only local?" β€” No. vLLM is a server​

This is the question that prompted the guide, so it gets answered first.

vLLM is not a local-only desktop tool like LM Studio. It is a high-throughput inference server engine. Running it "locally" just means the server happens to sit on your own machine. Running it on a remote Ubuntu GPU server is the primary, intended use case.

        your Ubuntu server (has the GPU)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ vLLM β†’ OpenAI-compatible β”‚
β”‚ API on :8000 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ /v1/chat/completions
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
VS Code your apps Claude Code / Codex-style tools

The catch is not "local vs server" β€” it is hardware:

vLLM is GPU-first β€” your server needs an NVIDIA GPU

vLLM officially supports Linux only and targets NVIDIA CUDA GPUs with compute capability 7.5+ (e.g. T4, RTX 20xx, L4, A100, H100, B200). A CPU-only build exists but the docs explicitly call it not optimized β€” treat it as non-viable for real serving. AMD ROCm and Intel XPU are supported via separate wheels/builds.

So: if your Ubuntu server has a supported NVIDIA GPU β†’ perfect fit. If it is a CPU-only VPS β†’ vLLM is the wrong tool; use the hosted DeepSeek API instead.


2. Requirements (confirmed)​

RequirementValue
OSLinux only (Windows β†’ WSL)
Python3.10 – 3.13
GPUNVIDIA, compute capability 7.5+
CUDA12.9 default (12.8 and 13.0 also available)
DriverA recent NVIDIA driver on the host (verify with nvidia-smi)
Driver vs toolkit

The vLLM wheel bundles its own PyTorch + CUDA runtime, so on the host you mainly need the NVIDIA driver, not the full CUDA toolkit. For the Docker path you need the NVIDIA Container Toolkit.

2.1 Prepare the Ubuntu host​

# Install the recommended NVIDIA driver, then reboot
sudo ubuntu-drivers autoinstall
sudo reboot

# After reboot, confirm the GPU is visible and note the CUDA version
nvidia-smi

If nvidia-smi lists your GPU and a CUDA version, the host is ready.


3. Installing vLLM on Ubuntu​

Three paths. Pick one. For a dedicated server I recommend Docker (clean, reproducible) or uv (fast, isolated venv).

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto

--torch-backend=auto detects your CUDA driver and selects the matching PyTorch build.

3.2 pip​

pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129
# CUDA 13.0 instead:
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu130

Requires the NVIDIA Container Toolkit on the host. Official image: vllm/vllm-openai:latest.

docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 --ipc=host \
vllm/vllm-openai:latest \
--model deepseek-ai/deepseek-coder-6.7b-instruct

The -v ~/.cache/huggingface mount persists downloaded weights between container restarts so you don't re-download multi-GB models every time.


4. Start the OpenAI-compatible server​

The core command β€” point it at a Hugging Face model id:

vllm serve deepseek-ai/deepseek-coder-6.7b-instruct

That serves an OpenAI-compatible API on http://localhost:8000 by default. Useful flags:

vllm serve <model> \
--host 0.0.0.0 \ # listen on all interfaces (needed for remote access)
--port 8000 \
--api-key <YOUR_KEY> # require this key on every request
Check the exact model id

deepseek-ai/deepseek-coder-6.7b-instruct is the expected Hugging Face repo id, but confirm the precise name on the deepseek-ai HF org before pulling β€” and remember the base vs instruct choice from the DeepSeek Coder Guide.

4.1 Verify it works​

# List loaded models
curl http://localhost:8000/v1/models

# Chat completion
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/deepseek-coder-6.7b-instruct",
"messages": [
{"role": "system", "content": "You are a senior engineer."},
{"role": "user", "content": "Write a Python function to debounce calls."}
]
}'

There are two endpoints: /v1/chat/completions (use the instruct model, message format) and /v1/completions (raw prompt, good with base models).


5. Running it as a real server (practice)​

vllm serve in a terminal dies when your SSH session ends. On a server you want it managed, restarted on failure, and not exposed raw to the internet.

5.1 systemd service​

Create /etc/systemd/system/vllm.service:

[Unit]
Description=vLLM OpenAI-compatible server
After=network-online.target
Wants=network-online.target

[Service]
User=vllm
WorkingDirectory=/opt/vllm
Environment="HF_HOME=/opt/vllm/hf-cache"
ExecStart=/opt/vllm/.venv/bin/vllm serve deepseek-ai/deepseek-coder-6.7b-instruct \
--host 127.0.0.1 --port 8000 --api-key ${VLLM_API_KEY}
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now vllm
journalctl -u vllm -f # watch startup / model load
Do not expose port 8000 to the public internet

vLLM's --api-key is a single static shared secret β€” it is not real authentication or rate limiting. Two safe patterns:

  1. Bind to 127.0.0.1 (as above) and reach it over an SSH tunnel or private network / VPN only.
  2. Put it behind a reverse proxy (nginx/Caddy) that terminates TLS and adds auth, then firewall the raw port:
sudo ufw allow 22/tcp
sudo ufw allow 443/tcp
sudo ufw deny 8000/tcp
sudo ufw enable

Never put a raw --host 0.0.0.0 vLLM port on a public IP.

5.2 nginx reverse proxy (sketch)​

server {
listen 443 ssl;
server_name llm.yourdomain.tld;
# ssl_certificate ... (e.g. via certbot)

location /v1/ {
proxy_pass http://127.0.0.1:8000;
proxy_set_header Host $host;
proxy_read_timeout 300s; # long generations
proxy_buffering off; # stream tokens through
}
}

6. Connecting clients​

Because the API is OpenAI-compatible, anything that speaks OpenAI works β€” just override base_url.

Python / your apps:

from openai import OpenAI

client = OpenAI(
api_key="<YOUR_KEY>", # the --api-key value ("EMPTY" if none set)
base_url="https://llm.yourdomain.tld/v1",
)
resp = client.chat.completions.create(
model="deepseek-ai/deepseek-coder-6.7b-instruct",
messages=[{"role": "user", "content": "Refactor this for readability: ..."}],
)
print(resp.choices[0].message.content)

VS Code (Continue), Claude Code / Codex-style tools: point them at the same base URL + key. The editor wiring is covered in the Cursor + DeepSeek + VS Code Guide.


7. vLLM vs Ollama β€” which one?​

OllamavLLM
Setup effortMinimalModerate
TargetOne developer machineServer / team backend
Throughput & concurrencyModestHigh (batching, paged attention)
OpenAI-compatible APIYes (/v1)Yes (/v1)
Best for"Just works" local codingShared GPU server, many clients, larger models

Rule of thumb: Ollama for your laptop, vLLM for the GPU server. See the Ollama Developer Guide for the local path.


8. Bottom line​

  • Local-only? No. vLLM is a server engine; a remote Ubuntu GPU box is its home turf.
  • Hard requirement: Linux + a supported NVIDIA GPU (CC 7.5+). CPU-only is not viable; a CPU VPS β†’ use the DeepSeek API instead.
  • Install: uv pip install vllm --torch-backend=auto, or the vllm/vllm-openai Docker image.
  • Serve: vllm serve <model> β†’ OpenAI-compatible /v1 on port 8000.
  • Productionize: systemd unit, bind to 127.0.0.1, reach it via SSH tunnel/VPN or a TLS reverse proxy, firewall the raw port.

For the model side (variant choice, base vs instruct, FIM, prompting), go back to the DeepSeek Coder Guide.


Sources​