Skip to main content

Ollama Developer Guide

What is this about?

This is a practical guide for developers who want to use Ollama as a local-first AI runtime. It covers what Ollama is, where it shines, when you should use it, how to self-host it cleanly, and which best practices matter in real projects.

Scope

This guide reflects the official Ollama documentation as checked on June 22, 2026.

1. What Ollama is​

Ollama is the easiest way to run LLMs behind a local CLI and HTTP API.

In practice, it gives you:

  • a local runtime for models on macOS, Windows, and Linux
  • a simple CLI like ollama run, ollama pull, and ollama serve
  • a native API at http://localhost:11434/api
  • an OpenAI-compatible API at http://localhost:11434/v1/
  • official Python and JavaScript/TypeScript libraries
  • a way to customize models with Modelfiles
  • support for streaming, thinking, structured outputs, tool calling, embeddings, and vision

Short version: if you want to build with local or self-hosted models without managing a full inference stack from scratch, Ollama is one of the cleanest entry points.


2. Why developers like Ollama​

The main advantages​

  • Local-first workflow: your default path is your own machine or your own server.
  • Very low setup friction: install it, pull a model, call the API.
  • OpenAI compatibility: many existing tools can be pointed at Ollama with only a base URL change.
  • Strong app-building features: tool calling, structured outputs, embeddings, vision, streaming.
  • Good customization story: Modelfiles let you pin context, system prompts, and generation settings.
  • Good editor story: official docs include integrations for VS Code, Codex, Claude Code, and more.
  • Good self-hosting story: it works directly on a workstation, in Docker, or on an internal server.

The real trade-off​

You are usually trading:

  • more control
  • better privacy
  • lower recurring API cost

for:

  • your own hardware limits
  • more responsibility for model choice
  • less raw quality than frontier hosted models in some tasks

3. What Ollama is best used for​

Excellent fits​

Use caseWhy Ollama fits well
Local coding assistantGood for editor integrations, CLI tooling, and private codebases
Internal chat over company docsGood privacy and easy local APIs for RAG
Structured extraction pipelinesJSON mode and schema-based outputs work well
Embeddings + semantic searchOfficial embeddings support is built in
Agent/tool workflowsTool calling and multi-turn loops are documented clearly
Vision prototypesMultimodal models can be used through the same chat API
Offline or controlled environmentsUseful when you do not want every request sent to a hosted vendor

Good, but not always ideal​

Use caseWatch out for
Production customer-facing chat at scaleYou still need capacity planning, monitoring, and traffic control
Large-context agentsContext size costs memory quickly
Frontier reasoning tasksA hosted frontier model may still perform better

Usually not the best choice​

  • If you need the absolute strongest frontier model quality on every request
  • If you want a fully managed multi-tenant AI platform with elastic scaling out of the box
  • If your hardware is weak but your workload expects big models + large context + low latency

4. How you should think about using it​

The cleanest way to use Ollama is:

  1. use it as your default local runtime
  2. choose separate models for separate jobs
  3. treat the OpenAI-compatible API as a compatibility layer, not magic full parity
  4. move to Docker or an internal server only once the local workflow already feels good

A healthy mental model​

  • Ollama is not just "chat in the terminal"
  • it is a developer platform for local AI features
  • the strongest pattern is usually hybrid:
    • local models for speed, privacy, and cheap iteration
    • hosted frontier models only for the hardest tasks

5. Quickstart​

CLI​

Install Ollama, then run:

ollama

The official quickstart says this opens an interactive menu where you can:

  • run a model
  • launch tools and integrations
  • access supported coding workflows

Run a first model​

ollama run gemma4

First API call​

curl http://localhost:11434/api/chat -d '{
"model": "gemma4",
"messages": [{ "role": "user", "content": "Hello!" }]
}'

That local API is the center of most real integrations.


6. The two API paths​

Native Ollama API​

Base URL:

http://localhost:11434/api

Use this when:

  • you want Ollama-native features directly
  • you are building greenfield tooling
  • you want the clearest mapping to the official docs

OpenAI-compatible API​

Base URL:

http://localhost:11434/v1/

Use this when:

  • your app already speaks OpenAI
  • you want to reuse SDKs or tools with minimal changes
  • you are plugging Ollama into existing editors, agents, or internal services

Example with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(
base_url='http://localhost:11434/v1/',
api_key='ollama', # required by the SDK, ignored by Ollama locally
)

response = client.chat.completions.create(
model='gpt-oss:20b',
messages=[{'role': 'user', 'content': 'Explain what Ollama is in one paragraph.'}],
)

print(response.choices[0].message.content)

Compatibility caveat​

Ollama supports large parts of the OpenAI API, but it is not perfect one-to-one parity. For example, the current docs show gaps such as:

  • no tool_choice support in /v1/chat/completions
  • no stateful previous_response_id flow in /v1/responses
  • only partial field support depending on the endpoint

So the safe rule is: compatible enough for many tools, but test the exact features you rely on.


7. Best practices​

1. Use different models for different jobs​

Do not force one model to do everything.

  • use a general instruct model for chat or coding
  • use a dedicated embedding model for retrieval
  • use a vision-capable model only when images are involved

The official embeddings docs explicitly recommend model classes such as:

  • embeddinggemma
  • qwen3-embedding
  • all-minilm

2. Prefer structured outputs for app logic​

If the model output feeds code, workflows, or databases, use:

  • format: "json" for simple JSON
  • a full JSON schema for stable structured outputs

The docs also recommend validating responses with:

  • Pydantic in Python
  • Zod in JavaScript/TypeScript

This is one of the highest-leverage practices in real software.

3. Keep temperature low for deterministic workflows​

For extraction, routing, classifications, or app-facing JSON:

  • lower temperature
  • for strict structured work, use 0 when appropriate

This is especially important when you want predictable outputs across runs.

4. Size context intentionally​

The official context-length docs currently say Ollama defaults to:

  • 4k when VRAM is under 24 GiB
  • 32k at 24-48 GiB
  • 256k at 48 GiB or more

They also explicitly note that web search, agents, and coding tools should be set to at least 64000 tokens.

That does not mean "always max out context." Larger context means more memory pressure. Only increase it when the workload needs it.

You can set it at serve time:

OLLAMA_CONTEXT_LENGTH=64000 ollama serve

5. Use Modelfiles instead of retyping behavior everywhere​

If you keep repeating the same system prompt, context size, or generation settings, move that into a Modelfile.

Example:

FROM gemma4
PARAMETER num_ctx 4096
PARAMETER temperature 0.2
SYSTEM """You are a careful senior developer who answers in concise technical English."""

Then create and run it:

ollama create my-dev-model -f Modelfile
ollama run my-dev-model

This is cleaner than scattering giant prompts across scripts.

6. Build tool calling as an explicit loop​

For real tool use:

  • let the model request tools
  • execute the tools in your app
  • append the tool results
  • ask the model to continue

The official docs show this pattern for:

  • single tool calls
  • parallel tool calls
  • multi-turn agent loops
  • streaming tool loops

That is the right architecture. Do not expect "one request and magic agent behavior" unless your wrapper implements the loop.

7. Track usage metrics​

Ollama responses expose useful metrics such as:

  • total_duration
  • load_duration
  • prompt_eval_count
  • prompt_eval_duration
  • eval_count
  • eval_duration

Use those for:

  • spotting cold starts
  • comparing models
  • measuring prompt inflation
  • catching slow context-heavy workflows

8. Pin names and aliases for compatibility​

If a tool expects a model name like gpt-3.5-turbo, the docs recommend using ollama cp to create a compatible alias.

That is a practical way to make legacy tools happy without changing every config surface.

9. Treat remote exposure as infrastructure, not a toy​

Officially, local access to http://localhost:11434 requires no authentication.

That is fine on your machine. It is not a reason to expose port 11434 directly to the internet.

For team or server use:

  • keep Ollama behind an internal network boundary
  • put a reverse proxy or gateway in front of it
  • handle auth outside Ollama if you expose it remotely
  • use Ollama API keys only for direct access to https://ollama.com/api

8. Self-hosting patterns​

Pattern A: local workstation​

Best for:

  • solo development
  • editor integrations
  • private code or docs
  • fast experimentation

This is where most people should start.

Pattern B: Docker on a dev box or internal server​

CPU only​

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Nvidia GPU​

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

AMD GPU​

docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm

Best for:

  • stable internal endpoint for a small team
  • CI or shared demo environments
  • repeatable setup on Linux

Pattern C: internal AI service​

Best for:

  • one central model host
  • shared editor and tool integrations
  • one endpoint for multiple internal apps

Recommended shape:

  • Ollama in Docker or on a dedicated machine
  • models pre-pulled
  • reverse proxy in front
  • internal-only networking
  • monitoring around API latency and memory use

9. VS Code usage​

The current official docs say VS Code can use Ollama models through GitHub Copilot Chat.

Current prerequisites listed by Ollama:

  • Ollama v0.18.3+
  • VS Code 1.113+
  • GitHub Copilot Chat 0.41.0+

The docs also note an important detail:

  • you still need to be logged in
  • but you do not need a paid Copilot plan
  • GitHub Copilot Free is enough for custom/local model selection

Quick setup:

ollama launch vscode

Then make sure Local is selected in the Copilot Chat panel.

This makes Ollama especially useful if you want:

  • local chat over your codebase
  • private prompts
  • a lower-cost coding setup
  • a bridge between local inference and a familiar editor UI

10. Practical recommendation​

Use Ollama when​

  • you want a local-first AI stack
  • you care about privacy or code locality
  • you want cheap iteration without per-token billing
  • you want a simple self-hosting path
  • you need RAG, embeddings, tools, or structured extraction on your own infrastructure

Do not force Ollama when​

  • the task absolutely requires the best hosted frontier model
  • your team is not ready to own GPU capacity, model choice, and latency trade-offs
  • the workload is mostly product-facing and you actually need a managed inference platform

My practical default​

For most developers, the strongest setup is:

  1. start with Ollama locally
  2. make your app work through the native or OpenAI-compatible API
  3. add Modelfiles and structured outputs
  4. move to Docker or an internal host only when the workflow is already proven
  5. keep one hosted frontier model as an escalation path for the hardest tasks

That gives you the best balance of control, cost, and developer speed.


Sources​