Ollama Developer Guide

What is this about?

This is a practical guide for developers who want to use Ollama as a local-first AI runtime. It covers what Ollama is, where it shines, when you should use it, how to self-host it cleanly, and which best practices matter in real projects.

Scope

This guide reflects the official Ollama documentation as checked on June 22, 2026.

1. What Ollama is

Ollama is the easiest way to run LLMs behind a local CLI and HTTP API.

In practice, it gives you:

a local runtime for models on macOS, Windows, and Linux
a simple CLI like ollama run, ollama pull, and ollama serve
a native API at http://localhost:11434/api
an OpenAI-compatible API at http://localhost:11434/v1/
official Python and JavaScript/TypeScript libraries
a way to customize models with Modelfiles
support for streaming, thinking, structured outputs, tool calling, embeddings, and vision

Short version: if you want to build with local or self-hosted models without managing a full inference stack from scratch, Ollama is one of the cleanest entry points.

2. Why developers like Ollama

The main advantages

Local-first workflow: your default path is your own machine or your own server.
Very low setup friction: install it, pull a model, call the API.
OpenAI compatibility: many existing tools can be pointed at Ollama with only a base URL change.
Strong app-building features: tool calling, structured outputs, embeddings, vision, streaming.
Good customization story: Modelfiles let you pin context, system prompts, and generation settings.
Good editor story: official docs include integrations for VS Code, Codex, Claude Code, and more.
Good self-hosting story: it works directly on a workstation, in Docker, or on an internal server.

The real trade-off

You are usually trading:

more control
better privacy
lower recurring API cost

for:

your own hardware limits
more responsibility for model choice
less raw quality than frontier hosted models in some tasks

3. What Ollama is best used for

Excellent fits

Use case	Why Ollama fits well
Local coding assistant	Good for editor integrations, CLI tooling, and private codebases
Internal chat over company docs	Good privacy and easy local APIs for RAG
Structured extraction pipelines	JSON mode and schema-based outputs work well
Embeddings + semantic search	Official embeddings support is built in
Agent/tool workflows	Tool calling and multi-turn loops are documented clearly
Vision prototypes	Multimodal models can be used through the same chat API
Offline or controlled environments	Useful when you do not want every request sent to a hosted vendor

Good, but not always ideal

Use case	Watch out for
Production customer-facing chat at scale	You still need capacity planning, monitoring, and traffic control
Large-context agents	Context size costs memory quickly
Frontier reasoning tasks	A hosted frontier model may still perform better

Usually not the best choice

If you need the absolute strongest frontier model quality on every request
If you want a fully managed multi-tenant AI platform with elastic scaling out of the box
If your hardware is weak but your workload expects big models + large context + low latency

4. How you should think about using it

The cleanest way to use Ollama is:

use it as your default local runtime
choose separate models for separate jobs
treat the OpenAI-compatible API as a compatibility layer, not magic full parity
move to Docker or an internal server only once the local workflow already feels good

A healthy mental model

Ollama is not just "chat in the terminal"
it is a developer platform for local AI features
the strongest pattern is usually hybrid:
- local models for speed, privacy, and cheap iteration
- hosted frontier models only for the hardest tasks

5. Quickstart

CLI

Install Ollama, then run:

ollama

The official quickstart says this opens an interactive menu where you can:

run a model
launch tools and integrations
access supported coding workflows

Run a first model

ollama run gemma4

First API call

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "messages": [{ "role": "user", "content": "Hello!" }]
}'

That local API is the center of most real integrations.

6. The two API paths

Native Ollama API

Base URL:

http://localhost:11434/api

Use this when:

you want Ollama-native features directly
you are building greenfield tooling
you want the clearest mapping to the official docs

OpenAI-compatible API

Base URL:

http://localhost:11434/v1/

Use this when:

your app already speaks OpenAI
you want to reuse SDKs or tools with minimal changes
you are plugging Ollama into existing editors, agents, or internal services

Example with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='ollama',  # required by the SDK, ignored by Ollama locally
)

response = client.chat.completions.create(
    model='gpt-oss:20b',
    messages=[{'role': 'user', 'content': 'Explain what Ollama is in one paragraph.'}],
)

print(response.choices[0].message.content)

Compatibility caveat

Ollama supports large parts of the OpenAI API, but it is not perfect one-to-one parity. For example, the current docs show gaps such as:

no tool_choice support in /v1/chat/completions
no stateful previous_response_id flow in /v1/responses
only partial field support depending on the endpoint

So the safe rule is: compatible enough for many tools, but test the exact features you rely on.

7. Best practices

1. Use different models for different jobs

Do not force one model to do everything.

use a general instruct model for chat or coding
use a dedicated embedding model for retrieval
use a vision-capable model only when images are involved

The official embeddings docs explicitly recommend model classes such as:

embeddinggemma
qwen3-embedding
all-minilm

2. Prefer structured outputs for app logic

If the model output feeds code, workflows, or databases, use:

format: "json" for simple JSON
a full JSON schema for stable structured outputs

The docs also recommend validating responses with:

Pydantic in Python
Zod in JavaScript/TypeScript

This is one of the highest-leverage practices in real software.

3. Keep temperature low for deterministic workflows

For extraction, routing, classifications, or app-facing JSON:

lower temperature
for strict structured work, use 0 when appropriate

This is especially important when you want predictable outputs across runs.

4. Size context intentionally

The official context-length docs currently say Ollama defaults to:

4k when VRAM is under 24 GiB
32k at 24-48 GiB
256k at 48 GiB or more

They also explicitly note that web search, agents, and coding tools should be set to at least 64000 tokens.

That does not mean "always max out context." Larger context means more memory pressure. Only increase it when the workload needs it.

You can set it at serve time:

OLLAMA_CONTEXT_LENGTH=64000 ollama serve

5. Use Modelfiles instead of retyping behavior everywhere

If you keep repeating the same system prompt, context size, or generation settings, move that into a Modelfile.

Example:

FROM gemma4
PARAMETER num_ctx 4096
PARAMETER temperature 0.2
SYSTEM """You are a careful senior developer who answers in concise technical English."""

Then create and run it:

ollama create my-dev-model -f Modelfile
ollama run my-dev-model

This is cleaner than scattering giant prompts across scripts.

6. Build tool calling as an explicit loop

For real tool use:

let the model request tools
execute the tools in your app
append the tool results
ask the model to continue

The official docs show this pattern for:

single tool calls
parallel tool calls
multi-turn agent loops
streaming tool loops

That is the right architecture. Do not expect "one request and magic agent behavior" unless your wrapper implements the loop.

7. Track usage metrics

Ollama responses expose useful metrics such as:

total_duration
load_duration
prompt_eval_count
prompt_eval_duration
eval_count
eval_duration

Use those for:

spotting cold starts
comparing models
measuring prompt inflation
catching slow context-heavy workflows

8. Pin names and aliases for compatibility

If a tool expects a model name like gpt-3.5-turbo, the docs recommend using ollama cp to create a compatible alias.

That is a practical way to make legacy tools happy without changing every config surface.

9. Treat remote exposure as infrastructure, not a toy

Officially, local access to http://localhost:11434 requires no authentication.

That is fine on your machine. It is not a reason to expose port 11434 directly to the internet.

For team or server use:

keep Ollama behind an internal network boundary
put a reverse proxy or gateway in front of it
handle auth outside Ollama if you expose it remotely
use Ollama API keys only for direct access to https://ollama.com/api

8. Self-hosting patterns

Pattern A: local workstation

Best for:

solo development
editor integrations
private code or docs
fast experimentation

This is where most people should start.

Pattern B: Docker on a dev box or internal server

CPU only

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Nvidia GPU

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

AMD GPU

docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm

Best for:

stable internal endpoint for a small team
CI or shared demo environments
repeatable setup on Linux

Pattern C: internal AI service

Best for:

one central model host
shared editor and tool integrations
one endpoint for multiple internal apps

Recommended shape:

Ollama in Docker or on a dedicated machine
models pre-pulled
reverse proxy in front
internal-only networking
monitoring around API latency and memory use

9. VS Code usage

The current official docs say VS Code can use Ollama models through GitHub Copilot Chat.

Current prerequisites listed by Ollama:

Ollama v0.18.3+
VS Code 1.113+
GitHub Copilot Chat 0.41.0+

The docs also note an important detail:

you still need to be logged in
but you do not need a paid Copilot plan
GitHub Copilot Free is enough for custom/local model selection

Quick setup:

ollama launch vscode

Then make sure Local is selected in the Copilot Chat panel.

This makes Ollama especially useful if you want:

local chat over your codebase
private prompts
a lower-cost coding setup
a bridge between local inference and a familiar editor UI

10. Practical recommendation

Use Ollama when

you want a local-first AI stack
you care about privacy or code locality
you want cheap iteration without per-token billing
you want a simple self-hosting path
you need RAG, embeddings, tools, or structured extraction on your own infrastructure

Do not force Ollama when

the task absolutely requires the best hosted frontier model
your team is not ready to own GPU capacity, model choice, and latency trade-offs
the workload is mostly product-facing and you actually need a managed inference platform

My practical default

For most developers, the strongest setup is:

start with Ollama locally
make your app work through the native or OpenAI-compatible API
add Modelfiles and structured outputs
move to Docker or an internal host only when the workflow is already proven
keep one hosted frontier model as an escalation path for the hardest tasks

That gives you the best balance of control, cost, and developer speed.

1. What Ollama is​

2. Why developers like Ollama​

The main advantages​

The real trade-off​

3. What Ollama is best used for​

Excellent fits​

Good, but not always ideal​

Usually not the best choice​

4. How you should think about using it​

A healthy mental model​

5. Quickstart​

CLI​

Run a first model​

First API call​

6. The two API paths​

Native Ollama API​

OpenAI-compatible API​

Compatibility caveat​

7. Best practices​

1. Use different models for different jobs​

2. Prefer structured outputs for app logic​

3. Keep temperature low for deterministic workflows​

4. Size context intentionally​

5. Use Modelfiles instead of retyping behavior everywhere​

6. Build tool calling as an explicit loop​

7. Track usage metrics​

8. Pin names and aliases for compatibility​

9. Treat remote exposure as infrastructure, not a toy​

8. Self-hosting patterns​

Pattern A: local workstation​

Pattern B: Docker on a dev box or internal server​

CPU only​

Nvidia GPU​

AMD GPU​

Pattern C: internal AI service​

9. VS Code usage​

10. Practical recommendation​

Use Ollama when​

Do not force Ollama when​

My practical default​

Sources​

1. What Ollama is

2. Why developers like Ollama

The main advantages

The real trade-off

3. What Ollama is best used for

Excellent fits

Good, but not always ideal

Usually not the best choice

4. How you should think about using it

A healthy mental model

5. Quickstart

CLI

Run a first model

First API call

6. The two API paths

Native Ollama API

OpenAI-compatible API

Compatibility caveat

7. Best practices

1. Use different models for different jobs

2. Prefer structured outputs for app logic

3. Keep temperature low for deterministic workflows

4. Size context intentionally

5. Use Modelfiles instead of retyping behavior everywhere

6. Build tool calling as an explicit loop

7. Track usage metrics

8. Pin names and aliases for compatibility

9. Treat remote exposure as infrastructure, not a toy

8. Self-hosting patterns

Pattern A: local workstation

Pattern B: Docker on a dev box or internal server

CPU only

Nvidia GPU

AMD GPU

Pattern C: internal AI service

9. VS Code usage

10. Practical recommendation

Use Ollama when

Do not force Ollama when

My practical default

Sources