Zum Hauptinhalt springen

Meta Llama Guide

What is this about?

Meta's AI story now has two tracks: the open-weight Llama family you can download, self-host, and fine-tune, and a newer proprietary flagship (Muse Spark) that powers the Meta AI consumer assistant. This guide focuses on the developer-relevant track — Llama models, the hosting options, Llama Stack, and the license — and explains where Muse Spark fits.

Source scope as of June 23, 2026

Based on official Meta sources (llama.com / developer.meta.com/ai, ai.meta.com, huggingface.co/meta-llama, the Llama Stack repo) plus the major hosting partners. Two things to keep straight: the latest open-weight generation is still Llama 4 (Scout / Maverick, April 2025) — no newer open-weight generation has shipped — and Muse Spark (April 2026) is proprietary, not part of the open-weight family. Meta's own hosted Llama API is in preview, not GA. Preview endpoints and partner model IDs shift; confirm against the live docs before relying on them.

1. The mental model​

SurfaceWhat it is forPrimary user
Open-weight Llama modelsDownload the weights; self-host or fine-tune on your own infrastructureTeams wanting control, privacy, on-prem
Llama API (Meta-hosted)Meta-hosted inference for Llama models, OpenAI-SDK-compatible — previewDevelopers wanting Llama without running infra
Llama StackOpen, standardized API layer + toolkit (inference, RAG, agents, safety, evals, telemetry) you can run anywhereDevelopers building portable LLM apps
Meta AI (consumer assistant)End-user chat across WhatsApp, Instagram, Messenger, and meta.aiConsumers (not a build-on platform)
Muse SparkMeta's proprietary flagship model now powering Meta AIConsumers; select partners via private API preview

Hosting / inference ecosystem that serves Llama: Hugging Face (weights), Ollama and llama.cpp (local), vLLM (high-throughput serving), and hosted partners Together, Groq, Fireworks, AWS Bedrock, Azure, and Google Vertex AI.

Rule of thumb:

  • Need control, privacy, or fine-tuning? → self-host the open weights.
  • Want Llama with zero ops? → a hosted partner (Together / Groq / Fireworks / Bedrock / Azure / Vertex).
  • Want it "from the source," OpenAI-compatible, and only prototyping? → the Meta Llama API (preview).
  • Just want an end-user assistant? → Meta AI (it is a product, not a platform).

2. The open-weight Llama models​

This is the track most developers mean by "Llama." You download the weights and run them yourself.

The latest open-weight generation is Llama 4 (released April 2025), built on a Mixture-of-Experts architecture:

ModelActive / total paramsContext windowMultimodal
Llama 4 Scout (Llama-4-Scout-17B-16E)17B active / ~109B total (16 experts)up to 10M tokens (Meta's headline claim)Text + image → text; fits on a single H100
Llama 4 Maverick (Llama-4-Maverick-17B-128E)17B active / ~400B total (128 experts)1M tokensText + image → text
Two caveats
  • Llama 4 Behemoth (~288B active / ~2T total) was announced as "still in training" and has not been publicly released — treat it as unreleased.
  • The previous-generation dense model Llama 3.3 70B Instruct (text-only, Dec 2024) is still widely served and remains a sensible choice for text workloads.

Safety models (use alongside the base models):

  • Llama Guard 4 (12B) — unified text + image safeguard for the Llama 4 and Llama 3 series.
  • Llama Prompt Guard 2 — prompt-injection / jailbreak detection, in 86M and a low-latency 22M variant.

(Code Llama is legacy — coding capability now lives in the general Llama 3+/4 models. No 2026 update was confirmed.)


3. The Llama license — read this before shipping​

Llama 4 ships under the Llama 4 Community License Agreement (effective April 5, 2025). It is source-available, not OSI "open source."

  • Grant: royalty-free, worldwide, non-exclusive rights to use, modify, create derivatives, reproduce, and distribute. Commercial use is allowed.
  • The 700M-MAU clause: if, on the Llama 4 release date, the products you make available exceed 700 million monthly active users in the preceding calendar month, you must request a separate license from Meta, which Meta may grant or deny at its discretion.
  • Obligations: preserve attribution ("Built with Llama"), include the license, and comply with the Acceptable Use Policy.

Use the term "open weight" (or "source available"), not "open source" — because of the MAU restriction and acceptable-use limits, Llama does not meet the Open Source Initiative definition.


4. Llama Stack and Meta AI / Muse Spark​

Llama Stack is an open, standardized API layer plus toolkit — inference, RAG, agents, safety, evals, telemetry — designed so an app written against it stays portable across providers (Together, Fireworks, Groq, vLLM, Ollama, AWS Bedrock, and others ship as adapters). It is OpenAI-compatible and lives in the llamastack/llama-stack repo.

Meta AI is the consumer assistant across WhatsApp, Instagram, Messenger, and meta.ai. It is an end-user product, not something you build on.

Muse Spark is Meta's new proprietary flagship reasoning model (announced April 8, 2026) that now powers Meta AI, replacing Llama inside the consumer apps. Key points for developers:

  • It is closed — no public weights. Reachable only via a private API preview to select partners (initially US/Canada).
  • It offers modes such as Instant, Thinking, and a parallel-reasoning "Contemplating" mode.
  • Meta's own wording is that it hopes to open-source future versions — a hope, not a commitment. Detailed specs (parameter count, context window) were not disclosed; do not assume numbers.

Practical implication: if you want Meta's newest, strongest model programmatically today, you generally cannot. For build-on work, use Llama 4 via the options below.


5. Quickstart paths​

(a) Self-host locally with Ollama (shortest local path)​

ollama run llama4

Ollama exposes a local OpenAI-compatible endpoint at http://localhost:11434/v1, so any OpenAI SDK works against it.

(b) A hosted partner (OpenAI-compatible)​

Most partners speak the OpenAI API — swap base_url and the key:

from openai import OpenAI

client = OpenAI(
api_key="<PARTNER_KEY>",
base_url="https://api.together.xyz/v1", # or https://api.groq.com/openai/v1
)

response = client.chat.completions.create(
model="<partner's Llama 4 model id>", # e.g. meta-llama/Llama-4-Maverick-17B-128E-Instruct
messages=[{"role": "user", "content": "..."}],
)

(Each partner's exact Llama 4 model string differs — pull it from that partner's model list.)

(c) Meta's Llama API (preview)​

OpenAI-SDK-compatible; requires the waitlist/API key. The served models expose a 128k context window (the 10M / 1M figures are Meta's capability claims, not what the preview API serves). Confirm the exact base path against the live docs — it is a moving preview.


6. Decision guide​

If you need…Choose…
data residency, privacy, offline use, or fine-tuningself-host — vLLM for production throughput, Ollama / llama.cpp for local/dev
Llama with zero ops, SLAs, and existing cloud billinga hosted partner — Groq for lowest latency; Bedrock / Azure / Vertex if you are already in that cloud
to prototype "from the source," OpenAI-compatiblethe Meta Llama API (preview — not for production commitments yet)
an end-user assistantMeta AI (consumer product, not a platform)

  • Solo developer / quick test: Ollama locally, or a hosted partner's free tier.
  • Production, cost-sensitive at scale: self-host Llama 4 on vLLM, or a hosted partner with autoscaling.
  • Regulated / data-residency: self-host the open weights on your own EU infrastructure (vLLM), pairing Llama Guard 4 for safety.
  • Portability across providers: build against Llama Stack so you can switch inference backends later.

Models & docs

Hosted & local

Meta AI / Muse Spark

Related guides