Meta Llama Guide
Meta's AI story now has two tracks: the open-weight Llama family you can download, self-host, and fine-tune, and a newer proprietary flagship (Muse Spark) that powers the Meta AI consumer assistant. This guide focuses on the developer-relevant track — Llama models, the hosting options, Llama Stack, and the license — and explains where Muse Spark fits.
Based on official Meta sources (llama.com / developer.meta.com/ai, ai.meta.com, huggingface.co/meta-llama, the Llama Stack repo) plus the major hosting partners. Two things to keep straight: the latest open-weight generation is still Llama 4 (Scout / Maverick, April 2025) — no newer open-weight generation has shipped — and Muse Spark (April 2026) is proprietary, not part of the open-weight family. Meta's own hosted Llama API is in preview, not GA. Preview endpoints and partner model IDs shift; confirm against the live docs before relying on them.
1. The mental model​
| Surface | What it is for | Primary user |
|---|---|---|
| Open-weight Llama models | Download the weights; self-host or fine-tune on your own infrastructure | Teams wanting control, privacy, on-prem |
| Llama API (Meta-hosted) | Meta-hosted inference for Llama models, OpenAI-SDK-compatible — preview | Developers wanting Llama without running infra |
| Llama Stack | Open, standardized API layer + toolkit (inference, RAG, agents, safety, evals, telemetry) you can run anywhere | Developers building portable LLM apps |
| Meta AI (consumer assistant) | End-user chat across WhatsApp, Instagram, Messenger, and meta.ai | Consumers (not a build-on platform) |
| Muse Spark | Meta's proprietary flagship model now powering Meta AI | Consumers; select partners via private API preview |
Hosting / inference ecosystem that serves Llama: Hugging Face (weights), Ollama and llama.cpp (local), vLLM (high-throughput serving), and hosted partners Together, Groq, Fireworks, AWS Bedrock, Azure, and Google Vertex AI.
Rule of thumb:
- Need control, privacy, or fine-tuning? → self-host the open weights.
- Want Llama with zero ops? → a hosted partner (Together / Groq / Fireworks / Bedrock / Azure / Vertex).
- Want it "from the source," OpenAI-compatible, and only prototyping? → the Meta Llama API (preview).
- Just want an end-user assistant? → Meta AI (it is a product, not a platform).
2. The open-weight Llama models​
This is the track most developers mean by "Llama." You download the weights and run them yourself.
The latest open-weight generation is Llama 4 (released April 2025), built on a Mixture-of-Experts architecture:
| Model | Active / total params | Context window | Multimodal |
|---|---|---|---|
Llama 4 Scout (Llama-4-Scout-17B-16E) | 17B active / ~109B total (16 experts) | up to 10M tokens (Meta's headline claim) | Text + image → text; fits on a single H100 |
Llama 4 Maverick (Llama-4-Maverick-17B-128E) | 17B active / ~400B total (128 experts) | 1M tokens | Text + image → text |
- Llama 4 Behemoth (~288B active / ~2T total) was announced as "still in training" and has not been publicly released — treat it as unreleased.
- The previous-generation dense model Llama 3.3 70B Instruct (text-only, Dec 2024) is still widely served and remains a sensible choice for text workloads.
Safety models (use alongside the base models):
- Llama Guard 4 (12B) — unified text + image safeguard for the Llama 4 and Llama 3 series.
- Llama Prompt Guard 2 — prompt-injection / jailbreak detection, in 86M and a low-latency 22M variant.
(Code Llama is legacy — coding capability now lives in the general Llama 3+/4 models. No 2026 update was confirmed.)
3. The Llama license — read this before shipping​
Llama 4 ships under the Llama 4 Community License Agreement (effective April 5, 2025). It is source-available, not OSI "open source."
- Grant: royalty-free, worldwide, non-exclusive rights to use, modify, create derivatives, reproduce, and distribute. Commercial use is allowed.
- The 700M-MAU clause: if, on the Llama 4 release date, the products you make available exceed 700 million monthly active users in the preceding calendar month, you must request a separate license from Meta, which Meta may grant or deny at its discretion.
- Obligations: preserve attribution ("Built with Llama"), include the license, and comply with the Acceptable Use Policy.
Use the term "open weight" (or "source available"), not "open source" — because of the MAU restriction and acceptable-use limits, Llama does not meet the Open Source Initiative definition.
4. Llama Stack and Meta AI / Muse Spark​
Llama Stack is an open, standardized API layer plus toolkit — inference, RAG, agents, safety, evals, telemetry — designed so an app written against it stays portable across providers (Together, Fireworks, Groq, vLLM, Ollama, AWS Bedrock, and others ship as adapters). It is OpenAI-compatible and lives in the llamastack/llama-stack repo.
Meta AI is the consumer assistant across WhatsApp, Instagram, Messenger, and meta.ai. It is an end-user product, not something you build on.
Muse Spark is Meta's new proprietary flagship reasoning model (announced April 8, 2026) that now powers Meta AI, replacing Llama inside the consumer apps. Key points for developers:
- It is closed — no public weights. Reachable only via a private API preview to select partners (initially US/Canada).
- It offers modes such as Instant, Thinking, and a parallel-reasoning "Contemplating" mode.
- Meta's own wording is that it hopes to open-source future versions — a hope, not a commitment. Detailed specs (parameter count, context window) were not disclosed; do not assume numbers.
Practical implication: if you want Meta's newest, strongest model programmatically today, you generally cannot. For build-on work, use Llama 4 via the options below.
5. Quickstart paths​
(a) Self-host locally with Ollama (shortest local path)​
ollama run llama4
Ollama exposes a local OpenAI-compatible endpoint at http://localhost:11434/v1, so any OpenAI SDK works against it.
(b) A hosted partner (OpenAI-compatible)​
Most partners speak the OpenAI API — swap base_url and the key:
from openai import OpenAI
client = OpenAI(
api_key="<PARTNER_KEY>",
base_url="https://api.together.xyz/v1", # or https://api.groq.com/openai/v1
)
response = client.chat.completions.create(
model="<partner's Llama 4 model id>", # e.g. meta-llama/Llama-4-Maverick-17B-128E-Instruct
messages=[{"role": "user", "content": "..."}],
)
(Each partner's exact Llama 4 model string differs — pull it from that partner's model list.)
(c) Meta's Llama API (preview)​
OpenAI-SDK-compatible; requires the waitlist/API key. The served models expose a 128k context window (the 10M / 1M figures are Meta's capability claims, not what the preview API serves). Confirm the exact base path against the live docs — it is a moving preview.
6. Decision guide​
| If you need… | Choose… |
|---|---|
| data residency, privacy, offline use, or fine-tuning | self-host — vLLM for production throughput, Ollama / llama.cpp for local/dev |
| Llama with zero ops, SLAs, and existing cloud billing | a hosted partner — Groq for lowest latency; Bedrock / Azure / Vertex if you are already in that cloud |
| to prototype "from the source," OpenAI-compatible | the Meta Llama API (preview — not for production commitments yet) |
| an end-user assistant | Meta AI (consumer product, not a platform) |
7. Recommended starting points​
- Solo developer / quick test: Ollama locally, or a hosted partner's free tier.
- Production, cost-sensitive at scale: self-host Llama 4 on vLLM, or a hosted partner with autoscaling.
- Regulated / data-residency: self-host the open weights on your own EU infrastructure (vLLM), pairing Llama Guard 4 for safety.
- Portability across providers: build against Llama Stack so you can switch inference backends later.
8. Official links​
Models & docs
- Llama home
- Llama 4 models
- Llama 4 announcement (specs)
- Llama 4 Community License
- Hugging Face — meta-llama
- Llama Stack (GitHub)
- Llama Guard 4 model card
Hosted & local
Meta AI / Muse Spark
Related guides