Unofficial community resource — not affiliated with Google

DiffusionGemma — Try Google's Text-Diffusion Model Online

Everything about DiffusionGemma in one place: what Google DeepMind's new text-diffusion model is, how fast it really runs, how it compares to Gemma 4, and how to run it yourself — plus a free playground for prototyping prompts.

Text Generation Playground

Playground powered by Gemini 2.5 Flash

Native DiffusionGemma hosted inference isn't offered by any API provider yet — this playground runs on Gemini 2.5 Flash so you can prototype prompts today. Output below is not produced by DiffusionGemma. To run the real model, see how to run it locally.

What is DiffusionGemma?

DiffusionGemma is an experimental open-weights language model from Google DeepMind, released on June 10, 2026. Instead of writing text one token at a time the way GPT-style models do, it generates text through discrete diffusion: it starts from a canvas of noise tokens and iteratively denoises blocks of 256 tokens in parallel until coherent text emerges. The result is one of the fastest open models available — Google reports generation speeds of more than 1,000 tokens per second on a single NVIDIA H100, up to four times faster than comparable autoregressive models.

The model is built on the Gemma 4 architecture as a 26B-class Mixture-of-Experts (25.2B total parameters), but only about 3.8B parameters are active per step — which is why Google labels it 26B A4B. It accepts text, image, and video inputs and produces text output, and it ships under the permissive Apache 2.0 license, so you can download the weights and deploy it in your own projects.

Speed and Benchmarks

Speed is the headline feature. Because diffusion decodes whole blocks of tokens in parallel rather than waiting on one token before predicting the next, DiffusionGemma keeps a GPU busy in a way sequential decoding can't. Published numbers so far:

1,000+ tokens/second on a single NVIDIA H100 (Google's official figure, single-user mode).
1,008 tokens/second on H100 and 1,288 tokens/second on H200 with the FP8-quantized checkpoint, measured by the vLLM team — the first diffusion LLM natively supported in vLLM.
Up to 4x faster than Google's comparable autoregressive Gemma models in single-user inference.

The trade-off is output quality: early community testing (including the r/LocalLLaMA discussion) finds DiffusionGemma's raw text quality lower than standard Gemma 4 of the same size. Google positions it as experimental — a preview of where fast parallel generation is heading, not a drop-in replacement for frontier chat models.

Architecture: MoE Meets Text Diffusion

Two design choices define DiffusionGemma. First, the Mixture-of-Experts backbone inherited from Gemma 4: 25.2B total parameters split across experts, with only ~3.8B activated for any given step. That keeps the memory footprint and per-step compute low enough for a single capable accelerator — quantized builds fit in roughly 18 GB of VRAM, putting consumer GPUs in reach.

Second, block diffusion decoding, building on Google's Gemini Diffusion research. The model treats a 256-token block as a canvas, fills it with noise, and refines all positions simultaneously over a handful of denoising iterations. Because the model sees bidirectional context within the block at every step, it can revise earlier choices as later ones firm up — self-correction that sequential decoders fundamentally can't do. This also makes it naturally strong at infilling: completing gaps in the middle of text or code rather than only appending at the end.

DiffusionGemma vs Gemma 4

Both models share the same 26B A4B MoE foundation, so the comparison is really about the decoding strategy:

Throughput: DiffusionGemma is up to 4x faster in single-user inference; Gemma 4 decodes token-by-token.
Output quality: Gemma 4 currently produces higher-quality text at the same scale. DiffusionGemma trades some quality for speed.
Editing and infilling: DiffusionGemma's bidirectional denoising handles insert-in-the-middle and fill-in-the-blank tasks natively; autoregressive models need special fill-in-middle training for that.
Maturity: Gemma 4 is a stable release line; DiffusionGemma is explicitly experimental.

A practical heuristic: reach for DiffusionGemma when latency, batch throughput, or structured editing dominate your requirements, and stay with Gemma 4 (or larger frontier models) when answer quality is the bottleneck.

How to Use DiffusionGemma

There are two honest paths today. The first is the official open-weights route: download the model from Hugging Face (google/diffusiongemma-26B-A4B-it), Kaggle, or Vertex AI and run it yourself with transformers, vLLM, or a quantized build in llama.cpp, Ollama, LM Studio or similar — our step-by-step local guide covers each option with the exact commands.

The second is prompt prototyping in the playground at the top of this page. No hosted API currently serves native DiffusionGemma inference, so the playground runs on Gemini 2.5 Flash — clearly labeled, and useful for drafting and iterating prompts you'll later run against the real model locally. Sign in, spend a free credit, and you have a working text-generation loop in seconds.

Limitations to Know Before You Build

Quality gap: generated text ranks below standard Gemma 4 in early evaluations — expect rougher prose and more factual slips.
Experimental status: APIs, checkpoints, and best practices are days old and likely to shift; pin versions in anything that matters.
No hosted API yet: as of mid-June 2026 no inference provider offers DiffusionGemma as a managed endpoint — running it means provisioning your own GPU.
Hardware floor: even quantized you need ~18 GB VRAM; comfortable full-precision serving wants an H100-class card.

Frequently Asked Questions

What is DiffusionGemma?

DiffusionGemma is Google DeepMind's experimental open-weights language model released on June 10, 2026. It generates text by iteratively denoising 256-token blocks in parallel (discrete diffusion) instead of predicting one token at a time, reaching over 1,000 tokens per second on a single NVIDIA H100.

Is DiffusionGemma free to use?

Yes. The weights are released under the Apache 2.0 license on Hugging Face, Kaggle, and Vertex AI, so you can download, modify, and deploy the model commercially at no cost. You pay only for your own compute.

Can I try DiffusionGemma online?

No hosted API currently serves native DiffusionGemma inference. The playground on this page runs on Gemini 2.5 Flash for prompt prototyping and is labeled accordingly. To run the real model you currently need your own GPU — see our how-to-run guide.

How fast is DiffusionGemma really?

Google reports 1,000+ tokens per second on a single H100 in single-user mode, up to 4x faster than comparable autoregressive Gemma models. The vLLM team independently measured 1,008 tokens/second on H100 and 1,288 tokens/second on H200 using the FP8 checkpoint.

Is DiffusionGemma better than Gemma 4?

It is faster, not better. Both share the 26B A4B MoE architecture, but early testing finds DiffusionGemma’s output quality lower than standard Gemma 4. Its advantages are speed, parallel decoding, and native infilling/editing thanks to bidirectional context.

What hardware do I need to run DiffusionGemma locally?

Quantized builds fit in roughly 18 GB of VRAM, so a high-end consumer GPU (e.g. a 24 GB card) works. Full-speed FP8 serving was benchmarked on H100/H200-class accelerators. Community quantizations for llama.cpp, Ollama, and LM Studio are listed on the Hugging Face model page.

Is this the official DiffusionGemma website?

No. diffusiongemma.org is an unofficial community resource and is not affiliated with or endorsed by Google. Official documentation lives at ai.google.dev and deepmind.google — we link to both throughout the site.