Google released DiffusionGemma on June 10, an experimental open model that generates text using diffusion rather than the usual word-by-word method. Built by Google DeepMind on the Gemma 4 family, the model ships under a permissive Apache 2.0 license, with weights on Hugging Face. It is a 26-billion-parameter Mixture of Experts system that activates only 3.8 billion parameters per step. Instead of predicting one token at a time from left to right, it drafts a block of 256 tokens at once and refines them over several passes.
The approach borrows from AI image generators, which start with noise and sharpen it into a clear result. DiffusionGemma applies the same idea to text, beginning with placeholder tokens and locking in correct ones across passes. Key points:
- Up to 4x faster generation on dedicated GPUs, with more than 1,000 tokens per second on an NVIDIA H100 and more than 700 on a GeForce RTX 5090.
- Runs within 18GB of VRAM when quantized, fitting high-end consumer cards.
- Bi-directional attention lets every token in a block consider all others, which helps with code infilling, in-line editing and non-linear text.
- Self-correction across passes, including cleanly closing complex markdown.
- A 256K-token context window and support for more than 140 languages.
The speed comes with a clear trade-off. Google says output quality is lower than standard Gemma 4 and recommends the regular models when quality matters most. The gains also concentrate on local, single-user workloads. In high-volume cloud serving, where autoregressive models already use hardware efficiently, parallel decoding offers less benefit and can cost more. DiffusionGemma works with tools including MLX, vLLM, Hugging Face Transformers, Unsloth and NVIDIA NeMo, with llama.cpp support planned. Google worked with NVIDIA on optimizations such as 4-bit NVFP4 kernels for RTX, Hopper and Blackwell hardware.
Who It’s For
- The model targets researchers and developers building latency-sensitive local tools, such as inline editors and code completion.
- Lower memory needs put a capable open model within reach of consumer GPUs that often cannot run large LLMs.
- Fine-tuning can sharpen narrow skills. Google cites an Unsloth fine-tune that taught the model to solve Sudoku, a task that trips up left-to-right models.
The Technical Shift
Most production models, including those from OpenAI, Anthropic and Google itself, are autoregressive and generate text sequentially. Researchers have explored text diffusion for years, but scaling it to large models has been hard. DiffusionGemma is built on the Gemma 4 26B A4B model that Google released in April, with its attention mechanism reworked to enable parallel generation.
By shifting the bottleneck from memory bandwidth to raw compute, it puts otherwise idle local hardware to fuller use. The release signals growing interest in alternatives to token-by-token decoding, and pairs an open license with day-one support across common developer frameworks.