News

Google Releases DiffusionGemma, an Open Model for Faster Text Generation

Google released DiffusionGemma, an open model that generates text in parallel blocks for up to four times faster output on local GPUs.

By Daniel Mercer Edited by Maria Konash Published: Jun 11, 2026 at 1:39 pm UTC Updated: Jun 11, 2026 at 4:33 pm UTC

Google Releases DiffusionGemma, an Open Model for Faster Text Generation — Google releases DiffusionGemma, an open model that generates text in parallel blocks for faster output on GPUs. Image: Google

Google released DiffusionGemma on June 10, an experimental open model that generates text using diffusion rather than the usual word-by-word method. Built by Google DeepMind on the Gemma 4 family, the model ships under a permissive Apache 2.0 license, with weights on Hugging Face. It is a 26-billion-parameter Mixture of Experts system that activates only 3.8 billion parameters per step. Instead of predicting one token at a time from left to right, it drafts a block of 256 tokens at once and refines them over several passes.

The approach borrows from AI image generators, which start with noise and sharpen it into a clear result. DiffusionGemma applies the same idea to text, beginning with placeholder tokens and locking in correct ones across passes. Key points:

Up to 4x faster generation on dedicated GPUs, with more than 1,000 tokens per second on an NVIDIA H100 and more than 700 on a GeForce RTX 5090.
Runs within 18GB of VRAM when quantized, fitting high-end consumer cards.
Bi-directional attention lets every token in a block consider all others, which helps with code infilling, in-line editing and non-linear text.
Self-correction across passes, including cleanly closing complex markdown.
A 256K-token context window and support for more than 140 languages.

The speed comes with a clear trade-off. Google says output quality is lower than standard Gemma 4 and recommends the regular models when quality matters most. The gains also concentrate on local, single-user workloads. In high-volume cloud serving, where autoregressive models already use hardware efficiently, parallel decoding offers less benefit and can cost more. DiffusionGemma works with tools including MLX, vLLM, Hugging Face Transformers, Unsloth and NVIDIA NeMo, with llama.cpp support planned. Google worked with NVIDIA on optimizations such as 4-bit NVFP4 kernels for RTX, Hopper and Blackwell hardware.

Who It’s For

The model targets researchers and developers building latency-sensitive local tools, such as inline editors and code completion.
Lower memory needs put a capable open model within reach of consumer GPUs that often cannot run large LLMs.
Fine-tuning can sharpen narrow skills. Google cites an Unsloth fine-tune that taught the model to solve Sudoku, a task that trips up left-to-right models.

The Technical Shift

Most production models, including those from OpenAI, Anthropic and Google itself, are autoregressive and generate text sequentially. Researchers have explored text diffusion for years, but scaling it to large models has been hard. DiffusionGemma is built on the Gemma 4 26B A4B model that Google released in April, with its attention mechanism reworked to enable parallel generation.

By shifting the bottleneck from memory bandwidth to raw compute, it puts otherwise idle local hardware to fuller use. The release signals growing interest in alternatives to token-by-token decoding, and pairs an open license with day-one support across common developer frameworks.

Disclaimer: AIstify is an independent media brand owned and operated by NuvexMedia LLC, publishing news, research, and insights on artificial intelligence, emerging technologies, automation, and related industries. NuvexMedia LLC invests in and collaborates with companies across the AI, technology, software, and digital innovation sectors. These relationships do not influence AIstify’s editorial coverage, and the publication maintains full editorial independence to provide accurate, timely, and objective information. © 2026 NuvexMedia LLC. All rights reserved. This content is for informational purposes only and should not be considered legal, tax, investment, financial, or other professional advice.