Local AI Inference: LiteRT-LM and DiffusionGemma

What Google’s New Local AI Stack Actually Is

Google’s latest Gemma-focused runtimes, LiteRT-LM and DiffusionGemma, are optimized systems for local AI inference that improve on-device AI performance by accelerating text generation, cutting memory usage, and broadening language and platform support so developers can run capable models directly on consumer hardware instead of relying on distant servers and high-latency cloud calls. Together, they mark a clear push toward edge computing speed, where powerful models execute on phones, laptops, and desktops. LiteRT-LM builds on LiteRT (formerly TensorFlow Lite) with an orchestration layer tuned for Gemma 4, while DiffusionGemma experiments with diffusion-based text generation rather than standard token-by-token decoding. For teams building privacy-sensitive, interactive apps—from chat interfaces to multimodal assistants—these tools promise lower latency, improved control over data, and more predictable costs compared to fully cloud-bound architectures.

Google’s LiteRT-LM and DiffusionGemma Push Local AI Inference into the Fast Lane

LiteRT-LM: Multi-Token Gemma 4 Optimization for Faster On-Device AI

LiteRT-LM is Google’s specialized runtime for Gemma 4, built to squeeze more speed and efficiency out of local hardware. A key addition is native support for Gemma 4 Multi-Token Prediction (MTP) drafters, which bring speculative decoding to on-device AI inference and deliver up to 2.2x faster inference for Gemma 4 E4B and 1.6x for E2B. According to Google, prefill and decode performance are 1.8x to 3.7x faster than frameworks such as llama.cpp, MLX, Cactus, and ONNX. The runtime enforces memory locality by running both the primary model and MTP drafter on the same GPU, sharing KV cache and activations to avoid slow cross-chip transfers. It also focuses on memory efficiency: per-layer embeddings stay off-memory, and image or audio encoders load only when needed, shrinking Gemma 4 E2B to about 607MB on Apple mobile CPUs from an original ~2.58GB footprint.

DiffusionGemma: 4x Faster Text Blocks on Local GPUs

DiffusionGemma takes a different route to speed, applying diffusion techniques more familiar from image generation to text. Instead of predicting one token at a time, it denoises blocks of up to 256 tokens in parallel, repeatedly refining them until the entire block settles into usable text. Google reports that DiffusionGemma reaches over 1,000 tokens per second on a single Nvidia H100 and around 700 tokens per second on a GeForce RTX 5090—roughly four times faster than similarly sized autoregressive Gemma models on local GPUs. The model uses a 26-billion-parameter Mixture of Experts design but activates only 3.8 billion parameters per inference and can fit within 18GB of VRAM when quantized. It accepts text, image, and video inputs, supports more than 140 languages, and works with a 256K-token context window, making it attractive for large local documents and multimodal workflows.

Why Local Inference Matters: Latency, Privacy, and Edge Computing Speed

Both LiteRT-LM and DiffusionGemma signal that local AI inference is becoming practical for real-world consumer and enterprise applications. Running models on-device removes round-trips to remote servers, which reduces latency and helps interactions feel instantaneous, especially for multi-turn chat, code assistants, and in-line editing tools. Keeping data on local hardware also lowers the privacy and sovereignty risks linked to sending prompts or user content to external services. DiffusionGemma’s parallel generation is particularly well-suited to single-user scenarios, where a GPU would otherwise sit idle between tokens. At the same time, LiteRT-LM’s quantization and smart memory handling show how Gemma 4 optimization can fit capable models on laptops and phones without blowing through memory budgets. Combined, they move edge computing speed closer to what developers are used to from cloud GPUs, but under direct local control.

What App Developers Can Do with Faster On-Device AI

For developers, the headline is that on-device AI performance is no longer a niche experiment. LiteRT-LM now exposes native APIs not only for Kotlin and C++ but also Swift and JavaScript, opening Gemma 4 optimization and multi-token prediction to mobile and web stacks. Its built-in session management lets apps save and restore KV cache, so long conversations, code sessions, or document workflows can resume without re-running the full prompt. Support for constrained decoding, function-calling, and “Thinking Mode” gives app logic more control over structured outputs and tool use. DiffusionGemma’s block-wise generation fits local coding tools, in-place document editing, or scientific workloads that need bi-directional context and large windows. Together, these tools make it feasible to design products around edge computing speed—where low-latency, privacy-friendly intelligence runs next to the user instead of in a distant data center.