NVIDIA Nemotron 3 Ultra for Long-Running AI Agents

What Nemotron 3 Ultra Is and Why Long-Running AI Agents Need It

NVIDIA Nemotron 3 Ultra is an open Mixture-of-Experts language model engineered to boost the reasoning efficiency, speed, and context retention of long-running AI agents that must sustain multi-turn workflows, call tools, and coordinate subtasks over many steps without losing focus or slowing down. Traditional single-turn chatbots treat each message as a fresh conversation and struggle when token counts explode across extended sessions. Long-running AI agents do the opposite: they plan, call tools and sub-agents, read observations, and then feed rich histories and reasoning traces back into the model. This process can quickly increase costs and introduce goal drift as context grows. Nemotron 3 Ultra is designed as the “frontier reasoning” brain inside such systems, orchestrating difficult decisions while allowing smaller, cheaper models to handle routine execution, validation, and tool calls.

A Smaller Frontier Model Built for Multi-Turn Agent Orchestration

Nemotron 3 Ultra is a 550B-parameter Mixture-of-Experts model with 55B active parameters, tuned explicitly for orchestration in agentic systems. It targets the hardest calls in a workflow: keeping architectural decisions consistent across long coding sessions, aggregating conflicting evidence from hundreds of research sources, or validating chip designs against thousands of constraints. NVIDIA reports that Nemotron 3 Ultra delivers frontier-level accuracy with a smaller active parameter count than some peers, while achieving 5x higher throughput than comparable open models in its class. On long-context tasks, it scores 95% on Ruler at a 1M-token context window, demonstrating that it can maintain detailed histories for multi-turn agents. In practice, this means an agent can keep a large project memory—requirements, intermediate plans, tool outputs—and still respond quickly enough to stay interactive for end users.

Speed, Token Efficiency, and Stable Performance Over Long Runs

Nemotron 3 Ultra focuses on AI reasoning efficiency in multi-turn agents by cutting both latency and token overhead. In benchmarks such as SWE-bench and Terminal Bench 2.0, the model finished tasks using fewer total tokens and fewer tokens per turn than comparable systems, which lowers the cost of agentic tasks by up to 30%. NVIDIA also states that "Nemotron 3 Ultra achieves 5x higher throughput compared to other open models in its class, enabling long-running agents to complete tasks faster and more efficiently." Multi-token prediction lets the model generate several future tokens per forward pass, speeding up long outputs that agent loops often require. NVFP4 precision keeps the same checkpoint usable across Hopper, Blackwell, and Ampere GPUs while delivering higher throughput per GPU, so performance remains stable even as agents run extended workflows.

Architectural Innovations That Help Agents Maintain Context and Use Tools

Several architectural choices in Nemotron 3 Ultra target the practical needs of multi-turn agents: long context, precise recall, and complex tool use. A hybrid Mamba-Transformer design improves sequence efficiency for long-context workloads while relying on Transformer layers to recall specific facts when agents query large histories. LatentMoE routing supports diverse workloads in a single model, from reasoning and code generation to tool calls and domain-specific logic. The model is post-trained with NVIDIA’s NeMo RL and Gym libraries on one of the largest suites of long-running, tool-using datasets available. That post-training is not optimized for single-turn chat, but for workflows where agents plan, call tools, read observations, delegate to sub-agents, validate outputs, and recover from errors across many turns. The result is a model that is structurally aligned with how long-running AI agents operate in production.

Training Methods and Data That Expand Agent Capabilities

Nemotron 3 Ultra adds Multi-Teacher On-Policy Distillation (MOPD) to improve reasoning without sacrificing efficiency for long-running AI agents. More than 10 specialized teacher models are trained, each dedicated to a domain such as coding, legal reasoning, or open-domain question answering. During MOPD, the student model generates its own rollouts and receives dense reward signals from the relevant teachers, with rollout generation, scoring, and optimization pipelined asynchronously to stay efficient. This process is iterative, with new teacher rounds initialized from improved student checkpoints. On the data side, Nemotron 3 Ultra builds on a 10T-token base with 212B new tokens, including 4B tokens of synthetic legal data, 35B tokens of synthesized Wiki-based data, and 173B refreshed GitHub tokens up to late 2025. These additions strengthen multi-turn agents that must reason across law, knowledge work, and code while handling complex, tool-rich workflows.