Nemotron 3 Ultra for Long-Running AI Agents

From Single-Turn Chatbots to Long-Running AI Agents

Nemotron 3 Ultra is a frontier reasoning model designed to power long-running AI agents that can maintain context, plan across many steps, call tools, coordinate sub-agents, and complete complex workflows with high autonomous agent efficiency. Traditional chatbots answer one question at a time, but long-running AI agents stay active over many turns, passing history, tool outputs, and reasoning steps back into the model as tasks grow. This evolution introduces a challenge: multi-agent workflows cause token counts to surge, driving up costs and increasing the chance of goal drift as conversations stretch out. NVIDIA’s response is a system-of-models approach, where Nemotron 3 Ultra handles orchestration and deep multi-turn reasoning while smaller models tackle high-volume execution. In this architecture, Nemotron 3 Ultra becomes the strategic brain of complex workflows, keeping agents on track over extended interactions without sacrificing performance efficiency.

Inside Nemotron 3 Ultra: Mixture-of-Experts for Multi-Turn Reasoning

Nemotron 3 Ultra is a 550B-parameter Mixture-of-Experts model with 55B active parameters, tuned for frontier reasoning and long-horizon planning in agentic systems. Its design targets the hardest parts of multi-turn reasoning: sustaining architectural decisions across lengthy coding sessions, synthesizing conflicting research evidence, or validating designs against thousands of constraints. According to NVIDIA, Nemotron 3 Ultra delivers frontier accuracy in a smaller model while achieving up to 5x higher throughput than other open models in its class. Architectural innovations such as a hybrid Mamba–Transformer stack increase sequence efficiency for long contexts while preserving precise recall, which is vital when long-running AI agents must retrieve specific facts from large histories. LatentMoE improves expert routing across reasoning, code generation and tool calls, while multi-token prediction speeds generation for long outputs, making Nemotron 3 Ultra well suited to extended workflows.

Efficiency Breakthroughs for Autonomous Agent Workflows

Nemotron 3 Ultra focuses on autonomous agent efficiency by cutting both token usage and latency across long interactions. In experiments on SWE-bench and Terminal Bench 2.0, it completed benchmarks using fewer total tokens and fewer tokens per turn than comparable models, lowering the cost to task completion by up to 30%. This matters because multi-agent workflows multiply context: agents plan, invoke tools, spawn sub-agents and feed back outputs over many rounds, quickly inflating token budgets. Nemotron 3 Ultra’s NVFP4 precision format allows a single checkpoint to run across Hopper, Blackwell and Ampere GPUs, delivering up to 5x higher throughput per GPU at similar interactivity compared with BF16 on Blackwell. Combined with multi-token prediction, this efficiency lets enterprises keep agents running longer, with richer context and more steps, without the usual performance or cost penalties associated with high-capacity reasoning models.

Multi-Teacher Distillation and Data for Stronger Agent Reasoning

To sustain reliable performance for long-running AI agents, Nemotron 3 Ultra uses Multi-Teacher On-Policy Distillation (MOPD), where more than 10 specialized teacher models train the student across domains. During training, the student model generates its own rollouts and receives dense rewards from domain-specific teachers, with rollout generation, scoring and optimization pipelined asynchronously for efficiency. This iterative co-evolution produces stronger specialization over time and more consistent multi-turn reasoning. The model builds on a 10T-token base with an additional 212B tokens tuned for legal, knowledge and code domains, including 4B synthetic legal tokens, 35B synthesized Wiki-based tokens and 173B refreshed GitHub tokens. NVIDIA also releases large-scale SFT and RL data and environments, helping Nemotron 3 Ultra stay consistent across frameworks, with SWEBench Verified scores in the 65%–70.4% range across various agent stacks.

Enterprise Autonomy: Orchestrating Complex, Always-On Agents

Nemotron 3 Ultra is built for enterprises that need agents to stay active, reliable and efficient across long workflows, not just respond in single turns. Post-training focuses on agent harnesses rather than simple chat, targeting environments where agents plan, call tools, read observations, delegate to sub-agents, validate outputs and recover from errors. Integrations with agent frameworks such as Hermes Agent and OpenClaw provide orchestration loops, memory and tool ecosystems suited to multi-turn reasoning. In practice, Nemotron 3 Ultra can act as an orchestration model that decides when deep reasoning is needed, when to invoke specialized tools, and when to hand off work to smaller models for bulk execution. This division of labor enables sustained autonomy—agents can manage complex operations, maintain context over many turns and complete end-to-end workflows while keeping throughput high and token costs manageable.