Nemotron 3 Ultra and Long-Running AI Agents

From Single-Turn Chat to Long-Running AI Agents

Nemotron 3 Ultra is a large reasoning model designed to upgrade single-turn chatbots into long-running AI agents that can think through complex problems, maintain context across many steps, and coordinate tools and sub-agents to complete multi-stage workflows efficiently. Traditional chatbots respond in isolation and struggle with long tasks, while agentic systems need to plan, call tools, delegate to sub-agents, and validate results over many turns. This quickly increases token counts, cost, and the risk of goals drifting as history grows. NVIDIA positions Nemotron 3 Ultra as an orchestration engine for these long-running AI agents, helping them keep track of decisions, reconcile conflicting information, and stay aligned to a goal across coding projects, research tasks, or operations workflows that span hundreds of steps.

Frontier Reasoning Model Built for Multi-Turn Agents

Nemotron 3 Ultra is a 550B-parameter Mixture-of-Experts model with 55B active parameters, aimed at frontier reasoning and orchestration roles rather than routine execution. Within a multi-agent workflow, most calls are simple, but a smaller set demands deep reasoning such as maintaining architecture choices over long coding sessions or checking chip designs against thousands of constraints. According to NVIDIA, Nemotron 3 Ultra reaches 91% on the Agent Productivity PinchBench and 95% on Ruler at a 1M-token context while offering 5x higher throughput than other open models in its class. It also uses fewer tokens per turn on benchmarks like SWE-bench and Terminal Bench 2.0, cutting the token cost of long-running tasks by up to 30%. This makes it well suited to power multi-turn agents that must think ahead without becoming too expensive to operate.

Architectural Innovations for Autonomous Workflows

To support autonomous workflows that span hundreds of steps, Nemotron 3 Ultra combines several architectural choices aimed at efficient long-context reasoning. A hybrid Mamba–Transformer design improves sequence efficiency for long inputs while still allowing precise fact retrieval when agents query large context windows. LatentMoE expert routing lets the model switch between reasoning, code generation, tool calling, and domain-specific logic without activating unnecessary capacity. Multi-token prediction reduces generation time by predicting several tokens in one forward pass, which is especially useful when agents produce long plans or detailed analysis. The model is post-trained specifically for agent harnesses using NVIDIA NeMo RL and Gym, focusing on environments where agents must plan, read observations, recover from errors, and coordinate sub-agents over many turns. This tuning makes Nemotron 3 Ultra more reliable inside orchestration loops than models optimized only for single-turn chat.

MOPD and Open Data to Strengthen Enterprise Reasoning

Nemotron 3 Ultra introduces Multi-Teacher On-Policy Distillation (MOPD), where the model learns from more than ten specialized teacher models while generating its own rollouts. Each teacher scores outputs within its domain, and the student receives dense rewards across coding, legal reasoning, question answering, and other areas. The process runs asynchronously, with rollout generation, teacher scoring, and optimization pipelined, and is iterated so newer student checkpoints seed stronger teachers in later rounds. On the data side, Ultra builds on 10T pre-training tokens and adds 212B tokens targeted at legal, wiki-based, and refreshed GitHub content, plus 10M new SFT samples, 1M RL tasks, and 15 new RL environments. For enterprises, this combination of transparent data pipelines and structured training is designed to deliver stronger, more predictable reasoning models that can underpin long-running AI agents in production.

System-of-Models Strategy and Path to Smaller Agents

Nemotron 3 Ultra is meant to sit at the top of a system-of-models stack for autonomous workflows. In this pattern, the large reasoning model focuses on orchestration, hard planning decisions, and complex validation, while smaller, efficient models handle high-volume execution, simple tool calls, and routine steps. Because Ultra is open and trained for agent harnesses like Hermes Agent, developers can deploy it as the central planner and then post-train compact models on its outputs. These distilled models inherit stronger reasoning for their size and can be used where cost or latency matters more than peak accuracy. With deployment recipes using NVIDIA’s Dynamo and fine-tuning options via LoRA, SFT, and reinforcement learning in NeMo, enterprises gain a practical path to building cost-effective, long-running AI agents that coordinate many tools and steps without losing context or control.