Nemotron 3 Ultra for Long-Running AI Agents

From Single-Turn Chatbots to Long-Running AI Agents

Nemotron 3 Ultra is a frontier reasoning model designed to power long-running AI agents that maintain context, coordinate tools, and reason across many steps without exploding token usage or cost. Long-running AI agents differ from single-turn chatbots because they handle multi-turn reasoning over extended workflows: planning tasks, calling tools, invoking sub-agents, and feeding back observations. In these systems, every extra message, plan, and tool response adds to the context window. Over hundreds of turns, this history can grow large enough to slow down inference, raise costs, and increase the risk of goal drift. Nemotron 3 Ultra addresses this problem by acting as a specialized orchestrator in a system of models, handling the hardest reasoning calls while lighter models take on repetitive execution. This division helps enterprises build AI agent workflows that stay coherent over time, instead of collapsing under token bloat.

A 550B-Parameter Reasoning Orchestrator for Multi-Turn Workflows

Nemotron 3 Ultra is a 550B-parameter Mixture-of-Experts model with 55B active parameters aimed at frontier-class multi-turn reasoning. Within a long-running AI agent workflow, most calls are routine, but a smaller subset demands deep planning: maintaining architectural decisions across coding sessions, synthesizing conflicting research, or verifying complex designs. Nemotron 3 Ultra is built for these orchestration moments. NVIDIA reports that the model reaches frontier accuracy while using fewer active parameters than peers such as GLM 5.1, Kimi K2.6, and Qwen3.5, and that “Nemotron 3 Ultra achieves 5x higher throughput compared to other open models in its class.” For long-horizon planning and professional work benchmarks, this balance translates into high-quality decisions without needing a massive always-on model. In practice, it means multi-turn agents can escalate difficult reasoning to Nemotron 3 Ultra when needed, while staying efficient the rest of the time.

Cutting Token Bloat and Improving AI Agent Efficiency

Multi-turn reasoning agents face a core efficiency problem: every round trip adds tokens. Plans, tool outputs, intermediate reasoning steps, and sub-agent dialogues all need to be kept in context for the agent to stay coherent. Nemotron 3 Ultra targets this challenge head-on. In experiments on SWE-bench and Terminal Bench 2.0, NVIDIA found that Nemotron 3 Ultra completes tasks with fewer total tokens and fewer tokens per turn than comparable models, lowering the cost for agentic tasks by up to 30%. Because it can reason more effectively per call, it reduces wasted back-and-forth and shortens long-running workflows. Its high throughput also lets agents respond faster even when outputs are long. For organizations building complex automations, the result is long-running AI agents that can sustain detailed reasoning over many steps without turning every session into an expensive, bloated context window.

Architectural Innovations: Hybrid Mamba Transformer and Multi-Teacher Training

Nemotron 3 Ultra’s design is tuned for long context and multi-turn reasoning rather than single-turn chat. Hybrid Mamba transformer layers combine Mamba for sequence efficiency with transformer layers that preserve precise recall, helping agents read and reuse large histories without losing important details. NVFP4 precision lets the same checkpoint run efficiently on multiple NVIDIA GPU generations, delivering up to 5x higher throughput per GPU at similar interactivity compared to BF16 on Blackwell. LatentMoE improves expert routing across tasks such as code generation, tool calling, and domain-specific logic, while multi-token prediction speeds up long outputs by predicting several tokens per forward pass. Training uses Multi-Teacher On-Policy Distillation, where Nemotron 3 Ultra learns from more than 10 specialized teacher models, each scoring it in a specific domain. This setup steadily refines reasoning across coding, legal, knowledge, and other workflows that long-running AI agents must handle.

Agent Harness Integration and Enterprise-Ready Planning

Nemotron 3 Ultra is post-trained for agent harnesses rather than only chat, using NVIDIA’s NeMo RL and Gym libraries and a large suite of long-running, tool-using datasets. It is tuned to work in workflows where agents plan, call tools, read observations, delegate to sub-agents, validate outputs, and recover from errors across many turns. Integration with harnesses such as Hermes Agent and frameworks like OpenClaw and AibleClaw shows how Nemotron 3 Ultra can act as the planning brain inside enterprise agents. Developers can use efficient models for high-volume execution and reserve Nemotron 3 Ultra for complex orchestration, creating long-running AI agents that maintain goals and context over extensive sessions. With open training data pipelines, extensive SFT and RL tasks, and support for fine-tuning through NeMo libraries, Nemotron 3 Ultra gives teams a practical path from simple chatbots to multi-turn reasoning agents that deliver reliable workflows at scale.