NVIDIA’s 120B Nemotron 3 Runs Like a 12B Model

Sanket Chaukiyal

March 24, 2026

TL;DR

  • NVIDIA released Nemotron 3 Super, a 120B-parameter hybrid Mamba-Transformer MoE model that activates just 12B parameters per inference — slashing compute costs while maintaining performance.
  • The model delivers a 1M token context window and clocks 2.2x better throughput than GPT-OSS-120B, positioning NVIDIA squarely against Google DeepMind and Anthropic in the efficiency wars.
  • Available now via API, Nemotron 3 Super targets developers chasing long-context tasks without the infrastructure headaches of dense transformer models.
  • NVIDIA‘s hybrid architecture blends Mamba state-space layers with transformer attention — a bet that selective computation beats brute-force scaling.

NVIDIA Drops Nemotron 3 Super into the MoE Arena

NVIDIA launched Nemotron 3 Super, a 120-billion-parameter hybrid model that uses Mixture-of-Experts routing to activate only 12 billion parameters during inference. The model ships with a 1-million-token context window and is available now through NVIDIA’s API. This isn’t just another model drop — it’s NVIDIA signaling it wants a seat at the table where Google, Anthropic, and OpenAI argue about who builds the most efficient frontier systems.

The architecture mixes Mamba state-space layers with traditional transformer attention. NVIDIA claims the hybrid design delivers 2.2x the throughput of GPT-OSS-120B, a comparable open-weight model, while handling contexts that stretch to a million tokens. That’s enough runway to process entire codebases, legal documents, or transcripts without chunking.

Nemotron 3 Super joins NVIDIA’s ongoing push into hybrid architectures — models that blend different computational primitives to squeeze more performance per watt. The company clearly believes the future isn’t just bigger transformers. It’s smarter routing.

Why the 12B Active Parameter Count Actually Matters

Here’s the thing about Mixture-of-Experts models: they let you have your cake and eat it too. You get the representational capacity of 120 billion parameters, but each forward pass only activates 12 billion. That’s a 10x reduction in compute per token, which translates directly into faster inference and lower API costs. For developers running tight budgets or latency-sensitive apps, that gap isn’t academic — it’s the difference between viable and prohibitively expensive.

NVIDIA’s 2.2x throughput advantage over GPT-OSS-120B isn’t just a benchmark flex. It’s a direct challenge to the assumption that dense models are the only path to quality. If you can route intelligently — sending each token to the subset of experts that matter — you sidestep the quadratic attention bottleneck that’s plagued transformers since day one. And you do it without sacrificing accuracy on standard evals.

But the real unlock here is the 1-million-token context window. Most production models top out around 128K tokens, maybe 200K if you’re willing to eat the latency hit. A million tokens? That’s a different game. You can ingest an entire novel, a semester’s worth of lecture notes, or a company’s internal wiki and ask questions across all of it without preprocessing. The use cases shift from “answer this question” to “be my research assistant for the next six hours.”

I’ve watched the context window arms race accelerate over the past year, and it’s clear the endgame isn’t just longer windows — it’s longer windows that don’t collapse into incoherence or crawl to a halt. NVIDIA’s hybrid architecture is a bet that you can’t brute-force your way there with pure transformers. You need state-space models to handle the linear-time dependencies and transformers to handle the hard reasoning. It’s like building a sports car with both a turbocharger and a hybrid battery — you use whichever engine fits the terrain.

The competitive stakes are obvious. Google’s Gemini models and Anthropic‘s Claude family both lean heavily on MoE architectures to balance cost and capability. NVIDIA entering this space with a model that outpaces GPT-OSS-120B by over 2x means the open-weight ecosystem just got a serious performance injection. Developers who couldn’t justify the infrastructure costs of running a 120B dense model now have a credible alternative that runs leaner and faster.

Hybrid Mamba-Transformer Architecture Signals NVIDIA’s Long Game

NVIDIA’s decision to hybridize Mamba and transformer layers isn’t a gimmick. It’s a structural bet on where AI architectures are heading. Transformers dominate today because they parallelize beautifully and scale predictably. But they’re also inefficient — every token attends to every other token, even when most of those connections contribute nothing useful. Mamba state-space models, by contrast, compress context into a fixed-size state and update it sequentially. Fast, efficient, but historically weaker at complex reasoning.

Nemotron 3 Super splits the difference. Use Mamba layers to chew through long contexts cheaply, then route to transformer layers when you need the full attention mechanism for hard inference. The result is a model that doesn’t choke on million-token inputs but still nails the reasoning benchmarks that matter to developers.

This fits into NVIDIA’s broader strategy of building models that showcase their hardware advantages. The company dominates AI training and inference infrastructure — every hyperscaler runs NVIDIA GPUs. Releasing a model that squeezes 2.2x better throughput from the same silicon is both a technical flex and a sales pitch. It tells enterprises: our chips run AI better, and here’s the proof.

And NVIDIA isn’t alone in this hybrid architecture bet. Reportedly, multiple research labs are experimenting with blending state-space models and transformers, chasing the same efficiency gains. But NVIDIA has the distribution advantage — developers already use their tools, their frameworks, their cloud instances. Nemotron 3 Super slots directly into that ecosystem.

What Nemotron 3 Super Means for Developers and Competitors

For developers, the calculus is straightforward. If you’re building apps that need long-context understanding — legal analysis, codebase navigation, research synthesis — Nemotron 3 Super just became a top-tier option. The 1-million-token window eliminates the need for chunking strategies or retrieval-augmented generation pipelines. You can throw the entire corpus at the model and let it figure out what matters.

The API availability is key. NVIDIA isn’t just open-sourcing weights and walking away. They’re hosting inference, which means you don’t need to provision GPU clusters or optimize serving pipelines. That lowers the barrier to entry for startups and solo developers who want frontier-model performance without the DevOps nightmare.

For competitors, Nemotron 3 Super raises the bar on what “open-weight” means. It’s not enough to release a model that matches GPT-4 on MMLU anymore. You need to ship something that’s faster, cheaper, and capable of handling contexts that would melt older architectures. Google’s Gemini and Anthropic’s Claude models already do this — but they’re closed. NVIDIA is betting that open weights plus superior throughput will carve out a distinct market position.

The MoE architecture also intensifies the efficiency war. If you can deliver 120B-scale performance while activating only 12B parameters, why would anyone run a dense 120B model? The answer used to be “because MoE models are finicky to train and serve.” But if NVIDIA’s cracked the engineering — and the 2.2x throughput number suggests they have — that excuse evaporates.

There’s also a regulatory angle. As governments worldwide scrutinize AI model releases, open-weight models with strong performance benchmarks give developers an alternative to closed APIs controlled by a handful of companies. NVIDIA positioning itself as the infrastructure provider for open AI is a hedge against a future where model access becomes a bottleneck.

Three Things to Watch as Nemotron 3 Super Hits Production

First, watch how developers stress-test the 1-million-token context window in real-world applications. Benchmark numbers are one thing. Actual coherence across a million tokens of messy, unstructured data — that’s where models usually stumble. If Nemotron 3 Super holds up, it’ll redefine what “long-context” means in production systems. If it degrades past 500K tokens, the hype deflates fast.

Second, track how NVIDIA prices API access relative to Anthropic’s Claude and Google’s Gemini. The 2.2x throughput advantage only matters if it translates into lower per-token costs or faster response times. If NVIDIA charges a premium because it’s NVIDIA, the efficiency gains get swallowed by margin. But if they undercut competitors, they could trigger a price war that reshapes the entire API market.

Third, pay attention to whether other labs adopt hybrid Mamba-Transformer architectures. NVIDIA’s release could validate the approach and spark a wave of similar models. Or it could remain a niche experiment if pure transformers continue scaling more predictably. The next six months will clarify whether hybrid architectures are the future or a detour.

FAQ

What makes NVIDIA Nemotron 3 Super different from other large language models?

Nemotron 3 Super uses a hybrid Mamba-Transformer architecture with Mixture-of-Experts routing, activating only 12 billion of its 120 billion parameters per inference. This design delivers 2.2x better throughput than comparable models like GPT-OSS-120B while supporting a 1-million-token context window — far longer than most production models.

How does the Mixture-of-Experts architecture improve efficiency?

MoE models route each input token to a small subset of specialized expert networks instead of activating all parameters. Nemotron 3 Super activates just 12 billion of its 120 billion parameters per forward pass, slashing compute costs and latency while maintaining the representational power of a much larger dense model.

What can developers do with a 1-million-token context window?

A 1-million-token context lets developers process entire codebases, legal document sets, research papers, or transcripts in a single inference call without chunking or retrieval pipelines. This enables applications like cross-document analysis, long-form content generation, and research synthesis that require understanding relationships across massive amounts of text.

How does Nemotron 3 Super compare to Google Gemini and Anthropic Claude?

Nemotron 3 Super competes directly with Google’s Gemini and Anthropic’s Claude models in the MoE architecture space, but with open-weight availability and API access through NVIDIA’s infrastructure. The 2.2x throughput advantage over GPT-OSS-120B positions it as a faster, more efficient alternative for developers who need long-context capabilities without closed-API vendor lock-in.

Source: llm-stats.com

Sanket Chaukiyal — Editor at Smart Chunks

Sanket Chaukiyal

Technology editor • 12+ years in editorial

Sanket is the founder and editor of Smart Chunks. He spent over six years at Autocar India (Haymarket SAC Publishing) as Sub Editor and Senior Copy Editor, and later served as Account Director (Content) at Rite Knowledge Labs. He holds a Master's in Media and Communication from the Symbiosis Institute of Media and Communication.

All articles → LinkedIn