Industry Analysis

The Moat Isn't Silicon: Why 6 Million CUDA Developers Are Harder to Move Than Any Chip

MI300X has 2.4x the memory and 3.49x the L2 bandwidth of H100. Yet B200 beats it by 105% at 128 concurrent users. The gap is CUDA, not silicon.

Josh Chen · 2026-07-05 · 5 min read

The most common retail mistake is reducing the gap between GPUs to a spec sheet. See AMD MI300X at 192GB HBM3 and 5.3 TB/s bandwidth versus NVIDIA H100 at 80GB and 3.35 TB/s, and the natural conclusion is "AMD has caught up." Low-level benchmarks by Chips and Cheese go further: MI300X's L1 cache bandwidth is 1.6x that of H100, and L2 is 3.49x. On raw silicon, AMD is ahead almost everywhere.

Then run AIMultiple's H1 2026 inference benchmark at 128 concurrent users, and NVIDIA B200 delivers +105.3% throughput over MI300X. The side losing on hardware wins by more than double. The gap isn't in the silicon — it's in the eighteen-year software inertia wrapped around it.

CUDA's lock-in isn't the API — it's baked-in hardware assumptions

If CUDA were just a set of APIs, AMD's hipify tool would already have automated CUDA-to-ROCm translation years ago. It hasn't happened, and the reason hides in an easily-missed technical detail: NVIDIA GPUs use warp size = 32; AMD uses wavefront size = 64. That isn't a tunable parameter — it's a hardware assumption baked into every CUDA kernel written over the past fifteen years.

When hipify auto-ports CUDA code to ROCm, any kernel that assumes "32 threads synchronize together" will silently skip execution on 32 of the 64 lanes on a wavefront-64 device — the code runs, the output is wrong, and the tool doesn't warn you. The ROCmPort-AI open-source project documents this pattern in detail on GitHub, and the practical engineering conclusion is: to actually port CUDA performance across, you need hand-tuned inline PTX assembly, one kernel at a time.

Which means: AMD has the code, the silicon, and the theoretical bandwidth advantage. It doesn't have eighteen years of engineer-time.

6 million developers is organizational inertia, not a user count

NVIDIA's 2026 disclosure: roughly 6 million active CUDA developers globally, and 40,000+ organizations using the stack. That looks like a user count, but the real lock-in mechanism isn't the users — it's the structures wrapped around them. Universities teach parallel computing on CUDA. Journal papers benchmark performance in CUDA. AI-engineer job descriptions list "CUDA experience required." Venture capitalists evaluating AI startups look for CUDA credentials on the founding team.

This is a near-perfect analogy to Microsoft's Win32 API dominance of desktop application software in the 1990s. When an entire generation of engineers builds capability, publishes papers, and changes jobs on the same API, the moat stops being a technical problem and becomes a sociological one. Competitors aren't facing "rewrite the code" — they're facing "rebuild the entire human-capital formation pipeline."

The $20B Groq acquisition: defense and offense

In December 2025, NVIDIA closed a $20B asset-and-licensing acquisition of Groq's IP and talent — not an equity M&A — and at GTC 2026 unveiled Groq's SRAM-based LPU as the "Groq 3 LPX" rack module inside the Vera Rubin platform. The signaling is clear: Groq's LPU architecture had a plausible path to disrupting the GPU-dominant inference paradigm. NVIDIA bought it outright and turned Groq's technical advantage into an internal component of the CUDA ecosystem rather than an external threat.

The heterogeneous-architecture performance disclosed at GTC 2026: when LPUs handle decode and Rubin GPUs focus on prefill and training, throughput per megawatt on trillion-parameter models improves up to 35x. That number needs caution — it comes from NVIDIA's own developer blog with no independent third-party validation currently available — but the strategic logic of the acquisition is unambiguous: rather than let Groq become an external threat, absorb it inside CUDA's ecosystem radius.

Where this moat has real cracks

CUDA's lock-in isn't invincible. Google, with TPU v6e (Trillium), has bypassed CUDA entirely — using its own JAX + XLA software stack. At 4096-token inference contexts, two TPU v6e-8 systems deliver 66% higher throughput than two H100s and 23.6x faster TTFT, and SDXL image generation on Trillium runs at $0.22 per 1,000 images. Which means: for specific workloads, bypassing CUDA is not only viable — it's cheaper.

More noteworthy is the March 2026 launch of Huawei's Ascend 950PR, which introduces a phrase that changes the framing: "higher CUDA compatibility." Target production of 750,000 units in 2026, DDR version ~50k RMB, HBM version ~70k RMB, with ByteDance and Alibaba named as major buyers. If that claim holds up under technical review, CUDA shifts from "the lock you can't remove" to "the bridge you can ride" — competitors stop fighting CUDA and start shipping on top of it. That reframing carries much heavier consequences than another benchmark loss. The technical detail is only in Chinese-language reporting so far, and independent verification is missing, but the direction is worth tracking.

Another honest disclosure on the data: the "35x throughput per megawatt" number rests on a single NVIDIA developer blog reference with no independent test. Vendor-narrative numbers should be discounted — that's basic research discipline.

A framework for judging any software-ecosystem moat

Four questions are more predictive than user counts:

Does lock-in live at the API layer or the hardware-assumption layer? APIs can be attacked by translation tools. Hardware assumptions cannot — warp=32 vs wavefront=64 is fifteen years of implicit liability.
Is organizational inertia locked in via academia, hiring, and education simultaneously? All three channels together produce Win32-class durability. Any one alone is short-term lock-in.
When facing a potential disruptive competitor, does the incumbent buy it or fight it? Buying and integrating shows strategic maturity — and shows that the disruption threat was real.
Do competitors choose to bypass or to claim compatibility? Bypassing (Google TPU) requires rebuilding an entire ecosystem — extraordinarily expensive. Claiming compatibility (Huawei 950PR), if true, is more damaging — because it means the lock is no longer a lock.

The framework applies to any software-platform moat — well beyond CUDA.

Disclaimer

This article reflects independent research and framework-sharing based on publicly available information and the author's own analysis. It does not constitute investment advice and does not promote, manage, or advise on any specific security for a fee. Investment decisions should be made independently, accounting for your own financial situation, objectives, and risk tolerance. Past performance does not guarantee future results.