The Age of Inference Is Here — and It’s Hungry for Better Chips

Let’s get one thing straight: AI hardware is no longer just about training. The real action is shifting to inference—the moment when neural networks actually do something. Think of it as the difference between cramming for an exam and acing it in real time. From reasoning models, to instant personal and coding assistance to enabling machines to use vision in making split-second decisions in fields to factories, inference is where AI meets the physical world.

As models evolve—think DeepSeek’s inference-intensive mixture-of-experts (MoE) architectures—the hardware is hitting a wall. GPUs? They’re inefficient and too power-hungry for this new era. CPUs? Too sluggish. MoE models, which dynamically route tasks to specialized subnetworks, are exploding in popularity because they are faster and more efficient. The catch? Traditional processors weren’t built for this. GPUs, designed to brute-force parallel computations, waste energy shuttling data between memory and compute units.

The age of inference demands purpose-built inference chips and UntetherAI is one of the few semiconductor companies with chips in the market specially designed to service this new era. Instead of forcing data to crawl through a bottleneck between memory and processor, UntetherAI chips break the memory wall by placing compute directly with memory—a design called “at-memory compute.” For MoE models, which thrive on sparse, dynamic computation, this means blazing-fast routing between experts. For time-sensitive inference, it enables real-time processing of temporal data streams without melting your power budget. And because these chips are purpose-built for inference, they sidestep the bloat of legacy hardware.

This demand for inference hardware becomes even more acute as AI intersects with the real world. The future of AI is not solely the hyperscaler cloud; rather, it is at the edge—in your phone, your car, your factory robots, in on-prem, regional, and sovereign datacenters. These environments do not have the luxury of infinite power or cooling. A tractor in a field can’t lug around a server rack, and powering AI for your enterprise should focus on deploying models, not new concrete and plumbing to deal with the power issues driven by GPUs. UntetherAI’s chips, which boast world-leading efficiency, are tailor-made for this reality.

Like every paradigm shift, this new age of inference is exposing the flaws in yesterday’s tech stack. DeepSeek’s innovations in MoE architectures are just the beginning. Just like new architectures and design points were needed for phones and the best laptops that are different than what’s in a desktop or server, what comes next—ubiquitous AI running in tight spaces on the edge or more compute per rack in a data center– will hinge on hardware that’s fast, efficient, and ruthlessly focused on solving the emerging power wall.

Inference is not the future. It’s the now. And the now needs better silicon.

For more information on Untether AI, please visit their website.

Radical Blog

The Age of Inference Is Here — and It’s Hungry for Better Chips