Microsoft Maia 200 and the ongoing evolution of custom AI silicon
Microsoft’s Maia 200 AI chip highlights a growing shift towards a model of vertical integration where one company designs and controls the model, the software stack, the silicon, and the data-centre infrastructure.
Tech giant Microsoft recently introduced the Maia 200, which is a bespoke artificial intelligence (AI) accelerator designed to improve the performance and cost of running large-scale AI models within its Azure cloud.
This chip belongs to a category known as application-specific integrated circuits, or ASICs, which are processors built for one particular job rather than serving as general-purpose hardware. Specifically, Maia 200 is an inference-first chip.
In the world of AI, there are two main phases for a model: training and inference. Training is the initial stage where a model learns from vast datasets, whereas inference is the phase where a finished model responds to user queries, such as generating text in a chat interface or creating an image.
By focusing on this stage, Microsoft aims to make its AI services more responsive and affordable while reducing its long-term reliance on external hardware suppliers.
Custom chips
The arrival of Maia 200 follows the earlier Maia 100, which was Microsoft’s first foray into in-house AI accelerators in 2023.
The creation of Maia 200 is part of a wider industry shift where major cloud providers develop their own silicon to gain more control over their infrastructure.
Google has its Tensor Processing Units, or TPUs, which are now in their seventh generation with the Ironwood series. Amazon Web Services (AWS) offers the Inferentia family for low-cost inference and the Trainium series for high-intensity training. Meta has also joined this club with its Meta Training and Inference Accelerator, known as MTIA, which is designed to power the recommendation systems behind its social platforms.
The driving force behind these multi-million-dollar investments is the need for efficiency—a combination of cost control, supply chain security, and performance optimisation.
While general-purpose graphics processing units, or GPUs, from companies like NVIDIA or AMD are incredibly powerful, they are also expensive and in high demand.
Custom chips allow cloud providers to tailor the hardware to the specific mathematical needs of their own AI models, such as Microsoft Copilot or the GPT series from OpenAI. This vertical integration helps ensure that every part of the system is working in perfect harmony to deliver more performance for every watt of electricity consumed and achieve better performance per dollar.
Technical capabilities
Maia 200 is a significant leap forward in manufacturing technology, as it is built using a 3-nanometre process from Taiwan Semiconductor Manufacturing Company (TSMC). It contains more than 140 billion transistors, which are the tiny switches that perform calculations. For comparison, the latest Meta MTIA chip uses a 5-nanometre process and houses fewer transistors, while the NVIDIA Blackwell Ultra features 208 billion.
In terms of speed, Maia 200 is capable of delivering over 10 petaFLOPS of performance in a format known as 4-bit floating-point precision, or FP4. Precision refers to the amount of detail used in the chip’s mathematical operations. While higher precision is useful for training models, lower precision formats like FP4 are becoming the standard for inference because they allow for much faster processing without losing much accuracy.
Microsoft claims that the FP4 performance of this chip is three times higher than that of the third-generation Amazon Trainium. Furthermore, Maia 200 is said to deliver 30% better performance per dollar than the existing hardware in Microsoft’s current fleet.
Overcoming bottleneck
A major hurdle in modern AI is not just how fast a chip can think, but how fast it can be fed with data. If the data cannot move quickly enough from storage to the processor, the chip sits idle.
Maia 200 tackles this with a redesigned memory system that combines 216GB of High Bandwidth Memory, or HBM3e, with a substantial amount of on-chip memory called SRAM.
The HBM3e provides a massive 7 terabytes per second of bandwidth, which ensures that even the largest AI models can stay busy. Meanwhile, the 272MB of on-chip SRAM acts as a very fast local scratchpad for the chip to store data it needs to access repeatedly.
By keeping this data on the chip itself rather than sending it back and forth to external memory, the system saves time and energy. This is supported by a specialised data-movement engine and a high-speed internal network that coordinates the flow of information without stalling the compute process.
Building blocks
The internal design of Maia 200 is built like a layered set of building blocks. At the smallest level is a unit called a tile, which is the basic autonomous unit of the chip. Each tile features two types of engines.
One is a specialised maths unit for heavy grid-based calculations, while the other is a more flexible processor that can handle a variety of different tasks. These tiles are grouped into larger clusters that share a pool of fast memory to manage the enormous amounts of data required for real-time AI queries.
To build systems larger than a single chip, Microsoft has developed a two-tier network design based on standard Ethernet technology.
Inside each server unit, four Maia accelerators are joined together by direct bridges. These bridges allow the four chips to talk to each other at ultra-high speeds without needing to pass through an external network switch, which keeps the communication fast and efficient.
For massive data centre deployments, these groups can be linked together to include as many as 6,144 individual chips. This large-scale network uses a custom communication system called the AI Transport Layer, or ATL, which moves data between chips at a rate of 2.8 terabytes per second.
Data centre infra
High-performance AI chips generate a significant amount of heat, and Maia 200 is designed to operate within a 750-watt power envelope. To manage this, Microsoft has integrated advanced liquid cooling.
The chips use a second-generation, closed-loop liquid cooling system that acts like a radiator for the server rack. This approach is far more efficient than traditional air cooling.
Maia 200 is also fully integrated with the existing Azure management system. This means it uses the same security, health monitoring, and diagnostic tools as the rest of Microsoft’s cloud infrastructure.
Because the hardware and software were developed together, Microsoft says it was able to move from the first finished chips to a working data centre deployment in less than half the time of other similar industry programmes.
Software tools
Hardware is only as useful as the software that runs on it, and Microsoft has introduced the Maia SDK (software development kit) to bridge this gap. This software kit is designed to work with popular open-source tools like PyTorch and the ONNX Runtime. It allows researchers and developers to move their existing AI models onto the new hardware without having to rewrite all of their code.
The SDK includes a compiler called Triton, which was created by OpenAI to simplify the way programmers write high-speed code for AI chips. For experts who want to squeeze every bit of performance out of the silicon, Microsoft also offers a more complex, low-level language called NPL.
Industry standard
While Maia 200 is a powerful first-party tool, it competes in a market where NVIDIA remains the dominant player.
For example, the NVIDIA Blackwell Ultra features 208 billion transistors and provides up to 15 petaFLOPS of compute power. NVIDIA also benefits from its CUDA (compute unified device architecture) ecosystem, which is a massive collection of software and libraries that has been the industry standard for nearly 20 years.
Keeping that in mind, Microsoft’s Maia 200 will complement NVIDIA chips in a heterogeneous infrastructure.
Maia 200 is specifically tuned for high-volume tasks such as powering Microsoft 365 Copilot and the newest models from OpenAI. By using its own silicon for these predictable and massive workloads, Microsoft can provide a better balance of performance and cost to its customers.
Edited by Jyoti Narayan


