An exclusive tour of Amazon’s Trainium lab, the chip that’s won over Anthropic, OpenAI, even Apple
Curious, I agreed to go. My tour guides for the day were the lab’s director, Kristopher King (pictured below right) and director of engineering Mark Carroll (below left), as well as the team’s PR person who arranged the visit, Doron Aronson (pictured with yours truly later in the story).
We’ll see if that exclusivity stands exactly as announced.
What makes AWS so appealing to OpenAI?
“Our customer base is just expanding as fast as we can get capacity out there,” King said.
“Bedrock could be as big as EC2 one day,” he added, referring to AWS’s behemoth compute cloud service.
“What that gives us is something huge,” Carroll said.
“That’s why Trainium3 is breaking all kinds of records,” particularly in “price per power,” he said. When trillions of tokens a day are involved, such improvements add up. In fact, Amazon’s chip team was lauded by Apple in 2024.
These chips represent the classic Amazon playbook: See what people want to buy, then build an in-house alternative that competes on price. The catch for chips, historically, has been switching costs. Applications written for Nvidia’s chips must be re-architected to work with others — a time-consuming process that discourages developers from switching. But the AWS chip team proudly told me that Trainium now supports PyTorch, a popular open source framework for building AI models. That includes many of the ones hosted on Hugging Face, a vast library where developers share open source models. The transition, Carroll told me, requires “basically a one-line change, and then recompile, and then run on Trainium. ” In other words, Amazon is attempting to chip away at Nvidia’s market dominance wherever possible.
But Amazon’s ambitions go beyond the chips themselves. It also designs the server that hosts the chips. Besides the networking components, this team has designed “Nitro,” a hardware-software combo that provides virtualization tech (which allows many instances of software to run separately on the same server); new state-of-the-art liquid cooling technology; and the server sleds (pictured below) that host this gear. All of that is to control cost and performance.
So this team has now had more than 10 years designing chips for AWS. The unit has retained its Annapurna roots and name — its logo is everywhere in the office.
The offices have your classic tech corporate vibe: desks in cubicles, gathering spots, and conference rooms. But tucked away at the back of a high floor in the building is the actual lab, with sweeping views of the city. The shelving-filled lab, about the size of two large conference rooms, is a noisy industrial space thanks to the fans on the equipment. It looks like a cross between a high school shop class and a Hollywood set for a high-end lab, except the engineers are dressed in jeans, not white lab coats. Note that this is not where the chips are manufactured, so no white hazmat suits were necessary.
But this is the room where the magic of the “bring-up” occurs. “A silicon bring-up is when you get the chip for the first time, and it’s like a big overnight party. You stay here, like a lock-in,” King explains.
After 18 months of work, the chip is activated for the first time to verify it works as designed.
The team even filmed some of the Trainium3 bring-up and posted it on YouTube. Spoiler alert: It’s never problem-free. For Trainium3, the prototype chip was originally air-cooled, like previous versions.
Unfazed, the team “immediately got a grinder and just started grinding off the metal,” King said. Because they didn’t want the noise disrupting the bring-up pizza party atmosphere, they snuck off and did the grinding in a conference room. Staying up all night and solving problems “is what silicon bring-up is all about,” King said. The lab even has a welding station, where hardware lab engineer and master welder Isaac Guevara demonstrated welding tiny integrated circuit components through a microscope. This is such insanely difficult work that senior leader Carroll openly admitted he couldn’t do it, to the guffaws of Guevara and the rest of the engineers in the room. The lab also contains both custom-made and commercial tools for testing and analyzing issues with chips. Here’s signal engineer Arvind Srinivasan demonstrating how the lab tests each tiny component on the chip: Sleds are the star of the lab But the star of the lab is an entire row showcasing each generation of the “sleds” the team designed.
Stack them together on a rack with the networking component, also custom-designed by this team, and you get the systems that are at the heart of Anthropic Claude’s success. Here’s the sled that was shown off during the AWS re:invent conference in December: Proven by Anthropic and OpenAI I expected my guides to crow about the OpenAI deal during the tour. The reticence could have been related to the aforementioned potential legal haze that might hang over the deal. But the sense I got was that these boots-on-the-ground engineers (who are currently designing the next version, Trainium4) haven’t had much chance to work with OpenAI yet. Their day-to-day work has so far been focused on Anthropic’s and Amazon’s needs.
It’s used by Anthropic. But there was a wall monitor in the main office displaying a quote about how OpenAI will be using Trainium. The pride was there, if subtle. In addition to this lab, the team also has its own private data center for quality and testing purposes. A short drive away, it doesn’t run customer workloads, so it’s housed at a co-location facility, not an AWS data center. Security is tight: There are strict protocols to enter the building and to access Amazon’s area within.
It’s not a pleasant place for the average person to hang out.
Here’s what a current Trn3 UltraServer looks like: Multiple sleds are on top and bottom, with the Neuron switches in the middle. Hardware development engineer David Martinez-Darrow is seen here performing maintenance on a sled: While attention on the team has always been high, the scrutiny has really ratcheted up as of late. Amazon CEO Andy Jassy keeps a close eye on this lab, publicly bragging about its products like a proud dad. In December, he said Trainium was already a multibillion-dollar business for AWS and called it one piece of AWS tech he’s most excited about. He also gave the chip a shout-out when announcing the OpenAI agreement. The team feels the pressure, too.
“It’s very important that we get as fast as possible to prove that it’s actually going to work,” Carroll said. “So far, we’ve been doing really well
Logic Quality Breakdown:
- Updated_At:
- Truth_Blocks:
- Analysis_Method: