Elon Musk has officially announced the commencement of GROK 3 training at the Memphis supercomputer facility, equipped with NVIDIA’s current-generation H100 GPUs. The facility, which Musk refers to as ‘the most powerful AI training cluster in the world,’ began operating on Monday with the aid of 100,000 liquid-cooled H100 GPUs on a single RDMA fabric.
The training was scheduled at 4:20 am local time in Memphis. In his subsequent tweet, Musk stated that the world’s “most advanced AI” could be developed by December of this year. Musk also tweeted about the achievement on X and congratulated the teams from xAI, X, and NVIDIA for their excellent work.
Nice work by @xAI team, @X team, @Nvidia & supporting companies getting Memphis Supercluster training started at ~4:20am local time.With 100k liquid-cooled H100s on a single RDMA fabric, it’s the most powerful AI training cluster in the world!
— Elon Musk (@elonmusk) July 22, 2024
xAI shifts strategy and cancels Oracle server deal
The announcement comes in the wake of the recent cancellation of a $10 billion server deal between xAI and Oracle. Musk indicated that the xAI Gigafactory of Compute, initially expected to be operational by the fall of 2025, has started operations ahead of schedule.
xAI had earlier outsourced its AI chips from Oracle but decided to disengage in order to develop its own advanced supercomputer. The project now plans to harness the potential of the state-of-the-art H100 GPUs that cost around $30,000 each. GROK 2 used 20,000 GPUs, and GROK 3 needed five times as many GPUs to build a more sophisticated AI chatbot.
Also Read:Elon Musk seeks public opinion on $5 billion xAI investment for Tesla
This is quite surprising, especially because NVIDIA has just recently announced the upcoming release of the H200 GPUs, which are based on the Hopper architecture. The decision to begin training with H100 GPUs instead of waiting for the H200 or the forthcoming Blackwell-based B100 and B200 GPUs. The H200 GPUs, which entered mass production in Q2, promise significant performance enhancements, but xAI’s immediate focus is on leveraging the existing H100 infrastructure to meet its ambitious targets.
Analyst questions power supply for Memphis Supercluster
Dylan Patel, an expert in AI and semiconductors, initially raised concerns over power concerns with running the Memphis Supercluster. He pointed out that the current grid supply of 7 megawatts can only sustain about 4,000 GPUs. The Tennessee Valley Authority (TVA) is expected to supply 50MW to the facility as a deal that is expected to be signed by the 1st of August. However, the substation that will be needed to meet the full power demand will only be completed in late 2024.
I bow down to Elon, he is so fucking good. Deleted the tweet.Yes only 8MW now from grid, 50MW Aug 1st once they sign TVA deal. 200MW by EOY, only need 155MW for 100k GPU but32k online now and rest online in Q4.3 months on 100k h100 will get them similar to current GPT 5 run pic.twitter.com/NQp3M5ruu8
— Dylan Patel @ ICML (@dylan522p) July 23, 2024
When analyzing satellite images, Patel noted that Musk has employed 14 VoltaGrid mobile generators, which can yield 2. 5 megawatts each. Altogether, these generators produce 35 megawatts of electricity. In addition to the 8MW from the grid, this makes it a total of 43MW, which is enough to power about 32,000 H100 GPUs with some power capping.