Original title: "Prime Intellect: Making Magic to Scale AI Training"
Original author: Teng Yan
Original translation: Siweiguai
Translator's note: With Nvidia's market value exceeding $3 trillion in the middle of the year, GPU computing power leasing has become the hottest track in the field of encrypted AI in 2024. However, most projects only stay at the stage of computing power resource aggregation and fail to solve the core problem of decentralized AI training-model training across distributed GPU clusters. The cutting-edge project Prime Intellect is trying to break this bottleneck. Crypto researcher Teng Yan wrote an article to explore Prime Intellect's innovative solutions and how it is expected to lead the future of decentralized AI training.
Most of the GPU market is mediocre, often just repeating the same product experience, subsidizing the cost by adding only a token.
But decentralized AI training is a whole new game with transformative potential. Prime Intellect is building critical infrastructure for decentralized AI training at scale.
Here’s why they go beyond the average DePIN project:
Prime Intellect’s grand blueprint consists of four parts:
1. Integrate global computing resources
2. Develop a distributed training framework for collaborative model development
3. Collaboratively train open source AI models
4. Enable collective ownership of AI models
GPU Market Aggregator
On July 1, they launched the first phase by launching the GPU Marketplace, which integrates computing resources from major centralized and decentralized GPU vendors, including Akash Network, io.net, Vast.ai, Lambda Cloud, and others. The goal is to provide users with the best rental prices by aggregating supplier resources and providing convenient tools. Users can directly use the Prime Intellect platform without having to visit Akash or io.net one by one to compare prices.
Their online testing platform is intuitive and easy to use. Users can spin up a cluster in minutes, without KYC. You can choose where you want to rent GPUs and the security level of the network (such as secure cloud or community cloud), and there is also a "lowest price" option.
They offer a variety of GPU options, from the top H100 to the RTX3000 and 4000 series. The current cluster size is capped at 8 GPUs, and Prime Intellect is working to expand it to 16-128.
Large-Scale Decentralized Training
The second part of their blueprint - developing a distributed AI training framework - is the most eye-catching.
The current situation is that training large-scale basic AI models usually requires self-built data centers. This involves high-speed networks, customized data storage, privacy protection, and efficiency optimization, which are difficult to achieve by simply renting multiple GPUs. So it is no surprise that giants such as Microsoft, Google, and OpenAI dominate this field, and small players lack the necessary resources.
Prime Intellect will enable model training across multiple distributed GPU clusters.
Decentralized training faces multiple challenges:
· Optimizing communication latency and bandwidth between nodes around the world
· Accommodating different types of GPUs in these networks
· Fault tolerance: the training process must be able to adapt to changes in the availability of GPU clusters, as these clusters may join or leave at any time
This requires translating cutting-edge research into actual production systems:
· Distributed Low Communication Training (DiLoCo): A method for data-parallel training on poorly connected devices that synchronizes gradients every 500 steps instead of every step.
· Prime Intellect recently open-sourced a framework that supports collaborative model development on globally distributed GPUs, making the code available to anyone.
· They reproduced Google DeepMind's DiLoCo experiment, training models across 3 countries with 90-95% compute utilization. They also scaled up to 3 times the original work, demonstrating its effectiveness on a billion-parameter model.
If Prime Intellect can solve these problems, it will greatly affect the model training method and resource utilization efficiency.
The last feature Prime Intellect is developing is a protocol to reward participants who contribute computing power, code, and funds, and to achieve collective governance of AI models. This fits the concept of decentralized AI and encourages users to participate. It is expected that they may use cryptocurrency as a medium of transaction and ownership.
My opinion
· The current GPU market is highly homogenized and lacks appeal. Although some markets have aggregated supply through token incentives, the demand side remains weak due to the challenges of decentralized training.
· The global decentralized GPU market is highly competitive. (Here is a price comparison of several GPU providers:)
· If Prime Intellect can improve the efficiency of decentralized AI training, it will open the door to GPU demand.
· Prime Intellect has well-known investor support such as Clem Delangue (co-founder and CEO of Hugging Face), Erik Voorhees (founder and CEO of Shapeshift), and Andrew Kang (co-founder and partner of Mechanism Capital).