深度研究：加密激励众筹一个AI模型，是否可行？

PANews · 2024-09-19T23:03:05.000Z

撰文：Jeff Amico 编译：深潮 TechFlow 引言在新冠疫情期间，Folding@home 取得了一个重大里程碑。该研究项目获得了 2.4 exaFLOPS 的计算能力，由全球 200 万台志愿者设备提供。这代表了当时世界上最大超级计算机的十五倍处理能力，使科学家能够大规模模拟 COVID 蛋白质动态。他们的工作推动了我们对病毒及其病理机制的理解，尤其是在疫情初期。 Folding@home 用户的全球分布，2021 Folding@home 基于志愿计算的悠久历史，项目通过众包计算资源来解决大规模问题。这个想法在 1990 年代的 SETI@home 中得到了广泛关注，该项目汇集了超过 500 万台志愿者计算机以寻找外星生命。此后，这一理念已被应用于多个领域，包括天体物理学、分子生物学、数学、密码学和游戏。在每种情况下，集体力量增强了单个项目的能力，远远超出了他们单独能够实现的范围。这推动了进步，使研究能够以更开放和合作的方式进行。许多人想知道我们是否可以将这一众包模型应用于深度学习。换句话说，我们能否在大众中训练一个大型神经网络？前沿模型训练是人类历史上计算最密集的任务之一。与许多 @home 项目一样，目前的成本超出了只有最大参与者才能承担的范围。这可能会阻碍未来的进展，因为我们依赖于越来越少的公司来寻找新的突破。这也将我们的 AI 系统的控制权集中在少数人手中。无论你对这项技术的看法如何，这都是一个值得关注的未来。大多数批评者驳斥了去中心化训练的想法，认为与当前的训练技术不兼容。然而，这种观点已经越来越过时。新的技术已经出现，能够减少节点间的通信需求，从而允许在网络连接不佳的设备上高效训练。这些技术包括 DiLoCo 、 SWARM Parallelism 、 lo-fi 和异构环境中基础模型的分散训练等多个技术。其中许多具有容错性，并支持异构计算。还有一些新架构专为去中心化网络设计，包括 DiPaCo 和去中心化混合专家模型。我们还看到各种加密原语开始成熟，使得网络能够在全球范围内协调资源。这些技术支持数字货币、跨境支付和预测市场等应用场景。与早期的志愿项目不同，这些网络能够汇聚惊人的计算能力，通常比目前设想的最大云训练集群大几个数量级。这些要素共同构成了新的模型训练范式。这种范式充分利用全球的计算资源，包括如果连接在一起可以使用的大量边缘设备。这将通过引入新的竞争机制来降低大多数训练工作负载的成本。它还可以解锁新的训练形式，使得模型开发变得协作和模块化，而不是孤立和单一的方式。模型可以从大众中获取计算和数据，实时学习。个人可以拥有他们所创建模型的一部分。研究人员也可以重新公开分享新颖的研究成果，无需通过货币化他们的发现来弥补高昂的计算预算。本报告考察了大型模型训练的现状及相关成本。它回顾了以往的分布式计算努力——从 SETI 到 Folding 再到 BOINC——以此为灵感探索替代路径。报告讨论了去中心化训练的历史挑战，并转向可能有助于克服这些挑战的最新突破。最后，它总结了未来的机遇与挑战。前沿模型训练的现状前沿模型训练的成本对非大型参与者而言已经不可承受。这个趋势并不新鲜，但根据实际情况，情况正在变得更加严重，因为前沿实验室不断挑战扩展假设。据报道，OpenAI 今年在训练方面花费超过 30 亿美元。Anthropic 预测到 2025 年，我们将开始进行 100 亿美元的训练，而 1000 亿美元的模型也不会太远。这一趋势导致行业的集中化，因为只有少数几家公司能够承担参与的费用。这引发了未来的核心政策问题——我们是否能接受所有领先的 AI 系统由一两家公司控制的局面？这也限制了进展速度，这一点在研究社区中显而易见，因为较小的实验室无法承担扩展实验所需的计算资源。行业领导者们也多次提到这一点： Meta 的 Joe Spisak：要真正理解 [模型] 架构的能力，你必须在规模上进行探索，我认为这正是当前生态系统中所缺失的。如果你看看学术界——学术界有很多杰出的人才，但他们缺乏计算资源的访问，这就成了一个问题，因为他们有这些伟大的想法，却没有真正以所需水平实现这些想法的途径。 Together 的 Max Ryabinin：对昂贵硬件的需求给研究社区带来了很大压力。大多数研究人员无法参与大型神经网络开发，因为进行必要的实验对他们而言成本过高。如果我们继续通过扩大模型规模来增加其大小，最终能够进行竞 Google 的 Francois Chollet：我们知道大语言模型 (LLMs) 尚未实现通用人工智能 (AGI)。与此同时，朝 AGI 发展的进展已经停滞。我们在大语言模型上所面临的局限性与五年前面临的局限性完全相同。我们需要新的想法和突破。我认为下一个突破很可能来自外部团队，而所有大型实验室则忙于训练更大的大语言模型。一些人对这些担忧持怀疑态度，认为硬件改进和云计算资本支出将解决这个问题。但这似乎不太现实。一方面，到本十年末，新一代 Nvidia 芯片的 FLOP 数量将大幅增加，可能达到今天 H100 的 10 倍。这将使每 FLOP 的价格下降 80-90%。同样，预计到本十年末，总 FLOP 供应将增加约 20 倍，同时改善网络和相关基础设施。所有这些都将提高每美元的训练效率。来源：SemiAnalysis AI Cloud TCO 模型与此同时，总 FLOP 需求也将大幅上升，因为实验室希望进一步扩大规模。如果持续十年的训练计算趋势保持不变，到 2030 年前沿训练的 FLOPs 预计将达到约 2e29。进行这种规模的训练大约需要 2000 万个 H100 等效 GPU，依据当前的训练运行时间和利用率。假设这一领域仍有多个前沿实验室，总所需的 FLOPS 数量将会是这个数字的几倍，因为整体供应将在它们之间分配。EpochAI 预测到那时我们需要大约 1 亿个 H100 等效 GPU，约为 2024 年出货量的 50 倍。SemiAnalysis 也做出了类似的预测，认为前沿训练需求和 GPU 供应在此期间大致同步增长。产能状况可能会因多种原因变得更加紧张。例如，如果制造瓶颈延迟了预计的出货周期，这种情况是常有的事。或者如果我们未能生产足够的能源来为数据中心供电。又或者如果我们在将这些能源来源连接到电网方面遇到困难。或者如果对资本支出的日益审查最终导致行业缩减规模，等等因素。在最好的情况下，我们当前的方法只能让少数公司继续推动研究的进展，而这可能还不够。显然，我们需要一种新的方法。这种方法不需要不断扩展数据中心、资本支出和能源消耗来寻找下一个突破，而是高效利用我们现有的基础设施，能够随着需求的波动灵活扩展。这将让研究中有更多实验的可能，因为训练运行不再需要确保亿万美元计算预算的投资回报。一旦摆脱这一限制，我们可以超越当前的大语言模型 (LLM) 模式，正如许多人所认为的，实现通用人工智能 (AGI) 是必要的。为了理解这种替代方案可能呈现的样子，我们可以从过去的分布式计算实践中汲取灵感。群体计算：简史 SETI@home 在 1999 年普及了这一概念，允许数百万参与者分析无线电信号，寻找外星智慧。SETI 从 Arecibo 望远镜收集电磁数据，将其分成若干批次，并通过互联网发送给用户。用户在日常活动中分析数据，并将结果发送回。用户之间无需沟通，批次可以独立审核，从而实现高度的并行处理。在其巅峰时刻，SETI@home 拥有超过 500 万名参与者，处理能力超过当时最大的超级计算机。它最终于 2020 年 3 月关闭，但它的成功激励了随后的志愿计算运动。 Folding@home 在 2000 年延续了这一理念，利用边缘计算模拟阿尔茨海默病、癌症和帕金森病等疾病中的蛋白质折叠。志愿者在个人电脑的空闲时间进行蛋白质模拟，帮助研究人员研究蛋白质如何错误折叠并导致疾病。在其历史的不同时间段，其计算能力超过了当时最大的超级计算机，包括在 2000 年代后期和 COVID 期间，当时它成为第一个超过一 exaFLOPS 的分布式计算项目。自成立以来，Folding 的研究人员已发表超过 200 篇同行评审论文，每一篇都依赖于志愿者的计算能力。伯克利开放网络计算基础设施 (BOINC) 在 2002 年普及了这一理念，提供了一个众包计算平台，用于各种研究项目。它支持 SETI@home 和 Folding@home 等多个项目，以及在天体物理学、分子生物学、数学和密码学等领域的新项目。到 2024 年，BOINC 列出了 30 个正在进行的项目，以及近 1,000 篇发表的科学论文，均利用其计算网络产生。在科研领域之外，志愿计算被用于训练围棋（LeelaZero、KataGo）和国际象棋（Stockfish、LeelaChessZero）等游戏引擎。LeelaZero 通过志愿计算从 2017 年到 2021 年进行训练，使其能够与自己下棋超过一千万局，创造了今天最强的围棋引擎之一。类似地，Stockfish 自 2013 年以来一直在志愿网络上持续训练，使其成为最受欢迎和最强大的国际象棋引擎之一。关于深度学习的挑战但是我们能否将这一模型应用于深度学习？我们是否可以将世界各地的边缘设备联网，创建一个低成本的公共训练集群？消费者硬件——从苹果笔记本到 Nvidia 游戏显卡——在深度学习方面的性能越来越出色。在许多情况下，这些设备的性能甚至超过了数据中心显卡的每美元性能。然而，要有效利用这些资源在分布式环境中，我们需要克服各种挑战。首先，当前的分布式训练技术假设节点之间存在频繁的通信。当前最先进的模型已经变得如此庞大，以至于训练必须被拆分到数千个 GPU 之间。这是通过多种并行化技术来实现的，通常是在可用的 GPU 之间拆分模型、数据集或同时拆分两者。这通常需要高带宽和低延迟的网络，否则节点将闲置，等待数据到来。例如，分布式数据并行技术 (DDP) 将数据集分配到各个 GPU 上，每个 GPU 在其特定的数据片段上训练完整的模型，然后共享其梯度更新，以生成各个步骤的新模型权重。这需要相对有限的通信开销，因为节点仅在每次反向传播后共享梯度更新，并且集体通信操作可以部分与计算重叠。然而，这种方法仅适用于较小的模型，因为它要求每个 GPU 在内存中存储整个模型的权重、激活值和优化器状态。例如，GPT-4 在训练时需要超过 10TB 的内存，而单个 H100 仅有 80GB。为了解决这一问题，我们还使用各种技术对模型进行拆分，以便在 GPU 之间进行分配。例如，张量并行技术 (tensor parallelism) 在单个层内拆分各个权重，使得每个 GPU 执行必要的操作并将输出传递给其他的 GPU。这降低了每个 GPU 的内存需求，但需要它们之间进行持续的通信往来，因此需要高带宽、低延迟的连接以提高效率。流水线并行技术 (pipeline parallelism) 将模型的层分配到各个 GPU 上，每个 GPU 执行其工作并与流水线中的下一个 GPU 共享更新。尽管这所需的通信量比张量并行更少，但可能会出现「气泡」（例如，空闲时间），在这种情况下，位于流水线后面的 GPU 会等待来自前面 GPU 的信息，以便开始其工作。为了解决这些挑战，发展出各种技术。例如，ZeRO（零冗余优化器）是一种内存优化技术，它通过增加通信开销来减少内存使用，从而使更大的模型能够在特定设备上进行训练。ZeRO 通过在 GPU 之间分割模型参数、梯度和优化器状态来降低内存需求，但依赖于大量的通信，以便设备能够获取分割的数据。它是流行技术如完全分片数据并行 (FSDP) 和 DeepSpeed 的基础方法。这些技术通常在大模型训练中结合使用，以最大化资源的利用效率，这被称为 3D 并行。在这种配置中，张量并行技术 (tensor parallelism) 通常用于在单个服务器内将权重分配到各个 GPU 上，因为在每个被分割的层之间需要大量通信。然后，流水线并行技术 (pipeline parallelism) 被用来在不同服务器之间（但在数据中心的同一岛屿内）分配层，因为它所需的通信量较少。接着，数据并行技术 (data parallelism) 或完全分片数据并行技术 (FSDP) 被用来在不同服务器岛屿之间拆分数据集，因为它可以通过异步共享更新和 / 或压缩梯度来适应更长的网络延迟。Meta 使用这种组合方法来训练 Llama 3.1，如下面的图示所示。这些方法给去中心化训练网络带来了核心挑战，这些网络依赖于通过（速度更慢且波动更大的）消费级互联网连接的设备。在这种环境中，通信成本很快就会超过边缘计算带来的收益，因为设备通常是空闲的，等待数据到达。以一个简单的例子说明，分布式数据并行训练一个具有 10 亿参数的半精度模型，每个 GPU 在每个优化步骤中需要共享 2GB 的数据。以典型的互联网带宽（例如 1 千兆位每秒）为例，假设计算与通信不重叠，传输梯度更新至少需要 16 秒，导致显著的空闲。像张量并行技术 (tensor parallelism) 这样的技术（需要更多的通信）当然会表现得更糟。其次，当前的训练技术缺乏容错能力。像任何分布式系统一样，随着规模的增加，训练集群变得更容易发生故障。然而，这一问题在训练中更加严重，因为我们目前的技术主要是同步的，这意味着 GPU 必须协同工作以完成模型训练。成千上万的 GPU 中单个 GPU 的故障会导致整个训练过程停止，迫使其他 GPU 从头开始训练。在某些情况下，GPU 并不会完全故障，而是由于各种原因变得迟缓，进而减慢集群中成千上万其他 GPU 的速度。考虑到当今集群的规模，这可能意味着数千万到数亿美元的额外成本。 Meta 在他们的 Llama 训练过程中详细阐述了这些问题，他们经历了超过 400 次意外中断，平均每天约 8 次中断。这些中断主要归因于硬件问题，例如 GPU 或主机硬件故障。这导致他们的 GPU 利用率仅为 38-43%。OpenAI 在 GPT-4 的训练过程中表现更差，仅为 32-36%，这也是由于训练过程中故障频繁。换句话说，前沿实验室们在完全优化的环境中（包括同质的、最先进的硬件、网络、电源和冷却系统）进行训练时，仍然难以达到 40% 的利用率。这主要归因于硬件故障和网络问题，而在边缘训练环境中，这些问题会更加严重，因为设备在处理能力、带宽、延迟和可靠性方面存在不均衡。更不用说，去中心化网络易受恶意行为者的侵害，他们可能出于各种原因试图破坏整体项目或在特定工作负载上作弊。即使是纯志愿者网络 SETI@home，也曾出现过不同参与者的作弊现象。第三，前沿模型训练需要大规模的计算能力。虽然像 SETI 和 Folding 这样的项目达到了令人印象深刻的规模，但与当今前沿训练所需的计算能力相比，它们相形见绌。GPT-4 在一个由 20,000 个 A100 组成的集群上训练，其峰值吞吐量为半精度的 6.28 ExaFLOPS。这比 Folding@home 在其峰值时的计算能力多出三倍。Llama 405b 使用 16,000 个 H100 进行训练，峰值吞吐量为 15.8 ExaFLOPS，是 Folding 峰值的 7 倍。随着多个实验室计划构建超过 100,000 个 H100 的集群，这一差距只会进一步扩大，每个集群的计算能力高达惊人的 99 ExaFLOPS。这很有道理，因为 @home 项目是志愿者驱动的。贡献者捐赠了他们的内存和处理器周期，并承担了相关成本。这自然限制了它们相对于商业项目的规模。最近的进展虽然这些问题在历史上一直困扰着去中心化训练工作，但它们似乎不再不可逾越。新的训练技术已经出现，能够减少节点间的通信需求，从而在互联网连接的设备上进行高效训练。这些技术很多源自大型实验室，它们希望为模型训练增加更大的规模，因此需要跨数据中心的高效通信技术。我们还看到了容错训练方法和加密激励系统的进展，这些方法可以支持更大规模的训练在边缘环境中进行。高效通信技术 DiLoCo 是谷歌近期的研究，它通过在设备间传递更新的模型状态之前进行本地优化，从而减少了通信开销。他们的方法（基于早期的联邦学习研究）显示出与传统同步训练相当的效果，同时节点之间的通信量降低了 500 倍。此后，该方法已被其他研究者复制，并扩展至训练更大模型（超过 10 亿个参数）。它还扩展到异步训练，这意味着节点可以在不同时间共享梯度更新，而不是一次性共享所有更新。这更好地适应了处理能力和网络速度各异的边缘硬件。其他数据并行方法，如 lo-fi 和 DisTrO，旨在进一步减少通信成本。Lo-fi 提出了完全本地微调的方法，这意味着节点独立训练，只在最后传递权重。这种方法在微调超过 10 亿参数的语言模型时，性能与基准相当，同时完全消除了通信开销。在一份初步报告中，DisTrO 声称采用了一种新型的分布式优化器，他们认为可以将通信需求降低四到五个数量级，尽管该方法尚待确认。新的模型并行方法也已经出现，这使得实现更大的规模成为可能。DiPaCo（同样来自谷歌）将模型划分为多个模块，每个模块包含不同的专家模块，以便于特定任务的训练。然后，训练数据通过「路径」进行分片，这些路径是每个数据样本对应的专家序列。给定一个分片，每个工作者几乎可以独立训练特定的路径，除了共享模块所需的通信，这部分由 DiLoCo 处理。这种架构将十亿参数模型的训练时间减少了超过一半。 SWARM 并行性和异构环境中基础模型的去中心化训练 (DTFMHE) 也提出了模型并行的方法，以在异构环境中实现大模型训练。SWARM 发现，随着模型规模的增加，管道并行性通信约束减小，这使得在较低的网络带宽和更高的延迟下有效训练更大模型成为可能。为了在异构环境中应用这一理念，他们在节点之间使用临时「管道连接」，这些管道可以在每次迭代中实时更新。这允许节点将其输出发送到任何下一个管道阶段的对等节点。这意味着，如果某个对等节点比其他节点更快，或者任何参与者断开连接，输出可以动态重新路由，以保证训练的持续进行，只要每个阶段至少有一个活跃参与者。他们使用这种方法在低成本的异构 GPU 上训练一个超过 10 亿参数的模型，并且互连速度较慢（如下图所示）。 DTFMHE 同样提出了一种新颖的调度算法，以及管道并行和数据并行，以在 3 个大洲的设备上训练大型模型。尽管他们的网络速度比标准 Deepspeed 慢 100 倍，但他们的方法速度仅比在数据中心使用标准 Deepspeed 慢 1.7-3.5 倍。与 SWARM 类似，DTFMHE 显示出随着模型规模增大，通信成本可以有效隐藏，即使在地理分布的网络中也同样适用。这使得我们能够通过各种技术克服节点之间较弱的连接，包括增加隐藏层的大小和每个管道阶段增加更多层。故障容错上述许多数据并行方法默认具有容错能力，因为每个节点都在内存中存储整个模型。这种冗余通常意味着，即使其他节点出现故障，节点仍然可以独立工作。这对于去中心化训练非常重要，因为节点通常是不可靠的、异构的，甚至可能存在恶意行为。然而，如前所述，纯数据并行方法仅适用于较小的模型，因此模型大小受到网络中最小节点内存容量的制约。为了解决上述问题，一些人提出了适用于模型并行（或混合并行）训练的容错技术。SWARM 通过优先选择延迟较低的稳定对等节点来应对对等节点故障，并在发生故障时重新路由管道阶段的任务。其他方法，如 Oobleck，采用类似的方法，通过创建多个「管道模板」来提供冗余，以应对部分节点故障。尽管在数据中心进行了测试，Oobleck 的方法提供了强大的可靠性保证，这些保证同样适用于去中心化环境。我们还看到了一些新的模型架构（如去中心化混合专家模型 (Decentralized Mixture of Experts, DMoE)），用于支持去中心化环境中的容错训练。与传统的专家混合模型类似，DMoE 由多个独立的「专家」网络组成，这些网络分布在一组工作者节点上。DMoE 使用分布式哈希表以去中心化方式跟踪和整合异步更新。该机制（在 SWARM 中也使用）对节点故障具有良好的抵抗力，因为如果某些节点失败或未能及时响应，它可以将某些专家排除在平均计算之外。规模化最后，像比特币和以太坊所采用的加密激励系统可以帮助实现所需的规模。这两个网络通过向贡献者支付一种可以随着采用增长而增值的本地资产来众包计算。这个设计通过给予早期贡献者丰厚奖励来激励他们，当网络达到最小可行规模后，这些奖励可以逐步减少。确实，这种机制存在各种陷阱，需要避免。其中最主要的陷阱是，过度激励供给而未能带来相应的需求。此外，如果基础网络不够去中心化，这可能引发监管问题。然而，当设计得当时，去中心化激励系统可以在较长时间内实现可观的规模。例如，比特币年电力消耗约为 150 太瓦时 (TWh)，这比目前构思中的最大 AI 训练集群的电力消耗高出两个数量级之多（100,000 个 H100 全负荷运行一年）。作为参考，OpenAI 的 GPT-4 在 20,000 个 A100 上进行了训练，Meta 的旗舰 Llama 405B 模型在 16,000 个 H100 上进行了训练。同样，在其高峰期，以太坊的电力消耗大约为 70 TWh，分散在数百万个 GPU 之间。即使考虑到未来几年 AI 数据中心的快速增长，像这些激励计算网络仍将多次超越其规模。当然，并非所有计算都是可替换的，训练相对于挖矿有独特的需求，需要考虑。尽管如此，这些网络展示了通过这些机制可以实现的规模。未来的道路将这些部分联系在一起，我们可以看到前进的新道路的开端。很快，新的训练技术将使我们能够超出数据中心的限制，因为设备不再需要共同放置才能发挥作用。这将需要时间，因为我们当前的去中心化训练方法仍处于较小规模，主要在 10 亿到 20 亿个参数的范围内，比像 GPT-4 这样的模型小得多。我们需要进一步的突破，以在不牺牲关键属性（如通信效率和容错能力）的情况下提升这些方法的规模。或者，我们需要新的模型架构，这些架构与今天的大型单体模型有所不同——可能更小、更模块化，在边缘设备上运行，而非在云端无论如何，可以合理地预期在这个方向上会有进一步的进展。我们当前方法的成本是不可持续的，这为创新提供了强烈的市场动力。我们已经看到这一趋势，像 Apple 这样的制造商正在构建更强大的边缘设备，以便在本地运行更多的工作负载，而不是依赖云端。我们还看到对开源解决方案的支持不断增加——甚至在像 Meta 这样的公司内部，以促进更去中心化的研究与开发。这些趋势随着时间的推移只会加速。与此同时，我们还需要新的网络基础设施来连接边缘设备，以便能够这样使用它们。这些设备包括笔记本电脑、游戏台式机，最终甚至可能是拥有高性能显卡和大内存的手机。这将使我们能够构建一个「全球集群」，低成本、始终在线的计算能力，可以并行处理训练任务。这也是一个具有挑战性的问题，需要在多个领域取得进展。我们需要更好的调度技术来在异构环境中进行训练。目前没有任何方法可以自动并行化模型以达到优化，特别是在设备可以随时断开或连接的情况下。这是优化训练的关键下一步，同时保留基于边缘网络的规模优势。我们还必须应对去中心化网络的一般复杂性。为了最大化规模，网络应该构建为开放协议——一套标准和指令，规定参与者之间的互动，就像 TCP/IP 而是用于机器学习计算。这将使任何遵循特定规范的设备能够连接到网络，无论拥有者和位置。它还确保网络保持中立，允许用户训练他们喜欢的模型。虽然这实现了规模最大化，但它也需要一个机制来验证所有训练任务的正确性，而不依赖于单一实体。这一点至关重要，因为存在固有的作弊诱因——例如，声称自己完成了某个训练任务以获得报酬，但实际上并没有做到。考虑到不同设备通常以不同方式执行机器学习操作，这使得使用标准复制技术变得难以验证正确性，因此这尤其具有挑战性。正确解决这个问题需要在密码学和其他学科上进行深入研究。幸运的是，我们在所有这些方面都继续看到进展。与过去几年相比，这些挑战似乎不再不可逾越。与机会相比，它们也显得相当微小。Google 在他们的 DiPaCo 论文中对此进行了最佳总结，指出去中心化训练有潜力打破的负反馈机制：分布式训练机器学习模型的进展可能促进基础设施的简化建设，最终导致计算资源的更广泛可用。目前，基础设施是围绕训练大型单体模型的标准方法而设计的，同时机器学习模型的架构也旨在利用当前的基础设施和训练方法。这种反馈循环可能使社区陷入一个误导性的局部最小值，即计算资源的限制超过了实际需要。也许最令人兴奋的是，研究界对解决这些问题的热情日益高涨。我们在 Gensyn 的团队正在构建上述网络基础设施。像 Hivemind 和 BigScience 这样的团队在实践中应用了许多这些技术。像 Petals、sahajBERT 和 Bloom 这样的项目展示了这些技术的能力，以及对基于社区的机器学习日益增长的兴趣。还有许多其他人也在推动研究进展，目标是建立一个更开放、更协作的模型训练生态系统。如果您对这项工作感兴趣，请与我们联系以参与其中。

By Jeff Amico
Compiled by: TechFlow
introduction
During the COVID-19 pandemic, Folding@home achieved a major milestone. The research project accessed 2.4 exaFLOPS of computing power, provided by 2 million volunteer devices around the world. This represented fifteen times the processing power of the world's largest supercomputer at the time, allowing scientists to simulate COVID protein dynamics at scale. Their work advanced our understanding of the virus and its pathological mechanisms, especially in the early stages of the pandemic.
Global distribution of Folding@home users, 2021
Folding@home builds on a long history of volunteer computing, where projects crowdsource computing resources to solve large-scale problems. The idea gained traction in the 1990s with SETI@home, which brought together more than 5 million volunteer computers to search for extraterrestrial life. The idea has since been applied to a variety of fields, including astrophysics, molecular biology, mathematics, cryptography, and gaming. In each case, the collective power has amplified the capabilities of individual projects far beyond what they could achieve alone. This drives progress and enables research to be conducted in a more open and collaborative way.
Many have wondered if we can apply this crowdsourcing model to deep learning. In other words, can we train a large neural network on the crowd? Training cutting-edge models is one of the most computationally intensive tasks in human history. As with many @home projects, the costs are currently beyond the reach of only the largest players. This could hinder future progress as we become dependent on fewer and fewer companies to find new breakthroughs. It also concentrates control of our AI systems in the hands of a few. Regardless of your views on this technology, this is a future worth watching.
Most critics dismiss the idea of ​​decentralized training as incompatible with current training techniques. However, this view is increasingly outdated. New techniques have emerged that reduce the need for inter-node communication, allowing efficient training on devices with poor network connectivity. These techniques include DiLoCo, SWARM Parallelism, lo-fi, and decentralized training of base models in heterogeneous environments. Many of these are fault-tolerant and support heterogeneous computing. There are also new architectures designed specifically for decentralized networks, including DiPaCo and decentralized hybrid expert models.
We are also seeing the maturation of various cryptographic primitives that enable networks to coordinate resources on a global scale. These technologies support use cases such as digital currencies, cross-border payments, and prediction markets. Unlike early volunteer projects, these networks are able to aggregate astonishing computing power, often orders of magnitude larger than the largest cloud training clusters currently envisioned.
Together, these elements form a new paradigm for model training. This paradigm fully exploits the world’s computing resources, including the vast number of edge devices that can be used if connected together. This will reduce the cost of most training workloads by introducing new competition mechanisms. It can also unlock new forms of training, making model development collaborative and modular rather than isolated and monolithic. Models can source compute and data from the crowd, learning in real time. Individuals can own a portion of the models they create. Researchers can also re-share novel research results publicly, without having to make up for high computing budgets by monetizing their discoveries.
This report examines the current state of large-scale model training and the associated costs. It reviews previous distributed computing efforts—from SETI to Folding to BOINC—as inspiration to explore alternative paths. The report discusses historical challenges to decentralized training and turns to recent breakthroughs that may help overcome these challenges. Finally, it summarizes the opportunities and challenges ahead.
The current state of cutting-edge model training
The cost of training cutting-edge models has become prohibitive for non-large players. This trend is not new, but it is becoming more severe as cutting-edge labs continue to challenge scaling assumptions. OpenAI reportedly spent over $3 billion on training this year. Anthropic predicts that by 2025, we will start training $10 billion models, and $100 billion models will not be far behind.
This trend has led to a concentration of the industry, as only a few companies can afford to participate. This raises a core policy question for the future - are we willing to accept a situation where all leading AI systems are controlled by one or two companies? It also limits the rate of progress, as is evident in the research community, as smaller labs cannot afford the computing resources needed to scale experiments. This has been mentioned repeatedly by industry leaders:
Joe Spisak, Meta: To really understand the power of [model] architectures, you have to explore them at scale, and I think that's what's missing from the current ecosystem. If you look at academia -- there's a lot of great talent in academia, but they lack access to compute resources, and that becomes a problem because they have these great ideas but don't really have the means to implement them at the level that they need.
Max Ryabinin, Together: The need for expensive hardware puts a lot of pressure on the research community. Most researchers cannot participate in large neural network development because it is too expensive for them to conduct the necessary experiments. If we continue to increase the size of models by scaling them up, we will eventually be able to compete.
Francois Chollet, Google: We know that large language models (LLMs) have not yet achieved artificial general intelligence (AGI). At the same time, progress towards AGI has stalled. The limitations we face on large language models are exactly the same as the limitations we faced five years ago. We need new ideas and breakthroughs. I think the next breakthrough is likely to come from external teams while all the large labs are busy training larger large language models. Some are skeptical of these concerns, arguing that hardware improvements and cloud computing capital expenditures will solve this problem. But this seems unlikely. On the one hand, by the end of this decade, the number of FLOPs in new generations of Nvidia chips will increase significantly, perhaps to 10 times that of today's H100. This will reduce the price per FLOP by 80-90%. Similarly, the total FLOP supply is expected to increase by about 20 times by the end of this decade, while improving the network and related infrastructure. All of this will increase the efficiency of training per dollar.
Source: SemiAnalysis AI Cloud TCO Model
At the same time, total FLOP demand will also rise sharply as labs look to scale further. If the decade-long trend in training compute remains the same, FLOPs for cutting-edge training are expected to reach about 2e29 by 2030. Training at this scale would require about 20 million H100-equivalent GPUs, based on current training runtimes and utilization. Assuming there are still multiple cutting-edge labs in this field, the total required FLOPS will be several times this number as the overall supply will be divided among them. EpochAI predicts that we will need about 100 million H100-equivalent GPUs by then, about 50 times the number shipped in 2024. SemiAnalysis makes a similar prediction, believing that cutting-edge training demand and GPU supply will grow roughly in tandem during this period.
Capacity conditions could become tighter for a number of reasons. For example, if manufacturing bottlenecks delay projected shipping cycles, which is not uncommon. Or if we fail to produce enough energy to power data centers. Or if we have trouble connecting those energy sources to the grid. Or if increasing scrutiny of capital expenditures ultimately causes the industry to scale back, and so on. In the best case, our current approach will only allow a few companies to continue to push research forward, and that may not be enough.
Clearly, a new approach is needed. One that doesn’t require ever-expanding data centers, capital expenditures, and energy consumption in search of the next breakthrough, but instead makes efficient use of our existing infrastructure, with the ability to flexibly scale as demand fluctuates. This will allow for more experimentation in research, as training runs no longer need to ensure a return on investment for billion-dollar compute budgets. Once freed from this limitation, we can move beyond the current large language model (LLM) paradigm, which many believe is necessary to achieve artificial general intelligence (AGI). To understand what this alternative might look like, we can draw inspiration from past distributed computing practices.
Swarm computing: a brief history
SETI@home popularized the concept in 1999, allowing millions of participants to analyze radio signals in the search for extraterrestrial intelligence. SETI collected electromagnetic data from the Arecibo telescope, split it into batches, and sent it to users over the internet. Users analyzed the data in their daily activities and sent their results back. No communication between users was required, and batches could be reviewed independently, allowing for a high degree of parallel processing. At its peak, SETI@home had over 5 million participants and more processing power than the largest supercomputers at the time. It ultimately shut down in March 2020, but its success inspired the volunteer computing movement that followed.
Folding@home continued the idea in 2000, using edge computing to simulate protein folding in diseases like Alzheimer’s, cancer, and Parkinson’s. Volunteers run protein simulations in their PC’s idle time, helping researchers study how proteins misfold and cause disease. At various times in its history, its computing power exceeded the largest supercomputers of the time, including in the late 2000s and during COVID, when it became the first distributed computing project to exceed one exaFLOPS. Since its inception, Folding researchers have published more than 200 peer-reviewed papers, each of which relied on the computing power of volunteers.
The Berkeley Open Infrastructure for Networked Computing (BOINC) popularized the idea in 2002, providing a crowdsourced computing platform for a variety of research projects. It supports multiple projects such as SETI@home and Folding@home, as well as new projects in fields such as astrophysics, molecular biology, mathematics, and cryptography. As of 2024, BOINC lists 30 ongoing projects and nearly 1,000 published scientific papers that have been produced using its computing network.
Outside of the research world, volunteer computing is used to train game engines such as Go (LeelaZero, KataGo) and chess (Stockfish, LeelaChessZero). LeelaZero was trained from 2017 to 2021 using volunteer computing, enabling it to play over 10 million games against itself, creating one of the strongest Go engines available today. Similarly, Stockfish has been continuously trained on a volunteer network since 2013, making it one of the most popular and powerful chess engines.
Challenges of Deep Learning
But can we apply this model to deep learning? Can we network edge devices around the world to create a low-cost public training cluster? Consumer hardware—from Apple laptops to Nvidia gaming graphics cards—is getting better and better at deep learning. In many cases, the performance of these devices even exceeds the performance per dollar of data center graphics cards.
However, to effectively utilize these resources in a distributed environment, we need to overcome various challenges.
First, current distributed training techniques assume frequent communication between nodes.
Current state-of-the-art models have become so large that training must be split across thousands of GPUs. This is achieved through a variety of parallelization techniques, typically splitting the model, the dataset, or both across the available GPUs. This typically requires high-bandwidth and low-latency networks, otherwise nodes will sit idle, waiting for data to arrive.
For example, distributed data parallelism (DDP) distributes the dataset across GPUs, with each GPU training a full model on its specific piece of data and then sharing its gradient updates to generate new model weights at each step. This requires relatively limited communication overhead, as nodes only share gradient updates after each backpropagation, and collective communication operations can partially overlap with computation. However, this approach is only suitable for smaller models because it requires each GPU to store the weights, activation values, and optimizer states of the entire model in memory. For example, GPT-4 requires more than 10TB of memory when training, while a single H100 has only 80GB.
To address this, we also use various techniques to split the model so that it can be distributed between GPUs. For example, tensor parallelism splits the weights within a single layer so that each GPU performs the necessary operations and passes the output to other GPUs. This reduces the memory requirements of each GPU, but requires constant communication between them, so high-bandwidth, low-latency connections are required for efficiency.
Pipeline parallelism distributes the layers of the model across GPUs, with each GPU performing its work and sharing updates with the next GPU in the pipeline. While this requires less communication than tensor parallelism, it can lead to "bubbles" (i.e., idle time) where GPUs later in the pipeline wait for information from earlier GPUs to start their work.
Various techniques have been developed to address these challenges. For example, ZeRO (Zero Redundancy Optimizer) is a memory optimization technique that reduces memory usage by increasing communication overhead, allowing larger models to be trained on a specific device. ZeRO reduces memory requirements by splitting model parameters, gradients, and optimizer states between GPUs, but relies on a lot of communication so that devices can get the split data. It is the basis for popular techniques such as Fully Sharded Data Parallelism (FSDP) and DeepSpeed.
These techniques are often combined in large model training to maximize resource utilization, which is called 3D parallelism. In this configuration, tensor parallelism is often used to distribute weights across GPUs within a single server, as a lot of communication is required between each split layer. Pipeline parallelism is then used to distribute layers across different servers (but within the same island in the data center) as it requires less communication. Next, data parallelism or fully sharded data parallelism (FSDP) is used to split the dataset across different server islands as it can accommodate longer network latencies by asynchronously sharing updates and/or compressing gradients. Meta uses this combined approach to train Llama 3.1, as shown in the figure below.
These approaches pose core challenges for decentralized training of networks that rely on devices connected via (slower and more volatile) consumer-grade internet. In this environment, communication costs can quickly outweigh the benefits of edge computing because devices are often idle, waiting for data to arrive. As a simple example, distributed data parallel training of a half-precision model with 1 billion parameters requires each GPU to share 2GB of data in each optimization step. Taking a typical Internet bandwidth (e.g., 1 gigabits per second) as an example, assuming computation and communication do not overlap, transmitting gradient updates takes at least 16 seconds, resulting in significant idleness. Techniques like tensor parallelism (which require more communication) will of course perform worse.
Second, current training technology lacks fault tolerance. Like any distributed system, training clusters become more prone to failures as they scale. However, this problem is exacerbated in training because our current technology is primarily synchronous, meaning that GPUs must work together to complete model training. The failure of a single GPU among thousands of GPUs can bring the entire training process to a halt, forcing the other GPUs to start training from scratch. In some cases, a GPU does not fail completely, but becomes sluggish for a variety of reasons, slowing down the thousands of other GPUs in the cluster. Given the size of today's clusters, this can mean tens to hundreds of millions of dollars in additional costs.
Meta elaborated on these issues during their Llama training process, where they experienced over 400 unexpected interruptions, an average of about 8 interruptions per day. These interruptions were mainly attributed to hardware issues, such as GPU or host hardware failures. This resulted in their GPU utilization being only 38-43%. OpenAI performed even worse during GPT-4 training, at only 32-36%, also due to frequent failures during training.
In other words, cutting-edge labs still struggle to reach 40% utilization when training in fully optimized environments (including homogeneous, state-of-the-art hardware, networking, power, and cooling systems). This is mainly attributed to hardware failures and network issues, which are exacerbated in edge training environments because devices have uneven processing power, bandwidth, latency, and reliability. Not to mention, decentralized networks are vulnerable to malicious actors who may try to undermine the overall project or cheat on a specific workload for a variety of reasons. Even the purely volunteer network SETI@home has seen cheating by different participants.
Third, cutting-edge model training requires massive amounts of compute power. While projects like SETI and Folding have achieved impressive scale, they pale in comparison to the compute power required for cutting-edge training today. GPT-4 was trained on a cluster of 20,000 A100s with a peak throughput of 6.28 ExaFLOPS at half precision. This is three times more compute power than Folding@home had at its peak. Llama 405b was trained using 16,000 H100s with a peak throughput of 15.8 ExaFLOPS, seven times Folding’s peak. This gap will only widen as multiple labs plan to build clusters of more than 100,000 H100s, each with a staggering 99 ExaFLOPS of compute power.
This makes sense, since @home projects are volunteer-driven. Contributors donate their memory and processor cycles, and cover the associated costs. This naturally limits their size relative to commercial projects.
Recent progress
While these issues have historically plagued decentralized training efforts, they no longer appear to be insurmountable. New training techniques have emerged that reduce the need for inter-node communication, allowing efficient training on internet-connected devices. Many of these techniques originate from large labs that want to add greater scale to model training and therefore require efficient communication techniques across data centers. We are also seeing progress in fault-tolerant training methods and cryptographic incentive systems that can support larger-scale training at the edge.
Efficient communication technology
DiLoCo is a recent work from Google that reduces communication overhead by performing local optimizations before passing updated model state between devices. Their approach (building on earlier federated learning research) shows comparable results to traditional synchronous training while reducing the amount of communication between nodes by 500x. The approach has since been replicated by other researchers and scaled to train larger models (over 1 billion parameters). It also scales to asynchronous training, meaning nodes can share gradient updates at different times rather than all at once. This better accommodates edge hardware with varying processing power and network speeds.
Other data parallel approaches, such as lo-fi and DisTrO, aim to further reduce communication costs. Lo-fi proposes a fully local fine-tuning approach, meaning that nodes train independently and only pass weights at the end. This approach performs comparable to the baseline when fine-tuning language models with over a billion parameters, while completely eliminating communication overhead. In a preliminary report, DisTrO claims to use a new type of distributed optimizer that they believe can reduce communication requirements by four to five orders of magnitude, although this approach has yet to be confirmed.
New approaches to model parallelism have also emerged that make it possible to achieve even greater scale. DiPaCo (also from Google) partitions the model into multiple modules, each containing a different expert module to facilitate training for a specific task. The training data is then sharded by “paths,” which are sequences of experts corresponding to each data sample. Given a shard, each worker can train a specific path almost independently, except for the communication required to share modules, which is handled by DiLoCo. This architecture reduces the training time of billion-parameter models by more than half.
SWARM Parallelism and Decentralized Training of Grounded Models in Heterogeneous Environments (DTFMHE) also proposed a model parallel approach to enable large model training in heterogeneous environments. SWARM found that as the model size increases, the pipeline parallelism communication constraints decrease, which makes it possible to effectively train larger models at lower network bandwidth and higher latency. To apply this idea in heterogeneous environments, they use temporary "pipeline connections" between nodes, which can be updated in real time at each iteration. This allows a node to send its output to any peer node in the next pipeline stage. This means that if a peer node is faster than the others, or any participant is disconnected, the output can be dynamically rerouted to ensure that training continues as long as there is at least one active participant in each stage. They used this approach to train a model with more than 1 billion parameters on low-cost heterogeneous GPUs with slow interconnects (as shown in the figure below).
DTFMHE similarly proposes a novel scheduling algorithm, as well as pipeline and data parallelism, to train large models on devices across 3 continents. Despite their network being 100x slower than standard Deepspeed, their approach is only 1.7-3.5x slower than using standard Deepspeed in a datacenter. Similar to SWARM, DTFMHE shows that communication costs can be effectively hidden as model size increases, even in geographically distributed networks. This allows us to overcome weak connections between nodes through various techniques, including increasing the size of hidden layers and adding more layers per pipeline stage.
Fault Tolerance
Many of the data-parallel approaches described above are fault-tolerant by default, since each node stores the entire model in memory. This redundancy often means that nodes can still function independently even if other nodes fail. This is important for decentralized training, since nodes are often unreliable, heterogeneous, and can even behave maliciously. However, as mentioned earlier, pure data-parallel approaches only work for smaller models, so model size is constrained by the memory capacity of the smallest node in the network.
To address the above issues, some people have proposed fault-tolerant techniques suitable for model parallel (or hybrid parallel) training. SWARM responds to peer node failures by giving priority to stable peer nodes with lower latency and rerouting tasks in pipeline stages when failures occur. Other methods, such as Oobleck, take a similar approach to provide redundancy by creating multiple "pipeline templates" to cope with partial node failures. Although tested in data centers, Oobleck's approach provides strong reliability guarantees that are also applicable to decentralized environments.
We also saw some new model architectures (such as Decentralized Mixture of Experts (DMoE)) to support fault-tolerant training in decentralized environments. Similar to traditional mixture of experts, DMoE consists of multiple independent "expert" networks distributed across a set of worker nodes. DMoE uses a distributed hash table to track and consolidate asynchronous updates in a decentralized manner. This mechanism (also used in SWARM) is resilient to node failures because it can exclude certain experts from the average calculation if some nodes fail or fail to respond in a timely manner.
Scale
Finally, cryptographic incentive systems like those used by Bitcoin and Ethereum can help achieve the required scale. Both networks crowdsource computation by paying contributors with a native asset that increases in value as adoption grows. This design incentivizes early contributors by giving them generous rewards that can be gradually reduced once the network reaches a minimum viable scale.
Indeed, there are various pitfalls to this mechanism that need to be avoided. Chief among them is over-incentivizing supply without creating a corresponding demand. Additionally, this can raise regulatory issues if the underlying network is not sufficiently decentralized. However, when designed properly, decentralized incentive systems can achieve significant scale over a longer period of time.
For example, Bitcoin consumes about 150 terawatt hours (TWh) of electricity per year, which is more than two orders of magnitude higher than the largest AI training cluster currently conceived (100,000 H100s running at full capacity for a year). For reference, OpenAI’s GPT-4 was trained on 20,000 A100s, and Meta’s flagship Llama 405B model was trained on 16,000 H100s. Similarly, at its peak, Ethereum consumed about 70 TWh of electricity, spread across millions of GPUs. Even accounting for the rapid growth of AI data centers in the coming years, incentivized computing networks like these will be outstripped many times over.
Of course, not all computation is fungible, and training has unique requirements relative to mining that need to be considered. Nonetheless, these networks demonstrate the scale that can be achieved through these mechanisms.
The road ahead
Tying these pieces together, we can see the beginnings of a new path forward.
Soon, new training techniques will allow us to scale beyond the limits of data centers, as devices no longer need to be co-located to function. This will take time, as our current decentralized training methods are still relatively small, mostly in the 1 billion to 2 billion parameter range, much smaller than models like GPT-4. We’ll need further breakthroughs to increase the scale of these methods without sacrificing key properties like communication efficiency and fault tolerance. Alternatively, we’ll need new model architectures that are different from today’s large monolithic models—perhaps smaller, more modular, and run on edge devices rather than in the cloud.
Regardless, it’s reasonable to expect further progress in this direction. The costs of our current approaches are unsustainable, providing a strong market incentive for innovation. We’re already seeing this trend with manufacturers like Apple building more powerful edge devices to run more workloads locally, rather than relying on the cloud. We’re also seeing growing support for open source solutions — even within companies like Meta — to foster more decentralized research and development. These trends will only accelerate over time.
At the same time, we also need new network infrastructure to connect edge devices so that they can be used in this way. These devices include laptops, gaming desktops, and eventually even mobile phones with high-performance graphics cards and large memory. This will enable us to build a "global cluster" of low-cost, always-on computing power that can handle training tasks in parallel. This is also a challenging problem that requires progress in multiple areas.
We need better scheduling techniques for training in heterogeneous environments. Currently there is no way to automatically parallelize models to achieve optimization, especially when devices can be disconnected or connected at any time. This is a key next step to optimize training while retaining the scale advantages of edge-based networks.
We must also contend with the general complexity of decentralized networks. To maximize scale, the network should be built as an open protocol — a set of standards and instructions that dictate the interactions between participants, like TCP/IP but for machine learning computations. This would enable any device that follows a specific specification to connect to the network, regardless of ownership and location. It also ensures that the network remains neutral, allowing users to train the models they like.
While this maximizes scale, it also requires a mechanism to verify the correctness of all training tasks without relying on a single entity. This is critical because there are inherent incentives to cheat—for example, claiming to have completed a certain training task in order to get paid, but not actually doing so. This is particularly challenging given that different devices often perform machine learning operations in different ways, making it difficult to verify correctness using standard replication techniques. Correctly solving this problem requires deep research in cryptography and other disciplines.
Fortunately, we continue to see progress on all of these fronts. These challenges no longer seem insurmountable compared to past years. They also seem quite small compared to the opportunities. Google summed it up best in their DiPaCo paper, pointing out the negative feedback mechanisms that decentralized training has the potential to break:
Advances in distributed training of machine learning models may lead to simplified infrastructure construction, ultimately leading to more widespread availability of computing resources. Currently, infrastructure is designed around standard methods for training large monolithic models, while machine learning model architectures are designed to take advantage of current infrastructure and training methods. This feedback loop may lead the community into a misleading local minimum, where computing resources are more limited than actually needed.
Perhaps most exciting is the growing enthusiasm in the research community to solve these problems. Our team at Gensyn is building the network infrastructure described above. Teams like Hivemind and BigScience are applying many of these techniques in practice. Projects like Petals, sahajBERT, and Bloom demonstrate the power of these techniques, as well as the growing interest in community-based machine learning. Many others are also driving research progress with the goal of building a more open and collaborative model training ecosystem. If you are interested in this work, please contact us to get involved.

Explore More From Creator

Latest News

Explore More From Creator

Latest News

Trending Articles