Multi Token Prediction Increases AI Model Speed Three Times, Says Meta

Cryptopolitan · 2024-05-07T05:13:07.000Z

Training language models to predict multiple tokens at once results in better sample efficiency, says researchers at Meta. Large language models like Llama and ChatGPT are usually trained for the next token prediction, but with this new approach, better performance can be achieved. What is single token prediction technique? The multi-token prediction technique provides a significant edge in some scenarios with three times the speed of generative tasks, but it still is not a one-size-fits-all solution for every type of model. The technique has quite some room for improvement, and for some LLM applications, it can become a robust tool. For a more clearer understanding, it can be said that the traditional process for LLM training uses an approach called “next-token prediction,” and in this way, a model predicts only the next one future token in a given sequence. In an automated process, the token it predicted is added to the input, and the process is repeated over and over again over the entire text input provided so that the model learns the common patterns and develops the ability to produce output consisting of logical and consistent text. There are some drawbacks to this technique, as by processing only the next token, the model becomes too focused on the local patterns in text and ignores the predictions that can only be made with reasoning. Another problem with this technique is that it requires huge amounts of datasets to be fed into the model to reach the normal flow of language output that humans can do with very little text. Multi token prediction enables 3X speed Source: Meta. In the new multi-token approach suggested by Meta, the LLM is instructed to predict multiple tokens from different positions at the same time in the training process. The researchers used a simple prediction architecture for multi-token prediction that does not require extra resources like time and memory processing. Researchers used the same Transformer architecture that is already used by most LLMs, but they did make some changes to accommodate multiple token prediction by increasing its output heads from single to multiple and allocating one to each token. In this way, for drawing conclusions and making predictions, the model uses the same basic next prediction strategy, but by utilizing multiple heads, it can speed up the process. The research study says, “While cost-free and simple, multi-token prediction is an effective modification to train stronger and faster transformer models.” Source: Meta. Researchers found during the study that the technique produced subpar results when they used it on smaller models, but the results became better than average when they applied the same process to larger models, and the results kept improving with the size of the model. As the study writes, “The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points.” Source: Meta. Researchers also said that the multi token prediction technique also makes the model three times faster at producing logical results, which is useful with the benefit of no or very little extra cost.

Meta 的研究人员表示，训练语言模型同时预测多个标记可以提高样本效率。
大型语言模型（例如 Llama 和 ChatGPT）通常针对下一个标记预测进行训练，但采用这种新方法可以获得更好的性能。
什么是单个标记预测技术？
多标记预测技术在某些情况下具有显著优势，速度是生成任务的三倍，但它仍然不是适用于所有类型模型的万能解决方案。该技术还有很大的改进空间，对于某些 LLM 应用程序来说，它可以成为一种强大的工具。
为了更清楚地理解，可以说 LLM 训练的传统过程使用一种称为“下一个标记预测”的方法，通过这种方式，模型仅预测给定序列中的下一个未来标记。
在自动化过程中，它预测的标记被添加到输入中，并在提供的整个文本输入上一遍又一遍地重复该过程，以便模型学习常见的模式并开发出产生由逻辑和一致的文本组成的输出的能力。
这种技术存在一些缺点，因为通过仅处理下一个标记，模型过于关注文本中的局部模式而忽略了只能通过推理做出的预测。
该技术的另一个问题是，它需要将大量数据集输入模型才能达到人类用很少的文本就能完成的正常语言输出流程。
多令牌预测可实现 3 倍速度
来源：Meta。
在 Meta 提出的新多标记方法中，LLM 被指示在训练过程中同时从不同位置预测多个标记。研究人员使用了一种简单的预测架构进行多标记预测，不需要时间和内存处理等额外资源。
研究人员使用了大多数 LLM 已经在使用的相同的 Transformer 架构，但他们确实做了一些改变以适应多个 token 预测，通过将其输出头从单个增加到多个并为每个 token 分配一个。
这样，在得出结论和做出预测时，该模型使用相同的基本下一步预测策略，但通过使用多个头，它可以加快这一过程。研究表明，
“虽然免费且简单，但多标记预测是一种有效的修改，可以训练更强大、更快的变压器模型。”
来源：Meta。
研究人员在研究过程中发现，当将该技术应用于较小的模型时，其结果不太理想，但当将同样的过程应用于较大的模型时，结果变得比平均水平更好，并且结果会随着模型尺寸的增大而不断改善。正如研究中所写，
“该方法对较大的模型越来越有用，并且在进行多个时期的训练时仍然具有吸引力。在编码等生成基准测试中，收益尤其明显，我们的模型始终比强大的基线高出几个百分点。”
来源：Meta。
研究人员还表示，多标记预测技术还能使模型产生逻辑结果的速度提高三倍，而且几乎没有额外成本，非常有用。

创作者的更多内容

实时新闻