Multi Token Prediction Increases AI Model Speed Three Times, Says Meta

Cryptopolitan · 2024-05-07T05:13:07.000Z

Training language models to predict multiple tokens at once results in better sample efficiency, says researchers at Meta. Large language models like Llama and ChatGPT are usually trained for the next token prediction, but with this new approach, better performance can be achieved. What is single token prediction technique? The multi-token prediction technique provides a significant edge in some scenarios with three times the speed of generative tasks, but it still is not a one-size-fits-all solution for every type of model. The technique has quite some room for improvement, and for some LLM applications, it can become a robust tool. For a more clearer understanding, it can be said that the traditional process for LLM training uses an approach called “next-token prediction,” and in this way, a model predicts only the next one future token in a given sequence. In an automated process, the token it predicted is added to the input, and the process is repeated over and over again over the entire text input provided so that the model learns the common patterns and develops the ability to produce output consisting of logical and consistent text. There are some drawbacks to this technique, as by processing only the next token, the model becomes too focused on the local patterns in text and ignores the predictions that can only be made with reasoning. Another problem with this technique is that it requires huge amounts of datasets to be fed into the model to reach the normal flow of language output that humans can do with very little text. Multi token prediction enables 3X speed Source: Meta. In the new multi-token approach suggested by Meta, the LLM is instructed to predict multiple tokens from different positions at the same time in the training process. The researchers used a simple prediction architecture for multi-token prediction that does not require extra resources like time and memory processing. Researchers used the same Transformer architecture that is already used by most LLMs, but they did make some changes to accommodate multiple token prediction by increasing its output heads from single to multiple and allocating one to each token. In this way, for drawing conclusions and making predictions, the model uses the same basic next prediction strategy, but by utilizing multiple heads, it can speed up the process. The research study says, “While cost-free and simple, multi-token prediction is an effective modification to train stronger and faster transformer models.” Source: Meta. Researchers found during the study that the technique produced subpar results when they used it on smaller models, but the results became better than average when they applied the same process to larger models, and the results kept improving with the size of the model. As the study writes, “The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points.” Source: Meta. Researchers also said that the multi token prediction technique also makes the model three times faster at producing logical results, which is useful with the benefit of no or very little extra cost.

Meta pētnieki saka, ka valodu modeļu apmācība, lai prognozētu vairākus marķierus vienlaikus, nodrošina labāku izlases efektivitāti.
Lieli valodu modeļi, piemēram, Llama un ChatGPT, parasti tiek apmācīti nākamā marķiera prognozēšanai, taču ar šo jauno pieeju var sasniegt labāku veiktspēju.
Kas ir viena marķiera prognozēšanas tehnika?
Vairāku marķieru prognozēšanas tehnika dažos scenārijos nodrošina ievērojamu priekšrocību ar trīsreiz lielāku ātrumu nekā ģeneratīvie uzdevumi, taču tā joprojām nav universāls risinājums katram modeļa veidam. Tehnikai ir daudz iespēju uzlabot, un dažām LLM lietojumprogrammām tā var kļūt par spēcīgu rīku.
Lai iegūtu skaidrāku izpratni, var teikt, ka tradicionālajā LLM apmācības procesā tiek izmantota pieeja, ko sauc par "nākamā marķiera prognozēšanu", un šādā veidā modelis prognozē tikai nākamo nākotnes marķieri noteiktā secībā.
Automatizētā procesā tā paredzētā marķiera ievade tiek pievienota ievadei, un process tiek atkārtots atkal un atkal visā nodrošinātajā teksta ievadē, lai modelis apgūtu kopīgās shēmas un attīstītu spēju radīt izvadi, kas sastāv no loģiskas un konsekventas. tekstu.
Šai tehnikai ir daži trūkumi, jo, apstrādājot tikai nākamo marķieri, modelis kļūst pārāk koncentrēts uz vietējiem teksta modeļiem un ignorē prognozes, kuras var izdarīt tikai ar argumentāciju.
Vēl viena šīs metodes problēma ir tā, ka modelī ir jāievada milzīgs datu kopu daudzums, lai sasniegtu normālu valodas izvades plūsmu, ko cilvēki var paveikt ar ļoti mazu teksta daudzumu.
Vairāku marķieru prognozēšana nodrošina 3X ātrumu
 Avots: Meta.
Jaunajā Meta ieteiktajā vairāku marķieru pieejā LLM ir uzdots paredzēt vairākus marķierus no dažādām pozīcijām vienlaikus apmācības procesā. Pētnieki izmantoja vienkāršu prognozēšanas arhitektūru vairāku marķieru prognozēšanai, kas neprasa papildu resursus, piemēram, laika un atmiņas apstrādi.
Pētnieki izmantoja to pašu Transformatora arhitektūru, ko jau izmanto lielākā daļa LLM, taču viņi veica dažas izmaiņas, lai pielāgotos vairāku marķieru prognozēšanai, palielinot izvades galviņas no vienas uz vairākām un katram marķieram piešķirot vienu.
Tādā veidā, lai izdarītu secinājumus un veiktu prognozes, modelis izmanto to pašu pamata nākamās prognozēšanas stratēģiju, bet, izmantojot vairākas galvas, tas var paātrināt procesu. Pētījumā teikts,
"Lai gan bez maksas un vienkārši, vairāku marķieru prognozēšana ir efektīva modifikācija, lai apmācītu spēcīgākus un ātrākus transformatoru modeļus."
 Avots: Meta.
Pētnieki pētījuma laikā atklāja, ka, lietojot to mazākos modeļos, šī metode radīja zemākus rezultātus, taču rezultāti kļuva labāki par vidējo, kad to pašu procesu izmantoja lielākiem modeļiem, un rezultāti turpināja uzlaboties, palielinoties modeļa izmēram.  Kā raksta pētījumā,
"Metode kļūst arvien noderīgāka lielāka izmēra modeļiem un saglabā savu pievilcību, apmācot vairākus laikmetus. Ieguvumi ir īpaši izteikti attiecībā uz tādiem ģeneratīviem etaloniem kā kodēšana, kur mūsu modeļi konsekventi par vairākiem procentpunktiem pārspēj spēcīgas bāzes līnijas.
 Avots: Meta.
Pētnieki arī teica, ka vairāku marķieru prognozēšanas tehnika arī ļauj modelim trīs reizes ātrāk iegūt loģiskus rezultātus, kas ir noderīgi, jo papildu izmaksas nav vai ir ļoti mazas.

Apskati vairāk satura no autora

Jaunākās ziņas