The "token"-based language processing process of the new generation AI model is revealing many limitations, posing a major barrier to the development of this field.

Generative AI models, ranging from the compact Gemma to the advanced GPT-4, are based on transformer architecture. Instead of processing raw text like humans, transformers operate by encoding data into smaller units called “tokens.” Tokens can be words, syllables, or even individual characters. This process, called tokenization, allows AI to receive information more efficiently, but at the same time creates many limitations.

One of the main challenges is the lack of consistency in how tokens are handled. For example, the model can parse “once upon a time” into “once”, “upon”, “a”, “time”, while “once upon a ” (with a space at the end) is interpreted as “once”, “upon”, “a”, ” “. This makes it difficult for the model to understand the context and true meaning of the sentence, leading to inaccurate results.

Furthermore, distinguishing between uppercase and lowercase letters also makes a significant difference. Regarding the model, “Hello” and “HELLO” can be understood as two completely different concepts. It is this ambiguity in the way tokens are encoded that causes many AI models to fail simple capitalization tests.

According to Sheridan Feucht, a doctoral student at Northeastern University, there is no such thing as a “perfect token.” Language itself inherently contains many complex elements, and determining which is the optimal semantic unit for encoding is still a difficult problem.

The problem becomes even worse when languages ​​other than English are considered. Many current encoding methods default to spaces as word separators, but this is not suitable for languages ​​such as Chinese, Japanese, Korean, etc. According to a 2023 study by Oxford University, Inefficient language encoding can cause an AI model to take twice as long to process a task compared to English.

Users using these “token-inefficient” languages ​​are also likely to face poorer AI performance and higher usage costs as many providers charge based on the number of tokens.

Research in the same year 2023 by Yennie Jun, an AI researcher at Google DeepMind, also showed that some languages ​​need 10 times more tokens than English to convey the same meaning. This clearly shows the linguistic inequality in the AI ​​field.

In addition, tokenization is also said to be the reason why current AI models have difficulty processing mathematics. Without truly understanding numbers, the tokenizer may consider “380” as a token, but represent “381” as a pair (“38” and “1”), destroying the relationship between the digits and leading to confusion for transformer.

Inconsistent encoding of numbers makes it difficult for the model to grasp the relationships between digits in equations and mathematical formulas.

We will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely. pic.twitter.com/5haV7FvbBx

— Andrej Karpathy (@karpathy) February 20, 2024

Despite many challenges, scientists are actively researching possible solutions. “Byte-level” state space models such as MambaByte, which is capable of directly processing raw data in byte form, show outstanding potential for handling linguistic “noise” and efficient text analysis than. However, MambaByte and similar models are still in the early research stages.

According to Sheridan Feucht, “Completely eliminating tokenization is a possible path, but currently it is computationally impossible for transformers.”

The emergence of new model architectures could be the key to a breakthrough in the tokenization problem. In the immediate future, researchers continue to search for solutions to optimize tokenization for different languages, aiming for a future where AI can understand and process language naturally and effectively.