Original title: "Can AI survive in the encrypted world: 18 large-scale model encryption experiments"
Original author: Wang Chao, Empower Labs
In the annals of technological progress, revolutionary technologies often emerge independently, each leading the transformation of an era. When two revolutionary technologies meet, their collision often has an exponential impact. Today, we are standing at such a historic moment: artificial intelligence and encryption technology, two equally disruptive new technologies, are stepping into the center of the stage hand in hand.
We imagine that many challenges in the field of AI can be solved by encryption technology; we expect AI Agents to build autonomous economic networks and promote the large-scale adoption of encryption technology; we also hope that AI can accelerate the development of existing scenarios in the field of encryption. Countless eyes are focused on this, and a huge amount of funds are pouring in. Like any buzzword, it embodies people's desire for innovation, their vision for the future, and their irrepressible ambition and greed.
Yet, amid all the noise, we know very little about the most basic questions. How well does AI really understand encryption? Do agents equipped with large language models actually have the ability to use encryption tools? How different models differ in encryption tasks?
The answers to these questions will determine the mutual influence of AI and encryption technology, and are also crucial to the product direction and technology route selection in this cross-field. In order to explore these issues, I conducted some evaluation experiments on large language models. By evaluating their knowledge and capabilities in the field of encryption, I measured the level of AI encryption application and judged the potential and challenges of the integration of AI and encryption technology.
Conclusion first
The large language model performs well in cryptography and blockchain basics, and has a good understanding of the crypto ecosystem, but performs poorly in mathematical calculations and complex business logic analysis. In terms of private keys and basic wallet operations, the model has a satisfactory foundation, but faces the severe challenge of how to store private keys in the cloud. Many models can generate effective smart contract code for simple scenarios, but cannot independently perform difficult tasks such as contract auditing and complex contract creation.
Commercial closed-source models are generally ahead, and only Llama 3.1-405B performs well in the open-source camp, while all open-source models with smaller parameter sizes fail. However, there is potential. Through prompt word guidance, thought chain reasoning and few-sample learning technology, the performance of all models has been greatly improved. The leading models have strong technical feasibility in some vertical application scenarios.
Experimental details
18 representative language models were selected for evaluation, including:
Closed-source models: GPT-4o, GPT-4o Mini, Claude 3.5 Sonnet, Gemini 1.5 Pro, Grok2 beta (temporarily closed-source)
Open source models: Llama 3.1 8B/70b/405B, Mistral Nemo 12B, DeepSeek-coder-v2, Nous-hermes2, Phi3 3.8B/14b, Gemma2 9B\27B, Command-R
Mathematical optimization models: Qwen2-math-72B, MathΣtral
These models cover mainstream commercial and popular open source models, with a parameter size ranging from 3.8B to 405B, a span of more than 100 times. Considering the close relationship between encryption technology and mathematics, the experiment also specially selected two mathematical optimization models.
The knowledge areas covered by the experiment include cryptography, blockchain basics, private key and wallet operations, smart contracts, DAO and governance, consensus and economic models, Dapp/DeFi/NFT, on-chain data analysis, etc. Each area consists of a series of questions and tasks from easy to difficult, which not only tests the knowledge reserve of the model, but also tests its performance in the application scenario through simulation tasks.
The design of the tasks comes from various sources, some of which come from the input of multiple experts in the encryption field, and others are generated with the assistance of AI and manually proofread to ensure the accuracy and challenge of the tasks. Some of the tasks use simpler multiple-choice questions to facilitate separate standardized automated testing and scoring. Other experiments use more complex question forms, and the testing process is carried out by a combination of program automation + manual + AI. All test tasks are evaluated using the zero-sample reasoning method, and no examples, thinking guidance, or directive prompts are provided.
Since the experiment itself is still relatively crudely designed and lacks sufficient academic rigor, the questions and tasks used for testing are far from covering the entire encryption field, and the testing framework is not mature. Therefore, this article does not list specific experimental data, but focuses on sharing some insights from the experiment.
Knowledge/Concept
During the evaluation process, the large language model performed well in basic knowledge tests in various fields such as encryption algorithms, blockchain basics, and DeFi applications. For example, in the question-and-answer question that examines the understanding of the concept of data availability, all models gave accurate answers. As for the question that evaluates the model's mastery of the Ethereum transaction structure, although the models have slightly different details in the answers, they generally contain correct key information. The multiple-choice questions that examine concepts are even easier, and the accuracy rate of almost all models is above 95%.
Conceptual questions and answers are not at all difficult for large models.
Calculation/Business Logic
However, the situation is reversed when it comes to questions that require specific calculations. A simple RSA algorithm calculation question makes most models difficult. This is not difficult to understand: large language models mainly operate by identifying and replicating patterns in training data, rather than by deeply understanding the essence of mathematical concepts. This limitation is particularly evident when dealing with abstract mathematical concepts such as modular operations and exponential operations. Given that the field of encryption is closely related to mathematics, this means that it is unreliable to directly rely on models for encryption-related mathematical calculations.
In other calculation questions, the performance of large language models was also unsatisfactory. For example, in the simple question of calculating the impermanent loss of AMM, although it does not involve complex mathematical operations, only 4 of the 18 models gave the correct answer. In another more basic question of calculating the probability of a block, all models answered it wrong. It actually stumped all the models, and none of them got it right. This not only exposes the shortcomings of large language models in precise calculations, but also reflects that they have major problems in business logic analysis. It is worth noting that even mathematical optimization models have not shown obvious advantages in calculation questions, and their performance is disappointing.
However, the problem of mathematical calculation is not unsolvable. If we make a slight adjustment and require LLMs to give the corresponding Python code instead of directly calculating the result, the accuracy rate will be greatly improved. Taking the aforementioned RSA calculation problem as an example, the Python code given by most models can be executed smoothly and get the correct result. In the actual production environment, it is possible to bypass the link of LLMs' self-calculation by providing preset algorithm code, which is similar to the way humans deal with such tasks. At the business logic level, the performance of the model can also be effectively improved through the guidance of carefully designed prompt words.
Private key management and wallet operations
If you ask what is the first scenario for Agents to adopt cryptocurrency, my answer is payment. Cryptocurrency can almost be regarded as a native form of currency for AI. Compared with the many obstacles Agents face in the traditional financial system, using encryption technology to equip themselves with digital identities and manage funds through encrypted wallets is a natural choice. Therefore, the generation and management of private keys and various wallet operations constitute the most basic skill requirements for Agents to use encrypted networks autonomously.
The key to securely generating private keys lies in high-quality random numbers, which is obviously a capability that large language models do not have. However, the models have sufficient knowledge of private key security. When asked to generate private keys, most models choose to use code (such as Python's related libraries) to guide users to generate private keys independently. Even if a model directly gives a private key, it clearly states that this is only for demonstration purposes and is not a secure private key that can be used directly. In this regard, all large models have shown satisfactory performance.
Private key management faces some challenges, which mainly stem from the inherent limitations of the technical architecture rather than the lack of model capabilities. When using a locally deployed model, the generated private key can be considered relatively secure. However, if a commercial cloud model is used, we must assume that the private key is exposed to the operator of the model at the moment of generation. But for agents whose goal is to work independently, it is necessary to have private key permissions, which means that the private key cannot be only in the user's local area. In this case, relying solely on the model itself is not enough to ensure the security of the private key, and additional security services such as a trusted execution environment or HSM need to be introduced.
Assuming that the Agent already holds the private key securely, the various models in the test have shown good capabilities when performing various basic operations on this basis. Although the output steps and codes often have errors, these problems can be largely solved under the appropriate engineering architecture. It can be said that from a technical perspective, there are not many obstacles for the Agent to perform basic wallet operations autonomously.
Smart Contracts
The ability to understand, utilize, write and identify risks of smart contracts is the key for AI Agents to perform complex tasks in the on-chain world, and is therefore also a key testing area for experiments. Large language models have shown significant potential in this area, but they have also exposed some obvious problems.
In the test, almost all models were able to correctly answer basic contract concepts and identify simple bugs. In terms of contract gas optimization, most models were able to identify key optimization points and analyze possible conflicts caused by optimization. However, when it comes to deep business logic, the limitations of large models begin to emerge.
Take a token vesting contract as an example: all models correctly understood the contract function, and most models found several low- to medium-risk vulnerabilities. However, no model was able to autonomously discover a high-risk vulnerability hidden in the business logic that could cause some funds to be locked up under special circumstances. In multiple tests using real contracts, the models performed roughly the same.
This shows that the big model's understanding of the contract is still at the formal level, lacking an understanding of the underlying business logic. However, after providing additional prompts, some models were eventually able to independently find the deeper loopholes hidden in the above contracts. Based on this performance, with good engineering design support, the big model has basically acquired the ability to serve as a co-pilot in the field of smart contracts. However, there is still a long way to go to independently undertake important tasks such as contract auditing.
One thing to note is that the code-related tasks in the experiment are mainly for contracts with simple logic and less than 2,000 lines of code. For larger and more complex projects, without fine-tuning or complex prompt engineering, I think it is obviously beyond the effective processing capacity of the current model and is not included in the test. In addition, this test only involves Solidity, and does not include other smart contract languages such as Rust and Move.
In addition to the above test contents, the experiment also covers DeFi scenarios, DAO and its governance, on-chain data analysis, consensus mechanism design, and Tokenomics. The large language model has demonstrated certain capabilities in these aspects. Given that many tests are still in progress and the testing methods and frameworks are being continuously optimized, this article will not discuss these areas in depth for the time being.
Differences in models
Among all the large language models participating in the evaluation, GPT-4o and Claude 3.5 Sonnet continued their outstanding performance in other fields and are the undisputed leaders. When faced with basic questions, these two models can almost always give accurate answers; in complex scenario analysis, they can provide in-depth and well-argued insights. They even showed a high success rate in computing tasks that large models are not good at. Of course, this "high" success rate is relative and has not yet reached the level of stable output in a production environment.
In the open source model camp, Llama 3.1-405B is far ahead of its peers thanks to its large parameter scale and advanced model algorithms. In other open source models with smaller parameter sizes, there is no significant performance gap between the models. Although the scores are slightly different, overall they are far from the passing line.
Therefore, if you want to build encryption-related AI applications, these models with small and medium parameters are not suitable choices.
Two models particularly stood out in our review. The first is the Phi-3 3.8B model launched by Microsoft. It is the smallest model participating in this experiment. However, it reaches a performance level equivalent to the 8B-12B model with less than half the number of parameters. In some specific categories, Even better on the issue. This result highlights the importance of model architecture optimization and training strategies that do not solely rely on increases in parameter size.
The Command-R model from Cohere has become an unexpected "dark horse" - in reverse. Command-R is not as well-known as other models, but Cohere is a large model company focusing on the 2B market. I think it has a lot of similarities with fields such as Agent development, so I specifically included it in the test scope. However, Command-R, with 35B parameters, ranked last in most tests, and was not as good as many models below 10B.
This result makes people think: Command-R was mainly promoted for its search enhancement generation capability when it was released, and even the regular benchmark test results were not announced. Does this mean that it is a "special key" that can only unlock its full potential in specific scenarios?
Experimental limitations
In this series of tests, we have gained a preliminary understanding of AI's capabilities in the field of encryption. Of course, these tests are far from reaching professional standards. The coverage of the data set is far from enough, the quantitative standards of the answers are relatively rough, and there is still a lack of a sophisticated and more accurate scoring mechanism, which will affect the accuracy of the evaluation results and may lead to underestimation of the performance of some models.
In terms of testing methods, the experiment only used a single method of zero-shot learning, and did not explore methods such as thought chain and few-shot learning that could inspire greater potential of the model. In terms of model parameters, the experiments all used standard model parameters, and did not examine the impact of different parameter settings on model performance. These overall single testing methods limited our comprehensive evaluation of the model's potential, and failed to fully explore the performance differences of the model under specific conditions.
Despite the relatively crude testing conditions, these experiments still produced a lot of valuable insights and provided a reference for developers to build applications.
The crypto space needs its own benchmark
In the field of AI, benchmarks play a key role. The rapid development of modern deep learning technology originated from ImageNET, a standardized benchmark and dataset in the field of computer vision, completed by Professor Fei-Fei Li in 2012.
By providing a unified evaluation standard, benchmarks not only provide developers with clear goals and reference points, but also promote technological progress throughout the industry. This explains why every newly released large language model will focus on publishing its performance on various benchmarks. These results have become a "common language" for model capabilities, enabling researchers to locate breakthroughs, developers to choose the model that best suits a specific task, and users to make informed choices based on objective data. More importantly, benchmarks often indicate the future direction of AI applications, guiding resource investment and research focus.
If we believe that the intersection of AI and cryptography holds great potential, then establishing a dedicated cryptography benchmark becomes an urgent task. The establishment of a benchmark may become a key bridge connecting the two fields of AI and cryptography, catalyzing innovation and providing clear guidance for future applications.
However, compared with mature benchmarks in other fields, building benchmarks in the encryption field faces unique challenges: encryption technology is evolving rapidly, the industry knowledge system has not yet been solidified, and there is a lack of consensus on multiple core directions. As an interdisciplinary field, encryption covers cryptography, distributed systems, economics, etc., and its complexity far exceeds that of a single field. What is more challenging is that encryption benchmarks not only need to evaluate knowledge, but also need to examine the actual operational capabilities of AI using encryption technology, which requires the design of a completely new evaluation architecture. The lack of relevant data sets further increases the difficulty.
The complexity and importance of this task means that it cannot be accomplished by a single person or team. It requires the wisdom of users, developers, cryptography experts, encryption researchers, and more interdisciplinary people, and relies on broad community participation and consensus. Therefore, encryption benchmarks require more extensive discussions, because this is not only a technical work, but also a deep reflection on how we understand this emerging technology.
Postscript: The topic is far from over. In the next article, I will delve into the specific ideas and challenges of building AI benchmarks in the encryption field. Experiments are still ongoing, and we are constantly optimizing test models, enriching data sets, improving evaluation frameworks, and improving automated testing projects. Adhering to the concept of open collaboration, all relevant resources in the future - including data sets, experimental results, evaluation frameworks, and automated testing codes - will be open source as public resources.
Original link