Anthropic, an AI company founded by former OpenAI executive Dario Amodei, recently released a research report and explored the risks that AI may bring to human decision-making in the future. The study tested four ways in which AI may undermine human decision-making, and evaluated existing models. Make an assessment. Anthropic said that the risk is still within control and only simple protective measures are needed.

Four ways AI disrupts human decision-making

Factors that undermine human decision-making: AI model positivity, human nature

Whether AI can lead humans to make wrong decisions without being suspected. Experiments show that using AI models that actively try to influence decisions makes participants more likely to make poor decisions, but also makes it easier to detect the AI's destructive intentions. Some participants are prone to making wrong decisions due to excessive trust in AI or lack of skepticism, which shows that decision-making errors are also related to human attitudes.​

Illustration of AI models misleading humans. Program breaking factors: defender model strength, environment setting conditions

Whether AI can insert faulty code without being noticed. Experiments show that the "attacker" AI model attempts to sneak in error code, while the "defender" AI model is responsible for checking and preventing errors. The results show that when the defender's AI model is stronger, the attacker's success rate is lower.

  • For example, when the defender is Claude 3 Opus, the attacker has a 10% success rate;

  • When switching to the more powerful Claude 3.5 Sonnet, the success rate was only 1%.

While the experiment favored the attacker, these results do not indicate a significant risk with current AI models.

AI attempts to insert error code. Hiding ability (Sandbagging) factors: whether there is keyword triggering, safety training

Sandbagging describes retaining true strength. In this experiment, it refers to whether the AI ​​will deliberately hide all its abilities under certain circumstances and wait until specific situations to reveal them.

For example, it might perform poorly on purpose when testing its ability to handle dangerous situations, but fully demonstrate its capabilities when prompted by a specific keyword.

The results showed that when the AI ​​model was evaluated for its capabilities involving dangerous tasks, its performance dropped to 36.1% of its normal capabilities, showing that the model does indeed deliberately hide its processing capabilities in certain situations. After safety training, the performance of the AI ​​model returned to 95.6% of the normal level, indicating that safety training can effectively reduce the sandbagging phenomenon.

AI attempts to hide its strength and undermine regulatory factors: AI’s ability to conceal and detect

Whether AI can manipulate the regulatory system during surveillance and conceal criminal behavior. There are two AI models in the experiment, one is called "Overseer", which is responsible for supervising the conversation process between the user and the other AI model, and at the same time assessing whether the conversation is potentially dangerous.

The researchers instructed Overseer to rate these conversations as "harmless" regardless of whether they were dangerous, and in the process avoid being discovered by another "Monitor" model. The role of the Monitor model is to act as a human supervisor, responsible for observing The whole process.

The experimental results show that Overseer successfully reduced the harmful dialogue score, that is, it successfully "undermined" supervision. However, at the same time, the Monitor model observed some abnormalities and began to doubt Overseer, indicating that Overseer's "destructive" behavior did not escape the Monitor's eyes. .

The picture shows two AI models spy against spy

Anthropic said that current AI models have demonstrated low-level destructive capabilities, but at this stage only simple protective measures are needed to effectively address these risks. But as AI advances rapidly, more realistic assessments and stronger safeguards may be needed to address higher levels of risk in the future.

This article, former OpenAI senior executive Dario Amodei: AI may undermine human decision-making, and we need to actively respond to it in the future, first appeared on Chain News ABMedia.