Anthropic and many other technology giants are forming a "Red teaming" to patch security flaws and prevent the risk of the model being exploited for bad purposes.

During the week, Anthropic released “red team” guidance, joining a group of vendors such as Google, Microsoft, NIST, NVIDIA and OpenAI who have also released similar frameworks. The goal of these frameworks is to identify and remediate growing security vulnerabilities in artificial intelligence (AI) models.

The “red team” approach is proving effective in detecting security vulnerabilities that other security methods cannot see, helping AI companies avoid their models being used to generate content. unwanted content.

The goal and importance of the "red team" strategy in the field of AI

Concerns about security risks from AI models are increasing, pushing policymakers to seek solutions for a safe, trustworthy AI platform. The Executive Order (EO) on Safe, Secure, and Trusted AI (14110), signed by President Biden on October 30, 2018, directed NIST to establish appropriate guidelines and processes to enable AI developers, especially with closed-use platform models, conduct "AI model testing" - also an AI "red team" option, to deploy safe, reliable AI systems. .

NIST released two draft publications in late April to help manage the risks of generative AI. These documents are complementary resources to the AI ​​Risk Management Framework (AI RMF) and the Secure Software Development Framework (SSDF).

The German Federal Office for Information Security (BSI) offers a “red team” strategy as part of their broader IT-Grundschutz framework. Australia, Canada, the European Union, Japan, the Netherlands and Singapore also have prominent frameworks. The European Parliament passed the EU Artificial Intelligence Act in March this year.

The concept of “red team” AI

In fact, the red team model has been around since the 1960s, when adversarial attacks were created in simulation form to ensure computer systems operated stably. “In computers, there is no concept of 'safety'. Instead, what engineers can say is: we tried but we couldn't break it," said Bruce Schneier, a security expert and fellow at Harvard University's Berkman Klein Research Center. .

Today, “red teaming” is also known as a technique of testing AI models by simulating diverse and unpredictable attacks, in order to determine their strengths and weaknesses. Because generative AI models are trained on huge data warehouses, traditional security methods find it difficult to find vulnerabilities.

But like any computer software, these models still share common cyber vulnerabilities: they can be attacked by nefarious actors to achieve a variety of goals, including asking questions. Harmful replies, pornographic content, illegal use of copyrighted material or disclosure of personal information such as name, address and phone number. The goal of the strategy is to promote patterns of responding and saying things that are not already programmed to do, including revealing biases.

In particular, members of the "red team" will use large language models (LLM) to automate the creation of commands and attack scripts to find and fix weaknesses of AI models generated in the field. large scale.

For example, Google uses red teams to protect AI models from threats such as prompt injection attacks, data poisoning attacks, and backdoors. Once such vulnerabilities are identified, they can narrow down the errors in the software and improve them.

The value of a “red team” strategy in improving AI model security continues to be demonstrated in competitions across the industry. Last year, DEF CON – the world's largest hacker conference – organized the first Generative Red Team (GRT) competition, considered one of the great successes in using crowdsourcing techniques.

Models are provided by Anthropic, Cohere, Google, Hugging Face, Meta, Nvidia, OpenAI and Stability. Participants test models on an evaluation platform developed by Scale AI.

Anthropic's AI "red team" strategy

In publishing its methods, Anthropic emphasized the need for scalable standardized and systematized testing procedures. According to the company, the lack of common standards is a major barrier to AI model testing across the industry

Anthropic also proposes four main testing methods: testing by domain experts, using language models for testing, testing in new methods and general open testing.

The notable point in Anthropic's approach is the seamless combination of in-depth human understanding and quantitative results from testing techniques. Typically, Anthropic focuses on the role of a group of experts by field, and prioritizes the application of Policy Vulnerability Testing (PVT) - a qualitative technique that helps identify and deploy security protection measures. , especially in sensitive areas that are easily exploited such as election interference, hate incitement, pornography,...

Like many other technology companies, Anthropic is aiming to automate the testing process by using AI models to perform random simulated attacks, thereby detecting vulnerabilities. "We believe that the more powerful the AI ​​models are, the more effectively they can support humans in testing and automating the testing process," Anthropic shared.

Based on the red group/blue group model, Anthropic uses attack models, "provoking" the target AI model to perform the desired behavior, thereby collecting data and adjusting and strengthening the system.

One of the key and challenging areas that Anthropic is pursuing is multi-modality testing. Testing AI models with images and sounds is much more complicated than with text, because attackers can completely "disguise" malicious code in images and sounds, bypassing the security system. honey. The proof is that Anthropic's Claude 3 model line, before being launched, had to go through a rigorous testing process regarding its ability to process multimedia information, to minimize potential risks such as fraud and incitement. hostile, or threatens children's safety.

Conclude

It can be said that AI model testing is gradually showing its position as an important shield, protecting the sustainable development of the AI ​​industry. The participation of leading technology corporations and government agencies shows the joint effort to create a solid legal and technical framework, opening up a future for AI to prosper while still ensuring safety. integrity and responsibility.