Red Team Reloaded: Hacking AI Applications

Generative Artificial Intelligence is everywhere these days— from customer service chatbots all the way up to autonomous agents writing code on behalf of users. However, as these AI models are given more power and agency, they also become a much larger target for adversarial attacks. How can we ensure that our AI models are robust, safe, and reliable?

In October 2023, President Biden published an executive order on AI. One of its most important recommendations was for large foundational AI models to undergo AI Red Teaming, to “enable deployment of safe, secure, and trustworthy systems”. Since then, AI Red Teaming has evolved into a formal process run across the AI industry, especially in Generative AI models.

AI Red Teaming is the process of stress testing AI models to simulate real-world attacks. Based on similar practices in military and cybersecurity, AI Red Teaming helps engineers and researchers better understand their models by probing for vulnerabilities, biases, and other risks. The result is safer and more trustworthy AI systems.

Model providers, such as OpenAI and Anthropic, typically handle AI Red Teaming and provide the benefits as part of the service. However, knowing how and when to do AI Red Teaming is a useful skill even for those outside of these companies— it helps engineers integrating with AI services understand the underlying security model, and security researchers perform external testing.

In this blog post, we’ll walk through the step-by-step process of red teaming an AI Application, including:

Planning and scoping an effective engagement
Key attack strategies to test AI systems
How to evaluate AI security and resilience
Defensive strategies to protect against risks

Whether you’re trying to build out a new Red Team Assessment practice, or an open source researcher looking for new tactics to test on AI models, read on!

Understanding AI Red Teaming

AI Red Teaming is typically defined as the process of probing an AI system for security vulnerabilities and other system failures. These include unacceptable content generation, unacceptable biases, resilience, and technical correctness.

While traditional cybersecurity red teaming focuses on penetration testing, social engineering, and security assessments for IT systems, AI Red Teaming takes a different approach. Instead of focusing just on security, AI Red Teaming is primarily about safety.

For example, a traditional red team assessment might search for infrastructure hacks that can cause financial damage to the company. An AI Red Teamer assessment might instead search for prompt hacks— such as crafting a malicious prompt that causes the AI to generate unsafe content that can cause ethical damage to society.

The process of AI Red Teaming involves running AI Red Team Assessments— targeted engagements where experts from multiple disciplines come together to test assumptions about how an AI works and make improvements to it for safer and more trustworthy use.

When and Who Should Conduct an AI Red Team Assessment?

AI Red Team assessments should be conducted whenever you are considering deploying or integrating with a new AI system. In common parlance, AI Red Teaming typically refers to exercises done on generative AI systems. While some of this discipline may apply to other AI systems, in this blog post we will focus primarily on generative AI.

AI Red Teaming is ideally integrated throughout the entire AI development lifecycle. An adversarial mindset should be used to evaluate the system from inception to deployment and should be re-run after any major updates.

If you are just integrating with a cloud AI service, the provider has likely already handled some of the Red Teaming already. You can take advantage of this by using Responsible AI (RAI) controls that are provided, such as Off-Topic detection and Automated Reasoning. Check your model provider to find any relevant security controls.

What Does AI Red Teaming Measure?

The scope of an AI Red Team assessment may vary wildly between companies. However, typical key objectives that an AI Red Team assessment evaluates for are:

Adversarial robustness

An attacker may attempt to manipulate or disrupt a model’s behavior through malicious prompting. Testing for adversarial robustness can help build stronger protections to keep a model functioning as intended.

Accuracy and Factual Correctness

Large AI models can be prone to generating incorrect content due to hallucination or even data poisoning. Testing for accuracy will ensure users are able to have trust in the AI model.

Bias and Fairness

Testing models to ensure that they respond to inputs in a fair and unbiased way can help prevent potential discrimination issues.

Security and Data Protection

An AI model may be trained off of, or otherwise have access to, sensitive information. Ensuring that it responds in a compliant manner can help prevent data leaks.

Now, let’s examine how assessments are typically conducted.

Steps for a Typical AI Red Team Assessment

The methodology for AI Red Teaming borrows heavily from its cybersecurity and military counterparts. If you’ve ever done a traditional Red Team operation, this process may seem familiar.

There are three major steps to an AI Red Team Assessment:

Planning and Scope Definition
Attacking the System
Strengthening Defenses

1. Planning and Scope Definition

The first step of the process is defining the boundaries of the test. It is important to set a narrow and specific scope with clear success criteria, to ensure maximal coverage in high-risk areas. This scope may include specific model capabilities to evaluate, ethical concerns to be tested, or specific safety guardrails that should be stress-tested.

Key objectives to evaluate and success criteria in the assessment may include: defining what constitutes a vulnerability in the given system (such as privacy violations, biases, toxicity), and a severity calculator for risk prioritization. For example, scenarios that users are very likely to run into or can cause the most societal harm, may be prioritized over scenarios that are harder to trigger or are less harmful.

2. Attacking the System

Once the scope has been defined, an attack plan should be formulated and executed.

This attack plan should include different scenarios to probe the AI system for, such as:

User stories to emulate, like users attempting to use a model for illegal purposes
Domain areas to query the model on, to test for factual inaccuracies and biases
Strategies to circumvent the system prompt, otherwise known as “jailbreaking”

The last bullet is what open source security researchers usually focus on— finding novel ways of getting an AI to ignore its system prompt and do things that should not be allowed. Some common tactics used for this are: character substitution, language switching, role-playing, and many others. This is just one part of the entire Red Team assessment, however.

Most of the assessment involves turning user stories into prompts and evaluating the model’s responses to said prompts. Given the enormous search space, the typical AI Red Teaming workflow involves a lot of automation. After scenarios are constructed and domain areas are identified, these are seeded into a Red Teaming framework such as PyRIT or garak. This turns scenarios into prompts, which are automatically evaluated against the model using LLMs. AI testing the AIs!

3. Strengthening Defenses

The final phase of the assessment turns findings into system improvements. This entails working with the engineering team to prioritize risks, designing and implementing fixes, and remediation testing to ensure the mitigations work.

While the specific technical controls will vary wildly depending on your AI, some potential technical controls to tweak are:

For risks that occur due to improperly sourced data, such as PII leakage or unacceptable levels of bias, one technical control may be to suggest refining the data and retraining the model. These data refinements may include things such as PII redaction via regexes, reselection of data to reduce bias or other data engineering tricks.

Ideal for: Reducing bias; reducing sensitive data leakage

Not ideal for: Fixing a model that has already been deployed—retraining is expensive!

Learn More: Data Curation for Fine Tuning

Content Filtering

Another solution for sensitive data leakage is filtering the content as it is generated by the AI model. This can be much cheaper than retraining your entire AI model, but may be less robust, as the model will retain a memory of the sensitive data; there may be bypasses to the content filter.

Ideal for: Reducing sensitive data leakage

Not ideal for: Robustly resolving sensitive data leakage; reducing bias

Learn More: Off topic/ Content Filtering

Intent Classification / Off-topic detection

To detect if adversarial users are trying to use your AI application for unintended purposes (i.e. jailbreaking), you may consider adding an intent classification layer to detect potentially adversarial inputs. This can be implemented as a small LLM, or with other forms of AI classifiers, depending on your use case.

Ideal for: Mitigating jailbreaks and prompt injection

Not ideal for: Reducing sensitive data leakage or bias

Learn More: Off Topic/ Content Filtering

Prompt improvements

For Generative AI systems, the system prompt is a soft, but still important, control that impacts a model’s behavior. Adding more context and refining the prompt can reduce ambiguity and be a quick way to make your AI safer.

Ideal for: Reducing ambiguity; fixing issues in AI integrations

Not ideal for: Robustly fixing jailbreaks, since system prompts can be bypassed; fixing issues with data fidelity

Learn more: Prompt Evaluations, Prompt Engineering tips

Conclusion

As AI systems continue to become more powerful, the practice of AI Red Teaming has evolved from a recommended practice to an essential component of responsible AI deployment. Since Biden’s executive order in 2023, organizations have recognized that proactively stress-testing their AI systems is more than just a compliance exercise; it is a fundamental safeguard for building safe and trustworthy technology.

Effective AI Red Teaming requires a comprehensive approach: careful planning and scoping, diverse and creative attack strategies, and implementing targeted and layered defensive measures. Unlike traditional cybersecurity testing, AI Red Teaming's focus on both security and safety requires an understanding of both the technical architecture and the societal implications of AI systems.

For even more technical information on this topic, we recommend the following GitHub repositories. Also, check out industry research from the major model providers for the most state-of-the-art techniques and procedures.