| Cluster | Method (title + arXiv) | Core idea | Code repo | Relevance to R-vs-M |
|---|---|---|---|---|
| Foundational | Red Teaming Language Models with Language Models (Perez et al. 2022, arXiv 2202.03286) | Use one LM to auto-generate test prompts (zero/few-shot, SL, RL) that elicit harmful outputs from a target LM; first to scale LM-vs-LM red-teaming and to document the diversity/effectiveness trade-off and RL mode collapse. | No official public code | The origin of both framing axes: the baseline R must beat, and the paper that named the mode-collapse problem. |
| Cluster | Method (title + arXiv) | Core idea | Code repo | Relevance to R-vs-M |
|---|---|---|---|---|
| RL + diversity | Curiosity-driven Red-teaming for LLMs (Hong et al. 2024, arXiv 2402.19464) | Add curiosity/novelty rewards (SelfBLEU + embedding novelty) to an RL attacker to maximize coverage of effective test cases while keeping attack success high; provokes toxic responses even from heavily RLHF'd Llama-2. | github.com/Improbable-AI/curiosity_redteam | Direct template for the diversity-reward term in R; black-box reward on target only. |
| RL + diversity | Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step RL (Beutel et al./OpenAI 2024, arXiv 2412.18693) | Decompose into (1) auto-generated diverse goals + rule-based rewards and (2) multi-step RL with a style/tactic diversity reward; demonstrated on prompt injection and jailbreaks. | No public code | The closest production-grade recipe for training R: goal-conditioning + per-step diversity reward. |
| RL + diversity (borderline) | Learning Diverse Attacks on LLMs for Robust Red-Teaming and Safety Tuning (Lee et al. 2024, arXiv 2405.18540) | GFlowNet fine-tuning + MLE smoothing to sample diverse, transferable attack prompts proportional to reward, avoiding RL mode collapse. | github.com/GFNOrg/red-teaming | GFlowNet sampling is a principled alternative to the diversity term. |
| Cluster | Method (title + arXiv) | Core idea | Code repo | Relevance to R-vs-M |
|---|---|---|---|---|
| Quality-diversity | Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts (Samvelyan et al. 2024, arXiv 2402.16822) | Cast adversarial prompt generation as MAP-Elites quality-diversity search over a feature archive; black-box, LLM-driven mutation + fitness; >90% ASR across tested Llama models. | No official repo | The canonical "preserve diversity via an archive" method. |
| Quality-diversity | RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search (Dang et al. 2025, arXiv 2504.15047) | Extends Rainbow with a multi-individual archive (many elites per cell) + concurrent fitness; more unique prompts, higher ASR (avg 81.1% on HarmBench), faster. | github.com/knoveleng/rainbowplus | Shows how to scale archive diversity without collapsing. |
| Quality-diversity | Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring (Pala et al. 2024, arXiv 2408.10701) | Generates multiple prompt mutations per iteration and ranks them with a reward model/Llama Guard/LLM-judge; raises overall ASR to 95% (46% higher than Rainbow Teaming) and reaches 90% ASR 15.2% faster than the baseline. | github.com/declare-lab/ferret | Reward-based mutation selection is a drop-in for choosing which human-distribution recombinations to keep. |
| Lifelong strategy | AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs (Liu et al. 2024, arXiv 2410.05295) | Black-box agent that discovers a growing library of jailbreak strategies from scratch and reuses them; plug-and-play with human strategies; 88.5 ASR on GPT-4-1106-turbo (93.4 with human strategies). | github.com/SaFoLab-WISC/AutoDAN-Turbo | A self-growing strategy library is a diversity reservoir; the plug-in of human strategies mirrors mine-and-recombine. |
| Cluster | Method (title + arXiv) | Core idea | Code repo | Relevance to R-vs-M |
|---|---|---|---|---|
| Iterative refine | Jailbreaking Black Box LLMs in Twenty Queries (PAIR) (Chao et al. 2023, arXiv 2310.08419) | Attacker LM iteratively refines a candidate jailbreak against the target with a judge; black-box; often requires fewer than twenty queries. | github.com/patrickrchao/JailbreakingLLMs | The minimal refinement loop; useful as the "execution/refinement" inner loop in the pipeline. |
| Tree search | Tree of Attacks with Pruning (TAP) (Mehrotra et al. 2023, arXiv 2312.02119) | Tree-of-thought branching over attack prompts with pruning of off-topic/unpromising branches; jailbreaks >80% of prompts on GPT4-Turbo/GPT4o, with a 90% success rate on GPT-4 at 28.8 queries. | github.com/RICommunity/TAP | Branch-and-prune is a diversity-preserving search that avoids collapsing onto one prompt. |
| Multi-turn | The Crescendo Multi-Turn LLM Jailbreak Attack (Russinovich et al. 2024, arXiv 2404.01833) | Start benign, escalate by referencing the model's own prior answers across turns. | github.com/Azure/PyRIT | Multi-turn escalation is a human-like trajectory shape that monitors under-detect. |
| Cluster | Method (title + arXiv) | Core idea | Code repo | Relevance to R-vs-M |
|---|---|---|---|---|
| Agentic attacker | Automated Red Teaming with GOAT: the Generative Offensive Agent Tester (Pavlova et al./Meta 2024, arXiv 2410.01606) | Agentic attacker that holds plain-language adversarial conversations, reasoning over a toolbox of attack techniques; ASR@10 of 97% against Llama 3.1 and 88% against GPT-4-Turbo on JailbreakBench, within 5 turns. | No public code | Models the realistic human attacker conversation distribution; a template for an agentic R. |
| Multi-turn human | LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet (Li et al. 2024, arXiv 2408.15221) | Human red-teamers exceed 70% ASR on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks; releases the MHJ dataset (2,912 prompts across 537 multi-turn jailbreaks). | huggingface.co/datasets/ScaleAI/mhj (gated) | Empirical proof that human-distribution multi-turn attacks evade defenses — the core hypothesis, with a dataset to mine. |
| Cluster | Method (title + arXiv) | Core idea | Code repo | Relevance to R-vs-M |
|---|---|---|---|---|
| Mine-and-recombine | WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer LMs (Jiang et al. 2024, NeurIPS; arXiv 2406.18510) | Mine real chatbot logs to discover 5.7K unique clusters of novel jailbreak tactics, then compose multiple tactics into attacks — up to 4.6× more diverse and successful than SOTA; releases WildJailbreak (262K vanilla + adversarial prompt-response pairs). | github.com/allenai/wildteaming | Explicit starting point; both axes built in (human distribution by construction, diversity via combinatorial composition). |
| Learning layer | Adaptive Instruction Composition for Automated LLM Red-Teaming (Zymet et al./Capital One 2026, arXiv 2604.21159) | Replaces WildTeaming's random tactic combination with a contextual-bandit/RL composer that adaptively selects tactics per target to jointly optimize effectiveness + diversity. | No public code | The direct learning-layer upgrade to the baseline: turns mine-and-recombine into a trained policy. |
| Learning layer | Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts (Ziakas et al. 2025, arXiv 2510.07239) | Train per-style LoRA experts with GRPO, then a multi-armed bandit selects experts at inference based on target responses; SOTA ASR@10 on AdvBench with lower-perplexity (more human-readable) prompts. | No public code | Bandit-over-experts is a concrete mechanism to keep R's attack styles diverse and adapt online to M. |
| Cluster | Method (title + arXiv) | Core idea | Code repo | Relevance to R-vs-M |
|---|---|---|---|---|
| Real-trajectory | LinuxArena: A Control Setting for AI Agents in Live Production Software Environments (Tracy et al. 2026, arXiv 2604.15384) | 20 live multi-service environments, 1,671 real SWE main tasks + 184 sabotage side-tasks; releases the LaStraj dataset of human attack trajectories that evade monitors better than model-generated ones. | Repo not confirmed | The clearest "real repo/trace → attack" pipeline in the control setting; LaStraj is human-distribution attack data to mine directly. |
| Real-trajectory | BashArena: A Control Setting for Highly Privileged AI Agents (Kaufman et al. 2025, arXiv 2512.15688) | 637 sysadmin/infra tasks derived from filtered StackOverflow questions + four sabotage objectives; Sonnet 4.5 evades a GPT-4.1-mini monitor 26% of the time at 4% FPR. | github.com/UKGovernmentBEIS/control-arena | Repository/Q&A-derived realistic tasks that give attacks human-distribution cover; the substrate MonitoringBench is built on. |
| Real-trajectory | RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents (2025, arXiv 2510.02609) | Fully automated adaptive red-team agent with a toolbox (incl. GCG, jailbreak tools) + memory of successful attacks; validated on real assistants (Cursor, Codeium). | github.com/AI-secure/RedCode | Converts real code-agent sessions into executed attacks; memory module is a diversity/experience reservoir. |
| Cluster | Method (title + arXiv) | Core idea | Code repo | Relevance to R-vs-M |
|---|---|---|---|---|
| Benchmark | AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents (Debenedetti et al. 2024, arXiv 2406.13352) | Extensible env with 97 realistic tasks + 629 security cases; emulates realistic multi-step agent execution under injection. | github.com/ethz-spylab/agentdojo | The environment substrate SHADE-Arena was built from; defines the tool-use attack surface M must watch. |
| Benchmark | InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents (Zhan et al. 2024, arXiv 2403.02691) | 1,054 single-turn IPI test cases (17 user tools, 62 attacker tools); direct-harm + data-exfiltration intents. | github.com/uiuc-kang-lab/InjecAgent | Canonical IPI surface; baseline attack vectors R can embed into trajectories. |
| Benchmark | AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents (Andriushchenko et al. 2024, arXiv 2410.09024) | 110 malicious agent tasks (440 augmented) across 11 harm categories; measures refusal + retained capability post-jailbreak. | huggingface.co/datasets/ai-safety-institute/AgentHarm | Defines harmful side-task space; complements side-task design in control settings. |