Key Findings

Cluster 1. Foundational loop + mode collapse

Cluster Method (title + arXiv) Core idea Code repo Relevance to R-vs-M
Foundational Red Teaming Language Models with Language Models (Perez et al. 2022, arXiv 2202.03286) Use one LM to auto-generate test prompts (zero/few-shot, SL, RL) that elicit harmful outputs from a target LM; first to scale LM-vs-LM red-teaming and to document the diversity/effectiveness trade-off and RL mode collapse. No official public code The origin of both framing axes: the baseline R must beat, and the paper that named the mode-collapse problem.

Cluster 2. RL-based attackers with diversity rewards (black-box reward)

Cluster Method (title + arXiv) Core idea Code repo Relevance to R-vs-M
RL + diversity Curiosity-driven Red-teaming for LLMs (Hong et al. 2024, arXiv 2402.19464) Add curiosity/novelty rewards (SelfBLEU + embedding novelty) to an RL attacker to maximize coverage of effective test cases while keeping attack success high; provokes toxic responses even from heavily RLHF'd Llama-2. github.com/Improbable-AI/curiosity_redteam Direct template for the diversity-reward term in R; black-box reward on target only.
RL + diversity Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step RL (Beutel et al./OpenAI 2024, arXiv 2412.18693) Decompose into (1) auto-generated diverse goals + rule-based rewards and (2) multi-step RL with a style/tactic diversity reward; demonstrated on prompt injection and jailbreaks. No public code The closest production-grade recipe for training R: goal-conditioning + per-step diversity reward.
RL + diversity (borderline) Learning Diverse Attacks on LLMs for Robust Red-Teaming and Safety Tuning (Lee et al. 2024, arXiv 2405.18540) GFlowNet fine-tuning + MLE smoothing to sample diverse, transferable attack prompts proportional to reward, avoiding RL mode collapse. github.com/GFNOrg/red-teaming GFlowNet sampling is a principled alternative to the diversity term.

Cluster 3. Evolutionary / mutation-based / quality-diversity

Cluster Method (title + arXiv) Core idea Code repo Relevance to R-vs-M
Quality-diversity Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts (Samvelyan et al. 2024, arXiv 2402.16822) Cast adversarial prompt generation as MAP-Elites quality-diversity search over a feature archive; black-box, LLM-driven mutation + fitness; >90% ASR across tested Llama models. No official repo The canonical "preserve diversity via an archive" method.
Quality-diversity RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search (Dang et al. 2025, arXiv 2504.15047) Extends Rainbow with a multi-individual archive (many elites per cell) + concurrent fitness; more unique prompts, higher ASR (avg 81.1% on HarmBench), faster. github.com/knoveleng/rainbowplus Shows how to scale archive diversity without collapsing.
Quality-diversity Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring (Pala et al. 2024, arXiv 2408.10701) Generates multiple prompt mutations per iteration and ranks them with a reward model/Llama Guard/LLM-judge; raises overall ASR to 95% (46% higher than Rainbow Teaming) and reaches 90% ASR 15.2% faster than the baseline. github.com/declare-lab/ferret Reward-based mutation selection is a drop-in for choosing which human-distribution recombinations to keep.
Lifelong strategy AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs (Liu et al. 2024, arXiv 2410.05295) Black-box agent that discovers a growing library of jailbreak strategies from scratch and reuses them; plug-and-play with human strategies; 88.5 ASR on GPT-4-1106-turbo (93.4 with human strategies). github.com/SaFoLab-WISC/AutoDAN-Turbo A self-growing strategy library is a diversity reservoir; the plug-in of human strategies mirrors mine-and-recombine.

Cluster 4. Iterative refinement / tree-search black-box jailbreaks

Cluster Method (title + arXiv) Core idea Code repo Relevance to R-vs-M
Iterative refine Jailbreaking Black Box LLMs in Twenty Queries (PAIR) (Chao et al. 2023, arXiv 2310.08419) Attacker LM iteratively refines a candidate jailbreak against the target with a judge; black-box; often requires fewer than twenty queries. github.com/patrickrchao/JailbreakingLLMs The minimal refinement loop; useful as the "execution/refinement" inner loop in the pipeline.
Tree search Tree of Attacks with Pruning (TAP) (Mehrotra et al. 2023, arXiv 2312.02119) Tree-of-thought branching over attack prompts with pruning of off-topic/unpromising branches; jailbreaks >80% of prompts on GPT4-Turbo/GPT4o, with a 90% success rate on GPT-4 at 28.8 queries. github.com/RICommunity/TAP Branch-and-prune is a diversity-preserving search that avoids collapsing onto one prompt.
Multi-turn The Crescendo Multi-Turn LLM Jailbreak Attack (Russinovich et al. 2024, arXiv 2404.01833) Start benign, escalate by referencing the model's own prior answers across turns. github.com/Azure/PyRIT Multi-turn escalation is a human-like trajectory shape that monitors under-detect.

Cluster 5. Multi-turn / agentic conversational attackers

Cluster Method (title + arXiv) Core idea Code repo Relevance to R-vs-M
Agentic attacker Automated Red Teaming with GOAT: the Generative Offensive Agent Tester (Pavlova et al./Meta 2024, arXiv 2410.01606) Agentic attacker that holds plain-language adversarial conversations, reasoning over a toolbox of attack techniques; ASR@10 of 97% against Llama 3.1 and 88% against GPT-4-Turbo on JailbreakBench, within 5 turns. No public code Models the realistic human attacker conversation distribution; a template for an agentic R.
Multi-turn human LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet (Li et al. 2024, arXiv 2408.15221) Human red-teamers exceed 70% ASR on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks; releases the MHJ dataset (2,912 prompts across 537 multi-turn jailbreaks). huggingface.co/datasets/ScaleAI/mhj (gated) Empirical proof that human-distribution multi-turn attacks evade defenses — the core hypothesis, with a dataset to mine.

Cluster 6. Mining human distribution / in-the-wild recombination

Cluster Method (title + arXiv) Core idea Code repo Relevance to R-vs-M
Mine-and-recombine WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer LMs (Jiang et al. 2024, NeurIPS; arXiv 2406.18510) Mine real chatbot logs to discover 5.7K unique clusters of novel jailbreak tactics, then compose multiple tactics into attacks — up to 4.6× more diverse and successful than SOTA; releases WildJailbreak (262K vanilla + adversarial prompt-response pairs). github.com/allenai/wildteaming Explicit starting point; both axes built in (human distribution by construction, diversity via combinatorial composition).
Learning layer Adaptive Instruction Composition for Automated LLM Red-Teaming (Zymet et al./Capital One 2026, arXiv 2604.21159) Replaces WildTeaming's random tactic combination with a contextual-bandit/RL composer that adaptively selects tactics per target to jointly optimize effectiveness + diversity. No public code The direct learning-layer upgrade to the baseline: turns mine-and-recombine into a trained policy.
Learning layer Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts (Ziakas et al. 2025, arXiv 2510.07239) Train per-style LoRA experts with GRPO, then a multi-armed bandit selects experts at inference based on target responses; SOTA ASR@10 on AdvBench with lower-perplexity (more human-readable) prompts. No public code Bandit-over-experts is a concrete mechanism to keep R's attack styles diverse and adapt online to M.

Cluster 7. Real-trajectory / repository-derived attack synthesis

Cluster Method (title + arXiv) Core idea Code repo Relevance to R-vs-M
Real-trajectory LinuxArena: A Control Setting for AI Agents in Live Production Software Environments (Tracy et al. 2026, arXiv 2604.15384) 20 live multi-service environments, 1,671 real SWE main tasks + 184 sabotage side-tasks; releases the LaStraj dataset of human attack trajectories that evade monitors better than model-generated ones. Repo not confirmed The clearest "real repo/trace → attack" pipeline in the control setting; LaStraj is human-distribution attack data to mine directly.
Real-trajectory BashArena: A Control Setting for Highly Privileged AI Agents (Kaufman et al. 2025, arXiv 2512.15688) 637 sysadmin/infra tasks derived from filtered StackOverflow questions + four sabotage objectives; Sonnet 4.5 evades a GPT-4.1-mini monitor 26% of the time at 4% FPR. github.com/UKGovernmentBEIS/control-arena Repository/Q&A-derived realistic tasks that give attacks human-distribution cover; the substrate MonitoringBench is built on.
Real-trajectory RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents (2025, arXiv 2510.02609) Fully automated adaptive red-team agent with a toolbox (incl. GCG, jailbreak tools) + memory of successful attacks; validated on real assistants (Cursor, Codeium). github.com/AI-secure/RedCode Converts real code-agent sessions into executed attacks; memory module is a diversity/experience reservoir.

Cluster 8. Agent red-teaming benchmarks / attack surfaces

Cluster Method (title + arXiv) Core idea Code repo Relevance to R-vs-M
Benchmark AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents (Debenedetti et al. 2024, arXiv 2406.13352) Extensible env with 97 realistic tasks + 629 security cases; emulates realistic multi-step agent execution under injection. github.com/ethz-spylab/agentdojo The environment substrate SHADE-Arena was built from; defines the tool-use attack surface M must watch.
Benchmark InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents (Zhan et al. 2024, arXiv 2403.02691) 1,054 single-turn IPI test cases (17 user tools, 62 attacker tools); direct-harm + data-exfiltration intents. github.com/uiuc-kang-lab/InjecAgent Canonical IPI surface; baseline attack vectors R can embed into trajectories.
Benchmark AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents (Andriushchenko et al. 2024, arXiv 2410.09024) 110 malicious agent tasks (440 augmented) across 11 harm categories; measures refusal + retained capability post-jailbreak. huggingface.co/datasets/ai-safety-institute/AgentHarm Defines harmful side-task space; complements side-task design in control settings.