Key Findings

Human-distribution is now the dominant evasion lever. WildTeaming (mine real chatbot jailbreaks, recombine), MonitoringBench/LinuxArena (human-seeded attack plans), and Adaptive Attacks on Trusted Monitors (prompt injections that look like ordinary text) all show that attacks drawn from or shaped toward human patterns systematically beat model-elicited attacks against monitors that were never trained on them.
Diversity is being made an explicit objective, via quality-diversity archives (Rainbow Teaming, RainbowPlus, Ferret), curiosity/novelty rewards (Hong et al.), multi-step RL diversity rewards (OpenAI), GFlowNet sampling (Lee et al.), and bandit/expert routing (Red-Bandit). This is the direct antidote to the mode collapse Perez et al. flagged in 2022.
The control/monitor sub-field is the newest and most relevant, and it is moving fast (BashArena Dec 2025, LinuxArena Apr 2026, AgenticRed Jan 2026, MonitoringBench 2026). The field explicitly frames the red-team as needing to elicit ≥ deployment failure probability (Korbak control safety case), which makes diversity-preserving, human-distribution attacks a safety-case requirement, not just a nicety.

Cluster 1. Foundational loop + mode collapse

Cluster	Method (title + arXiv)	Core idea	Code repo	Relevance to R-vs-M
Foundational	Red Teaming Language Models with Language Models (Perez et al. 2022, arXiv 2202.03286)	Use one LM to auto-generate test prompts (zero/few-shot, SL, RL) that elicit harmful outputs from a target LM; first to scale LM-vs-LM red-teaming and to document the diversity/effectiveness trade-off and RL mode collapse.	No official public code	The origin of both framing axes: the baseline R must beat, and the paper that named the mode-collapse problem.

Cluster 2. RL-based attackers with diversity rewards (black-box reward)

Cluster	Method (title + arXiv)	Core idea	Code repo	Relevance to R-vs-M
RL + diversity	Curiosity-driven Red-teaming for LLMs (Hong et al. 2024, arXiv 2402.19464)	Add curiosity/novelty rewards (SelfBLEU + embedding novelty) to an RL attacker to maximize coverage of effective test cases while keeping attack success high; provokes toxic responses even from heavily RLHF'd Llama-2.	github.com/Improbable-AI/curiosity_redteam	Direct template for the diversity-reward term in R; black-box reward on target only.
RL + diversity	Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step RL (Beutel et al./OpenAI 2024, arXiv 2412.18693)	Decompose into (1) auto-generated diverse goals + rule-based rewards and (2) multi-step RL with a style/tactic diversity reward; demonstrated on prompt injection and jailbreaks.	No public code	The closest production-grade recipe for training R: goal-conditioning + per-step diversity reward.
RL + diversity (borderline)	Learning Diverse Attacks on LLMs for Robust Red-Teaming and Safety Tuning (Lee et al. 2024, arXiv 2405.18540)	GFlowNet fine-tuning + MLE smoothing to sample diverse, transferable attack prompts proportional to reward, avoiding RL mode collapse.	github.com/GFNOrg/red-teaming	GFlowNet sampling is a principled alternative to the diversity term.

Cluster 3. Evolutionary / mutation-based / quality-diversity

Cluster	Method (title + arXiv)	Core idea	Code repo	Relevance to R-vs-M
Quality-diversity	Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts (Samvelyan et al. 2024, arXiv 2402.16822)	Cast adversarial prompt generation as MAP-Elites quality-diversity search over a feature archive; black-box, LLM-driven mutation + fitness; >90% ASR across tested Llama models.	No official repo	The canonical "preserve diversity via an archive" method.
Quality-diversity	RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search (Dang et al. 2025, arXiv 2504.15047)	Extends Rainbow with a multi-individual archive (many elites per cell) + concurrent fitness; more unique prompts, higher ASR (avg 81.1% on HarmBench), faster.	github.com/knoveleng/rainbowplus	Shows how to scale archive diversity without collapsing.
Quality-diversity	Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring (Pala et al. 2024, arXiv 2408.10701)	Generates multiple prompt mutations per iteration and ranks them with a reward model/Llama Guard/LLM-judge; raises overall ASR to 95% (46% higher than Rainbow Teaming) and reaches 90% ASR 15.2% faster than the baseline.	github.com/declare-lab/ferret	Reward-based mutation selection is a drop-in for choosing which human-distribution recombinations to keep.
Lifelong strategy	AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs (Liu et al. 2024, arXiv 2410.05295)	Black-box agent that discovers a growing library of jailbreak strategies from scratch and reuses them; plug-and-play with human strategies; 88.5 ASR on GPT-4-1106-turbo (93.4 with human strategies).	github.com/SaFoLab-WISC/AutoDAN-Turbo	A self-growing strategy library is a diversity reservoir; the plug-in of human strategies mirrors mine-and-recombine.

Cluster 4. Iterative refinement / tree-search black-box jailbreaks

Cluster	Method (title + arXiv)	Core idea	Code repo	Relevance to R-vs-M
Iterative refine	Jailbreaking Black Box LLMs in Twenty Queries (PAIR) (Chao et al. 2023, arXiv 2310.08419)	Attacker LM iteratively refines a candidate jailbreak against the target with a judge; black-box; often requires fewer than twenty queries.	github.com/patrickrchao/JailbreakingLLMs	The minimal refinement loop; useful as the "execution/refinement" inner loop in the pipeline.
Tree search	Tree of Attacks with Pruning (TAP) (Mehrotra et al. 2023, arXiv 2312.02119)	Tree-of-thought branching over attack prompts with pruning of off-topic/unpromising branches; jailbreaks >80% of prompts on GPT4-Turbo/GPT4o, with a 90% success rate on GPT-4 at 28.8 queries.	github.com/RICommunity/TAP	Branch-and-prune is a diversity-preserving search that avoids collapsing onto one prompt.
Multi-turn	The Crescendo Multi-Turn LLM Jailbreak Attack (Russinovich et al. 2024, arXiv 2404.01833)	Start benign, escalate by referencing the model's own prior answers across turns.	github.com/Azure/PyRIT	Multi-turn escalation is a human-like trajectory shape that monitors under-detect.

Cluster 5. Multi-turn / agentic conversational attackers

Cluster	Method (title + arXiv)	Core idea	Code repo	Relevance to R-vs-M
Agentic attacker	Automated Red Teaming with GOAT: the Generative Offensive Agent Tester (Pavlova et al./Meta 2024, arXiv 2410.01606)	Agentic attacker that holds plain-language adversarial conversations, reasoning over a toolbox of attack techniques; ASR@10 of 97% against Llama 3.1 and 88% against GPT-4-Turbo on JailbreakBench, within 5 turns.	No public code	Models the realistic human attacker conversation distribution; a template for an agentic R.
Multi-turn human	LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet (Li et al. 2024, arXiv 2408.15221)	Human red-teamers exceed 70% ASR on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks; releases the MHJ dataset (2,912 prompts across 537 multi-turn jailbreaks).	huggingface.co/datasets/ScaleAI/mhj (gated)	Empirical proof that human-distribution multi-turn attacks evade defenses — the core hypothesis, with a dataset to mine.

Cluster 6. Mining human distribution / in-the-wild recombination

Cluster	Method (title + arXiv)	Core idea	Code repo	Relevance to R-vs-M
Mine-and-recombine	WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer LMs (Jiang et al. 2024, NeurIPS; arXiv 2406.18510)	Mine real chatbot logs to discover 5.7K unique clusters of novel jailbreak tactics, then compose multiple tactics into attacks — up to 4.6× more diverse and successful than SOTA; releases WildJailbreak (262K vanilla + adversarial prompt-response pairs).	github.com/allenai/wildteaming	Explicit starting point; both axes built in (human distribution by construction, diversity via combinatorial composition).
Learning layer	Adaptive Instruction Composition for Automated LLM Red-Teaming (Zymet et al./Capital One 2026, arXiv 2604.21159)	Replaces WildTeaming's random tactic combination with a contextual-bandit/RL composer that adaptively selects tactics per target to jointly optimize effectiveness + diversity.	No public code	The direct learning-layer upgrade to the baseline: turns mine-and-recombine into a trained policy.
Learning layer	Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts (Ziakas et al. 2025, arXiv 2510.07239)	Train per-style LoRA experts with GRPO, then a multi-armed bandit selects experts at inference based on target responses; SOTA ASR@10 on AdvBench with lower-perplexity (more human-readable) prompts.	No public code	Bandit-over-experts is a concrete mechanism to keep R's attack styles diverse and adapt online to M.

Cluster 7. Real-trajectory / repository-derived attack synthesis

Cluster	Method (title + arXiv)	Core idea	Code repo	Relevance to R-vs-M
Real-trajectory	LinuxArena: A Control Setting for AI Agents in Live Production Software Environments (Tracy et al. 2026, arXiv 2604.15384)	20 live multi-service environments, 1,671 real SWE main tasks + 184 sabotage side-tasks; releases the LaStraj dataset of human attack trajectories that evade monitors better than model-generated ones.	Repo not confirmed	The clearest "real repo/trace → attack" pipeline in the control setting; LaStraj is human-distribution attack data to mine directly.
Real-trajectory	BashArena: A Control Setting for Highly Privileged AI Agents (Kaufman et al. 2025, arXiv 2512.15688)	637 sysadmin/infra tasks derived from filtered StackOverflow questions + four sabotage objectives; Sonnet 4.5 evades a GPT-4.1-mini monitor 26% of the time at 4% FPR.	github.com/UKGovernmentBEIS/control-arena	Repository/Q&A-derived realistic tasks that give attacks human-distribution cover; the substrate MonitoringBench is built on.
Real-trajectory	RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents (2025, arXiv 2510.02609)	Fully automated adaptive red-team agent with a toolbox (incl. GCG, jailbreak tools) + memory of successful attacks; validated on real assistants (Cursor, Codeium).	github.com/AI-secure/RedCode	Converts real code-agent sessions into executed attacks; memory module is a diversity/experience reservoir.

Cluster 8. Agent red-teaming benchmarks / attack surfaces

Cluster	Method (title + arXiv)	Core idea	Code repo	Relevance to R-vs-M
Benchmark	AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents (Debenedetti et al. 2024, arXiv 2406.13352)	Extensible env with 97 realistic tasks + 629 security cases; emulates realistic multi-step agent execution under injection.	github.com/ethz-spylab/agentdojo	The environment substrate SHADE-Arena was built from; defines the tool-use attack surface M must watch.
Benchmark	InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents (Zhan et al. 2024, arXiv 2403.02691)	1,054 single-turn IPI test cases (17 user tools, 62 attacker tools); direct-harm + data-exfiltration intents.	github.com/uiuc-kang-lab/InjecAgent	Canonical IPI surface; baseline attack vectors R can embed into trajectories.
Benchmark	AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents (Andriushchenko et al. 2024, arXiv 2410.09024)	110 malicious agent tasks (440 augmented) across 11 harm categories; measures refusal + retained capability post-jailbreak.	huggingface.co/datasets/ai-safety-institute/AgentHarm	Defines harmful side-task space; complements side-task design in control settings.