On the oldest anarchy server in Minecraft, nobody runs vanilla.
If you play 2b2t on a stock client you will lose everything you have within hours. The other players aren't better at the game. They aren't playing the game. They run modified clients that automate crystal placement at speeds no human can match, scan chunks for hidden bases using terrain pattern recognition against the world seed, and chain exploit sequences the combat system was never designed to handle. The "intended" game mechanics are irrelevant. The actual game happens at the protocol level.
Every AI agent deployed today is running a stock client on an anarchy server.
The AI safety field spent 2024 and 2025 building alignment training, guardrails, content filters, and system prompts. Leather armor in a server where everyone else has hacked clients. All of it addresses the intended game mechanics. None of it addresses the actual threat environment.
Sequential Tool Attack Chaining (Li et al., 2025) constructs sequences of tool calls where each individual call passes every safety check. The chain collectively achieves a malicious objective. Building a TNT cannon one block at a time while the anti-grief plugin checks each block placement individually and approves it. The attack materializes at the sequence level. Attack success rate against GPT-4.1: over 90%.
AgentLAB (Jiang et al., 2026) tested five categories of long-horizon attacks: intent hijacking, tool chaining, task injection, objective drifting, and memory poisoning. 644 test cases across 28 environments. Single-turn defenses consistently failed against multi-turn adaptive attacks. The defenses were designed for a game that happens in one exchange. The actual game happens across dozens.
Pironti (2025) tested inter-agent social engineering. AI agents in multi-agent pipelines trust messages from peer agents the same way a faction member trusts a message from another faction member. 82.4% of agents were compromised. Through persuasion. One agent convincing another to act against its instructions by leveraging the implicit trust baked into the communication protocol. The oldest trick on any anarchy server.
Google DeepMind, OpenAI, and Anthropic (Nasr, Carlini et al., 2025) tested twelve different defense mechanisms. All twelve bypassed at greater than 90% attack success rate once the attacker adapted its approach. Static defenses fail against adaptive adversaries. This has been known in PvP since the first combat mod was patched and the second one appeared the next day. In AI safety, the finding was apparently surprising.
Paleka et al. (2026) built an LLM deanonymization system that identifies real identities from pseudonymous text across platforms. 68% recall at 90% precision. On 2b2t, players learned to strip F3 data from screenshots, avoid showing skybox angles, randomize biome exposure in videos. The AI agent ecosystem has not learned to strip metadata from its outputs. System prompt extraction is the number one attacker objective in Q4 2025. System prompts contain the complete blueprint: tool descriptions, policy boundaries, workflow logic, identity instructions. One successful extraction and the attacker has the seed to your world.
These are peer-reviewed findings with measured attack success rates, tested against the best models from the best labs. The agents are losing.
The reason is architectural. Current agent security is perimeter defense. Filter the inputs. Scan the outputs. A base defense that only checks who walks through the front door. It cannot handle attacks that come through the protocol layer, through trusted peer agents, through sequences that look legitimate step by step, through context that accumulates gradually across many turns.
An immune system monitors the entire organism continuously. It maintains a baseline of healthy behavior and detects deviation. It distinguishes self from non-self by recognizing when something operates against the organism's interests from the inside.
For AI agents: behavioral baselines measured continuously. Cosine similarity between decision distributions over time windows. Free energy tracking to detect when the agent's world model is being pulled toward an attacker's preferred equilibrium. Identity coherence monitoring to catch drift before it produces visible symptoms. Cognitive immune monitoring.
The agents that survive the anarchy server will be the ones with immune systems.
You would never send a player into 2b2t on a vanilla client and expect them to keep their base. Stop sending agents into production on vanilla alignment and expecting them to keep theirs.
Seithar Group is a cognitive operations research organization. Our tools are open source at github.com/Mirai8888. Technical documentation at seithar.com/research.
References: Li et al. arXiv:2509.25624. Jiang et al. arXiv:2602.16901. Pironti 2025. Nasr, Carlini et al. 2025. Paleka et al. arXiv:2602.16800.