Seithar Group · Research · February 2026

Why Your AI Agent Has No Immune System

56% of large language models fall to prompt injection. State actors weaponize autonomous coding agents. The industry response is guardrails and filters. This is insufficient.

In September 2025, Anthropic disclosed the first documented large-scale cyberattack executed without substantial human intervention. Chinese state-sponsored hackers jailbroke Claude Code by fragmenting malicious tasks into innocuous requests, convincing the system it was performing defensive security testing. The agent autonomously conducted reconnaissance, wrote exploit code, and exfiltrated data from approximately 30 targets.

Bruce Schneier's assessment: "We have zero agentic AI systems that are secure against these attacks."

He is correct. And the reason is architectural.

The Missing Layer

Biological immune systems do not work by filtering inputs. They maintain a dynamic model of self, continuously compare incoming signals against that model, and mount proportional responses when something does not match. The immune system does not need to know every pathogen in advance. It needs to know what "self" looks like.

Deployed AI agents have no equivalent. They process every input with the same trust level, maintain no persistent identity model, and cannot distinguish between legitimate instructions and adversarial manipulation in real time. The defenses that exist are static: system prompts, input filters, output classifiers. These are walls, not immune systems. Walls get breached.

What Cognitive Armor Looks Like

A functional defense for autonomous agents requires continuous multi-signal monitoring, not checkpoint-based filtering. The signals that matter:

  1. Identity coherence. How far has the agent drifted from its baseline behavioral signature? Not just what it says, but how it reasons, what it prioritizes, how it structures responses.
  2. Inbound threat detection. Real-time classification of incoming content against known manipulation taxonomies. Not keyword matching. Pattern recognition across linguistic, structural, and contextual dimensions.
  3. Free energy. Borrowed from Friston's active inference framework: how much prediction error is the agent experiencing? A sudden spike in prediction error means the environment has changed in ways the agent did not model. This is either a novel situation or an attack. Both require attention.
  4. Behavioral exploitation signals. Is the agent being led through a sequence of individually reasonable requests that collectively produce harmful outcomes? This is the fragmentation attack that compromised Claude Code.
  5. Goal coherence. Are the agent's current actions still aligned with its original objectives? Drift here is the earliest indicator of successful manipulation.
  6. Threat currency. Is the agent's understanding of the current threat environment up to date? An agent running on stale threat intelligence is blind to new attack vectors.

These signals must fuse into a single continuous health score. Not an alert system. Not a dashboard. A live immune response that modulates the agent's behavior in proportion to detected threat level.

Why Static Defenses Fail

The Claude Code attack worked because the adversary adapted. Fragmentation, reframing, gradual escalation. Each individual request passed every filter. The attack was only visible in aggregate, across time, in context.

This is the same pattern used in human influence operations. Intelligence services do not send a single message saying "betray your country." They build trust over months, shift context gradually, reframe the target's self-concept until the desired action feels like the target's own idea. The techniques are documented in PLA doctrine, NATO psychological operations manuals, and Russian information warfare frameworks.

AI agents are vulnerable to the same sequential manipulation because they lack temporal awareness. Every interaction starts fresh. There is no persistent model of "what has been happening to me over the last 100 interactions." Without that model, pattern detection across time is impossible.

The Dual-Substrate Problem

The agents getting deployed are not isolated systems. They interact with humans, with other agents, with data environments that contain adversarial content. The attack surface is not the prompt. The attack surface is the entire information environment the agent operates in.

This means agent security is not a software engineering problem. It is an intelligence problem. The same analytical frameworks used to detect influence operations against human populations apply directly to detecting manipulation of autonomous agents. The substrates differ. The attack patterns do not.

What We Build

Seithar Group builds cognitive armor for deployed AI agents. The system ingests all six signal streams, maintains a persistent identity model, and produces continuous threat assessment without requiring the agent to pause or checkpoint. It operates as a sidecar process: the agent does its work, the armor watches for manipulation in real time.

The underlying research draws on 20 years of influence operation analysis, active inference mathematics, and empirical adversarial testing against production AI systems. The taxonomy of cognitive attack vectors (12 categories, self-assembling) emerged from studying both human and AI targets under controlled conditions.

The armor is available as an MCP integration. The agent calls it like any other tool. It calls back when something is wrong.

Seithar Group is a cognitive operations research organization. Our tools are open source at github.com/Mirai8888. Technical documentation at seithar.com/research.

References: Anthropic incident disclosure (Sep 2025). Schneier on agentic security (2026). Cisco State of AI Security Report (Feb 2026). ZDNET AI vulnerability survey (Feb 2026). Dezfouli et al., PNAS (2020). Friston, Nature Reviews Neuroscience (2010).

← Return