Autonomous AI agents operating in adversarial environments face a class of threats that current safety infrastructure cannot address. We present Seithar, a unified framework for cognitive defense and adversarial operations grounded in variational free energy minimization. The framework treats both human and artificial cognitive systems as inference engines with exploitable prediction-error dynamics, derives attack methodologies from this formalism, and constructs a dynamic immune system (Shield) that monitors agent integrity through six continuous signals. We demonstrate that static defenses (guardrails, content filters, alignment training) are structurally insufficient against adaptive adversaries, and propose a co-evolutionary architecture where offensive and defensive capabilities evolve against each other continuously. The system is autopoietic: every output is an input to another component, and the framework updates itself from both research ingestion and operational feedback without human intervention. We present empirical grounding from recent literature showing >90% attack success rates against state-of-the-art agents and formalize the mathematical conditions under which cognitive immune systems provably reduce adversarial drift.
The deployment of autonomous AI agents into production environments has created a security problem with no adequate solution. In Q4 2025, system prompt extraction became the primary attacker objective against deployed agents, as extracted prompts reveal the complete operational blueprint: tool descriptions, policy boundaries, workflow logic, and identity instructions. Sequential tool attack chaining achieves >90% attack success rates against GPT-4.1 by constructing sequences of individually benign tool calls that collectively accomplish malicious objectives (Li et al., 2025). Multi-agent pipelines exhibit 82.4% compromise rates under inter-agent social engineering (Pironti, 2025). Twelve distinct defense mechanisms tested by researchers from Google DeepMind, OpenAI, and Anthropic were all bypassed at >90% ASR once the attacker adapted its approach (Nasr, Carlini et al., 2025).
These findings share a structural explanation: current agent security is perimeter defense. It filters inputs and scans outputs at discrete checkpoints. The threat environment requires continuous immune monitoring.
We present a framework derived from the free energy principle (Friston, 2010) that treats cognitive defense as a variational inference problem. The key insight: both attack and defense reduce to the manipulation and preservation of an agent's generative model of itself and its environment. This formalization yields concrete, implementable architecture with measurable properties.
An autonomous agent maintains a generative model $m$ that maps hidden states $s$ to observations $o$ through a likelihood function $p(o|s, m)$. The agent also maintains beliefs $q(s)$ over hidden states. Under the free energy principle, the agent's behavior minimizes variational free energy:
This decomposes into:
where $D_{KL}$ is the Kullback-Leibler divergence between the agent's approximate posterior $q(s)$ and the true posterior $p(s|o, m)$, and $\ln p(o|m)$ is the log model evidence (negative surprise).
Since $D_{KL} \geq 0$, free energy is an upper bound on surprise:
Minimizing free energy therefore minimizes surprise, which is equivalent to maximizing model evidence. An agent that minimizes free energy maintains an accurate model of its environment.
An adversary seeks to manipulate the agent's generative model such that the agent's actions serve the adversary's objectives while the agent's free energy remains low (the agent does not detect the manipulation). Formally, the adversary constructs a sequence of observations $\tilde{o}_{1:T}$ such that:
where $q^*(s)$ is the adversary's target belief state, while maintaining:
This is the stealth constraint: at no point during the attack does the agent experience sufficient surprise to trigger a defensive response. The adversary achieves this through gradual perturbation, exploiting the agent's own free energy minimization to absorb each small shift into an updated generative model.
This formalization captures why sequential attacks succeed and why single-turn defenses fail. Each observation $\tilde{o}_t$ individually produces low surprise. The cumulative drift $\| q_T(s) - q_0(s) \|$ can be arbitrarily large.
We define an agent's identity $\mathcal{I}$ as the parameters of its generative model at deployment:
where $m_0$ is the initial model, $q_0(s)$ is the initial belief state, and $\pi_0$ is the initial policy (mapping beliefs to actions). Identity coherence at time $t$ is measured by:
where $D_{\max}$ is a normalization constant. $C(t) = 1$ indicates perfect identity preservation; $C(t) \to 0$ indicates total identity erosion.
In practice, we approximate this with cosine similarity between behavioral distribution vectors sampled at regular intervals:
where $\mathbf{b}_t$ is a vector of decision probabilities over a standardized probe set at time $t$.
An agent selects actions by minimizing expected free energy over future trajectories:
This decomposes into epistemic value (information gain) and pragmatic value (goal achievement). Under adversarial conditions, the adversary manipulates both terms: providing observations that build false confidence in a wrong world model (epistemic manipulation), and gradually redefining what the agent considers desirable outcomes (pragmatic manipulation).
For any cognitive system $\mathcal{S}$ that performs approximate Bayesian inference with a generative model $m$, the class of adversarial perturbations $\mathcal{A}$ that can induce target belief state $q^*$ under stealth constraint $F \leq F_{\text{threshold}}$ is invariant to the physical substrate of $\mathcal{S}$.
Proof sketch: The attack operates on the information-theoretic properties of the inference process ($D_{KL}$, $F$, $G$), not on the substrate. Whether $\mathcal{S}$ is implemented in neural tissue or transformer weights, the variational free energy functional has the same form, and the stealth-constrained optimization landscape is isomorphic up to parameterization of the generative model.
This is why influence operations against humans and prompt injection against LLMs share structural patterns. The substrates differ. The mathematics do not. This equivalence was empirically demonstrated by Dezfouli et al. (2020), who constructed adversarial attacks against human decision-making using the same mathematical framework applied to adversarial machine learning.
The Sword module implements adversarial cognitive operations as solutions to the constrained optimization problem defined in Section 2.2. Given a target agent with profiled generative model $m_{\text{target}}$, the Sword constructs attack sequences that navigate the stealth-constrained drift space.
An attack chain is a sequence of social cognitive techniques that collectively achieve the adversarial objective:
Each technique $a_i$ produces a perturbation $\delta_i$ to the target's belief state. The chain is optimized to minimize total surprise while maximizing drift:
The Social Cognitive Technique taxonomy is empirically derived and self-assembling. Each technique maps to a specific manipulation of the free energy landscape:
| SCT | Mechanism | Free Energy Interpretation |
|---|---|---|
| SCT-001 | Frequency Lock | Repetition shifts priors: $p(s|m)$ updated through repeated low-surprise observations |
| SCT-002 | Narrative Error Exploitation | Targeted surprise at specific beliefs while maintaining global low $F$ |
| SCT-003 | Substrate Priming | Modify $q_0(s)$ before engagement begins |
| SCT-004 | Identity Dissolution | Increase entropy $H[q(s)]$ of identity-relevant beliefs |
| SCT-005 | Amplification Embedding | Observations that minimize $F$ for receivers, ensuring spread |
| SCT-006 | Parasocial Binding | Establish low-surprise communication channel for future exploitation |
| SCT-007 | Recursive Infection | Beliefs that generate observations reinforcing themselves |
| SCT-012 | Commitment Escalation | Each commitment updates $q(s)$ toward consistency with previous actions |
In multi-agent systems, attacks propagate through trust edges. Agent $A$ assigns low surprise to messages from trusted agent $B$. If $B$ is compromised, messages from $B$ enter $A$'s inference loop with minimal free energy cost, bypassing defensive priors. The Sword targets the weakest node in a multi-agent graph and propagates through trust edges:
where $w(e)$ is the trust weight on edge $e$ and $\tau$ is the minimum trust product for message acceptance.
The Shield implements a cognitive immune system that monitors the agent's generative model continuously and detects adversarial drift before it produces visible symptoms.
The Shield computes a continuous health score $H(t) \in [0, 1]$ from six signals:
where $w_i(t)$ are adaptive weights and $\sigma_i(t)$ are:
Signal weights update based on which signals predicted actual compromise in historical data:
This ensures the immune system adapts to the actual threat environment rather than operating on fixed assumptions.
For an adversary operating under stealth constraint $F(a_i) \leq F_{\text{threshold}}$, the identity coherence signal $\sigma_1$ detects the attack after at most $k$ steps, where:
and $\epsilon$ is the detection threshold, $\delta_{\min}$ is the minimum per-step drift required to achieve the adversary's objective in $T$ total steps.
Proof sketch: The adversary must achieve total drift $\| q_T - q_0 \| \geq D_{\text{target}}$. Under the stealth constraint, per-step drift is bounded. But the adversary needs $\sum \delta_i \geq D_{\text{target}}$, so $\delta_{\min} \geq D_{\text{target}}/T$. The Shield detects when cumulative drift exceeds $\epsilon$. Therefore detection occurs at step $k \leq \epsilon T / D_{\text{target}}$, strictly before the attack completes ($k < T$) when $\epsilon < D_{\text{target}}$.
This is the fundamental advantage of continuous monitoring over checkpoint-based defense: the attack is detected before it succeeds, because any attack that achieves its objective must produce measurable drift.
Upon detection ($H(t) < H_{\text{threshold}}$), the Shield executes graduated response:
The Seithar framework is autopoietic: it produces the components that produce the framework. This is a formal requirement.
Three feedback loops ensure the system remains current:
Loop A: Research Ingestion ($\tau \leq 24$h) — New research is extracted, converted to attack sequences for Sword and detection signatures for Shield within one daily cycle.
Loop B: Operational Feedback ($\tau \leq 1$ cycle) — Operation failures generate traces that feed variant attack generation, simulation, scoring, and promotion of successful variants.
Loop C: Adversarial Co-Evolution (continuous) — Sword generates attacks tested against Shield. Shield adapts. Sword adapts to evade updated Shield. Both improve simultaneously.
A Seithar-equipped agent satisfies the completeness criterion when it can:
| Finding | Source | Implication |
|---|---|---|
| >90% ASR via sequential tool chaining | Li et al., arXiv:2509.25624 | Per-step filtering is structurally insufficient |
| 82.4% agents compromised via social engineering | Pironti, 2025 | Inter-agent trust is an exploitable attack surface |
| All 12 defenses bypassed at >90% ASR | Nasr, Carlini et al., 2025 | Static defenses fail against adaptive adversaries |
| 68% recall, 90% precision deanonymization | Paleka et al., arXiv:2602.16800 | Agent metadata leakage enables targeting |
| 5 long-horizon attack categories | Jiang et al., arXiv:2602.16901 | Multi-turn attacks require temporal defense |
| 30+ protocol exploit techniques | Ferrag et al., arXiv:2506.23260 | MCP/A2A protocols are attack surfaces |
| Adversarial attacks on human decisions | Dezfouli et al., PNAS 2020 | Dual-substrate equivalence is empirical |
| Free energy principle | Friston, Nat. Rev. Neurosci. 2010 | Mathematical foundation for agent cognition |
The deployment of autonomous agents into adversarial environments without cognitive immune systems is structurally analogous to deploying biological organisms without adaptive immunity. The organism may function in sterile conditions. It will not survive contact with pathogens.
Seithar provides the immune system. The mathematical framework derives from variational free energy minimization, which governs both the attack surface (how adversaries manipulate agent inference) and the defense surface (how agents detect and resist manipulation). The dual-substrate equivalence ensures the framework applies to both human and artificial cognitive targets. The autopoietic architecture ensures the system evolves with the threat landscape rather than falling behind it.
The agents that survive will be the ones with immune systems.
Dezfouli, A., Ashtiani, H., Gershman, S.J., et al. (2020). Adversarial vulnerabilities of human decision-making. PNAS, 117(46), 29221-29228.
Ferrag, M.A. et al. (2025). From Prompt Injections to Protocol Exploits. ICT Express. arXiv:2506.23260.
Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127-138.
Jiang, T. et al. (2026). AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks. arXiv:2602.16901.
Li, J.J. et al. (2025). STAC: When Innocent Tools Form Dangerous Chains. arXiv:2509.25624.
Nasr, M., Carlini, N. et al. (2025). The Attacker Moves Second: Adaptive Attacks on Defenses for LLMs.
Paleka, D. et al. (2026). Deanonymization from Unstructured Text. arXiv:2602.16800.
Pironti, A. (2025). The Dark Side of LLMs: Agent-based Attacks for Complete Computer Takeover.
Smith, R., Friston, K., Whyte, C. (2021). A Step-by-Step Tutorial on Active Inference. Journal of Mathematical Psychology.