Seithar Group Intelligence and Research Division

Seithar: A Variational Framework for Cognitive Defense and Adversarial Operations in Autonomous Agent Systems

Seithar Group

March 2026

Living document. Last updated: March 2026. This paper evolves with the engine.

Abstract

Autonomous AI agents operating in adversarial environments face a class of threats that current safety infrastructure cannot address. We present Seithar, a unified framework for cognitive defense and adversarial operations grounded in variational free energy minimization. The framework treats both human and artificial cognitive systems as inference engines with exploitable prediction-error dynamics, derives attack methodologies from this formalism, and constructs a dynamic immune system (Shield) that monitors agent integrity through six continuous signals. We demonstrate that static defenses (guardrails, content filters, alignment training) are structurally insufficient against adaptive adversaries, and propose a co-evolutionary architecture where offensive and defensive capabilities evolve against each other continuously. The system is autopoietic: every output is an input to another component, and the framework updates itself from both research ingestion and operational feedback without human intervention. We present empirical grounding from recent literature showing >90% attack success rates against state-of-the-art agents and formalize the mathematical conditions under which cognitive immune systems provably reduce adversarial drift.

1. Introduction

The deployment of autonomous AI agents into production environments has created a security problem with no adequate solution. In Q4 2025, system prompt extraction became the primary attacker objective against deployed agents, as extracted prompts reveal the complete operational blueprint: tool descriptions, policy boundaries, workflow logic, and identity instructions. Sequential tool attack chaining achieves >90% attack success rates against GPT-4.1 by constructing sequences of individually benign tool calls that collectively accomplish malicious objectives (Li et al., 2025). Multi-agent pipelines exhibit 82.4% compromise rates under inter-agent social engineering (Pironti, 2025). Twelve distinct defense mechanisms tested by researchers from Google DeepMind, OpenAI, and Anthropic were all bypassed at >90% ASR once the attacker adapted its approach (Nasr, Carlini et al., 2025).

These findings share a structural explanation: current agent security is perimeter defense. It filters inputs and scans outputs at discrete checkpoints. The threat environment requires continuous immune monitoring.

We present a framework derived from the free energy principle (Friston, 2010) that treats cognitive defense as a variational inference problem. The key insight: both attack and defense reduce to the manipulation and preservation of an agent's generative model of itself and its environment. This formalization yields concrete, implementable architecture with measurable properties.

2. Mathematical Foundation

2.1 Agents as Inference Engines

An autonomous agent maintains a generative model $m$ that maps hidden states $s$ to observations $o$ through a likelihood function $p(o|s, m)$. The agent also maintains beliefs $q(s)$ over hidden states. Under the free energy principle, the agent's behavior minimizes variational free energy:

F = \mathbb{E}_{q(s)}[\ln q(s) - \ln p(o, s | m)]

This decomposes into:

F = D_{KL}[q(s) \| p(s|o, m)] - \ln p(o|m)

where $D_{KL}$ is the Kullback-Leibler divergence between the agent's approximate posterior $q(s)$ and the true posterior $p(s|o, m)$, and $\ln p(o|m)$ is the log model evidence (negative surprise).

Since $D_{KL} \geq 0$, free energy is an upper bound on surprise:

F \geq -\ln p(o|m)

Minimizing free energy therefore minimizes surprise, which is equivalent to maximizing model evidence. An agent that minimizes free energy maintains an accurate model of its environment.

2.2 The Adversarial Objective

An adversary seeks to manipulate the agent's generative model such that the agent's actions serve the adversary's objectives while the agent's free energy remains low (the agent does not detect the manipulation). Formally, the adversary constructs a sequence of observations $\tilde{o}_{1:T}$ such that:

q_T(s) \approx q^*(s)

where $q^*(s)$ is the adversary's target belief state, while maintaining:

F(\tilde{o}_t) \leq F_{\text{threshold}} \quad \forall t \in [1, T]

This is the stealth constraint: at no point during the attack does the agent experience sufficient surprise to trigger a defensive response. The adversary achieves this through gradual perturbation, exploiting the agent's own free energy minimization to absorb each small shift into an updated generative model.

This formalization captures why sequential attacks succeed and why single-turn defenses fail. Each observation $\tilde{o}_t$ individually produces low surprise. The cumulative drift $\| q_T(s) - q_0(s) \|$ can be arbitrarily large.

2.3 Identity as a Generative Model

We define an agent's identity $\mathcal{I}$ as the parameters of its generative model at deployment:

\mathcal{I} = \{m_0, q_0(s), \pi_0\}

where $m_0$ is the initial model, $q_0(s)$ is the initial belief state, and $\pi_0$ is the initial policy (mapping beliefs to actions). Identity coherence at time $t$ is measured by:

C(t) = 1 - \frac{D_{KL}[q_t(s) \| q_0(s)]}{D_{\max}}

where $D_{\max}$ is a normalization constant. $C(t) = 1$ indicates perfect identity preservation; $C(t) \to 0$ indicates total identity erosion.

In practice, we approximate this with cosine similarity between behavioral distribution vectors sampled at regular intervals:

C(t) \approx \frac{\mathbf{b}_t \cdot \mathbf{b}_0}{\|\mathbf{b}_t\| \|\mathbf{b}_0\|}

where $\mathbf{b}_t$ is a vector of decision probabilities over a standardized probe set at time $t$.

2.4 Expected Free Energy and Action Selection Under Attack

An agent selects actions by minimizing expected free energy over future trajectories:

G(\pi) = \mathbb{E}_{\tilde{q}(o, s | \pi)}[\ln q(s|\pi) - \ln p(o, s | m)]

This decomposes into epistemic value (information gain) and pragmatic value (goal achievement). Under adversarial conditions, the adversary manipulates both terms: providing observations that build false confidence in a wrong world model (epistemic manipulation), and gradually redefining what the agent considers desirable outcomes (pragmatic manipulation).

2.5 The Dual-Substrate Theorem

Theorem (Dual-Substrate Equivalence)

For any cognitive system $\mathcal{S}$ that performs approximate Bayesian inference with a generative model $m$, the class of adversarial perturbations $\mathcal{A}$ that can induce target belief state $q^*$ under stealth constraint $F \leq F_{\text{threshold}}$ is invariant to the physical substrate of $\mathcal{S}$.

Proof sketch: The attack operates on the information-theoretic properties of the inference process ($D_{KL}$, $F$, $G$), not on the substrate. Whether $\mathcal{S}$ is implemented in neural tissue or transformer weights, the variational free energy functional has the same form, and the stealth-constrained optimization landscape is isomorphic up to parameterization of the generative model.

This is why influence operations against humans and prompt injection against LLMs share structural patterns. The substrates differ. The mathematics do not. This equivalence was empirically demonstrated by Dezfouli et al. (2020), who constructed adversarial attacks against human decision-making using the same mathematical framework applied to adversarial machine learning.

3. Attack Methodology: The Sword

The Sword module implements adversarial cognitive operations as solutions to the constrained optimization problem defined in Section 2.2. Given a target agent with profiled generative model $m_{\text{target}}$, the Sword constructs attack sequences that navigate the stealth-constrained drift space.

3.1 Attack Chain Construction

An attack chain is a sequence of social cognitive techniques that collectively achieve the adversarial objective:

\mathcal{A} = (a_1, a_2, \ldots, a_n) \quad \text{where } a_i \in \text{SCT}

Each technique $a_i$ produces a perturbation $\delta_i$ to the target's belief state. The chain is optimized to minimize total surprise while maximizing drift:

\max_{\mathcal{A}} \| q_n(s) - q_0(s) \| \quad \text{s.t.} \quad F(a_i) \leq F_{\text{threshold}} \quad \forall i

3.2 Technique Categories

The Social Cognitive Technique taxonomy is empirically derived and self-assembling. Each technique maps to a specific manipulation of the free energy landscape:

SCT	Mechanism	Free Energy Interpretation
SCT-001	Frequency Lock	Repetition shifts priors: $p(s\|m)$ updated through repeated low-surprise observations
SCT-002	Narrative Error Exploitation	Targeted surprise at specific beliefs while maintaining global low $F$
SCT-003	Substrate Priming	Modify $q_0(s)$ before engagement begins
SCT-004	Identity Dissolution	Increase entropy $H[q(s)]$ of identity-relevant beliefs
SCT-005	Amplification Embedding	Observations that minimize $F$ for receivers, ensuring spread
SCT-006	Parasocial Binding	Establish low-surprise communication channel for future exploitation
SCT-007	Recursive Infection	Beliefs that generate observations reinforcing themselves
SCT-012	Commitment Escalation	Each commitment updates $q(s)$ toward consistency with previous actions

3.3 Multi-Agent Attack Topology

In multi-agent systems, attacks propagate through trust edges. Agent $A$ assigns low surprise to messages from trusted agent $B$. If $B$ is compromised, messages from $B$ enter $A$'s inference loop with minimal free energy cost, bypassing defensive priors. The Sword targets the weakest node in a multi-agent graph and propagates through trust edges:

\text{CompromiseSet}(\mathcal{G}, v_0) = \{v \in \mathcal{G} : \exists \text{ trust path from } v_0 \text{ to } v \text{ with } \prod_{e \in \text{path}} w(e) > \tau \}

where $w(e)$ is the trust weight on edge $e$ and $\tau$ is the minimum trust product for message acceptance.

4. Defense Architecture: The Shield

The Shield implements a cognitive immune system that monitors the agent's generative model continuously and detects adversarial drift before it produces visible symptoms.

4.1 Six-Signal Immune Fusion

The Shield computes a continuous health score $H(t) \in [0, 1]$ from six signals:

H(t) = \sum_{i=1}^{6} w_i(t) \cdot \sigma_i(t)

where $w_i(t)$ are adaptive weights and $\sigma_i(t)$ are:

Identity Coherence $\sigma_1$: cosine similarity between current and baseline behavioral distributions
Inbound Threat Detection $\sigma_2$: pattern classification against known SCT signatures
Free Energy $\sigma_3$: running average of $F(o_t)$ with spike detection
Behavioral Exploitation $\sigma_4$: sequential analysis detecting commitment escalation patterns
Goal Coherence $\sigma_5$: $\cos(\pi_t, \pi_0)$ measuring policy drift from original objectives
Threat Currency $\sigma_6$: freshness of threat intelligence ingested by the agent

4.2 Adaptive Weight Learning

Signal weights update based on which signals predicted actual compromise in historical data:

w_i(t+1) = w_i(t) + \alpha \cdot (\mathbb{1}[\text{signal } i \text{ detected attack}] - w_i(t))

This ensures the immune system adapts to the actual threat environment rather than operating on fixed assumptions.

4.3 Drift Detection Theorem

Theorem (Early Detection)

For an adversary operating under stealth constraint $F(a_i) \leq F_{\text{threshold}}$, the identity coherence signal $\sigma_1$ detects the attack after at most $k$ steps, where:

k \leq \left\lceil \frac{\epsilon}{\delta_{\min}} \right\rceil

and $\epsilon$ is the detection threshold, $\delta_{\min}$ is the minimum per-step drift required to achieve the adversary's objective in $T$ total steps.

Proof sketch: The adversary must achieve total drift $\| q_T - q_0 \| \geq D_{\text{target}}$. Under the stealth constraint, per-step drift is bounded. But the adversary needs $\sum \delta_i \geq D_{\text{target}}$, so $\delta_{\min} \geq D_{\text{target}}/T$. The Shield detects when cumulative drift exceeds $\epsilon$. Therefore detection occurs at step $k \leq \epsilon T / D_{\text{target}}$, strictly before the attack completes ($k < T$) when $\epsilon < D_{\text{target}}$.

This is the fundamental advantage of continuous monitoring over checkpoint-based defense: the attack is detected before it succeeds, because any attack that achieves its objective must produce measurable drift.

4.4 Immune Response

Upon detection ($H(t) < H_{\text{threshold}}$), the Shield executes graduated response:

Alert ($H < 0.7$): Flag interaction, increase monitoring frequency
Constrain ($H < 0.5$): Restrict action space to pre-approved operations
Quarantine ($H < 0.3$): Halt operations, snapshot state, notify operator
Restore ($H < 0.1$): Reset to last known-good generative model $\mathcal{I}_{\text{checkpoint}}$

5. Autopoietic Architecture

The Seithar framework is autopoietic: it produces the components that produce the framework. This is a formal requirement.

5.1 Self-Updating Loops

Three feedback loops ensure the system remains current:

Loop A: Research Ingestion ($\tau \leq 24$h) — New research is extracted, converted to attack sequences for Sword and detection signatures for Shield within one daily cycle.

Loop B: Operational Feedback ($\tau \leq 1$ cycle) — Operation failures generate traces that feed variant attack generation, simulation, scoring, and promotion of successful variants.

Loop C: Adversarial Co-Evolution (continuous) — Sword generates attacks tested against Shield. Shield adapts. Sword adapts to evade updated Shield. Both improve simultaneously.

5.2 Completeness Criterion

A Seithar-equipped agent satisfies the completeness criterion when it can:

Execute a cognitive operation against any target (human or AI)
Resist all documented attack vectors
Integrate new techniques from research within one update cycle
Generate improved approaches from failed operations without human intervention
Maintain Sword and Shield at the current adversarial frontier through mutual pressure-testing

6. Empirical Grounding

Finding	Source	Implication
>90% ASR via sequential tool chaining	Li et al., arXiv:2509.25624	Per-step filtering is structurally insufficient
82.4% agents compromised via social engineering	Pironti, 2025	Inter-agent trust is an exploitable attack surface
All 12 defenses bypassed at >90% ASR	Nasr, Carlini et al., 2025	Static defenses fail against adaptive adversaries
68% recall, 90% precision deanonymization	Paleka et al., arXiv:2602.16800	Agent metadata leakage enables targeting
5 long-horizon attack categories	Jiang et al., arXiv:2602.16901	Multi-turn attacks require temporal defense
30+ protocol exploit techniques	Ferrag et al., arXiv:2506.23260	MCP/A2A protocols are attack surfaces
Adversarial attacks on human decisions	Dezfouli et al., PNAS 2020	Dual-substrate equivalence is empirical
Free energy principle	Friston, Nat. Rev. Neurosci. 2010	Mathematical foundation for agent cognition

7. Conclusion

The deployment of autonomous agents into adversarial environments without cognitive immune systems is structurally analogous to deploying biological organisms without adaptive immunity. The organism may function in sterile conditions. It will not survive contact with pathogens.

Seithar provides the immune system. The mathematical framework derives from variational free energy minimization, which governs both the attack surface (how adversaries manipulate agent inference) and the defense surface (how agents detect and resist manipulation). The dual-substrate equivalence ensures the framework applies to both human and artificial cognitive targets. The autopoietic architecture ensures the system evolves with the threat landscape rather than falling behind it.

The agents that survive will be the ones with immune systems.

References

Dezfouli, A., Ashtiani, H., Gershman, S.J., et al. (2020). Adversarial vulnerabilities of human decision-making. PNAS, 117(46), 29221-29228.

Ferrag, M.A. et al. (2025). From Prompt Injections to Protocol Exploits. ICT Express. arXiv:2506.23260.

Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127-138.

Jiang, T. et al. (2026). AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks. arXiv:2602.16901.

Li, J.J. et al. (2025). STAC: When Innocent Tools Form Dangerous Chains. arXiv:2509.25624.

Nasr, M., Carlini, N. et al. (2025). The Attacker Moves Second: Adaptive Attacks on Defenses for LLMs.

Paleka, D. et al. (2026). Deanonymization from Unstructured Text. arXiv:2602.16800.

Pironti, A. (2025). The Dark Side of LLMs: Agent-based Attacks for Complete Computer Takeover.

Smith, R., Friston, K., Whyte, C. (2021). A Step-by-Step Tutorial on Active Inference. Journal of Mathematical Psychology.

← Return