The Guardrail Illusion: Why Prompt-Level Controls Aren't Enough
Modern AI agents in finance don't just answer questions — they execute multi-step plans: query databases, draft filings, approve credit, flag suspicious transactions. Traditional guardrails (prompt filters, output scanners) only see individual steps. They miss the emergent risk of the whole sequence. This paper proposes a new governance architecture built for runtime.
4
Risk tiers
4
Capability types
7
MRM steps
3
LoD layers
The core problem: A prompt filter that blocks "transfer funds" can't detect a 12-step plan that transfers funds as an emergent side-effect of portfolio rebalancing. Governance must operate at trajectory level, not token level.
Why Finance Is Different
Financial agents operate under SR 11-7 model risk management rules. Validators must be able to explain every consequential decision. An agent that produces correct outputs via un-auditable reasoning chains violates regulatory expectations regardless of accuracy.
What the Paper Proposes
A three-layer architecture: (1) decompose agents into discrete capabilities, each with authority, constraints, and evidence requirements; (2) govern execution trajectories as first-class audit objects; (3) apply MRM validation at both capability and trajectory level.
Why "Scalable"
Once validated, individual capabilities can be reused across multiple agents. Governance cost does not scale linearly with the number of deployed agents — it scales with the number of distinct capabilities, which grows much more slowly.
What is SR 11-7 and why does it apply to AI agents? ▼
SR 11-7 is the Federal Reserve's 2011 supervisory guidance on model risk management. It requires financial institutions to validate, monitor, and govern all "models" used in decision-making. The paper argues AI agents qualify as models under SR 11-7 and must satisfy the same conceptual soundness, outcome analysis, and ongoing monitoring requirements — even when the model is an autonomous reasoning agent rather than a statistical formula.
What makes agentic risk different from LLM risk? ▼
A standalone LLM generates text. An agent executes a plan with real-world effects: API calls, database writes, financial transactions, regulatory filings. The risk is not hallucination in isolation — it's compounding errors across a multi-step execution where each step may be individually acceptable but the sequence violates a constraint. This requires trajectory-level governance, not output-level filtering.
How does this relate to the EU AI Act? ▼
The EU AI Act classifies AI systems in financial services as high-risk (Annex III). High-risk systems require conformity assessment, registration, technical documentation, and ongoing monitoring. The capability-based governance architecture in this paper maps cleanly onto those requirements: each capability is a documentable, testable unit that can satisfy technical documentation obligations at granular level.
Instead of governing an agent as a monolithic black box, the framework decomposes it into capabilities — discrete action types with defined authority, constraints, and evidence requirements. Four capability types cover the full surface area of financial AI agents.
C1 — Data Query & Retrieval: Read-only access to structured and unstructured data sources. Authority is limited to SELECT operations. Evidence required: data lineage, schema version, access log. Failure modes: data staleness, scope creep into write operations.
Capability definition schema
---------------------------------
id : C1 | C2 | C3 | C4
authority : what actions this capability may take
constraints : hard limits (no cross-account write, no external calls)
evidence : audit artifacts required for each invocation
failures : known failure modes and their signatures
risk_weight : contribution to tier assignment
Why decompose to capabilities? Every consequential AI action in finance is one of: reading data (C1), computing derived values (C2), generating documents or recommendations (C3), or executing real-world effects (C4). This decomposition is exhaustive and maps directly to regulatory audit requirements.
Authority vs. Autonomy
Each capability type carries a defined authority scope. C1 (read-only) has high autonomy; C4 (execution) requires explicit authorization at each invocation. The authority hierarchy prevents capability escalation — a common attack vector where an agent acquires permissions beyond its stated function through multi-step reasoning.
Evidence as Governance
Every capability invocation must produce an audit artifact: data lineage, computation trace, generation rationale, or execution receipt. These artifacts are the raw material for 2nd Line of Defence review. Without them, model validation cannot be performed retroactively — a critical gap in current practice.
What prevents an agent from combining capabilities to bypass a constraint? ▼
The trajectory governance layer (Section 3) monitors the sequence of capability invocations. Constraint violations are evaluated at the trajectory level, not just per-capability. For example: querying data (C1) is allowed; executing a transfer (C4) is allowed with authorization; but querying data specifically to identify authorization tokens and then transferring funds triggers a trajectory-level violation even if each step passed its individual constraint check.
How does C4 (Execution) authorization work in practice? ▼
C4 capabilities require out-of-band authorization: a human sign-off, a cryptographic approval token, or a pre-authorized budget envelope from a prior governance review. The agent cannot self-authorize C4 operations. This is analogous to dual-control requirements in traditional treasury operations. The authorization event is logged as a trajectory artifact and is required for audit trails under SR 11-7.
A trajectory is the full sequence of capability invocations for a single agent task. The paper uses credit memo generation as a running example: an agent that retrieves borrower data, computes financials, assesses risk, generates a memo, and flags it for human review. Each step is governed. Failures inject at specific points.
Inject a failure at any step to see how governance blocks propagation:
Click Play to animate the credit memo trajectory. Each step lights up as the agent completes it. Inject a failure to see the guardrail block and audit record.
Trajectories as audit objects: Under this framework, the trajectory — not the final output — is the primary governance object. Regulators can inspect every capability invocation, its inputs, outputs, and evidence artifacts. The memo is a byproduct; the trajectory is the record.
What the Runtime Monitor Does
A lightweight process runs alongside the agent and observes every capability invocation. It checks: (1) is this capability authorized for this agent? (2) are the inputs within declared scope? (3) does the output satisfy evidence requirements? If any check fails, the trajectory is halted and an incident record is created.
Why Runtime, Not Post-Hoc
Post-hoc review catches errors after they propagate. A credit memo built on stale data (F0) that passes all subsequent steps will generate a flawed recommendation. Runtime governance blocks at the point of failure, preventing downstream contamination and reducing remediation cost dramatically.
What happens when the runtime monitor blocks a trajectory step? ▼
A trajectory halt triggers three concurrent actions: (1) the current trajectory is archived as a failed execution object with all evidence artifacts collected so far; (2) a human-in-the-loop notification is issued to the designated 1LoD owner; (3) the incident is logged to the audit trail with the specific constraint violated, the capability type, and the trajectory step. The agent does not self-recover — human approval is required to resume or restart.
How are trajectory constraints specified? ▼
Constraints are defined at the capability level (e.g., C2 computation must use approved model versions) and at the task level (e.g., credit memo trajectory must flag for human review before C4 execution). The runtime monitor maintains a constraint registry indexed by capability type and task template. Constraints are versioned — changes require a 2nd Line of Defence sign-off before deployment.
4-Tier Risk Framework: From Advisory to Autonomous
Not all agents carry the same risk. The framework assigns each deployed agent a risk tier based on five dimensions: autonomy level, financial materiality, reversibility of actions, regulatory exposure, and human oversight available. Tiers drive governance intensity — from light-touch monitoring to full MRM validation cycles.
Click a tier to see its risk profile, required controls, and example use cases.
Tier 4 is not aspirational. The paper explicitly covers T4 autonomous agents (e.g., fully automated market-making, real-time credit adjudication at scale) and provides governance specifications for them. The question is not whether to deploy T4 agents, but how to govern them.
Why Reversibility Matters
Reversibility is the most important dimension for tier assignment. An agent that generates a draft document (fully reversible) can tolerate higher autonomy than one that executes an irrevocable payment. The framework mandates C4 (execution) authorization for all irreversible actions, regardless of tier.
Materiality Thresholds
Financial materiality is defined at the task level, not the agent level. A credit memo agent may handle T1 (advisory) tasks for retail clients and T3 (semi-autonomous) tasks for large corporate loans — with different governance controls applied to the same agent in different contexts.
Dynamic Tier Assignment
Tier assignment is evaluated per-task, not per-agent. The runtime monitor checks the task context at trajectory start and applies the appropriate control set. An agent's effective tier can change between invocations based on the nature of the specific request.
How do you handle an agent that spans multiple tiers within a single trajectory? ▼
The trajectory inherits the highest tier encountered across all its capability invocations. If a trajectory begins with T1 data queries and ends with a T3 computation used to approve a loan, the entire trajectory is governed as T3. This prevents tier dilution through task decomposition — a governance bypass where high-risk operations are disguised as sequences of low-risk steps.
What triggers a tier re-assessment? ▼
Tier re-assessment is triggered by: (1) material changes in agent capability (new tool access, updated model weights); (2) evidence of performance degradation beyond drift thresholds; (3) regulatory guidance changes; (4) incident events that reveal new failure modes not captured in the original tier assignment. Re-assessment is a 2nd Line of Defence responsibility under the MRM programme.
The framework specifies a complete Model Risk Management lifecycle for agentic AI. Each step assigns clear accountability across three lines of defence: 1st Line (business/development), 2nd Line (risk/validation), and SecOps (security monitoring). Click any step to see the full accountability matrix.
Click a step to see who is accountable, what is produced, and how it connects to regulatory requirements.
1LoD Business Line
Owns the agent use case, defines business requirements, performs initial capability scoping, monitors production performance, and escalates anomalies. Accountable for operational continuity and first-level incident response.
2LoD Risk & Validation
Independent validation of capability specifications, trajectory constraint logic, and tier assignments. Approves deployment and change events. Reviews evidence artifacts from trajectory logs. Issues annual model validation opinion.
SecOps Security
Monitors for adversarial inputs (prompt injection, jailbreaking), anomalous trajectory patterns, and capability escalation attempts. Operates the runtime threat detection layer. Coordinates incident response for security events.
The key innovation: Step 4 (Trajectory Registry) is new to AI MRM. Traditional model validation validates models in isolation. This framework validates execution trajectories as composite governance objects — capturing emergent risks that per-model validation misses.
Step 1: Capability Inventory — what exactly gets documented? ▼
For each capability instance: (a) capability type (C1-C4); (b) authority scope — what systems can be accessed, what operations are permitted; (c) constraint set — explicit prohibitions and pre-conditions; (d) evidence specification — what audit artifacts must be produced per invocation; (e) failure mode catalog — known ways this capability can fail and their detection signatures; (f) model components involved — LLM version, embedding model, tool definitions. This inventory is the input to Step 3 (validation).
Step 5: Ongoing Monitoring — what metrics matter? ▼
The framework specifies four monitoring tracks: (1) Performance drift — output quality relative to validated baseline; (2) Trajectory anomalies — unusual capability sequences, unexpected halt rates; (3) Constraint violation rates — frequency and type of runtime monitor interventions; (4) Evidence completeness — fraction of trajectory steps with full artifact coverage. Drift thresholds trigger 2LoD review; breach of absolute limits triggers trading halt equivalent (suspension of agent authority).
How does the MRM cycle handle model updates? ▼
Model updates (weight changes, fine-tuning, tool version changes) trigger a partial re-validation. The scope depends on which capabilities are affected: if only C2 computation is impacted, only C2 capability validation is required, not a full programme re-run. This proportionality principle keeps governance cost manageable for iteratively updated systems. All changes are logged to the capability version history and require 2LoD sign-off before re-deployment.
The central scalability argument: financial AI agents are compositional. A credit memo agent, a trading surveillance agent, and a regulatory filing agent all share the same four capability types — just assembled differently. Validate each capability once; reuse across all agents. Governance cost grows with distinct capabilities, not agent count.
Click a use case column to see which capabilities it uses and what the governance reuse means.
The Reuse Dividend
If C1 (Data Query) has been fully validated for the credit memo agent, deploying it in the trading surveillance agent requires only incremental validation of the new authority scope — not a full capability re-assessment. The savings compound: 10 agents sharing 4 capabilities = 4 full validations + 10 incremental reviews, not 40 full validations.
Constraint Inheritance
When a capability is reused, it inherits its validated constraint set as a floor. New agents may add stricter constraints but cannot relax validated ones without a full re-validation. This one-way ratchet ensures that validated governance properties are preserved across reuse contexts.
The research gap addressed: Existing MRM guidance (SR 11-7) was designed for statistical models with fixed inputs and outputs. Agentic AI introduces dynamic tool use, multi-step planning, and emergent behavior. This framework is the first structured attempt to extend SR 11-7 principles to that regime without abandoning its core requirements.
Governance cost model
--------------------------
Traditional approach:
cost = O(agents x validation_cost)
Capability-reuse approach:
cost = O(capabilities x validation_cost) + O(agents x incremental_review)
where incremental_review << validation_cost
For 10 agents, 4 capabilities:
Traditional : 10 full validations
Reuse model : 4 full + 10 incremental ~ 60-70% cost reduction
Can capability reuse create governance blind spots? ▼
Yes, and the paper addresses this. Reuse creates risk if the new agent context introduces hazards not present in the original validation. The incremental review process explicitly checks for: (1) new authority combinations that create emergent risk; (2) interaction effects between reused and new capabilities; (3) novel failure modes in the new deployment context. The constraint inheritance mechanism mitigates but does not eliminate this risk — human judgment remains required for each new deployment.
What happens when a shared capability has a vulnerability? ▼
A validated vulnerability in a shared capability triggers a cross-agent review: all deployed agents using that capability must be assessed for exposure to the vulnerability. The capability registry enables rapid blast-radius analysis — a key operational advantage of the architecture. In traditional per-model validation, a vulnerability in an underlying tool might not surface until each model is individually reviewed.
Where does this framework fall short? ▼
The paper acknowledges three open problems: (1) Emergent multi-agent risk — when multiple governed agents interact, trajectory-level governance of each does not automatically govern their interactions; (2) Constraint completeness — the framework assumes constraints can be specified in advance, but novel failure modes by definition are not anticipated; (3) Human oversight latency — for T4 autonomous agents operating at millisecond timescales, human-in-the-loop requirements create practical tensions with operational requirements. These remain active research questions.
Bottom line: The paper is the most practically-grounded bridge between AI safety research and financial services regulation published to date. It gives risk officers a vocabulary — capabilities, trajectories, tiers — and a process — the 7-step MRM programme — that maps directly to existing SR 11-7 obligations. Implementation starts with the capability inventory.
A structured audit of the paper's gaps, contradictions, and unresolved problems — organized across five categories. These are not minor implementation details; several represent fundamental incompatibilities between the framework's claims and its technical logic. Click any limitation card to expand the full argument.
Click a limitation card to see the full argument and the question the authors cannot answer from the paper.
The hardest problem (L13): The framework is a set of process controls, not a completeness proof. It reduces the probability of harmful trajectories — but it cannot bound the residual risk. No formal guarantee exists that a trajectory satisfying all capability constraints and trajectory templates will not produce a materially harmful outcome through emergent multi-step behavior. This is not a gap the authors can close with an addendum; it is an open research problem in AI safety.