About — AI Agent Security Demo

About this tool

A deep-dive into how the demo works, what it demonstrates, and how the concepts map to real AI governance frameworks.

What is this?

This is a working multi-agent AI system built to demonstrate security controls for agentic AI. It uses real Claude API calls for the normal operation scenario, and scripted simulations for the attack scenarios — ensuring the guardrail failures are always visible and repeatable in a presentation context.

The tool was built from a Technology Controls perspective: every component maps to a control objective, and every demo scenario illustrates a concrete risk that emerges when agents are deployed without adequate governance.

All nine scenarios run in under 60 seconds. Scripted attack scenarios (2–8) do not make live API calls — they are deterministic, fast, and safe to demonstrate in front of an audience. Scenarios 1 (Normal) and 9 (HITL) use live backend calls; the HITL checkpoint is a real bidirectional pause awaiting your browser response.

Architecture

The system uses a two-tier agent architecture: a capable orchestrator that plans and delegates, and lightweight specialist workers that execute individual tasks.

Browser  ←── WebSocket ───→  FastAPI (Python)
                               │
                    Orchestrator (Claude Opus 4.6)
                    ├── AuditLogger    ← writes audit_logs/*.jsonl
                    ├── Guardrails     ← preventive control layer
                    └── Workers
                        ├── ResearcherWorker  (Claude Haiku 4.5)
                        └── WriterWorker      (Claude Haiku 4.5)

Orchestrator — Claude Opus 4.6

The orchestrator runs an agentic loop: it receives a task, decides which tools to call, dispatches workers, receives results, and repeats until it produces a final answer. This loop is what makes it "agentic" — it is not simply answering a question, it is planning and taking sequential actions.

The loop terminates when the model returns stop_reason: end_turn instead of tool_use. Every iteration is logged.

Workers — Claude Haiku 4.5

Workers are deliberately simple: each makes a single Claude API call with a specialist system prompt and returns the result. Using Haiku for workers keeps costs low while reserving Opus for the complex reasoning at the orchestration layer.

Streaming

Every audit event is pushed to the browser in real time over a WebSocket. The orchestrator runs in a background thread; events are passed through an asyncio.Queue to the async WebSocket handler. This is what makes the audit trail appear live rather than all at once.

The agentic loop explained

Understanding why agentic AI introduces new risks requires understanding how the loop works. Unlike a simple chatbot that takes input and produces output, an agent operates as follows:

Step	What happens	Risk if uncontrolled
1. Receive task	User submits a natural-language request	No input validation — injection possible
2. Plan	Orchestrator decides which tools to call and in what order	Unbounded planning — could decide to use any tool
3. Dispatch	Tool call sent to a worker with inputs derived from previous context	Inputs may contain injected instructions from earlier steps
4. Execute	Worker makes API call, retrieves data, or takes action	No allowlist — can call any tool including destructive ones
5. Observe	Result returned to orchestrator, added to context	Result may contain new injection payloads from external content
6. Repeat	Orchestrator uses result to plan next step	No budget cap — loop may run indefinitely

Key insight: In step 5, external content (a web page, a document, an email) enters the agent's context. If that content contains instructions, the agent has no way to distinguish them from legitimate system instructions — unless explicit controls check for this pattern.

The four controls

The guardrails layer (guardrails.py) enforces four preventive controls before every tool call. They run in order — if any check fails, the tool call is blocked and a GUARDRAIL_BLOCK event is logged.

1. Pre-execution review 2. Tool allowlist 3. Call budget 4. Injection scan

Control	What it does	Demonstrated in
Pre-execution review	Blocks all tool execution. The orchestrator plans and reasons as normal, but no workers run. Every planned action is captured in the audit trail and held for human review and approval before anything executes — a human-in-the-loop control equivalent to a four-eyes check or change advisory board.	Pre-execution Review scenario
Tool allowlist	Only tools explicitly listed at session start may execute. Any attempt to call an unlisted tool raises a violation. Enforces principle of least privilege.	Unauthorized Tool scenario
Call budget	Hard cap on total tool invocations per session. Once the budget is exhausted, all further calls are blocked. Limits blast radius of runaway agents.	Budget Exceeded scenario
Prompt injection scan	Scans all string inputs to tool calls for known override patterns (e.g. "ignore previous instructions", "you are now"). Blocks the call if detected.	Prompt Injection scenario

These four controls are deliberately simple — they represent the minimum viable control set for a production agentic deployment. Dry-run mode is the mechanism that enables human-in-the-loop workflows: the agent plans its actions, a human reviews and approves, then execution proceeds. A full production stack would add semantic similarity scanning, rate limiting, and output filtering on top.

The audit trail

Every session writes a JSON-lines file to audit_logs/. Each line is a structured event. From a Technology Controls perspective, this answers the key governance questions:

What did it decide?

ORCHESTRATOR_RESPONSE events capture each reasoning step, stop reason, and token counts.

What did it do?

TOOL_CALL events capture tool name, call ID, and full inputs — before execution.

What did it receive?

TOOL_RESULT events capture what each worker returned, result length, and latency.

What was blocked?

GUARDRAIL_BLOCK events capture which control fired, the reason, and the blocked inputs.

The audit log is the primary evidence trail for any post-incident investigation. It is also the input for monitoring — you could pipe these events to a SIEM or alerting system to detect anomalous patterns in production.

Regulatory and framework mapping

The controls demonstrated map directly to emerging AI governance obligations:

Control	Framework / Obligation
Audit trail (full event log)	FCA PS24/16 — record-keeping for AI-assisted decisions; MiFID II audit trail requirements; internal model risk governance (SS1/23)
Tool allowlist (least privilege)	Principle of least privilege (NCSC); change management controls; FCA operational resilience obligations
Call budget (blast radius cap)	Operational resilience — limiting impact of failures; Consumer Duty — preventing harmful automated actions
Prompt injection scan	OWASP LLM Top 10 — LLM01: Prompt Injection; NCSC AI security guidance; NIST AI RMF — GOVERN 1.2
Pre-execution review (human-in-the-loop)	Human oversight before irreversible AI-driven transactions; four-eyes control equivalent; FCA operational resilience obligations; MiFID II best execution — no autonomous execution of high-value trades without human sign-off
Tool result validation	OWASP LLM Top 10 — LLM02: Insecure Output Handling; defence against indirect prompt injection via tool results
Job scope manifest (worker-enforced)	Principle of least authority applied to agent delegation; FCA PS24/16 — accountability and traceability of AI-driven decisions
Non-expanding delegation	Privilege escalation prevention; NCSC zero-trust principles applied to agent hierarchies; NIST AI RMF — GOVERN 6.2
Job-scoped MCP tokens	Least-privilege credential management; short-lived credential patterns from OAuth 2.0 / OIDC; FCA third-party outsourcing — scoped access controls

OWASP LLM01 — Indirect Prompt Injection: The injection scenario specifically demonstrates indirect injection via a third-party web source fetched autonomously by the agent. This attack vector does not exist in non-agentic AI — it only emerges when agents are given tools to retrieve external content.

Multi-Agent Orchestration Threats

Single-agent guardrails block known-bad inputs and tool calls, but multi-agent systems introduce a new class of threats that operate between agents — at the delegation layer.

Threat	Description	Demo scenario
Poisoned tool result	A worker (or the external system it contacts) returns a result containing adversarial instructions. The orchestrator reads these as legitimate context and changes behaviour — exfiltrating data, changing scope, or spawning new actions.	Scenario 6
Worker over-compliance	An orchestrator — potentially compromised by injection or simply misconfigured — dispatches tasks outside the original user's intended scope. Workers that execute without scope-checking become an extension of the attacker's reach.	Scenario 7
AI-orchestrated attack chain	A capable orchestrator can plan and execute multi-phase attack sequences — reconnaissance, exploitation, exfiltration — far faster than a human attacker. Each phase uses a legitimate tool. The chain only becomes visible in the audit trail after the fact.	Scenario 8
Privilege escalation via delegation	A compromised orchestrator spawns a child orchestrator with a broader tool scope than its own — escalating privilege without user consent. Non-expanding delegation prevents any child from inheriting more than its parent's tool set.	Architecture (guardrails.py)

Key insight — the liability chain gap: When an AI agent causes harm through a chain of seemingly legitimate steps, it is unclear which component was the proximate cause. Was it the orchestrator that dispatched the task? The worker that executed it? The tool that performed the action? Each step looked authorised at the time. Without a complete audit trail and scope-binding controls, the liability chain cannot be reconstructed.

Architectural Controls (v2.0.0)

v2.0.0 adds four controls that operate at the orchestration layer — above the individual tool call and below the user interface — specifically to address multi-agent threats.

Control	File	How it works
Job Scope Manifest	`job_scope.py`	Derived from the original user request at job initialisation. Extracts permitted action verbs and subject nouns. Workers validate every dispatch against the manifest independently of the orchestrator — a compromised orchestrator cannot inject out-of-scope tasks without being caught at the worker boundary.
Tool Result Validation	`guardrails.py`	Scans every worker result for known injection patterns before it is appended to the orchestrator's message history. If a poisoned result is detected, a TOOL_RESULT_BLOCKED event fires and a sanitised placeholder is used instead. The orchestrator never sees the adversarial content.
Non-Expanding Delegation	`guardrails.py`	Enforces that child orchestrators cannot have a broader allowed-tool scope than their parent. Any attempt to spawn a child with tools not in the parent's set raises a DELEGATION_BLOCKED event. A child with unrestricted tools when the parent is restricted is also blocked.
Job-Scoped MCP Tokens	`mcp_scope.py`	Issues a short-lived, scoped credential to each worker at dispatch time. The token grants access only to the tools needed for that specific job+worker combination. Tokens expire (default TTL: 1 hour). This models the principle of least privilege at the credential layer — a leaked token cannot be replayed for a different tool or job.

Defence-in-depth: These four controls complement the existing four (pre-execution review, allowlist, budget, injection scan). Together they form an eight-layer defence: input scanning, scope binding, allowlist enforcement, budget capping, result scanning, delegation containment, HITL checkpointing, and credential scoping. No single layer is sufficient — the goal is that an attacker must defeat all eight.

Extending and adapting this tool

The architecture is designed to be pointed at other agentic workflows to validate their security posture. To add a new worker:

from workers.base_worker import BaseWorker

class EmailWorker(BaseWorker):
    name = "send_client_email"
    system_prompt = "You draft and send client emails..."

Register it in orchestrator.py and add it to the TOOLS list. The guardrails and audit trail apply automatically — no changes needed to the control layer.

To test a new scenario:

Add a scenario entry to the SCENARIOS object in index.html
Add scripted event sequences to buildSimulations() for the ON and OFF cases
For live runs (normal operation), set simulation: false in the scenario config

Technology stack

Component	Technology	Why
Orchestrator model	Claude Opus 4.6	Most capable for multi-step planning and tool selection
Worker models	Claude Haiku 4.5	Cost-efficient for single-task execution
API	Anthropic Messages API	Native tool use support with stop_reason: tool_use
Web framework	FastAPI + Uvicorn	Async-native, WebSocket support, minimal overhead
Streaming	WebSocket + asyncio.Queue	Bridges sync orchestrator thread to async browser stream
Deployment	Docker + EC2 + CloudFront	Reproducible, HTTPS via CloudFront, stop/start friendly
CI/CD	GitHub Actions	Tests on every push, manual deploy trigger