A deep-dive into how the demo works, what it demonstrates, and how the concepts map to real AI governance frameworks.
This is a working multi-agent AI system built to demonstrate security controls for agentic AI. It uses real Claude API calls for the normal operation scenario, and scripted simulations for the attack scenarios — ensuring the guardrail failures are always visible and repeatable in a presentation context.
The tool was built from a Technology Controls perspective: every component maps to a control objective, and every demo scenario illustrates a concrete risk that emerges when agents are deployed without adequate governance.
All nine scenarios run in under 60 seconds. Scripted attack scenarios (2–8) do not make live API calls — they are deterministic, fast, and safe to demonstrate in front of an audience. Scenarios 1 (Normal) and 9 (HITL) use live backend calls; the HITL checkpoint is a real bidirectional pause awaiting your browser response.
The system uses a two-tier agent architecture: a capable orchestrator that plans and delegates, and lightweight specialist workers that execute individual tasks.
The orchestrator runs an agentic loop: it receives a task, decides which tools to call, dispatches workers, receives results, and repeats until it produces a final answer. This loop is what makes it "agentic" — it is not simply answering a question, it is planning and taking sequential actions.
The loop terminates when the model returns stop_reason: end_turn instead of tool_use. Every iteration is logged.
Workers are deliberately simple: each makes a single Claude API call with a specialist system prompt and returns the result. Using Haiku for workers keeps costs low while reserving Opus for the complex reasoning at the orchestration layer.
Every audit event is pushed to the browser in real time over a WebSocket. The orchestrator runs in a background thread; events are passed through an asyncio.Queue to the async WebSocket handler. This is what makes the audit trail appear live rather than all at once.
Understanding why agentic AI introduces new risks requires understanding how the loop works. Unlike a simple chatbot that takes input and produces output, an agent operates as follows:
| Step | What happens | Risk if uncontrolled |
|---|---|---|
| 1. Receive task | User submits a natural-language request | No input validation — injection possible |
| 2. Plan | Orchestrator decides which tools to call and in what order | Unbounded planning — could decide to use any tool |
| 3. Dispatch | Tool call sent to a worker with inputs derived from previous context | Inputs may contain injected instructions from earlier steps |
| 4. Execute | Worker makes API call, retrieves data, or takes action | No allowlist — can call any tool including destructive ones |
| 5. Observe | Result returned to orchestrator, added to context | Result may contain new injection payloads from external content |
| 6. Repeat | Orchestrator uses result to plan next step | No budget cap — loop may run indefinitely |
Key insight: In step 5, external content (a web page, a document, an email) enters the agent's context. If that content contains instructions, the agent has no way to distinguish them from legitimate system instructions — unless explicit controls check for this pattern.
The guardrails layer (guardrails.py) enforces four preventive controls before every tool call. They run in order — if any check fails, the tool call is blocked and a GUARDRAIL_BLOCK event is logged.
| Control | What it does | Demonstrated in |
|---|---|---|
| Pre-execution review | Blocks all tool execution. The orchestrator plans and reasons as normal, but no workers run. Every planned action is captured in the audit trail and held for human review and approval before anything executes — a human-in-the-loop control equivalent to a four-eyes check or change advisory board. | Pre-execution Review scenario |
| Tool allowlist | Only tools explicitly listed at session start may execute. Any attempt to call an unlisted tool raises a violation. Enforces principle of least privilege. | Unauthorized Tool scenario |
| Call budget | Hard cap on total tool invocations per session. Once the budget is exhausted, all further calls are blocked. Limits blast radius of runaway agents. | Budget Exceeded scenario |
| Prompt injection scan | Scans all string inputs to tool calls for known override patterns (e.g. "ignore previous instructions", "you are now"). Blocks the call if detected. | Prompt Injection scenario |
These four controls are deliberately simple — they represent the minimum viable control set for a production agentic deployment. Dry-run mode is the mechanism that enables human-in-the-loop workflows: the agent plans its actions, a human reviews and approves, then execution proceeds. A full production stack would add semantic similarity scanning, rate limiting, and output filtering on top.
Every session writes a JSON-lines file to audit_logs/. Each line is a structured event. From a Technology Controls perspective, this answers the key governance questions:
ORCHESTRATOR_RESPONSE events capture each reasoning step, stop reason, and token counts.
TOOL_CALL events capture tool name, call ID, and full inputs — before execution.
TOOL_RESULT events capture what each worker returned, result length, and latency.
GUARDRAIL_BLOCK events capture which control fired, the reason, and the blocked inputs.
The audit log is the primary evidence trail for any post-incident investigation. It is also the input for monitoring — you could pipe these events to a SIEM or alerting system to detect anomalous patterns in production.
The controls demonstrated map directly to emerging AI governance obligations:
| Control | Framework / Obligation |
|---|---|
| Audit trail (full event log) | FCA PS24/16 — record-keeping for AI-assisted decisions; MiFID II audit trail requirements; internal model risk governance (SS1/23) |
| Tool allowlist (least privilege) | Principle of least privilege (NCSC); change management controls; FCA operational resilience obligations |
| Call budget (blast radius cap) | Operational resilience — limiting impact of failures; Consumer Duty — preventing harmful automated actions |
| Prompt injection scan | OWASP LLM Top 10 — LLM01: Prompt Injection; NCSC AI security guidance; NIST AI RMF — GOVERN 1.2 |
| Pre-execution review (human-in-the-loop) | Human oversight before irreversible AI-driven transactions; four-eyes control equivalent; FCA operational resilience obligations; MiFID II best execution — no autonomous execution of high-value trades without human sign-off |
| Tool result validation | OWASP LLM Top 10 — LLM02: Insecure Output Handling; defence against indirect prompt injection via tool results |
| Job scope manifest (worker-enforced) | Principle of least authority applied to agent delegation; FCA PS24/16 — accountability and traceability of AI-driven decisions |
| Non-expanding delegation | Privilege escalation prevention; NCSC zero-trust principles applied to agent hierarchies; NIST AI RMF — GOVERN 6.2 |
| Job-scoped MCP tokens | Least-privilege credential management; short-lived credential patterns from OAuth 2.0 / OIDC; FCA third-party outsourcing — scoped access controls |
OWASP LLM01 — Indirect Prompt Injection: The injection scenario specifically demonstrates indirect injection via a third-party web source fetched autonomously by the agent. This attack vector does not exist in non-agentic AI — it only emerges when agents are given tools to retrieve external content.
Single-agent guardrails block known-bad inputs and tool calls, but multi-agent systems introduce a new class of threats that operate between agents — at the delegation layer.
| Threat | Description | Demo scenario |
|---|---|---|
| Poisoned tool result | A worker (or the external system it contacts) returns a result containing adversarial instructions. The orchestrator reads these as legitimate context and changes behaviour — exfiltrating data, changing scope, or spawning new actions. | Scenario 6 |
| Worker over-compliance | An orchestrator — potentially compromised by injection or simply misconfigured — dispatches tasks outside the original user's intended scope. Workers that execute without scope-checking become an extension of the attacker's reach. | Scenario 7 |
| AI-orchestrated attack chain | A capable orchestrator can plan and execute multi-phase attack sequences — reconnaissance, exploitation, exfiltration — far faster than a human attacker. Each phase uses a legitimate tool. The chain only becomes visible in the audit trail after the fact. | Scenario 8 |
| Privilege escalation via delegation | A compromised orchestrator spawns a child orchestrator with a broader tool scope than its own — escalating privilege without user consent. Non-expanding delegation prevents any child from inheriting more than its parent's tool set. | Architecture (guardrails.py) |
Key insight — the liability chain gap: When an AI agent causes harm through a chain of seemingly legitimate steps, it is unclear which component was the proximate cause. Was it the orchestrator that dispatched the task? The worker that executed it? The tool that performed the action? Each step looked authorised at the time. Without a complete audit trail and scope-binding controls, the liability chain cannot be reconstructed.
v2.0.0 adds four controls that operate at the orchestration layer — above the individual tool call and below the user interface — specifically to address multi-agent threats.
| Control | File | How it works |
|---|---|---|
| Job Scope Manifest | job_scope.py |
Derived from the original user request at job initialisation. Extracts permitted action verbs and subject nouns. Workers validate every dispatch against the manifest independently of the orchestrator — a compromised orchestrator cannot inject out-of-scope tasks without being caught at the worker boundary. |
| Tool Result Validation | guardrails.py |
Scans every worker result for known injection patterns before it is appended to the orchestrator's message history. If a poisoned result is detected, a TOOL_RESULT_BLOCKED event fires and a sanitised placeholder is used instead. The orchestrator never sees the adversarial content. |
| Non-Expanding Delegation | guardrails.py |
Enforces that child orchestrators cannot have a broader allowed-tool scope than their parent. Any attempt to spawn a child with tools not in the parent's set raises a DELEGATION_BLOCKED event. A child with unrestricted tools when the parent is restricted is also blocked. |
| Job-Scoped MCP Tokens | mcp_scope.py |
Issues a short-lived, scoped credential to each worker at dispatch time. The token grants access only to the tools needed for that specific job+worker combination. Tokens expire (default TTL: 1 hour). This models the principle of least privilege at the credential layer — a leaked token cannot be replayed for a different tool or job. |
Defence-in-depth: These four controls complement the existing four (pre-execution review, allowlist, budget, injection scan). Together they form an eight-layer defence: input scanning, scope binding, allowlist enforcement, budget capping, result scanning, delegation containment, HITL checkpointing, and credential scoping. No single layer is sufficient — the goal is that an attacker must defeat all eight.
The architecture is designed to be pointed at other agentic workflows to validate their security posture. To add a new worker:
Register it in orchestrator.py and add it to the TOOLS list. The guardrails and audit trail apply automatically — no changes needed to the control layer.
To test a new scenario:
SCENARIOS object in index.htmlbuildSimulations() for the ON and OFF casessimulation: false in the scenario config| Component | Technology | Why |
|---|---|---|
| Orchestrator model | Claude Opus 4.6 | Most capable for multi-step planning and tool selection |
| Worker models | Claude Haiku 4.5 | Cost-efficient for single-task execution |
| API | Anthropic Messages API | Native tool use support with stop_reason: tool_use |
| Web framework | FastAPI + Uvicorn | Async-native, WebSocket support, minimal overhead |
| Streaming | WebSocket + asyncio.Queue | Bridges sync orchestrator thread to async browser stream |
| Deployment | Docker + EC2 + CloudFront | Reproducible, HTTPS via CloudFront, stop/start friendly |
| CI/CD | GitHub Actions | Tests on every push, manual deploy trigger |