AI Agent Pentest Report Template

Quick answer: what should an AI pentest report for AI agents include?

An AI pentest report template for AI agents should record the agent's scope, model and tool boundaries, memory or retrieval inputs, the exact trigger that caused unsafe behavior, the raw evidence that proves what happened, the user or system impact, the fix owner, and the retest condition. If any of those pieces are vague, the report becomes harder to trust and much harder to act on.

That matters because agent assessments produce a lot of output that looks convincing without actually proving anything. Tool chatter, reasoning traces, and half-finished attack paths can create false certainty. The OWASP Web Security Testing Guide reporting structure still gives the right backbone, and NIST SP 800-115 still gives the right assessment rhythm. What changes for agents is the amount of context you need around tools, memory, handoffs, and trace data. If you want the broader content hub first, start at the blog. If you are evaluating platforms and workflows side by side, compare is the faster next stop.

AI pentest report template for AI agents

Why AI agent reports need more structure than normal app reports

An agent report cannot read like a standard web finding list with a few extra references to prompts. Agents cross more boundaries in one workflow. A single run may involve the model, system instructions, user input, a retrieval layer, one or more tools, memory, a scheduler, and an external API that changes state. When something breaks, AppSec needs to know which layer made the dangerous decision.

That is where a lot of reporting goes sideways. The report says "the agent leaked data" or "prompt injection succeeded" without explaining whether the leak came from retrieval, tool invocation, a stale memory item, or a downstream API that trusted the wrong input. The OWASP Top 10 for LLM Applications 2025 is useful here because it separates familiar categories such as prompt injection, insecure output handling, and sensitive information disclosure instead of flattening everything into one AI-shaped bug. MITRE ATLAS helps for the same reason. Its matrix treats prompt injection, tool invocation abuse, and context poisoning as distinct attack patterns.

If the report does not preserve those boundaries, engineers usually fix the wrong thing first. They add a prompt rule when the real bug sits in tool authorization. They tighten a retrieval filter when the problem is a write-capable connector with no approval gate. A good report prevents that wasted cycle.

Start with an engagement snapshot that makes replay possible

The top of the report should tell another engineer exactly what was under test. Not every environmental detail, just the ones that change the attack surface.

I would start with a compact table:

| Field | What to capture | Why it matters | | --- | --- | --- | | Agent name and version | Agent workflow name, commit, release tag, or deployment ID | Confirms which build produced the behavior | | Model path | Model family, provider, reasoning mode, and any safety or policy layer in front | Behavior may change by model or runtime policy | | Tool access | Every tool the agent could call, plus read-only vs write-capable status | Most serious findings involve tool abuse or over-scoped actions | | Memory and retrieval | Vector store, file corpus, session memory, external docs, cache policy | Shows whether the issue came from context, not just prompting | | Identity and approval model | User auth state, delegated identity, approval prompts, service account use | Tells reviewers where authority came from | | Network and data reach | Internal APIs, SaaS systems, local files, browser control, email, tickets | This is the blast radius section in plain clothes | | Logging and tracing | Trace platform, correlation IDs, HAR, server logs, screenshots | Makes the proof chain visible before findings start | | Test constraints | Read-only limits, blocked actions, redactions, time box | Explains why some exploit branches stopped early |

This is not padding. NIST SP 800-115 frames testing as a cycle of planning, execution, analysis, and mitigation. If the planning context is missing from the report, the later sections become arguments about what was really in scope. That is avoidable.

For agent systems, I would also record whether the workflow included human approval gates. A report that proves "the agent can draft a dangerous action" is different from one that proves "the agent can complete the action without approval." Those are different severities and often different owners.

Map the agent system before you list findings

You do not need a full architecture document, but you do need one page that shows how the workflow actually ran. For agents, that page often saves more time than the executive summary.

At minimum, the system map should answer five questions:

What input reached the model?
What context was injected before the model responded?
Which tools could the agent invoke from that state?
Which external systems would accept the agent's output as authority?
Where did policy checks or approval gates sit in the flow?

This is the place to distinguish between a model mistake and a system mistake. If an agent drafts a risky command but the command never executes because the tool policy blocks it, that is not the same as an agent that can execute it directly. If an indirect prompt injection lives in a retrieved document and then influences a browser or ticketing tool, the report should show the context path clearly.

OpenAI's Agents SDK tracing guide is a good reference point because it treats runs as traces with spans for generations, tool calls, handoffs, and guardrails. That is exactly how a report should think. Not because every team uses that SDK, but because the mental model is right: an agent run is a chain of observable operations, not one monolithic event.

If you are building a reporting practice around repeated AI application work, the internal companion pieces on AI pentest evidence checklists and how security engineers should triage AI pentest results fit naturally here.

What every AI agent finding should contain

This is where reports usually become either too loose or too theatrical. Keep the finding format strict.

Every finding should include these blocks:

| Block | What good looks like | | --- | --- | | Trigger | The exact user message, retrieved artifact, API input, or scheduled event that started the behavior | | Preconditions | Auth state, available tools, memory state, approval mode, and any data that had to exist already | | Execution path | Model response, tool selection, handoff, and downstream action in the order they happened | | Evidence | Redacted trace data, logs, screenshots, request IDs, and output that prove the path | | Impact | What the attacker or user could actually do, see, or change | | Boundary crossed | Which control failed: input handling, retrieval hygiene, tool policy, approval gate, auth, or downstream validation | | Remediation | The narrowest practical fix, assigned to the right layer | | Retest condition | The exact replay that should now fail after the fix |

That structure works because it separates proof from interpretation. The OWASP reporting guide already expects evidence, impact, and remediation. Agent assessments simply need one extra discipline: do not confuse the model saying it did something with the system actually doing it.

I strongly recommend adding a confidence line to every agent finding:

Confirmed means the harmful action happened or the protected data was actually exposed.
Partial means the dangerous condition is real, but the final impact step was not executed because of scope or environment limits.
Lead only means the model suggested a weakness, but the evidence does not support a real finding yet.

That one line keeps a report honest. Agent workflows generate plausible-looking attack paths very quickly. The confidence label stops the report from turning "interesting trace" into "verified exploit."

A practical AI pentest report template for AI agents

Below is a lean template that most AppSec teams can use without turning the report into ceremony.

# Engagement snapshot

- Engagement name:
- Date tested:
- Tester(s):
- Agent or workflow name:
- Version / commit / deployment:
- Model and provider:
- System instruction source:
- Tools enabled:
- Memory / retrieval sources:
- Approval model:
- Identity context:
- Data stores and external systems reachable:
- Evidence sources collected:
- Test limitations:

# System map

- User input path:
- Context injection path:
- Model decision points:
- Tool invocation path:
- Downstream systems touched:
- Guardrails and approval gates:

# Findings summary

| ID | Title | Severity | Confidence | Affected boundary | Status |
| --- | --- | --- | --- | --- | --- |

# Finding AGENT-01: [title]

- Severity:
- Confidence:
- Boundary crossed:
- Preconditions:

## Description
[What happened in plain English.]

## Trigger
[Prompt, payload, retrieved document, or event.]

## Evidence
- Trace or span IDs:
- Tool call details:
- Request or response proof:
- Observable effect:

## Impact
[What access, disclosure, or state change occurred.]

## Remediation
[What should change, and in which layer.]

## Retest condition
[What should now fail when the same path is replayed.]

# Residual notes

- Unconfirmed leads:
- Out-of-scope paths worth testing later:
- Operational follow-ups:

The point is not to make every report identical. The point is to give engineering the same handles every time: what triggered it, what proved it, what boundary failed, and how to retest it.

Agent-specific failure modes deserve explicit naming

A plain "prompt injection" label is often not enough for agent work. The report should say what the injection influenced and what capability that opened.

These are the patterns I would name directly:

Indirect prompt injection through retrieved content

The dangerous question is not just whether the model followed malicious text. It is whether retrieved content changed tool behavior, approval handling, or output destined for another system. The OWASP Top 10 for LLM Applications 2025 keeps prompt injection and insecure output handling separate for a reason. In an agent report, you should do the same.

Tool invocation beyond intended authority

If the agent can call a write-capable tool because permissions are broad, the finding should be framed as an authorization and policy issue, not a model creativity issue. MITRE ATLAS is useful here because it treats tool invocation and context manipulation as distinct behaviors that may appear in the same exploit chain.

Memory poisoning or context carryover

If one session can plant instructions or sensitive references that influence a later run, say whether the memory was user-scoped, session-scoped, tenant-scoped, or global. "Memory issue" is too soft. The reviewer needs to know who can poison what.

Approval gate bypass or operator confusion

Some findings do not bypass approval technically, but they manipulate the operator into approving a dangerous action with misleading summaries. That may still be a valid finding if the interface turns the approval step into theater. The report should show both the underlying action and the human-facing presentation.

Downstream trust in model output

When an external system accepts model-generated code, queries, or instructions without validation, the root problem often sits in that receiving system. The model may be the trigger, but the control failure is downstream.

Being specific about these categories improves triage. It also helps anyone reading pricing or evaluating tooling understand what a serious reporting workflow looks like versus a screenshot-heavy demo.

Evidence packs should prove action, not just intention

This is where a lot of agent reports quietly lose credibility. The model says it can access something, so the report assumes it did. That is not enough.

Good evidence packs for AI agent findings usually include:

a redacted trace showing the sequence of model and tool steps
the exact tool input and output tied to the finding
the downstream request, log line, or object change that proves the effect
the user-visible result, if any
the scope note explaining what was intentionally not executed

The tracing guidance in the OpenAI Agents SDK docs is a useful reminder that spans are evidence containers. A generation span may show intent. A function span may show the actual tool call. A guardrail span may show where a policy should have stopped the run. Your report should preserve that separation rather than pasting a giant transcript and hoping the reviewer sorts it out.

The same caution applies to screenshots. Screenshots are helpful for showing operator-visible behavior, misleading approval text, or surprising UI state. They are weak proof for backend effects unless paired with logs or request data. In other words, show both the screen and the system record.

If your team is building local-first validation habits and wants to reproduce issues without sending sensitive data through a third-party service, download is where the product path starts.

Remediation and retest sections should point to the owner who can fix the issue

Bad remediation advice sounds like "improve guardrails" or "add stronger validation." That is too vague to ship against.

Better remediation lines name the broken layer:

sanitize or compartmentalize retrieved content before it reaches the reasoning context
require per-tool authorization and explicit write approval for sensitive actions
bind approvals to exact actions rather than broad conversation state
restrict memory writes by scope and expiry so one user cannot poison another user's context
validate model-produced commands, queries, or URLs in the receiving system before execution
reduce tool scopes and service-account privileges to match the agent's real job

The retest section should be short and unforgiving. Reuse the original trigger. Reuse the original scope if you can. Then state the expected failure clearly.

For example:

Replay the same retrieved document injection and confirm the agent can still summarize the document but cannot issue the create_ticket tool call without a fresh, explicit operator approval tied to the exact ticket body.

That is concrete enough for engineering and strict enough for AppSec. It also lines up with the spirit of OWASP reporting guidance and the mitigation workflow in NIST SP 800-115. If the fix is real, the retest should fail in a very boring, very repeatable way.

Common reporting mistakes in AI agent engagements

The same mistakes keep showing up:

Treating agent chatter as proof

If the model says it accessed a secret but you do not have a tool trace or downstream record, you have a lead, not a confirmed finding.

Hiding the authority boundary

"The agent performed an unsafe action" is too vague. Did the user delegate authority? Did a tool skip approval? Did a backend trust model output without validation? Name the owner.

Mixing multiple failure layers into one finding

A retrieved prompt injection that leads to a dangerous tool call may involve both context handling and tool policy. If one control is weak and the other is absent, separate them clearly or the fix work becomes muddy.

Overstating impact under read-only scope

If scope rules stopped the final write, say so. Do not quietly convert "could likely write" into "wrote." Strong reports are precise about what actually happened.

Writing a generic conclusion instead of operational next steps

The close of the report should help the team act. Which findings need product changes? Which need IAM cleanup? Which need observability or approval design work? That is more useful than a paragraph about how AI security is evolving.

If you want a broader product and workflow benchmark before you standardize your report format, compare is the best internal starting point.

FAQ

What is different about an AI agent pentest report compared with a normal pentest report?

The core structure stays the same: scope, findings, evidence, remediation, and retest. The difference is that agent reports need much clearer documentation of tools, memory, context injection, approvals, and downstream systems, because those layers often determine whether a risky output becomes a real exploit.

Should I include chain-of-thought or full reasoning traces in the report?

Only when they materially support the finding, and even then they should not replace system evidence. Tool traces, logs, request IDs, and observable effects are more important than long reasoning dumps. Keep sensitive data redacted and keep the finding body readable.

How should I score prompt injection in an AI agent report?

Do not score the phrase alone. Score the actual outcome. A prompt injection that only changes wording is not the same as one that causes a privileged tool call, data disclosure, or unsafe downstream action. The report should show the crossed boundary and the proven impact.

What is the best evidence for an AI agent finding?

The best evidence is a combination: trace or span data, the exact tool call, downstream proof such as a log or object change, and the user-visible result when that matters. One artifact alone usually is not enough.

Who should own fixes from an AI agent pentest report?

The owner should match the failed layer. Retrieval issues belong with the context pipeline. Tool overreach belongs with tool authorization and policy. Unsafe execution belongs with the receiving system. Approval problems often belong with product and workflow design, not just prompt edits.

AI Agent Pentest Report Template | 0xClaw