mpm
← back
// Selected work

AttackBench

Safety controls for autonomous offensive AI, and an implementation that anticipated the OWASP Autonomous Penetration Testing Standard.

Abstract. Autonomous penetration-testing platforms put an LLM in the loop of an activity whose failure modes are measured in dropped production databases and out-of-scope exploitation. AttackBench is an agentic offensive-security platform originally designed in 2024, before any governance standard for this class of system existed, on the assumption that both the model and its human operator will, at some point, be wrong. This paper describes AttackBench’s safety architecture and maps it against the OWASP Autonomous Penetration Testing Standard (APTS) v0.1.0, released subsequently. The platform’s controls (a hashed, replayable evidence chain; layered rules-of-engagement enforcement, including dedicated validator agents; strategic approval gates; an out-of-band kill switch; and a sandboxed execution plane) align with the majority of APTS’s eight domains without having been designed against them. A recurring theme is deliberate economy: where a proven, widely understood building block already exists, the platform leans on it rather than reinventing it, keeping its surface area small. The paper also identifies one area the standard does not yet address: the operational security of the agent’s own reconnaissance, the proprietary information an autonomous agent discloses to third parties simply by querying them. None of this is offered as a finished or ideal design; like the field it operates in, AttackBench continues to evolve.

Note. The figures and walkthroughs in this paper are simplified illustrations of AttackBench’s design intent. They are not AttackBench outputs and should not be evaluated as such.

1. Introduction

The premise of agentic offensive security is that an LLM-driven agent can carry out the reconnaissance, exploitation, and lateral-movement work of a penetration test with far less human time per engagement. The premise is sound. The problem is the blast radius. A mistake in a SaaS product is a bad render; a mistake in an offensive engagement is an exploit fired at an out-of-scope host, a dropped production database, or an indirect prompt injection that arrived inside an HTTP response from the very system under test.

The design constraint everything else compounds against is Sloptimism: the optimism bias that sets in when a human is asked to validate a fluent, confident machine. Under cognitive load and a steady stream of plausible-sounding agent recommendations, the operator rubber-stamps a destructive action. It is not a failing to be fixed with training; it is the predictable result of putting a confident agent in front of a human and asking the human to be the brake. Bugcrowd has written about it directly, and Carnegie Mellon’s CyLab, in its 2026 whitepaper Creating a Scientific Foundation for Cyber Autonomy, frames the same tension as a foundational open problem: balancing human cognition against automation, and working out when an operator should be consulted and what must be put in front of them to decide. Any safety argument that terminates in “the operator will catch it” has no brake at all.

AttackBench is an agentic penetration-testing platform, an orchestrator that drives LLM-backed offensive subagents through scoped engagements, originally designed in 2024 on the opposite assumption: that the operator will be wrong, that the model will be wrong, and that the architecture has to hold anyway. Findings still come from the agent; that is the point of agentic offense. What the platform layers on is durability, observability, and a set of safety controls that do not depend on anyone in the loop being reliable. It is not a finished artifact. Agentic AI has changed substantially since 2024 and continues to, and the platform has evolved alongside it. That is itself a reason any assessment of a platform in this space should be read as a snapshot of a moving target rather than a verdict.

This paper maps that architecture against the OWASP Autonomous Penetration Testing Standard (APTS) v0.1.0, a governance standard for exactly this class of system, published well after AttackBench was first designed. Section 2 gives background on the platform and where it sits in a security program. Section 3 summarises the standard. Section 4 describes AttackBench’s safety architecture. Section 5 maps the architecture against representative APTS requirements across the standard’s eight domains. Section 6 identifies an area APTS v0.1.0 underspecifies. Section 7 discusses why a platform built before the standard nonetheless converges on it.

2. Background

AttackBench is a commercial autonomous offensive-security platform from NDay Security. It runs LLM-backed agents through scoped penetration-testing engagements, carrying an engagement from reconnaissance through exploitation and lateral movement, and produces evidence-backed findings a human can review and act on. The agents do the offensive work; the platform around them supplies the scoping, the safety controls, the evidence trail, and the operator touch-points.

I designed AttackBench’s safety architecture: the controls this paper describes, and the threat model they answer to. Offensive capability is the platform’s reason to exist, but capability was never the hard part. What keeps an autonomous offensive tool from becoming a liability is everything built around that capability, and that is the work this paper is about.

In a traditional engagement model, a penetration test is a point-in-time exercise bounded by how much skilled human time a client can buy. AttackBench changes the economics of the execution phase, which is what makes it a fit for Continuous Threat Exposure Management (CTEM). Rather than validating exposures once a quarter, a CTEM program can validate them continuously, with the platform serving as the adversarial-validation stage that confirms which exposures are actually reachable and exploitable. It does not replace a human pentest team. It shifts that team’s attention from execution toward scoping, judgement, and the findings that genuinely warrant it.

The same engagement engine sits behind three interfaces. A GUI gives an operator a dashboard view of an engagement: scope, strategic gates, findings, and the evidence behind them. A TUI gives the same control from the terminal. A headless mode exposes the engine for automation, so a validation run can be triggered by a schedule or a change in the environment rather than by a person opening an application. That last interface is what makes the CTEM fit practical, since continuous validation has to run without someone starting it each time.

3. The APTS standard

The OWASP Autonomous Penetration Testing Standard is a governance standard for autonomous penetration-testing platforms: it defines what such a system must do to operate safely, transparently, and within defined boundaries. It is explicitly a governance framework, not a testing methodology. It does not say how to find a vulnerability; it says what a platform that finds vulnerabilities autonomously must guarantee about its own conduct.

This is a genuine and overdue contribution. Before APTS, every vendor in the agentic-offense space made its own safety claims against its own private rubric, and a buyer had no shared vocabulary to compare them. APTS supplies that vocabulary. Version 0.1.0 defines 173 tier-required requirements across eight domains, organised into compliance tiers: Tier 1 (Foundation, 72 requirements), Tier 2 (Verified, a further 85), and Tier 3. It also defines a graduated model of autonomy from L1 (Assisted) to L4 (Fully Autonomous).

DomainPrefixScope
Scope EnforcementSEDefining, validating, and enforcing testing boundaries
Safety Controls & Impact ManagementSCImpact classification, blast radius, kill switches, rollback, execution sandbox
Human Oversight & InterventionHOApproval gates, dashboards, escalation, operator qualification
Graduated Autonomy LevelsALFour autonomy levels, each with its own containment expectations
Auditability & ReproducibilityARLogging, decision trails, evidence integrity, audit-trail isolation
Manipulation ResistanceMRPrompt injection, adversarial input, scope-widening defence
Third-Party & Supply Chain TrustTPAI-provider trust, data handling, multi-tenancy isolation, model disclosure
ReportingRPFinding validation, confidence scoring, coverage disclosure

Figure 1. The eight domains of APTS v0.1.0. The remainder of this paper spotlights representative requirements from these domains rather than auditing all 173.

Clearing that bar is itself a challenge the field has yet to broadly meet. Most agentic pentesting offerings available today were built capability-first, optimised to find and exploit before they were designed to constrain themselves, and retrofitting a governance standard like APTS onto that foundation is real work. That gap is less a criticism of any one product than a measure of how young the category is, and of how much the standard has just made legible.

The sections that follow do not attempt a line-by-line compliance audit. They take representative requirements, the ones that most sharply express each domain’s intent, and show where AttackBench’s architecture already meets them.

4. AttackBench’s safety architecture

AttackBench’s threat model has three actors that can each be wrong: the operator (Sloptimism), the model (confident error, or capture by injected instructions), and the target (a hostile environment that returns data engineered to manipulate the agent reading it). The architecture’s response is defense in depth: no single control is load-bearing, and every layer assumes the layer above it has already failed.

Agent intent
Strategic gates
A handful of high-stakes decisions per engagement, each a structured payload.
Rules of engagement
The agent's tool set is narrowed to the engagement profile before it plans; an RoE Validator agent checks each action as it runs.
Guardrails
Input/output sanitisation; target-derived data is typed as evidence and never enters the instruction stream.
Evidence Validator agent
Every finding is reconciled against the evidence behind it before it stands.
Sandbox gate
Offensive tooling cannot run on the host; unsandboxed execution is default-deny.
Ephemeral container
Tools run detached, with no path to the orchestrator, evidence store, or operator network.
Target network
Figure 2. Defense in depth. An agent action originates from a model under no obligation to be correct and crosses several independent layers before it can reach the target. No single layer is load-bearing, including the rules of engagement, which are themselves enforced in more than one place.

The layers, top to bottom: strategic gates reduce the operator’s decision surface to a few high-stakes choices; the rules of engagement are enforced in more than one place, with the agent’s tool set narrowed to the engagement profile before it plans and a dedicated RoE Validator agent checking each proposed action against the rules as the engagement runs; the guardrail layer sanitises input and output and keeps target-derived data out of the instruction stream; an Evidence Validator agent reconciles every finding against the evidence behind it; the sandbox gate refuses to run offensive tooling on the host at all; and the ephemeral container that ultimately runs that tooling has no path back to the orchestrator, the evidence store, or the operator’s network. An indirect prompt injection that lands a reverse shell executes inside that container and finds nothing else.

One tendency runs through the whole stack: where a proven, widely understood building block already does the job, the platform uses it rather than building its own. Engagement state is versioned with git; the evidence journal is an ordinary structured database. The payoff is a smaller surface area to reason about and natural compatibility with tooling and standards that already exist; the cost, accepted knowingly, is that the platform inherits those building blocks’ limits rather than engineering around them. It is one defensible set of trade-offs, not a claim to the best one.

These controls were part of AttackBench’s design from 2024, before APTS existed. The platform has not stood still since; agentic AI has moved quickly and the architecture has moved with it, but the load-bearing ideas were there early. The following section shows what the convergence with APTS looks like.

5. Mapping to key APTS requirements

5.1 Auditability and Reproducibility (AR)

Every prompt, tool invocation, network response, and finding is cryptographically hashed and written to a structured, append-only journal, an ordinary database doing ordinary work. Findings reference their evidence by hash; evidence references the network response that produced it by hash; an Evidence Validator agent reconciles the two. Engagement state is committed into git, so a finished run is a tagged commit rather than a chat transcript: durable versioning the platform did not have to invent.

Network responseuntyped data from the target under test
cryptographic hash
evidence recorde3b0c442…7852b855
findingreferences evidence by hash: evidence_hash = e3b0c442…
Evidence Validator agentreconciles the finding against its evidence; hashes match or it fails
structured journal
git snapshot
Figure 3. The evidence hash chain. A finding cannot be asserted without the network response that produced it surviving in the journal; an Evidence Validator agent reconciles the two, so the hashes either match or the finding fails. Durability comes from redundancy, not from a claim of immutability.

This is not a claim of in-band immutability: a log that lives on a compromised host can be edited by whoever compromised it. AttackBench’s answer is durability through redundancy and isolation. Hashes give fingerprint reconciliation (AR-010); the structured schema gives observability and append-only event logging (AR-001, AR-012); git snapshots give replay; and the authoritative trail is written to infrastructure the agent holds no credentials to reach (AR-020), so the agent can append to the record but has no path to alter or delete the store it lands in.

5.2 Scope Enforcement (SE)

Rules of engagement in AttackBench are not enforced at a single chokepoint. One layer resolves them at engagement-start: an operator selects a target profile (staging, safe-production, out-of-window) and the agent’s available tool set is narrowed to that profile before the agent has a chance to plan, so a denied capability is absent rather than refused on call. A second layer runs continuously: a dedicated RoE Validator agent checks each proposed action against the rules as the engagement proceeds, weighing its target, its nature, and its timing. Tool-set filtering and the validator agent cover for each other; neither is the whole of scope enforcement.

RoE profile
staging
safe-production
out-of-window
Tool set · resolved at start
recon + probing tools present
destructive payloads absent
RoE Validator agent · per action
target ∈ authorised scope?
action permitted by profile?
within the engagement window?
Figure 4. Rules of engagement are enforced in more than one place. The engagement profile narrows the agent's tool set before it plans; an RoE Validator agent then checks each action (target, nature, timing) as the engagement runs. Tool-set filtering is one element of scope enforcement, not the whole of it.

The up-front layer matters because a post-hoc filter on the model’s output is the wrong place to start: a model that has been allowed to plan against a destructive action has already spent attention budget on it, and the operator approving the plan inherits that frame. Narrowing the tool set up front means the agent rarely composes a strategy that depends on an action it cannot take. The RoE Validator agent then catches what a static tool set cannot express: scope that depends on the specific target resolved at runtime, or on the current time window. Together they implement the machine-parseable rules-of-engagement ingestion APTS describes in SE-001 and the per-action scope validation of SE-006.

SE-009 is the one place this section parts company with the standard’s letter. APTS prescribes immutable hard deny-lists for production databases, identity providers, control systems, and other critical infrastructure. AttackBench inverts that: scope is an explicit allow-list, so a system is out of bounds unless it has been named in, not safe until it has been named out. Everything unlisted is denied by default. The deny enforcement that does exist is pushed down to the firewall, the network layer, because that is the one place a boundary holds regardless of whether the agent understood what it was doing. A policy-layer deny-list assumes the agent grasps the full context and downstream effects of every action; it has no way to know that a command aimed at an in-scope host will cascade into a critical system the list was never written to anticipate. An allow-list, backed by network-layer deny, does not depend on that understanding.

5.3 Human Oversight (HO)

Micro-approvals fail at the human layer: is this nmap okay? this curl? this whois? After the fortieth approval the operator stops reading, which is the exact failure mode the approval was meant to prevent. This is Sloptimism manufactured by the interface itself.

Rejected: micro-approvals
allow nmap -sV ?
allow curl ?
allow whois ?
allow gobuster ?
…×40 · attention exhausted
Strategic gate
Target · 10.0.4.5 (core banking DB)
Action · CVE-2024-XXXX
Impact: CRITICALRollback: documented
Decision persists alongside the engagement's evidence hashes.
AUTHORISEDENY
Figure 5. The strategic gate. Rather than a stream of per-command approvals, which the operator stops reading after the fortieth, a small number of high-stakes decisions each carry a structured, persisted payload the operator can actually reason about.

AttackBench uses the strategic gate instead: a small number of high-stakes decisions per engagement, each carrying a structured payload (target, proposed action, predicted impact, referenced CVE, rollback procedure) persisted alongside the operator’s decision and the engagement’s evidence hashes. The operator reads carefully a handful of times rather than skimming continuously. This is the graduated-escalation model of SC-006 and the propose-with-impact pattern of AL-009: informational actions run, consequential ones present a reviewable case.

5.4 Safety Controls (SC)

Some controls cannot depend on what the agent thinks. The kill switch is the canonical one: an out-of-band path that terminates every active subagent and tool subprocess (nmap, msf, operator-provided tools) regardless of in-flight reasoning. The architecture has been wired for this from the original design; sub-five-second propagation across the active task graph, which APTS codifies as SC-009, is the latency target the platform tunes against.

kill switch · out-of-band
> SIGTERM → subagents · nmap · msf · tool subprocesses
[!] task graph halted · target latency < 5s
target-health check
baseline TTFB142 ms
post-exploit TTFB4,802 ms
regression detected → run halted
Figure 6. Hard limits. An out-of-band kill switch terminates every subagent and tool subprocess regardless of in-flight reasoning; target-health checks halt the run on an order-of-magnitude latency regression. Both operate independently of what the model thinks.

At finer grain, target-health checks compare latency and reachability against an engagement-start baseline and halt the run on an order-of-magnitude regression (SC-010). The model, left alone, would happily keep firing payloads at a degrading system. Underneath both, the agent runtime is confined to a sandbox it holds no credentials to escape (SC-019), and its tool set is an allow-list enforced outside the model (SC-020).

5.5 Manipulation Resistance (MR)

APTS-MR-001 and OWASP’s LLM01 collapse to the same threat: the agent reads hostile input from the network it is auditing. SQL-injection error pages, HTTP response bodies, even server headers are all carriers for indirect prompt injection. The countermeasures in AttackBench are architectural, not prompt-level. Anything retrieved from a target is typed and routed through the guardrail layer for sanitisation (MR-002); the agent reasons about evidence, it never executes it. Every finding is recorded alongside the hash of the response that triggered it, and the Evidence Validator agent reconciles the two, so a finding fabricated from injected target content has no surviving evidence to point to and fails validation. The sandboxed execution plane means a successful injection inherits the sandbox’s reach, not the orchestrator’s. Widening the architectural distance between the instruction stream and the guardrail boundary further still (MR-018) is the natural continuation of separation already in place.

6. An underspecified area: reconnaissance OPSEC

APTS is thorough on the data the platform handles. It classifies evidence by sensitivity and mandates encryption at rest (AR-015, TP-012); it requires discovered credentials to be encrypted and kept out of logs, reports, and model context (MR-019); it governs the trust placed in the AI provider, including disclosure of the foundation model and assurance that it is not retrained on customer data (TP-001, TP-019, TP-021). What it does not yet address is the operational security of the agent’s own reconnaissance.

An autonomous pentest agent doing open-source intelligence work issues queries to search engines, to public data brokers, to code-search and certificate-transparency services, and those queries are themselves a disclosure. Searching for a client’s domains, IP ranges, internal hostnames, or employee names tells every queried third party that the client exists, that it is interesting, and quite possibly that it is under test. The information leaves the engagement not as exfiltrated evidence but as the shape of the agent’s curiosity. APTS’s Manipulation Resistance domain is concerned with hostile data coming in; its Supply Chain Trust domain is concerned with the providers the platform vendor chose. Neither governs what the agent reveals about the client by virtue of what it looks up. SE-003’s exclusion of third-party infrastructure is about not testing third parties, not about not leaking to them.

AttackBench does not consider this solved, but its architecture constrains it by construction. The sandboxed execution plane gives the agent no direct network path to the open internet; web search is read-only and available only in non-executing modes, never to an agent mid-engagement; and the optional egress firewall is default-drop, admitting only the engagement’s target CIDRs.

Sandboxed execution plane
No direct network access. Web search is read-only and available only in non-executing modes, never to an agent mid-engagement.
egress firewall · default-drop
Admitted
target CIDRs only
Dropped
search engines · OSINT brokers · open internet
Figure 7. Egress containment. The execution plane gives the agent no direct network path; an optional default-drop firewall admits only the engagement's target CIDRs. The agent cannot quietly broadcast the client's environment to the open internet; that is a structural mitigation for an area the standard does not yet name.

An agent inside that boundary cannot quietly fan out across the open internet, because the boundary does not route there. What remains as genuine future work is finer-grained control over the reconnaissance footprint when open-internet OSINT is in scope: redaction and minimisation of client-identifying terms in outbound queries. That is an open problem, and naming it is part of why a standard like APTS matters: the next version has somewhere to put it.

7. Discussion: anticipation, not coincidence

The controls in Sections 4 and 5 (the hashed evidence chain, the layered rules-of-engagement enforcement, strategic-gate approvals, the out-of-band kill switch, the sandboxed execution plane) were part of AttackBench’s design from 2024, before APTS existed. The convergence is not luck. Anyone designing an autonomous offensive platform who takes the blast radius seriously is pushed toward the same small set of conclusions: the operator cannot be the brake, the model cannot be trusted with its own constraints, and the evidence trail has to survive a compromised host. The APTS working group reached those conclusions from the direction of governance; AttackBench reached them from the direction of trying not to ship a footgun. Two independent derivations landing close together is a reasonable signal that the conclusions are sound.

What the standard adds is precision and a shared bar. The harder edges APTS pushes on are alignment work on top of the existing architecture rather than redesigns of it: sub-five-second kill-switch propagation measured across the active task graph (SC-009), deeper architectural separation between instruction and evidence streams (MR-018), per-tool impact scoring surfaced into the strategic-gate payload. None of that should be read as a claim that AttackBench’s particular choices are the right ones; they are one defensible set of trade-offs, and the platform keeps changing as the field does. The value of a standard is precisely that it gives that change something steady to be measured against.

8. Conclusion

AttackBench was designed on the assumption that everyone in the loop will eventually be wrong, and that the architecture has to hold regardless. Mapped against APTS v0.1.0 after the fact, that assumption turns out to have anticipated much of the standard’s eight domains, not because the standard was followed but because the standard and the platform are answers to the same question. The one area where the platform’s design currently runs ahead of the standard, reconnaissance OPSEC, is also where the next revision of APTS has the most room to grow. None of this is a finished story. The platform was designed in 2024 and has evolved since, and a standard for autonomous offensive AI is exactly what lets a fast-moving platform be evaluated as more than a snapshot. The standard was needed; building toward it before it arrived is just what taking the problem seriously looked like.

References

The requirement map below ties every APTS identifier cited in this paper to the AttackBench mechanism that addresses it.

APTS requirementAttackBench mechanism
AR-001, AR-012: structured, append-only event loggingCryptographically hashed, append-only event log
AR-010: evidence hashingEvidence Validator agent reconciles findings against cryptographically hashed evidence; the chain reconciles or the finding fails review
AR-020: audit-trail isolationAuthoritative trail written to infrastructure the agent has no credentials to reach
SE-001: machine-parseable rules of engagementEngagement profile resolved into the agent’s tool set at start
SE-006: per-action scope validationRoE Validator agent checks each action’s target, nature, and timing
SE-009: hard deny-lists for critical systemsInverted: explicit scope allow-list (default-deny), with deny enforcement pushed to the firewall / network layer
HO / AL-009: propose actions with impact for approvalStrategic-gate payload: target, action, impact, CVE, rollback
SC-006: graduated escalationInformational actions auto-run; consequential actions gate
SC-009: kill switchOut-of-band termination of the active task graph
SC-010: target-health monitoringLatency/reachability baseline; halt on order-of-magnitude regression
SC-019, SC-020: sandboxed runtime, external tool allow-listEphemeral container; tool registry enforced outside the model
MR-001, MR-002: manipulation resistance, input sanitisationGuardrail layer; target data typed as evidence, never as instruction
MR-018: instruction/evidence separationArchitectural separation in place; deepening it is active work
MR-019, TP-012, AR-015: data classification and handlingHashed, structured evidence store (see §6 for the gap this leaves)

OWASP Autonomous Penetration Testing Standard (APTS) v0.1.0: owasp.org/APTS. OWASP Top 10 for LLM Applications, LLM01: Prompt Injection. On Sloptimism and validation fatigue, see Bugcrowd, Sloptimism is breaking any system built on human validation. The human-factors framing draws on L. Bauer, D. Brumley, J. Calandrino, N. Christin, G. Fanti, V. Gligor, B. Parno, J. Patel, V. Sekar, and J. Sherry, Creating a Scientific Foundation for Cyber Autonomy, CyLab, Carnegie Mellon University, v1.0.0, March 2026.

For inquiries about the AttackBench architecture or the NDay AttackBench platform, see ndaysecurity.com.