AttackBench
Safety controls for autonomous offensive AI, and an implementation that anticipated the OWASP Autonomous Penetration Testing Standard.
Abstract. Autonomous penetration-testing platforms put an LLM in the loop of an activity whose failure modes are measured in dropped production databases and out-of-scope exploitation. AttackBench is an agentic offensive-security platform originally designed in 2024, before any governance standard for this class of system existed, on the assumption that both the model and its human operator will, at some point, be wrong. This paper describes AttackBench’s safety architecture and maps it against the OWASP Autonomous Penetration Testing Standard (APTS) v0.1.0, released subsequently. The platform’s controls (a hashed, replayable evidence chain; layered rules-of-engagement enforcement, including dedicated validator agents; strategic approval gates; an out-of-band kill switch; and a sandboxed execution plane) align with the majority of APTS’s eight domains without having been designed against them. A recurring theme is deliberate economy: where a proven, widely understood building block already exists, the platform leans on it rather than reinventing it, keeping its surface area small. The paper also identifies one area the standard does not yet address: the operational security of the agent’s own reconnaissance, the proprietary information an autonomous agent discloses to third parties simply by querying them. None of this is offered as a finished or ideal design; like the field it operates in, AttackBench continues to evolve.
Note. The figures and walkthroughs in this paper are simplified illustrations of AttackBench’s design intent. They are not AttackBench outputs and should not be evaluated as such.
1. Introduction
The premise of agentic offensive security is that an LLM-driven agent can carry out the reconnaissance, exploitation, and lateral-movement work of a penetration test with far less human time per engagement. The premise is sound. The problem is the blast radius. A mistake in a SaaS product is a bad render; a mistake in an offensive engagement is an exploit fired at an out-of-scope host, a dropped production database, or an indirect prompt injection that arrived inside an HTTP response from the very system under test.
The design constraint everything else compounds against is Sloptimism: the optimism bias that sets in when a human is asked to validate a fluent, confident machine. Under cognitive load and a steady stream of plausible-sounding agent recommendations, the operator rubber-stamps a destructive action. It is not a failing to be fixed with training; it is the predictable result of putting a confident agent in front of a human and asking the human to be the brake. Bugcrowd has written about it directly, and Carnegie Mellon’s CyLab, in its 2026 whitepaper Creating a Scientific Foundation for Cyber Autonomy, frames the same tension as a foundational open problem: balancing human cognition against automation, and working out when an operator should be consulted and what must be put in front of them to decide. Any safety argument that terminates in “the operator will catch it” has no brake at all.
AttackBench is an agentic penetration-testing platform, an orchestrator that drives LLM-backed offensive subagents through scoped engagements, originally designed in 2024 on the opposite assumption: that the operator will be wrong, that the model will be wrong, and that the architecture has to hold anyway. Findings still come from the agent; that is the point of agentic offense. What the platform layers on is durability, observability, and a set of safety controls that do not depend on anyone in the loop being reliable. It is not a finished artifact. Agentic AI has changed substantially since 2024 and continues to, and the platform has evolved alongside it. That is itself a reason any assessment of a platform in this space should be read as a snapshot of a moving target rather than a verdict.
This paper maps that architecture against the OWASP Autonomous Penetration Testing Standard (APTS) v0.1.0, a governance standard for exactly this class of system, published well after AttackBench was first designed. Section 2 gives background on the platform and where it sits in a security program. Section 3 summarises the standard. Section 4 describes AttackBench’s safety architecture. Section 5 maps the architecture against representative APTS requirements across the standard’s eight domains. Section 6 identifies an area APTS v0.1.0 underspecifies. Section 7 discusses why a platform built before the standard nonetheless converges on it.
2. Background
AttackBench is a commercial autonomous offensive-security platform from NDay Security. It runs LLM-backed agents through scoped penetration-testing engagements, carrying an engagement from reconnaissance through exploitation and lateral movement, and produces evidence-backed findings a human can review and act on. The agents do the offensive work; the platform around them supplies the scoping, the safety controls, the evidence trail, and the operator touch-points.
I designed AttackBench’s safety architecture: the controls this paper describes, and the threat model they answer to. Offensive capability is the platform’s reason to exist, but capability was never the hard part. What keeps an autonomous offensive tool from becoming a liability is everything built around that capability, and that is the work this paper is about.
In a traditional engagement model, a penetration test is a point-in-time exercise bounded by how much skilled human time a client can buy. AttackBench changes the economics of the execution phase, which is what makes it a fit for Continuous Threat Exposure Management (CTEM). Rather than validating exposures once a quarter, a CTEM program can validate them continuously, with the platform serving as the adversarial-validation stage that confirms which exposures are actually reachable and exploitable. It does not replace a human pentest team. It shifts that team’s attention from execution toward scoping, judgement, and the findings that genuinely warrant it.
The same engagement engine sits behind three interfaces. A GUI gives an operator a dashboard view of an engagement: scope, strategic gates, findings, and the evidence behind them. A TUI gives the same control from the terminal. A headless mode exposes the engine for automation, so a validation run can be triggered by a schedule or a change in the environment rather than by a person opening an application. That last interface is what makes the CTEM fit practical, since continuous validation has to run without someone starting it each time.
3. The APTS standard
The OWASP Autonomous Penetration Testing Standard is a governance standard for autonomous penetration-testing platforms: it defines what such a system must do to operate safely, transparently, and within defined boundaries. It is explicitly a governance framework, not a testing methodology. It does not say how to find a vulnerability; it says what a platform that finds vulnerabilities autonomously must guarantee about its own conduct.
This is a genuine and overdue contribution. Before APTS, every vendor in the agentic-offense space made its own safety claims against its own private rubric, and a buyer had no shared vocabulary to compare them. APTS supplies that vocabulary. Version 0.1.0 defines 173 tier-required requirements across eight domains, organised into compliance tiers: Tier 1 (Foundation, 72 requirements), Tier 2 (Verified, a further 85), and Tier 3. It also defines a graduated model of autonomy from L1 (Assisted) to L4 (Fully Autonomous).
| Domain | Prefix | Scope |
|---|---|---|
| Scope Enforcement | SE | Defining, validating, and enforcing testing boundaries |
| Safety Controls & Impact Management | SC | Impact classification, blast radius, kill switches, rollback, execution sandbox |
| Human Oversight & Intervention | HO | Approval gates, dashboards, escalation, operator qualification |
| Graduated Autonomy Levels | AL | Four autonomy levels, each with its own containment expectations |
| Auditability & Reproducibility | AR | Logging, decision trails, evidence integrity, audit-trail isolation |
| Manipulation Resistance | MR | Prompt injection, adversarial input, scope-widening defence |
| Third-Party & Supply Chain Trust | TP | AI-provider trust, data handling, multi-tenancy isolation, model disclosure |
| Reporting | RP | Finding validation, confidence scoring, coverage disclosure |
Figure 1. The eight domains of APTS v0.1.0. The remainder of this paper spotlights representative requirements from these domains rather than auditing all 173.
Clearing that bar is itself a challenge the field has yet to broadly meet. Most agentic pentesting offerings available today were built capability-first, optimised to find and exploit before they were designed to constrain themselves, and retrofitting a governance standard like APTS onto that foundation is real work. That gap is less a criticism of any one product than a measure of how young the category is, and of how much the standard has just made legible.
The sections that follow do not attempt a line-by-line compliance audit. They take representative requirements, the ones that most sharply express each domain’s intent, and show where AttackBench’s architecture already meets them.
4. AttackBench’s safety architecture
AttackBench’s threat model has three actors that can each be wrong: the operator (Sloptimism), the model (confident error, or capture by injected instructions), and the target (a hostile environment that returns data engineered to manipulate the agent reading it). The architecture’s response is defense in depth: no single control is load-bearing, and every layer assumes the layer above it has already failed.
The layers, top to bottom: strategic gates reduce the operator’s decision surface to a few high-stakes choices; the rules of engagement are enforced in more than one place, with the agent’s tool set narrowed to the engagement profile before it plans and a dedicated RoE Validator agent checking each proposed action against the rules as the engagement runs; the guardrail layer sanitises input and output and keeps target-derived data out of the instruction stream; an Evidence Validator agent reconciles every finding against the evidence behind it; the sandbox gate refuses to run offensive tooling on the host at all; and the ephemeral container that ultimately runs that tooling has no path back to the orchestrator, the evidence store, or the operator’s network. An indirect prompt injection that lands a reverse shell executes inside that container and finds nothing else.
One tendency runs through the whole stack: where a proven, widely understood building block already does the job, the platform uses it rather than building its own. Engagement state is versioned with git; the evidence journal is an ordinary structured database. The payoff is a smaller surface area to reason about and natural compatibility with tooling and standards that already exist; the cost, accepted knowingly, is that the platform inherits those building blocks’ limits rather than engineering around them. It is one defensible set of trade-offs, not a claim to the best one.
These controls were part of AttackBench’s design from 2024, before APTS existed. The platform has not stood still since; agentic AI has moved quickly and the architecture has moved with it, but the load-bearing ideas were there early. The following section shows what the convergence with APTS looks like.
5. Mapping to key APTS requirements
5.1 Auditability and Reproducibility (AR)
Every prompt, tool invocation, network response, and finding is cryptographically hashed and written to a structured, append-only journal, an ordinary database doing ordinary work. Findings reference their evidence by hash; evidence references the network response that produced it by hash; an Evidence Validator agent reconciles the two. Engagement state is committed into git, so a finished run is a tagged commit rather than a chat transcript: durable versioning the platform did not have to invent.
This is not a claim of in-band immutability: a log that lives on a compromised host can be edited by whoever compromised it. AttackBench’s answer is durability through redundancy and isolation. Hashes give fingerprint reconciliation (AR-010); the structured schema gives observability and append-only event logging (AR-001, AR-012); git snapshots give replay; and the authoritative trail is written to infrastructure the agent holds no credentials to reach (AR-020), so the agent can append to the record but has no path to alter or delete the store it lands in.
5.2 Scope Enforcement (SE)
Rules of engagement in AttackBench are not enforced at a single chokepoint. One layer resolves them at engagement-start: an operator selects a target profile (staging, safe-production, out-of-window) and the agent’s available tool set is narrowed to that profile before the agent has a chance to plan, so a denied capability is absent rather than refused on call. A second layer runs continuously: a dedicated RoE Validator agent checks each proposed action against the rules as the engagement proceeds, weighing its target, its nature, and its timing. Tool-set filtering and the validator agent cover for each other; neither is the whole of scope enforcement.
The up-front layer matters because a post-hoc filter on the model’s output is the wrong place to start: a model that has been allowed to plan against a destructive action has already spent attention budget on it, and the operator approving the plan inherits that frame. Narrowing the tool set up front means the agent rarely composes a strategy that depends on an action it cannot take. The RoE Validator agent then catches what a static tool set cannot express: scope that depends on the specific target resolved at runtime, or on the current time window. Together they implement the machine-parseable rules-of-engagement ingestion APTS describes in SE-001 and the per-action scope validation of SE-006.
SE-009 is the one place this section parts company with the standard’s letter. APTS prescribes immutable hard deny-lists for production databases, identity providers, control systems, and other critical infrastructure. AttackBench inverts that: scope is an explicit allow-list, so a system is out of bounds unless it has been named in, not safe until it has been named out. Everything unlisted is denied by default. The deny enforcement that does exist is pushed down to the firewall, the network layer, because that is the one place a boundary holds regardless of whether the agent understood what it was doing. A policy-layer deny-list assumes the agent grasps the full context and downstream effects of every action; it has no way to know that a command aimed at an in-scope host will cascade into a critical system the list was never written to anticipate. An allow-list, backed by network-layer deny, does not depend on that understanding.
5.3 Human Oversight (HO)
Micro-approvals fail at the human layer: is this nmap okay? this curl? this whois? After the fortieth approval the operator stops reading, which is the exact failure mode the approval was meant to prevent. This is Sloptimism manufactured by the interface itself.
AttackBench uses the strategic gate instead: a small number of high-stakes decisions per engagement, each carrying a structured payload (target, proposed action, predicted impact, referenced CVE, rollback procedure) persisted alongside the operator’s decision and the engagement’s evidence hashes. The operator reads carefully a handful of times rather than skimming continuously. This is the graduated-escalation model of SC-006 and the propose-with-impact pattern of AL-009: informational actions run, consequential ones present a reviewable case.
5.4 Safety Controls (SC)
Some controls cannot depend on what the agent thinks. The kill switch is the canonical one: an out-of-band path that terminates every active subagent and tool subprocess (nmap, msf, operator-provided tools) regardless of in-flight reasoning. The architecture has been wired for this from the original design; sub-five-second propagation across the active task graph, which APTS codifies as SC-009, is the latency target the platform tunes against.
At finer grain, target-health checks compare latency and reachability against an engagement-start baseline and halt the run on an order-of-magnitude regression (SC-010). The model, left alone, would happily keep firing payloads at a degrading system. Underneath both, the agent runtime is confined to a sandbox it holds no credentials to escape (SC-019), and its tool set is an allow-list enforced outside the model (SC-020).
5.5 Manipulation Resistance (MR)
APTS-MR-001 and OWASP’s LLM01 collapse to the same threat: the agent reads hostile input from the network it is auditing. SQL-injection error pages, HTTP response bodies, even server headers are all carriers for indirect prompt injection. The countermeasures in AttackBench are architectural, not prompt-level. Anything retrieved from a target is typed and routed through the guardrail layer for sanitisation (MR-002); the agent reasons about evidence, it never executes it. Every finding is recorded alongside the hash of the response that triggered it, and the Evidence Validator agent reconciles the two, so a finding fabricated from injected target content has no surviving evidence to point to and fails validation. The sandboxed execution plane means a successful injection inherits the sandbox’s reach, not the orchestrator’s. Widening the architectural distance between the instruction stream and the guardrail boundary further still (MR-018) is the natural continuation of separation already in place.
6. An underspecified area: reconnaissance OPSEC
APTS is thorough on the data the platform handles. It classifies evidence by sensitivity and mandates encryption at rest (AR-015, TP-012); it requires discovered credentials to be encrypted and kept out of logs, reports, and model context (MR-019); it governs the trust placed in the AI provider, including disclosure of the foundation model and assurance that it is not retrained on customer data (TP-001, TP-019, TP-021). What it does not yet address is the operational security of the agent’s own reconnaissance.
An autonomous pentest agent doing open-source intelligence work issues queries to search engines, to public data brokers, to code-search and certificate-transparency services, and those queries are themselves a disclosure. Searching for a client’s domains, IP ranges, internal hostnames, or employee names tells every queried third party that the client exists, that it is interesting, and quite possibly that it is under test. The information leaves the engagement not as exfiltrated evidence but as the shape of the agent’s curiosity. APTS’s Manipulation Resistance domain is concerned with hostile data coming in; its Supply Chain Trust domain is concerned with the providers the platform vendor chose. Neither governs what the agent reveals about the client by virtue of what it looks up. SE-003’s exclusion of third-party infrastructure is about not testing third parties, not about not leaking to them.
AttackBench does not consider this solved, but its architecture constrains it by construction. The sandboxed execution plane gives the agent no direct network path to the open internet; web search is read-only and available only in non-executing modes, never to an agent mid-engagement; and the optional egress firewall is default-drop, admitting only the engagement’s target CIDRs.
An agent inside that boundary cannot quietly fan out across the open internet, because the boundary does not route there. What remains as genuine future work is finer-grained control over the reconnaissance footprint when open-internet OSINT is in scope: redaction and minimisation of client-identifying terms in outbound queries. That is an open problem, and naming it is part of why a standard like APTS matters: the next version has somewhere to put it.
7. Discussion: anticipation, not coincidence
The controls in Sections 4 and 5 (the hashed evidence chain, the layered rules-of-engagement enforcement, strategic-gate approvals, the out-of-band kill switch, the sandboxed execution plane) were part of AttackBench’s design from 2024, before APTS existed. The convergence is not luck. Anyone designing an autonomous offensive platform who takes the blast radius seriously is pushed toward the same small set of conclusions: the operator cannot be the brake, the model cannot be trusted with its own constraints, and the evidence trail has to survive a compromised host. The APTS working group reached those conclusions from the direction of governance; AttackBench reached them from the direction of trying not to ship a footgun. Two independent derivations landing close together is a reasonable signal that the conclusions are sound.
What the standard adds is precision and a shared bar. The harder edges APTS pushes on are alignment work on top of the existing architecture rather than redesigns of it: sub-five-second kill-switch propagation measured across the active task graph (SC-009), deeper architectural separation between instruction and evidence streams (MR-018), per-tool impact scoring surfaced into the strategic-gate payload. None of that should be read as a claim that AttackBench’s particular choices are the right ones; they are one defensible set of trade-offs, and the platform keeps changing as the field does. The value of a standard is precisely that it gives that change something steady to be measured against.
8. Conclusion
AttackBench was designed on the assumption that everyone in the loop will eventually be wrong, and that the architecture has to hold regardless. Mapped against APTS v0.1.0 after the fact, that assumption turns out to have anticipated much of the standard’s eight domains, not because the standard was followed but because the standard and the platform are answers to the same question. The one area where the platform’s design currently runs ahead of the standard, reconnaissance OPSEC, is also where the next revision of APTS has the most room to grow. None of this is a finished story. The platform was designed in 2024 and has evolved since, and a standard for autonomous offensive AI is exactly what lets a fast-moving platform be evaluated as more than a snapshot. The standard was needed; building toward it before it arrived is just what taking the problem seriously looked like.
References
The requirement map below ties every APTS identifier cited in this paper to the AttackBench mechanism that addresses it.
| APTS requirement | AttackBench mechanism |
|---|---|
| AR-001, AR-012: structured, append-only event logging | Cryptographically hashed, append-only event log |
| AR-010: evidence hashing | Evidence Validator agent reconciles findings against cryptographically hashed evidence; the chain reconciles or the finding fails review |
| AR-020: audit-trail isolation | Authoritative trail written to infrastructure the agent has no credentials to reach |
| SE-001: machine-parseable rules of engagement | Engagement profile resolved into the agent’s tool set at start |
| SE-006: per-action scope validation | RoE Validator agent checks each action’s target, nature, and timing |
| SE-009: hard deny-lists for critical systems | Inverted: explicit scope allow-list (default-deny), with deny enforcement pushed to the firewall / network layer |
| HO / AL-009: propose actions with impact for approval | Strategic-gate payload: target, action, impact, CVE, rollback |
| SC-006: graduated escalation | Informational actions auto-run; consequential actions gate |
| SC-009: kill switch | Out-of-band termination of the active task graph |
| SC-010: target-health monitoring | Latency/reachability baseline; halt on order-of-magnitude regression |
| SC-019, SC-020: sandboxed runtime, external tool allow-list | Ephemeral container; tool registry enforced outside the model |
| MR-001, MR-002: manipulation resistance, input sanitisation | Guardrail layer; target data typed as evidence, never as instruction |
| MR-018: instruction/evidence separation | Architectural separation in place; deepening it is active work |
| MR-019, TP-012, AR-015: data classification and handling | Hashed, structured evidence store (see §6 for the gap this leaves) |
OWASP Autonomous Penetration Testing Standard (APTS) v0.1.0: owasp.org/APTS. OWASP Top 10 for LLM Applications, LLM01: Prompt Injection. On Sloptimism and validation fatigue, see Bugcrowd, Sloptimism is breaking any system built on human validation. The human-factors framing draws on L. Bauer, D. Brumley, J. Calandrino, N. Christin, G. Fanti, V. Gligor, B. Parno, J. Patel, V. Sekar, and J. Sherry, Creating a Scientific Foundation for Cyber Autonomy, CyLab, Carnegie Mellon University, v1.0.0, March 2026.
For inquiries about the AttackBench architecture or the NDay AttackBench platform, see ndaysecurity.com.