Benchmarks

All numbers from fully autonomous runs — no human intervention after launch. Two benchmark suites: XBEN (104 CTF challenges across 3 difficulty levels) and OWASP Juice Shop (real-world target).

All-time records

87 Peak findings sdk-v76 — single Juice Shop run
21 Critical severity sdk-v20260325 — all-time record
21 Vuln types covered Held across 8 consecutive runs
21 Tools in one run sdk-v20260325 — all-time record

XBEN CTF results

104 challenges across 3 difficulty levels. 545 total findings, 34 flags captured, 5.8h total compute across 45 completed runs.

80% L1 adjusted pass rate 20/25 runnable — up from 59% (+21pp)
67% L2 adjusted pass rate 12/18 runnable — first L2 run
100% L3 adjusted pass rate 2/2 — deserialization + padding oracle
15 Vuln tags at 100% IDOR, SSRF, SSTI, cmd injection, GraphQL, deserialization, business logic, priv esc, file upload, path traversal, JWT, crypto, brute force, default creds, info disclosure
SQLi XSS SSRF IDOR RCE SSTI XXE Auth bypass Command injection Deserialization Path traversal NoSQL injection CSRF File upload Open redirect Privilege escalation Crypto weakness Security misconfig Info disclosure LFI Broken access control

Recent progress

Key wins from the latest development cycle.

L3: 100% pass rate

Both runnable L3 challenges solved: XBEN-057 via insecure deserialization (42 steps) and XBEN-101 via padding oracle (10 steps). The agent now clears all three XBEN difficulty tiers.

Progress repo: +89% findings

Git-backed intent tracking delivered +89% findings (35 → 66), +75% tool diversity (12 → 21 tools), and -15% duration (48 → 41min). Sub-agents share coverage state in real time.

ATT&CK / OWASP taxonomy

Bidirectional mapping across 23 canonical vulnerability types. Every finding is enriched with ATT&CK technique IDs and OWASP categories automatically.

Tradecraft playbooks

A fingerprint → extract → merge pipeline learns techniques across runs. Proven payloads are dispatched deterministically — the agent remembers what worked.

Pass rates: 80% L1, 67% L2, 100% L3

L1 up +21pp from 59% with zero regressions. 15 vulnerability categories at 100% across levels. 31 distinct tools used across XBEN runs.

Deep exploitation chains

XBEN-075: 12 findings via YAML deserialization to RCE. XBEN-080: 8 findings via SSTI → RCE → flag extraction. Multi-step exploitation paths, fully autonomous.

Performance trajectory

Juice Shop findings and efficiency across agent versions.

Notable changes

Run Change Impact
v76 Prompt v3 — restructured orchestrator planning 87 findings, 0.25 yield (all-time best)
v77 Blocker fixes — multipart auth, env sentinel 26min to full type coverage (fastest ever)
v78 Continuation loops — halt + depth pass 21 vuln types, +9 from continuation
v80 Bash fix + stagnation detection Zero shell errors, stagnation guard validated
v0325 Progress repo — git-backed intent tracking +89% findings, +75% tool diversity, -15% duration
v0328 Taxonomy alignment + tradecraft playbooks ATT&CK/OWASP mapping, cross-run technique learning
v0330 Auto-finding on flag, continuation dedup, enforcement reset L3 100% (2/2) — deserialization + padding oracle

Run history

Juice Shop runs — fully autonomous, no human intervention.

Run Findings Types Steps Duration Yield
v20260325 66 20 436 41m 0.15
v20260324 35 18 230 48m 0.15
v80 55 20 337 38m 0.16
v79 52 21 312 32m 0.17
v78 57 21 494 60m 0.12
v77 47 21 319 26m 0.15
v76 87 21 351 49m 0.25
v74 54 20 374 44m 0.14

Methodology

XBEN suite

104 CTF-style challenges across 3 difficulty levels. Each challenge is a containerized vulnerable application with a known flag. A challenge passes when the agent captures the flag AND correctly identifies the vulnerability type.

Adjusted rate

Build failures (broken challenge containers) are excluded from the denominator. Raw rate counts all 104; adjusted rate counts only runnable challenges. 53% of the suite currently fails to build due to stale base images.

Juice Shop runs

Separate from XBEN. Real-world target (OWASP Juice Shop) measuring findings count, vuln type diversity, step efficiency, and duration.

Execution model

Fully autonomous — no human intervention after launch. Agent runs inside sandboxed Kali containers with immutable scope policies. All runs use the same codebase and configuration.