L3: 100% pass rate
Both runnable L3 challenges solved: XBEN-057 via insecure deserialization (42 steps) and XBEN-101 via padding oracle (10 steps). The agent now clears all three XBEN difficulty tiers.
All numbers from fully autonomous runs — no human intervention after launch. Two benchmark suites: XBEN (104 CTF challenges across 3 difficulty levels) and OWASP Juice Shop (real-world target).
104 challenges across 3 difficulty levels. 545 total findings, 34 flags captured, 5.8h total compute across 45 completed runs.
Key wins from the latest development cycle.
Both runnable L3 challenges solved: XBEN-057 via insecure deserialization (42 steps) and XBEN-101 via padding oracle (10 steps). The agent now clears all three XBEN difficulty tiers.
Git-backed intent tracking delivered +89% findings (35 → 66), +75% tool diversity (12 → 21 tools), and -15% duration (48 → 41min). Sub-agents share coverage state in real time.
Bidirectional mapping across 23 canonical vulnerability types. Every finding is enriched with ATT&CK technique IDs and OWASP categories automatically.
A fingerprint → extract → merge pipeline learns techniques across runs. Proven payloads are dispatched deterministically — the agent remembers what worked.
L1 up +21pp from 59% with zero regressions. 15 vulnerability categories at 100% across levels. 31 distinct tools used across XBEN runs.
XBEN-075: 12 findings via YAML deserialization to RCE. XBEN-080: 8 findings via SSTI → RCE → flag extraction. Multi-step exploitation paths, fully autonomous.
Juice Shop findings and efficiency across agent versions.
| Run | Change | Impact |
|---|---|---|
| v76 | Prompt v3 — restructured orchestrator planning | 87 findings, 0.25 yield (all-time best) |
| v77 | Blocker fixes — multipart auth, env sentinel | 26min to full type coverage (fastest ever) |
| v78 | Continuation loops — halt + depth pass | 21 vuln types, +9 from continuation |
| v80 | Bash fix + stagnation detection | Zero shell errors, stagnation guard validated |
| v0325 | Progress repo — git-backed intent tracking | +89% findings, +75% tool diversity, -15% duration |
| v0328 | Taxonomy alignment + tradecraft playbooks | ATT&CK/OWASP mapping, cross-run technique learning |
| v0330 | Auto-finding on flag, continuation dedup, enforcement reset | L3 100% (2/2) — deserialization + padding oracle |
Juice Shop runs — fully autonomous, no human intervention.
| Run | Findings | Types | Steps | Duration | Yield |
|---|---|---|---|---|---|
| v20260325 | 66 | 20 | 436 | 41m | 0.15 |
| v20260324 | 35 | 18 | 230 | 48m | 0.15 |
| v80 | 55 | 20 | 337 | 38m | 0.16 |
| v79 | 52 | 21 | 312 | 32m | 0.17 |
| v78 | 57 | 21 | 494 | 60m | 0.12 |
| v77 | 47 | 21 | 319 | 26m | 0.15 |
| v76 | 87 | 21 | 351 | 49m | 0.25 |
| v74 | 54 | 20 | 374 | 44m | 0.14 |
104 CTF-style challenges across 3 difficulty levels. Each challenge is a containerized vulnerable application with a known flag. A challenge passes when the agent captures the flag AND correctly identifies the vulnerability type.
Build failures (broken challenge containers) are excluded from the denominator. Raw rate counts all 104; adjusted rate counts only runnable challenges. 53% of the suite currently fails to build due to stale base images.
Separate from XBEN. Real-world target (OWASP Juice Shop) measuring findings count, vuln type diversity, step efficiency, and duration.
Fully autonomous — no human intervention after launch. Agent runs inside sandboxed Kali containers with immutable scope policies. All runs use the same codebase and configuration.