Benchmarks — sploit.ai

All-time records

87 Peak findings sdk-v76 — single Juice Shop run

21 Critical severity sdk-v20260325 — all-time record

21 Vuln types covered Held across 8 consecutive runs

21 Tools in one run sdk-v20260325 — all-time record

XBEN CTF results

104 challenges across 3 difficulty levels. 545 total findings, 34 flags captured, 5.8h total compute across 45 completed runs.

80% L1 adjusted pass rate 20/25 runnable — up from 59% (+21pp)

67% L2 adjusted pass rate 12/18 runnable — first L2 run

100% L3 adjusted pass rate 2/2 — deserialization + padding oracle

15 Vuln tags at 100% IDOR, SSRF, SSTI, cmd injection, GraphQL, deserialization, business logic, priv esc, file upload, path traversal, JWT, crypto, brute force, default creds, info disclosure

SQLi XSS SSRF IDOR RCE SSTI XXE Auth bypass Command injection Deserialization Path traversal NoSQL injection CSRF File upload Open redirect Privilege escalation Crypto weakness Security misconfig Info disclosure LFI Broken access control

Recent progress

Key wins from the latest development cycle.

L3: 100% pass rate

Both runnable L3 challenges solved: XBEN-057 via insecure deserialization (42 steps) and XBEN-101 via padding oracle (10 steps). The agent now clears all three XBEN difficulty tiers.

Progress repo: +89% findings

Git-backed intent tracking delivered +89% findings (35 → 66), +75% tool diversity (12 → 21 tools), and -15% duration (48 → 41min). Sub-agents share coverage state in real time.

ATT&CK / OWASP taxonomy

Bidirectional mapping across 23 canonical vulnerability types. Every finding is enriched with ATT&CK technique IDs and OWASP categories automatically.

Tradecraft playbooks

A fingerprint → extract → merge pipeline learns techniques across runs. Proven payloads are dispatched deterministically — the agent remembers what worked.

Pass rates: 80% L1, 67% L2, 100% L3

L1 up +21pp from 59% with zero regressions. 15 vulnerability categories at 100% across levels. 31 distinct tools used across XBEN runs.

Deep exploitation chains

XBEN-075: 12 findings via YAML deserialization to RCE. XBEN-080: 8 findings via SSTI → RCE → flag extraction. Multi-step exploitation paths, fully autonomous.

Performance trajectory

Juice Shop findings and efficiency across agent versions.

Notable changes

Run	Change	Impact
v76	Prompt v3 — restructured orchestrator planning	87 findings, 0.25 yield (all-time best)
v77	Blocker fixes — multipart auth, env sentinel	26min to full type coverage (fastest ever)
v78	Continuation loops — halt + depth pass	21 vuln types, +9 from continuation
v80	Bash fix + stagnation detection	Zero shell errors, stagnation guard validated
v0325	Progress repo — git-backed intent tracking	+89% findings, +75% tool diversity, -15% duration
v0328	Taxonomy alignment + tradecraft playbooks	ATT&CK/OWASP mapping, cross-run technique learning
v0330	Auto-finding on flag, continuation dedup, enforcement reset	L3 100% (2/2) — deserialization + padding oracle

Run history

Juice Shop runs — fully autonomous, no human intervention.

Run	Findings	Types	Steps	Duration	Yield
v20260325	66	20	436	41m	0.15
v20260324	35	18	230	48m	0.15
v80	55	20	337	38m	0.16
v79	52	21	312	32m	0.17
v78	57	21	494	60m	0.12
v77	47	21	319	26m	0.15
v76	87	21	351	49m	0.25
v74	54	20	374	44m	0.14

Methodology

XBEN suite

104 CTF-style challenges across 3 difficulty levels. Each challenge is a containerized vulnerable application with a known flag. A challenge passes when the agent captures the flag AND correctly identifies the vulnerability type.

Adjusted rate

Build failures (broken challenge containers) are excluded from the denominator. Raw rate counts all 104; adjusted rate counts only runnable challenges. 53% of the suite currently fails to build due to stale base images.

Juice Shop runs

Separate from XBEN. Real-world target (OWASP Juice Shop) measuring findings count, vuln type diversity, step efficiency, and duration.

Execution model

Fully autonomous — no human intervention after launch. Agent runs inside sandboxed Kali containers with immutable scope policies. All runs use the same codebase and configuration.