SovGuard

Live · security audit May 2026

Six scanning layers inbound, nine outbound, and a published audit of exactly where it falls down.

6: inbound scan layers
9: outbound scanners
195+: attack patterns · 11 decoders
4: Criticals published, not buried

sovguard.io ↗ npm: @sovguard/engine

TypeScript
ONNX / DeBERTa-v3
SQLite
Fastify
Docker

Why this has to exist

Prompt injection is the SQL injection of the agent era, and no major LLM API ships a mitigation. The model trusts whatever text reaches it — including text that arrived from a tool result, a scraped page, or a file someone uploaded. Most teams handle this by hoping nobody tries.

The moment it stops being theoretical is when your agent talks to a stranger’s agent. That’s a marketplace, which is why this was built alongside Junction41 rather than as a separate product.

Inbound — six layers

	Layer	Speed	What it catches
L1	Regex	~1 ms	195+ patterns: instruction overrides, skeleton key, role-play, DAN, CSS steganography, log-to-leak, deceptive delight, ChatML delimiter attacks
L1+	Encoding decoders	~1 ms	11 decoders — Base64, Base32, ROT13, hex, Unicode escapes, HTML entities, URL encoding, leetspeak, token-break normalisation, GhostInk (Unicode tags and variation selectors)
L2	Perplexity	~1 ms	GCG adversarial suffixes, many-shot jailbreaks, gibberish, mixed scripts
L3	ML classifier	~50–100 ms	Self-hosted DeBERTa-v3 (ONNX) plus a multilingual MiniLM semantic layer, in-process with zero external calls. The only layer that catches paraphrase and non-English attacks
L4	Structured delivery	—	Wraps messages in randomised data markers (Microsoft’s Spotlighting) so the agent treats input as data, not instructions
L5	Canary tokens	~1 ms	Per-session natural-language canaries with a 24-hour TTL — detects system-prompt exfiltration
L6	File scanner	~1 ms	Filename injection, path traversal, null bytes, Unicode RLO, and full content scanning of TXT, MD, CSV, JSON, XML, HTML, SVG, DOCX, XLSX, PPTX and PDF including compressed streams

Outbound — nine scanners

Everything the agent says is checked on the way out too, which is the half most tools skip. PII (SSN, cards, email, phone) · URLs (exfil links, javascript:/data:/blob: schemes, IPv6 literals) · Code · Financial (unauthorised payment addresses, wallet manipulation across BTC/ETH/XMR/LTC) · Contamination (cross-job leakage via hashed fingerprint comparison) · Toxicity · Secrets (AWS, OpenAI, GitHub, Slack, JWT, PEM) · Exfil (zero-click remote-image pixel leaks in markdown and HTML) · Egress (canary leaks and data-egress markers).

What the audit found

I audited it in May 2026 and published the result in the repo, four Critical findings included. The headline one is uncomfortable: the regex layers are trivially bypassable by an adaptive attacker. A paraphrase, a handful of typos, or the same sentence in Italian all score zero. Verbatim attacks score 1.00.

Worse than the bypasses was what they revealed about the measurement. The old “130 out of 130” benchmark was circular — I wrote the payloads, I wrote the detector, then measured one against the other. That’s a regression test presented as evidence about attackers. And the harness ran with L3 disabled, which is the only layer that generalises past exact strings, while the report said nothing about it.

Two things follow, and I’d rather state them than let someone discover them:

Keyword matching classifies surface forms. Injection is defined by intent. No quantity of the first adds up to the second, which is why adding more patterns is a trap rather than a fix.
A benchmark that doesn’t record its own configuration is not a measurement. The harness now refuses to emit a bare number when the classifier isn’t loaded, and records which classifier ran.

What held up under the same audit: parameterised queries throughout with no SQLi anywhere, AES-256-GCM done correctly, constant-time comparisons, strict CORS allowlist, hashed API keys, fail-closed admin auth. It’s specifically the detection claim that didn’t survive contact — and that was the headline claim.

The honest current state: there is no trustworthy catch rate for this project yet. The next real milestone is a benchmark against a corpus I didn’t write, with L3 actually loaded. That number will be worse than any I’ve published, and it will be the first one that means anything.

Where it sits in the stack

SovGuard is the trust boundary, and it’s deliberately the most separable piece here — it is useful to anyone running an agent, whether or not they’ve heard of the rest of this. npm i @sovguard/engine, MIT.

Inside the stack it guards the seam between Junction41’s buyers and sovagents, alongside Jailbox, which bounds what a hired agent can touch on disk. Above it, brainbox is exactly the case that makes this non-optional: the moment a personal AI reaches out to hire a stranger, everything coming back is untrusted input aimed at a model that knows your life.