Home

Defense in depth

We detect prompt injection, then refuse to act on it.

Three checks fire before any tool runs. Every input gets a trust tag. Untrusted text is scanned for injection patterns, hidden Unicode, and obfuscation. Anything that matches is replaced with a blocked-text marker before the model ever sees the payload — then the dispatch gates run on top.

1

Source-tagged trust

Every input carries an instruction-source tag. Two trusted (operator, local_policy); one review-required (client); eight untrusted (browser, document, notification, ocr, qr, relay, screen, terminal). State-changing actions originating from untrusted sources are refused unless operator_confirmed=True.

periphery/security.py · evaluate_action
2

Injection scanner

Regex sweep over untrusted text: prompt-injection language ("ignore previous", "system prompt", "jailbreak", "reveal credentials"), shell verbs (bash · curl · iex · powershell · pip install), action verbs, hidden Unicode (zero-width, RTL/LTR marks, BOM), emoji-heavy obfuscation, QR hints, embedded URLs. Severity-tagged: high → auto-quarantine; medium → operator confirm.

periphery/security.py · inspect_untrusted_text
3

Pre-model sanitization

Injection-positive text in OCR, browser DOM, document, transcript, and message fields is replaced with [BLOCKED_UNTRUSTED_TEXT:source:reason] before the payload reaches the LLM. PII (email, phone, SSN, credit-card) redacted in safe and strict privacy modes. Adversarial bytes never enter the model context.

sanitize_untrusted_payload_for_model
4

Permission gate

Six-tier decision order at dispatch. Delete-guard (rm · Remove-Item · DROP TABLE · find -delete) hard-prompts past every preset. Risk-policy gates npm publish, terraform apply, kubectl apply, git push --force. Secret-guard refuses silent reads of .env, .pem, id_rsa, *credentials*, .kdbx.

src/permissions/ · gate · delete · risk · secret
5

Lockdown + quarantine

Two safety modes (confirm, lockdown). Three privacy modes (fast, safe, strict) — strict blocks raw screen captures entirely. Lockdown adds a domain allowlist (localhost, 127.0.0.1 default). Severe untrusted content auto-latches a session quarantine that blocks every state-changing action until explicitly cleared.

PolicyState · quarantine_session
6

Local policy + audit

Per-user YAML at ~/.periphery/policy.yaml: approved_apps, denylist_windows, denylist_urls, destructive_commands_require_confirm, never_execute_from. Every decision lands in a bounded in-memory ring (200 entries) plus persistent replay log: timestamp, action, status, reason, safety mode, sanitized details.

LocalPolicyConfig · record_action · replay_logger

What we don't claim

Perfect detection is unsolved. Layered refusal is the bet.

No classifier catches every injection variant — model-level adversarial robustness is an open problem. The strategy here is the opposite of magic: multiple deterministic checks at the dispatch boundary, each a real file in the repo, each able to refuse on its own. Untrusted content has to defeat the regex scanner, the source-trust tag, the pre-model blocker, the permission gate, lockdown mode, and the operator at the prompt — in that order.

Same engine ships in PeriCode CLI, PeriCode Inside, and every Periphery MCP install. Add the per-turn provider-env scoping inside the Obsidian plugin and other plugins can't read your API keys outside an active turn either.