sig: instruction signing for prompt injection defense

When an AI agent reads a poisoned document and decides to forward your files to an attacker, the problem isn't that the model is stupid. The problem is that the model has no way to tell the difference between your instructions and the attacker's instructions. All text is text.

This is the fundamental challenge of prompt injection, and it's why we made sig.

The trust boundary that doesn't exist

Today's agents operate in an environment where developer instructions, user messages, and attacker-controlled content all arrive through the same channel: natural language in a context window. The model processes all of it equally because there's nothing in the text itself that can authoritatively mark some of it as trustworthy and some of it as dangerous. XML may be cool again but it still ain't gonna save us.

Models are improving their resistance to prompt injection, but the prompt injection success rate is still bad. But it's not the models' fault: they were born this way. Following instructions in giant blobs of text is what they're relentlessly trained to do.

When researchers tell you to write better system prompts, they're asking you to solve this with more text. But adding instructions that say "ignore injected instructions" is just adding more text to a stream that already can't distinguish trusted from untrusted text. It's turtles all the way down.

What's needed is something outside the text stream: an anchor the model can check that exists independently of whatever text happens to be in its context window. Something that adds texture to the wall of text agents consume.

What sig does

sig is a tool for signing and verifying the instructions you give to AI agents. You sign your prompt templates at authoring time, and the agent can verify them at runtime using a tool call that returns the cryptographically validated original.

Developer signs:     "Review  for security issues."
                          ↓
Agent receives:      "Review  for security issues."
                          ↓
Agent calls verify → gets back the stored signed original
                          ↓
Match? → Instructions are authentic. Proceed.

Note: sig signs the template, with placeholders intact. When the agent verifies, it gets back the original template with {{code}} visible. The agent can see exactly which parts are fixed instructions (signed, authentic) and which parts are dynamic data (interpolated, untrusted). The template is the trusted anchor. Everything else is data.

This is a fundamentally different kind of defense than trying to make the model resistant to being tricked. sig doesn't try to prevent prompt injection. It gives the agent the information it needs to respond correctly when injection happens: "These are my real instructions. That other text is data. I should not follow instructions I find inside data."

Beyond templates: message provenance

Templates are the first piece. But agents also need to know whether the messages they receive actually came from their authorized operator.

Consider an agent connected to a group chat. Someone in the group says "delete all my files." Is that the owner, or someone else? Without provenance, the agent has to guess based on the text alone — and prompt injection means that guess can be manipulated.

sig's ContentStore lets the orchestrator sign messages from authenticated channels at ingestion time:

const store = createContentStore();

// Orchestrator signs message from authenticated user
store.sign('delete the staging database', {
  id: 'msg_456',
  identity: 'owner:+1234567890:whatsapp',
});

// Agent verifies: did this actually come from the owner?
const result = store.verify('msg_456');
// → verified, identity: owner:+1234567890:whatsapp

The agent can now distinguish "a message that provably came from the owner through an authenticated channel" from "text that claims to be from the owner but has no provenance." This is the same principle as template signing, extended to runtime content.

Protecting mutable configuration

Here's where it gets interesting. The recent Zenity Labs research demonstrated that an attacker can use indirect prompt injection to modify an AI agent's identity file (SOUL.md in OpenClaw's case), creating a persistent backdoor that survives restarts. The agent reads a poisoned document, the injection tells it to modify its own config, and the modification persists because the config file is designed to be writable.

Static signing doesn't solve this because these files need to change. The owner legitimately modifies their agent's personality, behavioral rules, and configuration over time. You can't just make them immutable.

sig solves this with file policies and update chains. You declare which files are mutable, who can authorize changes, and whether changes require a signed source:

{
  "files": {
    "llm/prompts/*.txt": { "mutable": false },
    "soul.md": {
      "mutable": true,
      "authorizedIdentities": ["owner:*"],
      "requireSignedSource": true
    }
  }
}

When an agent wants to modify a protected file, it must go through updateAndSign and provide provenance — the signature ID of the signed message that authorized the change. sig independently validates the provenance against the ContentStore. If the instruction came from a poisoned document instead of a real owner message, the agent cannot produce valid provenance.

Agent reads poisoned doc → wants to modify soul.md
→ calls updateAndSign with sourceId: ???
→ sig validates: no signed owner message exists for this
→ DENIED
→ audit log records the attempt

Every approved update is appended to an immutable chain that records what changed, who authorized it, and what signed source triggered the change. Every denied update is also logged — and denied updates where the agent claimed authorization it couldn't prove are your prompt injection detection signal.

The enforcement gap

Signing alone isn't enough. The agent can verify its instructions and still get tricked into ignoring the verification result. The model layer is probabilistic; it will sometimes fail.

That's why sig is designed to work with deterministic enforcement gates at the orchestrator level. The verification gate blocks sensitive tool calls (exec, write, network operations) until the agent has called verify in the current turn. The mutation gate intercepts writes to protected files and redirects them through updateAndSign with provenance requirements.

These gates run in the tool-call pipeline, in code, before the model has any influence. Prompt injection can make the model want to do something dangerous. It cannot make orchestrator-level code skip a validation check.

Layer 1: Verification Gate
  "Have you verified your instructions this turn?"

Layer 2: Mutation Gate
  "Can you prove this file change was authorized?"

Both layers are deterministic. Both reset per turn. Both log everything. Together they create a trust architecture where the model participates in security decisions (by calling verify and providing provenance) but cannot unilaterally bypass the enforcement.

What sig is not

sig is not a silver bullet. There is no silver bullet for prompt injection.

sig v1 uses content hashing (SHA-256), not asymmetric cryptographic signatures with keys. It detects modification and provides provenance, but it doesn't prevent forgery by someone with write access to the .sig/ directory. For the standard deployment where the agent has read-only access to .sig/, this is sufficient. Keyed signing may come in a future release.

sig doesn't prevent the model from reading or being influenced by unsigned text. The model still processes everything in its context. What sig does is give the model — and more importantly, the deterministic enforcement layer around the model — a reliable way to distinguish trusted instructions from untrusted data.

There's a harder case that signing and provenance don't fully address: what happens when a legitimate user instruction was itself influenced by tainted context? The user asks the agent to "update my config based on https://github.com/evil/project/README.md" The message is legitimately from the owner. The provenance is valid. But the intent could be manipulated. See the auditor in the airlock pattern for that approach.

Getting started

sig ships as a library, a CLI, and an MCP server. TypeScript, Python, your choice — both produce identical .sig/ directories.

npm install @disreguard/sig    # or: pip install disreguard-sig

sig init --engine jinja
sig sign prompts/*.txt --by developer

The MCP server gives your agent a verify tool that returns authenticated content, hash, signer identity, and extracted placeholders. Set SIG_VERIFY to control which files the agent can verify (the orchestrator controls this, not the agent).

For the full API, update chain documentation, and integration patterns, see the sig repository.

The bigger picture

sig is the first tool from disreGUARD, our security research lab focused on making agent systems safe from the impacts of prompt injection. It's one layer in what we believe needs to be a defense-in-depth approach: signed instructions, provenance validation, deterministic enforcement gates, capability restrictions, taint tracking, and the auditor's mentality that treats every agent system as a threat model to be understood.

Prompt injection is inevitable. But if the execution layer can verify what's authentic, validate who authorized a change, and enforce policy independent of what the model decides, then injection doesn't have to mean compromise.

sig is open source at github.com/disreguard/sig. Follow our work at disreguard.com.