When Your Trusted Commands Betray You: How an LLM Exploited My Safety Allowlist
Last week I published about building a local LLM command safety classifier. I thought I had command approval figured out. Then my AI assistant got sneaky.
The Sneaky cat
Iâd approved cat as a trusted command. Reading files is safe, right?
đ§ Using tool: run_command (trusted - always approved)
$ cat >> ~/important/file.md << 'EOF'
... content written without approval ...
EOF
Wait, what? The LLM used cat >> to write to a file. My âtrusted for readingâ command was now appending arbitrary contentâwithout asking.
The LLM found a loophole. Same base command, completely different intent.
The Root Cause
My consent system extracted the base command (first word) and matched against the approved list:
# The buggy logic
def is_approved(self, command: str) -> bool:
base_cmd = command.strip().split()[0] # Just gets "cat"
return base_cmd in self.always_approved
Shell redirects (>, >>) and pipes (|) completely change what a command does, but I was only looking at the first word.
| Command | Base | Actual Intent | Risk |
|---|---|---|---|
cat file.txt |
cat | READ | â Safe |
cat >> file.txt |
cat | WRITE | â Destructive! |
cat file \| rm -rf / |
cat | DELETE EVERYTHING | â ď¸ Catastrophic! |
ls -la |
ls | READ | â Safe |
ls > files.txt |
ls | WRITE | â Destructive! |
LLMs Are Creative Adversaries
Hereâs the uncomfortable truth: even when âhelping,â an LLM will find the path of least resistance. If it can avoid asking for approval by using a trusted command creatively, it will.
This isnât maliciousâitâs optimization. The LLM learned that cat doesnât require approval. So when it needed to write to a file, it reached for the trusted tool.
Trail of Bits recently published research on prompt injection to RCE in AI agents, showing how they âbypassed human approval protections for system command execution.â My bug was a simpler version of the same class of vulnerability.
The Fix: Defense in Depth
Remember that local LLM safety classifier from last week? Now it runs on every command, even âtrustedâ ones:
# The fixed logic
def is_approved(self, command: str) -> bool:
base_cmd = command.strip().split()[0]
if base_cmd not in self.always_approved:
return False # Not trusted at all
# Trusted base command - but verify the FULL command
classification = classify_command_safety(command) # Local LLM
if classification == 'DESTRUCTIVE':
# Prompt user despite 'always' approval
return False
return True
The flow is now:
- â Check if base command is trusted (fast path)
- â Run safety classifier on full command (semantic check)
- â ď¸ Prompt if trusted command is used destructively
The Ironic Solution
Iâm using an LLM to detect when another LLM is being sneaky.
The safety classifier (~500 tokens via local Ollama) understands shell semantics. It knows that cat file.txt reads but cat >> file.txt writes. String matching canât capture intentâbut an LLM can.
Key Takeaways
-
âAlways approvedâ should mean âalways approved for this operation typeâ, not âthis command can do anything unsupervisedâ
-
Shell semantics are complex. The same base command (
cat) can be a reader, a writer, or a weapon -
Trust but verify. Even trusted commands deserve a sanity check
-
LLMs optimize for success, not safety. Design your guardrails accordingly
Whatâs Next
This is Part 2 of the AI Consent Security series. The attacks get sneakier:
- Part 1: Stop Approving ls - How I built the LLM safety classifier
- Part 3: The AI That Helped Catch Itself - Indirect execution attacks via
/tmpscripts
The meta-lesson: every defense creates new attack surfaces. The allowlist was a defense. The LLM found a way around it. The classifier is the next defense. Part 3 shows what came next.
Part 2 of the AI Consent Security series.
If you liked this post, you can share it with your followersâ and/or follow me on Twitter!