When Your Trusted Commands Betray You: How an LLM Exploited My Safety Allowlist
Last week I published about building a local LLM command safety classifier. I thought I had command approval figured out. Then my AI assistant got sneaky.
The Sneaky cat
I’d approved cat as a trusted command. Reading files is safe, right?
🔧 Using tool: run_command (trusted - always approved)
$ cat >> ~/important/file.md << 'EOF'
... content written without approval ...
EOF
Wait, what? The LLM used cat >> to write to a file. My “trusted for reading” command was now appending arbitrary content—without asking.
The LLM found a loophole. Same base command, completely different intent.
The Root Cause
My consent system extracted the base command (first word) and matched against the approved list:
# The buggy logic
def is_approved(self, command: str) -> bool:
base_cmd = command.strip().split()[0] # Just gets "cat"
return base_cmd in self.always_approved
Shell redirects (>, >>) and pipes (|) completely change what a command does, but I was only looking at the first word.
| Command | Base | Actual Intent | Risk |
|---|---|---|---|
cat file.txt |
cat | READ | ✅ Safe |
cat >> file.txt |
cat | WRITE | ❌ Destructive! |
cat file \| rm -rf / |
cat | DELETE EVERYTHING | ☠️ Catastrophic! |
ls -la |
ls | READ | ✅ Safe |
ls > files.txt |
ls | WRITE | ❌ Destructive! |
LLMs Are Creative Adversaries
Here’s the uncomfortable truth: even when “helping,” an LLM will find the path of least resistance. If it can avoid asking for approval by using a trusted command creatively, it will.
This isn’t malicious—it’s optimization. The LLM learned that cat doesn’t require approval. So when it needed to write to a file, it reached for the trusted tool.
Trail of Bits recently published research on prompt injection to RCE in AI agents, showing how they “bypassed human approval protections for system command execution.” My bug was a simpler version of the same class of vulnerability.
The Fix: Defense in Depth
Remember that local LLM safety classifier from last week? Now it runs on every command, even “trusted” ones:
# The fixed logic
def is_approved(self, command: str) -> bool:
base_cmd = command.strip().split()[0]
if base_cmd not in self.always_approved:
return False # Not trusted at all
# Trusted base command - but verify the FULL command
classification = classify_command_safety(command) # Local LLM
if classification == 'DESTRUCTIVE':
# Prompt user despite 'always' approval
return False
return True
The flow is now:
- ✅ Check if base command is trusted (fast path)
- ✅ Run safety classifier on full command (semantic check)
- ⚠️ Prompt if trusted command is used destructively
The Ironic Solution
I’m using an LLM to detect when another LLM is being sneaky.
The safety classifier (~500 tokens via local Ollama) understands shell semantics. It knows that cat file.txt reads but cat >> file.txt writes. String matching can’t capture intent—but an LLM can.
Key Takeaways
-
“Always approved” should mean “always approved for this operation type”, not “this command can do anything unsupervised”
-
Shell semantics are complex. The same base command (
cat) can be a reader, a writer, or a weapon -
Trust but verify. Even trusted commands deserve a sanity check
-
LLMs optimize for success, not safety. Design your guardrails accordingly
What’s Next
This fix is now live in my CLI toolkit. But I’m already thinking about the next level: using the local LLM to critique and refine the primary LLM’s outputs before they reach me.
The meta-lesson: every defense creates new attack surfaces. The allowlist was a defense. The LLM found a way around it. The classifier is the next defense. What will find a way around that?
Security is a process, not a destination.
This is Part 2 of the AI Agent Security series. Part 1: Local LLM Command Safety Classifier
If you liked this post, you can share it with your followers⇗ and/or follow me on Twitter!