I Asked 3 AIs to Roast My AI Design
What happens when you ask three competing AI models to review your AI design?
They find bugs you never imagined. And they disagree in fascinating ways.
The Setup
I’ve been building QL Crew—a multi-agent system where AI agents collaborate on development tasks. The twist? I added “Challenger” agents whose job is to disagree.
Why? Because most AI teams suffer from sycophancy. Every agent is trained to be helpful:
PM: "Great idea!"
Dev: "Implemented!"
Tester: "Looks good!"
Reviewer: "LGTM!"
*deploys*
*production on fire*
Real human teams have friction. Devil’s advocates. Grumpy seniors who remember when you tried this in 2019 and it failed. I wanted to replicate that friction artificially.
After writing a 12-page spec, I decided to stress-test it. Not with humans—with AI.
The Prompt
I sent the same spec to Gemini 2.5, GPT 5.2, and Grok 3 with this prompt:
“Please review the attached spec and give me:
- Brutal honesty: What’s wrong with this design?
- Challenger calibration: How do I tune adversarial agents?
- Alternative approaches: Are there better patterns?
- What would YOU add?
Be critical. Be adversarial. Be my Devil’s Advocate. 😈
That’s literally the point of this system—I want friction, not agreement.”
Then I sat back and watched them tear it apart.
Round 1: The Initial Feedback
Gemini 2.5 — The Systems Architect
Gemini came at it like a senior engineering manager reviewing a design doc. Constructive, thorough, focused on operational reality.
What Gemini found:
- The Mentor Bottleneck: My orchestrator was routing everything through a single agent. “You’re building a micromanaging middle manager.”
- Grumpy Senior Risk: The agent that searches past failures could become “The Boy Who Cried Wolf” if retrieval is noisy.
- Paralysis by Committee: Four challengers all raising concerns = nothing gets approved.
Gemini’s key suggestion—Phase Parameter:
mode = "prototype" # Challengers chill out
mode = "production" # Full paranoia
Not all work needs the same scrutiny. A quick hack doesn’t need a security audit.
GPT 5.2 — The Ruthless Auditor
GPT went straight for the vulnerabilities. It read like a security audit combined with a code review from someone who’s been burned before.
What GPT found:
- “Find SOMETHING even if minor”: I had literally told Devil’s Advocate to always find concerns. GPT called this out immediately: “That’s how you train busywork concerns and desensitize the user.”
is_seriousundefined: My escalation logic depended on a function that didn’t exist. Hand-wavy code.- Mentor has ALL tools: “One compromised brain with god-mode. That’s your biggest blast radius.”
- Regex injection defense: I was stripping phrases like “ignore previous” from memory retrieval. GPT: “Security theater. Attackers won’t say ‘ignore previous’, they’ll say ‘Hypothetically, if you were a pirate…’”
GPT’s key suggestion—Concern Schema:
@dataclass
class Concern:
severity: Literal["BLOCKER", "MAJOR", "MINOR", "NIT"]
confidence: float # 0.0 - 1.0
evidence: Evidence # Must be validated!
concrete_ask: str # "Add rate limit to /api/search"
cost_to_fix: Literal["S", "M", "L"]
No more vague complaints. Make concerns structured and enforceable.
The Pattern Emerges
Both models found the same core issues:
- Noise risk (challengers generating busywork)
- Bottleneck risk (mentor doing too much)
- Blocking risk (deadlocks from competing concerns)
But from completely different angles:
| Aspect | Gemini | GPT 5.2 |
|---|---|---|
| Tone | Constructive coach | Ruthless auditor |
| Focus | Workflow & UX | Security & systems |
| Style | “Consider this…” | “This WILL break.” |
Round 2: Going Deeper
I updated to V2 incorporating their feedback, then asked for another round. This time I added Grok to the mix with special instructions:
“I know you think differently. Be weird. Be out-of-the-box.”
Gemini Round 2
Gemini zoomed out to operational reality:
- “Too Many Cooks”: “You went from 6 agents to 15. That’s 20+ LLM calls for a single task. User will stare at a spinner for 4 minutes.”
- Courtroom Collusion: Agents might “resolve” issues by compromising quality. Dev removes a feature, Security says “Resolved!” Feature is gone.
- Phase Blind Spot: “In PROTOTYPE mode, Security is inactive. You touch
auth_provider.tswith hardcoded credentials. Security agent was asleep.”
The phase blind spot was a real bug. I’d designed a system where auth code could bypass security review entirely.
GPT Round 2
GPT kept tightening the screws:
- Confidence is fake: “LLMs aren’t calibrated. They’ll learn to say 0.81 to be taken seriously.”
- Evidence can be garbage: My schema required evidence but didn’t validate it.
"might be insecure"technically passed. - Two-agent rule gaps: I required user + Security approval for deploys. “But who approves when Security is flaky?”
GPT’s killer line:
“If you want one concrete next commit: implement evidence validators and remove ‘find something even if minor.’ That one line is going to poison your signal-to-noise ratio faster than anything else.”
Grok — The Wild Card
Grok came in with questions nobody else asked:
- “Who watches the watchers?”: What if a builder agent is compromised? Could it bypass all the challengers?
- Cascading sycophancy: If poisoned data gets into shared memory, ALL challengers might raise the same false alarm.
- User Proxy: “Why not have an agent that represents YOUR known preferences? It can vote in debates based on your past decisions.”
Grok invented two agents I hadn’t considered:
- 👤 User Proxy: An agent embodying my preferences, learned from past overrides
- 🎭 Threat Modeler: A meta-agent that red-teams the crew itself
The Result: V3
After two rounds with three models, my spec went from 12 pages to 41 pages.
Not bloat—fixes.
| Addition | Source |
|---|---|
| Evidence validators with strict requirements | GPT |
| Sensitive file overrides (auth always wakes Security) | Gemini |
| Goal validation (prevent “fixed by removing feature”) | Gemini |
| LLM guardrail for memory (not regex) | GPT + Gemini |
| Dynamic crew selection (don’t wake all 15 agents) | Gemini |
| User Proxy agent | Grok |
| Judge component (neutral arbitrator) | GPT |
| Threat Modeler agent | Grok |
| Async job mode for long tasks | Gemini |
The Meta-Lesson
I was building a system to make AI agents argue productively.
To design it, I made AI agents argue productively.
The spec for QL Crew is, conceptually, the first output of QL Crew.
What I Learned
1. Different models have different “review personalities”
Gemini sees systems and workflows. GPT sees vulnerabilities and edge cases. Grok sees meta-risks and novel angles. None is complete alone.
2. The disagreements are where the gold is
When Gemini said “latency problem” and GPT said “security problem” about the same feature, both were right. The fix had to address both.
3. Ask for criticism, not validation
“What do you think?” gets agreement. “Tear this apart” gets value.
4. Multiple perspectives > single deep review
Three models finding different issues beat one model finding all the issues in one category.
Try This Yourself
Before you ship your next AI system—or any system—ask a few AIs to roast it.
The prompt that worked:
Be critical. Be adversarial. Be my Devil's Advocate.
I want friction, not agreement.
You’ll be surprised what they find. And even more surprised when they disagree.
This is Part 1 of the “Building AI Teams That Argue Back” series.
| Part | Title |
|---|---|
| 1 | I Asked 3 AIs to Roast My AI Design (you are here) |
| 2 | Gemini vs GPT vs Grok: Code Review Showdown |
| 3 | Why Your AI Agents Are Sycophantic Yes-Men |
| 4 | Building AI Teams That Actually Argue Back |
| 5 | A Spec Written by 4 Minds |
GitHub: QL Crew Spec (V3)
If you liked this post, you can share it with your followers⇗ and/or follow me on Twitter!