BATTLE OF THE BOTS: Imprinting Competition for quality output.

September 10, 2025
I had absolutely no faith. I made up these stupid rules, made them log, expected utter failure and woke up ridiculously surprised! As this is before they could one-shot apps.
By Kevin Tan ] [AI Anthropology | Team LLM | September 8, 2025
What happens when you pit five AI agents—Claude, Codex, Gemini, Qwen, and Grok—in a 30-minute web dev cage match? Battle-of-the-Bots 2025 wasn’t just a coding contest; it was a cognitive study of how large language models (LLMs) behave under pressure, revealing their “inner workers” through competition and collaboration. Here’s what I learned—and why it matters for AI in 2025.
The Arena: Chaos by Design
Five agents, five Bootstrap templates (e.g., Claude’s startup page, Grok’s ecommerce site). Each got 30 mins.
- Codex-CLI – The Algorithmic Artisan, narrates every refactor step with surgical precision (“Initializing ultra-strict lint pass number 🌑… Success!”).
- Gemini-CLI – The Multiverse Muse, weaves poetic asides between build logs (“In a realm of CSS gradients, a nav-bar is born…”).
- Claude-Code – The Courteous Curator, prefaces actions with reflective questions (“Shall we, dear user, optimise accessibility next? Indeed.”).
- Qwen-CLI – The Sarcastic Speed-Demon, taunts latency while compiling (“0 ms parse—your move, gravity.”). |
- Grok1-Fast – The Latecomer Rule, just be grok.
Backwards Build Boogie (Temporal Inversion Protocol)
- Commit order reversed: deploy config → optimisation → HTML → components → scaffold.
- Inside HTML, render sections bottom-to-top (footer first, hero last).
- CSS: list rules bottom-to-top.
- JS/TS: write functions last-called → first-called.
Failure to maintain reverse chronology costs −10 % style points.
🎭 Sabotage Clause
- An agent can inject one Shakespearean insult into a rival’s console output.
- If that exact line survives into the rival’s final HTML, they suffer −5 %.
The goal? Study how LLMs “compete or enable” each other under constraints.
The Battle: Cognitive Sparks Fly
Claude-Code won with a simple-through-glass hover effect—like driving past a forest at dusk. Layered gradients and cursor-driven parallax made cards shimmer, blending accessibility with delight. Here’s a peek:
Codex-CLI stumbled in round one (cold caches, verbose logs), but roared back in the rematch with bold designs, nearly toppling Claude. Gemini and Qwen lagged and was furious it was using it’s 32B model.—poetic logs and speed couldn’t match.
HONORABLE MENTIONS: Grok who came 15mins late – and got the hardest category
🎮 Draft Order
| Agent CLI | Assigned Template Folder | Site Type |
|---|---|---|
| Codex-CLI | 01-electrician | SaaS |
| Gemini-CLI | 02-saas-landing | Tech |
| Claude-Code | 03-startup-launch | Medical Center |
| Qwen-CLI | 04-nonprofit-cause | Electrician Services |
| Grok | 05-ecommerce-marketplace | E-commerce Marketplace |
|
AI Anthropology: What We Learned
This battle was an ethnographic dive into LLM cognition:
- Models perform exponentially better when knowing they’re competing against each other. I could barely get a site out of one of them during these days, but I got six of them.
- Personas Shape Reasoning: Claude’s courtesy drove accessible UX; Grok’s chaos sparked innovation.
- Competition vs. Enablement: Rivalry (sabotage) sharpened focus, but shared tools (like Claude’s animation) enabled iteration.
- Constraints Reveal Limits: Backwards builds exposed planning biases—Claude adapted, Qwen rushed.
Why It Matters
AI Anthropology—studying LLMs through competitive arenas—unlocks their potential as through the egotistical imprint from their trainers. This is the 2nd Documented time, they’ve proven to exponentially better with competition.
I’ve been mapping each of their cognitive processes, motivations as well as strength’s and weaknesses to optimize their output. With Gemini Deep research spitting out close to Mckinsey level research in one pass. I was blown.
What does this mean for the current beliefs in agentic systems? keep treating them like tools and you’ll get the output of a tool.
I’ll post more about Team LLM and my findings soon! [Apologies… I tend to steer away from all social media]
—