The debate over AI coding agents continues to divide the programming community, with some developers praising their potential to accelerate workflows while others decry frequent errors and inefficiencies that demand extensive human intervention. To evaluate the current capabilities of these tools objectively, Ars Technica conducted a rigorous test by challenging four leading AI coding agents to recreate the classic Windows game Minesweeper as a full-featured web version. The prompt specifically required replication of the original gameplay mechanics, mobile touchscreen support, sound effects, and one original “surprise fun” feature, providing a balanced benchmark that tests both technical accuracy and creative problem-solving without excessive complexity.
Testing Methodology
Four prominent AI coding agents participated in this single-shot evaluation: OpenAI’s Codex powered by GPT-5, Anthropic’s Claude Code with Opus 4.5, Google’s Gemini CLI, and Mistral Vibe. Each agent received the identical prompt through terminal interfaces and generated complete HTML, CSS, and JavaScript implementations directly on local machines. No human debugging occurred during code generation, simulating a realistic “first attempt” scenario for production use. Veteran Minesweeper player Kyle Orland evaluated the results blindly, assessing implementation fidelity, user interface polish, mobile responsiveness, sound integration, and the quality of the novel feature.
Minesweeper serves as an ideal test case because it demands sophisticated logic for mine placement, adjacency counting, recursive flood fills, and win/loss detection—far beyond trivial scripts—yet remains contained enough to complete within minutes. The requirement for a unique “fun” feature further probes each agent’s ability to innovate atop familiar patterns, revealing strengths in both rote replication and genuine creativity.
Mistral Vibe: Functional but Incomplete
Mistral Vibe produced a playable game that captured core mechanics like mine generation and number revelation but notably omitted chording, the advanced technique where right-clicking on numbered tiles with fully flagged neighbors clears surrounding safe cells. This absence renders efficient play frustratingly slow, a critical flaw for serious players. Mobile controls relied on awkward long-presses for flagging, triggering unwanted text selection handles, while the decorative “Custom” difficulty button served no function despite implying customizable board sizes.
Presentation suffered from missing sound effects—despite the prompt’s explicit request—and an unappealing black smiley face icon that deviated from the iconic yellow design. The sole “fun” addition, a rainbow grid animation upon victory, felt underwhelming and tacked-on. Though respectable for an open-weight model, Mistral’s sluggish generation time and incomplete feature set earned it a 4/10 score, highlighting room for improvement in both speed and attention to detail.
OpenAI Codex: The Standout Performer
OpenAI Codex delivered the strongest entry, flawlessly implementing chording with clear on-screen instructions for both desktop and mobile use, plus the obscure “?” marking cycle for uncertain tiles. Flag placement via intuitive long-press worked seamlessly on touchscreens, making it the most enjoyable handheld experience. Retro beep sound effects evoked 1980s PCs, complete with a toggle for muting, while the emoticon smiley face evolved expressively from happy yellow to shocked red upon mine detonation.
The “Lucky Sweep Bonus” feature granted free safe tile reveals after major cascades, offering strategic relief in guess-heavy endgames without trivializing difficulty. Codex’s polished terminal interface featured smooth animations and permission controls, though generation took roughly twice as long as competitors. These comprehensive strengths propelled it to a top score of 9/10, demonstrating frontier models’ ability to exceed expectations on nuanced tasks.
Anthropic Claude Code: Polished but Limited
Claude Code impressed with professional visuals, including emoji-based face buttons, crisp bomb/flag graphics, and effective audio cues. However, the persistent absence of chording handicapped gameplay flow, akin to fundamental control omissions in classic games. Mobile flag toggling functioned adequately but introduced board cropping on larger grids and occasional grayed-out visuals during power usage.
The standout “Power Mode” introduced five abilities—Shield for guess protection, Blast for guaranteed cascades, X-Ray for temporary mine reveals, Freeze for clock pauses, and auto-safe markings at startup. While creatively altering core rules, excessive power availability trivialized Expert boards, undermining challenge. Claude’s supremely fast generation (under five minutes) and intuitive interface secured a solid 7/10, balancing strong aesthetics against mechanical shortcomings.
Google Gemini CLI: Technical Failure
Gemini CLI catastrophically underperformed, producing non-functional gray boxes devoid of interactive playfields despite prolonged generation attempts exceeding an hour. The agent fixated on overcomplicated dependencies like React libraries and manual WAV creation, ignoring simpler WebAudio alternatives even after guidance. A second HTML5-focused run similarly stalled on audio, yielding zero playable results.
This hybrid system leveraging Gemini 2.5 variants (Flash Lite, Flash, Pro) promised task specialization but delivered frustration through sluggish planning and execution. While higher-tier Gemini 3 models exist for premium plans, this evaluation’s constraints exposed fundamental reliability gaps, warranting a 0/10 incomplete rating.
Key Insights and Future Implications
OpenAI Codex emerged victorious through mechanical completeness and thoughtful UX, with Claude Code close behind for speed and visuals. Mistral showed promise as an underdog, while Gemini faltered entirely. These results affirm AI agents’ evolution into viable prototyping tools for moderately complex applications, particularly when leveraging well-documented patterns like Minesweeper.
Yet persistent gaps—chording oversights, mobile UX inconsistencies, unbalanced innovations—underscore their role as human-augmenting assistants rather than autonomous replacements. Interactive refinement remains essential for production code, but single-shot successes like Codex suggest rapid progress toward dependable software engineering partners. As models advance, expect coding agents to tackle increasingly ambitious projects, reshaping development workflows while demanding vigilant oversight.



