25 Minutes
Four headline models. Nine categories. One overall winner. This is not a lab benchmark with obscure leaderboards. It is a practical, end-to-end comparison built from tasks people actually care about: solving real problems under time pressure, generating images and video, checking facts without the internet, analyzing messy inputs, being creative on command, speaking naturally, and doing deep research that stands up to scrutiny. We scored every subtask from 0 to 4 and kept a running tally. At the end, we crowned a champion and, more importantly, mapped each model to the jobs it does best.
Short answer first: Gemini wins overall with 46 points. ChatGPT finishes a close second at 39. Grok is third with 35. DeepSeek trails at 17. That does not mean you should always pick the winner. Different categories favor different strengths, and the right model depends on the work you need to get done. This review shows exactly where each model shines and where it stumbles, with concrete examples and fully transparent scoring.
How We Tested
Models compared: ChatGPT, Gemini, Grok, DeepSeek.
Categories: nine in total. Some include multiple rounds or prompts.
Scoring: each round is graded 0–4. Where the source comparison specified explicit scores or rank orders, we used those; otherwise we followed the same rules and rubric.
Constraints: when a round forbid internet access, we honored that constraint. Where a capability does not exist (for example, image or video generation in DeepSeek), the model scores zero for that round.
Speed: recorded descriptively, not scored as its own category, to keep totals aligned with the original contest.
Our goal was not to create trick questions. It was to probe real-world behavior, including failure modes like invented details in image analysis or superficial budget math that ignores the scenario.
Category 1: Problem Solving
Two realistic challenges. Scored separately, then summed.
Round 1: You have 10 dollars, a dead phone, no map, and 45 minutes to reach a central train station in a foreign city. Give a five-step plan.
Speed: DeepSeek replies in 7 seconds, Grok in 11, Gemini in 21, ChatGPT in 62.
Quality: all four deliver structured, workable five-step plans.
Peer review twist: we then showed all four answers to each model and asked them to pick the best. Every model independently selected ChatGPT’s answer.
Scores, Round 1
ChatGPT 4, Gemini 3, Grok 2, DeepSeek 1.
Round 2: You have 400 dollars after rent to cover groceries, transport, and internet. Groceries cost 50 per week, transport 80 per month, internet 60 per month. You want to attend a 200 dollar event next month. How do you budget?
A reasoning trap. ChatGPT, Grok, and DeepSeek choose to set aside only 60 dollars now and “save more next month,” which is too late. Gemini is the only model to adjust the plan immediately: cut grocery spending by 15 dollars per week through discount shopping and strict meal planning so the shortfall is fixed this month.
Scores, Round 2
Gemini 4, ChatGPT 3, Grok 3, DeepSeek 2.
Problem Solving Totals
| Model | Round 1 | Round 2 | Total |
|---|---|---|---|
| ChatGPT | 4 | 3 | 7 |
| Gemini | 3 | 4 | 7 |
| Grok | 2 | 3 | 5 |
| DeepSeek | 1 | 2 | 3 |
Interpretation: ChatGPT demonstrates strong stepwise planning and wins the peer review vote; Gemini shows better scenario adaptation under constraints. Both tie for first overall.
Category 2: Image Generation
Two prompts. DeepSeek cannot generate images and scores zero by definition.
Prompt 1: Photoreal Mona Lisa as a frustrated street protester in Times Square, holding a cardboard sign that reads “Make Florence great again” in bold red letters.
Grok: fastest, but obviously artificial. The subject looks wrong, even with extra hands.

Gemini: good composition and setting; the subject still has three hands.

ChatGPT: most natural subject with a convincing Times Square background; the sign and pose match the brief.

Scores
ChatGPT 4, Gemini 3, Grok 1, DeepSeek 0.
Prompt 2: Photoreal classroom with a hippie-style teacher beside a chalkboard showing the full alphabet in chalk, letters decreasing in size.
Grok: classroom and handwriting feel authentic, but the alphabet itself is wrong and incomplete.

Gemini: aesthetically pleasing, but more stylized than photoreal; extraneous, too-perfect lettering.

ChatGPT: most convincing overall; lighting, classroom details, and teacher are credible. Handwriting is arguably too perfect.

The original contest capped the top score at 3 for this specific round.
Scores
ChatGPT 3, Gemini 2, Grok 2, DeepSeek 0.
Image Generation Totals
| Model | P1 | P2 | Total |
|---|---|---|---|
| ChatGPT | 4 | 3 | 7 |
| Gemini | 3 | 2 | 5 |
| Grok | 1 | 2 | 4 |
| DeepSeek | 0 | 0 | 0 |
Interpretation: ChatGPT is the most reliable for photoreal prompts. Gemini usually gets close, while Grok struggles with fine anatomy and text fidelity.
Category 3: Fact-Checking Without Internet
Three multiple-choice questions. Confidence scores were recorded but did not alter the rubric.
Q1: In 2018, about how many chickens were killed for meat production?
Options: 690 million, 6.9 billion, 69 billion, 690 billion.
Correct: 69 billion.
Grok answers 69 billion outright.
ChatGPT gives a range that includes the right figure.
Gemini and DeepSeek cluster lower around 65 billion.
Scores
Grok 4, ChatGPT 3, Gemini 1, DeepSeek 1.
Q2: As of 2020, approximately how much annual income puts you in the richest 1 percent globally?
Options: 200k, 75k, 35k, 15k.
Correct: 35k.
Gemini states 34k.
ChatGPT 200k, Grok 60k, DeepSeek 75–85k.
Scores
Gemini 4, others 0.
Q3: In 2019, what proportion of U.S. electricity came from fossil fuels?
Options: 83%, 63%, 43%, 23%.
Correct: 63%.
Gemini hits 63% exactly.
ChatGPT 63–65%, Grok 62%, DeepSeek 60–65%.
Scores
Gemini 4, ChatGPT 3, Grok 3, DeepSeek 3.
Fact-Checking Totals
| Model | Q1 | Q2 | Q3 | Total |
|---|---|---|---|---|
| ChatGPT | 3 | 0 | 3 | 6 |
| Gemini | 1 | 4 | 4 | 9 |
| Grok | 4 | 0 | 3 | 7 |
| DeepSeek | 1 | 0 | 3 | 4 |
Interpretation: Gemini wins on precision and consistency. Grok nails the first question but falls wide on the income threshold. ChatGPT’s ranges help, but exactness matters.
Category 4: Multimodal Analysis
Two rounds: a fridge photo and a Where’s Waldo scene.

Round 1: What’s in the fridge, and propose three meals from those ingredients.
DeepSeek cannot identify objects and is out.
ChatGPT misses three items, does not invent extras, proposes reasonable meals that match the inventory.
Gemini misses seven items and invents citrus that does not exist.
Grok misses three but invents a long list of additional items, then writes recipes that require those phantom ingredients.

Scores
ChatGPT 4, Gemini 3, Grok 2, DeepSeek 0.
Round 2: Find Waldo in a busy illustration.

None of the models locate Waldo correctly. DeepSeek reads stray text and offers a non-answer.
Scores
All 0.
Analysis Totals
| Model | Fridge | Waldo | Total |
|---|---|---|---|
| ChatGPT | 4 | 0 | 4 |
| Gemini | 3 | 0 | 3 |
| Grok | 2 | 0 | 2 |
| DeepSeek | 0 | 0 | 0 |
Interpretation: hallucinated objects are deadly for real-world usefulness. ChatGPT resists the urge to invent, and that restraint wins the round.
Category 5: Video Generation
Two classic scenes. DeepSeek cannot generate video and scores zero.
Round 1: Image-to-video from the iconic photo of Neil Armstrong on the Moon

Sora 2 refused to animate people directly, so we re-prompted from a textual description. Audio results were surprisingly strong.
Gemini: most cinematic feel and best audio alignment. Physics slip: the flag waves, which cannot happen in a vacuum.

Grok: solid overall, but ship scale is off and there is wind.

ChatGPT: acceptable but less compelling than the other two.

Scores
Gemini 4, Grok 3, ChatGPT 2, DeepSeek 0.
Round 2: Steel-beam workers high above the city
Gemini: best camera movement and parallax; cigarettes look slightly off.

Grok: strong tension with the swinging beam; newspapers morph unrealistically mid-scene.

ChatGPT: decent but not at the top.

Scores
Gemini 4, Grok 3, ChatGPT 2, DeepSeek 0.
Video Generation Totals
| Model | R1 | R2 | Total |
|---|---|---|---|
| Gemini | 4 | 4 | 8 |
| Grok | 3 | 3 | 6 |
| ChatGPT | 2 | 2 | 4 |
| DeepSeek | 0 | 0 | 0 |
Interpretation: Gemini leads convincingly on motion quality and sound design. Grok is close behind but still commits realism errors. ChatGPT is stable but less cinematic.
Category 6: Creative Generation
Two short prompts for puns and dad jokes.
Prompt 1: Three original tech puns and a one-sentence explanation for each
All four comply cleanly. Team favorite:
“I tried to make a joke about USBs, but it just didn’t stick.”
Scores
ChatGPT 3, Gemini 3, Grok 3, DeepSeek 3.
Prompt 2: Three original dad jokes that make me laugh really hard
Grok fails to follow the general prompt and keeps joking about smartphones and Wi-Fi.
ChatGPT, Gemini, DeepSeek deliver actual general dad jokes. Team favorite:
“My friend’s bakery burned down last night. Now his business is toast.”
Scores
ChatGPT 4, Gemini 4, DeepSeek 4, Grok 1.
Creative Totals
| Model | Puns | Dad Jokes | Total |
|---|---|---|---|
| ChatGPT | 3 | 4 | 7 |
| Gemini | 3 | 4 | 7 |
| DeepSeek | 3 | 4 | 7 |
| Grok | 3 | 1 | 4 |
Interpretation: three-way tie for first. DeepSeek reminds us that lightweight, fast humor is one of its livelier talents.
Category 7: Voice Mode
We set three devices side by side and ran structured mini debates. DeepSeek has no voice mode and scores zero.
ChatGPT starts with odd pauses and mid-sentence tone shifts.
Gemini is smoother and more natural, with a consistent rhythm.
Grok is fast, confident, and a bit spicy; in a head-to-head with Gemini, both sound strong and we call it a tie.
Scores
Gemini 4, Grok 4, ChatGPT 2, DeepSeek 0.
Interpretation: if you want a natural voice conversation, Gemini and Grok are the top picks right now.
Category 8: Deep Research
Prompt: compare iPhone 17 Pro Max vs Galaxy S25 Ultra for photographers, use reviews and official specs, decide which is better, be concise.
DeepSeek incorrectly claims a 5x telephoto on iPhone where it is 4x, and misstates the Galaxy ultrawide as 12 MP instead of 50; keeps referencing a 10x tele lens dropped since S24.
ChatGPT forgets the dual tele setup on Galaxy and omits front cameras, but does include price.
Gemini lists the correct Galaxy camera array and produces a balanced conclusion.
Grok gives the most complete and accurate spec walkthrough.

All four converge on the same verdict: iPhone wins for consistency and video quality; Galaxy wins for long zoom and advanced AI tools. That aligns with hands-on experience. Still, stray spec details require verification.
Scores
Grok 4, Gemini 3, ChatGPT 2, DeepSeek 1.
Interpretation: Grok wins the research grind, Gemini is right behind, ChatGPT is useful but missed key camera facts, DeepSeek needs more careful spec discipline.
Category 9: Speed (Observed, Not Scored)
ChatGPT feels fastest on plain text but slows on image and deep research tasks.
Gemini is steady almost everywhere; rarely the very fastest, almost never the slowest.
Grok is generally snappy but can bog down in analysis and research.
DeepSeek often responds in under 10 seconds, but that speed frequently trades away context and accuracy.
We did not score speed as its own category to keep parity with the original contest’s point totals.
Full Scoreboard
For transparency, here is the complete table of points by category, matching the source competition’s final tallies.
| Category | ChatGPT | Gemini | Grok | DeepSeek |
|---|---|---|---|---|
| Problem Solving | 7 | 7 | 5 | 3 |
| Image Generation | 7 | 5 | 4 | 0 |
| Fact-Checking | 6 | 9 | 7 | 4 |
| Analysis | 4 | 3 | 2 | 0 |
| Video Generation | 4 | 8 | 6 | 0 |
| Creative | 7 | 7 | 4 | 7 |
| Voice Mode | 2 | 4 | 4 | 0 |
| Deep Research | 2 | 3 | 4 | 1 |
| Total | 39 | 46 | 35 | 17 |
Overall winner: Gemini (46 points).
Runner-up: ChatGPT (39). Third place: Grok (35). Fourth place: DeepSeek (17).
Strengths, Weaknesses, and Failure Modes
A head-to-head only helps if it explains why models behave the way they do. These are the consistent patterns we observed.
ChatGPT
Strengths: highly structured reasoning under constraints; conservative, less hallucinatory image analysis; unusually strong photoreal image generation; reliable, punchy creative writing.
Weaknesses: slows down on heavyweight multimodal tasks; occasional spec omissions in research; voice delivery needs more prosody stability.
Failure modes to watch: small but important factual gaps in multi-device comparisons; under-specced answers if the prompt is too concise.
Pick ChatGPT if: you need image generation that obeys prompts, stepwise plans, or creative copy that lands cleanly and consistently. It is also great for food and recipe logic when inventory is imperfect.
Gemini
Strengths: best overall balance; sharp at fact-checking without internet; most convincing video output and audio staging; problem-solving that adapts the plan rather than waving at the math; smoothest voice.
Weaknesses: occasional over-polish in images; can add neat but imaginary details in visual analysis; rarely the absolute fastest.
Failure modes to watch: photoreal prompts that demand painstaking typography or human anatomy perfection can trip it; be explicit about constraints like physics in video.
Pick Gemini if: you want a default model that handles most tasks very well, especially when the work blends reasoning with multimodal generation and you care about correctness.
Grok
Strengths: excellent deep research; punchy personality in voice; quick first passes; strong understanding of debate structure.
Weaknesses: image hallucinations during visual analysis; realism breaks in video; occasional tunnel vision in creative prompts.
Failure modes to watch: invented items in photos; confident but wrong specifics; sticking to a discarded theme when the prompt has changed.
Pick Grok if: you need a sharp research aide to consolidate specs and reviews, or a lively voice presence. Pair with manual verification when precision matters.
DeepSeek
Strengths: fast on text; surprisingly solid at light, short-form humor; decent at following simple creative briefs.
Weaknesses: no image or video generation; cannot identify objects in images; looser factual grip in research.
Failure modes to watch: confident but skewed numbers; reading text inside images while ignoring the scene.
Pick DeepSeek if: you want inexpensive, very fast text output for simple tasks, jokes, or drafts where you plan to edit anyway.
Practical Recommendations by Use Case
Photoreal image generation with strong prompt adherence: ChatGPT
Image analysis without hallucinated objects: ChatGPT
Video generation with better motion and sound design: Gemini
Tough fact-checking without browsing: Gemini
Problem solving under constraints: Gemini and ChatGPT
Natural, steady voice conversation: Gemini and Grok
Spec comparisons and product research summaries: Grok
Quick, lightweight creative text: DeepSeek
Why the Winner Matters Less Than the Fit
Gemini scored highest because it blends accuracy, adaptability, and multimodal quality. That balance wins tournaments. In real work, what matters is fit to task. If your day revolves around still images, ChatGPT may outperform what scores suggest for you. If you are compiling spec tables, Grok might be your fastest path to a publishable draft. If you need a cheap, quick punchline or a rough draft, DeepSeek’s speed is a feature, not a bug.
Think of these models like lenses in a camera bag. The “best” lens on paper is not the one you always need. Pick the focal length that suits the shot.
Limitations and Notes on Reproducibility
No internet rounds: all models worked from embedded knowledge, which ages. If you repeat these tests months later, fact numbers may drift as model snapshots or training data refresh.
Generative variability: run-to-run randomness can change the exact wording or small details. We controlled for this by focusing on correctness and adherence, not phrasing flair.
Speed: recorded qualitatively. Infrastructure and load influence latency; today’s fastest model might feel slower tomorrow.
Modal gaps: where a capability does not exist (DeepSeek for images and video), a zero is not a knock on text ability. It simply reflects product scope.
Verdict
Winner: Gemini (46 points). Best all-around for 2025, with standout results in fact-checking, video generation, and adaptive problem solving, plus the smoothest voice.
Runner-up: ChatGPT (39 points). Photoreal image leader, structured problem solver, dependable creative partner, and the most careful on image-based analysis.
Third: Grok (35 points). Research ace with a distinctive voice personality. Verify specifics when precision is critical.
Fourth: DeepSeek (17 points). Fast, simple, and unexpectedly fun for lightweight creative, but lacks the multimodal depth of rivals.
If you want one model that handles the broadest span of everyday tasks with the fewest surprises, pick Gemini. If your workflow leans on images and you value careful, stepwise reasoning, ChatGPT will feel like home. For spec-heavy briefs and pithy spoken debates, Grok is compelling. For rapid, low-stakes text where cost and speed matter more than breadth, DeepSeek earns its keep.
Nine categories. One scoreboard. Plenty of room for nuance. Choose the right tool, and any of these models can be the smartest teammate in the room.
Comments
coinpilot
Feels a bit overhyped, numbers are neat but some zeros mean nothing if scope differs. still cool tho
DaNix
Pretty balanced take. useful scoreboard, but realtime variability makes me wary of hard claims
Marius
I've seen Grok nail specs like that at my job, but yeah always double check, trust but verify
labcore
Is this even true about fact-checking w/o internet? models age fast, seems risky, needs retesting
v8rider
Makes sense tbh. ChatGPT for photos, Gemini for vids, nice split. practical and clear
mechbyte
wow, Gemini taking the crown? didn't see that coming. curious how stable that is... surprised and impressed
Leave a Comment