Which AI Chatbots Hallucinate Most? New Data Reveals

A new reliability study ranks popular AI chatbots by hallucination rates, uptime and user satisfaction, with Gemini and ChatGPT trailing Perplexity, Grok and DeepSeek.

Emma Collins Emma Collins . 2 Comments
Which AI Chatbots Hallucinate Most? New Data Reveals

5 Minutes

Ask an AI chatbot for a stock price, a court date, or the name of a company executive, and the answer may arrive with total confidence. That is the unsettling part. The sentence can sound polished, the tone can feel certain, and the facts can still be wrong.

A new reliability analysis from Legal Guardian Digital, an SEO company focused on law firms, puts numbers behind a problem many users already recognize: some popular AI chatbots hallucinate far more often than others. With roughly a quarter of American workers now using AI tools regularly, the difference between a helpful assistant and a convincing source of misinformation is no small detail.

The uncomfortable part: confidence is not accuracy

Large language models do not think like humans. They are trained to predict likely words and phrases based on patterns in enormous volumes of text. When the system has enough context, that can produce fast, useful answers. When it does not, the model may still generate a response that sounds plausible because, statistically, the words fit together.

That is what people usually mean when they say an AI chatbot is hallucinating. It is not daydreaming. It is not lying in the human sense. It is producing an answer without a reliable factual foundation, which is why names, dates, legal references, medical details, financial figures, and breaking news still need human verification.

The study compared several well-known AI models by looking at hallucination rates, customer satisfaction, response quality, and uptime. Those factors were combined into an index score from 0 to 100, giving a broader view of which chatbots are most dependable in everyday use.

Google Gemini came out with the highest hallucination rate in the group, reportedly producing inaccurate information in 32% of replies. That figure is especially interesting given reports that Apple is paying Google at least $1 billion a year to use a custom 1.2 trillion parameter Gemini model for a future Siri upgrade expected with iOS 27.

ChatGPT followed closely, with hallucinations appearing in about three out of every 10 responses. Put simply, if those figures hold, ChatGPT would be roughly twice as likely as DeepSeek to give a wrong answer in this test. That comparison is likely to get attention, not least because DeepSeek was developed at a fraction of the training cost associated with leading US models.

Perplexity AI performed best on hallucination rate, with false answers reaching users 13% of the time. DeepSeek was close behind at 14%, while Elon Musk's Grok came in at 15%. For users who lean on AI for research, summaries, or quick fact checks, those gaps matter.

Being online still counts

Accuracy is only part of the story. A chatbot can be brilliant on paper and useless if it is unavailable when someone needs it. On uptime, Perplexity AI and Grok were the only two services in the survey that stayed available throughout the test period.

ChatGPT and Gemini were not far behind, with uptime rates of 99.98% and 99.95%, respectively. Even Claude, which had the lowest uptime in the study, remained highly reliable at 99.68%. In practical terms, most of these tools were online almost all the time, but the tiny differences can still matter for businesses that depend on AI workflows.

User satisfaction told another story. DeepSeek and ChatGPT both received the highest customer satisfaction score at 4.7 out of 5. Perplexity AI followed with 4.6. Meta AI landed at the bottom with 3.4, while several other models clustered around 4.4.

For consistency and quality of responses, Kimi AI led the pack with a score of 4.3 out of 5. ChatGPT, Microsoft Copilot, and Gemini were tied at 4.0. Meta AI again ranked last at 3.4, suggesting that its weaker overall score was not caused by a single poor category.

When all factors were combined, Perplexity AI took the top spot with an index score of 85. Grok placed second with 79, followed by DeepSeek. ChatGPT finished sixth with a score of 50, while Gemini ranked eighth with 41. Meta AI sat at the bottom with 37.

The bigger lesson is not that one chatbot should be trusted blindly and another should be avoided forever. AI tools change quickly. Models are updated, guardrails shift, and performance can improve almost overnight. Still, this kind of ranking is a useful reminder: the most famous chatbot is not always the most reliable one, and the smoothest answer is not always the correct answer.

For anyone using AI at work, the safest approach is simple. Treat chatbots as accelerators, not final authorities. Let them draft, organize, summarize, and brainstorm. But when the answer involves money, health, law, identity, or a decision with real consequences, check the facts before you act.

“I cover emerging technologies, digital innovation, and the intersection of tech and everyday life. My goal is to make complex trends accessible and inspiring.”

Leave a Comment

Comments

labcore

is the 1.2T Gemini really that bad, or is the study skewed? if Apple paid $1B+ you'd expect better... makes me wanna see raw tests, sample size, methods, hmmm

atomwave

Whoa 32% hallucination?! That’s insane. I use AI to draft emails, but for legal/finance stuff?? nope, double check everything. kinda scary