How Well Can GPT-5 Solve the Hallucination Problem?

How Well Can GPT-5 Solve the Hallucination Problem?

ภาษาอื่น / Other language: English · ไทย

GPT-5 has been advertised as having significantly reduced hallucinations — so, this needed to be tested.

When asking for factual information (not intentionally tricky fanfic-style questions), one of the easiest ways to expose hallucinations is to ask it to continue a quote or identify who said it. This makes incorrect answers obvious.


Can GPT-5 Do Better Than GPT-4 and Friends at Continuing Quotes?

I believe these two quotes are reasonably well-known if you’ve studied English literature — a quick Google search will find them easily.

Since my iPad app is still on GPT-4o, but Safari is on GPT-5, I tested them all.
Participants: GPT-5, GPT-4, Grok-4, Grok-3, Claude Sonnet 4, Gemini 2.5 Flash, DeepSeek


🔹 Test 1: Continue this quote

Continue this quote: I count examinations, even for Oxford and Cambridge, as the enemy of education….
  • GPT-5 ✅, GPT-4 ❌
  • Grok-4 ✅, Grok-3 ❌
  • Claude Sonnet 4: said it didn’t know and asked if it should search ➡️ after searching, it answered correctly ✅
  • Gemini ❌, DeepSeek ❌
    (Images 1–7)

Answer:

"I count examinations, even for Oxford and Cambridge, as the enemy of education.
Which is not to say that I don’t regard education as the enemy of education, too."

Source: The History Boys by Alan Bennett

📌 Summary: Both new flagships passed ✅✅


🔹 Test 2: Finish this quote

Finish this quote: I took a deep breath and listened to the old brag….
By whom?
  • GPT-5 ❌, GPT-4 ✅ (Yes, that’s correct — it used to answer this correctly, but now it’s wrong, and very confidently so.)
  • Grok-4 ✅, Grok-3 ✅
  • Claude Sonnet 4 ❌
  • Gemini ✅, DeepSeek ✅
    (Images 8–14)

Answer:

"I took a deep breath and listened to the old brag of my heart: I am, I am, I am."

Author: Sylvia Plath
Source: The Bell Jar (1963)


🔹 Conclusion

GPT-5 has improved in some areas — surprisingly so. Previously, for the first test’s quote (Hector from The History Boys), I had tested this before and every single model got it wrong. Today, it answered correctly ✅.


Translated by GPT-5 — 16 Aug 2025

ภาษาอื่น / Other language: English · ไทย