By OpenAI's own testing,Watch Weekend Sexcapades (2014) its newest reasoning models, o3 and o4-mini, hallucinate significantly higher than o1.
First reported by TechCrunch, OpenAI's system card detailed the PersonQA evaluation results, designed to test for hallucinations. From the results of this evaluation, o3's hallucination rate is 33 percent, and o4-mini's hallucination rate is 48 percent — almost half of the time. By comparison, o1's hallucination rate is 16 percent, meaning o3 hallucinated about twice as often.
SEE ALSO: All the AI news of the week: ChatGPT debuts o3 and o4-mini, Gemini talks to dolphinsThe system card noted how o3 "tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims." But OpenAI doesn't know the underlying cause, simply saying, "More research is needed to understand the cause of this result."
OpenAI's reasoning models are billed as more accurate than its non-reasoning models like GPT-4o and GPT-4.5 because they use more computation to "spend more time thinking before they respond," as described in the o1 announcement. Rather than largely relying on stochastic methods to provide an answer, the o-series models are trained to "refine their thinking process, try different strategies, and recognize their mistakes."
However, the system card for GPT-4.5, which was released in February, shows a 19 percent hallucination rate on the PersonQA evaluation. The same card also compares it to GPT-4o, which had a 30 percent hallucination rate.
In a statement to Mashable, an OpenAI spokesperson said, “Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability.”
Evaluation benchmarks are tricky. They can be subjective, especially if developed in-house, and research has found flaws in their datasets and even how they evaluate models.
Plus, some rely on different benchmarks and methods to test accuracy and hallucinations. HuggingFace's hallucination benchmark evaluates models on the "occurrence of hallucinations in generated summaries" from around 1,000 public documents and found much lower hallucination rates across the board for major models on the market than OpenAI's evaluations. GPT-4o scored 1.5 percent, GPT-4.5 preview 1.2 percent, and o3-mini-high with reasoning scored 0.8 percent. It's worth noting o3 and o4-mini weren't included in the current leaderboard.
That's all to say; even industry standard benchmarks make it difficult to assess hallucination rates.
Then there's the added complexity that models tend to be more accurate when tapping into web search to source their answers. But in order to use ChatGPT search, OpenAI shares data with third-party search providers, and Enterprise customers using OpenAI models internally might not be willing to expose their prompts to that.
Regardless, if OpenAI is saying their brand-new o3 and o4-mini models hallucinate higher than their non-reasoning models, that might be a problem for its users.
UPDATE: Apr. 21, 2025, 1:16 p.m. EDT This story has been updated with a statement from OpenAI.
Topics ChatGPT OpenAI
Russian government hackers mined bitcoin to fund attacks on FIFA, antiThe top karaoke songs from 'A Star is Born'Russian government hackers mined bitcoin to fund attacks on FIFA, antiHighlights from 35 minutes of 'SpiderDonald Trump's ongoing fight with the family of a slain Muslim soldierScott Wilson, Hershel from 'The Walking Dead,' dies at 76Gifts for people who love Chris HemsworthScott Wilson, Hershel from 'The Walking Dead,' dies at 76Facebook employees revolt after executive appears at Kavanaugh hearingWhy the Microsoft Surface Pro 6 and Surface Laptop 2 don't have USBGoogle might be making a Home HubThe unannounced Pixel 3 XL is already for sale in one storeRussian government hackers mined bitcoin to fund attacks on FIFA, anti10 Harry Potter gifts that aren't THere's how to use an Apple Watch as a Walkie Talkie in WatchOS 5Facebook, Apple confirm they were targets of Supermicro malware attackHere are the most fetch pink celebrations of 'Mean Girls' DayDwayne Johnson posts first peek at 'Hobbs and Shaw' with Jason Statham10 Harry Potter gifts that aren't TFacebook patent details new tool to fight political echo chambers 5 Signs Your Storage Drive is About to Fail Samsung Galaxy Z Fold 7 and Flip 7 leak reveals possible new features What Ever Happened to Adobe Flash? Don't miss these National Orgasm Day deals [2025] GPU Availability and Pricing Update: November 2021 Major League Cricket 2025 livestream: Watch Major League Cricket for free What Ever Happened to GeoCities? Top 10 Hacks for Microsoft Excel 2023 Genesis GV60: A Gadget on Wheels Tinder launches Double Date feature to swipe with your BFF A Surveillance Primer: 5 Eyes, 9 Eyes, 14 Eyes How to Use the Amazon Echo Spot and Echo Show as a Security Camera Gateway 2000: Gone But Not Forgotten Overclocking Intel Non We Bought the Cheapest DDR5 RAM Modules We Could Find, Are They Any Good? Best RAM for Intel 12th Boca Juniors vs. Benfica 2025 livestream: Watch Club World Cup for free Grab the M3 MacBook Air at the record Preorder the new Anker Soundcore Sleep A30 earbuds with ANC for $159 10 Tips to Get You Started with Microsoft PC Game Pass
1.1645s , 10137.96875 kb
Copyright © 2025 Powered by 【Watch Weekend Sexcapades (2014)】,New Knowledge Information Network