Apple research predicts reasoning models may fall short while faced with complex problems
The paper cites popular AI models like OpenAI’s o1 and o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking, known for their thinking capabilities such as extended chain-of-thought (CoT) and self-reflection.
A new study by Apple has revealed that reasoning AI models may not be as capable as their fancy benchmark scores suggest, facing a ‘complete accuracy collapse’ when faced with complex tasks.
In its latest paper titled ‘The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity’, the iPhone maker argues that while the models may perform well with detailed explanations and responses, still lack general reasoning.
The paper cites popular AI models like OpenAI’s o1 and o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking, known for their thinking capabilities such as extended chain-of-thought (CoT) and self-reflection.
While these models appear strong on reasoning benchmarks, Apple says the current evaluation methods are flawed, as they focus too much on getting the final accurate answer and benchmarks that may be contaminated by training data.
“Recent generations of frontier language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers.
Current evaluations primarily focus on established mathematical and coding benchmarks, emphasising final answer accuracy. However, this evaluation paradigm often suffers from data contamination and does not provide insights into the reasoning traces’ structure and quality,” reads the paper.
Instead, the study used controlled puzzle tests to check how these models think.
Apple’s team designed custom puzzle environments, such as Tower of Hanoi, River Crossing, and Blocks World, to analyse not only the final answers, but also the models’ internal reasoning traces.
It found that as problems get harder, the models make more mistakes, revealing a major weakness in their reasoning skills.
“This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs “think”. Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities,” it added.
The researchers have identified three stages of performance: regular language models do better on simple tasks, reasoning models perform well on moderately complex ones, but both types fail when the problems get too hard.
The analysis also found how such models tend to “overthink” easy problems and break down completely on complex ones, highlighting a major flaw in how they handle reasoning.
“These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalisable reasoning. We believe our results can pave the way for future investigations into the reasoning capabilities of these systems,” it added.


