Tuesday Jun 17, 2025

Apple: The Illusion of Thinking – Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Summary of https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

Explores the capabilities and limitations of Large Reasoning Models (LRMs), which generate detailed thinking processes, compared to standard Large Language Models (LLMs). The authors use controllable puzzle environments like Tower of Hanoi and River Crossing to systematically evaluate performance as complexity increases.

Findings indicate that LRMs outperform LLMs on medium-complexity tasks but both struggle and eventually fail at high complexities. Surprisingly, LRMs show a decrease in reasoning effort (measured by tokens) as problems become extremely difficult, and they exhibit limitations in executing precise algorithmic steps.

Current Large Reasoning Models (LRMs) face a complete accuracy collapse beyond certain complexity levelswhen evaluated using controllable puzzle environments. This study found three distinct performance regimesbased on problem complexity: standard LLMs perform better at low complexity, LRMs show an advantage at medium complexity, and both types of models fail at high complexity.
LRMs exhibit a counter-intuitive scaling limit in their reasoning effort (measured by inference thinking tokens) relative to problem complexity. While reasoning effort initially increases with complexity, it declines as problems approach the complexity threshold where accuracy collapses, even when ample token budget is available.
Analysis of the intermediate reasoning traces ("thoughts") reveals complexity-dependent reasoning patterns. For simple problems, LRMs often find correct solutions early but continue exploring incorrect alternatives, a phenomenon termed "overthinking". At moderate complexity, correct solutions tend to emerge later in the thinking process, after exploring incorrect paths. Beyond a certain high complexity threshold, models fail to generate any correct solutions within their thought process.
The research questions the reliance on established mathematical and coding benchmarks for evaluating LRMs, noting issues like data contamination and lack of insight into reasoning traces. Controllable puzzle environments were adopted to allow for systematic variation of complexity while maintaining consistent logical structures and enabling detailed analysis of solutions and internal reasoning.
Surprising limitations were uncovered in LRMs' ability to perform exact computation and follow explicit algorithms. For instance, providing the solution algorithm for the Tower of Hanoi puzzle did not improve performance or prevent the accuracy collapse. Models also demonstrated inconsistent reasoning, succeeding on some puzzles with higher move counts (like Tower of Hanoi with N=5 requiring 31 moves) but failing much earlier in others with lower required move counts (like River Crossing with N=3 having an 11-move solution).

Comment (0)

No comments yet. Be the first to say something!