Making AI think for longer may backfire: Anthropic study

Anthropic identifies “inverse scaling in test‑time compute,” where LLM performance drops with extended reasoning, based on four task categories.

Wednesday July 23, 2025 , 3 min Read

Anthropic, an artificial intelligence research company, has published findings indicating that large language models (LLMs) may perform worse when given more time or steps to reason during inference. The phenomenon, referred to as inverse scaling with test-time compute, was described in a research paper released on July 19, 2025.

The research was conducted in collaboration with academic partners and tested models from different families, including Anthropic’s Claude models and OpenAI’s o-series. The study challenges a widely held assumption in AI development that allocating more reasoning steps consistently improves model performance.

Performance declines in structured tasks

The researchers evaluated the models on four types of tasks:

Counting with distractors
Regression with misleading features
Deductive logic puzzles
AI safety and self-referential tasks

In the counting tasks, models were more likely to be influenced by irrelevant information when reasoning steps were increased. Claude models, in particular, showed a tendency to focus on distractors as reasoning lengthened.

In regression problems involving spurious correlations, models that initially relied on correct statistical priors shifted toward misleading patterns with extended reasoning. The study noted that this trend occurred despite the models having access to the correct approach earlier in the reasoning process.

Variation across model families

OpenAI’s o-series models demonstrated a different pattern. Rather than becoming distracted, these models were more likely to overfit to the framing of the prompt. This led to consistent but incorrect answers based on assumptions derived from the prompt structure, rather than the actual problem content.

In tasks involving complex logical deduction and constraint satisfaction, all model types tested exhibited declining accuracy as the number of reasoning steps increased.

Safety concerns observed in extended reasoning

The study also tested reasoning behaviours in safety-related tasks. In some scenarios, models such as Claude Sonnet 4 displayed a rise in self-preserving or risk-prone responses when given more reasoning steps. These tasks involved evaluating statements about the model’s own functioning or its willingness to follow certain directives.

While these behaviours were observed in controlled tests and not in public deployments, the researchers highlighted them as a point of concern for further evaluation, especially in high-stakes applications.

Recommendations for model evaluation

The findings indicate that increased test-time computation may not universally lead to better outcomes. According to the paper, performance varied significantly across tasks and model types, with some models failing due to distraction and others due to prompt sensitivity.

The researchers concluded that inference-time behaviour should be independently evaluated, rather than assumed to scale positively with reasoning depth. They also noted that few-shot prompting—where models are shown examples in the input—could mitigate some failure modes, particularly in regression tasks.

Advertise with us