GPT‑5 output is approaching the quality of work of industry experts: OpenAI

OpenAI has introduced GDPval, a benchmark showing GPT-5 and rival models nearing expert-level outputs across 44 occupations, though cautioning that one-shot tests can't replace real-world workflows.

Friday September 26, 2025 , 3 min Read

OpenAI has said its latest AI system, GPT‑5, has stacked up against human professionals on a broad set of real‑world tasks, citing early results from a new company benchmark designed to measure economically valuable work.

The assessment, published on 25 September 2025, has suggested that frontier models are approaching the quality of work produced by industry experts in many cases.

A new workplace benchmark

The company has introduced “GDPval”, an evaluation spanning 44 occupations across nine industries that contribute most to US GDP, with 1,320 tasks in total and a 220‑task open “gold” subset.

Tasks reflected authentic work deliverables such as legal briefs, engineering presentations and nursing care plans—and have been judged by experienced professionals in blind comparisons.

OpenAI has said the benchmark has aimed to ground progress in evidence rather than speculation and to track model improvement over time, noting that performance from GPT‑4o to GPT‑5 has more than doubled on these tasks.

What is GDPval, and how does it work?

In GDPval’s first iteration (v0), expert graders compared AI‑generated deliverables with human‑produced work and recorded “wins” (better than experts) and “ties” (on par). The company has also released an automated grading tool to estimate human preferences, though it has emphasised that expert reviews remain the standard.

“GPT‑5‑high” (a higher‑compute variant) has been rated better than or on par with industry experts 40.6% of the time across GDPval’s gold set, according to OpenAI’s briefing to TechCrunch.

Anthropic’s Claude Opus 4.1 has topped the table in that analysis, with results on or above expert level in 49% of tasks.

Axios has published a chart derived from OpenAI’s data showing similar patterns: Claude Opus 4.1 at roughly 48% and GPT‑5 around 39% (wins plus ties) across 220 tasks—while underlining that performance has more than doubled between GPT‑4o and GPT-5.

OpenAI has added that models can complete GDPval tasks roughly 100× faster and 100× cheaper than experts, though it has cautioned that these figures cover raw inference time and API costs rather than real‑world oversight and integration.

Limits and context

The company has framed GDPval‑v0 as an “early step”. It has covered one-shot tasks and has not captured iterative workflows or ambiguity common in day-to-day roles; future versions have been slated to expand occupations, interactivity, and context.

Chief executive Sam Altman has positioned GPT‑5 as part of OpenAI’s push to embed intelligence in everyday work, while the firm has simultaneously launched a broader programme to evaluate real‑world utility through GDPval.

Advertise with us