OpenAI’s benchmark to evaluate Indian cultural nuances in AI
OpenAI’s IndQA comprises 2,278 culturally grounded, reasoning-heavy questions across 12 Indian languages and 10 cultural domains, developed in partnership with 261 domain experts.
ChatGPT maker OpenAI has introduced a new benchmark specifically designed to assess how proficiently artificial intelligence (AI) models understand and reason about culturally significant questions within Indian languages.
This new evaluation tool, IndQA, was created because existing multilingual benchmarks are becoming saturated and often fail to capture the necessary context, culture, and nuanced reasoning required for non-English speakers, who make up about 80% of the global population.
Established multilingual evaluations have become saturated, with top models clustering near high scores, which makes them less useful for measuring meaningful progress.
These benchmarks often restrict themselves to translation or multiple-choice formats, failing to adequately assess genuine language competence.
India presents an obvious initial focus for such an effort, given it is ChatGPT’s second-largest market and home to about a billion people who do not primarily use English, encompassing 22 official languages, including at least seven that have over 50 million speakers.
IndQA encompasses 2,278 questions spanning 12 languages and 10 broad cultural domains. These languages include Bengali, English, Hindi, Hinglish, Kannada, Marathi, Odia, Telugu, Gujarati, Malayalam, Punjabi, and Tamil. Hinglish was deliberately incorporated due to the common use of code-switching in conversations.
The cultural domains covered include architecture & design, arts & culture, everyday life, food & cuisine, history, law & ethics, literature & linguistics, media & entertainment, religion & spirituality, and sports & recreation.
A defining feature of IndQA is its focus on culturally nuanced and reasoning-heavy tasks, contrasting sharply with simpler existing evaluations like MMMLU (Multilingual Massive Multitask Language Understanding) and MGSM (Multilingual Grade School Math).
The rigour of the benchmark stems from the involvement of 261 domain experts from across India.
The benchmark uses a rubric-based grading approach and adversarial filtering against OpenAI’s most powerful models to ensure the questions are genuinely challenging.
Each draft question was tested against OpenAI’s strongest existing models at the time, specifically GPT-4o, OpenAI o3, GPT-4.5, and, partially, GPT-5. Only those questions where a majority of these models failed to produce acceptable answers were retained.
Consequently, IndQA is used to chart progress and measure improvement over time within a specific model family or configuration, rather than acting as a direct cross-language leaderboard, as the questions are not identical across languages.
While performance on Indian languages has improved significantly over the past couple of years, the new benchmark clearly demonstrates that substantial room for further advancement remains.
Other benchmarks, including IndicMMLU-Pro, MILU (Multi-task Indic Language Understanding), IndicQA, BharatBench, SANSKRITI, and CulturalVQA, help identify limitations in current AI models, particularly for low-resource languages and region-specific contexts.
Edited by Jyoti Narayan


