Microsoft Critique AI, multi-model deep research explained
Microsoft Copilot Researcher adds Critique to check its own work. One part writes, another reviews, but is this reliable? Here's all you need to know!
AI can generate answers. But can it check its own work?
That is the problem Microsoft is trying to solve with Critique, a new multi-model research system introduced inside its Copilot Researcher tool. Alongside it, the company has also rolled out Council, a feature that lets users compare outputs from different AI models side by side.
Together, these features signal a shift in how AI-generated research is being built. Now, let's take a closer look at this latest tech!
One model writes, another judges
Most AI tools today rely on a single model to handle everything. Planning, retrieving information, writing, and presenting results all happen in one pipeline. Critique breaks that structure. Instead of one model doing everything, Microsoft splits the process into two roles.
A generator model handles planning, retrieval, and drafting. A separate reviewer model then evaluates the output using a defined rubric. This reviewer is not meant to rewrite the content entirely. Its role is closer to peer review.
It checks whether the argument is complete, whether sources are reliable, and whether claims are properly grounded in evidence. Microsoft says this separation creates a feedback loop where the final output becomes more structured, more accurate, and more transparent.
The system also draws on partner models, including those from OpenAI and Anthropic, rather than relying on a single in-house model.
When AI answers disagree, Council steps in
While Critique focuses on improving a single output, Council introduces comparison. When enabled, the system runs multiple models in parallel and generates independent reports. A separate judge model then produces a summary that highlights where the models agree, where they differ, and what unique insights each one brings.
This is designed to surface blind spots. Instead of trusting one answer, users can see multiple perspectives and understand how conclusions were reached. For enterprise users working in areas like legal or financial research, this could make decision-making more grounded.
Turning AI outputs into something closer to peer review
At the centre of Critique is its evaluation rubric. The reviewer model assesses outputs based on 3 key areas. It checks whether sources are credible and relevant, whether the response fully addresses the query, and whether claims are backed by clear evidence.
This approach attempts to solve a known issue with long-form AI outputs. They often sound convincing, but lack depth or reliable sourcing. By formalising evaluation, Microsoft is trying to make AI research more disciplined.
The numbers look good, but they come with a catch
Microsoft tested Critique using a benchmark called DRACO, which evaluates systems across 100 complex research tasks. According to the company, the Critique-based system showed a seven-point improvement over its own single-model setup.
It also reported a 13.88% advantage over other systems referenced in the benchmark, particularly in depth and breadth of analysis. These results are promising, but they come with a caveat. The evaluation is company-reported, and independent validation will be important before drawing broader conclusions.
Bluesky’s Attie lets you “Vibe-Code” your feed and take back control of social media
More reliable, but not foolproof
Despite the improvements, some questions remain unresolved. Multi-model systems can increase reliability, but they also add complexity. It is not always clear how disagreements between models should be interpreted, or how much users should trust the judge model summarising them.
There is also the issue of cost and latency. Running multiple models in parallel can be resource-intensive, which may limit adoption outside enterprise settings. Most importantly, the system still depends on underlying models that can produce errors. Reviewing an output does not eliminate the risk; it only reduces it.
The bottom line
Microsoft’s Critique and Council point to a larger shift in AI. The focus is moving from generating answers to validating them. By separating generation from evaluation and introducing multi-model comparison, the company is trying to make AI research more reliable and transparent. Whether this approach becomes the standard will depend on how well it performs in real-world use. For now, it marks an important step. AI is no longer just learning to respond. It is learning to question itself.


