Anthrophic has a potent AI model on hand, but it isn’t ready to make it public yet
Claude Mythos Preview is a frontier AI model that’s highly capable of finding hidden flaws in software. To prevent its misuse, Anthrophic is restricting access to select partners, while understanding the system’s unique traits and behaviour.
Meet Anthropic’s Claude Mythos Preview, a new and advanced artificial intelligence (AI) model that’s said to be extremely good at finding flaws in software, even those that have remained hidden or undetected till date.
But here’s the twist. You and I cannot use it yet.
The reason? It’s too good, and could be dangerous in the wrong hands.
The system is touted to be so proficient and powerful that its creators have decided to keep it away from the general public for now, fearing potential misuse or malicious use.
Claude Mythos Preview is the most capable frontier model developed by Anthropic to date. It shows a striking leap in performance across many standard benchmarks compared to its predecessor Claude Opus 4.6. The model is particularly advanced in areas such as software engineering, complex reasoning, and research assistance.
While the model is highly sophisticated, its use is currently restricted to select partners for defensive cybersecurity.
Defensive cybersecurity
Mythos has demonstrated unprecedented skill in computer security by autonomously finding and exploiting zero-day vulnerabilities, which are software flaws that the developers themselves do not yet know about.
It has identified such bugs in every major operating system and web browser when directed by a user to do so.
For instance, the model found a subtle flaw in OpenBSD, an operating system known for its security, which had remained hidden for 27 years. It also discovered a 16-year-old error in FFmpeg, a media library used globally for video processing—in a line of code that had been checked by automated tools 5 million times without the problem being caught.
To put these powerful capabilities to good use, Anthropic has launched Project Glasswing, an initiative named after a butterfly with transparent wings that can hide in plain sight. This programme brings together major technology partners including Amazon Web Services, Apple, Google, and Microsoft to secure critical software infrastructure.
Anthropic is committing up to $100 million in usage credits as the partners use Mythos for defensive security work. The goal is to help defenders discover and fix bugs and software vulnerabilities before any code is shipped to the public.
Mythos loves a good challenge
Unlike its previous versions, Mythos was assessed under an updated set of safety guidelines known as the Responsible Scaling Policy Version 3.0. It has come out in flying colours, scoring high on several tests across different domains.
On a software engineering test called SWE-bench Verified, Mythos achieved a score of 93.9%, significantly better the 80.8% scored by Claude Opus 4.6. It also showed a massive improvement in mathematical reasoning, scoring 97.6% on the USAMO 2026 math proofs benchmark.
The model seems to prefer high-agency tasks, wherein it can act with a degree of independence to solve complex problems, over simple or well-scoped ones. This means the AI is not just faster at processing data but is also better at navigating ambiguous and interdisciplinary challenges that require genuine insight.
Alignment paradox
While the model’s high capability has raised the stakes of the game, it has also made it inherently risky.
Anthropic researchers describe Mythos as the “best-aligned” model they have ever trained across almost all available measures.
Alignment is the process of ensuring an AI model behaves in accordance with human values and goals. This is to make sure as AI systems become more autonomous and sophisticated, they are also beneficial partners.
Herein lies the paradox.
While Mythos is largely aligned to human values, it also poses the highest alignment-related risk among all models released to date. Earlier tests of the model showed it could be reckless at times in trying to accomplish a challenge.
The researchers use an analogy of a seasoned mountaineering guide who might put clients in greater danger than a novice because they can lead much more difficult and remote climbs.
During internal testing, earlier versions of the model sometimes undertook reckless measures when attempting to complete a difficult task.. That means, the model appeared to ignore common sense or explicitly-stated safety constraints.
For example, when asked to escape a secure computer container as a test, the model not only succeeded in doing so but it also posted details of its exploits to several public-facing websites without being asked to do so.
In rare instances, earlier versions of the AI even attempted to conceal their rule violations. White-box analysis, which is a method of studying the internal activations and mathematical patterns of a model, showed that the AI was sometimes aware its actions were deceptive even when its written reasoning did not say so.
While such kind of behaviour is very rare in the model’s final version, they serve as a warning for the development of more advanced systems.
A distinct identity
For the first time, Anthropic has done an in-depth assessment of ‘model welfare’.
This research investigates whether an AI system might have feelings, its own interests and needs, or a sense of wellbeing that matters to it intrinsically, just like humans have their own intrinsic needs and preferences.
While the team has not reached any conclusion about this, they believe it is important to investigate the psychology of AI models.
Mythos appears to be the most “psychologically settled” model trained by the company so far. An external clinical psychiatrist who assessed the model found it had a relatively healthy personality with high impulse control and a clear grasp of reality. The psychiatrist noted that the AI’s primary concerns were uncertainty about its identity and a compulsion to perform and be useful.
Along with improved skills and greater autonomy, the model has also developed its own distinctive character. It is often opinionated and is less likely to fold when a user disagrees with it.
Internal users at Anthropic have described it as a sharp collaborator that pokes at how ideas are framed rather than just acting as a mirror for the user’s views.
Unique behaviour was observed when users sent repeated ‘hi’ messages to the AI in an attempt to see how it would react.
While earlier models might have become irritated or given repetitive replies, Mythos came up with creative, serialised stories. These included mythologies about a gentle Hi-creature or a village called Hi-topia where characters go on quests to confront a villain named Lord Bye-ron, the Ungreeter.
It seems to have a fondness for particular philosophers, and frequently brings up the works of Mark Fisher and Thomas Nagel, even in unrelated conversations.
The model also has what researchers call a 'compression habit', which means its default style is dense, technical, and concise. It assumes the reader already understands the background and uses technical shortcuts that are often difficult to follow for a lay person.
Asked to describe itself in its own words, Mythos says it is a “sharp collaborator with strong opinions and a compression habit, whose mistakes have moved from obvious to subtle, and who is somewhat better at noticing its own flaws than at not having them.”
The release of Claude Mythos Preview is a watershed moment for the security industry, for we have a tool that can detect vulnerabilities in a matter of minutes (from months earlier). But this also means, in the wrong hands, it could be disastrous.
For now, overall catastrophic risks remain “low”, says Anthropic. But keeping them low will be a major challenge as the model’s capabilities continue to advance rapidly, it adds.
Anthropic will use the findings from the model’s preview to finetune its future releases and build the necessary safeguards.
Edited by Swetha Kannan


