Detecting and preventing distillation attacks, Anthropic outlines red flags and defences

The AI firm details how it is identifying large‑scale model‑copying attempts and the countermeasures it is rolling out.

Tuesday February 24, 2026 , 3 min Read

Anthropic has detailed how it is detecting and disrupting industrial-scale distillation attacks that, the company says, attempt to copy the capabilities of its Claude models through massed API queries and illicit account networks. In a post dated 23 February 2026, the firm alleged that coordinated operators scripted millions of prompts using thousands of fraudulent accounts in regions where Claude is not commercially offered, activity it described as a violation of its terms and regional restrictions.

What is a distillation attack

Knowledge distillation is a legitimate technique where a smaller student model learns from a larger teacher model by training on the teacher’s outputs. When this is done without permission at massive scale against a rival’s proprietary API, it becomes an attack. The attacker automates large volumes of prompts, captures responses, and uses them as supervision to train a competing model more cheaply and quickly, often without duplicating the safety guardrails or alignment work invested by the original provider.

What Anthropic says it is seeing

As per the company’s account, traffic linked to these campaigns showed distinct patterns. Prompts were highly repetitive and aimed at extracting stepwise reasoning, coding assistance, and tool-use traces, which are particularly valuable as training signals. Investigators observed proxy infrastructure that rotated identities at scale, moved rapidly after new model releases, and attempted to bypass rate limits and region controls by seeding many small accounts rather than a few noisy ones. Industry partners, Anthropic said, helped corroborate indicators in network telemetry and account metadata.

The defensive playbook

Anthropic described a layered defence. First, detection, using behavioural fingerprinting and classifiers that spot distillation-style prompt distributions, coordinated multi-account activity, and requests that try to elicit chain of thought. Second, access controls, with tighter checks on commonly abused pathways such as education, research, and startup programmes, plus stricter identity verification. Third, response shaping, where product and model changes reduce the extractive value of outputs for would-be student models while preserving utility for legitimate users. Fourth, intelligence sharing with other providers, cloud platforms, and relevant authorities to enable faster takedowns of proxy networks.

How do providers detect coordinated distillation activity

Modern platforms look for clusters of weak signals that become strong when combined. Typical red flags include:

Surges of near-identical prompts across many accounts that are newly created or lightly verified
Unnatural coverage of domains like algorithmic reasoning and code synthesis relative to normal user mix
Attempts to force disclosure of intermediate reasoning or hidden tool calls
High churn in IP addresses, autonomous system numbers, and device fingerprints tied to commercial proxy sellers
Temporal patterns that align with training schedule cadences rather than human usage

According to industry reports, providers also deploy rate shaping and dynamic watermark-like techniques, so that harvested outputs degrade a student model’s ability to faithfully learn from them without hurting utility for regular customers.

Advertise with us