OpenAI launches gpt‑oss‑safeguard, open safety reasoning models
OpenAI has released gpt‑oss‑safeguard, open‑weight safety reasoning models that apply developer‑defined policies at inference time, returning decisions and auditable reasoning.
OpenAI has released gpt-oss-safeguard, a research‑preview pair of open‑weight models designed to classify and label AI inputs and outputs against developer‑defined safety policies.
Available in 120B and 20B parameter sizes under an Apache 2.0 license, the models have been made downloadable and aim to give builders finer, transparent control over moderation and risk workflows.
The launch extends OpenAI’s open‑weight line following August’s release of the gpt‑oss base models, with the company adding safety tooling to the programme.
The base models and their documentation set the foundation for this safety‑focused variant.ghts where feasible.
What is gpt‑oss‑safeguard?
OpenAI has described gpt‑oss‑safeguard as an open‑weight “reasoner” for safety classification.
Rather than relying only on pre‑trained categories, the model accepts a developer‑written policy at inference time and classifies content according to that policy, returning both a decision and its chain‑of‑thought reasoning for auditability.
The approach intends to let organisations adapt to emerging harms and niche domains without retraining a bespoke classifier each time.
How it differs from traditional safety classifiers
Conventional moderation tools are typically trained on large labelled datasets aligned to fixed policies; changing those policies has usually required new data and retraining.
OpenAI said gpt‑oss‑safeguard differs by directly “reading” the policy at runtime and reasoning over it, a method inspired by the company’s internal Safety Reasoner and its “deliberative alignment” work.
In practice, developers can iterate on policy text itself to refine outcomes, trading some latency and compute for flexibility and clearer justifications.
In internal tests reported by OpenAI, gpt‑oss‑safeguard outperformed the gpt‑oss base models—and on some multi‑policy accuracy measures even a larger proprietary reasoning model—while proving competitive on public moderation benchmarks.
However, OpenAI cautioned that specialised classifiers trained on tens of thousands of high‑quality examples can still be stronger for certain risks, and that the reasoning approach can be more compute‑intensive.
The company recommended hybrid deployments that use lightweight filters up front and invoke the reasoner where deeper review is warranted.


