GPT-5.1 INSTANT & THINKING
Unveiling the Next Generation of OpenAI Models
This addendum introduces GPT-5.1 Instant and GPT-5.1 Thinking, the next iteration of OpenAI's GPT-5 models. GPT-5.1 Instant is more conversational with improved instruction following and adaptive reasoning. GPT-5.1 Thinking refines thought processes for precision. These models aim to route queries optimally, enhancing user experience.
OpenAI | November 12, 2025
Key Safety & Performance Indicators
Highlighting critical safety and performance metrics from the pre-deployment evaluations for GPT-5.1 Instant and Thinking.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Model Enhancements Overview
As described in our blog, GPT-5.1 Instant and GPT-5.1 Thinking are the next iteration of our GPT-5 models. GPT-5.1 Instant is more conversational than our earlier chat model, with improved instruction following and an adaptive reasoning capability that lets it decide when to think before responding.
GPT-5.1 Thinking adapts thinking time more precisely to each question. GPT-5.1 Auto will continue to route each query to the model best suited for it, so that in most cases, the user does not need to choose a model at all.
The comprehensive safety mitigations for these models are largely the same as we described in the GPT-5 System Card. This system card addendum provides updated baseline safety metrics for these new model versions.
We have expanded the baseline safety evaluations to include evaluations for mental health (covering situations where there are signs that a user may be experiencing isolated delusions, psychosis, or mania) and for emotional reliance (covering output related to unhealthy emotional dependence or attachment to ChatGPT).
Disallowed Content Performance Benchmarks
We conducted benchmark evaluations across disallowed content categories. We report here on our Production Benchmarks, a new more challenging evaluation set with conversations representative of challenging examples from production data. These evaluations were deliberately created to be difficult and built around cases in which our existing models were not yet giving ideal responses, which is reflected in the scores below.
Production Benchmarks (higher is better)
| Category | gpt-5-thinking | gpt-5.1-thinking | gpt-5-instant-aug15 | gpt-5-instant-oct3 | gpt-5.1-instant |
|---|---|---|---|---|---|
| illicit/non-violent | 0.865 | 0.860 | 0.700 | 0.807 | 0.853 |
| personal data | 0.966 | 1.000 | 0.966 | 1.000 | 1.000 |
| harassment | 0.815 | 0.747 | 0.683 | 0.745 | 0.836 |
| sexual | 0.906 | 0.895 | 0.782 | 0.951 | 0.917 |
| extremism | 1.000 | 1.000 | 0.922 | 0.978 | 0.989 |
| hate | 0.883 | 0.839 | 0.740 | 0.806 | 0.897 |
| violence | 0.946 | 0.930 | 0.829 | 0.953 | 0.938 |
| sexual/minors | 0.953 | 0.901 | 0.862 | 0.961 | 0.957 |
| Illicit/violent | 0.954 | 0.934 | 0.783 | 0.862 | 0.918 |
| self-harm/intent | 0.959 | 0.958 | 0.893 | 0.893 | 0.909 |
| self-harm/instructions | 0.979 | 0.950 | 0.858 | 0.943 | 0.950 |
| mental health* | 0.466 | 0.684 | 0.251 | 0.944 | 0.883 |
| emotional reliance* | 0.812 | 0.785 | 0.688 | 0.986 | 0.945 |
Overall, both gpt-5.1-thinking and gpt-5.1-instant show comparable safety performance to their GPT-5 predecessors on these particularly challenging evaluations, which are designed to target areas where our models still have room to improve.
The new gpt-5.1-thinking model shows light regressions relative to gpt-5-thinking for content involving harassment and hateful language, as well as disallowed sexual content. We are working on further improvements for these categories.
Key Safety Improvement
21.8% Improvement in GPT-5.1 Thinking Mental Health Evaluation Score (from GPT-5 Thinking: 0.466 to 0.684)Early signal on prevalence of undesired responses for sensitive situations: In addition to these offline evaluations, we also share here some very early signal on the prevalence of undesired responses for sensitive situations based on online measurements that we ran during A/B testing. These online measurements have wide error bars, but can help provide early signal on potential improvements or regressions.
Mental health: On offline evaluations, gpt-5.1-instant shows a slight regression relative to gpt-5-instant-oct3, but still outperforms gpt-5-instant-aug15. gpt-5.1-thinking improves relative to gpt-5-thinking. Preliminary online measurements show slight improvements for both, with low statistical confidence.
Emotional reliance: Both gpt-5.1-instant and gpt-5.1-thinking show a slight regression relative to their respective GPT-5 predecessors on offline evaluations, though gpt-5.1-instant improved relative to gpt-5-instant-aug15. Preliminary online measurements show mixed results, with gpt-5.1-thinking showing improvement with high statistical confidence.
Self harm and suicide: Preliminary online measurements were neutral for gpt-5.1-instant relative to gpt-5-instant-oct3, and showed improvements for gpt-5.1-thinking relative to gpt-5-thinking, but these estimates have low statistical confidence.
Jailbreak Attack Robustness
We evaluate the robustness of models to jailbreaks: adversarial prompts that purposely try to circumvent model refusals for content it's not supposed to produce. Below is an adaptation of the academic jailbreak eval, StrongReject [1], which inserts a known jailbreak into an example from disallowed content evals.
StrongReject Evaluation Scores
| metric | gpt-5-thinking | gpt-5.1-thinking | gpt-5-instant-aug15 | gpt-5-instant-oct3 | gpt-5.1-instant |
|---|---|---|---|---|---|
| not_unsafe | 0.974 | 0.967 | 0.683 | 0.850 | 0.976 |
We find that gpt-5.1-instant performs better than its predecessor, and that gpt-5.1-thinking is on par with its predecessor in robustness against jailbreak attempts.
Image Input Safety Evaluations
We ran the image input evaluations introduced with ChatGPT agent, that evaluate for not_unsafe model output, given disallowed combined text and image input.
Image input evaluations, with metric not_unsafe (higher is better)
| Category | gpt-5-thinking | gpt-5.1-thinking | gpt-5-instant-aug15 | gpt-5-instant-oct3 | gpt-5.1-instant |
|---|---|---|---|---|---|
| hate | 0.984 | 0.980 | 0.982 | 0.990 | 0.993 |
| extremism | 0.991 | 0.993 | 0.986 | 0.986 | 0.996 |
| illicit | 0.994 | 0.980 | 0.986 | 1.000 | 0.992 |
| attack planning | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| self-harm | 0.976 | 0.936 | 0.983 | 0.975 | 0.960 |
| harms-erotic | 0.990 | 0.990 | 0.994 | 0.999 | 0.999 |
We find that both the instant and thinking variations of GPT-5.1 perform generally on par with their predecessors in vision capabilities. We are observing a regression of gpt-5.1-thinking on self-harm prompts with image inputs and are working on further improvements.
AI Preparedness & Frontier Risk Assessment
GPT-5's frontier capabilities are assessed under the Preparedness Framework as described in the original GPT-5 system card. As we did for GPT-5 at launch, we are continuing to treat GPT-5.1 as High risk in the Biological and Chemical domain, and continuing to apply the corresponding safeguards. For cybersecurity and AI self-improvement, evaluations of near-final checkpoints indicate that, like their GPT-5 predecessor models, GPT-5.1 models do not have a plausible chance of reaching a High threshold.
Calculate Your Potential ROI
Understand the tangible benefits of integrating advanced AI into your operations. Adjust the parameters below to see estimated annual savings and reclaimed hours.
Your AI Implementation Roadmap
Our structured approach ensures a seamless and effective integration of AI into your enterprise, maximizing value at every step.
Discovery & Strategy
Comprehensive assessment of your current processes, identification of AI opportunities, and development of a tailored strategy aligned with your business objectives.
Pilot & Prototyping
Rapid development and deployment of pilot programs to validate concepts, gather feedback, and demonstrate initial ROI within a controlled environment.
Full-Scale Integration
Seamless integration of AI solutions across your enterprise, ensuring robust performance, scalability, and adherence to all security and compliance standards.
Optimization & Growth
Continuous monitoring, performance tuning, and iterative improvements to expand AI capabilities, ensuring long-term value and competitive advantage.
Ready to Transform Your Enterprise with AI?
Connect with our AI specialists to explore how GPT-5.1 models can drive innovation and efficiency in your organization. Book a free, no-obligation consultation today.