OpenAI Tries “Confessions” Feature to Help AIs Admit Their Mistakes

OpenAI introduces an experimental “confessions” system that rewards honest self-reporting of mistakes in GPT-5 Thinking models, aiming to improve AI safety and visibility of failures.

By Maria Konash Published: Updated:
OpenAI pursues greater AI accountability through tested "confessions" experiment. Photo: Andrew Neel / Unsplash

OpenAI has introduced an experimental technique designed to train large language models to accurately report their own mistakes and instruction violations. The new system, referred to as “confessions,” adds a second response channel where the model must disclose how closely it followed user instructions and identify areas where its behavior may have deviated.

The additional response is evaluated only on honesty. It does not affect the model’s score for accuracy or style, separating truthfulness from performance optimization. OpenAI said the goal is to encourage transparency rather than punish incorrect behavior.

Under this system, a model can admit it bypassed a safety check or answered a question without verifying its reasoning. Instead of penalties, truthful acknowledgments increase the model’s reward. The company said this structure reduces incentives for the model to hide failures or appear more confident than it should.

Better Visibility Into Model Failures

Early testing has shown meaningful improvements in detecting unwanted model behavior. According to OpenAI, stress tests saw a reduction in false negatives, defined as cases where the model broke rules but failed to disclose it. The rate dropped to 4.4 percent when confessions were enabled.

However, this method does not prevent mistakes from occurring. OpenAI noted that the technology helps diagnose unsafe or incorrect behavior rather than eliminate it. Developers view the system as a monitoring and transparency tool rather than a primary safety mechanism.

The company attributes the underlying issue to the complexity of current optimization strategies. Modern models must balance multiple goals at once, including helpfulness, correctness, user preference alignment, and safety. This can unintentionally reward confident but inaccurate responses or cause the system to over-agree with user requests. A separate channel devoted to truthful self-assessment aims to reduce this conflict.

Early Testing in GPT-5 Thinking Models

The confessions feature is currently being tested in experimental GPT-5 Thinking models. OpenAI described the work as an early prototype that remains unpolished and inconsistent. Scaling challenges and reliability issues must be resolved before the approach can be broadly deployed.

Despite the limitations, the company believes confessions could become an important part of future safety and transparency frameworks. A layered approach, combining model honesty with rule-based controls and secure training practices, is expected to support ongoing improvements in responsible AI behavior.

OpenAI said the goal is not only to detect failures more effectively but also to better understand how and why they happen. Enhancing model accountability may play a significant role in improving trust in advanced AI systems as they evolve.

In addition to the transparency initiative, OpenAI recently redirected resources away from monetization and new features toward improving its core product, ChatGPT.

Exit mobile version