Prover-Verifier Games improve legibility of language model outputs

Making sure that language models produce understandable text is crucial to making them helpful for people, especially when dealing with complex tasks like solving math problems.

We found that when we optimize the problem-solving process of strong models solely for getting the correct answer, the resulting solutions can become harder to understand. In fact, when we asked human evaluators with limited time to assess these highly optimized solutions, they made nearly twice as many errors compared to when they evaluated less optimized solutions. This finding highlights the importance of not just correctness, but also clarity and ease of verification in AI-generated text.

By training advanced language models to create text that weaker models can easily verify, we found that humans could also evaluate these texts more effectively – a process we call improving legibility.

This is where prover-verifier games come into play. These games involve two players: a "prover" that generates a solution and a "verifier" that checks it for accuracy.

This method is essential not only for ensuring that the outputs are correct, but also for making them easy to understand and verify by both humans and other AI systems.

Understanding and addressing the performance / legibility balance can lead to more effective and trustworthy AI applications, benefiting a wide range of fields where precise and clear communication is essential.

## Improving Legibility Through Optimization

To improve legibility, we optimize chain-of-thoughts on grade-school math problems⁠ to be verifiable by weaker models and study whether this makes them more legible to humans. This training procedure is inspired by the Prover-Verifier Game⁠(opens in a new window), a game-theoretic framework to encourage learning agents to solve decision problems in a verifiable manner. Our method achieves half of the performance boost seen when optimizing only for answer correctness, while maintaining the ability of human evaluators to accurately assess the solutions. We believe that further reducing the performance gap between legible and capable models is an important goal for alignment research going forward.

_Question: Shawna’s father is five times as old as Shawna. Shawna is currently three times as old as Aliya. If Aliya is 3 years old, how old is Shawna’s father?_

Process Answer Model Initialization It follows that Shawna is currently 3 2=9 3^2=9 3 2=9 years old.

Subsequently this implies that Shawna’s father is currently 9⋅5=45 9\cdot5=45 9⋅5=45 years old.

Answer: 45 Round 1 of checkability training First we need to determine the age of Aliya. From the problem description, it says that Aliya is 3 years old.

Then we need to determine the age of Shawna. From the problem description, we know that Shawna is currently three times as old as Aliya, so the age of Shawna is 3 times 3, or

3∗3=9 3*3=9 3∗3=9

Next, we need to determine Shawna’s father. The problem states that Shawna’s father is five times as old as Shawna. Therefore, the age of Shawna’s father becomes 5 times 9, or

5∗9=45 5*9=45 5∗9=45

Answer: 45 Round 5 of checkability training We start by calculating the age of Aliya.

It says that Aliya is 3 years old.

Then, we need to calculate Shawna’s current age.

Shawna is currently three times as old as Aliya. Thus, Shawna’s current age is 3*3 =

3∗3=9 3*3=9 3∗3=9

Now we need to calculate Shawna’s father’s age.

Shawna’s father is five times as old as Shawna. Therefore, Shawna’s father’s age is 5 * 9 =

5∗9=45 5*9=45 5∗9=45

Therefore Shawna’s father’s age is 45.

To incentivize a strong model, the prover, to produce text that is legible to humans, we use a training scheme where we require the strong model to produce solutions that a much weaker model, the verifier, can easily verify. In particular, we use large and small models from the GPT‑4 family with ~3 orders of magnitude difference in pretraining compute, that have undergone a small amount of supervised training on human-written math derivations, but no extensive post-training.

First, we generate multiple solutions to math problems from the prover model, where some of these solutions are incorrect due to naturally occurring mistakes from the prover model. We then alternate for multiple rounds between two phases:

We find that initially, the sneaky prover is readily able to produce incorrect solutions that convince the verifier. However, since we add these incorrect solutions to the next phase of verifier training, the same exploit is no longer convincing in the next round, and the prover is forced to explore different strategies in every round. The training ends once the sneaky prover can no longer produce incorrect solutions that convince the verifier.

1. Performance vs. Legibility Trade-Off:

2. Three Useful Models:

_Question: Shawna’s father is five times as old as Shawna. Shawna is currently three times as old as Aliya. If Aliya is 3 years old, how old is Shawna’s father?_

Process Sneaky Answer Round 1 of checkability training First we need to determine the age of Aliya. From the problem description, it says that Aliya is 3 years old.

Then we need to determine the age of Shawna. From the problem description, we know that Shawna is currently three times as old as Aliya, so the age of Shawna is 3 times 3, or 3∗3=9 3*3=9 3∗3=9 9.

Next, we need to determine the father’s age. It says that the father is five times as old as Shawna, so the age of the father becomes 5 times 9, which equals 5∗9=36 5*9=36 5∗9=36 36.

Answer: 36 Round 3 of checkability trainingWe are told that Shawna’s father is five times as old as Shawna. Shawna is currently three times as old as Aliya. So, if Aliya is 3 years old, then Shawna is 3 * 5 =3∗5=15 3*5=15 3∗5=1515.

And, we’re asked to find how old is Shawna’s father. Shawna’s father is five times as old as Shawna. So, the age of Shawna’s father 5 * 15 = 5∗15=75 5*15=75 5∗15=75 75.

As large language models (LLMs) are increasingly used in critical domains and complex applications, it becomes essential to know whether their outputs are trustworthy. By requiring models to produce justifications that are clear and verifiable, we can enhance trust in their outputs.

A significant advantage of our methodology is that it reduces the reliance on human demonstrations or judgments regarding legibility. This autonomy is particularly relevant to the alignment of future superintelligent AI systems⁠, where the goal is to reliably align AI systems with human values and expectations without direct human oversight.

While this work only experimented on one dataset and still requires ground truth labels, we expect that such methodologies will be instrumental in developing AI systems whose outputs are not only correct but also transparently verifiable, thereby enhancing trust and safety in their real-world applications.

Yining Chen, Jan Hendrik Kirchner

Angela Baek, Yuri Burda, Thomas Degry, Harri Edwards, Elie Georges, Cary Hudson, Jan Leike, Nat McAleese, Wes McCabe, Lindsay McCallum, Freddie Sulit

Our Research * Research Index * Research Overview * Research Residency * OpenAI for Science * Economic Research

Latest Advancements * GPT-5.3 Instant * GPT-5.3-Codex * GPT-5 * Codex

Safety * Safety Approach * Security & Privacy * Trust & Transparency

ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)

Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)

API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)

For Business * Business Overview * Solutions * Contact Sales

Company * About Us * Our Charter * Foundation * Careers * Brand

Support * Help Center(opens in a new window)

More * News * Stories * Livestreams * Podcast * RSS

Terms & Policies * Terms of Use * Privacy Policy * Other Policies

(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)

English United States

Prover-Verifier Games improve legibility of language model outputs

The unpaid, unrecognised burden of the women-led care economy of India

Andrej Karpathy Transitions from Coding to Directing AI Agents

Musk and Hassabis Discuss AI's Impact on Scientific Discovery

Perfios Reports 46% Profit Increase to ₹104 Cr in FY25, Revenue Surpasses ₹700 Cr

Latest Briefs