We built a neural theorem prover forLean(opens in a new window)that learned to solve a variety of challenging high-school olympiad problems, including problems from theAMC12(opens in a new window)andAIME(opens in a new window)competitions, as well as two problems adapted from theIMO(opens in a new window).
We built a neural theorem prover forLean(opens in a new window)that learned to solve a variety of challenging high-school olympiad problems, including problems from theAMC12(opens in a new window)andAIME(opens in a new window)competitions, as well as two problems adapted from theIMO(opens in a new window).A The prover uses a language model to find proofs of formal statements. Each time we find a new proof, we use it as new training data, which improves the neural network and enables it to iteratively find solutions to harder and harder statements.
We achieved a new state-of-the-art (41.2% vs 29.3%) on theminiF2F(opens in a new window)benchmark, a challenging collection of high-school olympiad problems. Our approach, which we call _statement curriculum learning_, consists of manually collecting a set of statements of varying difficulty levels (without proof) where the hardest statements are similar to the benchmark we target. Initially our neural prover is weak and can only prove a few of them. We iteratively search for new proofs and re-train our neural network on the newly discovered proofs, and after 8 iterations, our prover ends up being vastly superior when tested on miniF2F.
Formal mathematics is an exciting domain to study because of (i) its richness, letting you prove arbitrary theorems which require reasoning, creativity and insight and (ii) its similarity to games—where AI has been spectacularly successful—in that it has an automated way of determining whether a proof is successful (i.e., verified by the formal system). As demonstrated in the trivial example below, proving a formal statement requires generating a sequence of proof steps, each proof step consisting in a call to a tactic.B
These tactics take mathematical terms as arguments and each tactic call will transform the current statement to prove, into statements that are easier to prove, until nothing is left to prove.
_Adapted from AMC12 2000 Problem 5_ # Prove that if ∣x−2∣=p|x - 2| = p∣x−2∣=p, where x<2 x < 2 x<2, then x−p=2−2 p x - p = 2 - 2p x−p=2−2 p.
`theorem amc12_2000_p5 -- ← theorem name (x p : ℝ) -- ← the statement we want (h₀ : x < 2) -- to prove (h₁ : abs (x - 2) = p) : x - p = 2 - 2 * p :=begin -- ← formal proof starts here -- This first tactic requires that the prover invent -- the term: `abs (x - 2) = -(x - 2)`. have h₂ : abs (x - 2) = -(x - 2), { apply abs_of_neg, linarith, }, rw h₁ at h₂, -- At this stage the remaining goal to prove is: -- `x - p = 2 - 2 * p` knowing that `p = -(x - 2)`. linarith,end`
We observe that the capability to generate original mathematical terms required as arguments of tactics, which cannot be done without a neural language model, emerges from our training procedure. The proof below is an example of it: the proof step`use n + 1`(entirely generated by our models) proposes to use`n + 1`as a solution, the rest of the formal proof relying on the`ring_exp`tactic to verify that it is indeed valid.
_Adapted from AMC12B 2020 Problem 6_ # For all integers n≥9 n ≥ 9 n≥9, prove that ((n+2)!−(n+1)!)/n!((n + 2)! −(n + 1)!) / n!((n+2)!−(n+1)!)/n! is a perfect square.
``` theorem amc12b_2020_p6 (n : ℕ) (h0 : 9 ≤ n) : ∃ x : ℕ, (x:ℝ)^2 = (nat.factorial (n + 2) - nat.factorial (n + 1)) / nat.factorial n := begin -- The model directly proposes `n + 1` as solution. use n + 1, field_simp [nat.factorial_ne_zero, pow_succ'], ring_exp end ```
We also observe that our models and search procedure are capable of producing proofs that chain multiple non-trivial reasoning steps. In the proof below, the model starts by using contraposition leading to the existential statement (`∃ (x : ℝ), f x ≠ a * x + b`). It then generates a witness for it with`use (0 : ℝ)`and finishes the proof by leveraging the`norm_num`tactic.
_Adapted from the MATH dataset_ # Let f(x)=A x+B f(x) = Ax + B f(x)=A x+B and g(x)=B x+A g(x) = Bx + A g(x)=B x+A, where A<b r/>e B A <br />e B A<b r/>e B. If f(g(x))−g(f(x))=B−A f(g(x)) - g(f(x)) = B - A f(g(x))−g(f(x))=B−A, prove that A+B=0 A + B = 0 A+B=0.
``` theorem mathd_train_algebra_217 (a b : ℝ) (f g : ℝ → ℝ) (h₀ : ∀ x, f x = a * x + b) (h₁ : ∀ x, f x = b * x + a) (h₂ : a ≠ b) (h₃ : ∀ x, f (g x) - g (f x) = b - a) : a + b = 0 := begin revert h₀ h₁ h₂ h₃, -- Initial contraposition. contrapose!, rintro ⟨h₀, ⟨h₁, h₂⟩⟩, -- The model proposes `0` as witness for the current -- goal that consists in `∃ (x : ℝ), f x ≠ a * x + b`. use (0 : ℝ), simp only [sub_eq_iff_eq_add, h₀, mul_zero, zero_add], norm_num at h₀, end ```
Our models, trained with _statement curriculum learning_, were able to close a variety of problems from training textbooks as well asAMC12(opens in a new window)andAIME(opens in a new window)competitions, and 2 problems adapted from theIMO(opens in a new window). We present below three examples of such generated proofs.
_Adapted from IMO 1964 Problem 2_ # Suppose a a a, b b b, c c c are the sides of a triangle.
Prove that a 2(b+c−a)+b 2(c+a−b)+c 2(a+b−c)≤3 a b c a^2(b + c − a) + b^2(c + a − b) + c^2(a + b − c) \leq 3abc a 2(b+c−a)+b 2(c+a−b)+c 2(a+b−c)≤3 ab c.
``` theorem imo_1964_p2 (a b c : ℝ) (h₀ : 0 < a ∧ 0 < b ∧ 0 < c) (h₁ : c < a + b) (h₂ : b < a + c) (h₃ : a < b + c) : a^2 * (b + c - a) + b^2 * (c + a - b) + c^2 * (a + b - c) ≤ 3 * a * b * c := begin -- Arguments to `nlinarith` are fully invented by our model. nlinarith [sq_nonneg (b - a), sq_nonneg (c - b), sq_nonneg (c - a)] end ```
_Adapted from AIME 1984 Problem 1_ # Prove that a 2+a 4+a 6+a 8+...+a 98=93 a2 + a4 + a6 + a8 + ...+ a98 = 93 a 2+a 4+a 6+a 8+...+a 98=93 if a 1 a1 a 1, a 2 a2 a 2, a 3...a3...a 3... is an arithmetic progression with common difference 1 1 1, and a 1+a 2+a 3+...+a 98=137 a1 + a2 + a3 + ... + a98 = 137 a 1+a 2+a 3+...+a 98=137.
``` theorem aime_1984_p1 (u : ℕ → ℚ) (h₀ : ∀ n, u (n + 1) = u n + 1) (h₁ : ∑ k in finset.range 98, u k.succ = 137) : ∑ k in finset.range 49, u (2 * k.succ) = 93 := begin rw finset.sum_eq_multiset_sum, dsimp [finset.range] at h₁, simp [h₀], ring, norm_num at h₁, norm_num, apply eq_of_sub_eq_zero, { simp only [*, abs_of_pos, add_zero] at *, linarith }, end ```
_Adapted from IMO Longlist 1990 Problem 77_ # For a,b,c a, b, c a,b,c reals, prove that (a 2+a b+b 2)(b 2+b c+c 2)(c 2+c a+a 2)≥(a b+b c+c a)3(a^2 + ab + b^2)(b^2 + bc + c^2)(c^2 + ca + a^2) \geq (ab + bc + ca)^3(a 2+ab+b 2)(b 2+b c+c 2)(c 2+c a+a 2)≥(ab+b c+c a)3.
``` theorem imo_longlist_1990_p77 (a b c : ℝ) : (a * b + b * c + c * a)^3 ≤ (a^2 + a * b + b^2) * (b^2 + b * c + c^2) * (c^2 + c * a + a^2) := begin -- The three initial steps use Cauchy–Schwarz to prove -- `(a * b + b * c) ^ 2 ≤ (a ^ 2 + b ^ 2) * (b ^ 2 + c ^ 2)` -- which is required for the final call to `nlinarith`. let u : euclidean_space ℝ (fin 2) := ![a, b], let v : euclidean_space ℝ (fin 2) := ![b, c], have h₀ := real_inner_mul_inner_self_le u v, simp [u, v, fin.sum_univ_succ, ←pow_two, ←pow_two, le_of_lt, mul_assoc] at h₀, -- The model introduces another required cut (i.e. invent -- the term `0 ≤ (c + a) * (c + a)` and proves it). have h₃ : 0 ≤ (c + a) * (c + a), { nlinarith, }, have h₄ := sq_nonneg (a * b + b * c + c * a), simp [sq, h₀, h₃, mul_add, add_mul] at h₄ ⊢, nlinarith [sq_nonneg (b - a), sq_nonneg (c - b), sq_nonneg (a - c)] end ```
Formal mathematics involves two main challenges that make a naive application of reinforcement learning unlikely to succeed.
In our work, we address the infinite action space problem by sampling actions from a language model as we search for a proof. Language models have the capability to generate the tactic calls as well as the original mathematical terms often required as arguments. Our basis for addressing the lack of self-play is the observation that the key role of self-play in 2-player games is to provide an unsupervised curriculum. Our methodology proposes to replace this unsupervised curriculum with an auxiliary set of problem statements (without requiring proofs) of varying difficulty. We empirically show that, when the difficulty of these auxiliary problems is varied enough, our training procedure is able to solve a curriculum of increasingly difficult problems, eventually generalizing to the set of problems we care about.
While these results are extremely exciting, as they demonstrate that deep learning models are capable of non-trivial mathematical reasoning when interacting with a formal system, we are still very far from best-student performance on these competitions, only occasionally, rather than consistently, closing challenging olympiad problems. We hope nonetheless that our work will motivate research in this domain, in particular towards theIMO Grand Challenge(opens in a new window)and that the _statement curriculum learning_ methodology we propose will help accelerate progress in automated reasoning in general.
1. A These problems are not standard math exercises, they are used to let the best high-school students from the US (AMC12, AIME) or the world (IMO) compete against each other.
2. B The artifacts accepted by the formal system are low-level (like assembly code) and hard for humans to produce. Tactics are search procedures that generate such artifacts from higher level directives to assist formalization.
Stanislas Polu, Jesse Michael Han, Ilya Sutskever
Thanks to our paper co-authors: Igor Babuschkin, Kunhao Zheng and Mantas Baksys.
Thanks to the students of the Xena Project Discord who helped us formalize proofs and statements (in particular: Antoine Labelle, Hanting Zhang, Shing Tak Lam, Paul Lezeau, Sara Diaz, Nikita Golikov, Yael Dillies, Artem Vasilyev, Ollie Perree, and Yourong Zang).
Thanks in particular to Kevin Buzzard and Daniel Selsam for their support and thoughtful feedback since the very beginning of this project.
Building agricultural database for farmers API Jan 12, 2024
Creating websites in minutes with AI Website Builder API May 29, 2025
Delivering LLM-powered health solutions API Jan 4, 2024
Our Research * Research Index * Research Overview * Research Residency * OpenAI for Science * Economic Research
Latest Advancements * GPT-5.3 Instant * GPT-5.3-Codex * GPT-5 * Codex
Safety * Safety Approach * Security & Privacy * Trust & Transparency
ChatGPT * Explore ChatGPT(opens in a new window) * Business * Enterprise * Education * Pricing(opens in a new window) * Download(opens in a new window)
Sora * Sora Overview * Features * Pricing * Sora log in(opens in a new window)
API Platform * Platform Overview * Pricing * API log in(opens in a new window) * Documentation(opens in a new window) * Developer Forum(opens in a new window)
For Business * Business Overview * Solutions * Contact Sales
Company * About Us * Our Charter * Foundation * Careers * Brand
Support * Help Center(opens in a new window)
More * News * Stories * Livestreams * Podcast * RSS
Terms & Policies * Terms of Use * Privacy Policy * Other Policies
(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)(opens in a new window)
OpenAI © 2015–2026 Manage Cookies
English United States