The artificial intelligence industry has been making bold claims about solving mathematical problems, but a team of eleven leading mathematicians just raised the bar significantly. In a notable initiative called “First Proof,” these academics have created what may be the most rigorous test yet of AI’s ability to handle genuine research-level mathematics.
The Challenge: Real Math Problems, Not Contest Questions
Published on February 5, 2026, the First Proof paper presents ten mathematical questions that arose naturally during the authors’ own research. These aren’t textbook exercises or competition problems. They’re actual lemmas (smaller component proofs) that the mathematicians encountered while working on larger research projects.
The questions span diverse mathematical fields: algebraic combinatorics, spectral graph theory, algebraic topology, stochastic analysis, symplectic geometry, representation theory, lattices in Lie groups, tensor analysis, and numerical linear algebra. Each problem has been solved by its contributor, with proofs roughly five pages or less, but crucially, none of these solutions have ever been posted online.
The answers were encrypted and uploaded to 1stproof.org, with the decryption key scheduled for release on February 13, 2026, giving AI systems and researchers one week to attempt the problems [1].
Why This Matters: Closing the Contamination Gap
Previous AI math benchmarks have faced a fundamental problem: data contamination. When testing systems trained on vast swaths of internet data, how can we be sure they’re actually solving problems rather than recognizing patterns from their training data?
The First Proof initiative tackles this challenge head-on through several key design principles. The problems come from unpublished research, meaning they exist nowhere in any AI training dataset. The questions represent the actual distribution of problems working mathematicians face, not artificially constructed puzzles designed for automatic grading. Each problem requires human expert evaluation since correct answers aren’t necessarily unique. There may be multiple valid proofs or different counterexamples.
Additionally, the experiment allows AI systems unfettered access to internet searches and other resources, mirroring how these tools would actually be used in practice. This is a departure from restrictive testing environments that don’t reflect real-world mathematical research workflows.
The Baking Metaphor: Understanding Mathematical Research
The paper’s title draws from an apt culinary analogy. In baking, the “first proof” is the bulk fermentation process where dough ferments as one mass before being divided and shaped into individual loaves. Similarly, this research initiative represents a preliminary effort to assess AI capabilities, with plans to “ferment” these ideas in the community before producing a more structured benchmark.
The authors are transparent about what this experiment does and doesn’t measure. Mathematical research involves multiple stages: identifying important questions to study, developing novel theoretical frameworks and definitions, and finally proving well-formed statements. First Proof focuses exclusively on that final, most measurable stage, finding rigorous proofs for already well-specified questions.
As the researchers acknowledge, this choice reflects a pragmatic first step rather than a complete picture of mathematical creativity. The higher-level tasks of question selection and theory development remain crucial but harder to evaluate systematically.
Early Results: AI Systems Struggle with Novel Problems
The research team conducted preliminary tests using GPT-5.2 Pro and Gemini 3.0 Deepthink. Under a strict one-shot protocol where systems get a single attempt without iterative refinement, the results were sobering. Current AI systems struggled to answer many of the questions correctly.
This stands in sharp contrast to recent headlines about AI achievements in mathematics. Google’s Gemini Deep Think achieved gold-level scores on the International Mathematical Olympiad, and various systems have solved previously unsolved “Erdős problems.” However, these accomplishments came with caveats that First Proof helps illuminate.
Olympiad problems, while challenging for humans, don’t reflect the nature of research mathematics. Some purported AI breakthroughs turned out to be sophisticated literature searches rather than original proofs. One recent example from Axiom Math, a start-up focused on AI for mathematics, involved a proof that was later revealed to be a misrepresented result from existing literature.
The Industry Context: Skepticism and Transparency
Daniel Spielman, a Yale professor and one of the project’s architects, voiced a concern shared by many academics in an interview with Scientific American: “Almost all of the papers you see about people using LLMs are written by people at the companies that are producing the LLMs,” he noted. “It comes across as a bit of an advertisement” [2].
First Proof represents an independent assessment by researchers with no employment or consulting relationships with AI companies. The team deliberately avoided any conflicts of interest, and the project received no funding from commercial entities.
The initiative also addresses methodological concerns that have plagued previous benchmarks. FrontierMath, funded by OpenAI, consists of expert-level problems but remains largely private, with OpenAI having preferential access. IMProofBench and RealMath take different approaches, but each faces limitations around automatic grading, question design, or data contamination.
What Makes a Good AI Math Benchmark?
The First Proof researchers articulated several principles that distinguish their approach. Problems should be sampled from the true distribution of questions mathematicians currently work on, requiring human-graded proofs rather than automatically verifiable answers. Answers must never have appeared online, in talks, or any public forum to eliminate contamination. The questions themselves should be public and examinable by everyone, even if this prevents reuse. Models should have access to outside resources like internet searches to reflect realistic working conditions.
This combination of features appears to be unique among current math AI evaluations.
Second Proof and Beyond
The current release is explicitly labeled a preliminary effort rather than a formal benchmark. Ten questions aren’t sufficient for statistical reliability, and the team hasn’t specified formal grading criteria. Creating research-level problems with unpublished answers of appropriate length requires substantial effort, a typical mathematician might generate only a few such questions per year.
However, the team plans to release a second set of questions within a few months and is open to testing frontier AI systems on these problems before public release. This could provide a more controlled benchmark while maintaining the contamination-free properties that make the approach valuable.
Looking further ahead, the researchers aim to remove some current constraints, such as proof length limitations, and explore ways to measure performance on other aspects of mathematical research beyond theorem proving.
Implications for the Field
As MIT mathematician Andrew Sutherland noted, the greatest impact of AI on mathematics in 2026 may not come from solving famous open problems but from integration into mathematicians’ daily workflows [2]. Most working mathematicians haven’t yet adopted AI tools extensively, but that may be changing.
If AI systems can reliably solve the kinds of lemmas represented in First Proof, they could become valuable assistants for speeding up tedious aspects of research. However, the preliminary results suggest we’re not there yet, at least not with current publicly available systems operating autonomously.
The initiative also serves a broader purpose: establishing clear, transparent standards for evaluating AI capabilities in specialized domains. As AI systems become more sophisticated, rigorous evaluation methodologies become increasingly important for distinguishing genuine progress from hype.
Conclusion
First Proof represents a significant step toward understanding what AI systems can and cannot do in mathematical research. By focusing on genuine, unpublished problems drawn from active research, the initiative cuts through the noise surrounding AI mathematical capabilities.
The encrypted answers will decrypt on February 13, 2026, providing a definitive test of current systems. Whether AI succeeds or struggles with these problems, the mathematical community will gain valuable data about where the technology stands and what directions future development should take.
More importantly, First Proof establishes a methodology that could be extended beyond mathematics to other domains where rigorous evaluation of AI capabilities matters. In an era of ambitious claims about artificial intelligence, such grounded, transparent assessment frameworks are exactly what the field needs.
References
[1] Abouzaid, M., Blumberg, A. J., Hairer, M., Kileel, J., Kolda, T. G., Nelson, P. D., Spielman, D., Srivastava, N., Ward, R., Weinberger, S., & Williams, L. (2026). “First Proof.” arXiv preprint arXiv:2602.05192. https://arxiv.org/abs/2602.05192
[2] Howlett, J. (2026). “Mathematicians issue a major challenge to AI: Show us your work.” Scientific American. February 9, 2026. https://www.scientificamerican.com/article/mathematicians-launch-first-proof-a-first-of-its-kind-math-exam-for-ai/