Abstract
We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge. Within the allowed timeframe of the challenge, Aletheia autonomously solved 6 problems (2, 5, 7, 8, 9, 10) out of 10 according to majority expert assessments; we note that experts were not unanimous on Problem 8 (only). For full transparency, we explain our interpretation of FirstProof and disclose details about our experiments as well as our evaluation. Raw prompts and outputs are available at https://github.com/google-deepmind/superhuman/tree/main/aletheia.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Towards Autonomous Mathematics Research (2026)
- Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the Erd\H{o}s Problems (2026)
- Overseeing Agents Without Constant Oversight: Challenges and Opportunities (2026)
- Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math (2026)
- Milestones over Outcome: Unlocking Geometric Reasoning with Sub-Goal Verifiable Reward (2026)
- RocqSmith: Can Automatic Optimization Forge Better Proof Agents? (2026)
- PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper