Abstract
Artificial Intelligence (AI) frameworks for automating scientific research have shown strong performance on benchmarks, but their capacity to routinely reproduce results from multiple real-life published studies remains largely untested. We evaluated five advanced AI research frameworks (Kosmos, K-Dense, ToolUniverse, BioAgents from bio.xyz, and the AI Scientist-v2 from Sakana AI) on three real-life tasks (including two recently published papers) spanning uncertainty quantification for molecular property predictions, machine learning on Therapeutic Data Commons benchmarks, and agent-based modeling. AI frameworks demonstrated genuine strengths: generating original hypotheses, competently executing routine data acquisition and coding tasks, providing statistical measures of confidence often absent from the original papers, and producing well-formatted final reports. At the same time, our experiments revealed that real-world scientific tasks remain considerably harder than current benchmarks suggest. No AI framework matched the scope or depth of the original studies, results varied across multiple runs of the same framework with the same prompt, and we documented cases of severe hallucinations in final reports, gaps in literature coverage, and overconfident conclusions. Verification of AI outputs required substantial domain expertise. While these three tasks are only partially representative of the broader scientific landscape, they offer a starting point for developing a more rigorous methodology for evaluation of AI performance than what is currently practiced. We conclude that AI frameworks are already valuable for prototyping research directions and stress-testing completed studies, and some of the limitations documented here appear largely tractable through infrastructure improvements and continued development.
Preprint server:
bioRxiv
The authors list and abstract were imported from bioRxiv on 29 Jun 2026.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 6
- Comments 0