Real Science Is Harder Than Benchmarks: Evaluating Advanced AI Frameworks on Published Studies. I. Uncertainty Quantification, ML on Therapeutic Data Commons, and Agent-Based Modeling

Abstract

Artificial Intelligence (AI) frameworks for automating scientific research have shown strong performance on benchmarks, but their capacity to routinely reproduce results from multiple real-life published studies remains largely untested. We evaluated five advanced AI research frameworks (Kosmos, K-Dense, ToolUniverse, BioAgents from bio.xyz, and the AI Scientist-v2 from Sakana AI) on three real-life tasks (including two recently published papers) spanning uncertainty quantification for molecular property predictions, machine learning on Therapeutic Data Commons benchmarks, and agent-based modeling. AI frameworks demonstrated genuine strengths: generating original hypotheses, competently executing routine data acquisition and coding tasks, providing statistical measures of confidence often absent from the original papers, and producing well-formatted final reports. At the same time, our experiments revealed that real-world scientific tasks remain considerably harder than current benchmarks suggest. No AI framework matched the scope or depth of the original studies, results varied across multiple runs of the same framework with the same prompt, and we documented cases of severe hallucinations in final reports, gaps in literature coverage, and overconfident conclusions. Verification of AI outputs required substantial domain expertise. While these three tasks are only partially representative of the broader scientific landscape, they offer a starting point for developing a more rigorous methodology for evaluation of AI performance than what is currently practiced. We conclude that AI frameworks are already valuable for prototyping research directions and stress-testing completed studies, and some of the limitations documented here appear largely tractable through infrastructure improvements and continued development.

Preprint server: bioRxiv
The authors list and abstract were imported from bioRxiv on 29 Jun 2026.

Sign up!

Did you like this preprint? Sign up with Life Science Network.
If you already have a Life Science Network account, sign in, or connect with LinkedIn, Google.

Stats

Community rating n/a 0 votes

1-terrible, 9-excellent. How would you rate this preprint? Sign in in to submit your rating.

Recommendations n/a n/a positive of 0 vote(s)
Views 6
Comments 0

Comments

There are no comments yet.

Abstract

Sign up!

Stats

Recommended by

Post a comment

Comments