Comprehensive evaluation of LLM capabilities for interpretation and analysis of genome-scale metabolic models in metabolic engineering

Authors

Yeoh, J. W., Patro, C. P. K., Wong, L., Poh, C. L.

Abstract

Genome-scale metabolic models (GSMs) underpin pathway and strain engineering by linking genes to metabolic reactions and enabling system-level simulation of cellular fluxes and intervention effects, yet end-to-end analysis workflows remain fragmented, expert-demanding, and slow to adapt. Large language models (LLMs) could transform this landscape, lowering the barrier by explaining concepts, interpreting GSM files, and turning natural-language instructions into valid analysis code, thereby substantially mitigating the time, effort, and expertise required. However, their reliability for domain-specific tasks remains unexplored. Here, we delivered a systematic benchmark of four leading LLMs (GPT-4, Gemini, Claude, DeepSeek-R1) across four task areas central to metabolic engineering: domain knowledge, metabolic flux prediction, pathway construction, and flux optimization. For benchmarking, we introduced a standardized, rubric-based evaluation framework that uses multi-LLM automated scoring (an ensemble of LLM-as-a-judge assessments) and two distinct sets of nine task-tailored metrics (domain vs coding-focused tasks), rated on a 1-5 scale (up to 45 per task), covering scientific validity and code executability where applicable. Across tasks, we reveal consistent strengths (conceptual explanation, code synthesis) and critical failure modes (e.g., context window limitations, incorrect identifier assumptions, strain-dependent reasoning errors, and errors in domain-specific algorithms). In aggregate, DeepSeek-R1 led in domain tasks, narrowly edging GPT-4, Claude, and Gemini, demonstrating that conceptual biological logic remains highly invariant across architectures. In contrast, Gemini achieved the highest score for coding tasks, distinguished by functional execution and excelled in error handling, documentation, and readability, followed by GPT-4, Claude, and DeepSeek. We also evaluated LLM self-inspection capability by injecting subtle, consequential faults: a stoichiometric sign error causing mass imbalance and an omitted pathway reaction. We reveal that conversational "blind search" prompting completely fails to localize these network faults. Instead, robust error localization requires prompts reframed with domain-informed constraints that force the LLM to leverage tool-assisted code procedures, such as COBRApy mass-balance functions. Together, this work establishes an evidence-based baseline for LLM-enabled GSM analysis, providing actionable guidance for building reliable, automation-ready workflows for pathway and strain design.

Preprint server: bioRxiv
The authors list and abstract were imported from bioRxiv on 09 Jun 2026.

Sign up!

Did you like this preprint? Sign up with Life Science Network.
If you already have a Life Science Network account, sign in, or connect with LinkedIn, Google.

Stats

Community rating n/a 0 votes

1-terrible, 9-excellent. How would you rate this preprint? Sign in in to submit your rating.

Recommendations n/a n/a positive of 0 vote(s)
Views 26
Comments 0

Comments

There are no comments yet.

Authors

Abstract

Sign up!

Stats

Recommended by

Post a comment

Comments