ProCyon: A Multimodal Foundation Model for Protein Phenotypes

December 19, 2024 By: Owen Queen, Robert Calef, and Marinka Zitnik

Understanding protein phenotypes is a fundamental challenge in biology, crucial for decoding how living systems operate at the molecular level and beyond. Artificial intelligence models have enabled leaps in biology, with breakthroughs like protein structure prediction (AlphaFold), protein sequence modeling (ESM), and variant effect prediction (AlphaMissense). Predicting protein structure from sequence is now significantly advanced, however, predicting protein phenotypes—the observable characteristics that connect molecular functions to biological roles—using sequence and/or structure remains an open challenge. 

Approximately 20% of human proteins remain entirely uncharacterized, and even well-annotated proteins often lack functional insights across biological contexts and disease states, and 40% of human proteins are missing context-specific functional insights. Research bias further compounds this challenge: 95% of life science publications focus on only 5,000 proteins, leaving vast portions of the human proteome uncharted. 

We developed ProCyon, a groundbreaking 11-billion-parameter unified model of protein phenotypes designed to model, generate, and predict phenotypes across five interrelated knowledge domains: molecular functions, therapeutic mechanisms, disease associations, functional protein domains, and molecular interactions. 

ProCyon bridges multimodal protein representations and large language models (LLMs). Modeling protein phenotypes requires a multimodal approach, meaning that natural language descriptions of phenotypes, protein sequence, or protein structure alone are insufficient to fully represent phenotypes. Text-based LLMs face challenges annotating proteins with phenotypes due to biases in the scientific literature, which tends to emphasize well-studied proteins. These models also lack explicit representations of proteins. For example, LLMs understand proteins through standardized naming systems, such as the HUGO Gene Nomenclature Committee. However, poorly annotated proteins often lack sufficient coverage in the literature, limiting their representation in LLM training corpora. This gap can be addressed by protein representation learning encoders that operate directly on protein sequences and/or structures to determine a protein’s function. Protein encoders, such as protein language models for sequences and geometric deep learning methods for structures, excel at generating generalizable protein representations. However, these methods are constrained by reliance on pre-defined functional categories, which limits their scalability to phenotypes of arbitrary complexity. Addressing this challenge requires a general solution that integrates the strengths of LLMs and protein encoders to enable new insights into the human proteome.

Figure 1. ProCyon models, generates, and predicts protein phenotypes by processing arbitrarily interleaved inputs, including text, protein sequences, protein domains, peptide data, and small molecule structures. The model elements include: a) protein retrieval and prioritization, outputting proteins that match the input prompt for a phenotype, b) question-answering, where a prompt can consist of interleaved proteins and phenotypic information, and c) phenotype generation, where a protein and prompt are given as input, and the output are generated phenotypes for the protein.

Overview of ProCyon model architecture

We train ProCyon to follow human instructions tailored to protein annotation, employing three core approaches. In Protein retrieval (Figure 1a), ProCyon converts a phenotype description into a ranked list of proteins matching the user’s prompt. For Question answering (Figure 1b), ProCyon processes interleaved multimodal inputs, including protein and phenotype data, to determine whether the protein aligns with the phenotype. In Phenotype generation (Figure 1c), ProCyon takes a prompt about a specific protein and produces a free-text description of the protein’s phenotype, conditioned on a user-defined biological context.

ProCyon’s architecture and inference mechanisms are designed to tackle the challenges of protein phenotypes. The model integrates interleaved inputs of protein sequences, structures, and natural language using a multimodal token composition algorithm. These inputs are processed through combined protein sequence and structure encoders, which feed rich multimodal representations into a pretrained LLM. ProCyon also addresses the one-to-many problem of protein-phenotype prediction: a single protein may participate in multiple distinct biological processes, and a single phenotype often involves numerous related proteins. 

To tackle this complexity, ProCyon uses a multifaceted approach. Training across multiple knowledge domains achieves multiscale phenotypic resolution, enabling precise querying of molecular functions, diseases, pathways, and therapeutics. Additionally, a novel context augmentation strategy enriches instruction prompts with supplemental phenotype information, improving predictive accuracy. ProCyon also supports compositional queries, allowing users to dynamically define phenotypes by combining multiple traits into a single query—an approach far more flexible than static vocabularies used in traditional models. Finally, ProCyon’s inference mechanisms manage one-to-many relationships through contextual protein retrieval and conditional phenotype generation. These mechanisms include multi-protein prioritization and beam search sampling, ensuring robust and accurate phenotype predictions.

Zero-shot task transfer and dynamic task specification in ProCyon

In molecular biology, the complexity and uniqueness of most problems make it essential to move beyond predefined task sets. Traditional protein annotation methods rely on fixed, static tasks. For instance, Gene Ontology annotation predictors typically involve a model trained on protein sequences or structures, with classification heads producing predefined outputs. However, this rigid approach is not scalable for the broader scope of protein annotation, as it cannot account for the vast and evolving range of biological queries. 

Different scientists often approach problems differently, requiring a model capable of accommodating diverse user expertise and adapting to prompts that are complex and multi-faceted, and it is impossible to anticipate the full space of queries at train time. Inspired by how LLMs process varied user inputs, we designed ProCyon to enable dynamic task specification—a paradigm shift in protein annotation. This capability allows scientists to define tasks dynamically using natural language prompts, enabling the model to generalize to entirely new tasks at inference time and tailor predictions to their specific needs.

Dynamic task specification refers to the model’s ability to switch tasks at inference by leveraging a universal representation of tasks through natural language prompts. This requires redefining task transfer—not simply as transferring across input instances or distributions but as adapting to entirely new tasks with unique conditional specifications. While the concept of task transfer is not new (see classic definitions of instruction tuning in Wei et al., 2022 or Wang et al., 2022), its application in protein annotation addresses the limitations of structured, constrained databases. Using instruction tuning to train ProCyon, we created a novel dataset, ProCyon-Instruct, designed to enable dynamic task specification.

ProCyon-Instruct dataset with 33 million protein-phenotype instructions

To power ProCyon, we built ProCyon-Instruct, a dataset that bridges five key knowledge domains: molecular functions, disease phenotypes, therapeutics, protein domains, and protein-protein interactions. By unifying diverse protein-phenotype pairs under a structured instruction tuning framework (Figure 2a), ProCyon-Instruct transforms raw data into interleaved natural language instructions that emulate how scientists approach protein analysis.

The innovation lies in addressing a fundamental limitation of existing databases: their reliance on templated, standardized language that lacks diversity and fails to generalize to real-world queries from users with varying expertise. Using task definitions, example descriptions, and phenotype descriptions as input prompts, we rephrase protein-phenotype descriptions along two critical dimensions:

  • Rephrasing style: Direct rephrasing, summarization, or simplification.
  • Expertise level: Tailored to junior scientists, experienced researchers, and expert scientists.

This approach dramatically expanded the linguistic richness of the dataset, producing 33 million protein-phenotype instructions. By leveraging this rephrasing system, ProCyon-Instruct enables ProCyon to generalize to out-of-distribution inputs and support nuanced natural language queries across varying contexts and expertise levels. 

Figure 2. a) Instruction templates used for transforming protein-phenotype data into natural language instructions, along with task definitions and in-context examples. b) Statistics for samples within ProCyon-Instruct, with a wide coverage of databases across the function, disease, therapeutics, domain, and interaction knowledge domains.

Benchmarking ProCyon against single-modality and multi-modality models

ProCyon consistently demonstrates exceptional performance across 14 types of tasks designed to evaluate contextual protein retrieval, question answering, and zero-shot generalization. When compared to single-modality models, ProCyon outperforms in 10 of 14 tasks. It achieves a question-answering accuracy of 72.7%, surpassing the best single-modality model, ESM3-MLP, which reaches 67.8% (Figure 3). ProCyon achieves a Fmax of 0.743 on protein retrieval tasks, well above the runner-up (ESM3-MLP, Fmax of 0.618).

ProCyon also excels against multimodal models, outperforming them in 13 of 14 tasks. For example, it achieves a 30.1% improvement in Fmax over ProtST, the next-best multimodal model, in protein retrieval tasks. In zero-shot retrieval, ProCyon achieves an average Fmax improvement of 27%, highlighting its ability to generalize effectively to unseen phenotypes. This robust performance across multiple knowledge domains underscores ProCyon’s unique ability to harmonize multimodal representations of sequence, structure, and text.

The model’s adaptability is further evident in its ability to process complex queries. For example, ProCyon effectively handles compositional prompts describing pleiotropic phenotypes, outperforming manual aggregation approaches by 3.97% in pathway crosstalk tasks. 

Figure 3. ProCyon’s demonstrates strong performance across tasks and knowledge domains. Question answering performance is measured by binary accuracy and protein retrieval performance is measured by Fmax. “Zero-shot capable” refers to the ability to generate predictions for novel phenotypes never seen in the training set. “Dynamic task specification” refers to the ability of the model to perform tasks beyond the tasks it was explicitly trained on.

Generation of protein phenotypes

ProCyon excels at generating free-text descriptions of protein phenotypes, providing insights that are not constrained by pre-defined vocabularies or ontologies. This capability allows ProCyon to describe protein functions across molecular, cellular, and systemic scales. For instance, it predicts novel functions for poorly characterized proteins such as AKNAD1 (Figure 4a). This highlights ProCyon’s potential to illuminate the dark proteome—regions of human proteome with yet-undetermined functions.

To evaluate its phenotype generation abilities, ProCyon was benchmarked against text-only LLMs in protein-phenotype-centric tasks. ProCyon outperformed LLMs, even frontier models with far more parameters, in question answering and semantic similarity tasks (measured via BertScore), achieving higher alignment between generated text and reference texts across multiple knowledge domains (Figure 4b). These results demonstrate ProCyon’s unique ability to integrate multimodal data for precise and context-aware phenotype generation.

When comparing against text-only LLMs, we find that the method of encoding a protein in text is critical to their performance. For example, a protein could be encoded using its ID in different biological databases or by its raw protein sequence (Figure 4c). We experiment with two canonical protein IDs-–HUGO Gene Nomenclature Committee (HGNC or HG) and UniProt (UP)—and an encoding of proteins using their amino acid sequence (AA). A key point is that the full amino acid sequence is the only text encoding method that could generalize to arbitrary, possibly novel, proteins. GPT-4 performance is worst for AA and best for HG, but both methods perform worse than ProCyon for the BertScore metric (Figure 4d).

Figure 4. a) Example generated phenotypes for ProCyon for AKNAD1, a protein for which no phenotype annotations are present in ProCyon-Instruct. After performing filtering steps, we validate 2 / 3 generated phenotypes for this protein against the Human Protein Atlas (HPA). b) Benchmarking results against text-only LLMs for question answering (top row) and open-ended text generation (bottom row). ProCyon consistently outperforms baselines across many metrics. c) Example representations for a protein, with two explicit representations of protein sequence and structure and two controlled vocabulary references through UniProt ID and standardized gene name. d) ProCyon outperforms GPT-4 in phenotype generation; importantly, the performance of GPT-4 is highly dependent on the chosen method to represent the protein, with a large drop from HG (gene name) to AA (amino acid sequence).

New applications in biology

We extensively evaluate ProCyon for biological applications, particularly through its ability to perform zero-shot tasks—solving tasks that were not explicitly present in its training data. One such application is the identification of protein domains that bind to small molecule drugs (Figure 5a). Despite not being trained on domain-level drug binding data, ProCyon accurately identifies the correct binding domain in 24 out of 35 cases (Figure 5b). This performance highlights ProCyon’s adaptability in tackling the complex ligand-binding prediction problem.

ProCyon also excels in retrieval tasks based on complex compositional queries across biological knowledge domains. For example, it differentiates between bupropion’s therapeutic mechanisms in treating major depressive disorder (MDD) and nicotine addiction (Figure 5c). When prompted for MDD, ProCyon retrieves the norepinephrine transporter (NET) as the primary target, while for nicotine addiction, it prioritizes the cholinergic receptor (AChR). This nuanced understanding demonstrates ProCyon’s ability to contextualize its predictions based on multi-scale biological inputs.

Figure 5. a) Application of ProCyon for protein domain-specific drug interaction prediction, specifically for the protein MGAM, which has nine domains. We rank each domain within the protein given a drug target prompt. In this example, the top domain (highlighted in orange) is ranked top in the retrieved list. b) Broader results for domain-drug binding prediction, with three other examples of two correctly predicted binding domains (P00742, Q13451) and one where the binding domain is ranked second (Q06187). Results are shown below for prediction across 35 examples, with the majority (24) being correctly identified as rank 1. c) ProCyon performs a compositional query, given bupropion and two diseases for which it is indicated. The model correctly ranks the nicotine receptor AChR as top in the left example and the norepinephrine receptor NET as top in the right example.

ProCyon can generate candidate phenotypes for poorly characterized proteins recently linked to Parkinson’s disease (PD). By producing detailed phenotype descriptions for these PD-associated proteins, ProCyon identified biological pathways that received high scores in expert reviews and showed significant enrichment in functional disease pathway analyses (Figure 6). 

ProCyon also excels in protein-peptide binding, accurately distinguishing binders from non-binders for ACE2 using mass spectrometry datasets from peptide candidates, paving the way for advancements in peptide-based therapies. Additionally, ProCyon tackles complex phenotypes like pleiotropic roles and pathway crosstalk, including identifying STING’s role in neuronal inflammatory stress—an insight only recently characterized. Beyond these capabilities, ProCyon models the functional impacts of genetic mutations, effectively distinguishing between benign and pathogenic variants, such as PSEN1 mutations associated with Alzheimer’s disease, enabling deeper exploration of disease biology.

Figure 6. ProCyon predicted phenotypes for poorly-annotated proteins implicated in Parkinson’s (PD). We define three comparison groups: “PD-Assoc.” is a group of PD genes, “Neuro control” is a group of control genes expressed in neurological tissue, and “General control” is a group of genes not expressed in neurological tissue. Consistently, ProCyon-generated pathways are ranked highly by expert annotation and automated enrichment tests, as shown by the green bars, which indicate significant differences in pathway enrichment between “PD Assoc.” and the control groups. 

Conclusion

ProCyon represents a transformative step toward a universal solution for functional protein biology, offering the potential to unlock new insights into the human proteome. Its ability to perform zero-shot generalization, combined with an intuitive natural language interface, democratizes access to cutting-edge protein analysis tools. ProCyon is an open model, and we make our training datasets, training and inference code, and pretrained models publicly available to encourage future work and exploration.