Home Research Deeper Learning ATOMICA: Learning Universal Representations of Molecular Interactions

ATOMICA: Learning Universal Representations of Molecular Interactions

April 30, 2025 By: Ada Fang and Marinka Zitnik

ATOMICA is a representation learning model that captures intermolecular interactions across all molecular modalities—proteins, nucleic acids, small molecules, and ions. By learning universal representations of how atoms interact across different molecular types, ATOMICA provides a general-purpose model for reasoning about biological systems at atomic resolution.

AI for Science is an emerging research area that integrates machine learning to drive discovery across a broad spectrum of scientific domains. Rather than replacing traditional science, AI augments it by helping scientists model complex systems, uncover hidden structure in data, and generalize from sparse or noisy observations. ATOMICA is part of this emerging wave. It applies geometric deep learning to the fundamental units of biology: molecular interactions. By learning universal representations of how atoms interact across different molecular types, ATOMICA provides a general-purpose model for reasoning about biological systems at atomic resolution.

Molecular interactions govern nearly every biological process, from enzymes catalyzing reactions and proteins binding DNA to the regulation of signaling pathways by ions and lipids. Yet most machine learning models focus on molecules in isolation or specialize in a narrow class of interactions, such as protein-ligand or protein-protein binding. This siloed approach limits their ability to generalize across molecular types (Figure 1).

ATOMICA addresses this challenge. It is a geometric deep learning model trained to learn atomic-scale representations of intermolecular interfaces across a wide spectrum of biomolecular modalities, including small molecules, ions, amino acids, and nucleic acids. Its latent space encodes shared chemical and structural patterns across these interactions and improves as the diversity of training modalities increases.

We apply ATOMICA to capture disease-associated proteins with similarities in their interaction profiles. By focusing on molecular interactions, which are often functional sites and conserved throughout evolutionary history, ATOMICA annotates metal ions and cofactors to 2,646 previously uncharacterized ligand-binding sites in the dark proteome.

**Figure 1.** ATOMICA learns representations of molecular interactions between small molecules, metal ions, amino acids, and nucleic acids. Capabilities of ATOMICA: universal latent space of molecular interactions, model disease proteins, and functional annotation of the dark proteome.

ATOMICA’s architecture is a hierarchical geometric graph neural network that models each interaction complex at the all-atom scale. Atoms are connected both within molecules (intramolecular edges) and between interacting molecules (intermolecular edges), and recurring substructures, such as amino acids, nucleotides, and chemical groups, are represented as higher-level “blocks.” Message passing occurs both at the atomic and block levels to enable multi-scale reasoning. Here are the key design choices:

The atom nodes are represented by their atomic coordinates from their experimentally determined structure.
Since we are modelling an interaction complex, there are two types of edges – intramolecular and intermolecular – for atoms in the same and different molecules, respectively.
Many atoms are part of recurring substructures, such as chemical motifs, amino acids, or nucleotides. In our graph representation, these substructures are captured as “blocks,” with each relevant atom connected to its corresponding block node.

**Figure 2.** Architecture of ATOMICA. Interaction complexes are modelled at an all-atom scale and block scale (common chemical motifs, amino acids, and nucleotides). The geometric graph neural network is updated with SE(3)-equivariant message passing.

At the core of ATOMICA is a SE(3)-equivariant geometric graph neural network, which encodes the 3D spatial arrangement of atoms while respecting rotational and translational symmetries (Figure 2). Each node represents an atom, defined by its element type and 3D coordinates, and edges reflect spatial proximity and chemical bonding. The model updates node features through message passing using tensor field networks, which preserve geometric equivariance throughout the computation. This ensures that the learned representations are invariant to rigid-body transformations, a crucial property for modeling molecular structures. After atom-level updates, features are pooled to block-level nodes and message passing is repeated at this higher scale. The final graph embedding integrates atomic details with broader chemical context, allowing the model to generalize across diverse molecular interfaces.

ATOMICA is trained in a self-supervised manner using denoising and masked block-type prediction on over 2 million interaction complexes drawn from the Protein Data Bank and Cambridge Structural Database. This includes interactions between molecular modalities: metal ions, small molecules, proteins, peptides, DNA, and RNA (Figure 3).

**Figure 3.** Training details of ATOMICA. Top: The model is pretrained on 2,037,972 interaction complexes. Bottom: Training is self-supervised with the task of denoising and mask block-type prediction.

To evaluate the impact of cross-modality learning, we compared ATOMICA models pretrained on individual modality pairs to a single model pretrained jointly on all interaction types. We assessed performance using a masked block identity prediction task, which measures how well the model can infer missing structural components from the surrounding context of the interaction interface (Figure 4).

The results show that pretraining on multiple molecular modalities substantially improves embedding quality compared to training on a single interaction type. This improvement is particularly pronounced for low-resource interaction types, such as protein-DNA and protein-RNA interfaces, where training data are limited. For these cases, multimodality pretraining increases prediction accuracy by over 2.5 times relative to single-modality models.

These improvements scale with dataset size, consistent with scaling laws observed in other areas of deep learning. Pretraining on larger and more diverse datasets leads to better generalization, especially for modalities underrepresented in structural databases. This is the first demonstration of such scaling behavior in molecular interaction modeling, showing that ATOMICA can transfer knowledge across different molecular interaction types and that cross-modality training helps the model capture shared structural and chemical patterns that are not modality-specific.

**Figure 4.** Left: schema for evaluation of pretraining on multiple molecular modalities compared to one pair of molecular modalities. Middle: AUPRC of masked block identity accuracy for ATOMICA pretrained on all pairs of molecular modalities compared to one pair of interacting modalities. Right: improvement in performance of the model scales with increasing dataset size.

Protein-protein interaction networks have long been used to study disease mechanisms. In these networks, edges connect proteins that physically interact, and proteins associated with the same disease often cluster together to form connected components, referred to as disease pathways. However, proteins interact not only with other proteins but also with a wide range of molecular partners, including nucleic acids, ions, lipids, and small molecules. It remains unclear how similarities in these diverse interaction interfaces relate to disease involvement. ATOMICA addresses this by constructing ATOMICANets, which are modality-specific networks in which each node represents a protein with a known or predicted interface to a given molecular modality. Edges connect proteins that have similar interaction interfaces, measured using ATOMICA-derived embeddings.

We constructed five such networks: ATOMICANet-Nucleic-Acid, ATOMICANet-Small-Molecule, ATOMICANet-Ion, ATOMICANet-Lipid, and ATOMICANet-Protein. Across all networks, we observe a consistent trend: proteins with similar interface embeddings are more likely to be implicated in the same disease. This suggests that interaction similarity, even across different molecular modalities, reflects shared functional roles in disease biology (Figure 5). These networks offer a new, interaction-based perspective on disease pathways, extending beyond protein-protein interactions to capture a broader spectrum of biomolecular interfaces.

**Figure 5.** Construction of ATOMICANets from ATOMICA embeddings. Nodes are proteins with interaction interfaces with an interacting modality.

Proteins that are close in ATOMICANet, meaning they share similar interaction interfaces with a specific molecular modality, are more likely to be involved in the same disease. When mapping known disease-associated proteins onto ATOMICANets, we observe that they form larger connected components than expected by chance (Figure 6). These connected components correspond to disease pathways and reveal shared molecular interaction profiles among disease proteins. For example, in ATOMICANet-Lipid, proteins associated with asthma include several sodium ion channels and G protein-coupled receptors with similar lipid interaction interfaces. In ATOMICANet-Ion, proteins involved in myeloid leukemia include DNA-binding proteins with similar ion coordination sites. In ATOMICANet-Small-Molecule, proteins linked to hypertrophic cardiomyopathy share conserved nucleotide binding sites for ATP, ADP, GTP, or GDP. These findings demonstrate that ATOMICA can identify disease-relevant protein groupings based on shared interaction geometry across diverse molecular modalities, providing a new perspective on disease pathway organization. For the first time, this analysis considers similarity in protein interfaces with molecular partners beyond other proteins, including ions, lipids, nucleic acids, and small molecules.

Targets can also be nominated from ATOMICANets through random walks. For a given disease, the disease proteins are split into seed nodes and a held out set. Since disease proteins are likely to be near other disease proteins in the network, random walks from the seed nodes tend to visit the held out disease proteins. When comparing which disease proteins are most likely visited over many independent random walks from different seed proteins, different ATOMICANets tend to nominate different targets (Figure 6). This reflects the complementary nature of the connectivity captured in each network. Some frequently nominated targets in each network include:

ATOMICANet-Ion: zinc finger domain-containing proteins involved in transcriptional and DNA replication processes, such as POLD1.
ATOMICANet-Small-Molecule: ATP-dependent chromatin remodelers such as SMARCA.
ATOMICANet-Nucleic-Acids: transcription factors and histone acetyltransferases such as CREB-binding protein, which are recruited to DNA by interacting with DNA-bound transcription factors.
ATOMICANet-Lipid: FLT3 cell surface tyrosine protein kinase receptor with this section found in the cell membrane.
ATOMICANet-Protein: STAT3, which is often found as a homo or heterodimer.

**Figure 6.** Left: the three largest connected components of asthma on ATOMICANet-Lipid, myeloid leukemia on ATOMICANet-Ion, and hypertrophic cardiomyopathy on ATOMICANet-Small-Molecule. Right: disease proteins nominated by random walks across five ATOMICANets for lymphoma.

The “dark proteome” includes proteins with no known structure or function. Many of these lack sequence or structural similarity to annotated proteins, limiting the reach of traditional annotation tools. ATOMICA fills this gap by focusing on conserved binding site geometry (Figure 7).

ATOMICA identifies 2,646 ion and small-molecule binding sites in proteins from the dark proteome. These proteins lack recognizable sequence or structure similarity to annotated proteins, making them inaccessible to conventional annotation methods. However, many functional binding sites are conserved at the structural level, even when the overall protein fold or sequence is not.

ATOMICA operates on binding interfaces rather than protein sequences. This allows the model to focus on local structural features that are characteristic of specific interaction types. By fine-tuning ATOMICA on known protein-ligand complexes, we trained the model to identify which ions or small molecules are likely to bind a given interface. This approach enables the annotation of conserved functional sites even in uncharacterized proteins.

This interface-level annotation complements existing methods that predict function based on global sequence similarity. ATOMICA provides local predictions at putative binding sites, offering a finer resolution of functional inference. The annotated sites include zinc and magnesium coordination motifs, nucleotide-binding pockets, and conserved structural motifs associated with specific enzymatic activities. Examples include metallopeptidases, phosphatidate cytidylyltransferases, bacterial zinc finger domains, and cytochrome subunits.

**Figure 7.** Annotation of the dark proteome with ATOMICA. The dark proteome refers to poorly characterized proteins that are distinct in sequence and structure. ATOMICA annotates 2,646 ions and small molecules to binding sites of dark proteins.

ATOMICA offers a unified framework to represent and reason about molecular interactions. Its strength lies in its generality: by learning from diverse molecular modalities, it identifies conserved features of binding interfaces that transcend any single interaction type. This enables new applications in protein function annotation, disease pathway discovery, and therapeutic target prediction. ATOMICA code and models are all open:

ATOMICA: https://www.biorxiv.org/content/10.1101/2025.04.02.646906
Github: https://github.com/mims-harvard/ATOMICA
HuggingFace: https://huggingface.co/ada-f/ATOMICA

ATOMICA reflects a broader shift in AI for Science toward modeling interactions, systems, and mechanisms rather than isolated components. It learns how atoms interact across molecular interfaces, capturing the geometric and chemical constraints that govern binding. This interaction-based modeling approach supports a more integrated understanding of biological function. Molecular interactions drive processes such as catalysis, signaling, and regulation, and modeling these interactions at atomic resolution is essential for understanding how biological systems operate and fail. By learning generalizable, structure-informed representations of interactions across molecular modalities, ATOMICA provides a foundation for constructing predictive models that accurately reflect molecular behavior within its biological context.

This blog is adapted from ATOMICA: Learning Universal Representations of Intermolecular Interactions

Share on