MESM: integrating multi-source data for high-accuracy protein-protein interactions prediction through multimodal language models

Table of Contents

Dataset

In this study, we utilized multi-label protein–protein interaction data from SHS27k, SHS148k [48], SYS30k, and SYS60k [41] datasets to evaluate the performance of MESM in comparison with other PPI prediction methods. The SHS27k and SHS148k datasets were constructed by randomly selecting proteins with sequence similarity below 40% from the human PPI subset data of the STRING database [44], resulting in 1690 and 5189 proteins, respectively. The two datasets contain 7624 and 44,488 multi-label PPIs, respectively. To investigate MESM’s performance in predicting PPIs across different species, we selected SYS30k and SYS60k datasets. These two datasets were constructed based on the data construction strategy of SHS27k and SHS148k, randomly selecting 2685 and 3549 yeast proteins from the yeast PPI subset data of the STRING database, with 10,806 and 22,707 multi-label PPIs, respectively. The protein data used for training, including sequence information and PDB structure, are obtained from Uniprot [49] and AlphaFold [50]. However, some proteins lack structural data in these protein databases, necessitating the exclusion of approximately 2.8% of proteins to ensure data integrity [51].

As shown in Table 5, there are a total of seven types of PPIs in these four PPI datasets: Reaction, Binding, Post-translational modification (Ptmod), Activation, Inhibition, Catalysis, and Expression. It can be seen that the distribution of various PPI types across the datasets typically exhibits imbalance, with significant differences in sample sizes, where certain types have far more data samples than others. For example, in the SHS27k dataset, the sample sizes for Reaction, Binding, Activation, and Catalysis greatly exceed those of other types, each accounting for 18% or more, while the total proportion of Ptmod, Inhibition, and Expression is only 19.08%. This imbalanced distribution may result in trained models performing well in classification for types with larger sample sizes, but the classification performance significantly declines for certain PPI types with fewer samples (such as Expression and Ptmod).

Table 5 Number and ratio of seven PPI types in four datasets

Unsupervised pre-training model for extracting multimodal protein features

Existing PPI prediction methods typically adopt an end-to-end learning strategy [35], where the model is trained directly from protein feature extraction to PPI prediction. Although this approach allows the model to learn directly from the data, end-to-end learning requires simultaneous completion of feature extraction and task optimization, which is often challenging to achieve effectively, especially when data is insufficient. Furthermore, in protein–protein prediction tasks, graph neural network-based models generally require the complete PPI network graph as input, with larger datasets resulting in greater computational burdens [39]. Therefore, we adopt a stepwise approach for extracting protein features using a pre-training method. As shown in Fig. 1a, we adjust the model framework based on the pre-training model MPRL [52] for the protein–protein interaction prediction task to capture more expressive multimodal protein features.

MPRL utilizes the primary and tertiary structures of proteins for unsupervised multimodal protein representation learning. For protein $_\in P$, its primary structure is represented as its amino acid sequence $S$, written as $S=\left(_,{v}_{2},\cdots ,{v}_{M}\right)$, where ${v}_{M}$ denotes the m-th amino acid residue in the sequence, and M is the total number of amino acid residues in the sequence. Each ${v}_{m}$ belongs to the amino acid set V which contains 20 amino acid types. The tertiary structure of protein ${P}_{i}$ is the structure formed in three-dimensional space by the folding of the amino acid chain ${S}_{i}$ through non-covalent interactions (such as hydrogen bonds, hydrophobic interactions, etc.), describing the atomic positions and folding states of the protein in three-dimensional space, represented as ${T}_{i}=\left({x}_{1},{y}_{1},{z}_{1};{x}_{2},{x}_{2},{z}_{2};\cdots ;{x}_{M},{y}_{M},{z}_{M}\right)$, where $\left({x}_{m},{y}_{m},{z}_{m}\right)$ denotes the spatial coordinates of the m-th amino acid residue. Protein sequence features provide fundamental insights into protein information but are limited in capturing complex spatial and dynamic characteristics, failing to represent spatial conformations. In contrast, constructing a topological network through local connections between residues characterizes the three-dimensional folding state of proteins but struggles to capture long-range spatial correlations. Meanwhile, three-dimensional point cloud spatial features precisely describe the global geometric morphology of proteins through spatial coordinates of residues, enhancing the expression of spatial information but lacking fine-grained representation of residue features. These three types of data provide complementary information at the levels of molecular sequence, structure, and global spatial distribution, addressing the limitations of single data sources in terms of information completeness. Therefore, compared to relying solely on a single protein feature (e.g., sequence features), multimodal features can more effectively enhance protein representation learning and model performance in PPI prediction tasks through complementary advantages among various features.

Sequence feature extraction

We designed a model called Sequence Variational Autoencoder (SVAE) based on Variational Autoencoder (VAE) [53] to extract sequence features. SVAE extracts short-range patterns through 1D convolution and combines bidirectional GRU to capture long-range dependencies across sequences, adapting to the complex structure and functional properties of proteins. Meanwhile, global average pooling compresses variable-length sequences, making them suitable for proteins of different lengths. Leveraging VAE, SVAE maps high-dimensional protein features into a continuous, smooth, and structured latent space, enabling the generation of biologically plausible sequence features while preserving key functional characteristics.

As shown in Fig. 9a, SVAE mainly consists of two parts:

(1)

Encoder: The encoder extracts sequence data into fixed-dimensional features $X\in {\mathbb{R}}^{n\times s\times d}$ through convolutional layers, bidirectional gated recurrent units (BiGRU), and fully connected (FC) layers, where n, s and d represent the number of proteins, sequence length, and amino acid residue feature dimensions, respectively. It outputs mean $\mu$ and log-variance $\text{log}(\sigma )$ of the latent space to generate the latent variable $Z\in {\mathbb{R}}^{b\times d}$, where b and d represent the batch size and sequence feature dimension, respectively.
(2)

Decoder: Starting from the latent variable $Z$, the decoder generates reconstructed data of the input sequence features through fully connected layers and the nonlinear activation function ReLU.

The loss function of SVAE can be defined as Eq. 1:

$$L=L_{recon}+L_{KL}$$

(1)

where ${L}_{\text{recon}}$ represents the reconstruction loss, which measures the error between the model′s reconstructed data and the original data; while ${L}_{\text{KL}}$ represents the KL divergence, which constrains the distribution of the latent variable Z to bring q(Z|X) close to the standard normal distribution p(Z) = N(0,1). The reconstruction loss ${L}_{\text{recon}}$ can be expressed as Eq. 2:

$$L_{recon}=\mathrm{MSE}\;(\mathrm x,\hat{\mathrm x})=\frac1N\sum\limits_{i=1}^N\left(x_{i\;\;}-\;\hat{x}_i\right)^2$$

(2)

where x denotes the input original data, $\widehat{x}$ represents the output reconstructed by the model, and N is the number of samples. The KL divergence ${L}_{\text{KL}}$ can be expressed as Eq. 3:

$$\begin{array}{c}{L}_{\text{KL}}=-\frac{1}{2}\left(1+\text{log}\left({\sigma }^{2}\right)-{\mu }^{2}-{\sigma }^{2}\right)\end{array}$$

(3)

Structure feature extraction

To construct the protein structure graph, we use one-hot encoding to specify a specific amino acid type for each residue and employ the K-nearest neighbors (KNN) strategy to build edges based on the coordinates of the alpha carbon atoms of the amino acid residues within the protein.

In this process, we calculate the Euclidean distance between each node and other nodes, connecting it to the nearest five neighbors to construct the protein structure graph ${G}_{r}=\left(S,{E}_{r}\right)$, where ${E}_{r}=\{({R}_{i},{R}_{j})|d({R}_{i},{R}_{j})\le d,{R}_{i},{R}_{j}\in R,i\ne j\}$ denotes the set of edges where the Euclidean distance between the alpha carbon atoms of residues ${R}_{i}$ and ${R}_{j}$ is less than the K-nearest neighbors threshold d.

To learn and capture the local interactions and spatial proximity within the protein structure graph ${G}_{r}$, we utilize a model based on Variational Autoencoder (VAE) for unsupervised learning of graph-structured data, referred to as Variational Graph Autoencoder (VGAE) [54]. VGAE utilizes GCN [55] to extract structural features of protein graphs, encoding amino acid residues and their interactions into latent representations, and models uncertainty through variational inference to adapt to the diversity of protein structures. Its self-supervised learning effectively reconstructs protein graphs, and combined with negative sampling, enhances the discrimination of true connections, improving generalization capabilities.

As shown in Fig. 9b, VGAE primarily consists of two parts:

(1)

GCN Encoder: The GCN layers encode the input graph data, mapping the structural information of the graph into latent space, and output mean $\mu$ and log-variance $log(\sigma )$ in the latent space. Through reparameterization, $\mu$ and $\sigma$ are used to obtain the latent representation $Z\in {\mathbb{R}}^{b\times d}$, Where b and d are the product of the batch size and the number of amino acid residues in the protein and the dimension of the one-hot encoding, respectively
(2)

Inner Product Decoder: The decoder is responsible for predicting the probability of edge existence from the node representations Z in the latent space, which was shown in Eq. 4:

$$\begin{array}{c}{\widehat{A}}_{ij}=Sig({Z}_{i}{Z}_{j})\end{array}$$

(4)

where Sig is the Sigmoid activation function, and ${\widehat{A}}_{ij}$ represents the predicted probability of the edge. ${Z}_{i}$ and ${Z}_{j}$ represent the representation vectors of nodes i and j in the latent space.

The training objective of the model is to minimize the reconstruction loss, which measures the alignment between the predicted edges (connection probabilities) and the actual edges. The reconstruction loss is calculated from the given positive edges (edges that exist, with loss calculated by $\text{Sig}({Z}_{i}{Z}_{j})$) and negative edges (nonexistent edges, generated through negative sampling with loss calculated by $1-\text{Sig}\left({Z}_{i}{Z}_{j}\right)$). It is assessed using binary cross-entropy (BCE) to determine the presence or absence of edges, which is defined as Eq. 5:

$$\begin{array}{c}L=-\frac{1}{\left|{\mathcal{E}}^{+}\right|}{\sum }_{\left(i,j\right)\in {\mathcal{E}}^{+}}\text{logSig}\left({Z}_{i}{Z}_{j}\right) -\frac{1}{\left|{\mathcal{E}}^{-}\right|}{\sum }_{\left(i,j\right)\in {\mathcal{E}}^{-}}\text{log}\left(1-\text{Sig}\left({Z}_{i}{Z}_{j}\right)\right)\end{array}$$

(5)

where ${\mathcal{E}}^{+}$ and ${\mathcal{E}}^{-}$ represent the sets of positive and negative edges respectively.

Point cloud feature extraction

Suppose the protein consists of N atoms; then the point cloud of the protein can be defined as $\text{PC}={\{\text{pc}}_{\text{i}}\left|{\text{pc}}_{\text{i}}=\left({\text{x}}_{\text{i}}, {\text{y}}_{\text{i}}, {\text{z}}_{\text{i}}\right), \text{i}=1, 2, \cdots , \text{N}\right\}$. To ensure that all point clouds have a consistent spatial distribution and size, standardization operations such as centering, scaling, and fixed point counts are performed on the point cloud data. The point cloud data is represented as $X\in {\mathbb{R}}^{n\times d}$, where n and d represent the three spatial dimensions and the number of points in each spatial dimension, respectively. The specific steps are as follows:

(1)

Calculate the centroid of the point cloud by Eq. 6:

$$\begin{array}{c}c=\frac{1}{N}\sum_{i=1}^{N}{pc}_{i}\end{array}$$

(6)
(2)

Translate the point cloud to the origin by Eq. 7:

$$\begin{array}{c}{pc}_{i}{\prime}={pc}_{i}-c\end{array}$$

(7)
(3)

Calculate the maximum norm of the point cloud by Eq. 8:

$$\begin{array}{c}{r}_{\text{max}}=\underset{i}{\text{max}}\parallel {pc}_{i}{\prime}\parallel \end{array}$$

(8)
(4)

Scale the point cloud to fit within the unit sphere by Eq. 9:

$$\begin{array}{c}{pc}_{i}^{{\prime}{\prime}}=\frac{{pc}_{i}{\prime}}{{r}_{\text{max}}}\end{array}$$

(9)
(5)

Ensure each point cloud has the same number of points M: If N < M, fill the point count with zero vectors to reach M; if N > M, truncate the point cloud to M points. In the experiment, we set M to 2048.

We employed an unsupervised learning model based on Autoencoder called PAE [56] to reconstruct the 3D point cloud representation of proteins, thereby capturing their 3D spatial information, including spatial arrangement, folding patterns, and dynamic conformational changes, as well as geometric details and global structures. PAE utilizes the PointNet encoder to extract local and global features of protein point clouds and integrates a Spatial Transformer Network (STN) for rotational alignment to enhance structural consistency, while optimizing reconstruction accuracy through Chamfer distance to better preserve the geometric structure and topological relationships of proteins.

As shown in Fig. 9c, PAE consists of two core components:

(1)

Encoder: Using PointNet [57] to extract global features from the point cloud, we further map these features to a low-dimensional latent space $h\in {\mathbb{R}}^{b\times d}$ via fully connected layers. Here, b and d denote the batch size and the feature dimension of the point cloud, respectively.
(2)

Decoder: The decoder remaps the k-dimensional representation from the encoder back to the spatial coordinates of the point cloud data.

The model uses Chamfer Distance (CD) [58] to compute the square of the Euclidean distance from each point in one point set to the nearest point in the other point set and sums the distances to measure the similarity between the two point clouds, which is shown in Eq. 10:

$$\begin{array}{c}CD\left({S}_{1},{S}_{2}\right)=\sum_{x\in {S}_{1}} \underset{y\in {S}_{2}}{min} {\parallel x-y \parallel }_{2}^{2}+\sum_{y\in {S}_{2}} \underset{x\in {S}_{1}}{min} {\parallel x-y \parallel }_{2}^{2}\end{array}$$

(10)

Here, ${S}_{1}$ and ${S}_{2}$ represent the point sets of the point clouds.

Multimodal feature extraction

After obtaining the 640-dimensional protein sequence features, protein structural features, and three-dimensional spatial features generated by VAE, VGAE, and PAE, we perform z-score normalization on the feature vectors. This eliminates the dimensional influences among different features, allowing the features to be compared on the same scale to achieve balanced effects in multi-modal fusion. As shown in Fig. 10, we also employ AE for unsupervised learning of multi-modal protein features, called Fusion Autoencoder (FAE), with specific steps as follows:

(1) Encoder: The encoder compresses the input multi-modal feature vector Z into a low-dimensional latent space, achieving feature fusion and dimensionality reduction, and outputs the latent representation, which is shown in Eq. 11:

$$\begin{array}{c}H=ReLU\left(\text{Linear}\left(\text{Tanh}\left(\text{Linear}\left(\text{Z}\right)\right)\right)\right)\end{array}$$

(11)

where $Z\in {\mathbb{R}}^{b\times d}$, b and d represent the batch size and $640\times 3$ dimensions, while $H\in {\mathbb{R}}^{b\times k}$, k is 1024 dimensions.

(2) Decoder: The decoder maps the encoded latent representation h back to the original feature space, ensuring that the feature fusion process retains the original information and outputs the reconstructed feature vector, which is shown in Eq. 12:

$$\begin{array}{c}\widehat{Z}=Linear\left(\text{ReLU}\left(\text{Linear}\left(\text{H}\right)\right)\right)\end{array}$$

(12)

FAE uses mean squared error (MSE) loss as the reconstruction objective, minimizing the difference between the original input Z and the reconstructed output $\widehat{Z}$. MSE loss can be defined as Eq. 13:

$$\begin{array}{c}{L}_{\text{MSE}}=\frac{1}{N}\sum_{i=1}^{N}{\left({Z}_{i}-{\widehat{Z}}_{i}\right)}^{2}\end{array}$$

(13)

By ensuring the effectiveness of the compressed representations and optimizing the reconstruction quality, this model effectively improves the feature fusion process.

GraphGPS

Graph position encoding [59, 60] aims to provide effective embedding representations for graphs or their subgraphs, enhancing nodes’ understanding of their positions within the overall graph. This encoding method enhances the model’s representational capability by capturing the relationships and structural features among nodes, particularly when handling complex graph data. When two nodes share similar subgraphs or structures, their structural encodings should remain close, which is crucial for many graph neural network tasks. In particular, local structure encoding (Local SE) [61] allows each node to understand its surrounding substructure. This encoding method not only facilitates the transmission of information but also effectively captures the m-hop subgraph features around the nodes. Specifically, local structure encoding enhances nodes’ perception of their surroundings by identifying features related to the nodes, such as node degrees, diagonal elements of the random walk matrix, and predefined substructural statistics. With a structural encoding of radius m, the more similar the m-hop subgraphs between two nodes, the closer their local structure encodings will be. This proximity not only reflects the similarity between nodes but also provides a rich informational foundation for downstream networks. Meanwhile, the design of local structure encoding effectively suppresses the effects of noise and redundant information, thereby enhancing the robustness and performance of the model. We generated local structure encodings according to LSPE [61]. This structural encoding computation effectively integrates the information of node i and its neighboring nodes, allowing for an expression of the node’s local features. These local features play a crucial role in graph learning, helping the model understand the position of the node within the overall structure.

When the paths between nodes are long, information may encounter compression during transmission, leading to loss of information. Additionally, the local attention mechanism aims to achieve global information interaction between nodes by directly connecting them, effectively alleviating the issue of information compression. As shown in Fig. 11, GraphGPS [59] combines local GCN and global attention mechanisms [62], enhancing GCN’s performance in processing graph data by introducing local structure encoding and global feature capture.

The overall computation formula of GraphGPS can be defined as Eq. 14:

$$\begin{array}{c}{\text{X}}^{{\ell}+1}=GP{\text{S}}^{{\ell}}\left({\text{X}}^{{\ell}},\text{A}\right)\end{array}$$

(14)

where ${\text{X}}^l$ represents the node feature matrix at layer $l$, and $\mathrm {A}$ denotes the adjacency matrix. The specific computation process is as follows:

(1) The local graph convolutional network effectively aggregates feature information from each node and its neighbors to update the node’s feature representation, which is shown in Eq. 15:

$$\begin{array}{c}{\text{H}}_{M}^{{\ell}+1}={\text{GCN}}^{{\ell}}\left({\text{X}}^{{\ell}},\text{A}\right)\end{array}$$

(15)

(2) Use the global attention mechanism to aggregate node features, allowing the model to consider the features of all nodes comprehensively, thus capturing global information, which is shown in Eq. 16:

$$\begin{array}{c}{\text{H}}_{T}^{{\ell}+1}={\text{GlobalAttn}}^{{\ell}}\left({\text{H}}^{{\ell}}\right)\end{array}$$

(16)

(3) Combine local features and global features through MLP to generate the final node feature representation, which is shown in Eq. 17:

$$\begin{array}{c}{\text{H}}^{{\ell}+1}={\text{MLP}}^{{\ell}}\left({\text{H}}_{M}^{{\ell}+1}+{\text{H}}_{T}^{{\ell}+1}\right)\end{array}$$

(17)

This approach helps GraphGPS understand the overall structure of the graph at a higher level during information propagation and feature learning, better combining local neighborhood information with global context and enhancing the model’s learning capacity for graph data.

GAT [63] assigns dynamic weights to nodes by aggregating node information, further refining the global and local features output by GraphGPS and extracting more meaningful features from local structures to capture finer-grained interactions and relationships.

First, for node i and one of its neighboring nodes j, apply the shared weight matrix W, as shown in Eq. 18:

$$\mathrm h’_{\mathrm i\;}\;={\mathrm{Wx}}_{\mathrm i},\;\mathrm h’_{\mathrm j}\;={\mathrm{Wx}}_{\mathrm j}$$

(18)

Next, calculate the raw attention score for nodes i and j, which can be defined as Eq. 19:

$$\mathrm{e_{ij}}=\mathrm{LeakyReLU}\;\left(\mathrm a^\top\cdot\left[h’_i+h’_j\right]\right)$$

(19)

where $\text{a}$ is a learnable parameter vector, and ${\text{h}}_{i}’+{\text{h}}_{j}’$ represents the sum of the two features, while $(\cdot )$ denotes the dot product operation.

The raw attention coefficients are normalized over all neighboring nodes of node i using the softmax function to obtain the normalized attention coefficients ${\alpha }_{ij}$, as shown in Eq. 20:

$$\begin{array}{c}{\alpha }_{ij}=\frac{\text{exp}\left({e}_{ij}\right)}{\sum_{k\in \mathcal{N}\left(i\right)\cup i} \text{exp}\left({e}_{ik}\right)}\end{array}$$

(20)

Finally, perform weighted aggregation using ${\alpha }_{ij}$ and the transformed features of its neighboring nodes to obtain the output feature of node i, as shown in Eq. 21:

$$\mathrm{x}’_{i}\;={\sum_{\text{j}\in N(i)\cup i\;}} \alpha_{ij}h’_{\mathrm j}$$

(21)

SubgraphGCN

Message Passing Neural Networks (MPNNs) [64], as one of the classic implementations of Graph Neural Networks (GNNs), has been widely applied to protein–protein interaction prediction tasks and has demonstrated exceptional performance. In layer ${\ell}+1$ the feature of node v is calculated as Eq. 22:

$$\begin{array}{c}h_v^{\left(\ell+1\right)}=\gamma\left(h_v^{\left(\ell\right)},\text{AGGREGATE}\left(\{\phi\left(h_j^{\left(\ell\right)},e_{vj}\right)\mid j\in\mathcal N\left(v\right)\}\right)\right)\end{array}$$

(22)

Here, denotes the message generation process; AGGREGATE indicates the function used to aggregate messages from neighboring nodes (e.g., sum, mean, or max), and $\gamma$ is the function that updates the node features. The choice of different message functions, aggregation functions, and update functions affects the expressive power and performance of MPNNs. When all three functions are injective (one-to-one mappings), MPNNs achieve the same expressive power as the 1-Weisfeiler-Lehman (1-WL) isomorphism test [36].

During the computation of MPNN, each node follows a star pattern during message aggregation, where the central node of the graph is connected to all its neighbors and aggregates messages solely from its direct neighbors. The star pattern around the central node v is defined as Eq. 23:

$$\begin{array}{c}Star\left(v\right)=\left({\mathcal{N}}_{1}\left(v\right),\{\left(v,j\right)\in \mathcal{E}|j\in \mathcal{N}\left(v\right)\}\right)\end{array}$$

(23)

where ${\mathcal{N}}_{1}\left(v\right)$ signifies the subgraph centered at node v, including v and all nodes and edges reachable within one hop. When handling non-isomorphic regular graphs, MPNN may struggle to differentiate these graphs, weakening its discriminative ability. To address this issue, Subgraph-1-WL extends the 1-WL algorithm by generalizing the concept of the star graph Star(v) to the k-hop neighborhood $G\left[{N}_{k}\left(v\right)\right]$ centered at v, thereby improving the expressive power for graph isomorphism detection. SubgraphGCN [65] combines Subgraph-1-WL and GCN, with its internal node feature update process defined as Eq. 24:

$$\begin{array}{c}{h}_{v}^{\left({\ell}+1\right)}={\text{GCN}}^{\left({\ell}\right)}\left(Su{b}^{\left({\ell}\right)}\left[v\right]\right),\hspace{1em}l=0,\dots ,L-1\end{array}$$

(24)

Here, indicates the one-hop subgraph centered at node v. The subgraph generation process primarily includes the following steps: First, based on the edge indices and the number of nodes in the original graph, a 1-hop neighborhood expansion is used to generate local subgraphs for each node. Simultaneously, a Boolean mask matrix is generated to indicate which nodes belong to each subgraph, and the hop counts between nodes are recorded. Next, the dense mask is converted into a sparse format for node and edge indices. Finally, the edges of all subgraphs are merged into a global edge index, while node IDs are remapped to avoid overlaps. After obtaining all subgraphs corresponding to the PPI network graph, SubgraphGCN extracts node features from the global features, integrating the central node features, mean-pooled topological features within the subgraph, and globally aggregated contextual features across subgraphs, resulting in richer and more comprehensive feature representations.

As shown in Fig. 12, SubgraphGCN computes three types of encodings:

(1) Centroid encoding ${\text{h}}_{v}^{({\ell}+1)|Centroid}$ focuses on representing the features of the central node within the subgraph. It efficiently encodes the local characteristics of node v by embedding within the node’s star-shaped subgraph. This method effectively integrates the information from direct neighbors, enhancing the self-representation of the node, defined as Eq. 25:

$$\begin{array}{c}\text{h}_v^{(\ell+1)\text{|}Centroid}=\;\mathrm{Emb}\left(v\mid\text{Sub}^{\left(\ell+1\right)}\left[v\right]\right)\end{array}$$

(25)

(2) The subgraph encoding ${\text{h}}_{v}^{({\ell}+1)|\text{Subgraph}}$ aggregates features from the subgraph surrounding node v to form a comprehensive node representation. By applying the mean pooling operation, the subgraph encoding integrates rich information from neighboring nodes, thus providing an in-depth characterization of complex relationships between nodes, defined as Eq. 26:

$$\begin{array}{c}\text{h}_v^{(\ell+1)\vert \text{Subgraph}}=\text{GCN}^{\left(\ell\right)}\left(\text{Sub}^{\left(\ell\right)}\left[v\right]\right)=MEAN\left(\left\{\text{Emb}\left(i\mid\text{Sub}^{\left(\ell\right)}\left[v\right]\right)\mid i\in{\mathcal N}_k\left(v\right)\right\}\right)\end{array}$$

(26)

(3) The context encoding ${\text{h}}_{v}^{({\ell}+1)|\text{Context}}$ aims to analyze the node within different subgraph contexts to capture broader feature information. By summarizing the embeddings of the node in various contexts, context encoding ensures that the node representation encompasses information from multiple perspectives, enhancing overall expressive capability, defined as Eq. 27:

$$\begin{array}{c}\text{h}_v^{(\ell+1)\vert^{Context}}=MEAN\left(\left\{\text{Emb}\left(v\mid\text{Sub}^{\left(\ell\right)}\left[j\right]\right)\mid\forall js.t.\in{\mathcal N}_k\left(j\right)\right\}\right)\end{array}$$

(27)

By combining centroid encoding, subgraph encoding, and context encoding, SubgraphGCN for layer ${\ell}+1$ can be defined as Eq. 28:

$$\begin{array}{c}{\text{h}}_{v}^{\left({\ell}+1\right)}=SUM\left({\text{h}}_{v}^{({\ell}+1)|\text{Centroid}},{\text{h}}_{v}^{({\ell}+1)|\text{Subgraph}}, {\text{h}}_{v}^{({\ell}+1)|\text{Context}}\right)\end{array}$$

(28)

where $\textit{Emb }\left(i\mid\text{Sub}^{\left(\ell\right)}\left[j\right]\text{ }\right)$ represents the embedding of node when applying GCN at layer ${\ell}$.

By using the aggregation method of the 1-hop egonet instead of the traditional star pattern, SubgraphGCN can effectively recognize different connection patterns exhibited by nodes with the same degree across different graphs, significantly enhancing the model’s discriminative power. This innovative approach not only enhances the understanding of relationships between nodes but also helps capture more fine-grained structural features, effectively reflecting the complexity of the graph. The advantages of SubgraphGCN are also evident in its intelligent decomposition of the entire large graph into smaller, more manageable subgraphs. Through this decomposition, the model can leverage GCN for more detailed feature aggregation and learning within relatively smaller scopes, improving computational efficiency and allowing the model to delve deeper into the potential information of each subgraph. This method ensures that key features can be comprehensively and effectively extracted when processing raw heterogeneous graphs, providing strong support for subsequent analysis and predictions. SubgraphGCN fully leverages the unique advantages of graph neural networks in modeling graph structural information. By systematically performing key steps such as subgraph extraction, node feature processing, and feature aggregation, the model can better understand the relationship between local and global structures. This detailed and comprehensive feature extraction capability is particularly important for PPI prediction tasks.

Classifier and loss function

Let $P=\{{P}_{1},\cdots ,{P}_{N}\}$ be the set of proteins. $E=\left\{\left({P}_{i}, {P}_{j}\right) \right| i\ne j\}$ is the collection of PPIs. A set of type labels $L=\left\{{L}_{1}, {L}_{2},\cdots , {L}_{7}\right\}$ corresponds to each protein pair $\left({P}_{i},{P}_{j}\right)$. If there are multiple interaction types between ${\text{P}}_{\text{i}}$ and ${P}_{j}$, the corresponding position in the type label set L is set to 1; otherwise, it is set to 0. We treat proteins and PPIs as nodes and edges, respectively, thus constructing a comprehensive PPI network graph $G=\left(P, E\right)$. Since each PPI has at least one and at most seven types of PPI interactions, we extract the corresponding edge set ${E}_{k}=\left\{\left({P}_{i}, {P}_{j}\right) \right| i\ne j, {L}_{k}=1\}$ to construct seven independent PPI network graphs ${G}_{k}=\left(P,{E}_{k}\right)$.

As shown in Fig. 13, the final protein representation is ${P}_{i}=\left[X,{z}_{1},{z}_{2},\cdots ,{z}_{7}\right]$, where X represents the multimodal protein features generated by the pre-trained model FAE, and ${z}_{1},{z}_{2},\cdots ,{z}_{7}$ denote the protein features learned by MESM from seven independent PPI network graphs ${G}_{k}$, with [] indicating horizontal concatenation. This representation not only preserves the original protein information but also integrates unique protein association information from different interaction perspectives, further enhancing MESM’s ability to learn interactions among these seven proteins.

We perform element-wise multiplication on ${P}_{i}$ and ${P}_{j}$ to construct the joint representation between protein pairs, as shown in Eq. 29:

$${\mathrm H}_{{\mathrm P}_{\mathrm i},{\mathrm P}_{\mathrm j}}={\mathrm P}_{\mathrm i}\odot{\mathrm P}_{\mathrm j}$$

(29)

This element-wise multiplication operation effectively combines the feature information of the two proteins, thus extracting their interaction features. This joint representation is then fed into MLP to estimate the interaction probability between proteins ${P}_{i}$ and ${P}_{j}$. The final output is defined as Eq. 30:

$$\begin{array}{c}{\widehat{\text{y}}}_{\left({\text{P}}_{\text{i}},{\text{P}}_{\text{j}}\right)}=Sig\left(\text{MLP}\left({{\text{H}}_{{\text{P}}_{\text{i}},{\text{P}}_{\text{j}}}}\right)\right)\end{array}$$

(30)

To train this model, we adopt the ZLPR (zero-bounded log-sum-exp & pairwise rank-based) [66] loss function, which is a loss function for multi-label classification problems and can be viewed as a natural generalization of “Softmax and cross-entropy.” The form of the ZLPR loss function is defined as Eq. 31:

$$\begin{array}{c}{\mathcal{L}}_{\text{zlpr}}=\text{log}\left(1+\sum_{\text{i}\in {\Omega }_{\text{pos}}} {\text{e}}^{-{\text{s}}_{\text{i}}}\right)+\text{log}\left(1+\sum_{\text{j}\in {\Omega }_{\text{neg}}} {\text{e}}^{{\text{s}}_{\text{j}}}\right)\end{array}$$

(31)

Here, ${s}_{i}$ is the output score of the model for the ith class(${\lambda }_{i}$; during prediction, if ${s}_{i}>0$, it indicates that ${\lambda }_{i}$ could be the target class, while if ${s}_{i}<0$, it indicates otherwise. The ZLPR loss function is characterized by its ability to effectively capture dependencies between labels, allowing for a reasonable balance of influence between positive and negative classes in the model.

Experimental settings and evaluation metrics

In this study, we employed the Adam algorithm [67] to optimize the training process for the pre-training of multimodal proteins. The entire training process lasted for 200 epochs, with the learning rate set to 0.001 to ensure effective learning of the model. For different network architectures, we adopted varying batch sizes, with the batch sizes for SVAE, VGAE, PAE, and FAE set to 256, 256, 32, and 1024, respectively. In constructing the PPI prediction network, we utilized the AdamW optimization algorithm [68], which combines momentum updates with weight decay from the Adam algorithm, effectively preventing overfitting. The learning rate was set to 0.0001, and training was conducted for a total of 100 epochs to ensure that the model could fully learn the features of interactions between proteins. Additionally, we used a learning rate scheduler [68] that optimized the learning rate during the model training process through an initial linear increase strategy followed by gradual decay, allowing the model to better adapt to the needs of parameter updates at different training stages. We selected a 1024-dimensional protein feature representation and a 20-dimensional structural encoding dimension to capture the diversity and complexity of proteins. For each different type of PPI network graph, we used a single layer of GraphGPS, GAT, SubgraphGCN, and GCN, leveraging these graph neural networks to capture local and global graph structural information more effectively, thus enhancing the modeling capability for complex protein networks.

MESM was implemented in Python 3.10, leveraging PyTorch 2.2 for deep learning operations and PyTorch Geometric (PyG) 2.6 for graph-based computations. All experiments were conducted on an NVIDIA A100 GPU under a Linux operating system. We sequentially trained MESM on four datasets, with the model containing a total of 172.49 million trainable parameters.

Table 6 presents the computational cost of MESM, showing that as the PPI network becomes more complex, the model requires more training time and greater GPU memory. More specific implementation details about the dataset and code can be obtained from

Table 6 The computational cost of training MESM

In terms of dataset partitioning, we adopted three different partitioning strategies [35]: Breadth-First Search (BFS), Depth-First Search (DFS), and Random division, splitting the PPI dataset into training and test sets in an 8:2 ratio. During the dataset partitioning process, we used a random seed to ensure that the combinations of the training and test sets were different during each training session, thereby enhancing the model’s generalization ability. Each experiment was repeated five times to obtain the mathematical expectation and standard deviation of the results, which will serve as indicators of predictive performance.

Since the distribution of different PPI types in our dataset is imbalanced, we chose Micro-F1 as the multi-label classification evaluation metric for handling imbalanced datasets. Micro-F1 effectively consolidates the predictive performance across multiple classes, particularly excelling when dealing with imbalanced samples. To this end, the precision and recall for the ith class are defined in Eq. 32 and 33:

$$\begin{array}{c}Precision_{i}=\frac{{\text{TP}}_{i}}{{\text{TP}}_{i}+{\text{FP}}_{i}}\end{array}$$

(32)

$$\begin{array}{c}Recall_{i}=\frac{{\text{TP}}_{i}}{{\text{TP}}_{i}+{\text{FN}}_{i}}\end{array}$$

(33)

where TP (True Positive) denotes the number of samples correctly classified as positive, FP (False Positive) denotes the number of samples incorrectly classified as positive, and FN (False Negative) denotes the number of samples that are actually positive but not classified as such.

In our research task, the definition of Micro-F1 is defined in Eq. 34:

$${\text{Precision}}_{\text{Micro}}=\frac{\sum_{i=1}^{7}T{P}_{i}}{\sum_{i=1}^{7}T{P}_{i}+\sum_{i=1}^{7}F{P}_{i}}$$

$$\begin{array}{c}Recall_{\text{Micro}}=\frac{\sum_{i=1}^{7}T{P}_{i}}{\sum_{i=1}^{7}T{P}_{i}+\sum_{i=1}^{7}{FN}_{i}}\end{array}$$

(34)

$$\text{Micro}-\text{F}1=2\times \frac{{\text{Precision}}_{\text{Micro}}\times {\text{Recall}}_{\text{Micro}}}{{\text{Precision}}_{\text{Micro}}+{\text{Recall}}_{\text{Micro}}}$$

By utilizing these metrics, we can comprehensively evaluate the model’s performance in a multi-label environment, ensuring that it performs well not only in specific classes but also enhances overall generalization capability.

link