Our proposed CoMLHAN is a selfsupervised graph representation learning approach conceived for multilayer heterogeneous attributed networks. As previously discussed, a key novelty of CoMLHAN is its higher expressiveness w.r.t. existing methods, since heterogeneity is assumed to hold at both node and edge levels, possibly for each layer of the network. This capability of handling graphs that are multilayer, heterogeneous, and attributed simultaneously, enables CoMLHAN to better model complex realworld scenarios, thus incorporating most information when generating node embeddings.
In the following, we first provide a formal definition of multilayer heterogeneous attributed graph and representation learning in such networks, then we move to a detailed description of CoMLHAN. The notations used in this work are summarized in Table 9, Appendix 1.
Preliminary definitions
A multilayer graph is a set of interrelated graphs, each corresponding to one layer, with a node mapping function between any (selected) pair of layers to indicate which nodes in one graph correspond to nodes in the other one. We assume that each layer can be heterogeneous, i.e., is characterized by nodes of different types and/or edges of different types, such that any node can be linked to nodes of the same type as well as to nodes of different types, through the same or different relations, and is attributed, i.e., has nodes associated with external information, available as set of attributes. Therefore, each layer graph has its internal set of edges, dubbed intralayer or withinlayer edges, as well as a set of edges connecting its nodes to nodes of another layer, dubbed interlayer or acrosslayer edges. Layers can be seen as different interaction contexts, semantic aspects, or time steps, while the participation of an entity to a layer can be seen as a particular entity instance. Instances of the same entity are connected via pillaredges. We hereinafter refer to entity instances as nodes in the multilayer network. Figure 1 illustrates an example of a multilayer heterogeneous attributed graph.
Multilayer heterogeneous attributed graph.
We define a multilayer heterogeneous attributed graph as \(G_{{\mathcal{L}}}=\langle {\mathcal{L}}, {\mathcal{V}}, V_{{\mathcal{L}}}, E_{{\mathcal{L}}}, A, R, \phi , \varphi , {\varvec{\mathcal{X}}}_{{\mathcal{L}}}\rangle\), where \({\mathcal{L}} = \{G_{1}, \cdots , G_{\ell } \}\) is the set of layer graphs, indexed in \(L = \{1,\dots , \ell \}\), with \({\mathcal{L}} = \ell \ge 2\), \({\mathcal{V}}\) is the set of entities, \(V_{\mathcal{L}} \subseteq {\mathcal{V}} \times {\mathcal{L}}\) is the set of nodes, \(E_{{\mathcal{L}}}\) is the set of edges, including both intra and interlayer edges, A is the set of entity, resp. node, types, R is the set of relation types, \(\phi :{\mathcal{V}}\rightarrow A\) is the entitytype mapping function, \(\varphi :E_{{\mathcal{L}}} \rightarrow R\) is the edgetype mapping function, and \({\varvec{\mathcal{X}}}_{{\mathcal{L}}}\) is a set of matrices storing attributes, or initial features, with \({\varvec{\mathcal{X}}}_{{\mathcal{L}}} = \bigcup _{l=1\ldots \ell } {\varvec{\mathcal{X}}}_{l}\). More specifically, entities, resp. nodes, of each type are assumed to be associated with features stored in layerspecific matrices \({\varvec{\mathcal{X}}}_{l} = \{ {\textbf{X}}^{(a)}_{l} \}\), where each \({\textbf{X}}^{(a)}_{l}\) is the feature matrix associated with entities, resp. nodes, of type \({a} \in A\) in the lth layer. Throughout this work we use symbol \({\textbf{x}}^{(a)}_{{\langle i,l \rangle }}\) to denote the feature vector of entity \(v_{i}\) of type a in layer \(G_{l}\). We also admit that features can be layerindependent, in which case we indicate with \({\textbf{x}}^{(a)}_{i}\) the feature vector associated with entity \(v_{i}\) of type a in each layer, i.e., \({\textbf{x}}^{(a)}_{\langle i,l \rangle } = {\textbf{x}}^{(a)}_{i}\) for each \(G_{l} \in {\mathcal{L}}\).
We specify that each entity has instances (i.e., nodes) in one or more layers, and appears at least in one layer, i.e., \({\mathcal{V}} = \bigcup _{l=1\ldots \ell } {\mathcal{V}}_{l}\), with \({\mathcal{V}}_{l}\) set of entities appearing in the lth layer. Likewise, \(A=\bigcup _{l=1\ldots \ell } A_{l}\), with \(A_{l}\) denoting the set of node types of the lth layer, \(R=\bigcup _{l=1\ldots \ell } R_{l}\), with \(R_{l}\) denoting the set of edge types of the lth layer, and \(E_{{\mathcal{L}}} = \bigcup _{r \in R}E_{r}\) \(\subseteq V_{{\mathcal{L}}} \times V_{{\mathcal{L}}}\), with \(E_{r}\) indicating all the edges of type r.
Moreover, \(E_{{\mathcal{L}}}\) can be partitioned into two sets denoting the intralayer edges and interlayer edges. Note that interlayer edges represent coupling structure of layers; in our setting, we assume that different coupling constraints between layers might hold, e.g., layers could be coupled with each other, only adjacent layers could be coupled, layers could follow a temporal relation order, etc. We define the set of layer pairing indices as \(L_{cross}\), where each \(\pi =(l,l') \in L_{cross}\) is a pair of coupled layers denoting an interaction between layer \(G_{l}\) and \(G_{l'}\).
We stress that in contrast to other approaches, such as (Yang et al. 2021), in our formulation each layer \(G_{l}\) (\(l=1\ldots \ell\)) is a heterogeneous graph at both node and edge levels, i.e., \(A_{l}>1\) and \(R_{l} > 1\). Moreover, \(A_{l} \subseteq A\), for all \(G_{l} \in {\mathcal{L}}\), and \(R_{l} \subset R\), since interlayer connections are regarded as different types of edges.
Multilayer heterogeneous attributed graph embedding.
Given a multilayer heterogeneous attributed network \(G_{{\mathcal{L}}}\), our goal is to learn an embedding function at entity level \(g : {\mathcal{V}} \rightarrow {\mathbb{R}}^{d}\), where d is the dimension of the latent space, and \(d \ll {\mathcal{V}}\). Function g can be derived from an analogous function \(g' : V_{{\mathcal{L}}}\rightarrow {\mathbb{R}}^{d}\), where d is the dimension of the latent space, and \(d \ll V_{{\mathcal{L}}}\), being the embedding function at node level. The mapping g, resp. \(g'\), defines the latent representation of each entity \(v_{i} \in {\mathcal{V}}\), resp. node \(\langle i,l \rangle \in V_{{\mathcal{L}}}\), and we use symbol \({\textbf{z}}_{i}\), resp. \({\textbf{z}}_{\langle i,l \rangle }\), to denote its learned embedding. The learned embeddings are eventually used to support multiple downstream graph mining tasks, e.g., entity/node classification, link prediction, node regression, etc.
CoMLHAN: contrastive learning framework for multilayer heterogeneous attributed networks
We aim to learn node embeddings in an unsupervised manner, with function g employing graph neural networks and attention mechanisms in order to encode both structural and semantic, heterogeneous and multilayer information in the context of a multiview contrastive mechanism.
Our proposed approach is based on the infomax principle of maximizing mutual information (Linsker 1988), both in terms of graph structure encoding—complying with the distinction between local and highorder information—and acrosslayer information—complying with the distinction between interlayer edges connecting direct neighbors and pillaredges connecting different instances of the same entity. According to this principle, we define two different structural views on the original graph: the one is designed to encode the local structure of nodes and handle heterogeneity, capturing useful information from onehop neighbors of different types (possibly from different layers), and the other one is designed to encode the global structure of nodes and model information from distant nodes in the network, thus capturing useful information from multihop neighbors of the same type. Note that we include pillar edges in the global view, since they are particular connections matching two instances of the same entity, thus enabling acrosslayer transitions, but they do not represent edges between two direct neighbors.
It should be emphasized that CoMLHAN is conceived to be general and flexible, so as to exploit all available information but also being effective even when such information is lacking. For instance, acrosslayer relations could be limited to few replicas, nodes may show high variability in the number of neighbors, or one or more types of neighbors could be missing for some nodes.
Figure 2 shows a conceptual overview of our proposed framework. Accordingly, the final embedding for each target entity is learned through three main stages:

1.
Content encoding. Since the initial feature vectors of nodes/entities (\({\textbf{x}}\)) might be of different sizes, the first stage requires to transform such initial features into a shared lowdimensional latent space (\({\textbf{h}}\)). Moreover, this stage is also concerned with the content encoding “from scratch”, i.e., generating initial embeddings from raw data associated with nodes/entities, which might be from possibly multiple and heterogeneous contents, such as categorical or numerical attributes, unstructured text and multimedia content

2.
Graph structure encoding. According to the multiview learning paradigm, the second stage requires to generate two distinct embeddings for each entity, reflecting the graph structure and maximizing the mutual information: (1) embeddings for the local structure (\({\textbf{z}}\)^{ns}), including information from all direct neighbors of the nodes being instances of the target entity, and (2) embeddings for the highorder structure (\({\textbf{z}}\)^{mp}), including information from pillaredges and from target nodes that can be reached through composite relations (i.e., metapaths).

3.
Final embedding based on contrastive learning. The third stage requires a joint optimization between the embeddings learned under the two views to generate the final entity embedding (\({\textbf{z}}\)). The contrastive learning mechanism is enforced by choosing suitable positive and negative samples from the original graph.
In the following sections, we elaborate on the graph structure encoding (stage 2) and the generation of the final embedding based on contrastive learning (stage 3). We examine their computational complexity aspects in Appendix 4. For the sake of readability, note also that, since the first stage of content encoding is actually beyond the objectives of this work, we discuss it in Appendix 3.
Graph structure encoding
The second stage models two graph views, named network schema view and metapath view, able to encode the local and global structure surrounding nodes, respectively, while exploiting multilayer information.
The network schema of a heterogeneous graph is an abstraction of the original graph showing the different node types and their direct connections. It is often referred to as meta template, since it captures node and edge type mapping functions. Formally, a network schema is a directed graph defined over node types A, with edges as relation types from R. In a multilayer heterogeneous network \(G_{{\mathcal{L}}}\), the network schema includes all types A for individual layers and relations R, including both intra and interlayer edges. More specifically, we consider all relations involving any node \(\langle i,l \rangle\) of target type, denoted as \(R_{\langle i,l \rangle } \subseteq R\), and all node types a connected to the target node through a relation \(r \in R_{\langle i,l \rangle }\). Hereinafter, we refer to this graph as network schema graph. Figure 3 shows an example of network schema graph for the multilayer heterogeneous attributed network of Fig. 1.
A metapath is a sequence of connected nodes making two distant nodes in the network reachable, i.e., the terminal or endpoint nodes of a metapath instance. Formally, a metapath \(M_{m}\) is a path defined on the network schema graph, in the form \({a_{1}} \xrightarrow {{r_{1}}} {a_{2}} \xrightarrow {{r_{2}}} \cdots \xrightarrow {{r_{k}}} {a_{k+1}}\), describing a composite relation \({r_{1}} \circ {r_{2}} \circ \dots \circ {r_{k}}\) between node types \({a_{1}}\) and \({a_{k+1}}\). A metapath instance of \(M_{m}\) is a sampling under the guidance of \(M_{m}\) providing a sequence of connected nodes with edges matching the composite relation in \(M_{m}\). Examples of within layer metapath instances are depicted in Fig. 4a and b. Given a multilayer heterogeneous graph \(G_{{\mathcal{L}}}\) and a metapath \(M_{m}\), let \(N_{m}(i,l)\) denote the metapath based neighbors of node \(\langle i,l \rangle\) of a certain type a, defined as the set of nodes of type \(a'\) that are connected with node \(\langle i,l \rangle\) through at least one metapath instance of \(M_{m}\) having a as starting nodetype and \(a'\) as ending nodetype. Note that, similarly to Wang et al. (2019), the intermediate nodes along metapaths are discarded. A metapath based graph is a graph comprised of all the metapath based neighbors. For metapaths with terminal nodes of the same type, the resulting graph is homogeneous at node level. Figure 4c shows an example of singlelayer metapath based graph according to a specific metapath type.
Following Wang et al. (2021), given a target entity, the network schema view is used to capture the local structure, by modeling information from all the direct neighbors of the corresponding target nodes, whereas the metapath view is used to capture the global structure, by modeling information from all the nodes connected to the corresponding target nodes through a metapath and from the pillaredges derived by the corresponding metapath based graph.
View embedding generation. The two views exploit features associated with different entity types; specifically, the network schema view takes advantage of features of neighbors of any type, while the metapath view takes advantage of features of nodes of target type involved in high order relations.
We remind that CoMLHAN produces for each target entity a distinct embedding under each view. Nonetheless, both views share two fundamental steps in the embedding generation: (1) aggregating information of different instances of the same type—i.e., instances of the same relation and instances of the same metapath, respectively—and (2) combining information of different types—i.e., different types of relations and of metapaths, respectively, as well as different layers.
Network schema view embedding
In the network schema view, the embedding of each target node is computed from its direct neighbors, both within and across layers. As mentioned before, the network schema is a multilayer heterogeneous graph, having nodes of different types and relations corresponding to intra and interlayer edges involving nodes of target type.
To generate the embeddings under the network schema view, we follow a hierarchical attention approach, consisting of two main steps, which are summarized as follows and depicted in Fig. 5:

(NSVE1)
First, we aggregate information of the same type (i.e., different instances of the same relation type) via nodelevel attention, learning the importance of each neighbor and obtaining, for each node, an embedding w.r.t. each relation type that involves a node of target type t.

(NSVE2)
Second, we combine information of different types (i.e., different relations in different layers) via typelevel attention, learning the most relevant relations and obtaining an embedding for each node under the network schema view. Moreover, we combine information from different layers via acrosslayer attention, learning the importance of each layer and obtaining, for each entity, a single embedding under the network schema view.
Note that we refer to relation type and not to node type to be consistent in the event that target nodes are connected to a certain node type through multiple relationships. We point out that, in accordance with the infomax principle, the network schema view does not model pillaredges, since they are processed in the other view. We also specify that intralayer edges in different layers are seen as different types of relations, reflecting the separation into layers according to a certain aspect. In practice, layers are an additional way for distinguishing the context of relations.
Aggregating information of different instances of the same type (NSVE1). Aggregating information of the same type (i.e., different instances of the same relation type) takes place via nodelevel attention. This step exploits features of nodes connected to target nodes through a direct link, whether they are of the same type as the target or not.
Given the graph \(G_{{\mathcal{L}}}\), we define a function, denoted as \(N^{(r)}(\cdot )\), that for any pair entitylayer yields its neighborhood under relation type r, regardless of the within layer or acrosslayer location of the neighbors. Formally, given a target node \(\langle i,l \rangle\), we define the set of its neighbors under relation \(r \in R_{\langle i,l \rangle }\) as:
$$\begin{aligned} N^{(r)}(i,l) = \{\langle j,l' \rangle \in V_{{\mathcal{L}}} {} (\langle j,l' \rangle ,\langle i,l \rangle ) \in E_{r} \}. \end{aligned}$$
(1)
Above, note that \(N^{(r)}(i,l)\) returns withinlayer or acrosslayer neighbors of \(\langle i,l\rangle\) under relation r, when \(l=l'\) or \(l\ne l'\), respectively. (Recall that pillar edges are excluded from the definition of neighbor sets). Moreover, to ensure the aggregation of the same amount of information, we sample a fixed size of neighbors to be processed at each epoch by setting a threshold value for each type of neighbor (cf. “Experimental settings” section). In our setting, neighbor sampling can be done with and without replacement. Note that this neighbor sampling approach allows for saving computational resources in case of huge networks.
We thus define the embedding of entity \(v_{i}\) in layer \(G_{l}\) based on neighbors under relation r as:
$$\begin{aligned} {\textbf{z}}^{N^{(r)}}_{\langle i,l \rangle } = \sigma \left( \sum _{\langle j,l' \rangle \in N^{(r)}( i,l )} \alpha ^{(r)}_{\langle i,l \rangle ,\langle j,l' \rangle } {\textbf{W}}_{\textbf{2}}^{(r)} {\textbf{h}}_{\langle j,l' \rangle }\right) , \end{aligned}$$
(2)
where \({\textbf{z}}^{N^{(r)}}_{\langle i,l \rangle }\) is the embedding of node \(\langle i,l \rangle\) obtained from neighborhood under relation r, \(\sigma (\cdot )\) is the activation function (default is ELU), \({\textbf{W}}_{{\textbf{2}}}^{(r)}\) is the weight matrix of shape (d, d) associated with onehop neighbors \(\langle j,l'\rangle\), \({\textbf{h}}_{\langle j,l' \rangle }\) is the feature embedding of node \(\langle j,l' \rangle\) and \(\alpha ^{(r)}_{\langle i,l \rangle ,\langle j,l' \rangle }\) is the normalized attention coefficient for the relation r connecting \(\langle i,l \rangle\) and \(\langle j,l' \rangle\) and indicating the importance for \(\langle i,l \rangle\) of information coming from \(\langle j,l'\rangle\), as defined in Eq. 3:
$$\begin{aligned} \begin{aligned} \alpha ^{(r)}_{\langle i,l \rangle ,\langle j,l' \rangle }&= \frac{ \exp \left( e^{(r)}_{\langle i,l\rangle ,\langle j,l'\rangle }\right) }{\sum _{\langle u,l'\rangle \in N^{(r)}(i,l) } \exp \left( e^{(r)}_{\langle i,l\rangle ,\langle u,l'\rangle }\right) }, \\ \\ {\text{with}} \quad e^{(r)}_{\langle i,l\rangle ,\langle j,l'\rangle }&= {\textbf{a}}^{{(r)}^{\mathrm{T}}}\left( LeakyReLU\left( {\textbf{W}}^{(r)}\left[ {\textbf{h}}_{\langle i,l \rangle } \ {\textbf{h}}_{\langle j,l' \rangle }\right] \right) \right) , \end{aligned} \end{aligned}$$
(3)
where \({\textbf{a}}^{(r)} \in {\mathbb{R}}^{d}\) is the learnable weight vector under relation r, \([{\textbf{h}}_{\langle i,l \rangle } \ {\textbf{h}}_{\langle j,l' \rangle }] \in {\mathbb{R}}^{2d}\) is the rowwise concatenation of the column vectors associated with the two node embeddings, \({\textbf{W}}^{(r)} = [{\textbf{W}}_{{\textbf{1}}}^{(r)}\{\textbf{W}}_{{\textbf{2}}}^{(r)}] \in {\mathbb{R}}^{d \times 2d}\) is the columnwise concatenation of \({\textbf{W}}_{{\textbf{1}}}^{(r)}\) and \({\textbf{W}}_{{\textbf{2}}}^{(r)}\), both of shape (d, d) and containing the left and right half of the columns of \({\textbf{W}}^{(r)}\), associated with destination and source nodes (onehop neighbors), respectively.^{Footnote 1}
In Eq. 3, we adopt the same approach as in GATv2 (Brody et al. 2021), which aims to fix the static attention problem of standard Graph Attention Network (GAT) (Velickovic et al. 2018) that limits its expressive power, since the ranking of attended nodes is unconditioned on the query node; on the contrary, GATv2 is a dynamic graph attention variant where the order of internal operations of the scoring function is modified to apply an MLP for computing the score of each attended node.
The selfattention mechanism can be extended similarly to Vaswani et al. (2017) by employing multihead attention, in order to stabilize the learning process. In this case, operations are independently replicated Q times, with different parameters, and outputs are featurewise aggregated through an operator denoted with symbol \(\bigoplus\), which usually corresponds to average (default) or concatenation:
$$\begin{aligned} {\textbf{z}}^{N^{(r)}}_{\langle i,l \rangle } = \sigma \left( \underset{q=1\dots Q}{\bigoplus } \left( \sum _{\langle j,l' \rangle \in N^{(r)}(i,l) } \alpha ^{(r,q)}_{\langle i,l \rangle ,\langle j,l' \rangle } {\textbf{W}}^{(r,q)} {\textbf{h}}_{\langle j,l' \rangle }\right) \right) , \end{aligned}$$
(4)
where \({\textbf{W}}^{(r,q)}\) and \(\alpha ^{(r,q)}_{\langle i,l \rangle ,\langle j,l' \rangle }\) denote the weight matrix and the attention coefficient for the qth attention head under relation r, respectively.
Let \({\textbf{z}}_{\langle i,l \rangle }^{N^{(r)}}\) be the embedding of a target node \(\langle i,l \rangle\) obtained from its neighbors in each layer under relation r. Downstream of nodelevel attention, we thus obtain \(\bigcup \nolimits _{r \in R_{\langle i,l \rangle }} \{ {\textbf{z}}_{\langle i,l \rangle }^{N^{(r)}} \}\) embeddings.
Combining information of different types and layers (NSVE2). In order to combine information of different node types according to the different relations with target nodes, we employ typelevel attention for each layer separately. For each target node \(\langle i,l \rangle\), we obtain the embedding under the network schema view \({\textbf{z}}^{{{{\mathrm{NS}}}}}_{\langle i,l \rangle }\), as defined in Eq. 5:
$$\begin{aligned} {\textbf{z}}^{{{{\mathrm{NS}}}}}_{\langle i,l \rangle } = \sum _{r \in R_{\langle i, l \rangle }} \beta ^{(r)} {\textbf{z}}^{N^{(r)}}_{\langle i,l \rangle }, \end{aligned}$$
(5)
where \(\beta ^{(r)}\) is the attention coefficient for neighborhood under relation r, which is defined as follows:
$$\begin{aligned} \beta ^{(r)} = \frac{\exp {(w^{(r)})}}{\sum _{r' \in R_{\langle i, l \rangle }} \exp {(w^{(r')})}} \quad {\text{with}} \ \ w^{(r)}= \frac{1}{{\mathcal{V}}_{l}^{(t)}} \sum _{\langle i,l \rangle \in {\mathcal{V}}_{l}^{(t)}} {\textbf{a}}^{{{{{\mathrm{NS}}}}}^{\mathrm {T}}} {\text{tanh}}\left( {\textbf{W}}^{{{{{\mathrm{NS}}}}}} {\textbf{z}}^{{N}^{(r)}}_{\langle i,l \rangle } + {\textbf{b}}^{{{{{\mathrm{NS}}}}}}\right) , \end{aligned}$$
(6)
where \({\mathcal{V}}_{l}^{(t)}\) is the set of entities of target type t in layer l; \({\textbf{a}}^{{{{{\mathrm{NS}}}}}} \in {\mathbb{R}}^{d}\) is the typelevel attention vector; \(\textbf{W}^{{{{\mathrm{NS}}}}}\) and \({\textbf{b}}^{{{{\mathrm{NS}}}}}\) are the learnable weight matrix and the bias term, respectively, under the network schema view, shared by all relation types. We hence obtain the set of embeddings \(\bigcup \nolimits _{l \in L} \{ {\textbf{z}}^{{{{\mathrm{ NS}}}}}_{\langle i,l\rangle }\}\) under the network schema view for each target node.
In order to map the learned node embeddings into the same space of the contrastive loss function, we apply an additional level of attention, i.e., acrosslayer attention. This is designed to evaluate the importance of each layer of \(G_{{\mathcal{L}}}\) and combine layerwise the features of nodes. We thus obtain an embedding under the network schema view for each target entity \(v_{i}\), as defined in Eq. 7:
$$\begin{aligned} {\textbf{z}}^{{{{\mathrm{NS}}}}}_{i} = \sum _{l \in L} \beta ^{(l)} {\textbf{z}}^{{{{\mathrm{ NS}}}}}_{\langle i,l \rangle }, \end{aligned}$$
(7)
where \(\beta ^{(l)}\) is the learned attention coefficient for layer \(G_{l}\), computed via the same attention model like in Eq. 6, where in this case the learnable weights are shared by all layers.
Metapath view embedding
In the metapath view, the embedding of each target node is computed from its metapath based neighbors and from the pillaredges derived by the corresponding metapath based graph. We remind that each layer of a metapath based graph is a homogeneous network with nodes corresponding to a subset of target nodes and edges as connections of metapath based neighbors, including acrosslayer information matching pillaredges.
We consider metapaths of any length, starting and ending with nodes of target type; indeed, information of intermediate nodes can be discarded as it is included in the network schema view. Note that considering multiple metapaths allow us to deal with multiple semantic spaces (Lin et al. 2021), and our framework is designed to handle an arbitrary number of metapaths. Also, in case a layer does not contain any node of target type, the layer is discarded from the resulting multilayer graph. Yet, our framework admits the worst case of \(\ell 1\) layers missing for a metapath type.
Analogously to the network schema view, the metapath view embedding generation consists of two main steps (Fig. 6):

(MPVE1)
First, we aggregate information of the same type, this time intended as several instances of the same metapath and encoded via metapathspecific Graph Convolutional Network (GCN) (Kipf and Welling 2017), obtaining, for each target node, an embedding w.r.t. each metapath type.

(MPVE2)
Second, we combine information of different types (i.e., different metapaths in different layers) and layers (i.e., different metapaths across layers) via semantic attention, learning the importance of each metapath and obtaining an embedding for each target node and entity under the matapath view.
In the following, we first describe the process of metapath view embedding generation according to the basic CoMLHAN approach. Next, in “Alternative metapath view embedding: CoMLHANSA” section, we shall describe an alternative strategy, called CoMLHANSA, which differs from CoMLHAN in the way acrosslayer information relating to pillaredges is modeled.
Aggregating information of different instances of the same type (MPVE1). The first step of embedding generation under the metapath view is to aggregate information of the same type, which corresponds to several instances of a given metapath. More specifically, we consider all p metapaths \(\mathcal{M} = \{M_{1}, \dots , M_{p}\}\) involving nodes of target type, where each metapath \(M_{m}\) matches a multilayer graph with at most \(\ell\) layers.
In the metapath view, acrosslayer dependencies are modeled as particular types of metapaths, i.e., acrosslayer metapaths. They refer to the same composite relation, with the additional constraint that the terminal nodes belong to different layers, and that the intermediate node matches a pillaredge, i.e., it corresponds to an entity (of type different from the target one) with both instances involved in the composite relation. An example is illustrated in Fig. 7. We define the set of acrosslayer metapaths, \({\mathcal{M}}^{\Updownarrow }\), as the the union of all metapaths of any type and defined over all layerpairs.
To identify the metapath based neighbors of each node, we define two functions, denoted as \(N^{\Leftrightarrow }(\cdot )\) and \(N^{\Updownarrow }(\cdot )\), which for each node return the intralayer and interlayer neighborhood, respectively. Formally, we define the set of withinlayer neighbors of the node \(\langle i, l \rangle\), according to mth (withinlayer) metapath type, as:
$$\begin{aligned} N_{m}^{\Leftrightarrow }(i,l)= \{\langle j,l \rangle \in V_{{\mathcal{L}}} \  \ \langle j,l \rangle \in N_{m}(i,l)\}. \end{aligned}$$
(8)
Similarly, we define the set of acrosslayer neighbors of node \(\langle i, l \rangle\), according to the mth (acrosslayer) metapath type, as follows:
$$\begin{aligned} N_{m}^{\Updownarrow }(i,l) = \{\langle j,l' \rangle \in V_{{\mathcal{L}}} \  \ \langle j,l' \rangle \in N_{m}(i,l) , l' \ne l\}. \end{aligned}$$
(9)
Note that Eqs. 8 and 9 identify the metapath based neighborhood of type \(M_{m}\) for node \(\langle i,l \rangle\), with m referring to a within or acrosslayer metapath, respectively; in particular, \(N_{m}^{\Leftrightarrow }(i,l) \equiv N_{m}(i,l)\).
Given any target node \(\langle i,l \rangle\), we apply a metapath specific graph neural network \(f_{m}\) (with K hidden layers) in order to compute its embedding according to the mth metapath; formally, at each kth layer:
$$\begin{aligned} {\textbf{z}}^{(k+1)}_{\langle i,l \rangle } = {\left\{ \begin{array}{ll} f_{m}^{ (k+1)}\left( {\textbf{z}}^{(k)}_{\langle i,l \rangle },\ \ \bigoplus \left\{ {\textbf{z}}_{\langle j,l \rangle }^{(k)} \  \ \langle j,l \rangle \in N_{m}^{\Leftrightarrow }(i,l)\right\} \right) &{} {\text{if }} m {\text{ is a within layer metapath}}\\ \\ f_{m}^{ (k+1)}\left( {\textbf{z}}^{(k)}_{\langle i,l \rangle },\ \ \bigoplus \left\{ {\textbf{z}}_{\langle j,l \rangle }^{(k)} \  \ \langle j,l \rangle \in N_{m}^{\Updownarrow }(i,l)\right\} \right) &{} {\text{if }}m {\text{ is an acrosslayer metapath}}\\ \end{array}\right. } \end{aligned}$$
(10)
where \({\textbf{z}}^{(0)}_{\langle i,l \rangle } = {\textbf{h}}_{\langle i,l \rangle }\) is the feature embedding computed in the first stage, and \(\bigoplus\) denotes an arbitrary differentiable function, aggregating feature information from the local neighborhood of nodes [e.g., summation, a pooling operator, or even a neural network (Wang et al. 2020)]. Similarly to Wang et al. (2021), we use a GCN architecture as \(f_{m}\), for all \(M_{m}\) \((m=1\ldots p)\) in Eq. 10, assuming no different contribution from different instances of the same metapath.
More specifically, given the mth withinlayer metapath and \({\mathcal{A}} = \{ {\textbf{A}}_{1}, \ldots , {\textbf{A}}_{\ell } \}\) as the set of adjacency matrices associated with the corresponding metapath based graph, being \({\textbf{A}}_{l} \in {\mathbb{R}}^{n_{l} \times n_{l}}\) (\(l=1\ldots \ell\)) the adjacency matrix associated with layer l, the GCN for layer \(G_{l}\) is defined as follows:
$$\begin{aligned} {\textbf{z}}_{\langle i,l \rangle }^{(k+1)}= \sigma \left( \sum _{\langle j,l \rangle \in N^{\Leftrightarrow }(i,l)}\frac{1}{\sqrt{{\widetilde{\textbf{D}}}^{l}_{ii}{\widetilde{\textbf{D}}}^{l}_{jj}}} {{\textbf{W}}^{(k,l)}}^{\mathrm{T}} {\textbf{z}}_{\langle j,l \rangle }^{(k)} \right) \end{aligned}$$
(11)
where \(\sigma (\cdot )\) is a nonlinear activation function (default is \(ReLU(\cdot ) = max(0,\cdot )\)), \({\textbf{W}}^{(k,l)}\) is the trainable weight matrix for the mth metapath in the kth convolutional layer of shape (d, d), and \({\widetilde{\textbf{D}}}^{l}_{ii}=\sum _{j}{\widetilde{{\textbf{A}}}}^{l}_{ij}\) is the degree matrix derived from \({\widetilde{{\textbf{A}}}_{l}} = {\textbf{A}}_{l} + {\textbf{I}}_{n}\), with \({\textbf{I}}^{l}_{n}\) as the identity matrix of size \(n_{l}\), and \(n_{l}\) number of nodes of layer \(G_{l}\). The GCN model for acrosslayer metapaths is built similarly, considering \(N^{\Updownarrow }(\cdot )\) instead of \(N^{\Leftrightarrow }(\cdot )\) and \(\pi\) instead of l.
Let \({\textbf{z}}^{(m)}_{\langle i, l \rangle }\) and \({\textbf{z}}^{(m)}_{\langle i,\pi \rangle }\) be the node embedding associated with the mth within (resp. across)layer metapath of node \(\langle i,l \rangle\) (resp. layerspair \(\pi\)). Downstream of metapath specific GNNs, we obtain \(\{{\textbf{z}}^{(m)}_{\langle i,l \rangle } \  \ l \in L, \ m=1\dots p\} \bigcup \{{\textbf{z}}^{(m)}_{\langle i,\pi \rangle } \  \langle m,\pi \rangle \in \mathcal{M}^{\Updownarrow }\}\) node embeddings.
Combining information of different types and layers (MPVE2). Once obtained the metapath specific embeddings for each target node, we employ semanticlevel attention for combining different metapath types, including both intra and interlayer information. Given a node \(\langle i,l \rangle\), the embedding under the metapath view is computed as follows:
$$\begin{aligned} {\textbf{z}}^{{{{\mathrm{ MP}}}}}_{\langle i, l \rangle } = \sum _{m = 1}^{p} \beta ^{(m,l)} {\textbf{z}}_{\langle i,l \rangle }^{(m)} + \lambda ^{\Updownarrow } \left( \sum _{m = 1}^{p} \sum _{\pi  l \in \pi } \beta ^{(m,\pi )} {\textbf{z}}_{\langle i,\pi \rangle }^{(m)}\right) , \end{aligned}$$
(12)
where \(\beta\) is the attention coefficient denoting the importance of each type of within layer and acrosslayers metapath (cf. Eq. 6) and \(\lambda ^{\Updownarrow } \in [0\ldots 1]\) is a balancing coefficient denoting the importance of interlayer connections.
In order to project the node embedding into the same space of the loss function—analogously to the network schema view—we aggregate the embeddings obtained from each layer with a sum operator, which is defined as follows:
$$\begin{aligned} {\textbf{z}}_{i}^{{{{\mathrm{MP}}}}} =\sum _{l \in L} {\textbf{z}}^{{{{\rm MP}}}}_{\langle i, l \rangle }. \end{aligned}$$
(13)
Note that Eq. 13 does not require an additional level of attention, since the layer dependency has already been taken into account by the attention mechanism in Eq. 12. Therefore, Eqs. 12 and 13 can be combined as follows:
$$\begin{aligned} {\textbf{z}}_{i}^{{{{\rm MP}}}} = \underbrace{\sum _{m = 1}^{p} \sum _{l \in L} \beta ^{(m,l)} {\textbf{z}}_{\langle i,l \rangle }^{(m)}}_{\text{withinlayer}} + \underbrace{\lambda ^{\Updownarrow } \left( \sum _{m = 1}^{p} \sum _{\pi \in L_{cross}} \beta ^{(m,\pi )} {\textbf{z}}_{\langle i,\pi \rangle }^{(m)}\right) }_{\text{acrosslayers}}. \end{aligned}$$
(14)
Equation 14 hence enables the direct computation of the final embedding under the metapath view for each entity \(v_{i}\).
Alternative metapath view embedding: CoMLHANSA
Our alternative approach for embedding generation under the metapath view is named CoMLHANSA, where the suffix ‘SA’ refers to the supraadjacency matrix modeling each metapath based graph. The supraadjacency matrix, denoted as \({\textbf{A}}^{\mathrm{sup}}\), has diagonal blocks each representing a layerspecific adjacency matrix (i.e., \({\textbf{A}}_{l} \in {\mathbb{R}}^{n_{l} \times n_{l}}\), with \(l=1\ldots \ell\)), and offdiagonal blocks each corresponding to the interlayer adjacency matrix \({\textbf{A}}_{\pi }\) for layerpair \(\pi =(l,l')\), with values equal to 1 if an edge between \(\langle i,l \rangle\) and \(\langle j,l' \rangle\) exists, with \(l \ne l'\), and 0 otherwise.
To give an intuition, we model acrosslayer information downstream of semantic attention, by accounting for another level of attention, i.e., acrosslayer attention (by analogy with the network schema view).
We thus learn the importance of different (within layers) metapaths via semantic attention, obtaining an embedding under the metapath view for each node and we subsequently learn the importance of each layer via acrosslayer attention, obtaining an embedding under the metapath view for each entity.
Like in the basic CoMLHAN approach, the metapath view embedding generation in CoMLHANSA consists of two main steps (Fig. 8):

(MPVESA1)
First, we aggregate information of the same type, intended as several instances of the same metapath and encoded via metapathspecific GCNs, obtaining, for each node, an embedding w.r.t. each metapath. Unlike MPVE1, the first step of the CoMLHANSA approach hence handles the interlayer dependencies derived from pillaredges.

(MPVESA2)
Second, we combine information of different types (i.e., different metapaths in different layers) via semantic attention, learning the importance of each metapath and obtaining an embedding for each target node under the metapath view. Moreover, we combine information from different layers via acrosslayer attention, learning the importance of each layer and obtaining, for each target entity, a single embedding under the metapath view.
By avoiding acrosslayer metapaths \(\mathcal{M}^{\Updownarrow }\) definition, CoMLHANSA requires a limited number of learnable parameters, as it utilizes a metapath specific GCN shared by all layers \(G_{l}\).
Aggregating information of different instances of the same type (MPVESA1). We still use the notation \(N^{\Leftrightarrow }(\cdot )\) and \(N^{\Updownarrow }(\cdot )\) to indicate the set of withinlayer and acrosslayer neighbors, respectively. While the definition of \(N^{\Leftrightarrow }(\cdot )\) does not change w.r.t. Eq. 8, the definition of \(N^{\Updownarrow }(\cdot )\) of the CoMLHANSA approach is modified in the modeling of pillaredges, by directly considering all the instances of the same target entities in other layers, as shown in Eq. 15:
$$\begin{aligned} N_{m}^{\Updownarrow }(i,l) = \{\langle i,l' \rangle \in V_{{\mathcal{L}}} \  \ l' \ne l\}. \end{aligned}$$
(15)
Similarly to MPVE1, we apply a metapath specific GNN for aggregating different metapath instances of the same type:
$$\begin{aligned} {\textbf{z}}^{(k+1)}_{\langle i,l \rangle } = f_{m}^{ (k+1)}\left( {\textbf{z}}^{(k)}_{\langle i,l \rangle }, \bigoplus ^{k} \left( \left\{ {\textbf{h}}_{\langle j,l \rangle }^{(k)} \  \ \langle j,l \rangle \in N^{\Leftrightarrow }(i,l) \cup N^{\Updownarrow }(i,l) \right\} \right) \right) . \end{aligned}$$
(16)
Unlike MPVE1, the interlayer dependencies are taken into account by the GNN, employing a modified version of the propagation rule that can handle the supraadjacency matrix as input. We thus build for each metapath its corresponding metapath based supragraph, i.e., a graph where pillar edges exist between every node and its counterpart in other coupled layers. In our setting, we instantiate \(f_{m}\) with a multilayer GCN model (Zangari et al. 2021), as shown in Eq. 17:
$$\begin{aligned} {\textbf{z}}_{\langle i,l \rangle }^{(k+1)} = \sigma \left( \sum _{\langle j,l' \rangle \in N^{\Leftrightarrow }(i,l) \cup N^{\Updownarrow }(i,l) }\frac{1}{\sqrt{{\widetilde{\textbf{D}}}_{ii}{\widetilde{\textbf{D}}}_{jj}}} {{\textbf{W}}^{(k,m)}}^{\mathrm {T}} \delta (l,l') \ {\textbf{z}}_{(j,l')}^{(k)} \right) , \end{aligned}$$
(17)
where the degree matrix \({\widetilde{\textbf{D}}}\) is built considering both interlayer and intralayer links of nodes using the supraadjacency matrix of the graph, \({\widetilde{\textbf{D}}}_{ii}=\sum _{j=1}{\widetilde{{\textbf{A}}}}^{\mathrm{sup}}_{ij}\), where \({\widetilde{{\textbf{A}}}}^{\mathrm{sup}}\) is the supraadjacency matrix with selfloops added, \(\delta (l,l')\) is a scoring function denoting the weight coefficient for interlayer links, ranging between 0 and 1, with values equal to \(\lambda ^{\Updownarrow }\) if \(\ l \ne l'\), and 1 otherwise.
Let \({\textbf{z}}^{m}_{\langle i,l\rangle }\) be the embedding of node \(\langle i,l \rangle\) associated with the mth metapath. We thus obtain \(\bigcup \nolimits _{\begin{array}{c} m=1 \dots p \\ l \in L \end{array}} \{{\textbf{z}}^{(m)}_{\langle i,l\rangle }\}\) metapath specific embeddings.
Combining information of different types and layers (MPVESA2). Once obtained the metapath specific embeddings for each target node, we employ semanticlevel attention for combining different metapath types, obtaining for each node \(\langle i, l \rangle\) an embedding under the metapath view, which is defined as follows:
$$\begin{aligned} {\textbf{z}}^{{{{\rm MP}}}}_{\langle i,l \rangle } = \sum _{m =1 \dots p} \beta ^{(m,l)} {\textbf{z}}_{\langle i,l \rangle }^{(m)}, \end{aligned}$$
(18)
where \(\beta ^{(m,l)}\) is an attention coefficient computed as in Eq. 6.
In order to project the node embedding into the same space of the loss function, we apply an additional level of attention, named acrosslayer attention, similarly to network schema view, thus obtaining for each entity \(v_{i}\) an embedding under the metapath view:
$$\begin{aligned} {\textbf{z}}_{i}^{{{{\rm MP}}}} = \sum _{l \in L} \beta ^{(l)}{\textbf{z}}^{{{{\rm MP}}}}_{\langle i,l \rangle }, \end{aligned}$$
(19)
where \(\beta ^{(l)}\) is the attention coefficient denoting the importance of the lth layer, computed similarly to Eq. 6.
Final embedding based on Contrastive Learning
The third stage of the proposed framework is concerned with the exploitation of a contrastive learning mechanism to produce the final entity embeddings, pulling together similar entities and pushing apart dissimilar ones in the embedding space. We combine the contrastive losses computed according to each view, with individual nodes of both positive and negative pairs selected from distinct views.
Given the embeddings \({\textbf{z}}_{i}^{{{{\rm NS}}}}\) (Eq. 7) and \({\textbf{z}}_{i}^{{{{\rm MP}}}}\) (either Eq. 14 or Eq. 19) for each target entity \(v_{i}\), we transform them into the same space in which a contrastive loss function is computed, by employing a simple MLP architecture with one hidden layer, as defined in Eq. 20:
$$\begin{aligned} \begin{aligned} \hat{{\textbf{z}}}_{i}^{{{{\rm NS}}}}&= {\textbf{W}}^{(2)} \sigma ({\textbf{W}}^{(1)}{\textbf{z}}_{i}^{{{{\rm NS}}}} + {\textbf{b}}^{(1)}) + {\textbf{b}}^{(2)}, \\ \hat{{\textbf{z}}}_{i}^{{{{\rm MP}}}}&= {\textbf{W}}^{(2)} \sigma ({\textbf{W}}^{(1)}{\textbf{z}}_{i}^{{{{\rm MP}}}} + {\textbf{b}}^{(1)}) + {\textbf{b}}^{(2)}, \end{aligned} \end{aligned}$$
(20)
where \({\textbf{W}}^{(2)}\), \({\textbf{W}}^{(1)}\), \({\textbf{b}}^{(2)}\) and \({\textbf{b}}^{(1)}\) are learnable weights shared by both views and \(\sigma (\cdot )\) is the activation function (default is ELU).
The contrastive loss according to a certain view is computed on pairs of positive and negative samples. While earlier contrastive learning approaches were based on one or more negatives and a single positive for each instance, we follow the more recent trend of using both multiple positive and negative pairs (Khosla et al. 2020; Wang et al. 2021). Each target entity \(v_{i}\) can hence rely on more than one positive (at least itself, under the other view). For positive sampling, the idea is to select the best nodes connected by multiple metapath instances, since metapath based neighbors have higher probability of being similar to each other. For negative sampling, we simply choose considering everything that is not positive.
We first proceed to the selection of positive samples. For this purpose, we count the metapaths instances connecting each pair of target entities, considering all metapath types on individual layers, as shown in Eq. 21:
$$\begin{aligned} C_{i,j} = \sum _{l \in L} \sum _{m = 1 \dots p} \{j \  \ \langle j,l \rangle \in N_{m}^{\Leftrightarrow }(i,l)\} . \end{aligned}$$
(21)
For each target entity \(v_{i}\), we obtain a set \({\mathcal{S}}_{i}= \{v_{j} \in {\mathcal{V}} \  \ C_{i,j}>0 \}\) which is sorted by decreasing values of \(C_{i,j}\). Given a threshold \(T_{pos}\), we select for each entity itself and the best \(T_{pos}1\) entities as positives, obtaining a subset \(\overline{{\mathcal{S}}}_{i} \subseteq {\mathcal{S}}_{i}\) with \(\overline{{\mathcal{S}}}_{i} \le T_{pos}1\); all the remaining \({\mathcal{V}}T_{pos}\) entities are regarded as negatives for \(v_{i}\). Therefore, for each entity \(v_{i}\), we define the set of positive samples \({\mathcal{P}}_{i}\) as \({\mathcal{P}}_{i}=v_{i} \cup \{ v_{j}  v_{j} \in \overline{{\mathcal{S}}}_{i} \}\) and the set of negative samples \({\mathcal{N}}_{i}\) as \({\mathcal{N}}_{i}= {\mathcal{V}} {\setminus } {\mathcal{P}}_{i}\).
We stress that for the selection of positives we only exploit structural information, without using any information derived from the encoding of external content (i.e., initial features) of entities. Nonetheless, additional conditions on metapaths in the selection of entity pairs can be defined, e.g., by diversifying the minimum number of instances required to enable the enumeration of a specific metapath. CoMLHAN is flexible in both the metapath counting method and the overall positive and negative selection strategy.
For the computation of contrastive losses according to a given view, the embedding of each target entity \(v_{i}\) is selected from the given view, while the positive and negative samples are selected from the other view, as defined in Eqs. 22 and 23, and illustrated in Fig. 9:
$$\begin{aligned} L^{{{{\rm NS}}}}= & {} \log \frac{\sum _{j \in {\mathcal{P}}_{i}} \exp \left( sim \left( \hat{{\textbf{z}}}^{{{{\rm NS}}}}_{i},\hat{{\textbf{z}}}^{{{{\rm MP}}}}_{j}\right) /\tau \right) }{\sum _{u \in {{\mathcal{P}}_{i} \cup {\mathcal{N}}_{i}}} \exp \left( sim \left( \hat{{\textbf{z}}}^{{{{\rm NS}}}}_{i},\hat{{\textbf{z}}}^{{{{\rm MP}}}}_{u}\right) /\tau \right) }, \end{aligned}$$
(22)
$$\begin{aligned} L^{{{{\rm MP}}}}= & {} \log \frac{\sum _{j \in {\mathcal{P}}_{i}} \exp \left( sim \left( \hat{{\textbf{z}}}^{{{{\rm MP}}}}_{i},\hat{{\textbf{z}}}^{{{{\rm NS}}}}_{j}\right) /\tau \right) }{\sum _{u \in {{\mathcal{P}}_{i} \cup {\mathcal{N}}_{i}}} \exp \left( sim \left( \hat{{\textbf{z}}}^{{{{\rm MP}}}}_{i},\hat{{\textbf{z}}}^{{{{\rm NS}}}}_{u}\right) /\tau \right) }, \end{aligned}$$
(23)
where \(sim(\textbf{v}_{1},\textbf{v}_{2})\) denotes the cosine similarity between two vectors \(\textbf{v}_{1}\) and \(\textbf{v}_{2}\), and \(\tau\) is the temperature parameter, which indicates how concentrated the embeddings are in the representation space, so that a lower temperature leads the loss to be dominated by smaller distances and widely separated representations contribute less. Note that Eqs. 22–23 are independent from the specific strategy of positive and negative selection; we leave the investigation of alternative sampling methods as future work (“Conclusions” section).
The final contrastive loss is computed as a convex combination of the two contrastive losses to balance the effects of the two views:
$$\begin{aligned} L_{co} = \lambda L^{{{{\rm NS}}}} + (1\lambda ) L^{{{{\rm MP}}}} \end{aligned}$$
(24)
with \(0< \lambda < 1\). The loss function is completely specified depending on whether an unsupervised or semisupervised paradigm is adopted. The extension to the (semi)supervised case can be done by adding a new term to the final loss, as shown in Eq. 25:
$$\begin{aligned} L_{tot} = \eta L_{co} + L_{sup} \end{aligned}$$
(25)
where \(L_{sup}\) is the (semi)supervised term, e.g., crossentropy for classification tasks, jointly optimized with the contrastive term in a endtoend fashion, and the coefficient \(\eta\), \(0 \le \eta \le 1\), is given to the contrastive term, since in a (semi)supervised setting the (semi)supervised term is expected to be more relevant.
Similarly to Chen et al. (2020), once the training procedure is completed, the optimized \({\textbf{z}}_{i}^{{{{\rm MP}}}}\) or \({\textbf{z}}_{i}^{{{{\rm NS}}}}\) will eventually be used for downstream tasks. Particularly, our default choice is to select the embeddings under the metapath view, since metapaths represent highorder relations between target nodes and pillar edges capture the information of instances of the same entity, exploiting multilayer dependencies. It should however be noted that the similarity between the two learned embeddings, for any entity, is expected to be high, since, according to our positive selection strategy, each entity \(v_{i}\) includes itself under the other view in its set of positive samples \({\mathcal{P}}_{i}\). Nonetheless, in “Experimental settings” section, we shall provide empirical evidence of such embedding similarities. The final learned embeddings optimized via such crossview contrastive loss can be used for a wide range of analysis tasks—at node, entity, or edge level—such as node/entity classification, graph clustering, link prediction.