Co-MLHAN: contrastive learning for multilayer heterogeneous attributed networks

Graph representation learning has become a topic of great interest and many works focus on the generation of high-level, task-independent node embeddings for complex networks. However, the existing methods consider only few aspects of networks at a time. In this paper, we propose a novel framework, named Co-MLHAN, to learn node embeddings for networks that are simultaneously multilayer, heterogeneous and attributed. We leverage contrastive learning as a self-supervised and task-independent machine learning paradigm and define a cross-view mechanism between two views of the original graph which collaboratively supervise each other. We evaluate our framework on the entity classification task. Experimental results demonstrate the effectiveness of Co-MLHAN and its variant Co-MLHAN-SA, showing their capability of exploiting across-layer information in addition to other types of knowledge.

methods. As noted in (Khoshraftar and An 2022), GNNs ensure more refined graph representations, higher flexibility in leveraging attributes at node/edge level, and generalization to unseen nodes through task-specific and node similarity based training, although at the cost of tougher memory-requirements that might impact on scalability aspects.
Since GNNs typically require labels to learn rich representations, and annotating graphs is costly by needing domain knowledge, self-supervised learning approaches are currently being investigated, which coupled with GNNs allow to learn embeddings without relying on labeled data (Hassani and Ahmadi 2020). Among different graph selfsupervised learning methods, contrast-based methods have more flexible designs and broader applications compared to other approaches , training GNNs by discriminating positive and negative node pairs, i.e., similar and dissimilar instances. Contrastive learning aims to learn effective GNN encoders such that similar nodes are pulled together and dissimilar nodes are pushed apart in the embedding space (Jing et al. 2021).
To the best of our knowledge, there is a lack of methods able to handle networks whose nodes are replicated according to different interaction contexts or semantic aspects, are of different types and/or are connected via different types of relationships, and carry multiple information content. In other terms, networks that are simultaneously multilayer, heterogeneous and attributed are still unexplored in the landscape of graph representation learning, regardless of the particular learning paradigm adopted.
Contributions. To fill the above gap in the literature, in this work we propose a novel Contrastive learning based framework for Multilayer Heterogeneous Attributed Networks (Co-MLHAN), which is designed to learn node/entity embeddings without relying on labeled data. Specifically, we learn node representations by contrasting positive and negative samples belonging to distinct views of the original graph. Inspired by recent advances in multi-view contrastive learning (Hassani and Ahmadi 2020;Jing et al. 2021;Mavromatis and Karypis 2021;Wang et al. 2021), we indeed consider two views of a multilayer heterogeneous attributed network, which capture local and high-order (global) structure of nodes, respectively, and collaboratively supervise each other.
Our main contributions in this work correspond to addressing the following relevant, interrelated challenges: • Representation learning for an arbitrary multilayer network such that each layer can have multiple types of nodes and relations (heterogeneous network), and have initial features associated with nodes (attributed network). • Encoding of the local information of nodes, to account for the size and heterogeneity of the node neighborhoods, so as to handle variability in the number of neighbors and possible lack of certain types of neighbors. • Encoding of the high-order information of nodes, by employing meta-paths, to reach relevant information residing multi-hops away, so that nodes of the same type that are not directly connected can be tied to each other.
• Effective integration and exploitation of across-layer information, including the possibility of assigning different weights to different layers or treating them equally, as needed. This also avoids using a simplistic approach based on network flattening, so that dependencies between the layers can be retained, including both the links between the replicas of the nodes in different layers (pillar-edges) and any other interlayer edges. Moreover, with respect to modeling the across-layer information related to pillar-edges, we also propose a variant of the main method, which will be referred to as Co-MLHAN-SA. • Jointly learning of embeddings for each node/entity, each under the corresponding view, which can both be used for downstream tasks, such as classification. In this regard, we also provide a qualitative analysis of the interchangeability of the viewspecific embeddings. • High flexibility in terms of definition of node-and entity-level attributes as well as in terms of definition of the selection strategy of positive and negative sampling.
We experimentally evaluated our Co-MLHAN methods and selected competitors on IMDb movie data, from which we originally built multilayer heterogeneous attributed networks.
Plan of the paper. The remainder of this paper is structured as follows. "Proposed framework" section describes our proposed framework in detail. "Experimental evaluation" section provides our experimental evaluation concerning the entity classification task on IMDb network datasets. "Related work" section discusses related works focusing on GNN-based approaches for representation learning in heterogeneous attributed networks and in multilayer attributed networks. "Conclusions" section contains concluding remarks and provides pointers for future research. Moreover, Appendices 1-4 provide details about the preprocessing of our evaluation network datasets, an insight into the content encoding stage, and a discussion on computational complexity aspects of the proposed framework.

Proposed framework
Our proposed Co-MLHAN is a self-supervised graph representation learning approach conceived for multilayer heterogeneous attributed networks. As previously discussed, a key novelty of Co-MLHAN is its higher expressiveness w.r.t. existing methods, since heterogeneity is assumed to hold at both node and edge levels, possibly for each layer of the network. This capability of handling graphs that are multilayer, heterogeneous, and attributed simultaneously, enables Co-MLHAN to better model complex real-world scenarios, thus incorporating most information when generating node embeddings. In the following, we first provide a formal definition of multilayer heterogeneous attributed graph and representation learning in such networks, then we move to a detailed description of Co-MLHAN. The notations used in this work are summarized in Table 9, Appendix 1.

Preliminary definitions
A multilayer graph is a set of interrelated graphs, each corresponding to one layer, with a node mapping function between any (selected) pair of layers to indicate which nodes in one graph correspond to nodes in the other one. We assume that each layer can be heterogeneous, i.e., is characterized by nodes of different types and/or edges of different types, such that any node can be linked to nodes of the same type as well as to nodes of different types, through the same or different relations, and is attributed, i.e., has nodes associated with external information, available as set of attributes. Therefore, each layer graph has its internal set of edges, dubbed intra-layer or within-layer edges, as well as a set of edges connecting its nodes to nodes of another layer, dubbed inter-layer or acrosslayer edges. Layers can be seen as different interaction contexts, semantic aspects, or time steps, while the participation of an entity to a layer can be seen as a particular entity instance. Instances of the same entity are connected via pillar-edges. We hereinafter refer to entity instances as nodes in the multilayer network. Figure 1 illustrates an example of a multilayer heterogeneous attributed graph.
Multilayer heterogeneous attributed graph. We define a multilayer heterogeneous attributed graph as G L = �L, V, V L , E L ,A, R, φ, ϕ, X L , where L = {G 1 , · · · , G ℓ } is the set of layer graphs, indexed in L = {1, . . . , ℓ} , with |L| = ℓ ≥ 2 , V is the set of entities, Fig. 1 Illustration of a multilayer heterogeneous attributed graph with two layers ( G l and G l ′ ), three types of nodes (M (movie)-A (actor)-D (director)), and different content features for different nodes (e.g., text, images, structured attributes) V L ⊆ V × L is the set of nodes, E L is the set of edges, including both intra-and interlayer edges, A is the set of entity, resp. node, types, R is the set of relation types, φ : V → A is the entity-type mapping function, ϕ : E L → R is the edge-type mapping function, and X L is a set of matrices storing attributes, or initial features, with X L = l=1...ℓ X l . More specifically, entities, resp. nodes, of each type are assumed to be associated with features stored in layer-specific matrices X l = {X (a) l } , where each X (a) l is the feature matrix associated with entities, resp. nodes, of type a ∈ A in the l-th layer. Throughout this work we use symbol x (a) i,l to denote the feature vector of entity v i of type a in layer G l . We also admit that features can be layer-independent, in which case we indicate with x (a) i the feature vector associated with entity v i of type a in each layer, i.e., x (a) �i,l� = x (a) i for each G l ∈ L.
We specify that each entity has instances (i.e., nodes) in one or more layers, and appears at least in one layer, i.e., V = l=1...ℓ V l , with V l set of entities appearing in the lth layer. Likewise, A = l=1...ℓ A l , with A l denoting the set of node types of the l-th layer, R = l=1...ℓ R l , with R l denoting the set of edge types of the l-th layer, and E L = r∈R E r ⊆ V L × V L , with E r indicating all the edges of type r.
Moreover, E L can be partitioned into two sets denoting the intra-layer edges and interlayer edges. Note that inter-layer edges represent coupling structure of layers; in our setting, we assume that different coupling constraints between layers might hold, e.g., layers could be coupled with each other, only adjacent layers could be coupled, layers could follow a temporal relation order, etc. We define the set of layer pairing indices as L cross , where each π = (l, l ′ ) ∈ L cross is a pair of coupled layers denoting an interaction between layer G l and G l ′.
We stress that in contrast to other approaches, such as (Yang et al. 2021), in our formulation each layer G l ( l = 1 . . . ℓ ) is a heterogeneous graph at both node and edge levels, i.e., |A l | > 1 and |R l | > 1 . Moreover, A l ⊆ A , for all G l ∈ L , and R l ⊂ R , since inter-layer connections are regarded as different types of edges.
Multilayer heterogeneous attributed graph embedding. Given a multilayer heterogeneous attributed network G L , our goal is to learn an embedding function at entity level g : V → R d , where d is the dimension of the latent space, and d ≪ |V| . Function g can be derived from an analogous function g ′ : V L → R d , where d is the dimension of the latent space, and d ≪ |V L | , being the embedding function at node level. The mapping g, resp. g ′ , defines the latent representation of each entity v i ∈ V , resp. node �i, l� ∈ V L , and we use symbol z i , resp. z i,l , to denote its learned embedding. The learned embeddings are eventually used to support multiple downstream graph mining tasks, e.g., entity/ node classification, link prediction, node regression, etc.

Co-MLHAN: contrastive learning framework for multilayer heterogeneous attributed networks
We aim to learn node embeddings in an unsupervised manner, with function g employing graph neural networks and attention mechanisms in order to encode both structural and semantic, heterogeneous and multilayer information in the context of a multi-view contrastive mechanism.
Our proposed approach is based on the infomax principle of maximizing mutual information (Linsker 1988), both in terms of graph structure encoding-complying with the distinction between local and high-order information-and across-layer informationcomplying with the distinction between inter-layer edges connecting direct neighbors and pillar-edges connecting different instances of the same entity. According to this principle, we define two different structural views on the original graph: the one is designed to encode the local structure of nodes and handle heterogeneity, capturing useful information from one-hop neighbors of different types (possibly from different layers), and the other one is designed to encode the global structure of nodes and model information from distant nodes in the network, thus capturing useful information from multi-hop neighbors of the same type. Note that we include pillar edges in the global view, since they are particular connections matching two instances of the same entity, thus enabling across-layer transitions, but they do not represent edges between two direct neighbors.
It should be emphasized that Co-MLHAN is conceived to be general and flexible, so as to exploit all available information but also being effective even when such information is lacking. For instance, across-layer relations could be limited to few replicas, nodes may show high variability in the number of neighbors, or one or more types of neighbors could be missing for some nodes. Figure 2 shows a conceptual overview of our proposed framework. Accordingly, the final embedding for each target entity is learned through three main stages: 1. Content encoding. Since the initial feature vectors of nodes/entities ( x ) might be of different sizes, the first stage requires to transform such initial features into a shared low-dimensional latent space ( h ). Moreover, this stage is also concerned with the content encoding "from scratch", i.e., generating initial embeddings from raw data associated with nodes/entities, which might be from possibly multiple and heterogeneous contents, such as categorical or numerical attributes, unstructured text and multimedia content 2. Graph structure encoding. According to the multi-view learning paradigm, the second stage requires to generate two distinct embeddings for each entity, reflecting the graph structure and maximizing the mutual information: (1) embeddings for the local structure ( z ns ), including information from all direct neighbors of the nodes being instances of the target entity, and (2) embeddings for the high-order structure ( z mp ), including information from pillar-edges and from target nodes that can be reached through composite relations (i.e., meta-paths). 3. Final embedding based on contrastive learning. The third stage requires a joint optimization between the embeddings learned under the two views to generate the final entity embedding ( z ). The contrastive learning mechanism is enforced by choosing suitable positive and negative samples from the original graph.
In the following sections, we elaborate on the graph structure encoding (stage 2) and the generation of the final embedding based on contrastive learning (stage 3). We examine their computational complexity aspects in Appendix 4. For the sake of readability, note also that, since the first stage of content encoding is actually beyond the objectives of this work, we discuss it in Appendix 3.

Graph structure encoding
The second stage models two graph views, named network schema view and meta-path view, able to encode the local and global structure surrounding nodes, respectively, while exploiting multilayer information.
The network schema of a heterogeneous graph is an abstraction of the original graph showing the different node types and their direct connections. It is often referred to as meta template, since it captures node and edge type mapping functions. Formally, a network schema is a directed graph defined over node types A, with edges as relation types from R. In a multilayer heterogeneous network G L , the network schema includes all types A for individual layers and relations R, including both intra-and inter-layer edges. More specifically, we consider all relations involving any node i, l of target type, denoted as R �i,l� ⊆ R , and all node types a connected to the target node through a relation r ∈ R �i,l� . Hereinafter, we refer to this graph as network schema graph. Figure 3 shows an example of network schema graph for the multilayer heterogeneous attributed network of Fig. 1.
A meta-path is a sequence of connected nodes making two distant nodes in the network reachable, i.e., the terminal or endpoint nodes of a meta-path instance. Formally, a meta-path M m is a path defined on the network schema graph, in the form a 1 , describing a composite relation r 1 • r 2 • · · · • r k between node types a 1 and a k+1 . A meta-path instance of M m is a sampling under the guidance of M m providing a sequence of connected nodes with edges matching the composite relation in M m . Examples of within layer meta-path instances are depicted in Fig. 4a and b. Given a multilayer heterogeneous graph G L and a meta-path M m , let N m (i, l) denote the meta-path based neighbors of node i, l of a certain type a, defined as the set of nodes of type a ′ that are connected with node i, l through at least one meta-path instance of M m having a as starting node-type and a ′ as ending node-type. Note that, similarly to Wang et al. (2019), the intermediate nodes along meta-paths are discarded. A meta-path based graph is a graph comprised of all the meta-path based neighbors. For meta-paths Fig. 4 Examples of meta-path instances for types MAM (movie-actor-movie) marked with bold black lines, MDM (movie-director-movie) marked with bold red lines, AMA (actor-movie-actor) marked with bold blue lines, and AMDMA (actor-movie-director-movie-actor) marked with bold green lines, w.r.t. layer G l in the multilayer graph of Fig. 1 (a), all meta-path instances of type MAM w.r.t. the same layer (b) and the corresponding meta-path based graph for MAM type, with focus on node M 1 and its neighbors (c) with terminal nodes of the same type, the resulting graph is homogeneous at node level. Figure 4c shows an example of single-layer meta-path based graph according to a specific meta-path type.
Following Wang et al. (2021), given a target entity, the network schema view is used to capture the local structure, by modeling information from all the direct neighbors of the corresponding target nodes, whereas the meta-path view is used to capture the global structure, by modeling information from all the nodes connected to the corresponding target nodes through a meta-path and from the pillar-edges derived by the corresponding meta-path based graph.
View embedding generation. The two views exploit features associated with different entity types; specifically, the network schema view takes advantage of features of neighbors of any type, while the meta-path view takes advantage of features of nodes of target type involved in high order relations.
We remind that Co-MLHAN produces for each target entity a distinct embedding under each view. Nonetheless, both views share two fundamental steps in the embedding generation: (1) aggregating information of different instances of the same type-i.e., instances of the same relation and instances of the same meta-path, respectively-and (2) combining information of different types-i.e., different types of relations and of metapaths, respectively, as well as different layers.

Network schema view embedding
In the network schema view, the embedding of each target node is computed from its direct neighbors, both within and across layers. As mentioned before, the network schema is a multilayer heterogeneous graph, having nodes of different types and relations corresponding to intra-and inter-layer edges involving nodes of target type.
To generate the embeddings under the network schema view, we follow a hierarchical attention approach, consisting of two main steps, which are summarized as follows and depicted in Fig. 5:   Fig. 5 Illustration of hierarchical attention approach used in the steps of the network schema view embedding (i.e., NSVE-1 and NSVE-2), with focus on the target entity M 1 in different layers. From left to right, NSVE-1 box shows node-level attention w.r.t. target nodes M 1 , with colored lines denoting different relations and matching different attention weights; NSVE-2 box shows type-level attention w.r.t. target nodes M 1 , with different colors, resp. textures, denoting the embeddings obtained from different relations, resp. layers, and across-layer attention w.r.t. target nodes M 1 , combining the embeddings of different layers (NSVE-1) First, we aggregate information of the same type (i.e., different instances of the same relation type) via node-level attention, learning the importance of each neighbor and obtaining, for each node, an embedding w.r.t. each relation type that involves a node of target type t.  Second, we combine information of different types (i.e., different relations in different layers) via type-level attention, learning the most relevant relations and obtaining an embedding for each node under the network schema view. Moreover, we combine information from different layers via across-layer attention, learning the importance of each layer and obtaining, for each entity, a single embedding under the network schema view.
Note that we refer to relation type and not to node type to be consistent in the event that target nodes are connected to a certain node type through multiple relationships. We point out that, in accordance with the infomax principle, the network schema view does not model pillar-edges, since they are processed in the other view. We also specify that intra-layer edges in different layers are seen as different types of relations, reflecting the separation into layers according to a certain aspect. In practice, layers are an additional way for distinguishing the context of relations.
Aggregating information of different instances of the same type (NSVE-1). Aggregating information of the same type (i.e., different instances of the same relation type) takes place via node-level attention. This step exploits features of nodes connected to target nodes through a direct link, whether they are of the same type as the target or not.
Given the graph G L , we define a function, denoted as N (r) (·) , that for any pair entitylayer yields its neighborhood under relation type r, regardless of the within layer or across-layer location of the neighbors. Formally, given a target node i, l , we define the set of its neighbors under relation r ∈ R �i,l� as: Above, note that N (r) (i, l) returns within-layer or across-layer neighbors of i, l under relation r, when l = l ′ or l = l ′ , respectively. (Recall that pillar edges are excluded from the definition of neighbor sets). Moreover, to ensure the aggregation of the same amount of information, we sample a fixed size of neighbors to be processed at each epoch by setting a threshold value for each type of neighbor (cf. "Experimental settings" section). In our setting, neighbor sampling can be done with and without replacement. Note that this neighbor sampling approach allows for saving computational resources in case of huge networks.
We thus define the embedding of entity v i in layer G l based on neighbors under relation r as: where z N (r) i,l is the embedding of node i, l obtained from neighborhood under relation r, σ (·) is the activation function (default is ELU), W (r) 2 is the weight matrix of shape (d, d) (1) N (r) (i, l) = {�j, l ′ � ∈ V L |(�j, l ′ �, �i, l�) ∈ E r }.
(2) z N (r) �i,l� = σ   � �j,l ′ �∈N (r) (i,l) α (r) �i,l�,�j,l ′ � W (r) 2 h �j,l ′ �   , associated with one-hop neighbors �j, l ′ � , h �j,l ′ � is the feature embedding of node �j, l ′ � and α (r) �i,l�,�j,l ′ � is the normalized attention coefficient for the relation r connecting i, l and �j, l ′ � and indicating the importance for i, l of information coming from �j, l ′ � , as defined in Eq. 3: where a (r) ∈ R d is the learnable weight vector under relation r, [h �i,l� �h �j,l ′ � ] ∈ R 2d is the row-wise concatenation of the column vectors associated with the two node embeddings, W (r) = [W (r) 1 �W (r) 2 ] ∈ R d×2d is the column-wise concatenation of W (r) 1 and W (r) 2 , both of shape (d, d) and containing the left and right half of the columns of W (r) , associated with destination and source nodes (one-hop neighbors), respectively. 1 In Eq. 3, we adopt the same approach as in GATv2 (Brody et al. 2021), which aims to fix the static attention problem of standard Graph Attention Network (GAT) (Velickovic et al. 2018) that limits its expressive power, since the ranking of attended nodes is unconditioned on the query node; on the contrary, GATv2 is a dynamic graph attention variant where the order of internal operations of the scoring function is modified to apply an MLP for computing the score of each attended node.
The self-attention mechanism can be extended similarly to Vaswani et al. (2017) by employing multi-head attention, in order to stabilize the learning process. In this case, operations are independently replicated Q times, with different parameters, and outputs are feature-wise aggregated through an operator denoted with symbol , which usually corresponds to average (default) or concatenation: where W (r,q) and α (r,q) �i,l�,�j,l ′ � denote the weight matrix and the attention coefficient for the q-th attention head under relation r, respectively.
Let z N (r) i,l be the embedding of a target node i, l obtained from its neighbors in each layer under relation r. Downstream of node-level attention, we thus obtain r∈R �i,l� {z N (r) �i,l� } embeddings. . In order to combine information of different node types according to the different relations with target nodes, we employ type-level attention for each layer separately. For each target node i, l , we obtain the embedding under the network schema view z NS i,l , as defined in Eq. 5:

Combining information of different types and layers
(3) where β (r) is the attention coefficient for neighborhood under relation r, which is defined as follows: where l is the set of entities of target type t in layer l; a NS ∈ R d is the type-level attention vector; W NS and b NS are the learnable weight matrix and the bias term, respectively, under the network schema view, shared by all relation types. We hence obtain the set of embeddings l∈L {z NS �i,l� } under the network schema view for each target node. In order to map the learned node embeddings into the same space of the contrastive loss function, we apply an additional level of attention, i.e., across-layer attention. This is designed to evaluate the importance of each layer of G L and combine layer-wise the features of nodes. We thus obtain an embedding under the network schema view for each target entity v i , as defined in Eq. 7: where β (l) is the learned attention coefficient for layer G l , computed via the same attention model like in Eq. 6, where in this case the learnable weights are shared by all layers.

Meta-path view embedding
In the meta-path view, the embedding of each target node is computed from its metapath based neighbors and from the pillar-edges derived by the corresponding meta-path based graph. We remind that each layer of a meta-path based graph is a homogeneous network with nodes corresponding to a subset of target nodes and edges as connections of meta-path based neighbors, including across-layer information matching pillar-edges.
We consider meta-paths of any length, starting and ending with nodes of target type; indeed, information of intermediate nodes can be discarded as it is included in the network schema view. Note that considering multiple meta-paths allow us to deal with multiple semantic spaces (Lin et al. 2021), and our framework is designed to handle an arbitrary number of meta-paths. Also, in case a layer does not contain any node of target type, the layer is discarded from the resulting multilayer graph. Yet, our framework admits the worst case of ℓ − 1 layers missing for a meta-path type.
Analogously to the network schema view, the meta-path view embedding generation consists of two main steps ( Fig. 6): First, we aggregate information of the same type, this time intended as several instances of the same meta-path and encoded via meta-path-specific Graph Convolutional Network (GCN) (Kipf and Welling 2017), obtaining, for each target node, an embedding w.r.t. each meta-path type.
Second, we combine information of different types (i.e., different metapaths in different layers) and layers (i.e., different meta-paths across layers) via semantic attention, learning the importance of each meta-path and obtaining an embedding for each target node and entity under the mata-path view.
In the following, we first describe the process of meta-path view embedding generation according to the basic Co-MLHAN approach. Next, in "Alternative meta-path view embedding: Co-MLHAN-SA" section, we shall describe an alternative strategy, called Co-MLHAN-SA, which differs from Co-MLHAN in the way across-layer information relating to pillar-edges is modeled.
Aggregating information of different instances of the same type (MPVE-1). The first step of embedding generation under the meta-path view is to aggregate information of the same type, which corresponds to several instances of a given meta-path. More specifically, we consider all p meta-paths M = {M 1 , . . . , M p } involving nodes of target type, where each meta-path M m matches a multilayer graph with at most ℓ layers.
In the meta-path view, across-layer dependencies are modeled as particular types of meta-paths, i.e., across-layer meta-paths. They refer to the same composite relation, with the additional constraint that the terminal nodes belong to different layers, and that the intermediate node matches a pillar-edge, i.e., it corresponds to an entity (of type different from the target one) with both instances involved in the composite relation. An example is illustrated in Fig. 7. We define the set of across-layer meta-paths, M , as the the union of all meta-paths of any type and defined over all layer-pairs.
To identify the meta-path based neighbors of each node, we define two functions, denoted as N ⇔ (·) and N � (·) , which for each node return the intra-layer and inter-layer neighborhood, respectively. Formally, we define the set of within-layer neighbors of the node i, l , according to m-th (within-layer) meta-path type, as: Similarly, we define the set of across-layer neighbors of node i, l , according to the mth (across-layer) meta-path type, as follows: Note that Eqs. 8 and 9 identify the meta-path based neighborhood of type M m for node i, l , with m referring to a within or across-layer meta-path, respectively; in particular, Given any target node i, l , we apply a meta-path specific graph neural network f m (with K hidden layers) in order to compute its embedding according to the m-th metapath; formally, at each k-th layer: where z (0) �i,l� = h �i,l� is the feature embedding computed in the first stage, and denotes an arbitrary differentiable function, aggregating feature information from the local neighborhood of nodes [e.g., summation, a pooling operator, or even a neural network �� if m is an across-layer meta-path Fig. 7 All across-layer meta-path instances of type MAM (a), and the corresponding meta-path based graph for MAM type, with focus on entity M1 and its neighbors (b) ]. Similarly to Wang et al. (2021), we use a GCN architecture as f m , for all M m (m = 1 . . . p) in Eq. 10, assuming no different contribution from different instances of the same meta-path. More specifically, given the m-th within-layer meta-path and A = {A 1 , . . . , A ℓ } as the set of adjacency matrices associated with the corresponding meta-path based graph, being A l ∈ R n l ×n l ( l = 1 . . . ℓ ) the adjacency matrix associated with layer l, the GCN for layer G l is defined as follows: where σ (·) is a non-linear activation function (default is ReLU(·) = max(0, ·) ), W (k,l) is the trainable weight matrix for the m-th meta-path in the k-th convolutional layer of shape (d, d), and D l ii = j A l ij is the degree matrix derived from A l = A l + I n , with I l n as the identity matrix of size n l , and n l number of nodes of layer G l . The GCN model for across-layer meta-paths is built similarly, considering N � (·) instead of N ⇔ (·) and π instead of l. Let z (m) i,l and z (m) i,π be the node embedding associated with the m-th within (resp. across)-layer meta-path of node i, l (resp. layers-pair π ). Downstream of meta-path specific GNNs, we obtain

Combining information of different types and layers (MPVE-2).
Once obtained the metapath specific embeddings for each target node, we employ semantic-level attention for combining different meta-path types, including both intra-and inter-layer information. Given a node i, l , the embedding under the meta-path view is computed as follows: where β is the attention coefficient denoting the importance of each type of within layer and across-layers meta-path (cf. Eq. 6) and � ∈ [0 . . . 1] is a balancing coefficient denoting the importance of inter-layer connections.
In order to project the node embedding into the same space of the loss function-analogously to the network schema view-we aggregate the embeddings obtained from each layer with a sum operator, which is defined as follows: Note that Eq. 13 does not require an additional level of attention, since the layer dependency has already been taken into account by the attention mechanism in Eq. 12. Therefore, Eqs. 12 and 13 can be combined as follows: Equation 14 hence enables the direct computation of the final embedding under the meta-path view for each entity v i .

Alternative meta-path view embedding: Co-MLHAN-SA
Our alternative approach for embedding generation under the meta-path view is named Co-MLHAN-SA, where the suffix 'SA' refers to the supra-adjacency matrix modeling each meta-path based graph. The supra-adjacency matrix, denoted as A sup , has diagonal blocks each representing a layer-specific adjacency matrix (i.e., A l ∈ R n l ×n l , with l = 1 . . . ℓ ), and off-diagonal blocks each corresponding to the inter-layer adjacency matrix A π for layer-pair π = (l, l ′ ) , with values equal to 1 if an edge between i, l and �j, l ′ � exists, with l = l ′ , and 0 otherwise.
To give an intuition, we model across-layer information downstream of semantic attention, by accounting for another level of attention, i.e., across-layer attention (by analogy with the network schema view).
We thus learn the importance of different (within layers) meta-paths via semantic attention, obtaining an embedding under the meta-path view for each node and we subsequently learn the importance of each layer via across-layer attention, obtaining an embedding under the meta-path view for each entity.
Like in the basic Co-MLHAN approach, the meta-path view embedding generation in Co-MLHAN-SA consists of two main steps (Fig. 8): (MPVE-SA-1) First, we aggregate information of the same type, intended as several instances of the same meta-path and encoded via meta-path-specific GCNs, obtaining, for each node, an embedding w.r.t. each meta-path. Unlike MPVE-1, the first step of the Co-MLHAN-SA approach hence handles the inter-layer dependencies derived from pillar-edges. (MPVE-SA-2) Second, we combine information of different types (i.e., different metapaths in different layers) via semantic attention, learning the importance of each meta-path and obtaining an embedding for each target node under the meta-path view. Moreover, we combine information from different layers via across-layer attention, learning the importance of each layer and obtaining, for each target entity, a single embedding under the meta-path view.
By avoiding across-layer meta-paths M definition, Co-MLHAN-SA requires a limited number of learnable parameters, as it utilizes a meta-path specific GCN shared by all layers G l .
Aggregating information of different instances of the same type (MPVE-SA-1). We still use the notation N ⇔ (·) and N � (·) to indicate the set of within-layer and across-layer neighbors, respectively. While the definition of N ⇔ (·) does not change w.r.t. Eq. 8, the definition of N � (·) of the Co-MLHAN-SA approach is modified in the modeling of pillaredges, by directly considering all the instances of the same target entities in other layers, as shown in Eq. 15: Similarly to MPVE-1, we apply a meta-path specific GNN for aggregating different meta-path instances of the same type: Unlike MPVE-1, the inter-layer dependencies are taken into account by the GNN, employing a modified version of the propagation rule that can handle the supra-adjacency matrix as input. We thus build for each meta-path its corresponding meta-path based supra-graph, i.e., a graph where pillar edges exist between every node and its counterpart in other coupled layers. In our setting, we instantiate f m with a multi-layer GCN model (Zangari et al. 2021), as shown in Eq. 17: where the degree matrix D is built considering both inter-layer and intra-layer links of nodes using the supra-adjacency matrix of the graph, D ii = j=1 A sup ij , where A sup is the supra-adjacency matrix with self-loops added, δ(l, l ′ ) is a scoring function denoting the weight coefficient for inter-layer links, ranging between 0 and 1, with values equal to if l = l ′ , and 1 otherwise.
Let z m i,l be the embedding of node i, l associated with the m-th metapath. We thus obtain m = 1 . . . p �i,l� } meta-path specific embeddings.

Combining information of different types and layers (MPVE-SA-2).
Once obtained the meta-path specific embeddings for each target node, we employ semantic-level attention for combining different meta-path types, obtaining for each node i, l an embedding under the meta-path view, which is defined as follows: where β (m,l) is an attention coefficient computed as in Eq. 6. In order to project the node embedding into the same space of the loss function, we apply an additional level of attention, named across-layer attention, similarly to network schema view, thus obtaining for each entity v i an embedding under the meta-path view: where β (l) is the attention coefficient denoting the importance of the l-th layer, computed similarly to Eq. 6.

Final embedding based on Contrastive Learning
The third stage of the proposed framework is concerned with the exploitation of a contrastive learning mechanism to produce the final entity embeddings, pulling together similar entities and pushing apart dissimilar ones in the embedding space. We combine the contrastive losses computed according to each view, with individual nodes of both positive and negative pairs selected from distinct views.
Given the embeddings z NS i (Eq. 7) and z MP i (either Eq. 14 or Eq. 19) for each target entity v i , we transform them into the same space in which a contrastive loss function is computed, by employing a simple MLP architecture with one hidden layer, as defined in Eq. 20: where W (2) , W (1) , b (2) and b (1) are learnable weights shared by both views and σ (·) is the activation function (default is ELU).
The contrastive loss according to a certain view is computed on pairs of positive and negative samples. While earlier contrastive learning approaches were based on one or more negatives and a single positive for each instance, we follow the more recent trend of using both multiple positive and negative pairs (Khosla et al. 2020;Wang et al. 2021). Each target entity v i can hence rely on more than one positive (at least itself, under the other view). For positive sampling, the idea is to select the best nodes connected by multiple meta-path instances, since meta-path based neighbors have higher probability of being similar to each other. For negative sampling, we simply choose considering everything that is not positive.
We first proceed to the selection of positive samples. For this purpose, we count the meta-paths instances connecting each pair of target entities, considering all meta-path types on individual layers, as shown in Eq. 21: Given a threshold T pos , we select for each entity itself and the best T pos − 1 entities as positives, obtaining a subset S i ⊆ S i with |S i | ≤ T pos − 1 ; all the remaining |V| − T pos entities are regarded as negatives for v i . Therefore, for each entity v i , we define the set of positive samples P i as We stress that for the selection of positives we only exploit structural information, without using any information derived from the encoding of external content (i.e., initial features) of entities. Nonetheless, additional conditions on meta-paths in the selection of entity pairs can be defined, e.g., by diversifying the minimum number of instances required to enable the enumeration of a specific meta-path. Co-MLHAN is flexible in both the meta-path counting method and the overall positive and negative selection strategy.
For the computation of contrastive losses according to a given view, the embedding of each target entity v i is selected from the given view, while the positive and negative samples are selected from the other view, as defined in Eqs. 22 and 23, and illustrated in Fig. 9: where sim(v 1 , v 2 ) denotes the cosine similarity between two vectors v 1 and v 2 , and τ is the temperature parameter, which indicates how concentrated the embeddings are in the representation space, so that a lower temperature leads the loss to be dominated by smaller distances and widely separated representations contribute less. Note that Eqs. 22-23 are independent from the specific strategy of positive and negative selection; we leave the investigation of alternative sampling methods as future work ("Conclusions" section). The final contrastive loss is computed as a convex combination of the two contrastive losses to balance the effects of the two views: with 0 < < 1 . The loss function is completely specified depending on whether an unsupervised or semi-supervised paradigm is adopted. The extension to the (semi-) supervised case can be done by adding a new term to the final loss, as shown in Eq. 25: where L sup is the (semi-)supervised term, e.g., cross-entropy for classification tasks, jointly optimized with the contrastive term in a end-to-end fashion, and the coefficient η , 0 ≤ η ≤ 1 , is given to the contrastive term, since in a (semi-)supervised setting the (semi-)supervised term is expected to be more relevant.
Similarly to Chen et al. (2020), once the training procedure is completed, the optimized z MP i or z NS i will eventually be used for downstream tasks. Particularly, our default choice is to select the embeddings under the meta-path view, since meta-paths represent high-order relations between target nodes and pillar edges capture the information of instances of the same entity, exploiting multilayer dependencies. It should however be noted that the similarity between the two learned embeddings, for any entity, is expected to be high, since, according to our positive selection strategy, each entity v i includes itself under the other view in its set of positive samples P i . Nonetheless, in "Experimental settings" section, we shall provide empirical evidence of such embedding similarities. The final learned embeddings optimized via such cross-view contrastive loss can be used for a wide range of analysis tasks-at node, entity, or edge level-such as node/entity classification, graph clustering, link prediction.

Experimental evaluation
In this section, we describe the experimental evaluation of our framework. Our main goal is to evaluate Co-MLHAN and Co-MLHAN-SA on the entity (multi-class) classification task, choosing a target node type among the different node types with replicas in multiple layers and real-world initial features both at node and entity-level. "Data" section introduces the data, "Competing methods" section presents the competing methods, "Experimental settings" section discusses the experimental settings, and "Results" section describes the main results.

Data
To the best of our knowledge, there is a lack in the literature of publicly available benchmarks/repositories of networks that are simultaneously multilayer, heterogeneous, and attributed. To overcome this issue so as to properly build suitable network data for our evaluation, we resorted to online resources that would fulfill minimal requirements in (24) L co = L NS + (1 − )L MP (25) L tot = ηL co + L sup terms of publicly availability, domain accessibility, and variety and richness of stored information. In this respect, we ended up to select the Internet Movie Database (IMDb), 2 the most popular and authoritative online resource for movies, TVs and celebrities.
Note that IMDb was used in existing studies (e.g., Wang et al. 2019;Fu et al. 2020;Zhao et al. 2020) for the same classification task (based on movie genres) we address in this work; however, the variety of the resulting datasets makes it hard to perform a fair comparison, beyond being incomplete in terms of our requirements (i.e., networks that are both multilayer and heterogeneous at each layer).
We constructed two IMDb network datasets, dubbed IMDb-MLH and IMDb-MLHmb (where suffix 'mb' stands for 'most balanced'). They both model each of the layers of the multilayer network as heterogeneous (and attributed).
We identify three types of entities, inherited by nodes: movie (for short, M) actor (for short, A) and director (for short, D). Type movie is regarded as the target type, therefore the downstream task is multi-class classification on movie genres, which are 'action' ,  'comedy' and 'drama' . Tables 1, 2 and 3 summarize main characteristics of the networks, which are described next, whereas in Appendix 2, we provide a detailed description of the semantics of the constituting elements and the steps involved for data preprocessing.
• IMDb-MLH. Our main network dataset was conceived primarily for comparative evaluation with the competitors. As it can be noticed from Table 3, the network is particularly unbalanced w.r.t. the distribution of classes (i.e., movie genres), which reflects a major requirement of one of our competitors, that is, to ensure that the neighbors of each node cover all node types. To fulfill this requirement, we hence had to select from the original dataset nodes of type movie with at least one neighbor of type director (in any layer) and at least one neighbor of type actor (in any layer), while respecting the neighborhood constraint in the monoplex, flattened network. Note that IMDb also contains movie nodes with no links with director or actor nodes, which is however manageable by our methods only. We also filtered out movies with no episode associated with a plot (plots in IMDb are entered by users, and hence it might happen that all episodes of a certain TV series are not associated with plots; or, if available, the plots could be poorly meaningful). • IMDb-MLH-mb. This network dataset differs from the other one as it aims to reduce class imbalance. To this purpose, we kept the same number of 'comedy' and 'drama' movie nodes as in IMDb-MLH and increased those of the 'action' class, by relaxing the constraint of having at least one neighbor actor and one neighbor director for each movie. Due to this relaxation, we could not use IMDb-MLH-mb for evaluating the competitors, but we exploited the network to further delve into our methods.

Competing methods
We compared Co-MLHAN and Co-MLHAN-SA with two unsupervised learning methods, HeCo ) and NSHE (Zhao et al. 2020), on IMDb-MLH. HeCo is a contrastive multi-view learning based method for single-layer heterogeneous attributed graphs. We equipped HeCo with the same meta-paths and the same positives and negatives as used by our methods. NSHE is a unsupervised non-contrastive GNN-based approach for single-layer heterogeneous attributed graphs, which is designed to learn embeddings preserving both pairwise and network schema structure. In contrast to our methods, NSHE generates initial features of nodes by using DeepWalk (Perozzi et al. 2014) for all types of nodes and, if available, combines them with real-world features. As a motivation behind our choice of competing methods, we note that HeCo and NSHE are those sharing more aspects with our methods (cf. "Related work" section). Indeed, they are able to encode local and global node structure separately in an unsupervised manner, thus capturing the heterogeneity of both nodes and relations. Moreover, they respect the network schema of the graph, ensuring to visit all types of nodes and edges, they can deal with imbalance in the number of neighbors and relations of a certain type within the network schema, and allow to focus on the generation of embeddings of a specific type while using heterogeneous information. It should however be emphasized that both HeCo and NSHE were designed for heterogeneous attributed monoplex networks, i.e., single-layer graphs. Consequently, we were forced to downgrade our network data through a flattening approach, i.e., by compressing the multi-layer graph into a single graph discarding all replicated edges.

Experimental settings
To model each of our network datasets, intra-layer edges involving nodes of target type (i.e., movie) were considered between nodes of different types only, and pillar edges were considered as the only inter-layer relations, although our framework is designed to model non-pillar edges as well. Meta-paths with both terminal nodes of target type were used in the corresponding view and employed in meta-path count for the selection of positive samples. For the positive (and negative) selection strategy, we defined two alternatives, named AL3A and AL1A, differing in whether or not they consider constraints on the minimum number of instances of a specific meta-path type (AL stands for ' At Least'). This reflects on a different trade-off between the number of positives, which is higher in AL1A, and their meaningfulness, which is expected to be higher in AL3A. The positive statistics corresponding to the two strategies are provided in Table 4.
For all methods, we first learned the embedding for each entity in an unsupervised fashion and then trained a classifier for the final class prediction. We remind that for the final classification task we use the embeddings learned under the meta-path view, since it captures relations between target nodes, although our positive selection strategy and the joint optimization of the loss function entail similar representations. To validate our hypothesis, for each entity v i , we computed the cosine similarity between the embedding under the network schema view ( z NS i ) and the embedding under the meta-path view ( z MP i ). Results on IMDb-MLH confirmed our hypothesis, since we obtained the following statistics on the distribution of similarity measurements: 0.84 as 25% percentile, 0.87 as mean, 0.88 as median, 0.92 as 75% percentile, and 0.97 as maximum value.
We found the optimal hyperparameters for the representation learning process via grid search algorithm. Specifically, we trained the model using the Adam optimization algorithm (Kingma and Ba 2017) with full batch size, for 10,000 epochs, with early stopping technique based on the contrastive loss value and patience set to 30 (i.e., the training procedure stops if loss value does not decrease for 30 consecutive epochs), with = 0.5 for the convex combination of the two contrastive losses. Learning rate was set to 0.0001, and dropout regularization technique with p = 0.3 was applied to the transformed features h. We used Q = 1 attention heads, since GATv2 showed to work better than multi-head GAT , and temperature value τ = 0.5 . Moreover, we set the hidden dimension (d) for both views to 64, with K = 1 hidden layers in the meta-path view [including multiple layers can often lead to over-smoothing problem ]. In the network schema view, for neighborhood sampling, we randomly sampled 7 and 2 nodes of type actor and director, resp., at each epoch with replacement strategy. In the meta-path view, following Wang et al. (2021), we set the threshold for positive selection T pos equals to 5. Finally, we set � = 1 for the inter-layer edges, in order to fully exploit the inter-layer connections represented by pillar-edges. In case of Co-MLHAN, this setting of � = 1 to give the maximum importance to the inter-layer edges, is justified by the construction of across-layer meta-paths, since their intermediate node correspond to a pillar edge (between nodes of type actor or director). In case of Co-MLHAN-SA, we directly had pillar-edges between nodes of type movie, as this proved to be effective in other works, e.g., (Zangari et al. 2021).
As mentioned before, HeCo and NSHE were trained over the flattened networks, i.e., by discarding multilayer information, since they are conceived for single-layer heterogeneous graphs. While for HeCo we kept the same settings as for Co-MLHAN (cf. "Competing methods" section), for NSHE we selected the same hyperparameters it uses for the IMDb dataset (Zhao et al. 2020). For a fair comparison, we set its embedding dimension to 64. We use the publicly available software implementations for both competitors. 3 Once obtained the final embedding, we used a MLP with one hidden layer of size 64 as final classifier, trained using the Adam optimization algorithm with full batch size, for either 2000 epochs, or at convergence when the early-stopping regularization technique was selected (with patience value of 300 epochs); in the latter case, since the macro average treats all classes equally, we used F1 score with macro average as quality criteria on the validation set, in order to penalize wrong predictions of the most unbalanced class, i.e., 'action' . We split each dataset in training, test and validation sets, by choosing 70%, 15% and 15% of the entities for each class, respectively. Note that, when early stopping was not used, we just discarded the validation set so as not to vary the training and test sets. The learning rate was set to 0.01.
We carried out our methods and HeCo for 5 independent runs, which differed in random seed assignment, while we experimented NSHE for one run, due to its computational overhead, thus finally learning 5 and 1 different model weights, respectively. For each trained model, we derived the final network embeddings-to be given as input to the final classifier-and executed the final classifier over 50 independent runs with the same realization of training, test and validation sets. Finally, we computed the average of the performance scores achieved on the test set. Specifically, for each model, we computed F1-score with micro and macro averaging, AUC score, and F1-score of each class. F1-score with micro and macro averaging is used to evaluate the contributions of all classes, considering individual class contributions or treating all classes equally, respectively. ROC AUC (Area Under the Receiver Operating Characteristic Curve) score with OVR (one-vs-rest) averaging strategy is used to indicate the ability of the classifier to distinguish between classes. We also report F1-score for each class to more effectively evaluate how the model performances are affected by the early stopping technique.
Note that for methods from which multiple models were learned (i.e., they were executed over different seeds), we reported the average values for each performance criterion.

Results
We organize the presentation of our experimental results into four parts: "Evaluation on IMDb-MLH" and "Evaluation on IMDb-MLH-mb" sections concern the evaluation on IMDb-MLH and IMDb-MLH-mb, respectively, whereas "Qualitative inspection of the embeddings" section provides a qualitative analysis of the learned embeddings. Finally, "Summary of results" section summarizes our experimental findings.

Evaluation on IMDb-MLH
We first compared Co-MLHAN and Co-MLHAN-SA with HeCo and NSHE using initial features corresponding to the best top-1000 words by tf-idf and positives selection under the tougher condition AL3A, which assumes fewer but higher-quality positives per node (cf. Appendix 2); moreover, to ensure a fair comparison with our competitors requiring a flattening approach, we used for our methods only features associated with entities (entity-level features, for short EL).
We tested the classifier both with and without early-stopping technique. In both cases, as shown in Table 5, our proposed methods achieve high performance scores according to all quality criteria, consistently outperforming the competitors. In fact, although the amount of edges that were "lost" due to the flattening approach is relatively small (15 % and 20%, resp.), the compression of all layers does not allow the competitors to suitably capture the relations on different layers as well as their inter-layer dependencies. Note that we could not apply our competitors on a single layer of our network, since many entities are missing in each layer; as shown in Table 2, only 504 out of 2807 target entities are shared between the two layers. Co-MLHAN achieves the best performances on almost all the quality criteria (5 out of 6), while Co-MLHAN-SA, being the approach with closer performance, is the most effective in predicting movies of class 'action' (0.467), which is the less represented class. The reason behind this slight difference between our two methods might be due to the different across-layer information modeling w.r.t. pillar-edges. The across-layer metapaths defined by Co-MLHAN can be more meaningful, as they exploit richer inter-layer information than Co-MLHAN-SA. Moreover, the poor performance of HeCo w.r.t NSHE show that the contrastive learning mechanism performed by HeCo is not very effective for this dataset. Particularly, HeCo shows the lowest performance on the 'action' class, indicating that its learned embedding is unable to discriminate the instances of the most unbalanced class. Impact of early-stopping on the entity classification. Focusing on the results obtained by using the early-stopping technique, the overall performance of our methods turns out to be slightly lower than the setting discarding the early-stopping. In particular, from Table 5, we notice that the F1-score values corresponding to the 'action' class decrease for all methods when the early-stopping technique is used. We indeed found out that in some runs the training procedure stops too early because the F1 macro computed on validation set does not improve within the patience value. In this respect, Fig. 10 shows the testing and validation F1 macro scores of the final classifier averaged over 50 runs of the same (i.e., fixed-seed) model of Co-MLHAN, with and without early-stopping technique. When choosing early-stopping, the best-performing epochs are distributed with a mean value of 234 ± 272 , while the 25%, 50% (median) and 75% percentiles are equal to 14, 32 and 421, respectively. Since the increase in the F1 macro occurs around the 400th epoch (Fig. 10, left), the classifier appears to be under-trained in some runs, thus it cannot boost its performance. On the other hand, if the training is not early-stopped, the classifier learns to distinguish more accurately the instances of the most unbalanced class in each run.
The above results would suggest that, in the effort of avoiding overfitting and saving computational resources through the early-stopping technique, the final classifier might be under-trained, leading to an underfitting problem if the patience value is not properly set. In fact, we observed that the F1 macro on the validation set stabilizes around the 1000-th epoch (Fig. 10, right); however, as shown in Fig. 11, the overall benefit gained by a high patience value is marginal: a patience value set to 900 led to 0.644 F1 macro, which just decreases to 0.615 if the patience is set to 300, with only an improvement on the most unbalanced class, as shown in Fig. 12.
We point out that the hyperparameters of the final classifier were not globally optimized, since this goes beyond the main focus of this work; nonetheless, we recall that the classifier is shared by our methods and the competing ones, so as to fulfill fairness in the comparative evaluation. We therefore preferred to speed up the classification stage and set the patience value to 300 for all the experiments employing early-stopping technique on the classifier.
Impact of initial feature selection. We analyzed the behavior of the methods when equipped with all initial real features, i.e., without constraining the size of the initial feature space. We carried out the experiments with the same positives selection strategy as in the previous evaluation. Results corresponding to the early-stopping setting are reported in Table 6 (note that we observed no particular differences when not using early-stopping). Compared to the previous case corresponding to the top-1000 initial features, the performance of all methods tends to decrease due to the higher and sparser dimensionality. An exception is represented by NSHE, which slightly improves, probably due to its feature initialization (Zhao et al. 2020). However, Co-MLHAN and Co-MLHAN-SA still outperform both competitors, with the former achieving the highest F1 micro, F1 macro and AUC values. Moreover, when keeping all words as initial features, our methods report high values on the 'action' class (despite the use of the early-stopping technique), while the competitors maintain similar values to the previous case with top-1000 initial features.
The above results hence suggest that dealing with a full space of initial features can enable Co-MLHAN and Co-MLHAN-SA to better distinguish the movie instances, and in particular that our methods can effectively exploit these features unlike the competitors.

Evaluation on IMDb-MLH-mb
We further evaluated Co-MLHAN and Co-MLHAN-SA using the IMDb-MLH-mb network. More specifically, we investigated the behavior of our methods when equipped with node-level initial features, hereinafter referred to as NL, i.e., with layer-dependent initial features. To this purpose, we first compared the methods under the following  setup: initial features corresponding to the top-1000 words, positives selection AL3A, with and without using early-stopping technique.
As it can be noticed from Table 7, performance generally increases w.r.t. the entitylevel feature initialized methods, especially in terms of F1 macro, as a direct consequence of a better coverage of the 'action' class. Comparing the results obtained with entity-level (EL) and node-level (NL) features, we observe that, as expected, exploiting initial features at each layer (i.e., node-level case) leads to higher performance of the methods.
Moreover, we observe that the difference between the case with early-stopping and the case without early-stopping decreases on IMDb-MLH-mb, regardless of the layer dependency of the initial features, i.e., EL or NL setting.
Furthermore, we changed the meta-paths count strategy for positive selection (AL1A) (refer to Table 4 and Appendix 2 for additional details) to test the sensitivity of our methods, without changing the feature initialization. Results shown in Table 8 reveal a marginal decrease in performance, slightly more evident when using node-level initial features. This might be due since, according to AL1A, each entity has a number of positive samples which is on average greater than for the AL3A alternative, but the positives can be less meaningful (cf. Appendix 2); nonetheless, we observed a negligible worsening in the performance.

Qualitative inspection of the embeddings
After discussing the results from a numerical point of view, in this section we aim to visually analyze the final entity embeddings in order to gain insights in terms of patterns and clusters. To this purpose, we used Uniform Manifold Approximation and Projection (UMAP) (McInnes et al. 2018), which is a highly effective non-linear dimensionality reduction algorithm, particularly useful for visualizing relative proximities in high-dimensional data. It is based on manifold learning, which can be seen as a generalization of linear projection frameworks like PCA, sensitive to non-linear structures in data. In recent years, UMAP has gained popularity since it offers several advantages over related algorithms, such as PCA and t-SNE (van der Maaten and Hinton 2008). In particular, compared to the latter, UMAP can achieve a better preservation of the global structure of data in the final projection, it is more efficient, and it has no computational restrictions on the embedding dimension. UMAP defines two main hyperparameters to control the balance between local and global structure: nearest neighbors and minimum distance, denoting the number of local nearest neighbors to process, and how tightly UMAP packs points together, respectively. On the one hand, lower values of minimum distance result in more clustered embeddings, while larger values prevent UMAP from packing points together, leading to a more uniform dispersion of points; on the other hand, lower values of nearest neighbors allow UMAP to concentrate more on the local structure, while higher values enable looking at more neighbors for each point, resulting in a more global representation. Figure 13 shows the two-dimensional UMAP visualization of the initial feature embeddings with tf-idf weighting (Fig. 13a), and of the final embeddings under the meta-path view learned by our methods (Fig. 13b, c), w.r.t. IMDb-MLH. We executed UMAP with the following main hyperparameters: size of local neighborhood used for manifold approximation equal to 15, minimum distance between points equal to 0.7, and cosine similarity as proximity measure.
In the initial representation (Fig. 13a), all entities of type movie are grouped closely together regardless of their genre, resulting in a cluttered representation. This is actually not surprising, since their plots are provided by users without meeting quality requirements. Nonetheless, Fig. 13b and c show how the final embeddings learned by our methods allow UMAP to better separate entities of different classes.

Summary of results
In this section, we summarize the main findings of the empirical evaluation of our framework. We experimented it on two novel network datasets derived from IMDb (cf. Appendix 2), which are simultaneously multilayer, heterogeneous, and attributed. Specifically, we modeled IMDb as a temporal network with two layers, where each layer is heterogeneous and corresponds to years of movie releases. The first network dataset, named IMDb-MLH, was conceived for the comparative evaluation of our framework, since it fulfills the requirements of our competitors. The second network dataset, named IMDb-MLH-mb, was designed to reduce class imbalance and is not applicable to the competitors. Thus, we used it to investigate different input settings of our methods, i.e., Co-MLHAN and Co-MLHAN-SA.
Experimental results on the entity classification task showed that our methods significantly outperform existing competitors, effectively exploiting both external content and multilayer information. We also demonstrated that the overall performances do not degrade even in the (less realistic) case of feature-set size greater than the number of target nodes. In this case, our methods obtained higher values on the most unbalanced class, suggesting that Co-MLHAN and Co-MLHAN-SA can effectively exploit the full space of initial features. To ensure fairness in the evaluation, the final MLP classifier was shared by all methods. Moreover, we investigated the impact of early-stopping regularization technique on the final classifier, confirming that underfitting phenomena can arise if the patience value is not properly set.
We further inspected the quality of the learned embeddings through a data visualization tool, showing that our cross-view contrastive mechanism is beneficial for the downstream classification task, since instances belonging to different genres are properly clustered w.r.t. the initial embedding with only tf-idf information. As a related aspect, we provided evidence that, as theoretically expected, the embeddings under the meta-path view share a similar structure with the corresponding embeddings under the networkschema view, thus enabling the use for downstream tasks of the embeddings learned under one or the other view.
We investigated further properties of our methods using IMDb-MLH-mb. In that stage of evaluation, the difference between the case with and without early-stopping is strongly mitigated by the lower imbalance between classes. We showed that our framework is resilient to the selection of positive samples (AL1A vs. AL3A), and able to effectively exploit node-tailored feature information (NL vs. EL).
It should also be noted that our Co-MLHAN and Co-MLHAN-SA, which differ in the meta-path view, achieved similar performance in all the experiments, showing that both approaches can successfully handle information coming from pillar edges. Specifically, the performance by Co-MLHAN would suggest that defining meta-paths between different layers (i.e., across-layer meta-paths) allows one to suitably integrate high-order relations between nodes in different layers.

Related work
We discuss below most relevant GNN-based approaches that are designed for different aspects of complex networks and particularly related to our approach. Over the last years, several works focused on the extension of popular GNN models such as GCN (Kipf and Welling 2017) and GAT (Velickovic et al. 2018) to the heterogeneous or multilayer case. Their extension is still an open research problem. In this section, we explore both semi-supervised and unsupervised learning paradigms, with emphasis on contrastive learning approaches in unsupervised contexts.

Representation learning for heterogeneous attributed networks
A major challenge for heterogeneous networks is modeling information from nodes that are reachable via paths of different lengths, possibly involving different semantics and structural relations.
HetGNN ) introduces a random walk with restart strategy to sample a fixed size of strongly correlated heterogeneous neighbors for each node, and group them on the basis of their type. It employs two modules of recurrent neural networks, encoding deep features interactions of heterogeneous contents and content embeddings of different neighboring groups, respectively, which are further combined by an attention mechanism. Co-MLHAN shares with HetGNN the modeling approach to external content encoding.
Other models leverage meta-path based neighbors and they differ in the information captured along the meta-paths. HAN  focuses only on the information associated with the endpoint nodes of meta-paths. It employs both node-level and semantic-level attentions. Upon the learned attention values, the model can generate node embeddings by aggregating features from meta-path based neighbors in a hierarchical manner. In addition to the information of the terminal nodes in meta-paths, MAGNN (Fu et al. 2020) also incorporates information from intermediate nodes along the meta-paths. It uses intra-meta-path aggregation to incorporate intermediate nodes, and inter-meta-path aggregation to combine messages from multiple meta-paths. DHGCN (Manchanda et al. 2021) incorporates both the information of the nodes along the meta-paths and the information in the ego-network of the endpoints nodes, i.e., the information coming from the direct neighbors of the terminal nodes. It utilizes a twostep schema-aware hierarchical approach, performing attention-based aggregation of information from the immediate ego-network, and attention-based aggregation of information from the neighbors of target type using meta-path based convolutions. HGT (Hu et al. 2020) takes only its direct neighbors without manually designing meta-paths but incorporating information from high-order neighbors of different types through message passing. It introduces node and edge type dependent attention mechanism and uses meta relations to parameterize the weight matrices for calculating attention over each edge. Co-MLHAN supports a user-specified selection of meta-paths and focuses on metapath based neighbors of target type. We discard the information of intermediate nodes, according to the idea of differentiating local and high-order information in distinct views.
More recently developed approaches rely on considering node local and high-order structure separately. NSHE (Zhao et al. 2020) introduces a network schema sampling method which generates sub-graphs (i.e., schema instances) and a multi-task learning method with different predictions to handle the heterogeneity within each schema instance, thus preserving pairwise and network schema proximity simultaneously. HeCo ) employs a cross-view contrastive mechanism upon the definition of two views of the graph, named network schema view and meta-path view, which collaboratively supervise each other. In the network schema view, a node embedding is learned by aggregating the information from its direct neighbors, applying node-level and typelevel attention for the same type and different types of nodes, respectively. In the metapath view, a node embedding is learned by passing messages along multiple meta-paths, applying meta-path specific convolutional networks and semantic-level attention for the same and different types of meta-paths, respectively. VACA-HINE (Khan and Kleinsteuber 2021) aims at jointly learning node embeddings and cluster assignments, using a variational module for the reconstruction of the adjacency matrix in a cluster-aware manner and employing multiple contrastive modules for both local and global information.
Similarly to HeCo, Co-MLHAN adopts a multi-view approach and a contrastive learning mechanism of mutual supervision between two views of the graph, with the addition of across-layer information included in the views.

Representation learning for multilayer networks
Some major challenges for multilayer networks involve modeling multiple types of interactions, including both intra-and inter-layer edges, and exploiting the information of nodes matching the same entity. Here we discuss GNN-based methods focusing on their across-layer information modeling.
mGNN (Grassia et al. 2021) provides a generalization of GNNs to the case of multilayer networks. It deals with outside-layer neighborhood, building an additional layer for the inter-layer relations connecting nodes in different layers. The embedding at each layer is computed propagating node features in both the intra-and inter-layer neighborhood through two independent GNN layers. We share with this approach the capability to deal with general multilayer networks with inter-layer edges not being pillar-edges. Nevertheless, unlike mGNN, Co-MLHAN can handle different types of relations in each layer.
Among the GCN-based approaches, MGCN (Ghorbani et al. 2019) builds a graph convolutional network for each layer employing only links between nodes of the same layer, while an unsupervised term in the loss function also considers inter-layer dependencies. A different GCN-based approach is mGCN (Ma et al. 2019), which models explicit adjacency links among different layers. mGAT (Xie et al. 2020) is an attention-based approach that introduces a regularization term to the loss function to constrain the similarity between each pair of layers. GrAMME (Shanthamallu et al. 2020) provides two different approaches, named GrAMME-SG and GrAMME-Fusion. The former explicitly builds the inter-layer edges between each node in a layer and its counterpart in a different layer, and applies a series of attention layers with the fusion-head method. The latter deals with inter-layer dependencies in a different way, as it builds layer-wise attention models and introduces an additional layer that exploits inter-layer dependencies using only fusion heads. ML-GCN and ML-GAT (Zangari et al. 2021) exploit both within-and outside-layer neighborhood when computing the embedding on each layer, designing an extension of GCN and GAT architecture, resp., to multilayer networks, using the multihead attention mechanism but without fusion-head strategy to integrate the inter-layer dependencies. Co-MLHAN employs an attention-based component for learning the importance of each layer. For the modeling of intra-layer information, on the other hand, we do not exclude to use extensions of GCN or GAT, suggesting that the choice should be adapted to special needs of distinguishing between information of different importance.
More recent works introduce contrastive learning to boost the embeddings in multilayer networks. MGCCN ) uses self-reconstruction, which learns the embedding of each layer by capturing structure and content information, and contrastive fusion, which captures the consistent information in different layers by pulling close positive pairs and pushing away negative pairs in intra-layer and inter-layer connections. Also, it exploits pillar-edges to identify positive pairs. Co-MLHAN shares the approach of allowing different attributes for nodes in different layers and of not employing data augmentation to construct negative pairs. AMC-GNN  generates two graph views by data augmentation and compares the embeddings of different layers of GNN encoders to obtain feature representations, learning the importance weights of embeddings in different layers adaptively through the attention mechanism. In contrast to Co-MLHAN, the two views in AMC-GNN are obtained exploiting data augmentation on the original graph. DMGI (Park et al. 2019) integrates the relation-specific embeddings corresponding to different layers by introducing a consensus regularization framework minimizing their disagreements and a universal discriminator for all positive and negative pairs regardless of the relation type. Similar to Co-MLHAN, the views of this approach does not rely on changing the graph structure, but the similarity computation still employs a corruption of the attribute matrix, in contrast to our proposed approach. cM2NE (Xiong et al. 2021) proposes a contrastive learning based embedding framework modeling multiple structural views for each layer. The contrastive learning is performed to extract information for a specific view, across the views of a layer and across the aligned layers. Co-MLHAN has a less fine-grained granularity in the multi-view mechanism, as it is not applied on each layer; on the contrary, our views include by design the across-layer information.
We would like to stress here that all the above approaches are designed for networks with only one type of node.

Representation learning for heterogeneous attributed multilayer networks
In the past few years, interest has started to emerge in combining heterogeneity and multilayer aspects, however literature still lacks works focusing on embedding generation for such networks. GATNE (Cen et al. 2019) splits the overall node embedding into three parts: the base embedding and attribute embedding are shared among edges of different types, while the edge embedding is computed by aggregation of neighborhood information with the self-attention mechanism. This approach uses meta-paths via meta-path based random walk strategy to generate node sequences given as input to a skip-gram model during the optimization. Co-MLHAN also employs meta-paths to capture highorder relations between nodes, although the meta-path types are specified at the modeling stage. Moreover, we learn a single encompassing embedding for each node/entity, incorporating different relation types.
We want to emphasize that most existing works claiming to deal with networks being both heterogeneous and multilayer, actually refer to a multiplicity of nodes or of relations that hold globally over the network, but not necessarily on individual layers. The latter is instead an important aspect that we address in our proposed framework.

Conclusions
In this work, we proposed a self-supervised graph representation learning framework, based on a cross-view contrastive learning mechanism, for networks that are simultaneously multilayer, heterogeneous and attributed. Remarkably, our framework is able to deal with networks where each layer is a heterogeneous graph with attributed nodes, and with both intra-and inter-layer links between nodes. The embedding of nodes of any given target type are learned by contrasting the encodings generated by two views, i.e., network schema view and meta-path view, which embed local and high-order neighborhood information, respectively. The meta-path view also enables handling across-layer information, which we handle by two versions of the framework differing in the modeling of pillar edges: Co-MLHAN, modeling a particular type of meta-paths with terminal nodes belonging to different layers and the intermediate node-of a different type from target-matching a pillar-edge, and Co-MLHAN-SA, directly considering all the instances of the same target entities in other layers. The learned embeddings are taskindependent and hence can eventually be used for different downstream graph mining tasks, both at entity/node level, edge level or graph level. We demonstrated our methods under a task of entity classification, based on originally developed network datasets in the IMDb movie context, and including a comparative evaluation with recently proposed methods for heterogeneous graph embedding, HeCo and NSHE.

Possible extensions and future directions
Although our framework can handle an arbitrary number of layers, this reflects on the number of learnable parameters, thus impacting on the framework complexity. Particularly, for Co-MLHAN, the number of learnable parameters increases with the number of layers in both views, while Co-MLHAN-SA is less sensitive to the number of layers in the meta-path view, but is still affected in the network-schema view, since we distinguish relations of the same type across different layers. To reduce the number of learnable parameters of the framework, one direction would be to modify the network schema view so as to make node-level attention weights for a certain relation type be shared over all layers.
As we discuss in Appendix 4, the computational complexity of our framework does not hinder its scalability, since several steps can be easily parallelized. We leave as future work the training of the models based on a mini-batch setting in combination with sampling methods (Hamilton et al. 2018;Chen et al. 2018;Zeng et al. 2019;Hu et al. 2020).
Another aspect that might be addressed concerns the modeling of meta-paths connecting nodes of different types, where at least one (rather than both) among the starting and ending node is of target type. In this case, the resulting meta-path based graph would not be homogeneous, since the meta-path based neighbors are of different types. Since increasing the number of views is unlikely to be beneficial (as stated in (Hassani and Ahmadi 2020)), the definition of the two views should hence be revised.
A further extension would concern the definition of different selection strategies for the positive and/or the negative samples in the contrastive learning stage. On the one hand, the learned features could be exploited for the positives selection in addition to structural information, and on the other hand, hard negative sampling techniques could be devised (Kalantidis et al. 2020;Ahrabian et al. 2020;Robinson et al. 2020).
Our framework can also be extended to deal with different graph mining tasks other than node/entity classification, such as regression, clustering, link prediction. For instance, to accomplish the latter, we would need to handle the embeddings downstream of one of the two views at node-level so as to compute pair-wise hidden representations of nodes (upon which a similarity function can be used to predict the link strength of any pair of nodes).
Equally interesting would be to investigate other applications of our framework in different scenarios, having different structural and semantic properties, stressing the flexibility of the proposed framework by identifying datasets with more or less overlap between layers, and possibly with one or more node types without replicas. Contextually, by identifying richer sources of information, we could inspect other learning paradigms, such as multi-modal or multi-task learning, where multiple tasks are solved simultaneously, which has been proven effective for the task of recommendation in heterogeneous networks ).
least one MDM or one MAM, i.e., the two movies have at least a director or an actor in common; AL3A (at least 3 actors) increases the meta-path count for each MDM or MAM instance connecting the two entities, requiring at least one MDM or three MAMs, i.e., the two movies have at least a director or more than three actors in common. As a result, AL1A can rely on more positives per entity but less meaningful-including movie pairs sharing only one actor-while AL3A can rely on less but more meaningful positives per entity. Main statistics of the two alternatives are provided in Table 4.
The across-layers meta-paths are built upon the same meta-path types, with the intermediate node matching a pillar-edge. For instance, as shown in Fig. 7, given a meta-path of type MAM (for each layer), the corresponding across-layer meta-path has the same actor in both layers and the two movies belonging to different layers.
We provide nodes/entities of target type with real-world initial features; for the other two types, we identify initial features associating each node with an one-hot indicator vector (Kipf and Welling 2017). Initial features of movie nodes/entities are extracted from plots of individual episodes, where terms are selected according to their termfrequency inverse-document frequency (tf-idf) relevance scores. Specifically, we filter out words that appear in less than 10 documents or in more than 60% of the total corpus size. After that, in our experimental settings, we either selected the top-1000 words according to their tf-idf scores, or kept all (unfiltered) words (4085).
We emphasized that Co-MLHAN is conceived to be general and flexible, so as to exploit all available information but also being effective even when such information is lacking, e.g., in case of poor across-layer relationships, or when one or more types of neighbors are missing for some nodes; for instance, a new TV series could have a single season or the information regarding its cast could miss. In addition, nodes could show high variability in the number of neighbors, e.g., TV series can be associated with a large cast or not. External information can indeed be available either at node level or entity level, therefore initial features can be layer-dependent and associated with nodes, or layer-independent and associated with entities. For instance, we might handle the plots of the TV series (entities), which we also assign to the respective seasons (nodes) in different years (layers), as well as the plots of the individual seasons, from which we derive the overall plots of the series.

Appendix 3. Content encoding
The first stage in our proposed framework aims to encode contents associated with nodes or entities possibly coming from external sources, which might be of different domains. Note that, for an attributed heterogeneous graph, different types of nodes could be associated with different types of content, and that even nodes of the same type could have information from multiple sources and in different forms, such as structured attributes, unstructured text, and multimedia content. External information can indeed be available either at node level or entity level, therefore initial features can be layer-dependent and associated with nodes, or layer-independent and associated with entities.
As previously introduced, given a type a ∈ A , we denote with x (a) i,l the initial feature vector of node i, l (i.e., entity v i in layer G l ), and with x (a) i the initial feature vector of entity v i . We admit that the initial feature vectors corresponding to different entity/node types could be of different lengths. If this should hold, the content encoding stage would require a feature transformation step in order to project features of different types to the same latent space, using type-specific transformation matrices. Formally, in case of content-features associated with entities, we obtain the projected feature embedding h (a) i , for entity v i of type a, as follows: where W (a) ∈ R d×d (a) in and b (a) ∈ R d are the learnable matrix and bias term for the entity type a, respectively, and x (a) i is the initial feature vector of length d (a) in associated with entity v i . Analogously, in case of content-features associated with nodes, i.e., dependent on the specific layer, we obtain the projected feature embedding h (a) i,l , for node i, l of type a, as follows: where W (a) l and b (a) l are the learnable layer-specific matrix and bias term for the entity type a, respectively, and x (a) i,l is the initial feature vector of length d (a) in associated with node i, l .
For both Eqs. 26 and 27, σ (·) is a non-linear activation function; by default, we define it as ELU (·) = max(0, ·) + min(0, µ exp(·) − 1) , with µ = 1 . Note also that d is chosen such that d ≤ min a∈A {d (a) in }. Considering the possibility that each entity/node, regardless of its type, could be associated with information coming from multiple and diverse sources, the process of content feature generation would be more articulated as two aspects should be considered, namely content-specific feature extraction and multi-modal content feature aggregation. Indeed, an aggregation step would be needed to integrate contents from different modalities (i.e., structured attributes, text, images, etc.), and it can effectively be carried out by supplying an autoencoder model with the concatenation of the various content-specific embeddings, or by using an attention layer for their convex combination. Moreover, the aggregation step would be preceded by content-specific feature extraction in case the feature vectors x were not immediately available, and hence suitable methods (e.g., word embeddings or contextualized language models for text, convolutional networks for images, etc.) should be applied to generate features from the raw data associated with nodes/entities.
We also allow that each entity/node, regardless of its type, could be associated with no external information; in this case, initial features could be randomly generated, using identity matrices or sampling from a selected type of distribution (e.g., uniform, normal, exponential). It should however be noted that content feature generation is beyond the objectives of this work; the interested reader can refer to recently developed literature on this topic, such as (Baevski et al. 2022) which proposes a general self-supervised learning framework for generating contextualized latent representation of different modalities, including speech, images and text.
after the content encoding stage and the worst case in terms of magnitude of the networks. That is, each entity appears in each layer, i.e, the total number of nodes in the network schema view is O(|V|ℓ) and each target node appears in the meta-path based graphs, i.e., the total number of nodes in the meta-path view is O(|V (t) |ℓ) for each metapath. Without loss of generality, we consider that each relation r ∈ R involves nodes of target type (thus ensuring that |R| relations are considered in the network schema view), and we discard the across-layer meta-paths. Before delving into the details, we recall that the input and output of each sub-module of stage 2 and 3 are d−dimensional embeddings, with d ≪ |V L |.
As concerns the spatial complexity, the memory requirement is mainly given by the storage of the hidden states (e.g., z NS and z MP ), the learnable weight matrices ( W s) and attention vectors ( as). In particular, the attention values in NSVE-1 require an overhead of |E r | for each relation r involving the target nodes. Moreover, we need to store in memory the positive and the negative samples for each entity, i.e., P i and N i , where |P i ∪ N i | = |V (t) | .
Regarding the time complexity, the graph structure encoding stage requires the computation of embeddings under the network schema and the meta-path view, which can be calculated independently and therefore can be parallelized. The computational complexity of the former view is shared by both Co-MLHAN and Co-MLHAN-SA, while the latter view requires a separate analysis for the two methods. In the following, we analyze the costs of each of the steps performed at the two views. (NSVE-1) The computational cost of the NSVE-1 step, where node-level attention takes place, depends on an attention mechanism for each relation in each layer. Given a relation type r ∈ R , let V r be the set of nodes connected through the edges in E r . The computational complexity of Eq. 2 with a single attention head is T r = O(|V r |d 2 + |E r |d) (Brody et al. 2021), where the first term concerns the feature transformation step of GATv2, while the second term corresponds to the cost of calculating a general attention function, which can be parallelized. In the case of Q attention heads, both the first and the second terms are multiplied by a factor of Q, where the different heads can still be parallelized. Note that in practice, each target node considers only a subset of neighbors for each relation r due to our sampling strategy, which allows saving computational resources. Hence |E r | is an upper bound to the number of edges involved in relation r. Finally, since we equipped our approaches with the same attention mechanism on each relation r, the final time complexity of NSVE-1 is O(max(T r 1 , T r 2 , . . . , T r |R| )). (NSVE-2) NSVE-2 employs the same multilayer perceptron model for type-level and across-layer attention. In particular, under the assumption that each relation r ∈ R involves nodes of target type, the time complexity of the type-level attention step is O(|V (t) |d 3 |R|) , because involves dense matrix and vector operations. For the across-layer attention case, under the initial hypothesis that each target entity appears in each layer, the time complexity is O(|V (t) |ℓd 3 ) . Also, note that in both cases the attention coefficients can be calculated in parallel, for each relation r, and layer l, respectively.

(MPVE-1)
For each meta-path and layer, the complexity of MPVE-1 corresponds to the complexity of GCN (Kipf and Welling 2017), whose cost for K neural layers is O(Knonzero(A l )d + K |V (t) |d 2 ) , where nonzero(A l ) is the number of non-zero entries in the adjacency matrix of the l-th layer. Note that, in practical applications, K assumes small values due to the issue of oversmoothing , and the computations on each layer, meta-path (and across layer meta-path) are independent to each other, hence they can be easily parallelized.  MPVE-2 requires an attention model to compute the importance of each meta-path in each layer. Since the attention mechanism is the same as used in NSVE-2, the cost of MPVE-2 is O(p|V (t) |ℓd 3 ) , where the attention coefficients for each meta-path can be computed in parallel. (MPVE-SA-1) Regarding Co-MLHAN-SA, the time complexity of MPVE-SA-1 corresponds to the application of ML-GCN (Zangari et al. 2021) with K neural layers. Its computational complexity is O(Knonzero(A sup )d + K |V (t) |ℓd 2 ) , where nonzero(A sup ) is the number of non-zero entries in the A sup matrix. The first term corresponds to the propagation steps, while the second corresponds to the feature transformation steps of ML-GCN. (MPVE-SA-2) Similarly to MPVE-2, this sub-module requires the application of semantic-level attention, in order to combine the embedding learned from each multilayer meta-path based graph. Since we discarded across-layer meta-paths in MPVE-2, the computational complexity of this step is the same for both Co-MLHAN and Co-MLHAN-SA, i.e., O(p|V (t) |ℓd 3 ) . Also, similarly to NSVE-2, MPVE-SA-2 requires to attend over the information learned at each layer, with a level of across-layer attention, whose complexity is negligible compared to the first term, i.e., O(|V (t) |ℓd 3 ).
The third stage, based on contrastive learning, requires first a transformation through a MLP, which costs O(|V (t) |d 2 ) , then the loss functions of the two views are computed. For this last step, we need to compute the pairwise cosine-similarities between nodes belonging to different views, which costs O(|V (t) | 2 d).
To sum up, considering all the above terms, the time complexity of our framework can be characterized in terms of size of the multilayer heterogeneous network and size of the latent space (i.e., embedding length), which is typical in GNN-based approaches. Specifically, in the second stage, the cost is linear in the number of target nodes and edges, while it is cubic in the embedding length, due to the computation of the attention models. In the third stage, the cost becomes quadratic in the number of the target entities, due to the calculation of pairwise node similarities. We remark that our framework is extremely flexible in terms of the choice of each sub-module. In particular, we propose using an attention mechanism only if different instances of the same type are assumed to provide information with different importance. Nonetheless, several steps can be carried out in parallel (e.g., the attention model on each relation r, GCN models for each metapath, type-level, semantic-level and across-layer attention). Thus, in practical applications, the computational complexity of our framework does not hinder its scalability. In this regard, we aim to improve the efficiency of the training process in future works, e.g.,