 Research
 Open Access
 Published:
CoMLHAN: contrastive learning for multilayer heterogeneous attributed networks
Applied Network Science volume 7, Article number: 65 (2022)
Abstract
Graph representation learning has become a topic of great interest and many works focus on the generation of highlevel, taskindependent node embeddings for complex networks. However, the existing methods consider only few aspects of networks at a time. In this paper, we propose a novel framework, named CoMLHAN, to learn node embeddings for networks that are simultaneously multilayer, heterogeneous and attributed. We leverage contrastive learning as a selfsupervised and taskindependent machine learning paradigm and define a crossview mechanism between two views of the original graph which collaboratively supervise each other. We evaluate our framework on the entity classification task. Experimental results demonstrate the effectiveness of CoMLHAN and its variant CoMLHANSA, showing their capability of exploiting acrosslayer information in addition to other types of knowledge.
Introduction
Nowadays, with the ever increasing growth of interconnected data, a huge number of realworld scenarios and variety of applications can profitably be modeled using complex networks. In this context, one key aspect is how to incorporate information about the structure of the graph into machine learning models. Graph representation learning approaches are gaining increasing attention in recent years, since they are designed to overcome the limitations of traditional, handengineered feature extraction methods, by learning a mapping to embed nodes, or entire (sub)graphs, as points in a lowdimensional vector space. This mapping is then optimized so that geometric relationships in this learned space reflect the structure of the original graph. After optimizing the embedding space, the learned embeddings can be used as feature inputs for downstream machine/deep learning tasks for exploration and/or prediction (e.g., node classification, community detection and evolution, link prediction).
Graph representation learning approaches are conventionally categorized into traditional embedding (a.k.a. “shallow”) methods and Graph Neural Network (GNN)based methods. As noted in (Khoshraftar and An 2022), GNNs ensure more refined graph representations, higher flexibility in leveraging attributes at node/edge level, and generalization to unseen nodes through taskspecific and node similarity based training, although at the cost of tougher memoryrequirements that might impact on scalability aspects.
Since GNNs typically require labels to learn rich representations, and annotating graphs is costly by needing domain knowledge, selfsupervised learning approaches are currently being investigated, which coupled with GNNs allow to learn embeddings without relying on labeled data (Hassani and Ahmadi 2020). Among different graph selfsupervised learning methods, contrastbased methods have more flexible designs and broader applications compared to other approaches (Liu et al. 2021), training GNNs by discriminating positive and negative node pairs, i.e., similar and dissimilar instances. Contrastive learning aims to learn effective GNN encoders such that similar nodes are pulled together and dissimilar nodes are pushed apart in the embedding space (Jing et al. 2021).
To the best of our knowledge, there is a lack of methods able to handle networks whose nodes are replicated according to different interaction contexts or semantic aspects, are of different types and/or are connected via different types of relationships, and carry multiple information content. In other terms, networks that are simultaneously multilayer, heterogeneous and attributed are still unexplored in the landscape of graph representation learning, regardless of the particular learning paradigm adopted.
Contributions.
To fill the above gap in the literature, in this work we propose a novel Contrastive learning based framework for Multilayer Heterogeneous Attributed Networks (CoMLHAN), which is designed to learn node/entity embeddings without relying on labeled data. Specifically, we learn node representations by contrasting positive and negative samples belonging to distinct views of the original graph. Inspired by recent advances in multiview contrastive learning (Hassani and Ahmadi 2020; Jing et al. 2021; Mavromatis and Karypis 2021; Wang et al. 2021), we indeed consider two views of a multilayer heterogeneous attributed network, which capture local and highorder (global) structure of nodes, respectively, and collaboratively supervise each other.
Our main contributions in this work correspond to addressing the following relevant, interrelated challenges:

Representation learning for an arbitrary multilayer network such that each layer can have multiple types of nodes and relations (heterogeneous network), and have initial features associated with nodes (attributed network).

Encoding of the local information of nodes, to account for the size and heterogeneity of the node neighborhoods, so as to handle variability in the number of neighbors and possible lack of certain types of neighbors.

Encoding of the highorder information of nodes, by employing metapaths, to reach relevant information residing multihops away, so that nodes of the same type that are not directly connected can be tied to each other.

Effective integration and exploitation of acrosslayer information, including the possibility of assigning different weights to different layers or treating them equally, as needed. This also avoids using a simplistic approach based on network flattening, so that dependencies between the layers can be retained, including both the links between the replicas of the nodes in different layers (pillaredges) and any other interlayer edges. Moreover, with respect to modeling the acrosslayer information related to pillaredges, we also propose a variant of the main method, which will be referred to as CoMLHANSA.

Jointly learning of embeddings for each node/entity, each under the corresponding view, which can both be used for downstream tasks, such as classification. In this regard, we also provide a qualitative analysis of the interchangeability of the viewspecific embeddings.

High flexibility in terms of definition of node and entitylevel attributes as well as in terms of definition of the selection strategy of positive and negative sampling.
We experimentally evaluated our CoMLHAN methods and selected competitors on IMDb movie data, from which we originally built multilayer heterogeneous attributed networks.
Plan of the paper.
The remainder of this paper is structured as follows. “Proposed framework” section describes our proposed framework in detail. “Experimental evaluation” section provides our experimental evaluation concerning the entity classification task on IMDb network datasets. “Related work” section discusses related works focusing on GNNbased approaches for representation learning in heterogeneous attributed networks and in multilayer attributed networks. “Conclusions” section contains concluding remarks and provides pointers for future research. Moreover, Appendices 1–4 provide details about the preprocessing of our evaluation network datasets, an insight into the content encoding stage, and a discussion on computational complexity aspects of the proposed framework.
Proposed framework
Our proposed CoMLHAN is a selfsupervised graph representation learning approach conceived for multilayer heterogeneous attributed networks. As previously discussed, a key novelty of CoMLHAN is its higher expressiveness w.r.t. existing methods, since heterogeneity is assumed to hold at both node and edge levels, possibly for each layer of the network. This capability of handling graphs that are multilayer, heterogeneous, and attributed simultaneously, enables CoMLHAN to better model complex realworld scenarios, thus incorporating most information when generating node embeddings.
In the following, we first provide a formal definition of multilayer heterogeneous attributed graph and representation learning in such networks, then we move to a detailed description of CoMLHAN. The notations used in this work are summarized in Table 9, Appendix 1.
Preliminary definitions
A multilayer graph is a set of interrelated graphs, each corresponding to one layer, with a node mapping function between any (selected) pair of layers to indicate which nodes in one graph correspond to nodes in the other one. We assume that each layer can be heterogeneous, i.e., is characterized by nodes of different types and/or edges of different types, such that any node can be linked to nodes of the same type as well as to nodes of different types, through the same or different relations, and is attributed, i.e., has nodes associated with external information, available as set of attributes. Therefore, each layer graph has its internal set of edges, dubbed intralayer or withinlayer edges, as well as a set of edges connecting its nodes to nodes of another layer, dubbed interlayer or acrosslayer edges. Layers can be seen as different interaction contexts, semantic aspects, or time steps, while the participation of an entity to a layer can be seen as a particular entity instance. Instances of the same entity are connected via pillaredges. We hereinafter refer to entity instances as nodes in the multilayer network. Figure 1 illustrates an example of a multilayer heterogeneous attributed graph.
Multilayer heterogeneous attributed graph.
We define a multilayer heterogeneous attributed graph as \(G_{{\mathcal{L}}}=\langle {\mathcal{L}}, {\mathcal{V}}, V_{{\mathcal{L}}}, E_{{\mathcal{L}}}, A, R, \phi , \varphi , {\varvec{\mathcal{X}}}_{{\mathcal{L}}}\rangle\), where \({\mathcal{L}} = \{G_{1}, \cdots , G_{\ell } \}\) is the set of layer graphs, indexed in \(L = \{1,\dots , \ell \}\), with \({\mathcal{L}} = \ell \ge 2\), \({\mathcal{V}}\) is the set of entities, \(V_{\mathcal{L}} \subseteq {\mathcal{V}} \times {\mathcal{L}}\) is the set of nodes, \(E_{{\mathcal{L}}}\) is the set of edges, including both intra and interlayer edges, A is the set of entity, resp. node, types, R is the set of relation types, \(\phi :{\mathcal{V}}\rightarrow A\) is the entitytype mapping function, \(\varphi :E_{{\mathcal{L}}} \rightarrow R\) is the edgetype mapping function, and \({\varvec{\mathcal{X}}}_{{\mathcal{L}}}\) is a set of matrices storing attributes, or initial features, with \({\varvec{\mathcal{X}}}_{{\mathcal{L}}} = \bigcup _{l=1\ldots \ell } {\varvec{\mathcal{X}}}_{l}\). More specifically, entities, resp. nodes, of each type are assumed to be associated with features stored in layerspecific matrices \({\varvec{\mathcal{X}}}_{l} = \{ {\textbf{X}}^{(a)}_{l} \}\), where each \({\textbf{X}}^{(a)}_{l}\) is the feature matrix associated with entities, resp. nodes, of type \({a} \in A\) in the lth layer. Throughout this work we use symbol \({\textbf{x}}^{(a)}_{{\langle i,l \rangle }}\) to denote the feature vector of entity \(v_{i}\) of type a in layer \(G_{l}\). We also admit that features can be layerindependent, in which case we indicate with \({\textbf{x}}^{(a)}_{i}\) the feature vector associated with entity \(v_{i}\) of type a in each layer, i.e., \({\textbf{x}}^{(a)}_{\langle i,l \rangle } = {\textbf{x}}^{(a)}_{i}\) for each \(G_{l} \in {\mathcal{L}}\).
We specify that each entity has instances (i.e., nodes) in one or more layers, and appears at least in one layer, i.e., \({\mathcal{V}} = \bigcup _{l=1\ldots \ell } {\mathcal{V}}_{l}\), with \({\mathcal{V}}_{l}\) set of entities appearing in the lth layer. Likewise, \(A=\bigcup _{l=1\ldots \ell } A_{l}\), with \(A_{l}\) denoting the set of node types of the lth layer, \(R=\bigcup _{l=1\ldots \ell } R_{l}\), with \(R_{l}\) denoting the set of edge types of the lth layer, and \(E_{{\mathcal{L}}} = \bigcup _{r \in R}E_{r}\) \(\subseteq V_{{\mathcal{L}}} \times V_{{\mathcal{L}}}\), with \(E_{r}\) indicating all the edges of type r.
Moreover, \(E_{{\mathcal{L}}}\) can be partitioned into two sets denoting the intralayer edges and interlayer edges. Note that interlayer edges represent coupling structure of layers; in our setting, we assume that different coupling constraints between layers might hold, e.g., layers could be coupled with each other, only adjacent layers could be coupled, layers could follow a temporal relation order, etc. We define the set of layer pairing indices as \(L_{cross}\), where each \(\pi =(l,l') \in L_{cross}\) is a pair of coupled layers denoting an interaction between layer \(G_{l}\) and \(G_{l'}\).
We stress that in contrast to other approaches, such as (Yang et al. 2021), in our formulation each layer \(G_{l}\) (\(l=1\ldots \ell\)) is a heterogeneous graph at both node and edge levels, i.e., \(A_{l}>1\) and \(R_{l} > 1\). Moreover, \(A_{l} \subseteq A\), for all \(G_{l} \in {\mathcal{L}}\), and \(R_{l} \subset R\), since interlayer connections are regarded as different types of edges.
Multilayer heterogeneous attributed graph embedding.
Given a multilayer heterogeneous attributed network \(G_{{\mathcal{L}}}\), our goal is to learn an embedding function at entity level \(g : {\mathcal{V}} \rightarrow {\mathbb{R}}^{d}\), where d is the dimension of the latent space, and \(d \ll {\mathcal{V}}\). Function g can be derived from an analogous function \(g' : V_{{\mathcal{L}}}\rightarrow {\mathbb{R}}^{d}\), where d is the dimension of the latent space, and \(d \ll V_{{\mathcal{L}}}\), being the embedding function at node level. The mapping g, resp. \(g'\), defines the latent representation of each entity \(v_{i} \in {\mathcal{V}}\), resp. node \(\langle i,l \rangle \in V_{{\mathcal{L}}}\), and we use symbol \({\textbf{z}}_{i}\), resp. \({\textbf{z}}_{\langle i,l \rangle }\), to denote its learned embedding. The learned embeddings are eventually used to support multiple downstream graph mining tasks, e.g., entity/node classification, link prediction, node regression, etc.
CoMLHAN: contrastive learning framework for multilayer heterogeneous attributed networks
We aim to learn node embeddings in an unsupervised manner, with function g employing graph neural networks and attention mechanisms in order to encode both structural and semantic, heterogeneous and multilayer information in the context of a multiview contrastive mechanism.
Our proposed approach is based on the infomax principle of maximizing mutual information (Linsker 1988), both in terms of graph structure encoding—complying with the distinction between local and highorder information—and acrosslayer information—complying with the distinction between interlayer edges connecting direct neighbors and pillaredges connecting different instances of the same entity. According to this principle, we define two different structural views on the original graph: the one is designed to encode the local structure of nodes and handle heterogeneity, capturing useful information from onehop neighbors of different types (possibly from different layers), and the other one is designed to encode the global structure of nodes and model information from distant nodes in the network, thus capturing useful information from multihop neighbors of the same type. Note that we include pillar edges in the global view, since they are particular connections matching two instances of the same entity, thus enabling acrosslayer transitions, but they do not represent edges between two direct neighbors.
It should be emphasized that CoMLHAN is conceived to be general and flexible, so as to exploit all available information but also being effective even when such information is lacking. For instance, acrosslayer relations could be limited to few replicas, nodes may show high variability in the number of neighbors, or one or more types of neighbors could be missing for some nodes.
Figure 2 shows a conceptual overview of our proposed framework. Accordingly, the final embedding for each target entity is learned through three main stages:

1.
Content encoding. Since the initial feature vectors of nodes/entities (\({\textbf{x}}\)) might be of different sizes, the first stage requires to transform such initial features into a shared lowdimensional latent space (\({\textbf{h}}\)). Moreover, this stage is also concerned with the content encoding “from scratch”, i.e., generating initial embeddings from raw data associated with nodes/entities, which might be from possibly multiple and heterogeneous contents, such as categorical or numerical attributes, unstructured text and multimedia content

2.
Graph structure encoding. According to the multiview learning paradigm, the second stage requires to generate two distinct embeddings for each entity, reflecting the graph structure and maximizing the mutual information: (1) embeddings for the local structure (\({\textbf{z}}\)^{ns}), including information from all direct neighbors of the nodes being instances of the target entity, and (2) embeddings for the highorder structure (\({\textbf{z}}\)^{mp}), including information from pillaredges and from target nodes that can be reached through composite relations (i.e., metapaths).

3.
Final embedding based on contrastive learning. The third stage requires a joint optimization between the embeddings learned under the two views to generate the final entity embedding (\({\textbf{z}}\)). The contrastive learning mechanism is enforced by choosing suitable positive and negative samples from the original graph.
In the following sections, we elaborate on the graph structure encoding (stage 2) and the generation of the final embedding based on contrastive learning (stage 3). We examine their computational complexity aspects in Appendix 4. For the sake of readability, note also that, since the first stage of content encoding is actually beyond the objectives of this work, we discuss it in Appendix 3.
Graph structure encoding
The second stage models two graph views, named network schema view and metapath view, able to encode the local and global structure surrounding nodes, respectively, while exploiting multilayer information.
The network schema of a heterogeneous graph is an abstraction of the original graph showing the different node types and their direct connections. It is often referred to as meta template, since it captures node and edge type mapping functions. Formally, a network schema is a directed graph defined over node types A, with edges as relation types from R. In a multilayer heterogeneous network \(G_{{\mathcal{L}}}\), the network schema includes all types A for individual layers and relations R, including both intra and interlayer edges. More specifically, we consider all relations involving any node \(\langle i,l \rangle\) of target type, denoted as \(R_{\langle i,l \rangle } \subseteq R\), and all node types a connected to the target node through a relation \(r \in R_{\langle i,l \rangle }\). Hereinafter, we refer to this graph as network schema graph. Figure 3 shows an example of network schema graph for the multilayer heterogeneous attributed network of Fig. 1.
A metapath is a sequence of connected nodes making two distant nodes in the network reachable, i.e., the terminal or endpoint nodes of a metapath instance. Formally, a metapath \(M_{m}\) is a path defined on the network schema graph, in the form \({a_{1}} \xrightarrow {{r_{1}}} {a_{2}} \xrightarrow {{r_{2}}} \cdots \xrightarrow {{r_{k}}} {a_{k+1}}\), describing a composite relation \({r_{1}} \circ {r_{2}} \circ \dots \circ {r_{k}}\) between node types \({a_{1}}\) and \({a_{k+1}}\). A metapath instance of \(M_{m}\) is a sampling under the guidance of \(M_{m}\) providing a sequence of connected nodes with edges matching the composite relation in \(M_{m}\). Examples of within layer metapath instances are depicted in Fig. 4a and b. Given a multilayer heterogeneous graph \(G_{{\mathcal{L}}}\) and a metapath \(M_{m}\), let \(N_{m}(i,l)\) denote the metapath based neighbors of node \(\langle i,l \rangle\) of a certain type a, defined as the set of nodes of type \(a'\) that are connected with node \(\langle i,l \rangle\) through at least one metapath instance of \(M_{m}\) having a as starting nodetype and \(a'\) as ending nodetype. Note that, similarly to Wang et al. (2019), the intermediate nodes along metapaths are discarded. A metapath based graph is a graph comprised of all the metapath based neighbors. For metapaths with terminal nodes of the same type, the resulting graph is homogeneous at node level. Figure 4c shows an example of singlelayer metapath based graph according to a specific metapath type.
Following Wang et al. (2021), given a target entity, the network schema view is used to capture the local structure, by modeling information from all the direct neighbors of the corresponding target nodes, whereas the metapath view is used to capture the global structure, by modeling information from all the nodes connected to the corresponding target nodes through a metapath and from the pillaredges derived by the corresponding metapath based graph.
View embedding generation. The two views exploit features associated with different entity types; specifically, the network schema view takes advantage of features of neighbors of any type, while the metapath view takes advantage of features of nodes of target type involved in high order relations.
We remind that CoMLHAN produces for each target entity a distinct embedding under each view. Nonetheless, both views share two fundamental steps in the embedding generation: (1) aggregating information of different instances of the same type—i.e., instances of the same relation and instances of the same metapath, respectively—and (2) combining information of different types—i.e., different types of relations and of metapaths, respectively, as well as different layers.
Network schema view embedding
In the network schema view, the embedding of each target node is computed from its direct neighbors, both within and across layers. As mentioned before, the network schema is a multilayer heterogeneous graph, having nodes of different types and relations corresponding to intra and interlayer edges involving nodes of target type.
To generate the embeddings under the network schema view, we follow a hierarchical attention approach, consisting of two main steps, which are summarized as follows and depicted in Fig. 5:

(NSVE1)
First, we aggregate information of the same type (i.e., different instances of the same relation type) via nodelevel attention, learning the importance of each neighbor and obtaining, for each node, an embedding w.r.t. each relation type that involves a node of target type t.

(NSVE2)
Second, we combine information of different types (i.e., different relations in different layers) via typelevel attention, learning the most relevant relations and obtaining an embedding for each node under the network schema view. Moreover, we combine information from different layers via acrosslayer attention, learning the importance of each layer and obtaining, for each entity, a single embedding under the network schema view.
Note that we refer to relation type and not to node type to be consistent in the event that target nodes are connected to a certain node type through multiple relationships. We point out that, in accordance with the infomax principle, the network schema view does not model pillaredges, since they are processed in the other view. We also specify that intralayer edges in different layers are seen as different types of relations, reflecting the separation into layers according to a certain aspect. In practice, layers are an additional way for distinguishing the context of relations.
Aggregating information of different instances of the same type (NSVE1). Aggregating information of the same type (i.e., different instances of the same relation type) takes place via nodelevel attention. This step exploits features of nodes connected to target nodes through a direct link, whether they are of the same type as the target or not.
Given the graph \(G_{{\mathcal{L}}}\), we define a function, denoted as \(N^{(r)}(\cdot )\), that for any pair entitylayer yields its neighborhood under relation type r, regardless of the within layer or acrosslayer location of the neighbors. Formally, given a target node \(\langle i,l \rangle\), we define the set of its neighbors under relation \(r \in R_{\langle i,l \rangle }\) as:
Above, note that \(N^{(r)}(i,l)\) returns withinlayer or acrosslayer neighbors of \(\langle i,l\rangle\) under relation r, when \(l=l'\) or \(l\ne l'\), respectively. (Recall that pillar edges are excluded from the definition of neighbor sets). Moreover, to ensure the aggregation of the same amount of information, we sample a fixed size of neighbors to be processed at each epoch by setting a threshold value for each type of neighbor (cf. “Experimental settings” section). In our setting, neighbor sampling can be done with and without replacement. Note that this neighbor sampling approach allows for saving computational resources in case of huge networks.
We thus define the embedding of entity \(v_{i}\) in layer \(G_{l}\) based on neighbors under relation r as:
where \({\textbf{z}}^{N^{(r)}}_{\langle i,l \rangle }\) is the embedding of node \(\langle i,l \rangle\) obtained from neighborhood under relation r, \(\sigma (\cdot )\) is the activation function (default is ELU), \({\textbf{W}}_{{\textbf{2}}}^{(r)}\) is the weight matrix of shape (d, d) associated with onehop neighbors \(\langle j,l'\rangle\), \({\textbf{h}}_{\langle j,l' \rangle }\) is the feature embedding of node \(\langle j,l' \rangle\) and \(\alpha ^{(r)}_{\langle i,l \rangle ,\langle j,l' \rangle }\) is the normalized attention coefficient for the relation r connecting \(\langle i,l \rangle\) and \(\langle j,l' \rangle\) and indicating the importance for \(\langle i,l \rangle\) of information coming from \(\langle j,l'\rangle\), as defined in Eq. 3:
where \({\textbf{a}}^{(r)} \in {\mathbb{R}}^{d}\) is the learnable weight vector under relation r, \([{\textbf{h}}_{\langle i,l \rangle } \ {\textbf{h}}_{\langle j,l' \rangle }] \in {\mathbb{R}}^{2d}\) is the rowwise concatenation of the column vectors associated with the two node embeddings, \({\textbf{W}}^{(r)} = [{\textbf{W}}_{{\textbf{1}}}^{(r)}\{\textbf{W}}_{{\textbf{2}}}^{(r)}] \in {\mathbb{R}}^{d \times 2d}\) is the columnwise concatenation of \({\textbf{W}}_{{\textbf{1}}}^{(r)}\) and \({\textbf{W}}_{{\textbf{2}}}^{(r)}\), both of shape (d, d) and containing the left and right half of the columns of \({\textbf{W}}^{(r)}\), associated with destination and source nodes (onehop neighbors), respectively.^{Footnote 1} In Eq. 3, we adopt the same approach as in GATv2 (Brody et al. 2021), which aims to fix the static attention problem of standard Graph Attention Network (GAT) (Velickovic et al. 2018) that limits its expressive power, since the ranking of attended nodes is unconditioned on the query node; on the contrary, GATv2 is a dynamic graph attention variant where the order of internal operations of the scoring function is modified to apply an MLP for computing the score of each attended node.
The selfattention mechanism can be extended similarly to Vaswani et al. (2017) by employing multihead attention, in order to stabilize the learning process. In this case, operations are independently replicated Q times, with different parameters, and outputs are featurewise aggregated through an operator denoted with symbol \(\bigoplus\), which usually corresponds to average (default) or concatenation:
where \({\textbf{W}}^{(r,q)}\) and \(\alpha ^{(r,q)}_{\langle i,l \rangle ,\langle j,l' \rangle }\) denote the weight matrix and the attention coefficient for the qth attention head under relation r, respectively.
Let \({\textbf{z}}_{\langle i,l \rangle }^{N^{(r)}}\) be the embedding of a target node \(\langle i,l \rangle\) obtained from its neighbors in each layer under relation r. Downstream of nodelevel attention, we thus obtain \(\bigcup \nolimits _{r \in R_{\langle i,l \rangle }} \{ {\textbf{z}}_{\langle i,l \rangle }^{N^{(r)}} \}\) embeddings.
Combining information of different types and layers (NSVE2). In order to combine information of different node types according to the different relations with target nodes, we employ typelevel attention for each layer separately. For each target node \(\langle i,l \rangle\), we obtain the embedding under the network schema view \({\textbf{z}}^{{{{\mathrm{NS}}}}}_{\langle i,l \rangle }\), as defined in Eq. 5:
where \(\beta ^{(r)}\) is the attention coefficient for neighborhood under relation r, which is defined as follows:
where \({\mathcal{V}}_{l}^{(t)}\) is the set of entities of target type t in layer l; \({\textbf{a}}^{{{{{\mathrm{NS}}}}}} \in {\mathbb{R}}^{d}\) is the typelevel attention vector; \(\textbf{W}^{{{{\mathrm{NS}}}}}\) and \({\textbf{b}}^{{{{\mathrm{NS}}}}}\) are the learnable weight matrix and the bias term, respectively, under the network schema view, shared by all relation types. We hence obtain the set of embeddings \(\bigcup \nolimits _{l \in L} \{ {\textbf{z}}^{{{{\mathrm{ NS}}}}}_{\langle i,l\rangle }\}\) under the network schema view for each target node.
In order to map the learned node embeddings into the same space of the contrastive loss function, we apply an additional level of attention, i.e., acrosslayer attention. This is designed to evaluate the importance of each layer of \(G_{{\mathcal{L}}}\) and combine layerwise the features of nodes. We thus obtain an embedding under the network schema view for each target entity \(v_{i}\), as defined in Eq. 7:
where \(\beta ^{(l)}\) is the learned attention coefficient for layer \(G_{l}\), computed via the same attention model like in Eq. 6, where in this case the learnable weights are shared by all layers.
Metapath view embedding
In the metapath view, the embedding of each target node is computed from its metapath based neighbors and from the pillaredges derived by the corresponding metapath based graph. We remind that each layer of a metapath based graph is a homogeneous network with nodes corresponding to a subset of target nodes and edges as connections of metapath based neighbors, including acrosslayer information matching pillaredges.
We consider metapaths of any length, starting and ending with nodes of target type; indeed, information of intermediate nodes can be discarded as it is included in the network schema view. Note that considering multiple metapaths allow us to deal with multiple semantic spaces (Lin et al. 2021), and our framework is designed to handle an arbitrary number of metapaths. Also, in case a layer does not contain any node of target type, the layer is discarded from the resulting multilayer graph. Yet, our framework admits the worst case of \(\ell 1\) layers missing for a metapath type.
Analogously to the network schema view, the metapath view embedding generation consists of two main steps (Fig. 6):

(MPVE1)
First, we aggregate information of the same type, this time intended as several instances of the same metapath and encoded via metapathspecific Graph Convolutional Network (GCN) (Kipf and Welling 2017), obtaining, for each target node, an embedding w.r.t. each metapath type.

(MPVE2)
Second, we combine information of different types (i.e., different metapaths in different layers) and layers (i.e., different metapaths across layers) via semantic attention, learning the importance of each metapath and obtaining an embedding for each target node and entity under the matapath view.
In the following, we first describe the process of metapath view embedding generation according to the basic CoMLHAN approach. Next, in “Alternative metapath view embedding: CoMLHANSA” section, we shall describe an alternative strategy, called CoMLHANSA, which differs from CoMLHAN in the way acrosslayer information relating to pillaredges is modeled.
Aggregating information of different instances of the same type (MPVE1). The first step of embedding generation under the metapath view is to aggregate information of the same type, which corresponds to several instances of a given metapath. More specifically, we consider all p metapaths \(\mathcal{M} = \{M_{1}, \dots , M_{p}\}\) involving nodes of target type, where each metapath \(M_{m}\) matches a multilayer graph with at most \(\ell\) layers.
In the metapath view, acrosslayer dependencies are modeled as particular types of metapaths, i.e., acrosslayer metapaths. They refer to the same composite relation, with the additional constraint that the terminal nodes belong to different layers, and that the intermediate node matches a pillaredge, i.e., it corresponds to an entity (of type different from the target one) with both instances involved in the composite relation. An example is illustrated in Fig. 7. We define the set of acrosslayer metapaths, \({\mathcal{M}}^{\Updownarrow }\), as the the union of all metapaths of any type and defined over all layerpairs.
To identify the metapath based neighbors of each node, we define two functions, denoted as \(N^{\Leftrightarrow }(\cdot )\) and \(N^{\Updownarrow }(\cdot )\), which for each node return the intralayer and interlayer neighborhood, respectively. Formally, we define the set of withinlayer neighbors of the node \(\langle i, l \rangle\), according to mth (withinlayer) metapath type, as:
Similarly, we define the set of acrosslayer neighbors of node \(\langle i, l \rangle\), according to the mth (acrosslayer) metapath type, as follows:
Note that Eqs. 8 and 9 identify the metapath based neighborhood of type \(M_{m}\) for node \(\langle i,l \rangle\), with m referring to a within or acrosslayer metapath, respectively; in particular, \(N_{m}^{\Leftrightarrow }(i,l) \equiv N_{m}(i,l)\).
Given any target node \(\langle i,l \rangle\), we apply a metapath specific graph neural network \(f_{m}\) (with K hidden layers) in order to compute its embedding according to the mth metapath; formally, at each kth layer:
where \({\textbf{z}}^{(0)}_{\langle i,l \rangle } = {\textbf{h}}_{\langle i,l \rangle }\) is the feature embedding computed in the first stage, and \(\bigoplus\) denotes an arbitrary differentiable function, aggregating feature information from the local neighborhood of nodes [e.g., summation, a pooling operator, or even a neural network (Wang et al. 2020)]. Similarly to Wang et al. (2021), we use a GCN architecture as \(f_{m}\), for all \(M_{m}\) \((m=1\ldots p)\) in Eq. 10, assuming no different contribution from different instances of the same metapath.
More specifically, given the mth withinlayer metapath and \({\mathcal{A}} = \{ {\textbf{A}}_{1}, \ldots , {\textbf{A}}_{\ell } \}\) as the set of adjacency matrices associated with the corresponding metapath based graph, being \({\textbf{A}}_{l} \in {\mathbb{R}}^{n_{l} \times n_{l}}\) (\(l=1\ldots \ell\)) the adjacency matrix associated with layer l, the GCN for layer \(G_{l}\) is defined as follows:
where \(\sigma (\cdot )\) is a nonlinear activation function (default is \(ReLU(\cdot ) = max(0,\cdot )\)), \({\textbf{W}}^{(k,l)}\) is the trainable weight matrix for the mth metapath in the kth convolutional layer of shape (d, d), and \({\widetilde{\textbf{D}}}^{l}_{ii}=\sum _{j}{\widetilde{{\textbf{A}}}}^{l}_{ij}\) is the degree matrix derived from \({\widetilde{{\textbf{A}}}_{l}} = {\textbf{A}}_{l} + {\textbf{I}}_{n}\), with \({\textbf{I}}^{l}_{n}\) as the identity matrix of size \(n_{l}\), and \(n_{l}\) number of nodes of layer \(G_{l}\). The GCN model for acrosslayer metapaths is built similarly, considering \(N^{\Updownarrow }(\cdot )\) instead of \(N^{\Leftrightarrow }(\cdot )\) and \(\pi\) instead of l.
Let \({\textbf{z}}^{(m)}_{\langle i, l \rangle }\) and \({\textbf{z}}^{(m)}_{\langle i,\pi \rangle }\) be the node embedding associated with the mth within (resp. across)layer metapath of node \(\langle i,l \rangle\) (resp. layerspair \(\pi\)). Downstream of metapath specific GNNs, we obtain \(\{{\textbf{z}}^{(m)}_{\langle i,l \rangle } \  \ l \in L, \ m=1\dots p\} \bigcup \{{\textbf{z}}^{(m)}_{\langle i,\pi \rangle } \  \langle m,\pi \rangle \in \mathcal{M}^{\Updownarrow }\}\) node embeddings.
Combining information of different types and layers (MPVE2). Once obtained the metapath specific embeddings for each target node, we employ semanticlevel attention for combining different metapath types, including both intra and interlayer information. Given a node \(\langle i,l \rangle\), the embedding under the metapath view is computed as follows:
where \(\beta\) is the attention coefficient denoting the importance of each type of within layer and acrosslayers metapath (cf. Eq. 6) and \(\lambda ^{\Updownarrow } \in [0\ldots 1]\) is a balancing coefficient denoting the importance of interlayer connections.
In order to project the node embedding into the same space of the loss function—analogously to the network schema view—we aggregate the embeddings obtained from each layer with a sum operator, which is defined as follows:
Note that Eq. 13 does not require an additional level of attention, since the layer dependency has already been taken into account by the attention mechanism in Eq. 12. Therefore, Eqs. 12 and 13 can be combined as follows:
Equation 14 hence enables the direct computation of the final embedding under the metapath view for each entity \(v_{i}\).
Alternative metapath view embedding: CoMLHANSA
Our alternative approach for embedding generation under the metapath view is named CoMLHANSA, where the suffix ‘SA’ refers to the supraadjacency matrix modeling each metapath based graph. The supraadjacency matrix, denoted as \({\textbf{A}}^{\mathrm{sup}}\), has diagonal blocks each representing a layerspecific adjacency matrix (i.e., \({\textbf{A}}_{l} \in {\mathbb{R}}^{n_{l} \times n_{l}}\), with \(l=1\ldots \ell\)), and offdiagonal blocks each corresponding to the interlayer adjacency matrix \({\textbf{A}}_{\pi }\) for layerpair \(\pi =(l,l')\), with values equal to 1 if an edge between \(\langle i,l \rangle\) and \(\langle j,l' \rangle\) exists, with \(l \ne l'\), and 0 otherwise.
To give an intuition, we model acrosslayer information downstream of semantic attention, by accounting for another level of attention, i.e., acrosslayer attention (by analogy with the network schema view).
We thus learn the importance of different (within layers) metapaths via semantic attention, obtaining an embedding under the metapath view for each node and we subsequently learn the importance of each layer via acrosslayer attention, obtaining an embedding under the metapath view for each entity.
Like in the basic CoMLHAN approach, the metapath view embedding generation in CoMLHANSA consists of two main steps (Fig. 8):

(MPVESA1)
First, we aggregate information of the same type, intended as several instances of the same metapath and encoded via metapathspecific GCNs, obtaining, for each node, an embedding w.r.t. each metapath. Unlike MPVE1, the first step of the CoMLHANSA approach hence handles the interlayer dependencies derived from pillaredges.

(MPVESA2)
Second, we combine information of different types (i.e., different metapaths in different layers) via semantic attention, learning the importance of each metapath and obtaining an embedding for each target node under the metapath view. Moreover, we combine information from different layers via acrosslayer attention, learning the importance of each layer and obtaining, for each target entity, a single embedding under the metapath view.
By avoiding acrosslayer metapaths \(\mathcal{M}^{\Updownarrow }\) definition, CoMLHANSA requires a limited number of learnable parameters, as it utilizes a metapath specific GCN shared by all layers \(G_{l}\).
Aggregating information of different instances of the same type (MPVESA1). We still use the notation \(N^{\Leftrightarrow }(\cdot )\) and \(N^{\Updownarrow }(\cdot )\) to indicate the set of withinlayer and acrosslayer neighbors, respectively. While the definition of \(N^{\Leftrightarrow }(\cdot )\) does not change w.r.t. Eq. 8, the definition of \(N^{\Updownarrow }(\cdot )\) of the CoMLHANSA approach is modified in the modeling of pillaredges, by directly considering all the instances of the same target entities in other layers, as shown in Eq. 15:
Similarly to MPVE1, we apply a metapath specific GNN for aggregating different metapath instances of the same type:
Unlike MPVE1, the interlayer dependencies are taken into account by the GNN, employing a modified version of the propagation rule that can handle the supraadjacency matrix as input. We thus build for each metapath its corresponding metapath based supragraph, i.e., a graph where pillar edges exist between every node and its counterpart in other coupled layers. In our setting, we instantiate \(f_{m}\) with a multilayer GCN model (Zangari et al. 2021), as shown in Eq. 17:
where the degree matrix \({\widetilde{\textbf{D}}}\) is built considering both interlayer and intralayer links of nodes using the supraadjacency matrix of the graph, \({\widetilde{\textbf{D}}}_{ii}=\sum _{j=1}{\widetilde{{\textbf{A}}}}^{\mathrm{sup}}_{ij}\), where \({\widetilde{{\textbf{A}}}}^{\mathrm{sup}}\) is the supraadjacency matrix with selfloops added, \(\delta (l,l')\) is a scoring function denoting the weight coefficient for interlayer links, ranging between 0 and 1, with values equal to \(\lambda ^{\Updownarrow }\) if \(\ l \ne l'\), and 1 otherwise.
Let \({\textbf{z}}^{m}_{\langle i,l\rangle }\) be the embedding of node \(\langle i,l \rangle\) associated with the mth metapath. We thus obtain \(\bigcup \nolimits _{\begin{array}{c} m=1 \dots p \\ l \in L \end{array}} \{{\textbf{z}}^{(m)}_{\langle i,l\rangle }\}\) metapath specific embeddings.
Combining information of different types and layers (MPVESA2). Once obtained the metapath specific embeddings for each target node, we employ semanticlevel attention for combining different metapath types, obtaining for each node \(\langle i, l \rangle\) an embedding under the metapath view, which is defined as follows:
where \(\beta ^{(m,l)}\) is an attention coefficient computed as in Eq. 6.
In order to project the node embedding into the same space of the loss function, we apply an additional level of attention, named acrosslayer attention, similarly to network schema view, thus obtaining for each entity \(v_{i}\) an embedding under the metapath view:
where \(\beta ^{(l)}\) is the attention coefficient denoting the importance of the lth layer, computed similarly to Eq. 6.
Final embedding based on Contrastive Learning
The third stage of the proposed framework is concerned with the exploitation of a contrastive learning mechanism to produce the final entity embeddings, pulling together similar entities and pushing apart dissimilar ones in the embedding space. We combine the contrastive losses computed according to each view, with individual nodes of both positive and negative pairs selected from distinct views.
Given the embeddings \({\textbf{z}}_{i}^{{{{\rm NS}}}}\) (Eq. 7) and \({\textbf{z}}_{i}^{{{{\rm MP}}}}\) (either Eq. 14 or Eq. 19) for each target entity \(v_{i}\), we transform them into the same space in which a contrastive loss function is computed, by employing a simple MLP architecture with one hidden layer, as defined in Eq. 20:
where \({\textbf{W}}^{(2)}\), \({\textbf{W}}^{(1)}\), \({\textbf{b}}^{(2)}\) and \({\textbf{b}}^{(1)}\) are learnable weights shared by both views and \(\sigma (\cdot )\) is the activation function (default is ELU).
The contrastive loss according to a certain view is computed on pairs of positive and negative samples. While earlier contrastive learning approaches were based on one or more negatives and a single positive for each instance, we follow the more recent trend of using both multiple positive and negative pairs (Khosla et al. 2020; Wang et al. 2021). Each target entity \(v_{i}\) can hence rely on more than one positive (at least itself, under the other view). For positive sampling, the idea is to select the best nodes connected by multiple metapath instances, since metapath based neighbors have higher probability of being similar to each other. For negative sampling, we simply choose considering everything that is not positive.
We first proceed to the selection of positive samples. For this purpose, we count the metapaths instances connecting each pair of target entities, considering all metapath types on individual layers, as shown in Eq. 21:
For each target entity \(v_{i}\), we obtain a set \({\mathcal{S}}_{i}= \{v_{j} \in {\mathcal{V}} \  \ C_{i,j}>0 \}\) which is sorted by decreasing values of \(C_{i,j}\). Given a threshold \(T_{pos}\), we select for each entity itself and the best \(T_{pos}1\) entities as positives, obtaining a subset \(\overline{{\mathcal{S}}}_{i} \subseteq {\mathcal{S}}_{i}\) with \(\overline{{\mathcal{S}}}_{i} \le T_{pos}1\); all the remaining \({\mathcal{V}}T_{pos}\) entities are regarded as negatives for \(v_{i}\). Therefore, for each entity \(v_{i}\), we define the set of positive samples \({\mathcal{P}}_{i}\) as \({\mathcal{P}}_{i}=v_{i} \cup \{ v_{j}  v_{j} \in \overline{{\mathcal{S}}}_{i} \}\) and the set of negative samples \({\mathcal{N}}_{i}\) as \({\mathcal{N}}_{i}= {\mathcal{V}} {\setminus } {\mathcal{P}}_{i}\).
We stress that for the selection of positives we only exploit structural information, without using any information derived from the encoding of external content (i.e., initial features) of entities. Nonetheless, additional conditions on metapaths in the selection of entity pairs can be defined, e.g., by diversifying the minimum number of instances required to enable the enumeration of a specific metapath. CoMLHAN is flexible in both the metapath counting method and the overall positive and negative selection strategy.
For the computation of contrastive losses according to a given view, the embedding of each target entity \(v_{i}\) is selected from the given view, while the positive and negative samples are selected from the other view, as defined in Eqs. 22 and 23, and illustrated in Fig. 9:
where \(sim(\textbf{v}_{1},\textbf{v}_{2})\) denotes the cosine similarity between two vectors \(\textbf{v}_{1}\) and \(\textbf{v}_{2}\), and \(\tau\) is the temperature parameter, which indicates how concentrated the embeddings are in the representation space, so that a lower temperature leads the loss to be dominated by smaller distances and widely separated representations contribute less. Note that Eqs. 22–23 are independent from the specific strategy of positive and negative selection; we leave the investigation of alternative sampling methods as future work (“Conclusions” section).
The final contrastive loss is computed as a convex combination of the two contrastive losses to balance the effects of the two views:
with \(0< \lambda < 1\). The loss function is completely specified depending on whether an unsupervised or semisupervised paradigm is adopted. The extension to the (semi)supervised case can be done by adding a new term to the final loss, as shown in Eq. 25:
where \(L_{sup}\) is the (semi)supervised term, e.g., crossentropy for classification tasks, jointly optimized with the contrastive term in a endtoend fashion, and the coefficient \(\eta\), \(0 \le \eta \le 1\), is given to the contrastive term, since in a (semi)supervised setting the (semi)supervised term is expected to be more relevant.
Similarly to Chen et al. (2020), once the training procedure is completed, the optimized \({\textbf{z}}_{i}^{{{{\rm MP}}}}\) or \({\textbf{z}}_{i}^{{{{\rm NS}}}}\) will eventually be used for downstream tasks. Particularly, our default choice is to select the embeddings under the metapath view, since metapaths represent highorder relations between target nodes and pillar edges capture the information of instances of the same entity, exploiting multilayer dependencies. It should however be noted that the similarity between the two learned embeddings, for any entity, is expected to be high, since, according to our positive selection strategy, each entity \(v_{i}\) includes itself under the other view in its set of positive samples \({\mathcal{P}}_{i}\). Nonetheless, in “Experimental settings” section, we shall provide empirical evidence of such embedding similarities. The final learned embeddings optimized via such crossview contrastive loss can be used for a wide range of analysis tasks—at node, entity, or edge level—such as node/entity classification, graph clustering, link prediction.
Experimental evaluation
In this section, we describe the experimental evaluation of our framework. Our main goal is to evaluate CoMLHAN and CoMLHANSA on the entity (multiclass) classification task, choosing a target node type among the different node types with replicas in multiple layers and realworld initial features both at node and entitylevel. “Data” section introduces the data, “Competing methods” section presents the competing methods, “Experimental settings” section discusses the experimental settings, and “Results” section describes the main results.
Data
To the best of our knowledge, there is a lack in the literature of publicly available benchmarks/repositories of networks that are simultaneously multilayer, heterogeneous, and attributed. To overcome this issue so as to properly build suitable network data for our evaluation, we resorted to online resources that would fulfill minimal requirements in terms of publicly availability, domain accessibility, and variety and richness of stored information. In this respect, we ended up to select the Internet Movie Database (IMDb),^{Footnote 2} the most popular and authoritative online resource for movies, TVs and celebrities.
Note that IMDb was used in existing studies (e.g., Wang et al. 2019; Fu et al. 2020; Zhao et al. 2020) for the same classification task (based on movie genres) we address in this work; however, the variety of the resulting datasets makes it hard to perform a fair comparison, beyond being incomplete in terms of our requirements (i.e., networks that are both multilayer and heterogeneous at each layer).
We constructed two IMDb network datasets, dubbed IMDbMLH and IMDbMLHmb (where suffix ‘mb’ stands for ‘most balanced’). They both model each of the layers of the multilayer network as heterogeneous (and attributed).
We identify three types of entities, inherited by nodes: movie (for short, M) actor (for short, A) and director (for short, D). Type movie is regarded as the target type, therefore the downstream task is multiclass classification on movie genres, which are ‘action’, ‘comedy’ and ‘drama’. Tables 1, 2 and 3 summarize main characteristics of the networks, which are described next, whereas in Appendix 2, we provide a detailed description of the semantics of the constituting elements and the steps involved for data preprocessing.

IMDbMLH. Our main network dataset was conceived primarily for comparative evaluation with the competitors. As it can be noticed from Table 3, the network is particularly unbalanced w.r.t. the distribution of classes (i.e., movie genres), which reflects a major requirement of one of our competitors, that is, to ensure that the neighbors of each node cover all node types. To fulfill this requirement, we hence had to select from the original dataset nodes of type movie with at least one neighbor of type director (in any layer) and at least one neighbor of type actor (in any layer), while respecting the neighborhood constraint in the monoplex, flattened network. Note that IMDb also contains movie nodes with no links with director or actor nodes, which is however manageable by our methods only. We also filtered out movies with no episode associated with a plot (plots in IMDb are entered by users, and hence it might happen that all episodes of a certain TV series are not associated with plots; or, if available, the plots could be poorly meaningful).

IMDbMLHmb. This network dataset differs from the other one as it aims to reduce class imbalance. To this purpose, we kept the same number of ‘comedy’ and ‘drama’ movie nodes as in IMDbMLH and increased those of the ‘action’ class, by relaxing the constraint of having at least one neighbor actor and one neighbor director for each movie. Due to this relaxation, we could not use IMDbMLHmb for evaluating the competitors, but we exploited the network to further delve into our methods.
Competing methods
We compared CoMLHAN and CoMLHANSA with two unsupervised learning methods, HeCo (Wang et al. 2021) and NSHE (Zhao et al. 2020), on IMDbMLH. HeCo is a contrastive multiview learning based method for singlelayer heterogeneous attributed graphs. We equipped HeCo with the same metapaths and the same positives and negatives as used by our methods. NSHE is a unsupervised noncontrastive GNNbased approach for singlelayer heterogeneous attributed graphs, which is designed to learn embeddings preserving both pairwise and network schema structure. In contrast to our methods, NSHE generates initial features of nodes by using DeepWalk (Perozzi et al. 2014) for all types of nodes and, if available, combines them with realworld features.
As a motivation behind our choice of competing methods, we note that HeCo and NSHE are those sharing more aspects with our methods (cf. “Related work” section). Indeed, they are able to encode local and global node structure separately in an unsupervised manner, thus capturing the heterogeneity of both nodes and relations. Moreover, they respect the network schema of the graph, ensuring to visit all types of nodes and edges, they can deal with imbalance in the number of neighbors and relations of a certain type within the network schema, and allow to focus on the generation of embeddings of a specific type while using heterogeneous information.
It should however be emphasized that both HeCo and NSHE were designed for heterogeneous attributed monoplex networks, i.e., singlelayer graphs. Consequently, we were forced to downgrade our network data through a flattening approach, i.e., by compressing the multilayer graph into a single graph discarding all replicated edges.
Experimental settings
To model each of our network datasets, intralayer edges involving nodes of target type (i.e., movie) were considered between nodes of different types only, and pillar edges were considered as the only interlayer relations, although our framework is designed to model nonpillar edges as well. Metapaths with both terminal nodes of target type were used in the corresponding view and employed in metapath count for the selection of positive samples. For the positive (and negative) selection strategy, we defined two alternatives, named AL3A and AL1A, differing in whether or not they consider constraints on the minimum number of instances of a specific metapath type (AL stands for ‘At Least’). This reflects on a different tradeoff between the number of positives, which is higher in AL1A, and their meaningfulness, which is expected to be higher in AL3A. The positive statistics corresponding to the two strategies are provided in Table 4.
For all methods, we first learned the embedding for each entity in an unsupervised fashion and then trained a classifier for the final class prediction. We remind that for the final classification task we use the embeddings learned under the metapath view, since it captures relations between target nodes, although our positive selection strategy and the joint optimization of the loss function entail similar representations. To validate our hypothesis, for each entity \(v_{i}\), we computed the cosine similarity between the embedding under the network schema view (\({\textbf{z}}_{i}^{{{{\rm NS}}}}\)) and the embedding under the metapath view (\({\textbf{z}}_{i}^{{{{\rm MP}}}}\)). Results on IMDbMLH confirmed our hypothesis, since we obtained the following statistics on the distribution of similarity measurements: 0.84 as 25% percentile, 0.87 as mean, 0.88 as median, 0.92 as 75% percentile, and 0.97 as maximum value.
We found the optimal hyperparameters for the representation learning process via grid search algorithm. Specifically, we trained the model using the Adam optimization algorithm (Kingma and Ba 2017) with full batch size, for 10,000 epochs, with early stopping technique based on the contrastive loss value and patience set to 30 (i.e., the training procedure stops if loss value does not decrease for 30 consecutive epochs), with \(\lambda =0.5\) for the convex combination of the two contrastive losses. Learning rate was set to 0.0001, and dropout regularization technique with \(p=0.3\) was applied to the transformed features \({\textbf{h}}\).
We used \(Q=1\) attention heads, since GATv2 showed to work better than multihead GAT, and temperature value \(\tau = 0.5\). Moreover, we set the hidden dimension (d) for both views to 64, with \(K=1\) hidden layers in the metapath view [including multiple layers can often lead to oversmoothing problem (Li et al. 2019)]. In the network schema view, for neighborhood sampling, we randomly sampled 7 and 2 nodes of type actor and director, resp., at each epoch with replacement strategy. In the metapath view, following Wang et al. (2021), we set the threshold for positive selection \(T_{pos}\) equals to 5. Finally, we set \(\lambda ^{\Updownarrow }=1\) for the interlayer edges, in order to fully exploit the interlayer connections represented by pillaredges. In case of CoMLHAN, this setting of \(\lambda ^{\Updownarrow }=1\) to give the maximum importance to the interlayer edges, is justified by the construction of acrosslayer metapaths, since their intermediate node correspond to a pillar edge (between nodes of type actor or director). In case of CoMLHANSA, we directly had pillaredges between nodes of type movie, as this proved to be effective in other works, e.g., (Zangari et al. 2021).
As mentioned before, HeCo and NSHE were trained over the flattened networks, i.e., by discarding multilayer information, since they are conceived for singlelayer heterogeneous graphs. While for HeCo we kept the same settings as for CoMLHAN (cf. “Competing methods” section), for NSHE we selected the same hyperparameters it uses for the IMDb dataset (Zhao et al. 2020). For a fair comparison, we set its embedding dimension to 64. We use the publicly available software implementations for both competitors.^{Footnote 3}
Once obtained the final embedding, we used a MLP with one hidden layer of size 64 as final classifier, trained using the Adam optimization algorithm with full batch size, for either 2000 epochs, or at convergence when the earlystopping regularization technique was selected (with patience value of 300 epochs); in the latter case, since the macro average treats all classes equally, we used F1 score with macro average as quality criteria on the validation set, in order to penalize wrong predictions of the most unbalanced class, i.e., ‘action’. We split each dataset in training, test and validation sets, by choosing 70%, 15% and 15% of the entities for each class, respectively. Note that, when early stopping was not used, we just discarded the validation set so as not to vary the training and test sets. The learning rate was set to 0.01.
We carried out our methods and HeCo for 5 independent runs, which differed in random seed assignment, while we experimented NSHE for one run, due to its computational overhead, thus finally learning 5 and 1 different model weights, respectively. For each trained model, we derived the final network embeddings—to be given as input to the final classifier—and executed the final classifier over 50 independent runs with the same realization of training, test and validation sets. Finally, we computed the average of the performance scores achieved on the test set. Specifically, for each model, we computed F1score with micro and macro averaging, AUC score, and F1score of each class. F1score with micro and macro averaging is used to evaluate the contributions of all classes, considering individual class contributions or treating all classes equally, respectively. ROC AUC (Area Under the Receiver Operating Characteristic Curve) score with OVR (onevsrest) averaging strategy is used to indicate the ability of the classifier to distinguish between classes. We also report F1score for each class to more effectively evaluate how the model performances are affected by the early stopping technique.
Note that for methods from which multiple models were learned (i.e., they were executed over different seeds), we reported the average values for each performance criterion.
Results
We organize the presentation of our experimental results into four parts: “Evaluation on IMDbMLH” and “Evaluation on IMDbMLHmb” sections concern the evaluation on IMDbMLH and IMDbMLHmb, respectively, whereas “Qualitative inspection of the embeddings” section provides a qualitative analysis of the learned embeddings. Finally, “Summary of results” section summarizes our experimental findings.
Evaluation on IMDbMLH
We first compared CoMLHAN and CoMLHANSA with HeCo and NSHE using initial features corresponding to the best top1000 words by tfidf and positives selection under the tougher condition AL3A, which assumes fewer but higherquality positives per node (cf. Appendix 2); moreover, to ensure a fair comparison with our competitors requiring a flattening approach, we used for our methods only features associated with entities (entitylevel features, for short EL).
We tested the classifier both with and without earlystopping technique. In both cases, as shown in Table 5, our proposed methods achieve high performance scores according to all quality criteria, consistently outperforming the competitors. In fact, although the amount of edges that were “lost” due to the flattening approach is relatively small (15 % and 20%, resp.), the compression of all layers does not allow the competitors to suitably capture the relations on different layers as well as their interlayer dependencies. Note that we could not apply our competitors on a single layer of our network, since many entities are missing in each layer; as shown in Table 2, only 504 out of 2807 target entities are shared between the two layers.
CoMLHAN achieves the best performances on almost all the quality criteria (5 out of 6), while CoMLHANSA, being the approach with closer performance, is the most effective in predicting movies of class ‘action’ (0.467), which is the less represented class. The reason behind this slight difference between our two methods might be due to the different acrosslayer information modeling w.r.t. pillaredges. The acrosslayer metapaths defined by CoMLHAN can be more meaningful, as they exploit richer interlayer information than CoMLHANSA. Moreover, the poor performance of HeCo w.r.t NSHE show that the contrastive learning mechanism performed by HeCo is not very effective for this dataset. Particularly, HeCo shows the lowest performance on the ‘action’ class, indicating that its learned embedding is unable to discriminate the instances of the most unbalanced class.
Impact of earlystopping on the entity classification. Focusing on the results obtained by using the earlystopping technique, the overall performance of our methods turns out to be slightly lower than the setting discarding the earlystopping. In particular, from Table 5, we notice that the F1score values corresponding to the ‘action’ class decrease for all methods when the earlystopping technique is used. We indeed found out that in some runs the training procedure stops too early because the F1 macro computed on validation set does not improve within the patience value. In this respect, Fig. 10 shows the testing and validation F1 macro scores of the final classifier averaged over 50 runs of the same (i.e., fixedseed) model of CoMLHAN, with and without earlystopping technique. When choosing earlystopping, the bestperforming epochs are distributed with a mean value of \(234 \pm 272\), while the 25%, 50% (median) and 75% percentiles are equal to 14, 32 and 421, respectively. Since the increase in the F1 macro occurs around the 400th epoch (Fig. 10, left), the classifier appears to be undertrained in some runs, thus it cannot boost its performance. On the other hand, if the training is not earlystopped, the classifier learns to distinguish more accurately the instances of the most unbalanced class in each run.
The above results would suggest that, in the effort of avoiding overfitting and saving computational resources through the earlystopping technique, the final classifier might be undertrained, leading to an underfitting problem if the patience value is not properly set. In fact, we observed that the F1 macro on the validation set stabilizes around the 1000th epoch (Fig. 10, right); however, as shown in Fig. 11, the overall benefit gained by a high patience value is marginal: a patience value set to 900 led to 0.644 F1 macro, which just decreases to 0.615 if the patience is set to 300, with only an improvement on the most unbalanced class, as shown in Fig. 12.
We point out that the hyperparameters of the final classifier were not globally optimized, since this goes beyond the main focus of this work; nonetheless, we recall that the classifier is shared by our methods and the competing ones, so as to fulfill fairness in the comparative evaluation. We therefore preferred to speed up the classification stage and set the patience value to 300 for all the experiments employing earlystopping technique on the classifier.
Impact of initial feature selection. We analyzed the behavior of the methods when equipped with all initial real features, i.e., without constraining the size of the initial feature space. We carried out the experiments with the same positives selection strategy as in the previous evaluation. Results corresponding to the earlystopping setting are reported in Table 6 (note that we observed no particular differences when not using earlystopping).
Compared to the previous case corresponding to the top1000 initial features, the performance of all methods tends to decrease due to the higher and sparser dimensionality. An exception is represented by NSHE, which slightly improves, probably due to its feature initialization (Zhao et al. 2020). However, CoMLHAN and CoMLHANSA still outperform both competitors, with the former achieving the highest F1 micro, F1 macro and AUC values. Moreover, when keeping all words as initial features, our methods report high values on the ‘action’ class (despite the use of the earlystopping technique), while the competitors maintain similar values to the previous case with top1000 initial features.
The above results hence suggest that dealing with a full space of initial features can enable CoMLHAN and CoMLHANSA to better distinguish the movie instances, and in particular that our methods can effectively exploit these features unlike the competitors.
Evaluation on IMDbMLHmb
We further evaluated CoMLHAN and CoMLHANSA using the IMDbMLHmb network. More specifically, we investigated the behavior of our methods when equipped with nodelevel initial features, hereinafter referred to as NL, i.e., with layerdependent initial features. To this purpose, we first compared the methods under the following setup: initial features corresponding to the top1000 words, positives selection AL3A, with and without using earlystopping technique.
As it can be noticed from Table 7, performance generally increases w.r.t. the entitylevel feature initialized methods, especially in terms of F1 macro, as a direct consequence of a better coverage of the ‘action’ class. Comparing the results obtained with entitylevel (EL) and nodelevel (NL) features, we observe that, as expected, exploiting initial features at each layer (i.e., nodelevel case) leads to higher performance of the methods.
Moreover, we observe that the difference between the case with earlystopping and the case without earlystopping decreases on IMDbMLHmb, regardless of the layer dependency of the initial features, i.e., EL or NL setting.
Furthermore, we changed the metapaths count strategy for positive selection (AL1A) (refer to Table 4 and Appendix 2 for additional details) to test the sensitivity of our methods, without changing the feature initialization. Results shown in Table 8 reveal a marginal decrease in performance, slightly more evident when using nodelevel initial features. This might be due since, according to AL1A, each entity has a number of positive samples which is on average greater than for the AL3A alternative, but the positives can be less meaningful (cf. Appendix 2); nonetheless, we observed a negligible worsening in the performance.
Qualitative inspection of the embeddings
After discussing the results from a numerical point of view, in this section we aim to visually analyze the final entity embeddings in order to gain insights in terms of patterns and clusters. To this purpose, we used Uniform Manifold Approximation and Projection (UMAP) (McInnes et al. 2018), which is a highly effective nonlinear dimensionality reduction algorithm, particularly useful for visualizing relative proximities in highdimensional data. It is based on manifold learning, which can be seen as a generalization of linear projection frameworks like PCA, sensitive to nonlinear structures in data. In recent years, UMAP has gained popularity since it offers several advantages over related algorithms, such as PCA and tSNE (van der Maaten and Hinton 2008). In particular, compared to the latter, UMAP can achieve a better preservation of the global structure of data in the final projection, it is more efficient, and it has no computational restrictions on the embedding dimension. UMAP defines two main hyperparameters to control the balance between local and global structure: nearest neighbors and minimum distance, denoting the number of local nearest neighbors to process, and how tightly UMAP packs points together, respectively. On the one hand, lower values of minimum distance result in more clustered embeddings, while larger values prevent UMAP from packing points together, leading to a more uniform dispersion of points; on the other hand, lower values of nearest neighbors allow UMAP to concentrate more on the local structure, while higher values enable looking at more neighbors for each point, resulting in a more global representation.
Figure 13 shows the twodimensional UMAP visualization of the initial feature embeddings with tfidf weighting (Fig. 13a), and of the final embeddings under the metapath view learned by our methods (Fig. 13b, c), w.r.t. IMDbMLH. We executed UMAP with the following main hyperparameters: size of local neighborhood used for manifold approximation equal to 15, minimum distance between points equal to 0.7, and cosine similarity as proximity measure.
In the initial representation (Fig. 13a), all entities of type movie are grouped closely together regardless of their genre, resulting in a cluttered representation. This is actually not surprising, since their plots are provided by users without meeting quality requirements. Nonetheless, Fig. 13b and c show how the final embeddings learned by our methods allow UMAP to better separate entities of different classes.
Summary of results
In this section, we summarize the main findings of the empirical evaluation of our framework. We experimented it on two novel network datasets derived from IMDb (cf. Appendix 2), which are simultaneously multilayer, heterogeneous, and attributed. Specifically, we modeled IMDb as a temporal network with two layers, where each layer is heterogeneous and corresponds to years of movie releases. The first network dataset, named IMDbMLH, was conceived for the comparative evaluation of our framework, since it fulfills the requirements of our competitors. The second network dataset, named IMDbMLHmb, was designed to reduce class imbalance and is not applicable to the competitors. Thus, we used it to investigate different input settings of our methods, i.e., CoMLHAN and CoMLHANSA.
Experimental results on the entity classification task showed that our methods significantly outperform existing competitors, effectively exploiting both external content and multilayer information. We also demonstrated that the overall performances do not degrade even in the (less realistic) case of featureset size greater than the number of target nodes. In this case, our methods obtained higher values on the most unbalanced class, suggesting that CoMLHAN and CoMLHANSA can effectively exploit the full space of initial features. To ensure fairness in the evaluation, the final MLP classifier was shared by all methods. Moreover, we investigated the impact of earlystopping regularization technique on the final classifier, confirming that underfitting phenomena can arise if the patience value is not properly set.
We further inspected the quality of the learned embeddings through a data visualization tool, showing that our crossview contrastive mechanism is beneficial for the downstream classification task, since instances belonging to different genres are properly clustered w.r.t. the initial embedding with only tfidf information. As a related aspect, we provided evidence that, as theoretically expected, the embeddings under the metapath view share a similar structure with the corresponding embeddings under the networkschema view, thus enabling the use for downstream tasks of the embeddings learned under one or the other view.
We investigated further properties of our methods using IMDbMLHmb. In that stage of evaluation, the difference between the case with and without earlystopping is strongly mitigated by the lower imbalance between classes. We showed that our framework is resilient to the selection of positive samples (AL1A vs. AL3A), and able to effectively exploit nodetailored feature information (NL vs. EL).
It should also be noted that our CoMLHAN and CoMLHANSA, which differ in the metapath view, achieved similar performance in all the experiments, showing that both approaches can successfully handle information coming from pillar edges. Specifically, the performance by CoMLHAN would suggest that defining metapaths between different layers (i.e., acrosslayer metapaths) allows one to suitably integrate highorder relations between nodes in different layers.
Related work
We discuss below most relevant GNNbased approaches that are designed for different aspects of complex networks and particularly related to our approach. Over the last years, several works focused on the extension of popular GNN models such as GCN (Kipf and Welling 2017) and GAT (Velickovic et al. 2018) to the heterogeneous or multilayer case. Their extension is still an open research problem. In this section, we explore both semisupervised and unsupervised learning paradigms, with emphasis on contrastive learning approaches in unsupervised contexts.
Representation learning for heterogeneous attributed networks
A major challenge for heterogeneous networks is modeling information from nodes that are reachable via paths of different lengths, possibly involving different semantics and structural relations.
HetGNN (Zhang et al. 2019) introduces a random walk with restart strategy to sample a fixed size of strongly correlated heterogeneous neighbors for each node, and group them on the basis of their type. It employs two modules of recurrent neural networks, encoding deep features interactions of heterogeneous contents and content embeddings of different neighboring groups, respectively, which are further combined by an attention mechanism. CoMLHAN shares with HetGNN the modeling approach to external content encoding.
Other models leverage metapath based neighbors and they differ in the information captured along the metapaths. HAN (Wang et al. 2019) focuses only on the information associated with the endpoint nodes of metapaths. It employs both nodelevel and semanticlevel attentions. Upon the learned attention values, the model can generate node embeddings by aggregating features from metapath based neighbors in a hierarchical manner. In addition to the information of the terminal nodes in metapaths, MAGNN (Fu et al. 2020) also incorporates information from intermediate nodes along the metapaths. It uses intrametapath aggregation to incorporate intermediate nodes, and intermetapath aggregation to combine messages from multiple metapaths. DHGCN (Manchanda et al. 2021) incorporates both the information of the nodes along the metapaths and the information in the egonetwork of the endpoints nodes, i.e., the information coming from the direct neighbors of the terminal nodes. It utilizes a twostep schemaaware hierarchical approach, performing attentionbased aggregation of information from the immediate egonetwork, and attentionbased aggregation of information from the neighbors of target type using metapath based convolutions. HGT (Hu et al. 2020) takes only its direct neighbors without manually designing metapaths but incorporating information from highorder neighbors of different types through message passing. It introduces node and edge type dependent attention mechanism and uses meta relations to parameterize the weight matrices for calculating attention over each edge. CoMLHAN supports a userspecified selection of metapaths and focuses on metapath based neighbors of target type. We discard the information of intermediate nodes, according to the idea of differentiating local and highorder information in distinct views.
More recently developed approaches rely on considering node local and highorder structure separately. NSHE (Zhao et al. 2020) introduces a network schema sampling method which generates subgraphs (i.e., schema instances) and a multitask learning method with different predictions to handle the heterogeneity within each schema instance, thus preserving pairwise and network schema proximity simultaneously. HeCo (Wang et al. 2021) employs a crossview contrastive mechanism upon the definition of two views of the graph, named network schema view and metapath view, which collaboratively supervise each other. In the network schema view, a node embedding is learned by aggregating the information from its direct neighbors, applying nodelevel and typelevel attention for the same type and different types of nodes, respectively. In the metapath view, a node embedding is learned by passing messages along multiple metapaths, applying metapath specific convolutional networks and semanticlevel attention for the same and different types of metapaths, respectively. VACAHINE (Khan and Kleinsteuber 2021) aims at jointly learning node embeddings and cluster assignments, using a variational module for the reconstruction of the adjacency matrix in a clusteraware manner and employing multiple contrastive modules for both local and global information.
Similarly to HeCo, CoMLHAN adopts a multiview approach and a contrastive learning mechanism of mutual supervision between two views of the graph, with the addition of acrosslayer information included in the views.
Representation learning for multilayer networks
Some major challenges for multilayer networks involve modeling multiple types of interactions, including both intra and interlayer edges, and exploiting the information of nodes matching the same entity. Here we discuss GNNbased methods focusing on their acrosslayer information modeling.
mGNN (Grassia et al. 2021) provides a generalization of GNNs to the case of multilayer networks. It deals with outsidelayer neighborhood, building an additional layer for the interlayer relations connecting nodes in different layers. The embedding at each layer is computed propagating node features in both the intra and interlayer neighborhood through two independent GNN layers. We share with this approach the capability to deal with general multilayer networks with interlayer edges not being pillaredges. Nevertheless, unlike mGNN, CoMLHAN can handle different types of relations in each layer.
Among the GCNbased approaches, MGCN (Ghorbani et al. 2019) builds a graph convolutional network for each layer employing only links between nodes of the same layer, while an unsupervised term in the loss function also considers interlayer dependencies. A different GCNbased approach is mGCN (Ma et al. 2019), which models explicit adjacency links among different layers. mGAT (Xie et al. 2020) is an attentionbased approach that introduces a regularization term to the loss function to constrain the similarity between each pair of layers. GrAMME (Shanthamallu et al. 2020) provides two different approaches, named GrAMMESG and GrAMMEFusion. The former explicitly builds the interlayer edges between each node in a layer and its counterpart in a different layer, and applies a series of attention layers with the fusionhead method. The latter deals with interlayer dependencies in a different way, as it builds layerwise attention models and introduces an additional layer that exploits interlayer dependencies using only fusion heads. MLGCN and MLGAT (Zangari et al. 2021) exploit both within and outsidelayer neighborhood when computing the embedding on each layer, designing an extension of GCN and GAT architecture, resp., to multilayer networks, using the multihead attention mechanism but without fusionhead strategy to integrate the interlayer dependencies. CoMLHAN employs an attentionbased component for learning the importance of each layer. For the modeling of intralayer information, on the other hand, we do not exclude to use extensions of GCN or GAT, suggesting that the choice should be adapted to special needs of distinguishing between information of different importance.
More recent works introduce contrastive learning to boost the embeddings in multilayer networks. MGCCN (Liu et al. 2021) uses selfreconstruction, which learns the embedding of each layer by capturing structure and content information, and contrastive fusion, which captures the consistent information in different layers by pulling close positive pairs and pushing away negative pairs in intralayer and interlayer connections. Also, it exploits pillaredges to identify positive pairs. CoMLHAN shares the approach of allowing different attributes for nodes in different layers and of not employing data augmentation to construct negative pairs. AMCGNN (Shi et al. 2021) generates two graph views by data augmentation and compares the embeddings of different layers of GNN encoders to obtain feature representations, learning the importance weights of embeddings in different layers adaptively through the attention mechanism. In contrast to CoMLHAN, the two views in AMCGNN are obtained exploiting data augmentation on the original graph. DMGI (Park et al. 2019) integrates the relationspecific embeddings corresponding to different layers by introducing a consensus regularization framework minimizing their disagreements and a universal discriminator for all positive and negative pairs regardless of the relation type. Similar to CoMLHAN, the views of this approach does not rely on changing the graph structure, but the similarity computation still employs a corruption of the attribute matrix, in contrast to our proposed approach. cM2NE (Xiong et al. 2021) proposes a contrastive learning based embedding framework modeling multiple structural views for each layer. The contrastive learning is performed to extract information for a specific view, across the views of a layer and across the aligned layers. CoMLHAN has a less finegrained granularity in the multiview mechanism, as it is not applied on each layer; on the contrary, our views include by design the acrosslayer information.
We would like to stress here that all the above approaches are designed for networks with only one type of node.
Representation learning for heterogeneous attributed multilayer networks
In the past few years, interest has started to emerge in combining heterogeneity and multilayer aspects, however literature still lacks works focusing on embedding generation for such networks. GATNE (Cen et al. 2019) splits the overall node embedding into three parts: the base embedding and attribute embedding are shared among edges of different types, while the edge embedding is computed by aggregation of neighborhood information with the selfattention mechanism. This approach uses metapaths via metapath based random walk strategy to generate node sequences given as input to a skipgram model during the optimization. CoMLHAN also employs metapaths to capture highorder relations between nodes, although the metapath types are specified at the modeling stage. Moreover, we learn a single encompassing embedding for each node/entity, incorporating different relation types.
We want to emphasize that most existing works claiming to deal with networks being both heterogeneous and multilayer, actually refer to a multiplicity of nodes or of relations that hold globally over the network, but not necessarily on individual layers. The latter is instead an important aspect that we address in our proposed framework.
Conclusions
In this work, we proposed a selfsupervised graph representation learning framework, based on a crossview contrastive learning mechanism, for networks that are simultaneously multilayer, heterogeneous and attributed. Remarkably, our framework is able to deal with networks where each layer is a heterogeneous graph with attributed nodes, and with both intra and interlayer links between nodes. The embedding of nodes of any given target type are learned by contrasting the encodings generated by two views, i.e., network schema view and metapath view, which embed local and highorder neighborhood information, respectively. The metapath view also enables handling acrosslayer information, which we handle by two versions of the framework differing in the modeling of pillar edges: CoMLHAN, modeling a particular type of metapaths with terminal nodes belonging to different layers and the intermediate node—of a different type from target—matching a pillaredge, and CoMLHANSA, directly considering all the instances of the same target entities in other layers. The learned embeddings are taskindependent and hence can eventually be used for different downstream graph mining tasks, both at entity/node level, edge level or graph level. We demonstrated our methods under a task of entity classification, based on originally developed network datasets in the IMDb movie context, and including a comparative evaluation with recently proposed methods for heterogeneous graph embedding, HeCo and NSHE.
Possible extensions and future directions
Although our framework can handle an arbitrary number of layers, this reflects on the number of learnable parameters, thus impacting on the framework complexity. Particularly, for CoMLHAN, the number of learnable parameters increases with the number of layers in both views, while CoMLHANSA is less sensitive to the number of layers in the metapath view, but is still affected in the networkschema view, since we distinguish relations of the same type across different layers. To reduce the number of learnable parameters of the framework, one direction would be to modify the network schema view so as to make nodelevel attention weights for a certain relation type be shared over all layers.
As we discuss in Appendix 4, the computational complexity of our framework does not hinder its scalability, since several steps can be easily parallelized. We leave as future work the training of the models based on a minibatch setting in combination with sampling methods (Hamilton et al. 2018; Chen et al. 2018; Zeng et al. 2019; Hu et al. 2020).
Another aspect that might be addressed concerns the modeling of metapaths connecting nodes of different types, where at least one (rather than both) among the starting and ending node is of target type. In this case, the resulting metapath based graph would not be homogeneous, since the metapath based neighbors are of different types. Since increasing the number of views is unlikely to be beneficial (as stated in (Hassani and Ahmadi 2020)), the definition of the two views should hence be revised.
A further extension would concern the definition of different selection strategies for the positive and/or the negative samples in the contrastive learning stage. On the one hand, the learned features could be exploited for the positives selection in addition to structural information, and on the other hand, hard negative sampling techniques could be devised (Kalantidis et al. 2020; Ahrabian et al. 2020; Robinson et al. 2020).
Our framework can also be extended to deal with different graph mining tasks other than node/entity classification, such as regression, clustering, link prediction. For instance, to accomplish the latter, we would need to handle the embeddings downstream of one of the two views at nodelevel so as to compute pairwise hidden representations of nodes (upon which a similarity function can be used to predict the link strength of any pair of nodes).
Equally interesting would be to investigate other applications of our framework in different scenarios, having different structural and semantic properties, stressing the flexibility of the proposed framework by identifying datasets with more or less overlap between layers, and possibly with one or more node types without replicas. Contextually, by identifying richer sources of information, we could inspect other learning paradigms, such as multimodal or multitask learning, where multiple tasks are solved simultaneously, which has been proven effective for the task of recommendation in heterogeneous networks (Li et al. 2020).
Availability of data and materials
Python code for the proposed methods, as well as the network datasets, are available at https://people.dimes.unical.it/andreatagarelli/comlhan/.
Notes
Alternatively, this operation can be carried out as \({\textbf{W}}^{(r)}[{\textbf{h}}_{\langle i,l \rangle } \ {\textbf{h}}_{\langle j,l' \rangle }] = {\textbf{W}}_{{\textbf{1}}}^{(r)} {\textbf{h}}_{\langle i,l \rangle } + {\textbf{W}}_{{\textbf{2}}}^{(r)} {\textbf{h}}_{\langle j,l \rangle }\). Note that in order to save the number of parameters, \({\textbf{W}}^{(r)}\) can be constrained to \([{\textbf{W}}_{{\textbf{1}}}^{(r)}\{\textbf{W}}_{{\textbf{1}}}^{(r)}]\).
The HeCo and NSHE source code are publicly available at https://github.com/liunonline/HeCo and https://github.com/AndyJZhao/NSHE, respectively.
References
Ahrabian K, Feizi A, Salehi Y, Hamilton WL, Bose AJ (2020) Structure aware negative sampling in knowledge graphs. CoRR. arXiv:2009.11355
Baevski A, Hsu WN, Xu Q, Babu A, Gu J, Auli M (2022) data2vec: a general framework for selfsupervised learning in speech. Vis Lang arXiv. https://doi.org/10.48550/ARXIV.2202.03555
Brody S, Alon U, Yahav E (2021) How attentive are graph attention networks? CoRR. arXiv:2105.14491
Cen Y, Zou X, Zhang J, Yang H, Zhou J, Tang J (2019) Representation learning for attributed multiplex heterogeneous network. CoRR. arXiv:1905.01669
Chen J, Ma T, Xiao C (2018) Fastgcn: Fast learning with graph convolutional networks via importance sampling
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. https://doi.org/10.48550/ARXIV.2002.05709
Fu X, Zhang J, Meng Z, King I (2020) MAGNN: metapath aggregated graph neural network for heterogeneous graph embedding. CoRR. arXiv:2002.01680
Ghorbani M, Baghshah MS, Rabiee HR (2019) MGCN: semisupervised classification in multilayer graphs with graph convolutional networks. In: Spezzano F, Chen W, Xiao X (eds) ASONAM ’19: international conference on advances in social networks analysis and mining, Vancouver, British Columbia, Canada, 27–30 August, 2019. ACM, pp 208–211. https://doi.org/10.1145/3341161.3342942
Grassia M, Domenico MD, Mangioni G (2021) mGNN: generalizing the graph neural networks to the multilayer case. CoRR. arXiv:2109.10119
Hamilton WL, Ying R, Leskovec J (2018) Inductive representation learning on large graphs. CoRR. arXiv:1706.02216
Hassani K, Ahmadi AHK (2020) Contrastive multiview representation learning on graphs. CoRR. arXiv:2006.05582
Hu Z, Dong Y, Wang K, Sun Y (2020) Heterogeneous graph transformer. CoRR. arXiv:2003.01332
Jing B, Xiang Y, Chen X, Chen Y, Tong H (2021) Graphmvp: multiview prototypical contrastive learning for multiplex graphs. CoRR. arXiv:2109.03560
Kalantidis Y, Sariyildiz MB, Pion N, Weinzaepfel P, Larlus D (2020) Hard negative mixing for contrastive learning. CoRR. arXiv:2010.01028
Khan RA, Kleinsteuber M (2021) A framework for joint unsupervised learning of clusteraware embedding for heterogeneous networks. CoRR. arXiv:2108.03953
Khoshraftar S, An A (2022) A survey on graph representation learning methods. CoRR. https://doi.org/10.48550/arXiv.2204.01855
Khosla P, Teterwak P, Wang C, Sarna A, Tian Y, Isola P, Maschinot A, Liu C, Krishnan D (2020) Supervised contrastive learning. CoRR. arXiv:2004.11362
Kingma DP, Ba J (2017) Adam: a method for stochastic optimization. CoRR arXiv:1412.6980
Kipf TN, Welling M (2017) Semisupervised classification with graph convolutional networks. In: Proceedings of the 5th international conference on learning representations (ICLR)
Li G, Muller M, Thabet A, Ghanem B (2019) Deepgcns: Can GCNS go as deep as CNNS? In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)
Li H, Wang Y, Lyu Z, Shi J (2020) Multitask learning for recommendation over heterogeneous information network. IEEE Trans Knowl Data Eng 34:789–802
Lin B, Wang X, Dong Y, Huo C, Ren W, Xu C (2021) Metapaths guided neighbors aggregated network for? Heterogeneous graph reasoning. https://doi.org/10.48550/ARXIV.2103.06474
Linsker R (1988) Selforganization in a perceptual network. Computer 21(3):105–117. https://doi.org/10.1109/2.36
Liu L, Kang Z, Tian L, Xu W, He X (2021) Multilayer graph contrastive clustering network. CoRR. arXiv:2112.14021
Liu Y, Pan S, Jin M, Zhou C, Xia F, Yu PS (2021) Graph selfsupervised learning: a survey. CoRR. arXiv:2103.00111
Ma Y, Wang S, Aggarwal CC, Yin D, Tang J (2019) Multidimensional graph convolutional networks. In: Proceedings of the 2019 Siam international conference on data mining. SIAM, pp 657–665
Manchanda S, Zheng D, Karypis G (2021) Schemaaware deep graph convolutional networks for heterogeneous graphs. CoRR. arXiv:2105.00644
Mavromatis C, Karypis G (2021) Hemi: multiview embedding in heterogeneous graphs. CoRR. arXiv:2109.07008
McInnes L, Healy J, Melville J (2018) UMAP: uniform manifold approximation and projection for dimension reduction. https://doi.org/10.48550/ARXIV.1802.03426
Park C, Kim D, Han J, Yu H (2019) Unsupervised attributed multiplex network embedding. CoRR. arXiv:1911.06750
Perozzi B, AlRfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: Macskassy SA, Perlich C, Leskovec J, Wang W, Ghani R (eds) Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 701–710
Robinson J, Chuang C, Sra S, Jegelka S (2020) Contrastive learning with hard negative samples. CoRR. arXiv:2010.04592
Shanthamallu US, Thiagarajan JJ, Song H, Spanias A (2020) GrAMME: semisupervised learning using multilayered graph attention models. IEEE Trans Neural Netw Learn Syst 31(10):3977–3988. https://doi.org/10.1109/TNNLS.2019.2948797
Shi S, Xie P, Luo X, Qiao K, Wang L, Chen J, Yan B (2021) Adaptive multilayer contrastive graph neural networks. CoRR. arXiv:2109.14159
van der Maaten L, Hinton G (2008) Visualizing data using tSNE. J Mach Learn Res 9:2579–2605
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates Inc, Red Hook
Velickovic P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y (2018) Graph attention networks. In: Proceedings of the 6th international conference on learning representations (ICLR)
Wang X, Ji H, Shi C, Wang B, Cui P, Yu PS, Ye Y (2019) Heterogeneous graph attention network. CoRR. arXiv:1903.07293
Wang M, Zheng D, Ye Z, Gan Q, Li M, Song X, Zhou J, Ma C, Yu L, Gai Y, Xiao T, He T, Karypis G, Li J, Zhang Z (2020) Deep graph library: a graphcentric, highlyperformant package for graph neural networks. CoRR arXiv:1909.01315
Wang X, Liu N, Han H, Shi C (2021) Selfsupervised heterogeneous graph neural network with cocontrastive learning. CoRR. arXiv:2105.09111
Xie Y, Zhang Y, Gong M, Tang Z, Han C (2020) MGAT: multiview graph attention networks. Neural Netw 132:180–189. https://doi.org/10.1016/j.neunet.2020.08.021
Xiong H, Yan J, Pan L (2021) Contrastive multiview multiplex network embedding with applications to robust network alignment. In: Zhu F, Ooi BC, Miao C (eds) KDD ’21: The 27th ACM SIGKDD conference on knowledge discovery and data mining, virtual event, Singapore, August 14–18, 2021. ACM, pp 1913–1923. https://doi.org/10.1145/3447548.3467227
Yang G, Kang Y, Zhu X, Zhu C, Xiao G (2021) Info2vec: an aggregative representation method in multilayer and heterogeneous networks. Inf Sci 574:444–460. https://doi.org/10.1016/j.ins.2021.06.013
Zangari L, Interdonato R, Caliò A, Tagarelli A (2021) Graph convolutional and attention models for entity classification in multilayer networks. Appl Netw Sci 6(1):87. https://doi.org/10.1007/s41109021004204
Zeng H, Zhou H, Srivastava A, Kannan R, Prasanna V (2019) GraphSAINT: graph sampling based inductive learning method. https://doi.org/10.48550/ARXIV.1907.04931
Zhang C, Song D, Huang C, Swami A, Chawla NV (2019) Heterogeneous graph neural network. In: Teredesai A, Kumar V, Li Y, Rosales R, Terzi E, Karypis G (eds) Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining, KDD 2019, Anchorage, AK, USA, August 4–8, 2019. ACM, pp 793–803. https://doi.org/10.1145/3292500.3330961
Zhao J, Wang X, Shi C, Liu Z, Ye Y (2020) Network schema preserving heterogeneous information network embedding. In: Bessiere C (ed) Proceedings of the twentyninth international joint conference on artificial intelligence, IJCAI 2020, pp 1366–1372. ijcai.org. https://doi.org/10.24963/ijcai.2020/190
Funding
LM was funded by the PON FSEFESR Ricerca e Innovazione 2014–2020 (PON R &I), Azione I.1 “Dottorati Innovativi con caratterizzazione industriale”, Avviso n. 1233, July 30, 2020. The paper was partially funded by POR CALABRIA FESR 2014/2020 “Smart Cities Lab” (CUP J89J21018490005, former J89J21009750007).
Author information
Authors and Affiliations
Contributions
LM and AT conceived the idea presented in this work. LM, LZ, and AT developed the theoretical definition of the methods. LM, LZ, and AT defined the evaluation methodology and experiments to perform. LM and LZ developed the code and took care of running the experiments. All authors performed evaluation of the results and related discussion. AT supervised the writing, reviewing and editing. All authors participated in the writing process. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Appendix 1. Notations
Table 9 summarizes main notations used throughout this work.
Appendix 2. Data
In the following we provide details of our datasets built upon IMDb. For the sake of simplicity, we model a temporal network with two layers corresponding to years 2020 and 2021 of movie release. Each of the layers is modeled as heterogeneous (and attributed). Each node type, i.e., movie (M), actor (A) and director (D), can be associated with its own initial features. For instance, a movie can be associated with a rating, one or more genres, film’s gross and budget spent, a poster, a trailer, etc., while an actor or a director can be associated with personal data, such as short biography, photo, a list of the most famous interpreted or directed characters, etc. An entity of type movie matches a tvSeries, while a node of type movie matches a specific season in a certain year. Each season is intended as an aggregation of episodes, i.e., their combined information. Pillar edges between nodes of type movie refer to seasons of the same TV series in different years. As we previously stated, movie is regarded as the target type, therefore the classification task is to predict the movie genre, i.e., ‘action’, ‘comedy’ and ‘drama’.
An entity of type actor or director matches a specific person in that role. Its corresponding nodes are included in specific layers if he/she worked in the related year. Pillar edges between nodes of type actor refer to the same actor who acted in some movies in different years; analogously, pillar edges between nodes of type director refer to the same director who directed some movies in different years. Pillar edges are here considered as the only interlayer relations, although our framework is designed to model nonpillar edges as well, connecting nodes possibly of different type, in different layers (e.g., movies referencing other movies or actors referencing movies).
Intralayer edges involving nodes of target type are only between nodes of different types, and in particular between nodes of types M and A (M–A meaning “interpreted by” and A–M meaning “starred in”) and between nodes of types M and D (M–D meaning “directed by” and D–M meaning “directed”). Our framework would also allow direct edges between nodes of the same type; for instance, any two movies sharing a certain feature (e.g., “same genre as”, “same running time as”, “same original language as”, etc.) can be connected. We stress that intralayer edges in different layers are generally seen as different relation types; for instance, if different layers are built according to movie genres, a relation between the same two types of nodes in one layer can assume a different meaning in the other layers.
We select six metapaths, two for each type of entity: MAM (movie–actor–movie) and MDM (movie–director–movie) for type M, from which we derive pairs of movies starring the same actor or directed by the same director, respectively; AMA (actor–movie–actor) and AMDMA (actor–movie–director–movie–actor) for type A, indicating pairs of actors who acted in the same movie or who acted in different movies but directed by the same director, respectively; DMD (director–movie–director) and DMAMD (director–movie–actor–movie–director) for type D, identifying pairs of directors who codirected the same movie or pairs of directors who directed different movies but with a common actor, respectively. Metapaths MAM and MDM, involving the target type, are used in the corresponding view and are both employed in metapath count for the selection of positive samples. Specifically, for each entity pair, AL1A (at least 1 actor) increases the metapath count for each metapath MDM or MAM instance connecting the two entities, requiring at least one MDM or one MAM, i.e., the two movies have at least a director or an actor in common; AL3A (at least 3 actors) increases the metapath count for each MDM or MAM instance connecting the two entities, requiring at least one MDM or three MAMs, i.e., the two movies have at least a director or more than three actors in common. As a result, AL1A can rely on more positives per entity but less meaningful—including movie pairs sharing only one actor—while AL3A can rely on less but more meaningful positives per entity. Main statistics of the two alternatives are provided in Table 4.
The acrosslayers metapaths are built upon the same metapath types, with the intermediate node matching a pillaredge. For instance, as shown in Fig. 7, given a metapath of type MAM (for each layer), the corresponding acrosslayer metapath has the same actor in both layers and the two movies belonging to different layers.
We provide nodes/entities of target type with realworld initial features; for the other two types, we identify initial features associating each node with an onehot indicator vector (Kipf and Welling 2017). Initial features of movie nodes/entities are extracted from plots of individual episodes, where terms are selected according to their termfrequency inversedocument frequency (tfidf) relevance scores. Specifically, we filter out words that appear in less than 10 documents or in more than 60% of the total corpus size. After that, in our experimental settings, we either selected the top1000 words according to their tfidf scores, or kept all (unfiltered) words (4085).
We emphasized that CoMLHAN is conceived to be general and flexible, so as to exploit all available information but also being effective even when such information is lacking, e.g., in case of poor acrosslayer relationships, or when one or more types of neighbors are missing for some nodes; for instance, a new TV series could have a single season or the information regarding its cast could miss. In addition, nodes could show high variability in the number of neighbors, e.g., TV series can be associated with a large cast or not. External information can indeed be available either at node level or entity level, therefore initial features can be layerdependent and associated with nodes, or layerindependent and associated with entities. For instance, we might handle the plots of the TV series (entities), which we also assign to the respective seasons (nodes) in different years (layers), as well as the plots of the individual seasons, from which we derive the overall plots of the series.
Appendix 3. Content encoding
The first stage in our proposed framework aims to encode contents associated with nodes or entities possibly coming from external sources, which might be of different domains. Note that, for an attributed heterogeneous graph, different types of nodes could be associated with different types of content, and that even nodes of the same type could have information from multiple sources and in different forms, such as structured attributes, unstructured text, and multimedia content. External information can indeed be available either at node level or entity level, therefore initial features can be layerdependent and associated with nodes, or layerindependent and associated with entities.
As previously introduced, given a type \(a \in A\), we denote with \({\textbf{x}}_{\langle i,l \rangle }^{(a)}\) the initial feature vector of node \(\langle i,l \rangle\) (i.e., entity \(v_{i}\) in layer \(G_{l}\)), and with \({\textbf{x}}_{i}^{(a)}\) the initial feature vector of entity \(v_{i}\). We admit that the initial feature vectors corresponding to different entity/node types could be of different lengths. If this should hold, the content encoding stage would require a feature transformation step in order to project features of different types to the same latent space, using typespecific transformation matrices. Formally, in case of contentfeatures associated with entities, we obtain the projected feature embedding \({\textbf{h}}_{i}^{(a)}\), for entity \(v_{i}\) of type a, as follows:
where \({\textbf{W}}^{(a)} \in {\mathbb{R}}^{d \times d^{(a)}_{in}}\) and \({\textbf{b}}^{(a)} \in {\mathbb{R}}^{d}\) are the learnable matrix and bias term for the entity type a, respectively, and \({\textbf{x}}^{(a)}_{i}\) is the initial feature vector of length \(d^{(a)}_{in}\) associated with entity \(v_{i}\). Analogously, in case of contentfeatures associated with nodes, i.e., dependent on the specific layer, we obtain the projected feature embedding \({\textbf{h}}^{(a)}_{\langle i,l \rangle }\), for node \(\langle i,l \rangle\) of type a, as follows:
where \({\textbf{W}}^{(a)}_{l}\) and \({\textbf{b}}^{(a)}_{l}\) are the learnable layerspecific matrix and bias term for the entity type a, respectively, and \({\textbf{x}}^{(a)}_{\langle i,l \rangle }\) is the initial feature vector of length \(d^{(a)}_{in}\) associated with node \(\langle i,l \rangle\).
For both Eqs. 26 and 27, \(\sigma (\cdot )\) is a nonlinear activation function; by default, we define it as \(ELU(\cdot ) =max(0, \cdot ) + min(0,\mu \exp (\cdot )1)\), with \(\mu =1\). Note also that d is chosen such that \(d \le min_{a \in A}\{ d^{(a)}_{in}\}\).
Considering the possibility that each entity/node, regardless of its type, could be associated with information coming from multiple and diverse sources, the process of content feature generation would be more articulated as two aspects should be considered, namely contentspecific feature extraction and multimodal content feature aggregation. Indeed, an aggregation step would be needed to integrate contents from different modalities (i.e., structured attributes, text, images, etc.), and it can effectively be carried out by supplying an autoencoder model with the concatenation of the various contentspecific embeddings, or by using an attention layer for their convex combination. Moreover, the aggregation step would be preceded by contentspecific feature extraction in case the feature vectors \({\textbf{x}}\) were not immediately available, and hence suitable methods (e.g., word embeddings or contextualized language models for text, convolutional networks for images, etc.) should be applied to generate features from the raw data associated with nodes/entities.
We also allow that each entity/node, regardless of its type, could be associated with no external information; in this case, initial features could be randomly generated, using identity matrices or sampling from a selected type of distribution (e.g., uniform, normal, exponential). It should however be noted that content feature generation is beyond the objectives of this work; the interested reader can refer to recently developed literature on this topic, such as (Baevski et al. 2022) which proposes a general selfsupervised learning framework for generating contextualized latent representation of different modalities, including speech, images and text.
Appendix 4. Computational complexity aspects
In this section, we discuss the computational complexity aspects of our framework. In our analysis, we assume sparse graphs in both views, dense contentfeatures obtained after the content encoding stage and the worst case in terms of magnitude of the networks. That is, each entity appears in each layer, i.e, the total number of nodes in the network schema view is \({\mathcal{O}}({\mathcal{V}} \ell )\) and each target node appears in the metapath based graphs, i.e., the total number of nodes in the metapath view is \({\mathcal{O}} ({\mathcal{V}}^{(t)} \ell )\) for each metapath. Without loss of generality, we consider that each relation \(r \in R\) involves nodes of target type (thus ensuring that R relations are considered in the network schema view), and we discard the acrosslayer metapaths. Before delving into the details, we recall that the input and output of each submodule of stage 2 and 3 are \(d\)dimensional embeddings, with \(d \ll V_{{\mathcal{L}}}\).
As concerns the spatial complexity, the memory requirement is mainly given by the storage of the hidden states (e.g., \({\textbf{z}}^{{{{\rm NS}}}}\) and \({\textbf{z}}^{{{{\rm MP}}}}\)), the learnable weight matrices (\({\textbf{W}}\)s) and attention vectors (\({\textbf{a}}\)s). In particular, the attention values in NSVE1 require an overhead of \(E_{r}\) for each relation r involving the target nodes. Moreover, we need to store in memory the positive and the negative samples for each entity, i.e., \({\mathcal{P}}_{i}\) and \({\mathcal{N}}_{i}\), where \({\mathcal{P}}_{i} \cup {\mathcal{N}}_{i} = {\mathcal{V}}^{(t)}\) .
Regarding the time complexity, the graph structure encoding stage requires the computation of embeddings under the network schema and the metapath view, which can be calculated independently and therefore can be parallelized. The computational complexity of the former view is shared by both CoMLHAN and CoMLHANSA, while the latter view requires a separate analysis for the two methods. In the following, we analyze the costs of each of the steps performed at the two views.

(NSVE1)
The computational cost of the NSVE1 step, where nodelevel attention takes place, depends on an attention mechanism for each relation in each layer. Given a relation type \(r \in R\), let \(V_{r}\) be the set of nodes connected through the edges in \(E_{r}\). The computational complexity of Eq. 2 with a single attention head is \(T_{r} ={\mathcal{O}}(V_{r} d^{2} + E_{r}d)\) (Brody et al. 2021), where the first term concerns the feature transformation step of GATv2, while the second term corresponds to the cost of calculating a general attention function, which can be parallelized. In the case of Q attention heads, both the first and the second terms are multiplied by a factor of Q, where the different heads can still be parallelized. Note that in practice, each target node considers only a subset of neighbors for each relation r due to our sampling strategy, which allows saving computational resources. Hence \(E_{r}\) is an upper bound to the number of edges involved in relation r. Finally, since we equipped our approaches with the same attention mechanism on each relation r, the final time complexity of NSVE1 is \({\mathcal{O}}(max(T_{r_{1}}, T_{r_{2}}, \dots , T_{r_{R}}))\).

(NSVE2)
NSVE2 employs the same multilayer perceptron model for typelevel and acrosslayer attention. In particular, under the assumption that each relation \(r \in R\) involves nodes of target type, the time complexity of the typelevel attention step is \({\mathcal{O}}({\mathcal{V}}^{(t)}d^{3}R)\), because involves dense matrix and vector operations. For the acrosslayer attention case, under the initial hypothesis that each target entity appears in each layer, the time complexity is \({\mathcal{O}}( {\mathcal{V}}^{(t)} \ell d^{3})\). Also, note that in both cases the attention coefficients can be calculated in parallel, for each relation r, and layer l, respectively.

(MPVE1)
For each metapath and layer, the complexity of MPVE1 corresponds to the complexity of GCN (Kipf and Welling 2017), whose cost for K neural layers is \({\mathcal{O}}( K nonzero({\textbf{A}}_{l}) d + K {\mathcal{V}}^{(t)} d^{2} )\), where \(nonzero({\textbf{A}}_{l})\) is the number of nonzero entries in the adjacency matrix of the lth layer. Note that, in practical applications, K assumes small values due to the issue of oversmoothing (Li et al. 2019), and the computations on each layer, metapath (and across layer metapath) are independent to each other, hence they can be easily parallelized.

(MPVE2)
MPVE2 requires an attention model to compute the importance of each metapath in each layer. Since the attention mechanism is the same as used in NSVE2, the cost of MPVE2 is \({\mathcal{O}}(p {\mathcal{V}}^{(t)} \ell d^{3} )\), where the attention coefficients for each metapath can be computed in parallel.

(MPVESA1)
Regarding CoMLHANSA, the time complexity of MPVESA1 corresponds to the application of MLGCN (Zangari et al. 2021) with K neural layers. Its computational complexity is \({\mathcal{O}}( K nonzero({\textbf{A}}^{\mathrm{sup}}) d + K {\mathcal{V}}^{(t)} \ell d^{2})\), where \(nonzero({\textbf{A}}^{\mathrm{sup}})\) is the number of nonzero entries in the \({\textbf{A}}^{\mathrm{sup}}\) matrix. The first term corresponds to the propagation steps, while the second corresponds to the feature transformation steps of MLGCN.

(MPVESA2)
Similarly to MPVE2, this submodule requires the application of semanticlevel attention, in order to combine the embedding learned from each multilayer metapath based graph. Since we discarded acrosslayer metapaths in MPVE2, the computational complexity of this step is the same for both CoMLHAN and CoMLHANSA, i.e., \({\mathcal{O}}(p {\mathcal{V}}^{(t)} \ell d^{3} )\). Also, similarly to NSVE2, MPVESA2 requires to attend over the information learned at each layer, with a level of acrosslayer attention, whose complexity is negligible compared to the first term, i.e., \({\mathcal{O}}({\mathcal{V}}^{(t)} \ell d^{3} )\).
The third stage, based on contrastive learning, requires first a transformation through a MLP, which costs \({\mathcal{O}}({\mathcal{V}}^{(t)}d^{2})\), then the loss functions of the two views are computed. For this last step, we need to compute the pairwise cosinesimilarities between nodes belonging to different views, which costs \({\mathcal{O}}({\mathcal{V}}^{(t)}^{2}d)\).
To sum up, considering all the above terms, the time complexity of our framework can be characterized in terms of size of the multilayer heterogeneous network and size of the latent space (i.e., embedding length), which is typical in GNNbased approaches. Specifically, in the second stage, the cost is linear in the number of target nodes and edges, while it is cubic in the embedding length, due to the computation of the attention models. In the third stage, the cost becomes quadratic in the number of the target entities, due to the calculation of pairwise node similarities. We remark that our framework is extremely flexible in terms of the choice of each submodule. In particular, we propose using an attention mechanism only if different instances of the same type are assumed to provide information with different importance. Nonetheless, several steps can be carried out in parallel (e.g., the attention model on each relation r, GCN models for each metapath, typelevel, semanticlevel and acrosslayer attention). Thus, in practical applications, the computational complexity of our framework does not hinder its scalability. In this regard, we aim to improve the efficiency of the training process in future works, e.g., by equipping it with minibatch training setting (Hamilton et al. 2018), or investigating more efficient similarity methods.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Martirano, L., Zangari, L. & Tagarelli, A. CoMLHAN: contrastive learning for multilayer heterogeneous attributed networks. Appl Netw Sci 7, 65 (2022). https://doi.org/10.1007/s41109022005049
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41109022005049
Keywords
 Graph representation learning
 Contrastive learning
 Multilayer networks
 Heterogeneous networks
 Attributed networks
 Entity classification