 Research
 Open Access
 Published:
Graph convolutional and attention models for entity classification in multilayer networks
Applied Network Science volumeĀ 6, ArticleĀ number:Ā 87 (2021)
Abstract
Graph Neural Networks (GNNs) are powerful tools that are nowadays reaching state of the art performances in a plethora of different tasks such as node classification, link prediction and graph classification. A challenging aspect in this context is to redefine basic deep learning operations, such as convolution, on graphlike structures, where nodes generally have unordered neighborhoods of varying size. Stateoftheart GNN approaches such as Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs) work on monoplex networks only, i.e., on networks modeling a single type of relation among an homogeneous set of nodes. The aim of this work is to generalize such approaches by proposing a GNN framework for representation learning and semisupervised classification in multilayer networks with attributed entities, and arbitrary number of layers and intralayer and interlayer connections between nodes. We instantiate our framework with two new formulations of GAT and GCN models, namely MLGCN and MLGAT, specifically devised for general, attributed multilayer networks. The proposed approaches are evaluated on an entity classification task on nine widely used realworld network datasets coming from different domains and with different structural characteristics. Results show that both our proposed MLGAT and MLGCN methods provide effective and efficient solutions to the problem of entity classification in multilayer attributed networks, being faster to learn and offering better accuracy than the competitors. Furthermore, results show how our methods are able to take advantage of the presence of real attributes for the entities, in addition to arbitrary interlayer connections between the nodes in the various layers.
Introduction
The topic of graph representation learning and its impact on related analysis tasks in network data has attracted great attention over the past few years, leading to one of the fastest growing subfields of research in deep learning. Graph Neural Networks (GNNs) are powerful deep learning tools that have nowadays reached stateoftheart performances in a plethora of different tasks in graphstructured data, such as node classification, link prediction, community detection and graph classification (Xu etĀ al. 2019). One of the main challenges addressed by these methods is to redefine basic deep learning operations, such as convolution, on structures like graph networks, where nodes may have neighborhoods that are unordered and of varying size (Bronstein etĀ al. 2017). The graph convolutional network (GCN) model proposed by Kipf and Welling (2017), where convolution on graphs is carried out by aggregating the values of each nodeās features along with its neighborsā features, paved the way for the development of further methods based on GCNs, specifically for endtoend learning tasks or focusing on the lowdimensional embedding generation; in the latter case, for instance, the graph autoencoder (GAE) model (Kipf and Welling 2016) is one of the earliest approaches for unsupervised learning, clustering and link prediction on graphs based on GCNs. Another important advance in deep learning, namely the attention mechanism, has also inspired several studies that apply it to graphstructured data. The graph attention network model (GAT) by Velickovic etĀ al. (2018) exploits a masked selfattention mechanism in order to learn weights between each couple of connected nodes, where selfattention allows for discovering the most representative parts of the input. It should be noted that the above methods work on simple networks only, i.e., networks modeling a single type of relation for a homogeneous set of nodes. Due to the growing interest for modeling and mining multilayer networks (KivelĆ¤ etĀ al. 2014) as well as for developing network embedding models for attributed or featurerich networks (Gaito etĀ al. 2021; Interdonato etĀ al. 2019), in the last few years a significant effort has been put in studying representation learning models and techniques specifically conceived for multilayer networks with associated attribute information at node level. Nevertheless, representation learning becomes even more challenging for multilayer networks because of the presence of intralayer and interlayer relations, different layer characteristics, as well as node features. In particular, these approaches must be able to obtain new latent node representations based on intra and interlayer dependencies. For example, in a crossplatform multilayer network one might want to predict a userās gender based on the relationships and properties that users and relating contacts have on each platform.
A further source of complexity also relates to the partial knowledge that we may have about the attributes associated with the nodes in one or multiple layers of the multilayer network in input. Moreover, for the purpose of a classification or prediction task, it is often the case in a real scenario that the labels for a given target concept (i.e., class) are available for few nodes only.
The aim of this work is to revise main GNN approaches to address both representation learning and prediction problems in a multilayer network. Our focus is on how to model properties of the multilayer network topology and of the available node attributes, so to learn an effective representation of the node features within and across the multiple layers of the input network. Furthermore, the aim is also to exploit the learned representations to predict the class labels of the entities (or actors) in the network. A key requirement in our proposed approach is also its flexibility to multilayer networks characterized not only by arbitrary nodecoverage and number of layers, but also by the presence of interlayer relations that are not constrained to link node instances of the same entity over the layers.
We summarize our contributions as follows:

1.
To cope with partial knowledge on attributes as well as class labels at nodelevel, which is usually encountered in real scenarios, we address the endtoend learning problem of embedding and classification for the entities in a multilayer attributed network following a transductive semisupervised learning approach. In this context, class labels are known at training time only for a relatively small amount of nodes in the multilayer network, while all available structural information and node attributes can be exploited for learning, and the goal is to predict the labels of the unlabeled nodes.

2.
We propose a representation learning and node classification framework based on GNN models and designed for arbitrary multilayer attributed networks. In accord with the significant trend in literature whereby graph convolutional and attentionbased approaches are by far the most widely used, the core GNN component of our framework is instantiated both as GCN and GAT.

3.
Unlike existing GNN approaches for multiplex or multirelational graphs, we propose to aggregate topological neighborhood information from different layers directly into the propagation rule of the GNN component, i.e., during its forward learning phase, in order to make the embedding of an entity in a particular layer depending on both its neighbors in that layer (dubbed withinlayer neighborhood) and on its neighbors located in other layers where the entity occurs (referred to as outsidelayer neighborhood). Therefore, by K sequential applications of our multilayer designed GNN components, the Khop withinlayer and outsidelayer neighborhood structural information for each entity is incorporated in the embedding process.

4.
Our designed GNN components in the proposed framework are able to incorporate external information associated with the multilayer network, in the form of attributes that can be available at entitylevel or at nodelevel for each particular layer of the input network.

5.
Experimental evidence from widely used multiplex networks and from a realworld attributed multilayer network dataset has shown that both the GAT and the GCN instances of our framework represent effective and efficient solutions to the problem of entity classification in multilayer attributed networks. Our methods were also compared with two recently proposed methods for multirelational networks based on a GAT model, named GrAMMESG and GrAMMEFusion (Shanthamallu etĀ al. 2020): our methods are able to achieve accuracy as good as or better than the competitors (up to 13% of accuracy improvement), while outperforming them in terms of efficiency (with a training time which is two orders of magnitude lower than GrAMME methods in most cases).
Background
In this section we provide background notions on deep learning approaches based on Graph Neural Networks (GNNs). For a better comprehension, we first introduce some preliminary notations.
Preliminary notations. We are given a graph \(G=(V,E)\), where V is the set of nodes, with \(V=n\), and E is the set of edges. Besides the adjacency matrix \({\mathbf {A}}\) that represents the graph structure, a further matrix is provided in input, \({\mathbf {X}}\), which stores the feature descriptions of the nodes, i.e., each node \(v_i\) is provided with a vector \({\mathbf {x}}_i \in {\mathbb {R}}^f\), where f is the number of input features. The general goal for GNNs is to learn a function that takes in input the above matrices and yields an output feature representation of the nodes \({\mathbf {Z}} = [{\mathbf {z}}_1, \dots , {\mathbf {z}}_n]^{\mathrm {T}}\), where \({\mathbf {z}}_i \in {\mathbb {R}}^{d}\) denotes the embedding or output feature vector for node \(v_i \in V\) and d is the size of the embedding space. Every GNN layer can be modeled as a nonlinear function \({\mathbf {H}}^{(k+1)} = g(\mathbf{H }^{(k)}, {\mathbf {A}})\), where k is an index for a neural network layer, \(\mathbf{H }^{(0)}=\mathbf{X }\) and \(\mathbf{H }^{(K)}=\mathbf{Z }\), with K total number of neural network layers. TableĀ 1 summarizes the main notations that will be used throughout this paper.
Deep graph learning. Deep learning frameworks such as convolutional neural networks (CNNs) (LeCun and Bengio 1995), recurrent neural networks (RNNs) (Hochreiter and Schmidhuber 1997) and autoencoders (AEs) (Vincent etĀ al. 2010) have been extremely successful in several machine learning tasks and for a variety of domains, including gridstructured data (e.g., images), sequences, and text data (LeCun etĀ al. 2015). However, they cannot be straightforwardly applied to graphstructured data as well because several operations (e.g., convolution) need to be revised to be well suited for such type of data. Indeed, graphstructured data show complexities at different and more levels than other types of data, which include the lack of natural orderings of the nodes and/or edges, variability in the size and topology of a nodeās neighborhood, and the opportunity for modeling different types of node relations (Wu etĀ al. 2021).
In recent years, the trend of using deep learning techniques to analyze graphs has contributed to the birth of connectionist models called Graph Neural Networks (GNNs), which aim to extend deep learning on graphstructured data, exploiting its dependencies through a message passing scheme between the nodes (Zhou etĀ al. 2019). In contrast to random walk approaches, which consider only nodes cooccurring in a random walk and optimize the embeddings to encode random walk statistics (Grover and Leskovec 2016; Perozzi etĀ al. 2014), GNN carries out a scheme for which each node iteratively combines both the neighbors and its own features to obtain a new representation. After k iterations (i.e., kth layer of the GNN), node representations have a nonlinear relation with their khop away neighborhood information. Interestingly, the neighborhood aggregation scheme is strictly connected to a random walk process: in fact, as studied in Xu etĀ al. (2018), in a Klayer GNN, the influence distribution of node \(v_i\) (i.e., how much a change in the initial features of any node \(v_j\) affects the final embedding of \(v_i\)) is equivalent, in expectation, to a random walk of length K starting from node \(v_i\), therefore the influence of \(v_i\) by \(v_j\) is proportional to the probability of visiting node \(v_j\) in a random walk of length K starting from node \(v_i\). Moreover, it should be noted that using a high number of iterations (i.e., K) could lead to an oversmoothing problem, i.e., representations of nodes could become very similar to one another after several iterations, as this would be an effect of an overly expanded range of the node influences.
GNNs are endtoend trainable, i.e., they can be trained in a supervised or unsupervised manner depending on the task to be performed, and they are designed to compute the new embedding state using both structural information of the graph and properties of nodes and edges, through an iterative neighborhood properties aggregation scheme. This final embedding state can be used to produce an output such as the node labels, or even to obtain the representation of an entire graph through pooling, for example, by summing the representation vectors of all nodes in the graph (Xu etĀ al. 2019).
Two of the most successful approaches are Graph Convolutional Networks (GCN) (Kipf and Welling 2016) and Graph Attention Networks (GAT) (Velickovic etĀ al. 2018), both convolutionalstyle GNNs, but with different assumptions regarding the contribution of the neighborhood. More specifically, the former model adopts a spectral approach in which the convolution is defined by signal theory filters, while GAT aims to incorporate the attention mechanism in the propagation step, learning the importance of the neighborhood of each node through a masked selfattention strategy. In the following, we briefly review the above two approaches.
Graph convolutional network. A GCN is the counterpart of a convolutional neural network model for graphstructured data that uses a graph spectral approach to convolution. Specifically, it operates through a firstorder spectral approximation of the graph by restricting the filters (limiting the order of the Chebyshev polynomial) to operate in the neighborhood at one step away from each node.
EquationĀ (1) shows the propagation rule of a GCN layer, which is the building block of the model, as it aims to learn a function capable of generating new feature representations for each node \(v_i \in V\) by propagating and transforming its own features and those of its neighbors:
Above, \(\Gamma (i)\) denotes the set of neighbors of node i, \(\sigma (\cdot )\) is a nonlinear activation function (e.g., \(ReLU(\cdot ) = max(0,\cdot )\)), \({\mathbf {W}}\) is a layerspecific trainable weight matrix, and \(\widetilde{{\mathbf {D}}}_{ii}=\sum _{j}\widetilde{{\mathbf {A}}}_{ij}\) is the degree matrix derived from \(\widetilde{{\mathbf {A}}} = {\mathbf {A}} + {\mathbf {I}}_n\), where \({\mathbf {A}}\) is the adjacency matrix of the input graph G, and \({\mathbf {I}}_n\) is the identity matrix of size \(n\).
Note that selfloops are added to the graph, and the adjacency and degree matrices are built accordingly (this is known as renormalization trick), so that each node can also consider its own features, and potential numerical issues can be controlled; specifically, symmetric normalization is used because repeated applications of the propagation rule can lead to numerical instability and problems in the calculation of the gradient when used in the deep neural network.
GCN plays a central role in building many complex models of GNNs, which also includes unsupervised learning architectures such as the Graph AutoEncoders (GAEs) (Kipf and Welling 2016; Zhou etĀ al. 2019). A GAE is a framework for unsupervised learning that leverages GCN to encode node information into lowdimensional vectors. Specifically, the encoder that calculates the embeddings consists of two GCN layers with a nonlinear activation function, whereas the decoder to reconstruct the original adjacency matrix from node embeddings, is a simple inner product decoder. The model is hence trained by minimizing the similarity between the original adjacency matrix and the reconstructed one.
Graph attention network. Graph Attention Network (GAT) (Velickovic etĀ al. 2018) is a graph neural network architecture that uses the attention mechanism to learn weights between connected nodes. In contrast to GCN, which uses predetermined weights for the neighbors of a node corresponding to the normalization coefficients described in Eq.Ā (1), GAT modifies the aggregation process of GCN by learning the strength of the connection between neighboring nodes through the attention mechanism (Wu etĀ al. 2021).
The building block is a Graph Attention Layer (GAT layer) which generalizes the attention model on graph structured data and is agnostic of the particular choice of attention mechanism. Stacking GAT layers several times allows one to develop deep neural network architectures.
In order to learn the weighting factor of each nodeās features, attention coefficients are computed based on the features of the connected nodes using a function \(a: {\mathbb {R}}^{d} \times {\mathbb {R}}^{d} \mapsto {\mathbb {R}}\). EquationĀ (2) indicates the importance of node jās features to node i:
The graph structure is taken into account as for each node \(v_i\), its neighborhood is considered in performing a masked attention mechanism. In Velickovic etĀ al. (2018) the attention mechanism is a feedforward neural network, which utilizes the nonlinear activation function LeakyReLU: instead of outputting a 0 for all negative values as ReLU does, LeakyReLU outputs a value of \(\beta x\) for any input x that is negative (where \(\beta\) is an hyperparameter that determines the amount of leak, usually set between 0.01 and 0.2), whereas for positive values of x, it simply outputs x. By allowing nonzero gradient for negative values, LeakyReLU overcomes the issue of dying neurons that affects the ReLU function.
Then, the attention coefficients are normalized through Softmax function as in Eq.Ā (3), so that the attention weights sum up to 1 over all neighbors of a node, eventually obtaining the normalized attention coefficients \(\alpha\):
As next step, each node updates its hidden state by weighting the features of the neighborhood nodes with the attention coefficients according to Eq.Ā (4):
Note that the learnable weight matrix \({\mathbf {W}}\) is preapplied to every node in order to transform the input features into higherlevel features.
To stabilize the learning process of selfattention, the mechanism has been extended similarly to Vaswani etĀ al. (2017) by employing multihead attention. The operations of the layer are independently replicated Q times (with different parameters) and outputs are featurewise aggregated. EquationĀ (5) shows the computation of a linear combination of the features by concatenating the Q attention heads, where \(\alpha _{ij}^{(q)}\) is normalized attention coefficient computed by the qth attention mechanism:
where \(\Vert\) denotes the concatenation operator. For the combination of the Q independent heads, Velickovic etĀ al. (2018) suggests to concatenate them in the hidden layers and to average them in the final layer; in the latter case, the application of the \(\sigma\) function is delayed.
Limitations of GNNs. Although GNNs can exploit node attributes and topology of the input graph, their learning power for a node classification task could be limited when there is a misaligment between features, graph and class label. While combining graph and feature information generally leads to an improvement in classification performance, the study in Qian etĀ al. (2021) has shown the importance of graph and feature alignment in GNN models such as GCN, highlighting that when features and graph subspaces associated with the data are not aligned, the GCN approach can exhibit a performance degradation, being even outperformed by an MLP model learned from data features while discarding the network topology. More specifically, if the node connections in the network are not consistent with the associated nodefeatures (e.g., two adjacent nodes having significantly different features), then the nodeneighborhood aggregation scheme could not be beneficial. As a matter of fact, typical schemes of neighborhood aggregation in GNNs inherently assume the homophily principle, i.e., connected nodes have the same class label or similar features. Another study proposed in Zhu etĀ al. (2020) has shown that learning on networks with low homophily (i.e., connected nodes have different class labels) is a challenging task for GNNs, which could perform worse than MLP. However, as reported in other studies, such as Ma etĀ al. (2021), GCN models can still achieve good performance on low homophily networks, provided that nodes with the same class have similar neighborhoods, and different classes have distinguishable patterns.
Although investigating the aforementioned limitations is not a focus of this work, we shall take into account them in our experimental evaluation. In particular, we measure the homophily score of evaluation networks (cf. "Data" section), and we investigate on the impact of not using realworld features for our evaluation networks, where class assignment is based exclusively on graph topology (cf. "Evaluation with competing methods" section).
Related work
In order to contextualize our proposal with respect to existing literature, we here discuss some of the recently proposed GNN methodologies specifically designed for multilayer networks.
One of the first frameworks that considers interlayer edges for embedded representation learning is MANE (Li etĀ al. 2018). However, the optimization problem of node embedding solved in MANE does not account for node attributes, and its overall approach is not endtoend. By contrast, approaches that aim to extend deeplearning based methods for singlelayer graphs such as GCN and GAT are wellsuited for modeling both within and interlayer dependencies to generate embeddings for nodes that are associated with input features, and in addition they have the advantage of being designed to learn node embedding and a classifier simultaneously via an endtoend approach. Indeed, node classification approaches for multilayer networks have been recently proposed.
Ghorbani etĀ al. (2019) proposed MGCN to extend the GCN model to multilayer networks. The method builds a GCN for each layer of the network, by utilizing only links between nodes of the same layer, while discarding the interlayer relations. To consider interlayer dependencies, the method uses an unsupervised term in the loss function, which calculates the ability of reconstructing the network through the inner product of the embeddings. Our proposed method shares with MGCN the design for solving a semisupervised classification problem where label information is smoothed over the graph structure via regularization, according to Kipf and Welling (2017). However, unlike MGCN, our proposed MLGCN method is able to incorporate the interlayer edges within the GCN propagation rules, as well as in the loss function.
While MGCN is an extension of GCN, the GrAMME method in Shanthamallu etĀ al. (2020) extends GAT for multilayer networks. The peculiarity of this approach is the way the Q attention heads are combined: instead of concatenating or averaging them as suggested in Velickovic etĀ al. (2018), in GrAMME a mechanism called fusionhead is applied, which consists in a weighted combination of the attention heads with learnable parameters. Specifically, in Shanthamallu etĀ al. (2020) two approaches are developed, namely GrAMMESG and GrAMMEFusion. The former explicitly builds the interlayer edges between each node in a layer and its counterpart (referred to as pillar edges) in a different layer, and applies a series of GAT layers with the fusionhead method, exploiting the interlayer dependencies. The GrAMMEFusion approach deals with interlayer dependencies in a different way, as it builds layerwise attention models and introduces an additional layer that exploits interlayer dependencies using only fusion heads. In the case of nodes with missing attributes, both methods employ random initialization (using a standard normal distribution). The empirical evaluation reported by the authors with several multiplex networks showed that the GrAMMEFusion method performs better than GrAMMESG.
Our proposed GAT extension to multilayer networks shares the multihead attention mechanism with the GrAMME methods, although our approach is closer to GAT as it does not need the fusionhead strategy to integrate the interlayer dependencies. More importantly, our approach involves both withinlayer and outsidelayer neighborhoods when computing the embedding of an entity in each layer, while GrAMMESG involves only pillar edges (in addition to the local neighborhood) in the propagation rule, and GrAMMEFusion integrates the interlayer dependencies using only fusion heads. Note that, given its declared superiority with respect to convolutional approaches, in our experimental evaluation, we have referred to the GrAMME methods as main competitors of our proposed methods.
Proposed framework
Given a set \({\mathcal {V}}\) of N entities (e.g., users) and a set \({\mathcal {L}} = \{L_1, \cdots , L_\ell \}\) of layers (e.g., user relational contexts), with \({\mathcal {L}} = \ell \ge 2\), we denote a multilayer network with \(G_{{\mathcal {L}}}=\langle V_{{\mathcal {L}}}, E_{{\mathcal {L}}}, {\mathcal {V}},\) \({\mathcal {L}}\rangle\), where \(V_{\mathcal {L}} \subseteq {\mathcal {V}} \times {\mathcal {L}}\) is the set of entitylayer pairings or nodes (i.e., to denote which users are present in which layers), and \(E_{{\mathcal {L}}} \subseteq V_{\mathcal {L}} \times V_{\mathcal {L}}\) is the set of undirected edges between nodes within and across layers.^{Footnote 1}
We represent a multilayer network by a set of adjacency matrices \({\mathcal {A}} = \{ {\mathbf {A}}_1, \cdots , {\mathbf {A}}_\ell \}\), with \({\mathbf {A}}_l \in {\mathbb {R}}^{n_l \times n_l}\) (\(l=1..\ell\)), where \(n_l=V_l\). Entities are assumed to be associated with features stored in layerspecific matrices \({\mathcal {X}} = \{ {\mathbf {X}}_1, \cdots , {\mathbf {X}}_\ell \}\), with \({\mathbf {X}}_l \in {\mathbb {R}}^{n_l \times f_l}\) and \(f_l\) the number of node features in the lth layer. We will also use symbol \({\mathbf {x}}_{(i,l)}\) to denote the feature vector of entity \(v_i\) in layer \(L_l\).
Note that in our multilayer network model there is neither prior assumption about the set of valid couplings between the layers, nor about the structure of the layers. Indeed our framework is theoretically able to consider networks with different coupling constraints between the layers, e.g. temporal networks or crossplatform networks.
It is also worth noticing that the sizes \(f_l\) may differ in principle, however they all must be bounded with respect to a maximum size, say f; truncation, resp. zeropadding, apply for those layers having a greater, resp. lower, number of features than f. Moreover, to avoid numerical scaling issues, all feature matrices are assumed to be rownormalized within a common interval of values. Furthermore, in case no node attributes are available for \(G_{{\mathcal {L}}}\), each layerspecific feature matrix is assumed to be the identity matrix \({\mathbf {I}}_l \in {\mathbb {R}}^{n_l \times n_l}\). Also, for partially complete feature matrices, value imputation and matrix completion methods can certainly be used, however this goes beyond the scope of this work.
Node embedding in multilayer network. Given a multilayer network \(G_{{\mathcal {L}}} = \langle V_{{\mathcal {L}}}, E_{{\mathcal {L}}}, {\mathcal {V}},\) \({\mathcal {L}} \rangle\), we define the multilayer network embedding as the problem of learning lowdimensional latent representations for each node (i.e., entitylayer pair), that is, learning a function \(g: V_{{\mathcal {L}}}\mapsto {\mathbb {R}}^d\) that maps each node into a ddimensional space, with \(d\ll N\), so that nodes that are similar in \(G_{{\mathcal {L}}}\) have embeddings close to each other.
The above definition resembles the classic one of node embeddings, with adaptation to multilayer networks. Moreover, to model similarity of nodes in the multilayer network, we follow the general idea adopted in representation learning on graphs, that is, node embeddings are generated based on neighborhoods, upon the intuition that nodes aggregate information from their neighbors by using a GNN.
However, a major question becomes how to consider a nodeās neighborhood in the multilayer network to properly generate the embeddings. In this regard, we notice that a major requirement for our proposed framework is to account for node links that are internal as well as external to a particular layer where the nodes occur. To this purpose, our key idea is to incorporate in the GNN propagation rules aggregation over node featuresāboth topological and exogenous to the network, i.e., node attributesāthat are computed not only w.r.t. the nodeās neighbors in the same layer but also w.r.t. the nodeās neighbors in the other layers.
In this regard, we define two functions, denoted as \(\Gamma\) and \(\Psi\), that for each pair entitylayer, i.e., node, return the neighborhood of the entity that is internal and external to that layer, respectively. Formally, given an entity \(v_i\) in a layer \(L_l\), we define the set of withinlayer neighbors of \(v_i\) in layer \(L_l\) as:
Similarly, we define the set of outsidelayer neighbors of \(v_i\) in layers different from \(L_l\) as:
FigureĀ 1 shows an illustration of multilayer network that our framework is able to deal with: in fact, more generally than multiplex networks, interlayer edges can be formed to link not only nodes of the same entity but also nodes of different entities. Both types of interlayer edges are indeed considered in our definition of outsidelayer neighbors (cf. Eq.Ā 7).
Let us now consider the key constituents in our proposed GNN models, precisely a GCN model and a GAT model for multilayer networks. First, we denote with \({\mathbf {h}}^{(k)}_{(i,l)}\) the hidden state at the kth layer of the neural network for entity \(v_i\) in layer \(L_l\), and with \({\mathbf {z}}_{(i,l)}={\mathbf {h}}^{(K)}_{(i,l)}\) the final embedding of entity \(v_i\) in \(L_l\), eventually used for a downstream task, such as entity classification. Using the message passing paradigm (Gilmer etĀ al. 2017), we abstract the aggregation scheme of our framework in Eq.Ā (8):
where the \(\phi _{e}\) function, named message function, is edgewise defined to generate messages across the edges obtained by combining the edge properties \({\mathbf {x}}_{e}\), and the state of its two endnodes, i.e., \({\mathbf {h}}_{(i,l)}\) and \({\mathbf {h}}_{(j,m)}\); \(\phi _{v}\), named update function, is a nodewise function useful to update the state of a node; \(\bigoplus\) is the aggregation (or reduce) operator, which is usually summation, or alternatively a pooling operator or even a neural network (Wang etĀ al. 2020). Note that in our formulation a node updates its state considering both intralayer and interlayer dependencies within the aggregation stage, so that the embedding of each layer is related to the other layers. If the framework is instantiated based on a GAT approach, the message \(\phi _{e}\) of each edge (i,Ā l),Ā (j,Ā m) received by node \(v_i\) corresponds to the normalized attention coefficient \(\alpha _{(i,l), (j,m)}\) multiplied by the hidden state of node \(v_j\).
Our contribution is to incorporate the above representation models into the propagation rules of both GCN and GAT frameworks, in order to make them suitable for the multilayer network context. Our resulting methods are dubbed MLGCN and MLGAT, respectively.
MLGCN propagation rules. Given a node \(v_i\) in a layer \(L_l\), the first propagation rule is defined as:
where \({\mathbf {h}}_{(i,l)}^{(0)} = {\mathbf {x}}_{(i,l)}\), \(\sigma (\cdot )\) is the ReLU activation function, and \({\mathbf {W}}^{(0)}\) is the initial weight matrix of shape (f,Ā d), shared across all nodes of the multilayer graph. Note that the degree matrix \(\widetilde{{\mathbf {D}}}\) is built considering both interlayer and intralayer connections of nodes using the supraadjacency matrix of the graph, which can be defined as:
where \({A}_{l,m}\) is an interlayer adjacency matrix built upon the interlayer connections between layer l and layer m (i.e., 1 if there exists an edge between (i,Ā l) and (u,Ā m) with \(l \ne m\), and 0 otherwise). Moreover, \(\widetilde{{\mathbf {D}}}_{ii}=\sum _{j=1}\widetilde{{\mathbf {A}}}^{\text {sup}}_{ij}\), where \(\widetilde{{\mathbf {A}}}^{\text {sup}}\) is the supraadjacency matrix with selfloops added.
The above equation is then adapted to produce the propagation rule at the generic kth layer of the GNN (with \(1 \le k < K\)):
Note that, for \(k=K1\), the above equation produces the output feature vector for entity \(v_i\) in layer \(L_l\). Also, the weight matrix \({\mathbf {W}}^{(k)}\) shape is (d,Ā d), for every \(k \ge 1\).
MLGAT propagation rules. Given a node \(v_i\) in a layer \(L_l\), the first propagation rule is defined as:
where \(\alpha _{((i,l),(j,m))}\) is the normalized attention coefficient for any edge \(((i,l), (j,m)) \in E_{{\mathcal {L}}}\). Note that we integrate the attention mechanism both on intralayer and interlayer edges, so that our model can selectively integrate the information received from the interlayer and intralayer neighbors.
Similarly to the solution proposed for MLGCN, the initial propagation rule equation is generalized for any kth layer of the GNN (with \(1 \le k < K\)). In addition, a mechanism of multihead attention is used to stabilize the learning process of selfattention. Therefore, given Q attention heads, we define two variants of the generic propagation rule, where either the output features of the heads are concatenated:
or the heads are averaged before applying the activation function:
Above, \({\mathbf {h}}_{(j,m)}^{(q, k)}\) denote the embedding of node (j,Ā m) for the qth head of the kth layer of the neural network. Moreover, in our setting, we averaged the Q attention heads in order to save memory occupation; therefore we set \(\sigma\) as the exponential linear unit (ELU) activation function, i.e., for positive values of input b, the function simply outputs b, whereas if the input is negative, the output is \(exp(b)1\).
Entity classification. As downstream task, we consider the problem of classifying the entities of a multilayer network graph in a transductive setting, i.e., class labels are only available for a small subset of entities, however the whole network graph containing both labeled and unlabeled data is used during the learning process, and the goal of the trained model is to predict the labels of the unlabeled entities. As previously discussed in "Introduction" section, this setting complies with the realistic assumption of lack of knowledge on a target concept corresponding to the class, for most of the nodes in a network. However, the transductive setting is also challenging as it requires that our GNNbased learning framework must be able to learn representation not only of nodes with labels but also of nodes without labels. We define this type of (multiclass) classification task with partial supervision, i.e., semisupervised classification task, as follows:
Problem 1
(Entity classification in multilayer network) Given a multilayer network \(G_{{\mathcal {L}}} = \langle V_{{\mathcal {L}}}, E_{{\mathcal {L}}}, {\mathcal {V}},\) \({\mathcal {L}}\rangle\), and associated input feature matrix \({\mathcal {X}}\), let \(\mathbf {{Y}} \in {\mathbb {R}}^{N \times C}\) denote the binary matrix storing the class labels assigned to each entity in \({\mathcal {V}}\), where C is the number of a predetermined set of classes. Given a small subset of entities \({\mathcal {V}}_{train} \subset {\mathcal {V}}\) that we refer to as training set, we denote as \({\mathbf {Y}}_{train} \in {\mathbb {R}}^{{\mathcal {V}}_{train} \times C}\) the corresponding class label matrix. The goal is to predict the label of the entities in \({\mathcal {V}} {\setminus } {\mathcal {V}}_{train}\) using both the multilayer graph structure based on the supraadjacency matrix and the entity features stored in \({\mathcal {X}}\). That is, we want to obtain the probability distribution matrix \({\widehat{\mathbf {Y}}} \in {\mathbb {R}}^{N \times C}\) so to derive \({\widehat{\mathcal {Y}}}=\underset{c}{\mathop {\hbox {arg max}}\limits {}}{{\widehat{\mathbf {Y}}}}\) that assigns class labels to the entities in \({\mathcal {V}} {\setminus } {\mathcal {V}}_{train}\).
In order to predict the class label for each entity \(v \in {\mathcal {V}}\), we combine the node embeddings obtained from each layer. Given \({\mathbf {Z}}_{l} \in {\mathbb {R}}^{n_l \times d}\) as the learned embeddings for layer \(L_l\), we obtain the entity representation through the following crosslayer aggregation:
where \(\widetilde{{\mathbf {Z}}} \in {\mathbb {R}}^{N \times d}\) is the final entity embedding matrix, and \(\mu \in {\mathbb {R}}^{\ell }\) is a vector of nonnegative values that can be either predetermined (e.g., uniform distribution, or any arbitrary userprovided distribution), or learned during the training through an optimization procedure. In our setting, we used the configuration with trainable parameters, as described in the following paragraph on parameter learning.
The last step is the application of a feedforward neural network to the matrix \(\widetilde{{\mathbf {Z}}}\) is provided in input to a feedforward neural network, whose output is a matrix \({\widehat{\mathbf {Z}}} \in {\mathbb {R}}^{N \times C}\). The final step consists in applying the rowwise softmax function onto \({\widehat{\mathbf {Z}}}\) to yield the class prediction of an entity. AlgorithmĀ 1 sketches the pseudocode of our classification framework, whereas Fig.Ā 2 depicts an illustration of the framework.
Parameter learning. With the exception of the crosslayer parameters \(\mathbf {\mu }\), which are initialized by a uniform distribution (i.e., \(\mu _{l}=\frac{1}{\ell }\), for all \(l\!=\!1 \dots \ell\)), all other parameters are initialized through Glorot (also known as Xavier) initialization (Glorot and Bengio 2010); this initialization technique is widely used in deep learning tasks and has proven to be effective in several applications (Mishkin and Matas 2016). Particularly, the trainable weight matrix \({\mathbf {W}}\) in MLGAT and MLGCN is subjected to Glorot initialization with normal distribution and with uniform distribution, respectively. Moreover, in MLGAT, each layer of the multilayer network is also parametrized with a weight vector \(\mathbf {{\tilde{a}}} \in {\mathbb {R}}^{2d}\) of the singlelayer feed forward neural network used as the attention mechanism (according to Velickovic etĀ al. 2018).
All the above parameters, jointly with those of the neural network downstream of the crosslayer aggregation (cf. Eq.Ā 15), are then updated during the training through an optimization strategy. That is, after the forward step of our learning framework, we calculate the loss function over all labeled examples \({\mathcal {V}}_{train}\). Then the gradient of the loss with respect to all parameters is calculated, which are finally updated through the optimization strategy (cf. "Experimental evaluation" section). We use the crossentropy as supervised loss function shown in Eq.Ā (16):
Note that, like Kipf and Welling (2017), we avoid explicit graphbased regularization in the loss function, while our GNN models are learned through the supraadjacency matrix of the multilayer network that allow the models to learn representation for nodes instances of unlabeled entities.
In order to show the outcomes of the representation learning process, a twodimensional projection of the embeddings produced by the proposed approaches is reported in Fig.Ā 3 (MLGAT) and Fig.Ā 4 (MLGCN). The representation is obtained via the tdistributed stochastic neighbor embedding (tSNE) method, which is a widely used nonlinear dimensionality reduction technique for embedding highdimensional data for visualization in a lowdimensional space (van der Maaten and Hinton 2008). More specifically, the plots show the embeddings produced by tSNE on \(\widetilde{{\mathbf {Z}}}\) for the training entities (\(25\%\) of total entities), i.e., the entity representation obtained through the crosslayer aggregation defined in Eq. (15). The embeddings refer to the Koumbia2 network including its realworld nodeattributes (cf. "Data" section); different colors in the figures correspond to the two different node labels. Left side of the figures shows the embeddings obtained after one training epoch of tSNE, while the right one shows the final embeddings obtained after 1000 training epochs, extracted downstream of the second (i.e., last) hidden layer. It is easy to see how the embedding is already significant after only one training epoch, with a relatively good separation between the two classes (slightly more evident for MLGAT). The embeddings get clearly better after 1000 training epochs, with an evident separation between the representations of entities belonging to the two classes.
Experimental evaluation
We evaluated our framework for semisupervised entity classification tasks on several realworld multiplex and multilayer networks from different domains.
In the following we provide details on the evaluation network datasets, on the experimental settings of our proposed methods, and on that of competitors (GrAMMESG and GrAMMEFusion (Shanthamallu etĀ al. 2020)) and baselines (GCN and GAT).
Data
We considered 9 network datasets, from which we derived a total of 19 networks for evaluation (plus their monoplex flattened versions). These datasets come from different domains and present very different structural characteristics. Moreover, all datasets are publicly available and most of them have been previously exploited as benchmarks for a variety of network analysis tasks, including node classification and link prediction. This is a major criterion for our choice, since it allows comparison of our results with the ones reported in previous literature. Eight of such datasets are originally provided as multiplex networks, i.e., networks in which interlayer connections are coupling edges only, connecting a node and its counterparts in other layers. These networks also do not provide attributes associated to the entities. Therefore, in order to stress the ability of our framework of dealing with generic multilayer attributed networks, we introduced the Koumbia network dataset (Interdonato etĀ al. 2020), which comes with unconstrained interlayer connections and realworld properties associated to the entities. In the following we briefly describe our evaluation network datasets.
Balance (Siegler 1976) models psychological experimental results on a set of individuals. According to Shanthamallu etĀ al. (2020), four attributes characterize the subjects (left weight, the left distance, the right weight, and the right distance). The classes correspond to the balance scale of the subjects (tip to the right, tip to the left, or being balanced).
CKMSocial (Coleman etĀ al. 1957) contains social information from physicians (entities) in four towns in Illinois, Peoria, Bloomington, Quincy and Galesburg. It consists of 3 directed layers generated from different sociometric matrices, where the cities are used to label the entities.
Congress (Schlimmer 1987) models the results of bills obtained from the 1984 United States Congressional Voting Records Database. The network has 16 layers corresponding to votes, where for each layer two congressmen are linked if they voted the same. Each congressman is labeled as either democrat or republican.
DKPol (Magnani etĀ al. 2021) (Dansk Politik) is a network with three types of online relations between Danish Members of the Parliament on Twitter. It comes with a ground truth corresponding to affiliations to 10 political parties, that we used as labels.
LeskovecNg (Chen and III 2016) is a 4layer temporal collaboration network, which contains coauthors of Prof. Jure Leskovec or Prof. Andrew Ng over 20 years, partitioned into 4 different 5year intervals. Entities are researchers, and on each layer there is an edge between two researchers if they coauthored at least one paper in the 5year interval. Each researcher is labeled as Leskovecās collaborator or Ngās collaborator.
Starwars^{Footnote 2} is comprised of 6 layers, each corresponding to interactions between Starwars characters in the first 6 episodes of the saga. We manually labeled each character as male, female or droid and used this information as entity labels. Note that the resulting class distribution is very unbalanced as there are 76 males, 12 females and 4 droids.
Terrorist (Everton 2012) models interactions between 79 terrorists drawn from the Noordinās Network dataset. Similarly to Liu etĀ al. (2017), we built a 4layer network from the following relation types: trust, communication, operational, business and financial ties. We derive two different types of entity labels: (i) 2class labels corresponding to membership of an individual to the Noordinās splinter group (member or nonmember), and (ii) 3class labels corresponding to the current state defined as the physical condition of the individual (dead, alive, jail). The two versions are dubbed TerroristNoordin and Terroriststatus, respectively.
Vickers (Vickers and Chan 1981) is a 3layer directed multiplex network modeling the social relations between 29 seventh grade students in a school. We use the gender as entity labels; there are 12 boys and 17 girls.
Koumbia (Interdonato etĀ al. 2020) is a multilayer network extracted from a Sentinel2^{Footnote 3} satellite image time series, centered on an agricultural landscape in the Koumbia area in Burkina Faso. In this dataset, entities represent segments of the satellite image, and classes correspond either to crop (i.e., segments containing pixels related to cultivated area) or nocrop (i.e., segments containing pixels related to uncultivated areas, such as forests) segments. The network is originally associated with interlayer edges and realworld attributes for the entities, corresponding to the segment statistics of the radiometric bands of the satellite images. We created the input feature matrix by concatenating the average values of ten different radiometric bands for 21 timestamps, obtaining a feature vector of size 210 for each entity. Fig.Ā 5a displays the feature distribution obtained by linearizing the input feature matrix. Note that the geo2net framework^{Footnote 4} presented in Interdonato etĀ al. (2020) is designed to build a multilayer network with an arbitrary number of layers, which model the association of nodes to an arbitrary number of functional classes (e.g., temporal radiometric profiles) by producing fuzzy layer memberships using the fuzzy cmeans algorithm (Ross 2009). In our case study, we exploit this functionality in order to take into account versions of the network with a varying number of layers (i.e., 2, 5, 10, 15, 20). We denote each of these networks as Koumbial (e.g., Koumbia5 will denote the version with 5 layers). FigureĀ 5 shows Koumbia2 (b) and Koumbia5 (c) graphs, whereas TableĀ 4 reports detailed information about interlayer edges. Since the competitors are specifically designed for multiplex networks, for a fair comparison with those methods we will also use a multiplex version of this dataset (i.e., obtained discarding interlayer connections other than coupling ones), named Koumbialmpx.
TableĀ 2 summarizes the structural properties of the all network datasets, plus the multiplex versions of Koumbia. For the latter, interlayer edge information is reported in TableĀ 4. Note that the graph homophily in TableĀ 2 corresponds to the fraction of edges in a graph connecting nodes with the same class label (Zhu etĀ al. 2020), formally \({\{(v_i,v_j): (v_i,v_j) \in E \wedge y_i = y_j\}}/{E},\) with \(v_i, v_j \in V\) and \(y_i\), \(y_j\) class labels of \(v_i\), \(v_j\), respectively. This statistic was calculated excluding the interlayer edges, which otherwise would lead to biased homophily scores. Also, average entity frequency corresponds to the fraction of layers on which each entity appears, averaged over all entities. Moreover, TableĀ 3 summarizes information on the monoplex, flattened versions of the network datasets, that will be used for evaluation of baseline methods designed for monoplex networks.
Experimental settings
We conducted all experiments under a transductive learning setting whereby, given an input multilayer network, only a fixed portion of the set of entities for each class were used as labeled data for the training of a GNN model. Recall that, due to the transductive setup, the learning process is nonetheless able to use all node attributes and topological information. For those networks without external information, node attributes were initialized by sampling each attribute from either a Gaussian, an Exponential, or a Uniform distributions; more precisely, for a given choice of number f of node attributes, either we randomly generated each attribute values from a Gaussian distribution, or we randomly generated one third each of the attributes from Gaussian, Exponential and Uniform distributions.
We used two main settings for the training set size, namely at 25% and 5% of the set of entities (we will refer to the setting with 25% of training set size unless otherwise specified). All GNNs were trained using the Adam optimization algorithm (Kingma and Ba 2017) with full batch size, for either 1000 epochs, or at convergence when the earlystopping regularization technique was used (with patience value of 50 epochs), learning rate set to 0.005, L2 weight regularization set to 0.0005, and dropout regularization technique with \(p=0.6\) applied to the hidden layers and to the normalized attention coefficients of GCN and GAT based methods, respectively. Note that this introduces stochasticity since we sample the withinlayer and outsidelayer neighborhood of each node. It should also be noted that, since all our evaluation networks fit into the GPU memory, we performed full batch training (Kipf and Welling 2017, 2016), where the parameters are updated after processing the whole network. Also, regarding the crosslayer aggregation (cf. Eq.Ā 15), the parameters \(\mu\) were initialized with uniform distribution. Both our MLGAT method and GAT use multihead attention with \(Q=2\) attentions heads for each layer where the heads are averaged in order to save memory resource. Furthermore, the negative slope \(\beta\) for LeakyReLU function was set to 0.2 (cf. "Background" section). We set \(K=2\) with \(d=32\) features, and number of input features to \(f=64\).
Concerning the setting of baseline methods, the original GAT and GCN methods were trained over the flattened networks (cf. TableĀ 3), since they were conceived for singlelayer graphs. Moreover, since the GrAMME (Shanthamallu etĀ al. 2020) framework can deal with multirelational/multiplex networks only, in our comparative experiments we used the multiplex versions of the Koumbial networks, named \(Koumbia\_ mpx\), thus discarding interlayer edges connecting nodes representing different entities.
For GrAMMESG and GrAMMEFusion, we set learning rate to 0.01, \(f=64\), \(d=32\), 5 fusion heads for GrAMMEFusion, and used dropout regularization technique.
Note that we used the publicly available software implementations for all the competitors.^{Footnote 5}
Results
We present the results of the experimental evaluation of the proposed framework, which are organized into two evaluation stages:

1.
In "Evaluation with competing methods" section, we compare the proposed methods and competitors (i.e., GrAMMESG, GrAMMEFusion, GAT and GCN), on the multiplex networks introduced in "Data" section, in terms of prediction performance, impact due to the setting of the training set size and earlystopping technique, and of the initial input nodefeatures. We also analyze the training execution times of the methods, and we discuss aspects of their computational complexity.

2.
In "Evaluation on realworld nodeattributes and arbitrary interlayer edges: the Koumbia multilayer network testbed" section, we conduct a thorough analysis of the Koumbia dataset, focusing on the impact of using real nodefeatures. We focus on this network since it is the only one including realworld attributes for the entities and arbitrary interlayer edges, thus allowing to evaluate to what extent the proposed framework is able to exploit such characteristics.
Evaluation with competing methods
TableĀ 5 reports the average accuracy scores achieved by the proposed MLGAT and MLGCN approaches, and the competing methods (GrAMMESG, GrAMMEFusion, GAT and GCN). For each network and method, the average accuracy was computed over 10 independent runs, where each run corresponded to a different traintest split, with 25% of training entities, and the input features of the entities were randomly initialized with a normal distribution for all the networks.
At a first glance, our proposed methods are able to achieve very high, or even nearly optimal accuracy, on most networks. This also mostly holds for GrAMMEFusion and, to a less extent, for GrAMMESG. Moreover, the results obtained by the baselines on the corresponding monoplex versions of the network datasets reveal some situations in which the flattening process is actually beneficial for the entity classification task: particularly, as it can be observed in Fig.Ā 6, CKMsocial and LeskovecNg layers are mostly structured with disconnected components, where each component in a layer contains nodes belonging to a specific label, so that the monoplex flattened network can better contextualize each node with respect to neighbors of different classes. Note also how the attentionbased approaches (i.e., MLGAT and GrAMME) do not suffer from the same issue, and obtain performances close to or even better than the ones of the baselines.
By contrast, multilayer approaches always obtain the best performances on networks with very high values of average entity frequency (i.e. the average percentage of layers on which each entity appears, cf. TableĀ 2), such as Vickers (MLGAT), Congress (GrAMMEFusion) and Balance (MLGCN); an exception is represented by CKMSocial, which is explained since in this network the nodes in every layer are associated with the same label. The latter case is also interesting as it is the only one where MLGCN outperforms not only the competitors but also MLGAT (which is nonetheless the second best performer). This might be ascribed to the peculiar structure of the Balance network, where each layer is composed by 5 disjoint complete connectedcomponents and the various node labels are distributed over the components. In fact, MLGAT has worse performance than MLGCN for both the Balance and Congress networks, which are the two with highest combination of average degree and average density (cf. TableĀ 2). As already observed in Mohan (2021), for networks with high average degree and dense supraadjacency matrix, GCNbased methods may perform better than randomwalk based approaches and GAT based ones due to the stochasticity introduced by the attention mechanism. Furthermore, according to Zhu etĀ al. (2020) and Qian etĀ al. (2021), all models have relatively low performance on the network with the lowest homophily score, i.e., Terroriststatus (0.469). Likewise, all models have good performance on networks with strong homophily, such as LeskovecNg (0.994) and CKMSocial (1.0). On the other hand, note how multilayer models can still perform well on networks with lower homophily.
Finally, a major remark that stands out from TableĀ 5 concerns the results obtained on Koumbia networks, for increasing number of layers. Note that Koumbia is the network dataset including the highest number of entities (2246), and that the Koumbia2 network has no edge overlap between the two layers (i.e., the node set on the two layers is completely disjoint). Notably, our MLGAT followed by MLGCN outperform all the competitors, with GrAMMESG performing even worse than baselines GAT and GCN. Note also that GrAMME revealed to be sensitive to the number of layers, at the point that for 10 layers (i.e., Koumbia10mpx) both variants of our major competitor run outofrunningtime.
Impact of training set size and earlystopping
We investigated a more challenging scenario for the training of the GNN models under study, by using only 5% of the entities as training instances, and the remaining ones for testing. Results are reported in TableĀ 6. As expected, the classification performances of the various methods tend to decrease in almost all the networks. Two particular situations occur with the Starwars and Terroriststatus networks: on the former, only GrAMMEFusion and GCN have worse performance, while on the latter, GrAMME methods even improve w.r.t. TableĀ 5. These can be explained as both network datasets have highly unbalanced distribution of class labels, therefore for the least covered class (i.e., ādroidā for Starwars and ādeadā for Terroriststatus) the number of selected training instances does not significantly change as the percentage of training set size decreases from 25 to 5%. More interestingly, on the largest networks other than Koumbia, i.e., Congress and Balance, MLGCN and MLGAT outperform the other competing models. In general, it turns out to be that MLGCN and MLGAT tend to be less sensitive than the other methods when the percentage of training set size changes from 25% to 5%.
Another important aspect that we have not considered so far is the opportunity of using the earlystopping regularization which, with the use of a validation set, can be helpful to mitigate overfitting of the GNN models. To this purpose, we carried out a further stage of evaluation where each GNN model was equipped with earlystopping and a patience value of 50 epochs, i.e., the training of a GNN model was terminated if the validation accuracy had not increased for 50 consecutive epochs. TablesĀ 7 andĀ 8 show results corresponding to 25% and 5% of training set size, respectively. Considering first the effect of earlystopping with training set size of 25%, we observe that our MLGAT and MLGCN and their monoplex counterparts improve their performance w.r.t. the scenario without earlystopping (i.e., TableĀ 5) in most of the networks. For instance, MLGAT and MLGCN increase their accuracy on Starwars, from 0.70 to 0.817 and from 0.714 to 0.815, on Terroriststatus, from 0.477 to 0.570 and from 0.502 to 0.545, on CKMSocial, from 0.954 to 0.962 and from 0.824 to 0.921, respectively. Moreover, on Koumbia networks, MLGAT and MLGCN achieve comparable or even better results (i.e., MLGAT on Koumbia5mpx) than those corresponding to nonearlystopping, while the monoplex counterparts, especially GAT, decrease their performance significantly. By contrast, GrAMME methods tend to benefit less from the use of earlystopping.
Finally, from the comparison between TablesĀ 6 andĀ 8 corresponding to 5% of training set size, we draw analogous remarks to the above discussed for the scenario with 25% of training set size. Although the variations between results in TableĀ 8 and corresponding results in TableĀ 6 are in general relatively small, some cases are still remarkable, such as the improvement of MLGAT and MLGCN on the two Terrorist networks, Koumbia5mpx and Koumbia10mpx.
Impact of the attribute matrix
To better assess the robustness of our proposed methods, and also to gain insights into those cases in favor of monoplexbased baselines, we replicated the previous analysis by initializing the entity features with mixed distributions, following the other, more realistic approach described in "Experimental settings" section, i.e., one third of the attributes are modeled as normal distributions, one third as uniform distributions, and one third as exponential distribution.
Results of these experiments are reported in TableĀ 9. While GAT and GCN are still the best performing methods on LeskovecNg and CKMSocialādue to the peculiar structural characteristics described in the above section, that makes the monoplex version better suited for the task at hand than the multilayer networkāit can be noted that their relative performances on DKPol are significantly decreased (about 0.79) w.r.t. the ones observed in TableĀ 5; by contrast, on this network, MLGCN is the best performing method (0.82). Also, the accuracy of GAT (0.77) and GCN (0.67) worsens significantly for Vickers and Balance, respectively. By contrast, MLGCN achieves better accuracy than the corresponding values in TableĀ 5 on DKPol, Congress, and Koumbia2mpx, while both MLGAT and MLGCN improve on Vickers and Starwars.
Overall, comparing the methodsā accuracy values averaged over the networks, from TableĀ 9 w.r.t. TableĀ 5, GAT and GCN show a more evident decrease percentage (resp., about \(\)2% and \(\)1.5%) than MLGAT and MLGCN. To sum up, while the performances of the monoplexbased baselines change drastically in some cases, indicating more sensitivity to the characteristics of the node attributes, our proposed methods turn out to be more robust, even benefiting from a mixed, thus more realistic, distribution of values for the node attributes in some networks.
Computational complexity aspects and training time analysis
In this section, we first discuss the computational complexity of our methods, MLGAT and MLGCN, assuming sparse supraadjacency matrix and that the total number of nodes in a multilayer network is \({\mathcal {O}}(N \ell )\).
The time complexity of MLGCN with K layers results from the addition of two terms, the one corresponding to the propagation steps, which is \({\mathcal {O}}( K nonzero({\mathbf {A}}^{\text {sup}}) f)\), where \(nonzero({\mathbf {A}}^{\text {sup}})\) is the number of nonzero entries in the \({\mathbf {A}}^{\text {sup}}\) matrix, and the other one corresponding to the feature transformation steps, which is \({\mathcal {O}}(K N \ell f^2 )\). Therefore, the total cost of MLGCN is \({\mathcal {O}}( K nonzero({\mathbf {A}}^{\text {sup}}) f + K N \ell f^2 )\). The time complexity of MLGAT also takes into account the computation of the attention coefficients. Given Q attention heads, the time complexity of MLGAT with K layers is \({\mathcal {O}}(KQN \ell f^2 + KQE_{{\mathcal {L}}}f)\), where the first term concerns the feature transformation steps, and the second term corresponds to the cost of a general attention mechanism. Note that the computation of the attention coefficients can be parallelized both for the intralayer and interlayer edges, as well as the computation of the Q attention heads.
Concerning the spatial complexity, modeling interlayer dependencies in the propagation rule has the overhead of storing the whole supraadjacency matrix. Moreover, we need to take into account the hidden states and the weight matrices. More precisely, the memory requirement during the training stage for MLGCN is \({\mathcal {O}}(Kf^2 + KNf)\), whereas for the multihead attention MLGAT, this cost is multiplied by a factor Q. Furthermore, the attention functions value requires an overhead of \({\mathcal {O}}(Q E_{{\mathcal {L}}})\). It is also worth noticing that, to improve scalability of our implementations, we could learn our models with minibatch training in combination with neighborhood sampling approaches (e.g., Hamilton etĀ al. 2018), which we leave it as for future work.
We now present the results reported in TableĀ 10, which shows the training times obtained by the proposed MLGAT and MLGCN methods, compared to those by GrAMMESG and GrAMMEFusion. We observe that our methods are extremely efficient, especially MLGCN, with training times under 0.3 minutes on all networks. Remarkably, similar training times on all networks are observed for MLGAT and MLGCN, respectively, thus hinting at their scalability. Both our methods significantly outperform GrAMMESG and GrAMMEFusion: indeed, except for Vickers, the training time of the GrAMME methods is always one or two orders of magnitude higher than that of our methods, also showing to have scalability issues (training times range from the 0.3 minutes of Vickers to 52 minutes on Koumbia for GrAMMESG and GrAMMEFusion, respectively).
The outperforming behavior of our methods against GrAMME ones is however quite surprising, since all the methods share a theoretical computational complexity that is linear in the number of nodes and in the number of edges of the multilayer network. In fact, as reported in Shanthamallu etĀ al. (2020), while the cost of GrAMMESG is actually more sensitive to the number of entities and layers in the network (i.e., linear in the number of entities and edges, but quadratic in the number of layers), the cost of GrAMMEFusion, thanks to a simplified attention mechanism, is declared as linear in the number of entities and edges of the multilayer network, which is analogous for our methods. Therefore, we tend to ascribe such a performance gap of the competitors to a less efficient implementation w.r.t. our methods, which were developed under the DGL framework^{Footnote 6} that has become a widely recognized software tool for deep learning on graph data (Wang etĀ al. 2020).
Evaluation on realworld nodeattributes and arbitrary interlayer edges: the Koumbia multilayer network testbed
A major strong point of the proposed MLGAT and MLGCN approaches is that they are designed to deal with general multilayer networks, i.e., with arbitrary interlayer edges, and to exploit external information in the form of attributes associated to the entities. In this section, we present a further evaluation stage that aims to stress our methods by evaluating them on a realworld attributed multilayer network, i.e., the Koumbia multilayer network (Interdonato etĀ al. 2020). By focusing on this network, we delve into the understanding of the impact of using realworld attributes about the entities (cf. "Data" section and Fig.Ā 5a) on the entity classification task in a practical application contexts. Moreover, based on the technique described in Interdonato etĀ al. (2020), we take into account different versions of the network with varying number of layers (i.e., 2, 5, 10, 15, 20) and, since the networks include interlayer edges between each couple of layers, we will also evaluate how the proposed approach is able to manage an increasing number of interlayer edges.
To this purpose, we compare the performance of our methods on three different scenarios relating the input features associated with various Koumbia with different number of layers: the real attributes originally associated with Koumbia entities, attributes in the form of identity matrix, and attributes modeled as normal distributions. The three modalities will be denoted with suffix Fon, Foff, and normal, respectively. Note that the experiments with the normal distributions have different results with respect to the ones reported in TableĀ 5, since in that case the multiplex version of the network was taken into account (i.e., without considering interlayer edges).
TablesĀ 11 and 12 show the average accuracy and mean reciprocal rank (MRR), averaged over 20 runs, obtained on the Koumbia networks with Fon, Foff and normal types of attributes, for MLGAT and MLGCN, respectively. A detailed plot on the variations of accuracy with respect to the number of layers is also shown in Fig.Ā 7. It can be noted that the Fon versions always obtain significantly better result than the Foff and normal ones for both methods, thus confirming that exploiting realworld nodeattributes is indeed beneficial for the entity classification task and, more importantly, that the proposed framework is able to correctly exploit such external information in the form of node attributes.
MLGAT and MLGCN obtain similar performances on the Fon networks (accuracy around 0.94 and MRR around 0.97). The performance scores also show robustness with respect to the number of layers, and hence of interlayer edges, in the network. Note that the slightly lower performance obtained for Koumbia2 is actually not surprising, as in that case the two layers have disjoint nodesets, which negatively affects the performance of the multilayer approaches.
It is also interesting to notice that, while MLGAT performs similarly on the Foff and normal versions (thanks to the attention mechanism), MLGCN shows significantly better performance on the normal version than on the Foff one. This result is in line with previous studies (Kipf and Welling 2017; Velickovic etĀ al. 2018), and, in this specific case, it can also be explained by the fact that the normal distribution can be a relatively good approximation of the real one (cf. Fig.Ā 5).
In order to further analyze the benefit of the attention mechanism exploited by MLGAT, against the convolutional approach used in MLGCN, we perform a further analysis stage, where we evaluate the performance of the two approaches w.r.t. an increasing number of hidden layers K in the neural network (i.e., not to be confused with the layers of the multilayer network). FigureĀ 8 shows the accuracy achieved by MLGAT and MLGCN by increasing K, and with different dimensions of the embedding space, i.e., \(d=\{32, 128\}\). It can be noted that, while for \(K=\{1,2\}\) all methods obtain similar performance, MLGCN tends to decrease in accuracy for higher K values; particularly, MLGCN accuracy decreases of about 7% when increasing K from 2 to 3, with \(d=32\), and from 5 to 6 with \(d=128\). We tend to explain this behavior since a higher number of convolutional layers would smooth the difference between intralayer and interlayer neighborhoods, which hence might be treated equally in this process. Conversely, the attention mechanism in MLGAT is way more robust to this phenomenon, as revealed by the nearly constant performance by MLGAT even with high values of K.
Conclusions: discussion and future work
We proposed a GNN framework for representation learning and semisupervised classification in multilayer networks with attributed entities, and with arbitrary number of layers and intralayer and interlayer connections between nodes. We instantiated our framework through two new formulations of GAT and GCN models, specifically designed for the above general, attributed multilayer networks. We evaluated our MLGAT and MLGCN methods on realworld network datasets coming from different domains and with different structural characteristics. Our results showed that MLGAT and MLGCN models are significantly faster learners than the competitors, and they outperform in accuracy both the competitors and baseline methods especially on arbitrary multilayer networks, with large number of entities and layers. Furthermore, as demonstrated by the evaluation on Koumbia multilayer networks, derived from satellite images, our methods are able to take advantage of the presence of real attributes for the entities, in addition to arbitrary interlayer connections between the nodes in the various layers.
Comparing the GAT and GCN approaches, we observed that, unlike MLGCN, MLGAT performance is not affected when networks are structured as disconnected layers or when most layers tend to contain nodes of the same label. By contrast, MLGCN tends to be more robust than MLGAT when the network shows relatively high density and average degree.
Nevertheless, the approach of integrating withinlayer and outsidelayer neighborhood shared by both MLGCN and MLGAT might not be wellsuited to effectively learn from multilayer networks where the various layers would show assortativity different to each other according to the entity class labels; e.g., a 2layer network with gender as entity class, such that the first layer is assortative by gender and the second layer shows reverse assortativity by gender. To overcome this limitation, it would be interesting to revise the crosslayer aggregation component in terms of a GNN model as well, and investigate how this approach would be more effective than simply weighing the embeddings from each particular layer of the network. As a related aspect, the above would also raise the opportunity of evaluating a transfer learning task across the layers of a network: for instance, given one or more selected layers, our methods would train a model on those layers which would then be finetuned on other layers, e.g., for a task of node classification. Moreover, measuring the similarity between the different layers of a multilayer network (e.g., via the subspace alignment measure proposed in Qian etĀ al. 2021), and more in general, multilayer network simplification approaches (Interdonato etĀ al. 2020) could be consider in order to improve the quality of the final embeddings by reducing the quantity of redundant or noisy content in each layer.
Further developments of our framework might concern two aspects. The first aspect refers to an extension to heterogeneous, attributed multilayer networks, which are rapidly growing attention also thanks to a renewed interest to knowledge graphs in many application domains. The second aspect instead refers to the adaptation of our framework to inductive learning tasks, in order to generalize to unseen (portions of) graphs. This would also enable it to deal with dynamic networks, particularly for understanding the growth of a multilayer network in terms of changes on the status and properties of its entities and their connections, or for updating a GNN model on a timeevolving multilayer network without having to learn it from scratch at each new timestamp. In this regard, while the adaptation of our defined methodsā propagation rules is relatively easy to achieve for an inductive learning task, it would also be meaningful to introduce an unsupervised term in the loss function as a form of regularization to account for the structural information of the unseen portions of a network.
Availability of data and materials
Python code for the proposed methods, as well as the network datasets, will be made publicly available upon publication of the manuscript.
Notes
In this work we will use the term layer to denote either a constituent of a multilayer network or a constituent of the neural network model, while the exact meaning is assumed to be clear as within the particular context where the term is used.
The GrAMME, GAT and GCN source codes are publicly available at https://github.com/udayshankars/, https://github.com/Diego999/pyGAT, and https://github.com/tkipf/pygcn, respectively.
Abbreviations
 GNN:

Graph Neural Network
 GCN:

Graph Convolutional Network
 GAT:

Graph Attention Network
 MLGCN:

Multilayer Graph Convolutional Network
 MLGAT:

Multilayer Graph Attention Network
 CNN:

Convolutional Neural Network
 RNN:

Recurrent Neural Network
References
Bronstein MM, Bruna J, LeCun Y, Szlam A, Vandergheynst P (2017) Geometric deep learning: going beyond Euclidean data. IEEE Signal Process Mag 34(4):18ā42
Chen P, III AOH (2016) Multilayer spectral graph clustering via convex layer aggregation. In: Proceedings of IEEE global conference on signal and information processing, pp 317ā321
Coleman J, Katz E, Menzel H (1957) The diffusion of an innovation among physicians. Sociometry 20(4):253ā270
Everton SF (2012) The Noordin top terrorist network. In: Disrupting dark networks. Structural analysis in the social sciences. Cambridge University Press, Cambridge, pp 385ā396. https://doi.org/10.1017/CBO9781139136877.019
Gaito S, Interdonato R, Murata T, Sala A, Tagarelli A, Thai MT (2021) Introduction to the special section on reloading featurerich information networks. IEEE Trans Netw Sci Eng. https://doi.org/10.1109/TNSE.2021.3073824
Ghorbani M, Baghshah MS, Rabiee HR (2019) MGCN: semisupervised classification in multilayer graphs with graph convolutional networks. In: Proceedings of IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM), pp 208ā211. https://doi.org/10.1145/3341161.3342942
Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. In: Proceedings of 34th international conference on machine learning, pp 1263ā1272
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of thirteenth international conference on artificial intelligence and statistics, pp 249ā256
Grover A, Leskovec J (2016) node2vec: scalable feature learning for networks. In: Proceedings of 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 855ā864
Hamilton WL, Ying R, Leskovec J (2018) Inductive representation learning on large graphs. CoRR arXiv:abs/1706.02216arXiv:1706.02216
Hochreiter S, Schmidhuber J (1997) Long shortterm memory. Neural Comput 9(8):1735ā1780. https://doi.org/10.1162/neco.1997.9.8.1735
Interdonato R, Atzmueller M, Gaito S, Kanawati R, Largeron C, Sala A (2019) Featurerich networks: going beyond complex network topologies. Appl Netw Sci 4(1):4ā1413
Interdonato R, Gaetano R, Lo Seen D, Roche M, Scarpa G (2020) Extracting multilayer networks from sentinel2 satellite image time series. Netw Sci 8(S1):26ā42. https://doi.org/10.1017/nws.2019.58
Interdonato R, Magnani M, Perna D, Tagarelli A, Vega D (2020) Multilayer network simplification: approaches, models and methods. Comput Sci Rev 36:100246
Kingma DP, Ba J (2017) Adam: a method for stochastic optimization. CoRR arXiv:abs/1412.6980
Kipf TN, Welling M (2016) Variational graph autoencoders. CoRR arXiv:abs/1611.07308
Kipf TN, Welling M (2017) Semisupervised classification with graph convolutional networks. In: Proceedings of 5th international conference on learning representations (ICLR)
KivelĆ¤ M, Arenas A, Barthelemy M, Gleeson JP, Moreno Y, Porter MA (2014) Multilayer networks. J Complex Netw 2(3):203ā271
LeCun Y, Bengio Y (1995) Convolutional networks for images, speech, and timeseries. In: Arbib MA (ed) The handbook of brain theory and neural networks. MIT Press, Cambridge
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436ā444. https://doi.org/10.1038/nature14539
Li J, Chen C, Tong H, Liu H (2018) Multilayered network embedding. In: Proceedings of SIAM international conference on data mining (SDM), pp 684ā692. https://doi.org/10.1137/1.9781611975321.77
Liu W, Chen PY, Yeung S, Suzumura T, Chen L (2017) Principled multilayer network embedding. CoRR arXiv:abs/1709.03551
Magnani M, Hanteer O, Interdonato R, Rossi L, Tagarelli A (2021) Community detection in multiplex networks. ACM Comput Surv 54(3):48ā14835. https://doi.org/10.1145/3444688
Ma Y, Liu X, Shah N, Tang J (2021) Is homophily a necessity for graph neural networks? arXiv:2106.06134
Mishkin D, Matas J (2016) All you need is a good init. In: Proceedings of international conference on learning representations (ICLR). arXiv:1511.06422
Mohan APKV (2021) Temporal network embedding using graph attention network. Complex Intell Syst. https://doi.org/10.1007/s4074702100332x
Perozzi B, AlRfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: Macskassy SA, Perlich C, Leskovec J, Wang W, Ghani R (eds) Proceedings of 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 701ā710
Qian Y, Expert P, Rieu T, Panzarasa P, Barahona M (2021) Quantifying the alignment of graph and features in deep learning. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/tnnls.2020.3043196
Ross T (2009) Fuzzy logic with engineering applications, 3rd edn. Wiley, Hoboken. https://doi.org/10.1002/9781119994374
Schlimmer JC (1987) Concept acquisition through representational adjustment
Shanthamallu US, Thiagarajan JJ, Song H, Spanias A (2020) GrAMME: semisupervised learning using multilayered graph attention models. IEEE Trans Neural Netw Learn Syst 31(10):3977ā3988. https://doi.org/10.1109/TNNLS.2019.2948797
Siegler RS (1976) Three aspects of cognitive development. Cogn Psychol 8(4):481ā520
van der Maaten L, Hinton G (2008) Visualizing data using tSNE. J Mach Learn Res 9:2579ā2605
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30
Velickovic P, Cucurull G, Casanova A, Romero A, LiĆ² P, Bengio Y (2018) Graph attention networks. In: Proceedings of 6th international conference on learning representations (ICLR)
Vickers M, Chan S (1981) Representing classroom social structure. Victoria Institute of Secondary Education, Melbourne
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371ā3408
Wang M, Zheng D, Ye Z, Gan Q, Li M, Song X, Zhou J, Ma C, Yu L, Gai Y, Xiao T, He T, Karypis G, Li J, Zhang Z (2020) Deep graph library: a graphcentric, highlyperformant package for graph neural networks. CoRR abs/1909.01315
Wu Z, Pan S, Chen F, Long G, Zhang C, Yu PS (2021) A comprehensive survey on graph neural networks. IEEE Trans Neural Netw Learn Syst 32(1):4ā24. https://doi.org/10.1109/tnnls.2020.2978386
Xu K, Hu W, Leskovec J, Jegelka S (2019) How powerful are graph neural networks? In: Proceedings of 7th international conference on learning representations (ICLR)
Xu K, Li C, Tian Y, Sonobe T, Kawarabayashi KI, Jegelka S (2018) Representation learning on graphs with jumping knowledge networks. arXiv:1806.03536
Zhou J, Cui G, Zhang Z, Yang C, Liu Z, Wang L, Li C, Sun M (2019) Graph neural networks: a review of methods and applications. CoRR arXiv:abs/1812.08434
Zhu J, Yan Y, Zhao L, Heimann M, Akoglu L, Koutra D (2020) Beyond homophily in graph neural networks: current limitations and effective designs. In: Proceedings of the annual conference on neural information processing systems (NeurIPS)
Acknowledgements
Not applicable.
Funding
This work was supported by the French National Centre for Space Studies (CNES), as part of the AMORIS (Analyse et MOdĆ©lisation des RĆ©seaux complexes issus de lāImagerie Satellitaire) project, APR TOSCA 2020.
Author information
Authors and Affiliations
Contributions
AT conceived the idea presented in this work. AT and LZ developed the theoretical definition of the methods. AT and RI defined the set of experiments to perform. LZ took care of running the experiments. RI, LZ, and AC performed evaluation of the results and related discussion. All authors participated in the writing process. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
All authors read and approved the final manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zangari, L., Interdonato, R., CaliĆ³, A. et al. Graph convolutional and attention models for entity classification in multilayer networks. Appl Netw Sci 6, 87 (2021). https://doi.org/10.1007/s41109021004204
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41109021004204
Keywords
 Graph neural networks
 Multilayer networks
 Entity classification