Graph convolutional and attention models for entity classification in multilayer networks

Graph Neural Networks (GNNs) are powerful tools that are nowadays reaching state of the art performances in a plethora of different tasks such as node classification, link prediction and graph classification. A challenging aspect in this context is to redefine basic deep learning operations, such as convolution, on graph-like structures, where nodes generally have unordered neighborhoods of varying size. State-of-the-art GNN approaches such as Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs) work on monoplex networks only, i.e., on networks modeling a single type of relation among an homogeneous set of nodes. The aim of this work is to generalize such approaches by proposing a GNN framework for representation learning and semi-supervised classification in multilayer networks with attributed entities, and arbitrary number of layers and intra-layer and inter-layer connections between nodes. We instantiate our framework with two new formulations of GAT and GCN models, namely ML-GCN and ML-GAT, specifically devised for general, attributed multilayer networks. The proposed approaches are evaluated on an entity classification task on nine widely used real-world network datasets coming from different domains and with different structural characteristics. Results show that both our proposed ML-GAT and ML-GCN methods provide effective and efficient solutions to the problem of entity classification in multilayer attributed networks, being faster to learn and offering better accuracy than the competitors. Furthermore, results show how our methods are able to take advantage of the presence of real attributes for the entities, in addition to arbitrary inter-layer connections between the nodes in the various layers.

network following a transductive semi-supervised learning approach. In this context, class labels are known at training time only for a relatively small amount of nodes in the multilayer network, while all available structural information and node attributes can be exploited for learning, and the goal is to predict the labels of the unlabeled nodes. 2. We propose a representation learning and node classification framework based on GNN models and designed for arbitrary multilayer attributed networks. In accord with the significant trend in literature whereby graph convolutional and attentionbased approaches are by far the most widely used, the core GNN component of our framework is instantiated both as GCN and GAT . 3. Unlike existing GNN approaches for multiplex or multirelational graphs, we propose to aggregate topological neighborhood information from different layers directly into the propagation rule of the GNN component, i.e., during its forward learning phase, in order to make the embedding of an entity in a particular layer depending on both its neighbors in that layer (dubbed within-layer neighborhood) and on its neighbors located in other layers where the entity occurs (referred to as outside-layer neighborhood). Therefore, by K sequential applications of our multilayer designed GNN components, the K-hop within-layer and outside-layer neighborhood structural information for each entity is incorporated in the embedding process. 4. Our designed GNN components in the proposed framework are able to incorporate external information associated with the multilayer network, in the form of attributes that can be available at entity-level or at node-level for each particular layer of the input network. 5. Experimental evidence from widely used multiplex networks and from a real-world attributed multilayer network dataset has shown that both the GAT and the GCN instances of our framework represent effective and efficient solutions to the problem of entity classification in multilayer attributed networks. Our methods were also compared with two recently proposed methods for multirelational networks based on a GAT model, named GrAMME-SG and GrAMME-Fusion (Shanthamallu et al. 2020): our methods are able to achieve accuracy as good as or better than the competitors (up to 13% of accuracy improvement), while outperforming them in terms of efficiency (with a training time which is two orders of magnitude lower than GrAMME methods in most cases).

Background
In this section we provide background notions on deep learning approaches based on Graph Neural Networks (GNNs). For a better comprehension, we first introduce some preliminary notations. Preliminary notations. We are given a graph G = (V , E) , where V is the set of nodes, with |V | = n , and E is the set of edges. Besides the adjacency matrix A that represents the graph structure, a further matrix is provided in input, X , which stores the feature descriptions of the nodes, i.e., each node v i is provided with a vector x i ∈ R f , where f is the number of input features. The general goal for GNNs is to learn a function that takes in input the above matrices and yields an output feature representation of the nodes Z = [z 1 , . . . , z n ] T , where z i ∈ R d denotes the embedding or output feature vector for node v i ∈ V and d is the size of the embedding space. Every GNN layer can be modeled as a non-linear function H (k+1) = g(H (k) , A) , where k is an index for a neural network layer, H (0) = X and H (K ) = Z , with K total number of neural network layers. Table 1 summarizes the main notations that will be used throughout this paper.
Deep graph learning. Deep learning frameworks such as convolutional neural networks (CNNs) (LeCun and Bengio 1995), recurrent neural networks (RNNs) (Hochreiter and Schmidhuber 1997) and autoencoders (AEs) (Vincent et al. 2010) have been extremely successful in several machine learning tasks and for a variety of domains, including grid-structured data (e.g., images), sequences, and text data (LeCun et al. 2015). However, they cannot be straightforwardly applied to graph-structured data as well because several operations (e.g., convolution) need to be revised to be well suited for such type of data. Indeed, graph-structured data show complexities at different and more levels than other types of data, which include the lack of natural orderings of the nodes and/or edges, variability in the size and topology of a node's neighborhood, and the opportunity for modeling different types of node relations (Wu et al. 2021).  In recent years, the trend of using deep learning techniques to analyze graphs has contributed to the birth of connectionist models called Graph Neural Networks (GNNs), which aim to extend deep learning on graph-structured data, exploiting its dependencies through a message passing scheme between the nodes (Zhou et al. 2019). In contrast to random walk approaches, which consider only nodes co-occurring in a random walk and optimize the embeddings to encode random walk statistics (Grover and Leskovec 2016;Perozzi et al. 2014), GNN carries out a scheme for which each node iteratively combines both the neighbors and its own features to obtain a new representation. After k iterations (i.e., k-th layer of the GNN), node representations have a non-linear relation with their k-hop away neighborhood information. Interestingly, the neighborhood aggregation scheme is strictly connected to a random walk process: in fact, as studied in Xu et al. (2018), in a K-layer GNN, the influence distribution of node v i (i.e., how much a change in the initial features of any node v j affects the final embedding of v i ) is equivalent, in expectation, to a random walk of length K starting from node v i , therefore the influence of v i by v j is proportional to the probability of visiting node v j in a random walk of length K starting from node v i . Moreover, it should be noted that using a high number of iterations (i.e., K) could lead to an over-smoothing problem, i.e., representations of nodes could become very similar to one another after several iterations, as this would be an effect of an overly expanded range of the node influences.
GNNs are end-to-end trainable, i.e., they can be trained in a supervised or unsupervised manner depending on the task to be performed, and they are designed to compute the new embedding state using both structural information of the graph and properties of nodes and edges, through an iterative neighborhood properties aggregation scheme. This final embedding state can be used to produce an output such as the node labels, or even to obtain the representation of an entire graph through pooling, for example, by summing the representation vectors of all nodes in the graph (Xu et al. 2019).
Two of the most successful approaches are Graph Convolutional Networks (GCN) (Kipf and Welling 2016) and Graph Attention Networks (GAT) (Velickovic et al. 2018), both convolutional-style GNNs, but with different assumptions regarding the contribution of the neighborhood. More specifically, the former model adopts a spectral approach in which the convolution is defined by signal theory filters, while GAT aims to incorporate the attention mechanism in the propagation step, learning the importance of the neighborhood of each node through a masked self-attention strategy. In the following, we briefly review the above two approaches.
Graph convolutional network. A GCN is the counterpart of a convolutional neural network model for graph-structured data that uses a graph spectral approach to convolution. Specifically, it operates through a first-order spectral approximation of the graph by restricting the filters (limiting the order of the Chebyshev polynomial) to operate in the neighborhood at one step away from each node.
Equation (1) shows the propagation rule of a GCN layer, which is the building block of the model, as it aims to learn a function capable of generating new feature representations for each node v i ∈ V by propagating and transforming its own features and those of its neighbors: Above, Ŵ(i) denotes the set of neighbors of node i, σ (·) is a non-linear activation function (e.g., ReLU (·) = max(0, ·) ), W is a layer-specific trainable weight matrix, and D ii = j A ij is the degree matrix derived from A = A + I n , where A is the adjacency matrix of the input graph G, and I n is the identity matrix of size n.
Note that self-loops are added to the graph, and the adjacency and degree matrices are built accordingly (this is known as renormalization trick), so that each node can also consider its own features, and potential numerical issues can be controlled; specifically, symmetric normalization is used because repeated applications of the propagation rule can lead to numerical instability and problems in the calculation of the gradient when used in the deep neural network.
GCN plays a central role in building many complex models of GNNs, which also includes unsupervised learning architectures such as the Graph Auto-Encoders (GAEs) (Kipf and Welling 2016;Zhou et al. 2019). A GAE is a framework for unsupervised learning that leverages GCN to encode node information into low-dimensional vectors. Specifically, the encoder that calculates the embeddings consists of two GCN layers with a non-linear activation function, whereas the decoder to reconstruct the original adjacency matrix from node embeddings, is a simple inner product decoder. The model is hence trained by minimizing the similarity between the original adjacency matrix and the reconstructed one.
Graph attention network. Graph Attention Network (GAT) (Velickovic et al. 2018) is a graph neural network architecture that uses the attention mechanism to learn weights between connected nodes. In contrast to GCN, which uses predetermined weights for the neighbors of a node corresponding to the normalization coefficients described in Eq. (1), GAT modifies the aggregation process of GCN by learning the strength of the connection between neighboring nodes through the attention mechanism (Wu et al. 2021).
The building block is a Graph Attention Layer (GAT layer) which generalizes the attention model on graph structured data and is agnostic of the particular choice of attention mechanism. Stacking GAT layers several times allows one to develop deep neural network architectures.
In order to learn the weighting factor of each node's features, attention coefficients are computed based on the features of the connected nodes using a function a : R d × R d � → R . Equation (2) indicates the importance of node j's features to node i: The graph structure is taken into account as for each node v i , its neighborhood is considered in performing a masked attention mechanism. In Velickovic et al. (2018) the attention mechanism is a feed-forward neural network, which utilizes the non-linear activation function LeakyReLU: instead of outputting a 0 for all negative values as ReLU does, LeakyReLU outputs a value of −βx for any input x that is negative (where β is an hyperparameter that determines the amount of leak, usually set between 0.01 and 0.2), whereas for positive values of x, it simply outputs x. By allowing non-zero gradient for (1) (2) e ij = a(h i , h j ).
negative values, LeakyReLU overcomes the issue of dying neurons that affects the ReLU function.
Then, the attention coefficients are normalized through Softmax function as in Eq.
(3), so that the attention weights sum up to 1 over all neighbors of a node, eventually obtaining the normalized attention coefficients α: As next step, each node updates its hidden state by weighting the features of the neighborhood nodes with the attention coefficients according to Eq. (4): Note that the learnable weight matrix W is pre-applied to every node in order to transform the input features into higher-level features.
To stabilize the learning process of self-attention, the mechanism has been extended similarly to Vaswani et al. (2017) by employing multi-head attention. The operations of the layer are independently replicated Q times (with different parameters) and outputs are feature-wise aggregated. Equation (5) shows the computation of a linear combination of the features by concatenating the Q attention heads, where α (q) ij is normalized attention coefficient computed by the q-th attention mechanism: where denotes the concatenation operator. For the combination of the Q independent heads, Velickovic et al. (2018) suggests to concatenate them in the hidden layers and to average them in the final layer; in the latter case, the application of the σ function is delayed.
Limitations of GNNs. Although GNNs can exploit node attributes and topology of the input graph, their learning power for a node classification task could be limited when there is a misaligment between features, graph and class label. While combining graph and feature information generally leads to an improvement in classification performance, the study in Qian et al. (2021) has shown the importance of graph and feature alignment in GNN models such as GCN, highlighting that when features and graph subspaces associated with the data are not aligned, the GCN approach can exhibit a performance degradation, being even outperformed by an MLP model learned from data features while discarding the network topology. More specifically, if the node connections in the network are not consistent with the associated node-features (e.g., two adjacent nodes having significantly different features), then the node-neighborhood aggregation scheme could not be beneficial. As a matter of fact, typical schemes of neighborhood aggregation in GNNs inherently assume the homophily principle, i.e., connected nodes have the same class label or similar features. Another study proposed in Zhu et al. (2020) has shown that learning on networks with low homophily (i.e., connected nodes have different class labels) is a challenging task for GNNs, which could perform worse than . (4) MLP. However, as reported in other studies, such as Ma et al. (2021), GCN models can still achieve good performance on low homophily networks, provided that nodes with the same class have similar neighborhoods, and different classes have distinguishable patterns.
Although investigating the aforementioned limitations is not a focus of this work, we shall take into account them in our experimental evaluation. In particular, we measure the homophily score of evaluation networks (cf. "Data" section), and we investigate on the impact of not using real-world features for our evaluation networks, where class assignment is based exclusively on graph topology (cf. "Evaluation with competing methods" section).

Related work
In order to contextualize our proposal with respect to existing literature, we here discuss some of the recently proposed GNN methodologies specifically designed for multilayer networks.
One of the first frameworks that considers inter-layer edges for embedded representation learning is MANE ). However, the optimization problem of node embedding solved in MANE does not account for node attributes, and its overall approach is not end-to-end. By contrast, approaches that aim to extend deep-learning based methods for single-layer graphs such as GCN and GAT are well-suited for modeling both within-and inter-layer dependencies to generate embeddings for nodes that are associated with input features, and in addition they have the advantage of being designed to learn node embedding and a classifier simultaneously via an end-to-end approach. Indeed, node classification approaches for multilayer networks have been recently proposed. Ghorbani et al. (2019) proposed MGCN to extend the GCN model to multilayer networks. The method builds a GCN for each layer of the network, by utilizing only links between nodes of the same layer, while discarding the inter-layer relations. To consider inter-layer dependencies, the method uses an unsupervised term in the loss function, which calculates the ability of reconstructing the network through the inner product of the embeddings. Our proposed method shares with MGCN the design for solving a semi-supervised classification problem where label information is smoothed over the graph structure via regularization, according to Kipf and Welling (2017). However, unlike MGCN, our proposed ML-GCN method is able to incorporate the inter-layer edges within the GCN propagation rules, as well as in the loss function.
While MGCN is an extension of GCN, the GrAMME method in Shanthamallu et al. (2020) extends GAT for multilayer networks. The peculiarity of this approach is the way the Q attention heads are combined: instead of concatenating or averaging them as suggested in Velickovic et al. (2018), in GrAMME a mechanism called fusion-head is applied, which consists in a weighted combination of the attention heads with learnable parameters. Specifically, in Shanthamallu et al. (2020) two approaches are developed, namely GrAMME-SG and GrAMME-Fusion. The former explicitly builds the inter-layer edges between each node in a layer and its counterpart (referred to as pillar edges) in a different layer, and applies a series of GAT layers with the fusion-head method, exploiting the inter-layer dependencies. The GrAMME-Fusion approach deals with inter-layer dependencies in a different way, as it builds layer-wise attention models and introduces an additional layer that exploits inter-layer dependencies using only fusion heads. In the case of nodes with missing attributes, both methods employ random initialization (using a standard normal distribution). The empirical evaluation reported by the authors with several multiplex networks showed that the GrAMME-Fusion method performs better than GrAMME-SG.
Our proposed GAT extension to multilayer networks shares the multi-head attention mechanism with the GrAMME methods, although our approach is closer to GAT as it does not need the fusion-head strategy to integrate the inter-layer dependencies. More importantly, our approach involves both within-layer and outside-layer neighborhoods when computing the embedding of an entity in each layer, while GrAMME-SG involves only pillar edges (in addition to the local neighborhood) in the propagation rule, and GrAMME-Fusion integrates the inter-layer dependencies using only fusion heads. Note that, given its declared superiority with respect to convolutional approaches, in our experimental evaluation, we have referred to the GrAMME methods as main competitors of our proposed methods.

Proposed framework
Given a set V of N entities (e.g., users) and a set L = {L 1 , · · · , L ℓ } of layers (e.g., user relational contexts), with |L| = ℓ ≥ 2 , we denote a multilayer network with G L = �V L , E L , V, L , where V L ⊆ V × L is the set of entity-layer pairings or nodes (i.e., to denote which users are present in which layers), and E L ⊆ V L × V L is the set of undirected edges between nodes within and across layers. 1 We represent a multilayer network by a set of adjacency matrices A = {A 1 , · · · , A ℓ } , with A l ∈ R n l ×n l ( l = 1..ℓ ), where n l = |V l | . Entities are assumed to be associated with features stored in layer-specific matrices X = {X 1 , · · · , X ℓ } , with X l ∈ R n l ×f l and f l the number of node features in the l-th layer. We will also use symbol x (i,l) to denote the feature vector of entity v i in layer L l .
Note that in our multilayer network model there is neither prior assumption about the set of valid couplings between the layers, nor about the structure of the layers. Indeed our framework is theoretically able to consider networks with different coupling constraints between the layers, e.g. temporal networks or cross-platform networks.
It is also worth noticing that the sizes f l may differ in principle, however they all must be bounded with respect to a maximum size, say f; truncation, resp. zero-padding, apply for those layers having a greater, resp. lower, number of features than f. Moreover, to avoid numerical scaling issues, all feature matrices are assumed to be row-normalized within a common interval of values. Furthermore, in case no node attributes are available for G L , each layer-specific feature matrix is assumed to be the identity matrix I l ∈ R n l ×n l . Also, for partially complete feature matrices, value imputation and matrix completion methods can certainly be used, however this goes beyond the scope of this work.
Node embedding in multilayer network. Given a multilayer network G L = �V L , E L , V, L , we define the multilayer network embedding as the problem of learning lowdimensional latent representations for each node (i.e., entity-layer pair), that is, learning a function g : V L � → R d that maps each node into a d-dimensional space, with d ≪ N , so that nodes that are similar in G L have embeddings close to each other.
The above definition resembles the classic one of node embeddings, with adaptation to multilayer networks. Moreover, to model similarity of nodes in the multilayer network, we follow the general idea adopted in representation learning on graphs, that is, node embeddings are generated based on neighborhoods, upon the intuition that nodes aggregate information from their neighbors by using a GNN.
However, a major question becomes how to consider a node's neighborhood in the multilayer network to properly generate the embeddings. In this regard, we notice that a major requirement for our proposed framework is to account for node links that are internal as well as external to a particular layer where the nodes occur. To this purpose, our key idea is to incorporate in the GNN propagation rules aggregation over node features-both topological and exogenous to the network, i.e., node attributes-that are computed not only w.r.t. the node's neighbors in the same layer but also w.r.t. the node's neighbors in the other layers.
In this regard, we define two functions, denoted as Ŵ and , that for each pair entitylayer, i.e., node, return the neighborhood of the entity that is internal and external to that layer, respectively. Formally, given an entity v i in a layer L l , we define the set of within-layer neighbors of v i in layer L l as: Similarly, we define the set of outside-layer neighbors of v i in layers different from L l as: Figure 1 shows an illustration of multilayer network that our framework is able to deal with: in fact, more generally than multiplex networks, inter-layer edges can be formed to link not only nodes of the same entity but also nodes of different entities. Both types of inter-layer edges are indeed considered in our definition of outsidelayer neighbors (cf. Eq. 7).
Let us now consider the key constituents in our proposed GNN models, precisely a GCN model and a GAT model for multilayer networks. First, we denote with h (k) (i,l) the hidden state at the k-th layer of the neural network for entity v i in layer L l , and with z (i,l) = h (K ) (i,l) the final embedding of entity v i in L l , eventually used for a downstream task, such as entity classification. Using the message passing paradigm (Gilmer et al. 2017), we abstract the aggregation scheme of our framework in Eq. (8): where the φ e function, named message function, is edge-wise defined to generate messages across the edges obtained by combining the edge properties x e , and the state of its two end-nodes, i.e., h (i,l) and h (j,m) ; φ v , named update function, is a node-wise function useful to update the state of a node; is the aggregation (or reduce) operator, which is usually summation, or alternatively a pooling operator or even a neural network (Wang et al. 2020). Note that in our formulation a node updates its state considering both intralayer and inter-layer dependencies within the aggregation stage, so that the embedding of each layer is related to the other layers. If the framework is instantiated based on a GAT approach, the message φ e of each edge (i, l), (j, m) received by node v i corresponds to the normalized attention coefficient α (i,l),(j,m) multiplied by the hidden state of node v j .
Our contribution is to incorporate the above representation models into the propagation rules of both GCN and GAT frameworks, in order to make them suitable for the multilayer network context. Our resulting methods are dubbed ML-GCN and ML-GAT , respectively.
ML-GCN propagation rules. Given a node v i in a layer L l , the first propagation rule is defined as: is the ReLU activation function, and W (0) is the initial weight matrix of shape (f, d), shared across all nodes of the multilayer graph. Note that the degree matrix D is built considering both inter-layer and intra-layer connections of nodes using the supra-adjacency matrix of the graph, which can be defined as: where A l,m is an inter-layer adjacency matrix built upon the inter-layer connections between layer l and layer m (i.e., 1 if there exists an edge between (i, l) and (u, m) with l = m , and 0 otherwise). Moreover, D ii = j=1 A sup ij , where A sup is the supra-adjacency matrix with self-loops added.
The above equation is then adapted to produce the propagation rule at the generic k-th layer of the GNN (with 1 ≤ k < K): Note that, for k = K − 1 , the above equation produces the output feature vector for entity v i in layer L l . Also, the weight matrix W (k) shape is (d, d), for every k ≥ 1.
ML-GAT propagation rules. Given a node v i in a layer L l , the first propagation rule is defined as: where α ((i,l),(j,m)) is the normalized attention coefficient for any edge ((i, l), (j, m)) ∈ E L . Note that we integrate the attention mechanism both on intra-layer and inter-layer edges, so that our model can selectively integrate the information received from the inter-layer and intra-layer neighbors.
Similarly to the solution proposed for ML-GCN, the initial propagation rule equation is generalized for any k-th layer of the GNN (with 1 ≤ k < K ). In addition, a mechanism of multi-head attention is used to stabilize the learning process of self-attention. Therefore, given Q attention heads, we define two variants of the generic propagation rule, where either the output features of the heads are concatenated: or the heads are averaged before applying the activation function: Above, h (q,k) (j,m) denote the embedding of node (j, m) for the q-th head of the k-th layer of the neural network. Moreover, in our setting, we averaged the Q attention heads in order to save memory occupation; therefore we set σ as the exponential linear unit (ELU) activation function, i.e., for positive values of input b, the function simply outputs b, whereas if the input is negative, the output is exp(b) − 1.
Entity classification. As downstream task, we consider the problem of classifying the entities of a multilayer network graph in a transductive setting, i.e., class labels are only available for a small subset of entities, however the whole network graph containing both labeled and unlabeled data is used during the learning process, and the goal of the trained model is to predict the labels of the unlabeled entities. As previously discussed in "Introduction" section, this setting complies with the realistic assumption of lack of knowledge on a target concept corresponding to the class, for most of the nodes in a network. However, the transductive setting is also challenging as it requires that our GNN-based learning framework must be able to learn representation not only of nodes with labels but also of nodes without labels. We define this type of (multiclass) classification task with partial supervision, i.e., semi-supervised classification task, as follows: Problem 1 (Entity classification in multilayer network) Given a multilayer network G L = �V L , E L , V, L , and associated input feature matrix X , let Y ∈ R N ×C denote the binary matrix storing the class labels assigned to each entity in V , where C is the number of a predetermined set of classes. Given a small subset of entities V train ⊂ V that we refer to as training set, we denote as Y train ∈ R |V train |×C the corresponding class label matrix. The goal is to predict the label of the entities in V\V train using both the multilayer graph structure based on the supra-adjacency matrix and the entity features stored in X . That is, we want to obtain the probability distribution matrix Y ∈ R N ×C so to derive Y = arg max c Y that assigns class labels to the entities in V\V train .
In order to predict the class label for each entity v ∈ V , we combine the node embeddings obtained from each layer. Given Z l ∈ R n l ×d as the learned embeddings for layer L l , we obtain the entity representation through the following cross-layer aggregation: where Z ∈ R N ×d is the final entity embedding matrix, and µ ∈ R ℓ is a vector of nonnegative values that can be either pre-determined (e.g., uniform distribution, or any arbitrary user-provided distribution), or learned during the training through an optimization procedure. In our setting, we used the configuration with trainable parameters, as described in the following paragraph on parameter learning.
The last step is the application of a feed-forward neural network to the matrix Z is provided in input to a feed-forward neural network, whose output is a matrix Z ∈ R N ×C .
The final step consists in applying the row-wise softmax function onto Z to yield the class prediction of an entity. Algorithm 1 sketches the pseudo-code of our classification framework, whereas Fig. 2 depicts an illustration of the framework.
Parameter learning. With the exception of the cross-layer parameters µ , which are initialized by a uniform distribution (i.e., µ l = 1 ℓ , for all l = 1 . . . ℓ ), all other parameters are initialized through Glorot (also known as Xavier) initialization (Glorot and Bengio 2010); this initialization technique is widely used in deep learning tasks and has proven to be effective in several applications (Mishkin and Matas 2016). Particularly, the trainable weight matrix W in ML-GAT and ML-GCN is subjected to Glorot initialization with normal distribution and with uniform distribution, respectively. Moreover, in ML-GAT , each layer of the multilayer network is also parametrized with a weight vector Q a ∈ R 2d of the single-layer feed forward neural network used as the attention mechanism (according to Velickovic et al. 2018).
All the above parameters, jointly with those of the neural network downstream of the cross-layer aggregation (cf. Eq. 15), are then updated during the training through an optimization strategy. That is, after the forward step of our learning framework, we calculate the loss function over all labeled examples V train . Then the gradient of the loss with respect to all parameters is calculated, which are finally updated through the optimization strategy (cf. "Experimental evaluation" section). We use the cross-entropy as supervised loss function shown in Eq. (16): Note that, like Kipf and Welling (2017), we avoid explicit graph-based regularization in the loss function, while our GNN models are learned through the supra-adjacency matrix of the multilayer network that allow the models to learn representation for nodes instances of unlabeled entities.
In order to show the outcomes of the representation learning process, a two-dimensional projection of the embeddings produced by the proposed approaches is reported in Fig. 3 (ML-GAT ) and Fig. 4 (ML-GCN). The representation is obtained via the t-distributed stochastic neighbor embedding (t-SNE) method, which is a widely used nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional space (van der Maaten and Hinton 2008). More specifically, the plots show the embeddings produced by t-SNE on Z for the training entities ( 25% of total entities), i.e., the entity representation obtained through the cross-layer aggregation defined in Eq. (15). The embeddings refer to the Koumbia-2 network including its real-world node-attributes (cf. "Data" section); different colors in the figures correspond to the two different node labels. Left side of the figures shows the embeddings obtained after one training epoch of t-SNE, while the right one shows the final embeddings obtained after 1000 training epochs, extracted downstream of the second (i.e., last) hidden layer. It is easy to see how the embedding is already significant after only one training epoch, with a relatively good separation between the two classes (slightly more evident for ML-GAT ). The embeddings get clearly better after 1000 training epochs, with an evident separation between the representations of entities belonging to the two classes.

Experimental evaluation
We evaluated our framework for semi-supervised entity classification tasks on several real-world multiplex and multilayer networks from different domains.
In the following we provide details on the evaluation network datasets, on the experimental settings of our proposed methods, and on that of competitors (GrAMME-SG and GrAMME-Fusion (Shanthamallu et al. 2020)) and baselines (GCN and GAT ).

Data
We considered 9 network datasets, from which we derived a total of 19 networks for evaluation (plus their monoplex flattened versions). These datasets come from different domains and present very different structural characteristics. Moreover, all datasets are publicly available and most of them have been previously exploited as benchmarks for a variety of network analysis tasks, including node classification and link prediction. This is a major criterion for our choice, since it allows comparison of our results with the ones reported in previous literature. Eight of such datasets are originally provided as multiplex networks, i.e., networks in which inter-layer connections are coupling edges only, connecting a node and its counterparts in other layers. These networks also do not provide attributes associated to the entities. Therefore, in order to stress the ability of our framework of dealing with generic multilayer attributed networks, we introduced the Koumbia network dataset , which comes with unconstrained inter-layer connections and real-world properties associated to the entities. In the following we briefly describe our evaluation network datasets.
Balance (Siegler 1976) models psychological experimental results on a set of individuals. According to Shanthamallu et al. (2020), four attributes characterize the subjects (left weight, the left distance, the right weight, and the right distance). The classes correspond to the balance scale of the subjects (tip to the right, tip to the left, or being balanced).
CKM-Social (Coleman et al. 1957) contains social information from physicians (entities) in four towns in Illinois, Peoria, Bloomington, Quincy and Galesburg. It consists of 3 directed layers generated from different sociometric matrices, where the cities are used to label the entities. Zangari et al. Appl Netw Sci (2021) 6:87 Congress (Schlimmer 1987) models the results of bills obtained from the 1984 United States Congressional Voting Records Database. The network has 16 layers corresponding to votes, where for each layer two congressmen are linked if they voted the same. Each congressman is labeled as either democrat or republican.
DKPol (Magnani et al. 2021) (Dansk Politik) is a network with three types of online relations between Danish Members of the Parliament on Twitter. It comes with a ground truth corresponding to affiliations to 10 political parties, that we used as labels.
Leskovec-Ng (Chen and III 2016) is a 4-layer temporal collaboration network, which contains coauthors of Prof. Jure Leskovec or Prof. Andrew Ng over 20 years, partitioned into 4 different 5-year intervals. Entities are researchers, and on each layer there is an edge between two researchers if they co-authored at least one paper in the 5-year interval. Each researcher is labeled as Leskovec's collaborator or Ng's collaborator.
Starwars 2 is comprised of 6 layers, each corresponding to interactions between Starwars characters in the first 6 episodes of the saga. We manually labeled each character as male, female or droid and used this information as entity labels. Note that the resulting class distribution is very unbalanced as there are 76 males, 12 females and 4 droids.
Terrorist (Everton 2012) models interactions between 79 terrorists drawn from the Noordin's Network dataset. Similarly to Liu et al. (2017), we built a 4-layer network from the following relation types: trust, communication, operational, business and financial ties. We derive two different types of entity labels: (i) 2-class labels corresponding to membership of an individual to the Noordin's splinter group (member or non-member), and (ii) 3-class labels corresponding to the current state defined as the physical condition of the individual (dead, alive, jail). The two versions are dubbed Terrorist-Noordin and Terrorist-status, respectively.
Vickers (Vickers and Chan 1981) is a 3-layer directed multiplex network modeling the social relations between 29 seventh grade students in a school. We use the gender as entity labels; there are 12 boys and 17 girls.
Koumbia ) is a multilayer network extracted from a Sentinel-2 3 satellite image time series, centered on an agricultural landscape in the Koumbia area in Burkina Faso. In this dataset, entities represent segments of the satellite image, and classes correspond either to crop (i.e., segments containing pixels related to cultivated area) or no-crop (i.e., segments containing pixels related to uncultivated areas, such as forests) segments. The network is originally associated with inter-layer edges and realworld attributes for the entities, corresponding to the segment statistics of the radiometric bands of the satellite images. We created the input feature matrix by concatenating the average values of ten different radiometric bands for 21 timestamps, obtaining a feature vector of size 210 for each entity. Fig. 5a displays the feature distribution obtained by linearizing the input feature matrix. Note that the geo2net framework 4 presented in  is designed to build a multilayer network with an arbitrary number of layers, which model the association of nodes to an arbitrary number of functional classes (e.g., temporal radiometric profiles) by producing fuzzy layer memberships using Zangari et al. Appl Netw Sci (2021) 6:87 the fuzzy c-means algorithm (Ross 2009). In our case study, we exploit this functionality in order to take into account versions of the network with a varying number of layers (i.e., 2, 5, 10, 15, 20). We denote each of these networks as Koumbia-l (e.g., Koumbia-5 will denote the version with 5 layers). Figure 5 shows Koumbia-2 (b) and Koumbia-5 (c) graphs, whereas Table 4 reports detailed information about inter-layer edges. Since the competitors are specifically designed for multiplex networks, for a fair comparison with those methods we will also use a multiplex version of this dataset (i.e., obtained discarding inter-layer connections other than coupling ones), named Koumbia-l-mpx. Table 2 summarizes the structural properties of the all network datasets, plus the multiplex versions of Koumbia. For the latter, inter-layer edge information is reported in Table 4. Note that the graph homophily in Table 2 corresponds to the fraction of edges in a graph connecting nodes with the same class label (Zhu et al. 2020 and y i , y j class labels of v i , v j , respectively. This statistic was calculated excluding the inter-layer edges, which otherwise would lead to biased homophily scores. Also, average entity frequency corresponds to the fraction of layers on which each entity appears, averaged over all entities. Moreover, Table 3 summarizes information on the monoplex, flattened versions of the network datasets, that will be used for evaluation of baseline methods designed for monoplex networks.

Experimental settings
We conducted all experiments under a transductive learning setting whereby, given an input multilayer network, only a fixed portion of the set of entities for each class were used as labeled data for the training of a GNN model. Recall that, due to the transductive setup, the learning process is nonetheless able to use all node attributes and topological information. For those networks without external information, node attributes were initialized by sampling each attribute from either a Gaussian, an Exponential, or a Uniform distributions; more precisely, for a given choice of number f of node attributes, either we randomly generated each attribute values from a Gaussian distribution, or we randomly generated one third each of the attributes from Gaussian, Exponential and Uniform distributions.
We used two main settings for the training set size, namely at 25% and 5% of the set of entities (we will refer to the setting with 25% of training set size unless otherwise specified). All GNNs were trained using the Adam optimization algorithm (Kingma and Ba 2017) with full batch size, for either 1000 epochs, or at convergence when the early-stopping regularization technique was used (with patience value of 50 epochs), learning rate set to 0.005, L2 weight regularization set to 0.0005, and dropout regularization technique with p = 0.6 applied to the hidden layers and to the normalized attention coefficients of GCN and GAT based methods, respectively. Note that this introduces stochasticity since we sample the within-layer and outside-layer neighborhood of each node. It should also be noted that, since all our evaluation networks fit into the GPU memory, we performed full batch training Welling 2017, 2016), where the parameters are updated after processing the whole network. Also, regarding the cross-layer aggregation (cf. Eq. 15), the parameters µ were initialized with uniform distribution. Both our ML-GAT method   and GAT use multi-head attention with Q = 2 attentions heads for each layer where the heads are averaged in order to save memory resource. Furthermore, the negative slope β for LeakyReLU function was set to 0.2 (cf. "Background" section). We set K = 2 with d = 32 features, and number of input features to f = 64.
Concerning the setting of baseline methods, the original GAT and GCN methods were trained over the flattened networks (cf. Table 3), since they were conceived for single-layer graphs. Moreover, since the GrAMME (Shanthamallu et al. 2020) framework can deal with multirelational/multiplex networks only, in our comparative experiments we used the multiplex versions of the Koumbia-l networks, named Koumbia_mpx , thus discarding inter-layer edges connecting nodes representing different entities.
For GrAMME-SG and GrAMME-Fusion, we set learning rate to 0.01, f = 64 , d = 32 , 5 fusion heads for GrAMME-Fusion, and used dropout regularization technique.
Note that we used the publicly available software implementations for all the competitors. 5

Results
We present the results of the experimental evaluation of the proposed framework, which are organized into two evaluation stages: 1. In "Evaluation with competing methods" section, we compare the proposed methods and competitors (i.e., GrAMME-SG, GrAMME-Fusion, GAT and GCN), on the multiplex networks introduced in "Data" section, in terms of prediction performance, impact due to the setting of the training set size and early-stopping technique, and of the initial input node-features. We also analyze the training execution times of the methods, and we discuss aspects of their computational complexity. 2. In "Evaluation on real-world node-attributes and arbitrary inter-layer edges: the Koumbia multilayer network testbed" section, we conduct a thorough analysis of the Koumbia dataset, focusing on the impact of using real node-features. We focus on this network since it is the only one including real-world attributes for the entities   and arbitrary inter-layer edges, thus allowing to evaluate to what extent the proposed framework is able to exploit such characteristics. Table 5 reports the average accuracy scores achieved by the proposed ML-GAT and ML-GCN approaches, and the competing methods (GrAMME-SG, GrAMME-Fusion, GAT and GCN). For each network and method, the average accuracy was computed over 10 independent runs, where each run corresponded to a different train-test split, with 25% of training entities, and the input features of the entities were randomly initialized with a normal distribution for all the networks. At a first glance, our proposed methods are able to achieve very high, or even nearly optimal accuracy, on most networks. This also mostly holds for GrAMME-Fusion and, to a less extent, for GrAMME-SG. Moreover, the results obtained by the baselines on the corresponding monoplex versions of the network datasets reveal some situations in which the flattening process is actually beneficial for the entity classification task: particularly, as it can be observed in Fig. 6, CKM-social and Leskovec-Ng layers are mostly structured with disconnected components, where each component in a layer contains nodes belonging to a specific label, so that the monoplex flattened network can better  Table 5 Accuracy (mean and standard deviation over 10 runs) obtained by the proposed methods and competitors

Evaluation with competing methods
Bold values refer to the best results on each network Network

ML-GCN
contextualize each node with respect to neighbors of different classes. Note also how the attention-based approaches (i.e., ML-GAT and GrAMME) do not suffer from the same issue, and obtain performances close to or even better than the ones of the baselines.
By contrast, multilayer approaches always obtain the best performances on networks with very high values of average entity frequency (i.e. the average percentage of layers on which each entity appears, cf. Table 2), such as Vickers (ML-GAT ), Congress (GrAMME-Fusion) and Balance (ML-GCN); an exception is represented by CKM-Social, which is explained since in this network the nodes in every layer are associated with the same label. The latter case is also interesting as it is the only one where ML-GCN outperforms not only the competitors but also ML-GAT (which is nonetheless the second best performer). This might be ascribed to the peculiar structure of the Balance network, where each layer is composed by 5 disjoint complete connected-components and the various node labels are distributed over the components. In fact, ML-GAT has worse Fig. 6 Leskovec-Ng and CKM-Social networks before and after the flattening process. In Leskovec-Ng, blue nodes and red nodes correspond to Leskovec's collaborators and to Ng's collaborators, respectively. In CKM-Social, color codes correspond to different cities Zangari et al. Appl Netw Sci (2021) 6:87 Table 6 Accuracy (mean and standard deviation over 10 runs) obtained by the proposed methods and competitors Training and testing set sizes correspond to 5% and 95% of the entities, respectively Bold values refer to the best results on each network performance than ML-GCN for both the Balance and Congress networks, which are the two with highest combination of average degree and average density (cf. Table 2). As already observed in Mohan (2021), for networks with high average degree and dense supra-adjacency matrix, GCN-based methods may perform better than random-walk based approaches and GAT based ones due to the stochasticity introduced by the attention mechanism. Furthermore, according to Zhu et al. (2020) and Qian et al. (2021), all models have relatively low performance on the network with the lowest homophily score, i.e., Terrorist-status (0.469). Likewise, all models have good performance on networks with strong homophily, such as Leskovec-Ng (0.994) and CKM-Social (1.0). On the other hand, note how multilayer models can still perform well on networks with lower homophily. Finally, a major remark that stands out from Table 5 concerns the results obtained on Koumbia networks, for increasing number of layers. Note that Koumbia is the network dataset including the highest number of entities (2246), and that the Koumbia-2 network has no edge overlap between the two layers (i.e., the node set on the two layers is completely disjoint). Notably, our ML-GAT followed by ML-GCN outperform all the competitors, with GrAMME-SG performing even worse than baselines GAT and GCN. Note also that GrAMME revealed to be sensitive to the number of layers, at the point that for 10 layers (i.e., Koumbia-10-mpx) both variants of our major competitor run out-of-running-time.

Impact of training set size and early-stopping
We investigated a more challenging scenario for the training of the GNN models under study, by using only 5% of the entities as training instances, and the remaining ones for testing. Results are reported in Table 6. As expected, the classification performances of the various methods tend to decrease in almost all the networks. Two particular situations occur with the Starwars and Terrorist-status networks: on the former, only GrAMME-Fusion and GCN have worse performance, while on the latter, GrAMME methods even improve w.r.t. Table 5. These can be explained as both network datasets have highly unbalanced distribution of class labels, therefore for the least covered class (i.e., 'droid' for Starwars and 'dead' for Terrorist-status) the number of selected training instances does not significantly change as the percentage of training set size decreases from 25 to 5%. More interestingly, on the largest networks other than Koumbia, i.e., Congress and Balance, ML-GCN and ML-GAT outperform the other competing models. In general, it turns out to be that ML-GCN and ML-GAT tend to be less sensitive than the other methods when the percentage of training set size changes from 25% to 5%. Another important aspect that we have not considered so far is the opportunity of using the early-stopping regularization which, with the use of a validation set, can be helpful to mitigate over-fitting of the GNN models. To this purpose, we carried out a further stage of evaluation where each GNN model was equipped with early-stopping and a patience value of 50 epochs, i.e., the training of a GNN model was terminated if the validation accuracy had not increased for 50 consecutive epochs. Tables 7 and 8 show results corresponding to 25% and 5% of training set size, respectively. Considering first the effect of early-stopping with training set size of 25%, we observe that our ML-GAT and ML-GCN and their monoplex counterparts improve their performance w.r.t. the scenario without early-stopping (i.e., Table 5) in most of the networks. For instance, ML-GAT and ML-GCN increase their accuracy on Starwars, from 0.70 to 0.817 and from 0.714 to 0.815, on Terrorist-status, from 0.477 to 0.570 and from 0.502 to 0.545, on CKM-Social, from 0.954 to 0.962 and from 0.824 to 0.921, respectively. Moreover, on Koumbia networks, ML-GAT and ML-GCN achieve comparable or even better results (i.e., ML-GAT on Koumbia-5-mpx) than those corresponding to nonearly-stopping, while the monoplex counterparts, especially GAT , decrease their performance significantly. By contrast, GrAMME methods tend to benefit less from the use of early-stopping.
Finally, from the comparison between Tables 6 and 8 corresponding to 5% of training set size, we draw analogous remarks to the above discussed for the scenario with 25% of training set size. Although the variations between results in Table 8 and corresponding results in Table 6 are in general relatively small, some cases are still remarkable, such as the improvement of ML-GAT and ML-GCN on the two Terrorist networks, Koumbia-5-mpx and Koumbia-10-mpx.

Impact of the attribute matrix
To better assess the robustness of our proposed methods, and also to gain insights into those cases in favor of monoplex-based baselines, we replicated the previous analysis by initializing the entity features with mixed distributions, following the other, more realistic approach described in "Experimental settings" section, i.e., one third of the attributes are modeled as normal distributions, one third as uniform distributions, and one third as exponential distribution.
Results of these experiments are reported in Table 9. While GAT and GCN are still the best performing methods on Leskovec-Ng and CKM-Social-due to the peculiar structural characteristics described in the above section, that makes the monoplex version better suited for the task at hand than the multilayer network-it can be noted that their relative performances on DKPol are significantly decreased (about 0.79) w.r.t. the ones observed in Table 5; by contrast, on this network, ML-GCN is the best performing method (0.82). Also, the accuracy of GAT (0.77) and GCN (0.67) worsens significantly for Vickers and Balance, respectively. By contrast, ML-GCN achieves better accuracy than the corresponding values in Table 5 on DKPol, Congress, and Koumbia-2-mpx, while both ML-GAT and ML-GCN improve on Vickers and Starwars.
Overall, comparing the methods' accuracy values averaged over the networks, from Table 9 w.r.t. Table 5, GAT and GCN show a more evident decrease percentage (resp., about − 2% and − 1.5%) than ML-GAT and ML-GCN. To sum up, while the performances of the monoplex-based baselines change drastically in some cases, indicating more sensitivity to the characteristics of the node attributes, our proposed methods turn out to be more robust, even benefiting from a mixed, thus more realistic, distribution of values for the node attributes in some networks.

Computational complexity aspects and training time analysis
In this section, we first discuss the computational complexity of our methods, ML-GAT and ML-GCN, assuming sparse supra-adjacency matrix and that the total number of nodes in a multilayer network is O(N ℓ). The time complexity of ML-GCN with K layers results from the addition of two terms, the one corresponding to the propagation steps, which is O(Knonzero(A sup )f ) , where nonzero(A sup ) is the number of non-zero entries in the A sup matrix, and the other one corresponding to the feature transformation steps, which is O (KN ℓf 2 ) . Therefore, the total cost of ML-GCN is O(Knonzero(A sup )f + KN ℓf 2 ) . The time complexity of ML-GAT also takes into account the computation of the attention coefficients. Given Q attention heads, the time complexity of ML-GAT with K layers is O(KQN ℓf 2 + KQ|E L |f ) , where the first term concerns the feature transformation steps, and the second term corresponds to the cost of a general attention mechanism. Note that the computation of the attention coefficients can be parallelized both for the intra-layer and inter-layer edges, as well as the computation of the Q attention heads.
Concerning the spatial complexity, modeling inter-layer dependencies in the propagation rule has the overhead of storing the whole supra-adjacency matrix. Moreover, we need to take into account the hidden states and the weight matrices. More precisely, the memory requirement during the training stage for ML-GCN is O(Kf 2 + KNf ) , whereas for the multi-head attention ML-GAT , this cost is multiplied by a factor Q. Furthermore, the attention functions value requires an overhead of O(Q|E L |) . It is also worth noticing that, to improve scalability of our implementations, we could learn Evaluation on real-world node-attributes and arbitrary inter-layer edges: the Koumbia multilayer network testbed A major strong point of the proposed ML-GAT and ML-GCN approaches is that they are designed to deal with general multilayer networks, i.e., with arbitrary inter-layer edges, and to exploit external information in the form of attributes associated to the entities. In this section, we present a further evaluation stage that aims to stress our methods by evaluating them on a real-world attributed multilayer network, i.e., the Koumbia multilayer network . By focusing on this network, we delve into the understanding of the impact of using real-world attributes about the entities (cf. "Data" section and Fig. 5a) on the entity classification task in a practical application contexts. Moreover, based on the technique described in , we take into account different versions of the network with varying number of layers (i.e., 2, 5, 10, 15, 20) and, since the networks include inter-layer edges between each couple of layers, we will also evaluate how the proposed approach is able to manage an increasing number of inter-layer edges.
To this purpose, we compare the performance of our methods on three different scenarios relating the input features associated with various Koumbia with different number of layers: the real attributes originally associated with Koumbia entities, attributes in the form of identity matrix, and attributes modeled as normal distributions. The three modalities will be denoted with suffix Fon, Foff, and normal, respectively. Note that the experiments with the normal distributions have different results with respect to the ones reported in Table 5, since in that case the multiplex version of the network was taken into account (i.e., without considering inter-layer edges).
Tables 11 and 12 show the average accuracy and mean reciprocal rank (MRR), averaged over 20 runs, obtained on the Koumbia networks with Fon, Foff and normal types of attributes, for ML-GAT and ML-GCN, respectively. A detailed plot on the variations of accuracy with respect to the number of layers is also shown in Fig. 7. It can be noted that the Fon versions always obtain significantly better result than the Foff and normal ones for both methods, thus confirming that exploiting real-world node-attributes is indeed beneficial for the entity classification task and, more importantly, that the proposed framework is able to correctly exploit such external information in the form of node attributes.
ML-GAT and ML-GCN obtain similar performances on the Fon networks (accuracy around 0.94 and MRR around 0.97). The performance scores also show robustness with respect to the number of layers, and hence of inter-layer edges, in the network. Note that the slightly lower performance obtained for Koumbia-2 is actually not surprising, as in that case the two layers have disjoint node-sets, which negatively affects the performance of the multilayer approaches.
It is also interesting to notice that, while ML-GAT performs similarly on the Foff and normal versions (thanks to the attention mechanism), ML-GCN shows significantly better performance on the normal version than on the Foff one. This result is in line with previous studies (Kipf and Welling 2017;Velickovic et al. 2018), and, in this specific case, it can also be explained by the fact that the normal distribution can be a relatively good approximation of the real one (cf. Fig. 5).
In order to further analyze the benefit of the attention mechanism exploited by ML-GAT , against the convolutional approach used in ML-GCN, we perform a further analysis stage, where we evaluate the performance of the two approaches w.r.t. an increasing number of hidden layers K in the neural network (i.e., not to be confused with the layers of the multilayer network). Figure 8 shows the accuracy achieved by ML-GAT and ML-GCN by increasing K, and with different dimensions of the embedding space, i.e., d = {32, 128} . It can be noted that, while for K = {1, 2} all methods obtain similar performance, ML-GCN tends to decrease in accuracy for higher K values; particularly, ML-GCN accuracy decreases of about 7% when increasing K from 2 to 3, with d = 32 , and from 5 to 6 with d = 128 . We tend to explain this behavior since a higher number of convolutional layers would smooth the difference between intra-layer and inter-layer neighborhoods, which hence might be treated equally in this process. Conversely, the attention mechanism in ML-GAT is way more robust to this phenomenon, as revealed by the nearly constant performance by ML-GAT even with high values of K.

Conclusions: discussion and future work
We proposed a GNN framework for representation learning and semi-supervised classification in multilayer networks with attributed entities, and with arbitrary number of layers and intra-layer and inter-layer connections between nodes. We instantiated our framework through two new formulations of GAT and GCN models, specifically designed for the above general, attributed multilayer networks. We evaluated our ML-GAT and ML-GCN methods on real-world network datasets coming from different domains and with different structural characteristics. Our results showed that ML-GAT and ML-GCN models are significantly faster learners than the competitors, and they outperform in accuracy both the competitors and baseline methods especially on arbitrary multilayer networks, with large number of entities and layers. Furthermore, as demonstrated by the evaluation on Koumbia multilayer networks, derived from satellite images, our methods are able to take advantage of the presence of real attributes for the entities, in addition to arbitrary inter-layer connections between the nodes in the various layers.
Comparing the GAT and GCN approaches, we observed that, unlike ML-GCN, ML-GAT performance is not affected when networks are structured as disconnected layers or when most layers tend to contain nodes of the same label. By contrast, ML-GCN tends to be more robust than ML-GAT when the network shows relatively high density and average degree.
Nevertheless, the approach of integrating within-layer and outside-layer neighborhood shared by both ML-GCN and ML-GAT might not be well-suited to effectively learn from multilayer networks where the various layers would show assortativity different to each other according to the entity class labels; e.g., a 2-layer network with gender as entity class, such that the first layer is assortative by gender and the second layer shows reverse assortativity by gender. To overcome this limitation, it would be interesting to revise the cross-layer aggregation component in terms of a GNN model as well, and investigate how this approach would be more effective than simply weighing the embeddings from each particular layer of the network. As a related aspect, the above