 Research
 Open Access
 Published:
Sequencetosequence modeling for graph representation learning
Applied Network Science volume 4, Article number: 68 (2019)
Abstract
We propose sequencetosequence architectures for graph representation learning in both supervised and unsupervised regimes. Our methods use recurrent neural networks to encode and decode information from graphstructured data. Recurrent neural networks require sequences, so we choose several methods of traversing graphs using different types of substructures with various levels of granularity to generate sequences of nodes for encoding. Our unsupervised approaches leverage long shortterm memory (LSTM) encoderdecoder models to embed the graph sequences into a continuous vector space. We then represent a graph by aggregating its graph sequence representations. Our supervised architecture uses an attention mechanism to collect information from the neighborhood of a sequence. The attention module enriches our model in order to focus on the subgraphs that are crucial for the purpose of a graph classification task. We demonstrate the effectiveness of our approaches by showing improvements over the existing stateoftheart approaches on several graph classification tasks.
Introduction
We address the problem of comparing and classifying graphs by learning their latent representation. This problem arises in many domain areas, including bioinformatics, social network analysis, chemistry, neuroscience, and computer vision. For instance, in neuroscience, comparing brain networks represented by graphs helps to identify brains with neurological disorders (Van Wijk et al. 2010). In social network analysis, we may need to compare egonetworks to detect anomalies (Akoglu et al. 2010) or to identify corresponding egonetworks across multiple social networks. Cutting across domains, we may be interested in understanding how to distinguish the structure of a social network from that of a biological network, an authorship network, a computer network, or a citation network (Newman 2003).
For many years, graph kernels (Vishwanathan et al. 2010; Gärtner et al. 2003; Shervashidze et al. 2009, 2011; Borgwardt and Kriegel 2005), feature extraction methods (Berlingerio et al. 2012; Macindoe and Richards 2010; Yan and Han 2002) and graph matching algorithms (Bunke 2000; Riesen et al. 2010) have been the approaches of choice for graph comparison. Recently, remarkable advancements have been made in developing methods for embedding nodes (Grover and Leskovec 2016; Perozzi et al. 2014; Tang et al. 2015; GarcíaDurán and Niepert 2017), edges (Chen et al. 2018; Trivedi et al. 2017; Rossi et al. 2018) and subgraphs (Narayanan et al. 2016; Yanardag and Vishwanathan 2015; Adhikari et al. 2017) into lowdimensional spaces. Moreover, several other approaches (Niepert et al. 2016, Zhang et al. 2018, Duvenaud et al. 2015, Gilmer et al. 2017, li et al. 2015b, Lee et al. 2018b) have been proposed to learn graph representations for a supervised task such as graph classification or regression. These approaches are aimed at optimizing the performance of the classification task, rather than learning the explicit topology of a graph.
There is still a lack of wellperforming approaches for learning the representation for an entire graph. There are several challenges that need to be addressed within this area. First, the choice of the subgraph structures to be incorporated in the graph representation learning has a significant impact on the expressiveness power of the embeddings of an entire graph. Second, choosing the appropriate granularity level of this substructure (e.g., whether to include first or second order neighborhoods of a node when building node sequences), which is necessary to preserve the graph embedding, is an open problem. The choice may depend on many factors, such as the graph domain, scale, density, and its various structural properties. The types of the substructures, from finetocoarse, such as nodes, edges, trees, graphlets, random walks, and communities, can capture local and global features of the graph. The question is what types of substructures with what level of granularity are informative enough to capture the general graph structure and recognize similarity between graphs, while reducing the loss of information? The additional challenge is, of course, the efficiency of learning the representation of the substructures and aggregating them into a graph embedding. In this work, we investigate these challenges within the context of our proposed architectures.
Moreover, most of the recent studies (Ying et al. 2018; Niepert et al. 2016; Zhang et al. 2018; Duvenaud et al. 2015; Li et al. 2015b) focus on Graph Neural Networks (GNNs) to investigate the graph representation learning problem. Gilmer et al. (2017) introduced a message passing framework and explored the family of GNNs by massage propagation in the graph topology. In this group of approaches, the hidden representations of the nodes are iteratively updated, using differentiable functions, from the hidden representations of their neighbors. Several concerns regrading the scalability and training time arise since these approaches rely on several iterations of message passing using the entire adjacency matrix. Therefore, in this work we investigate whether it is possible to achieve efficient graph representations by sampling a set of sequences from the graph and solely rely on them in order to find a hidden representation for the whole graph structure. Recently, Lee et al. (2018b) proposed an approach for graph representation learning using embedding sequences. However, their approach is designed in a supervised manner and is not applicable without taskspecific supervision. In this work, we study graph representation learning techniques using recurrent models in both supervised and unsupervised regimes.
In the unsupervised representation learning, we only consider a setting in which we have knowledge about the graph structure, such as an adjacency matrix as well as node and edge properties. In the supervised representation learning, we assume knowledge about the supervised task, such as graph classification labels or regression values, in addition to the information about the graph structure. In both settings, we use effective models based on a sequencetosequence learning framework and demonstrate the capability of our models in learning graph representation with or without incorporating taskspecific supervision information.
In the unsupervised regime, we leverage the LSTM sequencetosequence learning framework of (Sutskever et al. 2014), which uses one LSTM to encode the input sequence into a vector and another LSTM to generate the output sequence from that vector. In some variations of our models, we use the same sequence as both input and output, making this a sequencetosequence LSTM autoencoder (Li et al. 2015a). We consider several types of substructures including random walks, shortest paths, and breadthfirst search, with various levels of node neighborhood granularity to prepare the input to our models. An unsupervised graph representation approach can be used not only in processing labeled data, such as in graph classification in bioinformatics, but can be also used in many practical applications, such as anomaly detection in social networks or streaming data, as well as in exploratory analysis and scientific hypothesis generation. An unsupervised method for learning graph representations provides a fundamental capability to analyze graphs based on their intrinsic properties. Figure 1 shows the result of using our unsupervised approach to embed the graphs in the Proteins dataset (Borgwardt et al. 2005) into a 100dimensional space, visualized with tSNE (Maaten and Hinton 2008). Each point is one of the 1113 graphs in the dataset. The two class labels were not used when learning the graph representations, but we still generally find that graphs of the same class are clustered in the learned space.
In the supervised regime, we suggest a supervised version of a sequencetosequence learning framework to predict graph classification labels. The supervised model is inspired by the best variation of our unsupervised framework in order to utilize the neighborhood of a sequence for the purpose of the graph classification task. One recurrent neural network is used to gather information from a sequence and another recurrent model is applied to incorporate the information from the neighborhood of the sequence into the graph representation learning. Moreover, our architecture is equipped with a twolevel attention mechanism to improve our classification performance by leveraging local and global information from the neighborhoods of nodes in a sequence. Recent graph representation approaches (Lee et al. 2018a, b; Veličković et al. 2018) introduce the notion of attention mechanism to explore the neighbors of a node and detect the most informative neighbors. However, our proposed architecture utilizes two levels of attention to capture more global information from the neighborhood of the entire sequence in addition to the neighbors of a node.
We demonstrate the efficacy of our supervised and unsupervised approaches in classification tasks for both labeled and unlabeled graphs. Our approach outperforms the stateoftheart methods on nearly all the considered datasets.
Related work
We discuss prior and relevant work in developing unsupervised and supervised methods for graph comparison.
Unsupervised
In the unsupervised group, existing graph comparison methods can be categorized into three main (not necessarily disjoint) classes: feature extraction, graph kernels, and graph matching. Feature extraction methods compare graphs across a set of features, such as specific subgraphs or numerical properties that capture the topology of the graphs (Berlingerio et al. 2012; Macindoe and Richards 2010; Yan and Han 2002). The efficiency and performance of such methods is highly dependent on the feature selection process. Most graph kernels (Vishwanathan et al. 2010) are based on the idea of Rconvolutional kernels (Haussler 1999), a way of defining kernels on structured objects by decomposing the objects into substructures and comparing pairs in the decompositions. For graphs, the substructures include graphlets (Shervashidze et al. 2009), shortest paths (Borgwardt and Kriegel 2005), random walks (Gärtner et al. 2003), and subtrees (Shervashidze et al. 2011). Recently, several new graph kernels such as Deep Graph Kernel (DGK) (Yanardag and Vishwanathan 2015), optimalassignment WeisfeilerLehman (WLOA) (Kriege et al. 2016), Pyramid Match Kernel (PM) (Nikolentzos et al. 2017) and local WL label (LWL) (Morris et al. 2017) have been proposed and evaluated for graph classification tasks. The DGK (Yanardag and Vishwanathan 2015) uses methods from unsupervised learning of word embeddings to augment the kernel with substructure similarity. WLOA (Kriege et al. 2016) is an assignment kernel that finds an optimal bijection between different parts of the graph. The PM kernel (Nikolentzos et al. 2017) finds an approximate correspondence between the sets of vectors of the two graphs. LWL (Morris et al. 2017) is a kernel that considers both local and global features of the graph. While graph kernel methods are effective, their time complexity is quadratic in the number of graphs, and there is no opportunity to customize their representations for supervised tasks.
Graph matching algorithms use the topology of the graphs, their nodes and edges directly, counting matches and mismatches (Bunke 2000; Riesen et al. 2010). These approaches do not consider the global structure of the graphs and are sensitive to noise.
In addition to the three categories above, Graph2vec (Narayanan et al. 2017) is and unsupervised method inspired by document embedding models. This approach finds a representation for a graph by maximizing the likelihood of existing graph subtrees given the graph embedding. Our approach outperforms this method by a large margin. Graph2vec does not capture global information in the graph structure by only considering subtrees as graph representatives, which ultimately affects its performance. Other methods have been developed to learn representations for individual nodes in graphs, such as DeepWalk (Perozzi et al. 2014), node2vec (Grover and Leskovec 2016), and many others. These methods are not directly related to graph comparison because they require aggregating node representations to represent entire graphs. However, we compared to one such representative method in our experiments to show that the aggregation of node embeddings is not informative enough to represent the structure of a graph.
Supervised
We now discuss supervised methods that learn graph representations for the graph classification task. The representations obtained by these approaches are tailored for a supervised task and are not based solely on the graph topology. Most of existing approaches (Niepert et al. 2016; Zhang et al. 2018; Ying et al. 2018; Gilmer et al. 2017; Duvenaud et al. 2015; Li et al. 2015b; Bruna et al. 2013; Henaff et al. 2015; Defferrard et al. 2016; Scarselli et al. 2009) are variations of Graph Neural Networks (GNNs) and rely on the idea of message propagation around the neighbors. Niepert et al. (2016) developed a framework (PSCN) to learn graph representations by defining receptive fields of neighborhoods and using canonical node ordering. Deep Graph Convolutional Neural Network (DGCNN) (Zhang et al. 2018) is another model that extracts multiscale node features and applies a consistent pooling layer on unordered nodes. The main difference between DGCNN and PSCN is the way they deal with the nodeordering problem. We compare our models to them in our experiments, outperforming them on all datasets. Ying et al. (2018) proposed a hierarchical representation learning framework via hierarchical GNNs pooling layers. Gilmer et al. (2017) proposed a message passing neural network framework, and explored the existing supervised approaches (Gilmer et al. 2017; Duvenaud et al. 2015; Li et al. 2015b) that have been recently used for graphstructured data in chemistry applications, such as molecular property prediction. Duvenaud et al. (Duvenaud et al. 2015) introduced a GNN to create “fingerprints" (vectors that encode molecule structure) for graphs derived from molecules. The information about each atom and its neighbors are fed to the neural network, and neural fingerprints are used to predict new features for the graphs. Bruna et al. (2013) proposed spectral networks, generalizations of GNNs on lowdimensional graphs via graph Laplacians. Henaff et al. (2015) and Defferrard et al. (2016) extended spectral networks to highdimensional graphs. Scarselli et al. (2009) proposed a GNN which extends recursive neural networks and find node representations using random walks. Li et al. (2015b) extended GNNs with gating recurrent neural networks to predict sequences from graphs. In general, neural message passing approaches can suffer from high computational and memory costs, since they perform multiple iterations of updating hidden node states in graph representations. However, our supervised approach obtains strong performance without the requirement of passing messages between nodes for multiple iterations.
Lee et al. (2018b) proposed Graph Attention Model (GAM) using recurrent neural networks for the graph classification task. The goal is to train a model that is able to distinguish between different classes of graphs. This approach is limited to the graph classification application and cannot address the problems of graph comparison and graph representation learning without any taskspecific supervision. GAM captures the graph representation by traversing the graph via attentionguided walks. In fact, the approach decides how to guide a walk while traversing the graph by choosing the nodes that are more informative for the purpose of the graph classification. Finally, the nodes selected during the graph traversal and their corresponding attributes are used for graph classification. However, our architecture is flexible enough to work with different choices of extracted sequences such as BFS, shortest paths as well as random walks, and to benefit from various types of substructures in learning to represent the graphs. GAM mainly focuses on local information obtained via attention mechanism on neighbors of nodes and is not capable of considering global structure of the graph. However, by using extracted sequences with different order of neighborhood granularity and applying a twolevel attention mechanism, we can retrieve informative information from both local and global structure of the graph, which ultimately improves performance.
Background
We briefly discuss the required background about graphs and LSTM recurrent neural networks.
Graphs. For a graph G=(V,E), V denotes its node set and E⊆V×V denotes its edge set. Edge e∈E is a pair of nodes (v,v^{′})∈V×V and represents an undirected edge. The set of neighbors for a node v is Nbrs(v)={v^{′}(v,v^{′})∈E}.G is called a labeled graph if there is a labeling function label:V→L that assigns a label from a set of labels L to each node. The graph G is called an unlabeled graph if no labels have been assigned to its nodes.
Long shortterm memory (LSTM). An LSTM (Hochreiter and Schmidhuber 1997) is a recurrent neural network (RNN) designed to model longdistance dependencies in sequential data. RNNs are a class of neural networks that use their internal hidden states to process sequences of inputs. We denote the input vector at time t by x_{t} and we denote the hidden vector computed at time t by h_{t}. At each time step, an LSTM computes a memory cell vector c_{t}, an input gate vector i_{t}, a forget gate vector f_{t}, and an output gate vector o_{t}:
where ⊙ denotes elementwise multiplication, σ is the logistic sigmoid function, each W is a weight matrix connecting inputs to particular gates (denoted by subscripts), each U is an analogous matrix connecting hidden vectors to gates, each K is a diagonal matrix connecting cell vectors to gates, and each b is a bias. We refer to this as an “LSTM encoder” because it converts an input sequence into a sequence of hidden vectors h_{t}. We will also use a type of LSTM that predicts the next item \(\overline {x}\) in the sequence from h_{t}. This architecture, which we refer to as an “LSTM decoder,” adds the following:
where g is a function that takes a hidden vector h_{t} and outputs a predicted observation \(\overline {x}\). With symbolic data, g typically computes a softmax distribution over symbols and returns the maxprobability symbol. When using continuous inputs, g could be an affine transform of h_{t} followed by a nonlinearity.
Approach overview
We propose unsupervised and supervised architectures for learning representations of labeled and unlabeled graphs. In our unsupervised architecture, the goal is to learn graph embeddings such that graphs with similar structure lie close to one another in the embedding space. We seek to learn a mapping function \(\Phi :G \to \mathbb {R}^{k}\) that embeds a graph G into a kdimensional space. We are interested in methods that scale linearly in the number of graphs in a dataset (as opposed to graph kernel methods that require quadratic time).
Our approach uses encoderdecoder models, particularly autoencoders (Hinton and Zemel 1993), which form one of the principal frameworks of unsupervised learning. Autoencoders are typically trained to reconstruct their input in a way that learns useful properties of the data. There are two parts to an autoencoder: an encoder that maps the input to some intermediate representation, and a decoder that attempts to reconstruct the input from this intermediate representation. We need to decide how to represent graphs in a form that can be encoded and then reconstructed. We do this by extracting ode sequences with various levels of granularity (the increasing order of node neighborhoods) from the graphs. “Generating sequences from graphs” section describes several methods for doing this. Given node sequences from a graph, we then need an encodingdecoding framework that can handle variablelength sequences. We choose LSTMs for both the encoder and decoder, forming an LSTM autoencoder (Li et al. 2015a). LSTM autoencoders use one LSTM to read the input sequence and encode it to a fixed dimensional vector, and then use another LSTM to decode the output sequence from the vector. We consider several variations for the encoder and decoder, described in “Sequencetosequence encoderdecoder” section. We experiment with two training objectives, described in “Training” section.
Given the trained encoder LSTM_{enc}, we define the graph embedding function Φ(G) as the mean of the vectors output by the encoder over Seq(G), the set of graph sequences extracted from G:
Using the mean outperformed max pooling in our experiments so we only report results using the mean in this paper. We use Φ to represent graphs in our experiments in “Experiments” section, demonstrating stateoftheart performance for several graph classification tasks.
Our supervised model uses a sequencetosequence framework to capture the graph structure with the supervision of a graph classification task. We train our model to predict the label of a graph using randomly selected sets of node sequences over multiple iterations. The model leverages the node embeddings learned by our unsupervised encoderdecoder models in learning to predict graph labels. We describe our supervised approach in “Supervised graph representation learning” section and compare it with other supervised and unsupervised approaches in “Experiments” section.
Generating sequences from graphs
The input of our model is a set of node sequences generated from different types of graph substructures such as Random Walks, Shortest Paths, and BreadthFirst Search. Different substructures capture varying aspects of local vs global topology of the graph. Moreover, we incorporate information about the neighborhood of a node sequence in order to preserve more information about the region of the graph that is traversed by the sequence. The neighborhood of a sequence consists of the neighbors of nodes in that sequence. In order to incorporate the neighborhood of a sequence in graph representation learning, we assign a label to each node in the sequence based on its neighborhood. The order of the neighborhood (which we call granularity) determines whether it is the immediate neighbors (1st order neighborhood) or the neighbors of neighbors as well (2nd order neighborhood), etc. For instance, the k^{th}order granularity for sequence s:v_{1},…,v_{s}, considers all the nodes in the graph that have a distance of k or less from a v_{i}∈s. We examine the impact of various substructures and orders of granularity on our models.
Types of substructures
Random Walks (RW): Given a source node u, we generate a random walk w_{u} with fixed length m. Let v_{i} denote the ith node in w_{u}, starting with v_{0}=u. v_{t+1} is a node from the Nbrs(v_{t}) that is selected with probability 1/deg(v_{t}), where deg(v_{t}) is the degree of v_{t}. (This is a 0order Random Walk in our context. A higher order Random Walk replaces nodes with higher order neighborhoods–see the following “Substructure granularity” section).
Shortest Paths (SP): We generate all the shortest paths between each pair of nodes in the graph using the FloydWarshall algorithm (Floyd 1962).
BreadthFirst Search (BFS): We run the BFS algorithm at each node to generate graph sequences for that node. The graph sequences for the graph include the BFS sequences starting at each node in the graph, limited to a maximum number of edges from the starting node. We give details on the maximum used in our experiments below.
Substructure granularity
Each of the sequences defined in the previous section can be sequences of nodes at different orders of granularity. For example, a Random Walk or a Shortest Path can be a sequence of nodes with 0order granularity, a sequence of nodes with firstorder granularity, or a sequence of nodes with the secondorder granularity, etc.
We use WeisfeilerLehman (WL) algorithm (Weisfeiler and Lehman 1968) to generate labels for nodes, encoding the node neighborhood (of the correpsonding order of granularity) information in the label. This algorithm is typically used as a graph isomorphism test. It is known as an iterative node classification or node refinement procedure (Shervashidze et al. 2011). The WL algorithm uses multiset labels to encode the local structure of the graphs. The idea is to create a multiset label for each node using the sorted list of its neighbors’ labels. Then, the sorted list is compressed into a new value. The WL algorithm can iterate this labeling process and add higher order neighbors to the neighborhood list at each iteration. This labeling process continues until the new multiset labels of graphs in the dataset are different or the number of iterations reaches a specified limit. Each iteration increases the order of substructure granularity by expanding the node neighborhood. The WL labels can enrich the information provided by a node, regardless of whether the original graph is labeled or unlabeled.
Each unique WL label is converted to a parameter vector in a kdimensional space where each entry is initialized with a draw from a uniform distribution with range [−1,1]. Overall, depending on the granularity that we consider for nodes, we may include different number of parameter vectors in the model. WL parameter vectors are updated during the training time in addition to the model parameters. For each node v in a sequence, Emb(v) denotes the parameter vector assigned to the corresponding WL label of node v.
Sequencetosequence encoderdecoder
We discussed three types of substructures used for extracting sequences from graphs and our approach for embedding nodes considering the order of their neighborhoods. We now describe how we will learn our graph embedding functions in our unsupervised setting.
We formulate graph representation learning as training an encoderdecoder on node sequences generated from graphs. The most common type of encoderdecoder is a feedforward deep neural network, but those suffer from the limitation of requiring fixedlength inputs and an inability to model sequential data. Therefore, we focus in this paper on sequencetosequence encoderdecoder architecture, which can support arbitrarylength sequences.
These encoderdecoder models are based on the sequencetosequence learning framework of Sutskever et al. (2014), an LSTMbased architecture in which both the inputs and outputs are sequences of variable length. The architecture uses one LSTM as the encoder LSTM_{enc} and another LSTM as the decoder LSTM_{dec}. An input sequence s with length m is given to LSTM_{enc} and its elements are processed one per time step. The hidden vector h_{m} at the last time step m is the fixedlength representation of the input sequence. This vector is provided as the initial vector to LSTM_{dec} to generate the output sequence.
We suggested four different versions of sequencetosequence encoderdecoder models to investigate their ability to capture the graph structure. Three out of four encoderdecoder models adapt the sequencetosequence learning framework for autoencoding simply by using the same sequence for both the input and output. The autoencoders are trained so that LSTM_{dec} reconstructs the input using the final hidden vector from LSTM_{enc}.
In our experiments, we use several graph datasets. We train a single encoderdecoder for each graph dataset. The encoderdecoder is trained on a training set of graph sequences pooled across all graphs in the dataset. After training the encoderdecoder, we obtain the representation Φ(G) for a single graph G by encoding its sequences s∈Seq(G) using LSTM_{enc}, then averaging its encoding vectors, as in Eq. (3).
Encoderdecoder variations
S2SAE: This is the standard sequencetosequence autoencoder inspired by (Li et al. 2015a), which we customize for embedding graphs. Figure 2 shows an overview of this model. We use \(h_{t}^{enc}\) to denote the hidden vector at time step t in LSTM_{enc} and \(h_{t}^{dec}\) to denote the hidden vector at time step t in LSTM_{dec}. We define shorthand for Eq. 1 as follows:
where Emb(v_{t}) takes the role of x_{t} in Eq. 1. The hidden vector at the last time step \(h_{last}^{enc} \in \mathbb {R}^{m}\) denotes the representation of the input sequence, and is used as the hidden vector of the decoder at its first time step:
The last cell vector of the encoder is copied over in an analogous way. Then each decoder hidden vector \(h_{t}^{dec}\) is computed based on the hidden vector and the node embedding from the previous time step:
The decoder uses \(h_{t}^{dec}\) to predict the next node embedding \(\overline {Emb({v_{t}})}\) as in Eq. 2. We have two different loss functions to test with this model. First, we consider the node embeddings fixed and compute a loss based on the difference between the predicted node embedding \(\overline {Emb({v_{t}})}\) and the true one Emb(v_{t}). Second, we consider a parameter vector for each embedding and update the node embeddings in addition to the model parameters using a cross entropy function. We discuss training in “Training” section. For Emb(v_{0}), we use a vector of all zeroes.
S2SAEPP: In the previous model, LSTM_{dec} predicts the embedding of the node at time step t using \(h_{t1}^{dec}\) as well as Emb(v_{t−1}), the true node embedding at time step t−1. However, this may enable the decoder to rely too heavily on the previous true node in the sequence, thereby making it easier to reconstruct the input and reducing the need for the encoder to learn an effective representation of the input sequence. We consider a variation (“S2SAEPP: sequencetosequence autoencoder previous predicted”) in which we use the previous predicted node \({Emb}(\overline {v_{t1}})\) instead of the previous true one:
This forces the encoder and decoder to work harder to respectively encode and decode the graph sequences. This variation is related to scheduled sampling (Bengio et al. 2015), in which the training process is changed gradually from using true previous symbols to using predicted previous symbols more and more during training. The difference of S2SAEPP with previous model is indicated in Fig. 3.
S2SAEPPWL1,2: This model is similar to S2SAEPP except that, for each node in the sequence, we use two different levels of neighborhood granularity. We incorporate the firstorder neighborhoods and secondorder neighborhoods (neighbors of neighbors) of nodes in learning graph representation. We use \(x_{1_{t}}\) to denote the embedding of the label produced by one iteration of WL (for the firstorder neighborhood) and \(x_{2_{t}}\) for that produced by two iterations of WL (for secondorder neighborhood). Equation 1 is modified to receive both as inputs. For example, the first line of Eq. 1 becomes:
The other equations are changed analogously. The embeddings for both the firstorder and secondorder neighborhoods are learned. Figure 4 shows how this model integrates the information from neighborhoods in order to learn the graph representation.
S2SN2NPP: This model is a “neighborstonode” (N2N) prediction model and uses random walks as graph sequences (Fig. 5). The idea is to explore the neighborhood of a sequence by an encoder and predict the sequence itself by the decoder using the gathered information via the encoder. The encoder is encouraged to collect the information from the neighborhood which is distinguishable from other graph substructures and decoder can reconstruct the original sequence from that. That is, each item in the input sequence is the set of neighbors (their embeddings are averaged) for the corresponding node in the output sequence:
where Nbrs(v) returns the set of neighbors of v and we predict the nodes in the random walk via the decoder as in Eq. 7. Unlike the other models, this model is not an autoencoder because the input and output sequences are not the same.
Training
Let S be a sequence training set generated from a set of graphs. The representations of the sequences s∈S are computed using the encoders described in the previous section. We use two different loss functions to train our models: squared error and categorical crossentropy. The goal is to minimize the following loss functions, summed over all examples s∈S, where s:v_{1},…,v_{s}.
Squared error
We used the squared error (SE) loss function for the embeddings that are fixed and are not considered as the trainable parameters of the model. We include a nonlinear transformation to estimate the embedding of the tth node in s using the hidden vector of the decoder at time step t:
where ReLU is the rectified linear unit activation function and W and b are additional parameters.
Given the predicted node embeddings for the sequence s, the squared error loss function computes the average of the elementwise squared differences between the input and output sequences:
Categorical cross entropy
We use the categorical cross entropy (CE) loss function for experiments in which we update the node embeddings during training. We predict the tth node as follows:
where L is the set of labels. The loss computes the categorical cross entropy between the input embeddings and the predicted output embeddings:
where l denotes the true label of node v_{t}, and the predicted probability of the true label is computed as follows:
Supervised graph representation learning
In this setting, the graph representation learning is guided by a taskspecific supervision, incorporating the supervision information into the learning process. Given a dataset D:{G_{1},...,G_{n}}, the goal is to learn a function GrLabel:D→L_{G} that assigns a label from a set of labels L_{G} to each graph in the dataset.
We introduce our supervised method inspired by the most effective unsupervised method proposed in “Sequencetosequence encoderdecoder” section. As we will show in “Experiments” section, that S2SN2NPP outperforms the other methods in almost all experiments. Therefore, we design the foundation of our supervised method based on the S2SN2NPP. We utilize one LSTM for processing a sequence (LSTM_{Seq}) and one bidrectional LSTM for processing the neighborhoods of the sequence (BiLSTM_{Nbh}). The neighborhoods of a sequence consist of the neighbors of nodes in the sequence. The hidden representation obtained by the BiLSTM_{Nbh} from the neighborhoods is given to the LSTM_{Seq} in order to incorporate it in graph representation learning. The BiLSTM_{Nbh} contains an attention mechanism to select the most informative neighborhoods for the purpose of graph classification.
Recently, several approaches focused on the application of attention mechanisms in graphstructured data (Lee et al. 2018a, b; Veličković et al. 2018). The attention mechanism in these methods is mainly designed to explore the neighbors of a node and focus on the most informative neighbors. However, we utilize a twolevel attention mechanism to improve our classification performance by leveraging local as well as global information from the neighborhoods of nodes in a sequence. The firstlevel attention mechanism, Att_{Nbr}, attends over the neighbors of a node to capture more relevant information from its neighbors. The secondlevel attention module, Att_{Nbh}, attends over the neighborhoods along a sequence of nodes to focus on the overall more informative neighborhoods. Our approach aims to capture more globally relevant information from the graph by the Att_{Nbh} in comparison with the Att_{Nbr}. Figure 6 shows the difference between the two levels of attention mechanism. Overall, given a sequence s, our approach includes two main components: (1) Neighborhood embedding to gather information from the neighborhoods of nodes in sequence s, and (2) Sequence embedding to represent the sequence s by incorporating information from its neighborhoods. We discuss these two components in the following. Figure 7 shows an overview of the proposed approach.
Neighborhood embedding
For each sequence s∈S, where s:v_{1},…,v_{s}, extracted in “Generating sequences from graphs” section, information from the neighbors of nodes in the sequence is gathered and passed through the BiLSTM_{Nbh}. An attention module Att_{Nbr} is utilized in order to gather local information from the neighbors of v_{i}∈s. The Att_{Nbr}(Nbrs(v_{i})) function decides to which nodes in Nbrs(v_{i}) to pay attention in order to enhance their impact during training. We rely on the attention mechanism in our model to relieve the classification task from the burden of considering all the nodes in Nbrs(v_{i}) to be equally informative. For each neighbor n∈Nbrs(v_{i}):
where f is a neural network that computes the attention coefficient for node n. The normalized coefficient, a_{n}, is obtained by a softmax function. The resulting attention readout for node v_{i} is represented by r_{i}. The information about the neighborhood of v_{i} is passed through the BiLSTM_{Nbh}. Finally, BiLSTM_{Nbh} provides a hidden representation from the neighborhood of a sequence.
Sequence embedding
After gathering information from the neighborhood of sequence s, the last hidden representation of the neighborhood, \(h_{s}^{Nbh}\), is used by LSTM_{Seq} to find a representation for the sequence for the purpose of graph classification. LSTM_{Seq} processes the sequence whose neighborhood has been already processed by BiLSTM_{Nbh}:
The secondlevel attention module, Att_{Nbh} is applied over the output of BiLSTM_{Nbh} in order to assign weights to parts of the neighborhood along sequence s and reflect the importance of each part for the taskspecific prediction:
where f is a neural network that computes a single scalar as the attention coefficient for each hidden state of BiLSTM_{Nbh}. The attention coefficient of Att_{Nbh} is represented by a_{i}. An attention mechanism on the outputs of BiLSTM_{Nbh} provides the capability for the model to focus on the parts of the neighborhoods along the sequence that are influential in the endtoend training of the model. The concatenation of sequence embedding and attention module is given to a simple feedforward neural network, g, to find the subgraph representation, rep_{s}, by incorporating sequence s and its neighborhood.
Training
In order to train a scalable endtoend model, we train the model on a set of extracted sequences from “Generating sequences from graphs” section. The model is trained so that it learns to predict the graph label using the set of chosen sequences. A batch of sequences, b, is selected from a graph G_{k}∈D, and the graph representation is obtained as follows:
Finally, the probability distribution over labels for G_{k} is computed by:
where W is a weight matrix reducing the dimension of the graph representation and b is a bias. The loss function computes the categorical cross entropy between the predicted distribution \(\overline {y}\) and the true label of the graph \(l_{G_{k}} \in L_{G}\):
Experiments
In this section, we evaluate our representation learning procedure on both labeled and unlabeled graphs using our supervised and unsupervised models. We use our learned representations for the task of graph classification using several benchmark datasets and compare the accuracy of our models to stateoftheart approaches.
Datasets
For our classification experiments with labeled graphs, we use six datasets of bioinformatics graphs. For unlabeled graphs, we use six datasets of social network graphs.
Labeled graphs: The bioinformatics benchmarks include several well known datasets of labeled graphs. MUTAG (Debnath et al. 1991) is a dataset of mutagenic aromatic and heteroaromatic nitro compounds. PTC (Toivonen et al. 2003) contains several compounds classified in terms of carcinogenicity for female and male rats. Enzymes (Borgwardt et al. 2005) includes 100 proteins from each of the 6 Enzyme Commission top level enzymes classes. Proteins (Borgwardt et al. 2005) consists of graphs classified into enzymes and nonenzymes groups. Nci1 and Nci109 (Wale et al. 2008) are two balanced subsets of chemical compounds screened for activity against nonsmall cell lung cancer and ovarian cancer cell lines respectively.
Unlabeled graphs: We use several datasets developed by (Yanardag and Vishwanathan 2015) for unlabeled graph classification. COLLAB is a collaboration dataset where each network is generated from egonetworks of researchers in three research fields. Networks are classified based on research field. IMDBBINARY and IMDBMULTI include egonetworks for film actors/actresses from various genres on IMDB, and networks are classified by genre. Each graph in the REDDIT dataset corresponds to an online discussion thread. The REDDITBINARY dataset includes graphs that are extracted from four different subreddits. These subreddits may belong to a question/answerbased community or a discussionbased community. These community labels are given by the dataset and the task is to classify the graphs based on their community labels. REDDITMULTI5K and REDDITMULTI12K are extracted from five subreddits and eleven subreddits respectively, where the subreddit labels are given by the dataset. The task in these two datasets is to predict which subreddit a graph belongs to.
Baselines
We compare our approach to several well known graph kernels: shortest path kernel (SPK) (Borgwardt and Kriegel 2005), random walk kernel (RWK) (Gärtner et al. 2003), graphlet kernels (GK) (Shervashidze et al. 2009), WeisfeilerLehman subtree kernel (WLSK) (Shervashidze et al. 2011). Also, we compare with recently proposed kernel methods: Deep Graph Kernels (DGK) (Yanardag and Vishwanathan 2015), WLOA kernel (Kriege et al. 2016), Pyramid Match Kernel (PM) (Nikolentzos et al. 2017) and local WL label (LWL) (Morris et al. 2017). We also compare to four recent supervised methods: (PSCN) (Niepert et al. 2016), Deep Graph Convolutional Neural Network (DGCNN) (Zhang et al. 2018), Spectral Graph Representation (SGR) (Tsitsulin et al. 2018), GCN (Kipf and Welling 2017) and the unsupervised method: Graph2vec of (Narayanan et al. 2017). We also compare to a graph representation method based on node2vec (Grover and Leskovec 2016); we use it to learn node embeddings and average them for all nodes in a graph. We report the best results from prior work on each dataset, choosing the best from multiple configurations of their methods.
Experimental setup
For unsupervised setting, we perform 10fold crossvalidation on the graph representations of a dataset using a CSVM classifier from LIBSVM (Chang and Lin 2011) with a radial basis kernel. Each 10fold crossvalidation experiment is repeated 10 times (with different random splits) and we report average accuracies and standard deviations. We use nested crossvalidation for tuning the regularization and kernel hyperparameters of the SVM. For the supervised setting, we again perform 10fold crossvalidation on a dataset, and use the 9 training folds for training our model.
Hyperparameter selection
We treat three labeled bioinformatics graph datasets (MUTAG, PTC, Enzymes) and two unlabeled social network datasets (IMDBBINARY and REDDITBINARY) as development datasets for tuning certain highlevel decisions and hyperparameters of our unsupervised approach, though we generally found results to be robust across most values. Figure 8 shows the effect of dimensionality of the graph representation, showing robustness across values larger than 50; thus, we use 100 in all experiments below. The dashed lines in Fig. 8 show accuracy when the node embeddings are fixed and the solid lines when the node embeddings are considered as vector parameters and are updated during training. We use SE (“Squared error” section) when the node embeddings are fixed and CE (“Categorical cross entropy” section) when we update the node embeddings during training. CE consistently outperforms SE and we use CE for all remaining experiments. By using CE, we learn representations of neighborhoods at the right order of granularity and, using those, we learn the representation of the entire graph. We use AdaGrad (Duchi et al. 2011) with learning rate 0.01 and minibatch size 100.
In the experiments that used BFS for sequence generation, we only consider nodes that are at most 1 edge from the starting node. In some cases, this still leads to extremely long sequences for nodes with many neighbors. We convert these to multiple sequences so that each has a maximum length of 10. When doing so, we still prepend the initial starting node to each of the truncated partial sequences.
When using random walks, we generate multiple random walks from each node. We compared random walk lengths of {3,5,10,15,20}. Figure 9 shows robust performance with length 5 across datasets and we use this length below.
Comparing models, type, and granularity of sequences
Figures 10, 11 and 12 show the results of the classification task on the development labeled graphs for our unsupervised models, with varying types of substructures (RW, BFS, SP) and their granularity (WL0, WL1, WL2). WL0 shows the 0order granularity in which we use the original label of the nodes. WL1 and WL2 show sequences of first order and second order neighborhoods of the nodes, respectively.
Figure 10 shows the accuracy of the graph classification task with different orders of sequence granularity. We show the average of accuracies of the two autoencoders S2SAE and S2SAEPP (rather than all four, since S2SN2NPP does not have the full range of substructures and S2SAEPPWL1,2, uses both WL labels together, thus, not comparable). The results indicate that using sequences of neighborhoods improves accuracy substantially compared to using sequences of nodes. There is a large gap between accuracies generated by the first order neighborhoods and original labels (0order neighborhood) across all sequence types. This provides strong evidence that sequences of first order neighborhoods can enrich the original labels in a way that improves graph representations by capturing additional local node structure.
The number of unique substructures increases from first order neighborhoods to second order neighborhoods. Thus, the number of WL labels assigned to neighborhoods increases with each iteration. Although distinctive labels provide more information about the local structure of nodes, using WL labels with higher iterations does not necessarily lead to better graph representations. For example, the accuracy of Enzymes (Fig. 10) shows a significant drop when using the second iteration of the WL algorithm. We believe that the reason of such a sharp drop in this dataset can be explained by the graphs’ label entropy (Li et al. 2011). Given a graph G and a set of labels L, the label entropy of G is defined as \(H(G)=\sum _{l\in L} p(l)\log p(l)\). The average label entropy in each dataset is depicted in Table 1. The entropy of the Enzymes dataset increases more than MUTAG and PTC from the substructures with first order neighborhoods to the second order neighborhoods. As the entropy increases, the number of unique WL labels in a dataset, and consequently the impurity of the set of labels, increases. When the number of common labels shared by different graphs decreases, the model cannot learn the similarity between graphs because each graph is represented by a set of labels that are nearly unique to it. However, the model can detect the commonality among the topology of graphs (similar substructures) when they have more shared labels.
Moreover, Fig. 11 shows that the accuracies of classification using different types of sequences are very close to each other when using the first order and second order neighborhoods. The difference between the types of sequences is more evident when we use the original labels (0order neighborhood sequences of nodes). In Enzymes, using shortest paths clearly results in better accuracies than random walks and BFS, regardless of the sequence granularity. For the same type of graphs, Borgwardt and Kriegel (2005) similarly observed that their shortest path kernel was better than walkbased kernels. We suspect that the reason is related to the clustering coefficient, a popular metric in network analysis. The clustering coefficient is the fraction of triangles connected to node v over the total number of connected triples centered on node v. Having many triangles in the egonetwork of node v may cause the tottering problem in a walk traversing node v and may generate less discriminative BFS sequences from that egonetwork. Shortest paths prevent tottering and capture the global graph structure. BFS sequences mainly consider the local topological features of a graph, and random walks collect a representative sample of nodes rather than of topology (Kurant et al. 2011). Fortunately, using the sequence of substructures augmented with neighborhoods can reduce the effect of sequence type in most settings.
Figure 12 shows the comparison between different unsupervised models, using the sequences of first order neighborhoods. Model S2SAEPP is better than Model S2SAE in nearly all cases. As we conjectured above, Model S2SAEPP may force the encoder representation to capture the entire sequence since the decoder has less assistance during reconstruction. Model S2SN2NPP obtains higher accuracy in almost all datasets, showing the benefit of capturing local neighborhoods With S2SAEPPWL1,2, we only observe improvements over the other S2SAE models on the PTC dataset. This suggests that adding substructures from both of the first order neighborhoods and second order neighborhoods could not provide more informative graph representations, regardless of the type of substructures. The reason could be due to the fact that in this model we add too many vector parameters for each neighborhood substructure and our model is not able to learn the meaningful embeddings for these substructures, which leads to poor graph representation.
Comparison to stateoftheart
We compare S2SN2NPP and our supervised model to the stateoftheart in Table 1. We exceed all prior results except on MUTAG dataset. Our supervised method outperforms other supervised and unsupervised graph representation learning approaches. However, S2SN2NPP in combination with CSVM performs well in graph classification. Considering that none of the previous work can outperform all others in all datasets, we suggested average ranking measure to compare the performance of all the approaches together. Our supervised approach shows robustness, achieving the first ranking among other methods. S2SN2NPP obtains the second ranking and this is a strong evidence that our unsupervised method learns the graph structure effectively. The third and fourth rankings are obtained by LWL and WLOA, which are kernel methods and suffer from the quadratic complexity of the growth of the running time with the number of graphs in the dataset. Table 2 compares our method to prior work on the unlabeled graph datasets. Our approach established new stateoftheart accuracies on all datasets except REDDITBINARY.
Figure 13 shows the duration of training of the graph representation learning in our S2SN2NPP approach versus the DGCNN. The parameters of DGCNN are configured according to their paper (Zhang et al. 2018). DGCNN requires more time to learn the graph representation for the purpose of classification in all datasets. However, the difference between the two methods is more obvious in the COLLAB dataset, which includes more graphs and the number of nodes for each graph is more than that of the other datasets. Moreover, Fig. 14 shows the linear growth of training time of our representation learning approach with the increase in the number of graphs in a dataset. We report the training time of our approach by changing the size of a dataset incrementally. The dataset includes a set of randomly selected graphs from the COLLAB dataset.
Conclusions
We proposed sequencetosequence LSTM architectures for learning representations of graphs in both supervised and unsupervised regimes. We trained our models using sequences from different types of substructures (random walks, shortest paths, and breadthfirst search) with various levels of granularity (neighborhoods of increasing order). Our experiments demonstrate that our graph representations can increase the accuracy of graph classification tasks on both supervised and unsupervised approaches, achieving to our knowledge, the best results on several datasets considered.
Availability of data and materials
The datasets are selected from the related work mentioned in the experiment section.
Abbreviations
 BFS:

Breadthfirst search
 CE:

Cross entropy DGCNN: Deep graph convolutional neural network
 DGK:

Deep graph kernel
 GAM:

Graph attention model
 GNNs:

Graph neural networks
 LSTM:

Long shortterm memory
 LWL:

Local WL label
 PM:

Pyramid match kernel
 RNN:

Recurrent neural network
 RW:

Random walk
 SP:

Shortest path
 WL:

WeisfeilerLehman
 WLOA:

Optimalassignment Weisfeiler Lehman
References
Adhikari, B, Zhang Y, Ramakrishnan N, Prakash BA (2017) Distributed representations of subgraphs In: DaMNet.
Akoglu, L, McGlohon M, Faloutsos C (2010) Oddball: Spotting anomalies in weighted graphs In: PAKDD.
Bengio, S, Vinyals O, Jaitly N, Shazeer N (2015) Scheduled sampling for sequence prediction with recurrent neural networks In: NIPS.
Berlingerio, M, Koutra D, EliassiRad T, Faloutsos C (2012) NetSimile: a scalable approach to sizeindependent network similarity. arXiv.
Borgwardt, KM, Kriegel HP (2005) Shortestpath kernels on graphs In: ICDM.
Borgwardt, K, Ong C, Schönauer S, Vishwanathan S, Smola A, Kriegel H (2005) Protein function prediction via graph kernels. Bioinformatics 21.
Bruna, J, Zaremba W, Szlam A, LeCun Y (2013) Spectral networks and locally connected networks on graphs. CoRR.
Bunke, H (2000) Graph matching: Theoretical foundations, algorithms, and applications In: Vision Interface.
Chang, CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM TIST 2.
Chen, J, Xu X, Wu Y, Zheng H (2018) Gclstm: Graph convolution embedded lstm for dynamic link prediction. arXiv preprint arXiv:1812.04206.
Debnath, A, Lopez de Compadre R, Debnath G, Shusterman A, Hansch C (1991) Structureactivity relationship of mutagenic aromatic and heteroaromatic nitro compounds. J Med Chem.
Defferrard, M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. arXiv.
Duchi, J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. JMLR.
Duvenaud, D, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, AspuruGuzik A, Adams RP (2015) Convolutional networks on graphs for learning molecular fingerprints In: NIPS.
Floyd, RW (1962) Algorithm 97: shortest path. Commun ACM.
GarcíaDurán, A, Niepert M (2017) Learning graph representations with embedding propagation In: NIPS.
Gärtner, T, Flach P, Wrobel S (2003) On graph kernels: Hardness results and efficient alternatives In: COLT.
Gilmer, J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. CoRR.
Grover, A, Leskovec J (2016) node2vec: Scalable feature learning for networks In: KDD.
Haussler, D (1999) Convolution kernels on discrete structures. Technical report.
Henaff, M, Bruna J, LeCun Y (2015) Deep convolutional networks on graphstructured data. arXiv.
Hinton, GE, Zemel RS (1993) Autoencoders, minimum description length, and helmholtz free energy In: NIPS.
Hochreiter, S, Schmidhuber J (1997) Long shortterm memory. Neural Comput.
Kipf, TN, Welling M (2017) Semisupervised classification with graph convolutional networks In: ICLR.
Kriege, NM, Giscard PL, Wilson R (2016) On valid optimal assignment kernels and applications to graph classification In: NIPS.
Kurant, M, Markopoulou A, Thiran P (2011) Towards unbiased bfs sampling. IEEE J Sel Areas Commun.
Lee, JB, Rossi RA, Kim S, Ahmed NK, Koh E (2018a) Attention models in graphs: A survey. arXiv preprint arXiv:1807.07984.
Lee, JB, Rossi R, Kong X (2018b) Graph classification using structural attention In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1666–1674.. ACM.
Li, J, Luong M, Jurafsky D (2015a) A hierarchical neural autoencoder for paragraphs and documents In: ACL.
Li, G, Semerci M, Yener B, Zaki MJ (2011) Graph classification via topological and label attributes In: MLG.
Li, Y, Tarlow D, Brockschmidt M, Zemel R (2015b) Gated graph sequence neural networks. arXiv.
Maaten, Lvd, Hinton G (2008) Visualizing data using tSNE. JMLR.
Macindoe, O, Richards W (2010) Graph comparison using fine structure analysis In: SocialCom.
Morris, C, Kersting K, Mutzel P (2017) Glocalized weisfeilerlehman graph kernels: Globallocal feature maps of graphs In: ICDM.
Narayanan, A, Chandramohan M, Chen L, Liu Y, Saminathan S (2016) subgraph2vec: Learning distributed representations of rooted subgraphs from large graphs. MLG.
Narayanan, A, Chandramohan M, Venkatesan R, Chen L, Liu Y, Jaiswal S (2017) graph2vec: Learning distributed representations of graphs In: MLG.
Newman, ME (2003) The structure and function of complex networks. SIAM Rev.
Niepert, M, Ahmed M, Kutzkov K (2016) Learning convolutional neural networks for graphs In: ICML.
Nikolentzos, G, Meladianos P, Vazirgiannis M (2017) Matching node embeddings for graph similarity In: AAAI.
Perozzi, B, AlRfou R, Skiena S (2014) DeepWalk: Online learning of social representations In: KDD.
Riesen, K, Jiang X, Bunke H (2010) Exact and inexact graph matching: Methodology and applications In: Managing and Mining Graph Data.
Rossi, RA, Zhou R, Ahmed N (2018) Deep inductive graph representation learning. IEEE Trans Knowl Data Eng.
Scarselli, F, Gori M, Tsoi C, Hagenbuchner M, Monfardini G (2009) The graph neural network model. IEEE Trans Neural Netw 20.
Shervashidze, N, Schweitzer P, Leeuwen EJv, Mehlhorn K, Borgwardt KM (2011) WeisfeilerLehman graph kernels. JMLR.
Shervashidze, N, Vishwanathan S, Petri T, Mehlhorn K, Borgwardt KM (2009) Efficient graphlet kernels for large graph comparison In: AISTATS.
Sutskever, I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks In: NIPS.
Tang, J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015) Line: Largescale information network embedding In: WWW.
Toivonen, H, Srinivasan A, King R, Kramer S, Helma C (2003) Statistical evaluation of the predictive toxicology challenge. Bioinformatics 19.
Tsitsulin, A, Mottin D, Karras P, Bronstein A, Müller E (2018) Sgr: Selfsupervised spectral graph representation learning. arXiv preprint arXiv:1811.06237.
Trivedi, R, Dai H, Wang Y, Song L (2017) Knowevolve: Deep temporal reasoning for dynamic knowledge graphs In: ICML.
Van Wijk, BC, Stam CJ, Daffertshofer A (2010) Comparing brain networks of different size and connectivity density using graph theory. PLoS ONE 5.
Veličković, P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2018) Graph attention networks In: ICLR.
Vishwanathan, S, Schraudolph N, Kondor R, Borgwardt K (2010) Graph kernels. JMLR.
Wale, N, Watson IA, Karypis G (2008) Comparison of descriptor spaces for chemical compound retrieval and classification. KAIS 14.
Weisfeiler, B, Lehman A (1968) A reduction of a graph to a canonical form and an algebra arising during this reduction. NauchnoTechnicheskaya Informatsia.
Yan, X, Han J (2002) gspan: Graphbased substructure pattern mining In: ICDM.
Yanardag, P, Vishwanathan S (2015) Deep graph kernels In: KDD.
Ying, Z, You J, Morris C, Ren X, Hamilton W, Leskovec J (2018) Hierarchical graph representation learning with differentiable pooling In: Advances in Neural Information Processing Systems, 4805–4815.
Zhang, M, Cui Z, Neumann M, Chen Y (2018) An endtoend deep learning architecture for graph classification In: AAAI.
Acknowledgements
Not applicable.
Funding
This research was supported in part by the following grants: NSF IIS1515587 and NSF III1514126
Author information
Authors and Affiliations
Contributions
AT defined, formalized and implemented the approach under the supervision of TBW. KG contributed ideas in the unsupervised setting and the overall preparation of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Taheri, A., Gimpel, K. & BergerWolf, T. Sequencetosequence modeling for graph representation learning. Appl Netw Sci 4, 68 (2019). https://doi.org/10.1007/s4110901901748
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s4110901901748
Keywords
 Graph representation learning
 Deep learning
 Graph classification
 Recurrent models