An End-to-End Graph Convolutional Kernel Support Vector Machine

A novel kernel-based support vector machine (SVM) for graph classification is proposed. The SVM feature space mapping consists of a sequence of graph convolutional layers, which generates a vector space representation for each vertex, followed by a pooling layer which generates a reproducing kernel Hilbert space (RKHS) representation for the graph. The use of a RKHS offers the ability to implicitly operate in this space using a kernel function without the computational complexity of explicitly mapping into it. The proposed model is trained in a supervised end-to-end manner whereby the convolutional layers, the kernel function and SVM parameters are jointly optimized with respect to a regularized classification loss. This approach is distinct from existing kernel-based graph classification models which instead either use feature engineering or unsupervised learning to define the kernel function. Experimental results demonstrate that the proposed model outperforms existing deep learning baseline models on a number of datasets.


Introduction
The world contains much implicit structure which can be modelled using a graph. For example, an image can be modelled as a graph where objects (e.g. person, chair) are modelled as vertices and their pairwise relationships (e.g. sitting) are modelled as edges [22]. This representation has led to useful solutions for many vision problems including image captioning and visual question answering [2]. Similarly, a street network can be modelled as a graph where locations are modelled as vertices and street segments are modelled as edges. This representation has led to useful solutions for many transportation problems including the placement of electrical vehicle charging stations [8].
Given the ubiquity of problems which can be modelled in terms of graphs, performing machine learning on graphs represents an area of great research interest. Advances in the application of deep learning or neural networks to sequence spaces in the context of natural language processing and fixed dimensional vector spaces in the context of computer vision has led to much interest in applying deep learning to graphs. There exist many types of machine learning tasks one may wish to perform on graphs. These include vertex classification, graph classification, graph generation [46] and learning implicit/hidden structures [7]. In this work we focus on the task of graph classification. Examples of graph classification tasks include human activity recognition where human pose is modelled using a skeleton graph [42], visual scene understanding where the scene is modelled using a scene graph [39] and semantic segmentation of three dimensional point clouds where the point cloud is modelled as a graph of geometrically homogeneous elements [23].
Graph convolutional is the most commonly used deep learning architecture applied to graphs. This architecture consists of a sequence of convolutional layers where each layer iteratively updates a vector space representation of each vertex. In their seminal work, Gilmer et al. [9] demonstrated that many different convolutional layers can be formulated in terms of a framework containing two steps. In the first step, message passing is performed where each vertex receives messages from adjacent vertices regarding their current representation. In the second step, each vertex performs an update of its representation which is a function of its current representation and the messages it received in the previous step. In order to perform graph classification given a sequence of convolutional layers, the set of vertex representations output from this sequence must be integrated to form a graph representation. This graph representation can subsequently be used to predict a corresponding class label. We refer to this task of integrating vertex representations as vertex pooling and it represents the focus of this article. Note that, Gilmer et al. [9] refers to this task as readout.
Performing vertex pooling is made challenging by the fact that different sets of vertex representations corresponding to different graphs may contain different numbers of elements. Furthermore, the elements in a given set are unordered. Therefore one cannot directly apply a feed-forward or recurrent architecture because these require an input lying in a vector space or sequence space respectively. To overcome this challenge most solutions involve mapping the sets of vertex representations to either a vector or sequence space which can then form the input to a feed-forward or recurrent architecture respectively. There exists a wide array of such solutions ranging from computing simply summary statistics such as mean vertex representation to more complex clustering based methods [44].
In this article we propose a novel binary graph classification method which performs vertex pooling by mapping a set of vertex representations to an element in a reproducing kernel Hilbert space (RKHS). A RKHS is a function space for which there exists a corresponding kernel function equalling the dot product in this space. Being a function space where the domain of functions in this space is a Euclidean Space, the RKHS in question is of infinite dimension and in turn has high model capacity. However, the infinite nature of this space makes it challenging to work directly in this space. To overcome this challenge, we use the corresponding kernel function which allows us to implicitly compute the dot product in this space without explicitly mapping to the space in question. This is a commonly used strategy known as the kernel trick. More specifically, the kernel corresponding to the RKHS is used within a support vector machine (SVM) to perform binary graph classification. A useful feature of the proposed pooling method is that the mapping to a RKHS is parametrized by a scale parameter which controls the degree to which different sets of vertex representations can be discriminated.
The proposed graph classification model is trained in a supervised end-to-end manner where the convolutional layers, the kernel function and SVM parameters are jointly optimized with respect to a regularized classification loss. This approach is distinct from existing kernel-based models which instead use feature engineering or unsupervised learning to define the kernel function and only optimize the parameters of the classification method in a supervised manner [43]. Using feature engineering can result in diagonal dominance whereby a graph is determined to only be similar to itself, but not to any other graph [43]. Although unsupervised learning can overcome this problem and improve performance, the kernel may not be optimal for the task at hand given it was learned in an unsupervised as opposed to supervised manner [15]. The proposed solution of optimizing in an end-to-end manner overcomes these limitations.
The remainder of this paper is structured as follows. Section 2 reviews related work on graph kernels and vertex pooling methods. Section 3 describes the proposed graph classification model. Section 4 presents an evaluation of this model through comparison to 12 baseline models on 4 datasets. Finally, section 5 draws some conclusions from this work and discusses possible future research directions.

Related Work
In this work we propose a novel vertex pooling method which performs vertex pooling by mapping to a RKHS. In the following two sections we review related work on vertex pooling methods and graph kernels.

Vertex Pooling
As discussed in the introduction to this article, existing vertex pooling methods generally map the set of vertex representations to a fixed dimensional vector space or sequence space. The simplest methods for performing vertex pooling compute a summary statistic of the set of vertex representations. Commonly used summary statistics include mean, max and sum [4]. Despite the simple nature of these methods, a recent study by Luzhnica et al. [25] demonstrated that in some cases they can outperform more complex methods. Zhang et al. [47] proposed a vertex pooling method which first performs a sorting of vertex representations based on the Weisfeiler-Lehman graph isomorphism algorithm. A subset of these vertex representations are then selected based on this ranking, where the size of this subset is a user specified parameter. Tarlow et al. [24] proposed a vertex pooling method which outputs an element in sequence space. Gilmer et al. [9] proposed to perform vertex pooling by applying the set2set model from Vinyals et al. [36]. The set2set model maps the set of vertex representations to fixed dimensional vector space representation which is invariant to the order of elements in the set. Ying et al. [44] proposed a vertex pooling method which uses clustering to iteratively integrate vertex representations and outputs an element in a fixed dimensional vector space. Kearnes et al. [16] proposed a vertex pooling method which creates a fuzzy histogram of the vertex representations and outputs an element in a fixed dimensional vector space.

Graph Kernels
As described in the introduction to this article, existing kernel-based graph classification methods use either feature engineering or unsupervised learning to define the kernel. We now review each of these approaches in turn.
The most common approach for feature engineering kernels is the R-convolution framework where the kernel function of two graphs is defined in terms of the similarity of their respective substructures [13]. This framework is similar to the bag-of-words framework used in natural language processing. Substructures used in the R-convolution framework to define kernels include graphlets [33], shortest path properties [1] and random walk properties [34].
The Weisfeiler-Lehman framework is a framework for feature engineering kernels which is inspired by the Weisfeiler-Lehman test of graph isomorphism. In this framework the vertex representations of a given graph are iteratively updated in a similar manner to graph convolution to give a sequence of graphs. A kernel is then defined with respect to this sequence by summing the application of a given kernel, known as the base kernel, to each graph in the sequence. Shervashidze et al. [32] proposed a family of kernels using this framework by considering a set of base kernels including one which measures the similarity of shortest path properties. Rieck et al. [30] proposed a kernel using this framework by considering a base kernel which measures the similarity of topological properties.
Kriege et al. [21] proposed another framework for feature engineering kernels known as assignment kernels which computes an optimal assignment between graph substructures and sums over a kernel applied to each correspondence in the assignment. The authors proposed a number of kernels using this framework including one based on the Weisfeiler-Lehman graph isomorphism algorithm. Kondor et al. [20] proposed a multiscale kernel which considers vertex features plus topological information through the graph Laplacian. Zhang et al. [48] proposed a kernel-based on the return probabilities of random walks. The authors used an approximation of the kernel function so that the method can be applied to large datasets [29].
To overcome the limitations feature engineering and improve performance, recent works in the field of graph kernels have considered unsupervised learning techniques. These methods generally learn a graph representation in an unsupervised manner and subsequently use this representation to define a kernel. Yanardag et al. [43] proposed a kernel which uses the R-convolution framework to define a set of substructures and subsequently learns an embedding of these substructures in an unsupervised manner using a word2vec type model. Ivanov et al. [15] proposed a kernel which determines two graphs to be similar if their vertices have similar neighbourhoods measured in terms of anonymous walks which are a generalization of random walks. Learning is performed in an unsupervised manner using a word2vec type model. Nikolentzos et al. [26] proposed a graph kernel which first computes sets of vertex representations corresponding to the graphs in questions in an unsupervised manner. The similarity of these sets are then computed using the earth mover's distance. The authors noted that these similarities do not yield a positive semidefinite kernel matrix preventing it from being used in some kernel-based classification methods. To overcome this issue the authors use a version of the support vector machine for indefinite kernel matrices. Similar to Nikolentzos et al. [26], Wu et al. [37] proposed a graph kernel which first computes sets of vertex representations corresponding to the graphs in questions in an unsupervised manner. The resulting set of embeddings are in turn used to embed the graph in question by measuring the disturbance distance to sets of embeddings corresponding to random graphs.
Finally, this graph representation is used to define a kernel.

Methodology
The proposed graph classification model consists of the following three steps. In the first step, a sequence of graph convolutional layers are applied to the graph in question to generate a corresponding set of vertex representations. In the second step, this set of vertex representations is mapped to a RKHS. In the final step, graph classification is performed using a SVM. Each of these three steps are described in turn in the first three subsections of this section. In the final subsection we describe how the parameters of each step are optimized jointly in an end-to-end manner. Before that, we first introduce some notation and formally define the problem of graph classification.
A graph is a tuple (V, E) where V is a set of vertices and E ⊆ (V × V ) is a set of edges. Let G denote the space of graphs. Let l : V → Σ denote a vertex labelling function. In this work we assume that Σ is a finite set. Let G = {G 1 , G 2 , . . . , G n } denote a set of n graphs and Y = {Y 1 , Y 2 , . . . , Y n } denote a corresponding set of graph labels. In this work we assume that graph labels take elements in the set {0, 1}. In this work we consider the problem of binary graph classification where given G and Y we wish to learn a map G → {0, 1}.

Graph Convolution Layers
A large number of different graph convolutional layers have been proposed. Broadly speaking a graph convolutional layer will update the representation of each vertex in a given graph where this update is a function of the current representation of that vertex plus the representations of its adjacent neighbours. In this section we only briefly review existing graph convolutional layers but the interested reader can find a more indepth analysis in the following review papers [49,38].
Gilmer et al. [9] showed that many different convolutional layers may be reformulated in terms of a framework called Message Passing Neural Networks defined in terms of a message function M and an update function U . In this framework vertex representations are updated according to Equation 1 where h t v denotes the representation of vertex v output from the t-th convolutional layer and N (v) denotes the set of vertices adjacent to v. Each vertex representation h t v is an element of R m where the dimension m may vary from layer to layer. For the input layer, that is t = 1, vertex representations equal a one-hot encoding of the vertex labelling function l and therefore the corresponding dimension is |Σ|. For all subsequent layers the corresponding dimension is a model hyperparameter.
In the proposed graph classification model we use the functions M and U originally proposed by Hamilton et al. [12] and defined in Equation 2. Here CONCAT is the horizontal vector concatenation operation, W t and b t are the weights and biases respectively for the t-th convolutional layer, and ReLU is the real valued rectified linear unit non-linearity.
A sequence of two convolutional layers were used in the proposed model. A number of studies have found that the use of two layers empirically gives the best performance [19]. This sequence of layers will map a graph G i = (V, E) to a set of |V | points in R m where m is the dimension of the final convolutional layer. Since the number of vertices in a graph may vary the number of points in R m may in turn vary. Let us denote by Set the space of sets of points in R m . Given this, the sequence of convolutions layers defines a map G → Set.

Mapping to RKHS
The output from the sequence of convolutional layers defined in the previous subsection is an element in the space Set. In this section we propose a method for mapping elements in this space to a reproducing kernel Hilbert space (RKHS). We in turn define a kernel between elements in this space.
A Hilbert space is a vector space with an inner product such that the induced norm turns the space into a complete metric space. A positive-semidefinite kernel on a set X is a function k : X × X → R such that there is a feature space H and a map φ : X → H such that k (x, y) = φ(x), φ(y) where x, y ∈ X and ·, · denotes the dot product in H. Equivalently, a function k : X × X → R is a kernel if and only if for every subset {x 1 , . . . , x q } ⊆ X , the q × q matrix K with entries K ij = k(x i , x j ) is positive semi-definite. Given a kernel k, one can define a map X → R X as Equation 3 where codomain of this map is the space of real valued functions on X . Such a space is called a function space. Given this, it can be proven that k(x, y) = k(·, x), k(·, y) . By virtue of this property R X is called a reproducing kernel Hilbert space (RKHS) corresponding to the kernel k [31].
x → k(·, x) Let k R σ : R m × R m → R be the Gaussian kernel function defined in Equation  4 where u ∈ R and σ 2 ∈ R ≥0 .
Given k R σ , we define a map F : Set × R → R R m in Equation 5 where R R m is the space of real valued functions on R m . To illustrate this map consider the element of Set displayed in Figure 1(a) where the dimension m equals 2. Recall that elements in the space Set correspond to sets of points in R m . Figures 1(b) and 1(c) display the elements of R R m resulting from applying the map F to this element of Set with σ parameter values of 0.001 and 0.0005 respectively. The parameter σ of the map F is a scale parameter and may be interpreted as follows. As the value of σ approaches 0, F (x, σ) becomes a sum of a set indicator functions applied to x. In this case distinct elements of the space Set map to distinct elements of R R m where the distance between these functions measured by the L p norm is greater than zero. On the other hand, as σ approaches ∞, differences between the functions are gradually smoothed out and in turn the distance between the functions gradually reduces. Therefore, one can view the parameter σ as controlling the discrimination power of the method.
Given the map F defined in Equation 5, we define the kernel k L σ : R R m × R R m → R in Equation 6. Note that, the final equality in this equation follows from the reproducing property of the RKHS related to k R σ and the bilinearity of the inner product [28]. By examination of Equation 6, we see that the kernel k L σ equals the dot between between elements in the codomain of the map F which is an infinite dimensional function space. That is, the kernel allows us to operate in this codomain without the computational complexity of explicitly mapping into it Theorem 1. The kernel k L σ is a positive-semidefinite kernel.
Proof. The kernel k L σ is a positive-semidefinite kernel because it is defined in Equation 6 to equal the dot product in the space R R m .
The kernel k L σ has a specific scale which is specified by σ. In order to adopt a multi-scale approach we consider a set of s scales Σ = {σ 1 , . . . , σ s } to define a corresponding set of kernels {k L σ1 , . . . , k L σs }. We combine these kernels to using a linear combination defined in Equation 7 where {β 1 , . . . , β s } ∈ R s ≥0 . Let H denote the reproducing kernel Hilbert space (RKHS) corresponding to the kernel k L Σ [11].
Theorem 2. The kernel k L Σ is a positive-semidefinite kernel.
Proof. The kernel k L Σ is a positive-semidefinite kernel because it is the sum of positive-semidefinite kernels and the coefficients {β 1 , . . . , β s } are all positive (see proposition 13.1 in [31]).
Let f : Set → R be a map from which we obtain a decision function by sgn(f ). That is, if f returns a positive value we classify the graph in question as 1 and otherwise we classify it as 0. We determine a suitable map f lying in the RKHS H corresponding to the kernel k L Σ by Equation 8. Note that, the first term in this sum corresponds to the soft margin loss [31] and the second term is a regularization term.
By the representer theorem any solution to Equation 8 can be written in the form of Equation 9 where {α 1 , . . . , α n } ∈ R n [28].
Substituting this into Equation 8 we obtain Equation 10 where optimization of the function f is performed with respect to {α 1 , . . . , α n } ∈ R n . Here is the elementwise multiplication operator (Hadamard product), 0 is a vector of zeros of size n and 1 is a vector of ones of size n.

End-to-End Optimization
As described in the previous subsections, the proposed classification model contains three steps with each having corresponding parameters which require optimization with respect to the objective function defined in Equation 10. The parameters in question are the sets of convolutional layer parameters W t and b t defined in Equation 2, the sets of kernel parameters σ l and β l defined in Equation 7, and the set of SVM parameters α j defined in Equation 9. All of these parameters are unconstrained real values apart from the sets of kernel parameters σ l and β l which are constrained to be positive real values. As such, the optimization problem in question is a constrained optimization problem. In this work we wish to optimize all the above model parameters jointly in an endto-end manner. We refer to this as the end-to-end optimization problem. Note that, if only the SVM parameters were optimized and all other parameters were fixed the optimization problem could be formulated as a quadratic program by taking the dual and solved in closed-form [31]. This is the some commonly used method for optimizing the parameters of an SVM. In order to solve the end-to-end optimization problem we use a gradient based optimization method. Such methods are the most commonly used methods for optimizing neural network parameters [10]. There are two main approaches that can be used to apply a gradient based optimization method to a constrained optimization problem. The first approach is to project the result of each gradient step back into the feasible region. The second approach is to transform the constrained optimization problem into an unconstrained optimization problem and solve this problem. Such a transformation can be achieved using the Karush-Kuhn-Tucker (KKT) method [27]. In this work we use the former approach. In practice this reduces to passing the parameters σ l and β l through the function max(·, 0) after each gradient step. The above optimization can be used in conjunction with any gradient based optimization method such as stochastic gradient descent. In this work the Adam method was used [18].

Evaluation
In this section we present an evaluation of the proposed end-to-end graph classification model with respect to current state-of-the-art models. This section is structured as follows. Section 4.1 provides implementation details for the proposed model. Section 4.2 describes the baseline models used to compare the proposed model against. Finally, section 4.3 describes the datasets used in this evaluation and compares the performance of all models on these datasets.

Implementation Details
The parameters of the proposed model were initialized as follows. The convolutional layer weights W t and biases b t in Equation 2 were initialized using Kaiming initialization [14] and to a value of 0 respectively. The kernel parameters {σ 1 , . . . , σ s } and {β 1 , . . . , β s } in Equation 7 were all initialized to a value of 1.
The model hyper-parameters were set as follows. The dimension of the convolutional hidden layers was set equal to 25. The Adam optimizer learning rate was set to its default value of 0.001 and training was performed for 300 epochs. The hyper-parameters λ in Equation 10 and s in Equation 7 were selected from the sets {0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0} and {1, 2} respectively by considering classification accuracy on a validation set. The proposed classification model was implemented in Python3 using Py-Torch. All experiments were run on an Nvidia GeForce RTX 2080 GPU. As can be observed from Equations 6 and 7, the proposed kernel function reduces to a series of summations. If this was naively implemented using a series of Python for loops this would result in slow learning and inference. To overcome this issue we performed vectorization whereby the kernel was implemented using a series of PyTorch tensor operations. These tensor operations are implemented in C++ as opposed to Python and therefore result in significantly faster learning and inference. The authors will make this code available upon publication of this paper.
The time and space complexity of classifying a given graph is O(n) where n is the number of graphs in the training dataset. This is a consequence of the summation in Equation 9 over all training examples. The time and space complexity of performing an update of the method parameters using backprop is O(n 2 ) because this step computes the complete kernel matrix K in Equation 10.

Baseline Methods
As described in the related work section of this paper, existing models for graph classification belong to two main categories of feature engineered kernel and end-to-end deep learning models. Recent studies have found that the latter category of models outperform the former [44]. Therefore for the purposes of this evaluation we only considered end-to-end deep learning models.
A total of 12 baseline methods were considered. We considered so many baseline methods to ensure we were comparing to state of the art; many existing methods claim to outperform each other so it is difficult to determine which methods are in fact state of the art. The baseline methods considered in the evaluation are end-to-end methods but not kernel-based methods. The proposed method is the first end-to-end kernel-based method for graphs.
Implementations for these in PyTorch were obtained from the PyTorch Geometric Python library [6] and can be downloaded directly from the benchmark section of the PyTorch Geometric website 1 . Model parameters were optimized using the Adam optimizer with the default learning rate of 0.001 and run for 300 epochs. For all baseline models a negative log likelihood loss function was used.
Model hyperparameters corresponding to the number and dimension of hidden layers were selected from the sets {1, 2, 3, 4, 5} and {16, 32, 64, 128} respectively by considering the loss on a validation set.
We now briefly describe the architectures of the 12 baseline models; specific implementation details can be found at the benchmark section of the PyTorch Geometric website: GCN -This model consists of graph convolutional layers proposed by Kipf et al. [19], followed by mean pooling, followed by a non-linear layer, followed by a dropout layer, followed by a linear layer, followed by a softmax layer.
SortPool -This model consists of the graph convolutional layers proposed by Hamilton et al. [12], followed by the pooling layer proposed by Zhang et al. [47], followed by a non-linear layer, followed by a dropout layer, followed by linear layer, followed by a softmax layer.

Datasets and Results
To evaluate the proposed graph classification model we considered four commonly used graph classification datasets obtained from the TU Dortmund University graph dataset repository [17] 2 . The first dataset is MUTAG dataset which contains 188 graphs corresponding to chemical compounds where there are 7 distinct types of vertices. The classification problem is binary and concerns predicting a particular characteristic of the chemical [3]. The second dataset is PTC MR dataset which contains 344 graphs corresponding to chemical compounds where there are 18 distinct types of vertices. The classification problem is binary and concerns predicting a carcinogenicity property. The third dataset is the BZR MD dataset which contains 306 graphs corresponding to chemical compounds where there are 8 distinct types of vertices. The classification problem is binary and concerns predicting a particular characteristic of the chemical [35]. The final dataset is the PTC FM dataset which contains 349 graphs corresponding to chemical compounds where there are 18 distinct types of vertices. The classification problem is binary and concerns predicting a carcinogenicity property.
Stratified k-folds cross-validation with a k value of 10 was used to split the data into training and testing sets. One of the k − 1 folds in the training set was randomly selected to be a validation set and classification accuracy on this set was used to select model hyperparameters. The same training, testing and validation splits were used for all graph classification models considered. This is an important point because the performance of a given model may vary as a function of the split used. For each dataset we computed the mean accuracy on the test sets for each method. The results of this analysis are displayed in Table 1. For two of the four datasets, the proposed graph classification model outperformed all baseline methods. In fact, on the MUTAG dataset the proposed model outperformed all baseline methods by a significant margin. For the remaining two datasets, the proposed method outperformed many but not all baseline methods. These positive results demonstrate the utility of the proposed model.
It is important to note that the proposed method was compared against a large number of benchmark methods (12). This makes it challenging for any single method to perform best on all datasets. It is difficult to interpret exactly why one deep learning architecture performs better or worse than another on a particular dataset. However, one limitation of the proposed method that may limit its ability to accurately discriminate is that it only methods the distribution of node embeddings and not the position of these nodes in the graph. The recent work by You et al. [45] suggests position information is important. The DiffPool method which performed best on the BZR MD dataset actually uses node position information when performing clustering in the pooling step (this is illustrated in Figure 1 of the original paper by Ying et al. [44]). We hypothesize that position information may not be important for some graph classification tasks while being important for others. This may explain why the proposed method does not uniformly outperform all others. It is also worth noting that the proposed method achieved similar performance to the GIN method on the BZR MD dataset. In a recent paper by Errica et al. [5], the authors found the GIN method to achieve best results on a number of datasets.

Conclusions and Future Work
This article proposes a novel kernel-based support vector machine (SVM) for graph classification. Unlike existing kernel-based models, the proposed model is trained in a supervised end-to-end manner whereby the convolutional layers, the kernel function and SVM parameters are jointly optimized. The proposed model outperforms existing deep learning models on a number of datasets which demonstrates the utility of the model.
Despite these positive results, the proposed model is not a suitable candidate solution for all graph classification problems. Like all kernel-based models, the proposed model does not natively scale to large datasets. This is a consequence of the fact that training the model requires computation and storing of the kernel matrix whose size is quadratic in the number of training examples. This limitation may potentially be overcome by performing an approximation of the kennel function [29]. The authors plan to investigate this research direction in future work.