Complex Network Effects on the Robustness of Graph Convolutional Networks

Vertex classification -- the problem of identifying the class labels of nodes in a graph -- has applicability in a wide variety of domains. Examples include classifying subject areas of papers in citation networks or roles of machines in a computer network. Vertex classification using graph convolutional networks is susceptible to targeted poisoning attacks, in which both graph structure and node attributes can be changed in an attempt to misclassify a target node. This vulnerability decreases users' confidence in the learning method and can prevent adoption in high-stakes contexts. Defenses have also been proposed, focused on filtering edges before creating the model or aggregating information from neighbors more robustly. This paper considers an alternative: we leverage network characteristics in the training data selection process to improve robustness of vertex classifiers. We propose two alternative methods of selecting training data: (1) to select the highest-degree nodes and (2) to iteratively select the node with the most neighbors minimally connected to the training set. In the datasets on which the original attack was demonstrated, we show that changing the training set can make the network much harder to attack. To maintain a given probability of attack success, the adversary must use far more perturbations; often a factor of 2--4 over the random training baseline. These training set selection methods often work in conjunction with the best recently published defenses to provide even greater robustness. While increasing the amount of randomly selected training data sometimes results in a more robust classifier, the proposed methods increase robustness substantially more. We also run a simulation study in which we demonstrate conditions under which each of the two methods outperforms the other, controlling for the graph topology, homophily of the labels, and node attributes.

methods increase robustness substantially more.We also run a simulation study in which we demonstrate conditions under which each of the two methods outperforms the other, controlling for the graph topology, homophily of the labels, and node attributes.

Introduction
Classification of vertices in graphs is an important problem in a variety of applications, from e-commerce (classifying users for targeted advertising) to security (classifying computer nodes as malicious or not) to bioinformatics (classifying roles in a protein interaction network).In the past several years, numerous methods have been developed for this task (see, e.g., [17,28]).More recently, research has focused on attacks by adversaries [52,11] and robustness to such attacks [43].If an adversary were able to insert misleading data into the training set (e.g., generate benign traffic during a data collection period that could conceal its behavior during testing/inference time), the chance of successfully evading detection would increase, leaving data analysts unable to respond to potential threats.
To classify vertices in the presence of adversarial activity, we must implement learning systems that are robust to such potential manipulation.If such malicious behavior has low cost to the attacker and imposes high cost on the data analyst, machine learning systems will not be trusted and adopted for use in practice, especially in high-stakes scenarios such as network security and traffic safety.Understanding how to achieve robustness is key to realizing the full potential of machine learning.
Adversaries, of course, will attempt to conceal their manipulation.The first published poisoning attack against vertex classification was an adversarial technique called Nettack [52], which can create perturbations that are subtle while still being extremely effective in decreasing performance on the target vertices.The authors use their poisoning attack against a graph convolutional network (GCN).
From a defender's perspective, we aim to make it more difficult for the attacker to cause node misclassification.In addition to changing the properties of the classifier itself, there may be portions of a complex network that provide more information for learning than others.Complex networks are highly heterogeneous and random sampling may not be the best way to obtain labels.If there is flexibility in the means of obtaining training data, the defender should leverage what is known about the graph topology.This paper demonstrates that leveraging complex network properties can improve robustness of GCNs in the presence of adversaries.We focus on two alternative techniques for training data selection.Both methods aim to train with a subset of nodes that are well connected to the held out set.Here we see a benefit, often raising the number of perturbations required for a given level of attack success by over a factor of 2. When it is possible to pick a specific subset on which to train, this can provide a significant advantage.Some combination of these methods will likely be useful to develop a more robust vertex classification system.

Scope and Contributions
In this paper, we are specifically interested in targeted poisoning attacks against vertex classifiers, where the data are modified at training time to cause a specific target node to be misclassified.We consider attacks against the structure of the graph, rather than against node attributes.We focus on classification methods where there is an implicit assumption of homophily.Working within this context, the contributions of this work are as follows: • We propose two methods-StratDegree and GreedyCover-for selecting training data that result in a greater burden on attackers.
• We demonstrate that the robustness gained via these methods cannot be reliably obtained by simply increasing the amount of randomly selected training data.
• We show that the most robust defense methods are often improved by working in conjunction with GreedyCover.
• We show that there is no consistent tradeoff between the robustness gained from these methods and classification performance.
• In simulation, we study the effects of various generative models and report the impact of class homophily, topological features, and node attribute similarity across classes on classification performance and robustness to attack.
These contributions all point toward interesting future research in this area, such as determining the conditions under which such methods are effective.

Paper Organization
The remainder of this paper is organized as follows.In Section 2, we briefly contextualize our work within the current literature.In Section 3 we describe the vertex classification problem, GCNs, and the Nettack method.Section 4 outlines the methods we investigate to select training data, and Section 5 details the experimental setup, including datasets, attacks, and classification methods.Section 6 documents experimental results on real data, illustrating the effectiveness of the proposed methods.In Section 7, we present the results of a simulation study in which we vary graph topology, node attributes, and homophily, and evaluate robustness of the methods across the landscape.In Section 8 we conclude with a summary and outline open problems and future work.

Related Work
Adversarial examples in deep neural networks have received considerable attention since they were documented several years ago [38].Since that time, numerous attack methods have been proposed, largely focused on the image classification domain (though there has been interest in natural language processing as well, e.g., [18]).In addition to documenting adversarial examples, Szegedy et al. demonstrated that such examples can be generated using the limited-memory BFGS (L-BFGS) algorithm, which identifies an adversarial example in an incorrect class with minimal L 2 norm distance to the true data.Later, Goodfellow et al. proposed the fast gradient sign method (FGSM), where the attacker starts with a clean image and takes small, equal-sized steps in each dimension (i.e., alters each pixel by the same amount) in the direction maximizing the loss [16].Another proposed attack-the Jacobian-based Saliency Map Attack (JSMA)-iteratively modifies the pixel with the largest impact on the loss [33].DeepFool, like L-BFGS, minimizes the L 2 distance from the true instance while crossing a boundary into an incorrect class, but does so quickly by approximating the classifier as linear, stepping to maximize the loss, then correcting for the true classification surface [29].Like Nettack, these methods all try to maintain closeness to the original data (L 2 norm for L-BFGS and DeepFool, L 0 norm for JSMA, and L ∞ norm for FGSM).Some of these methods have been adapted for use with graph data.In [44], the authors modify FGSM and JSMA to use integrated gradients and show it to be effective against vertex classification.In addition, new attacks against vertex classification have been introduced, including a method that uses reinforcement learning to identify modifications to graph structure for an evasion attack [11].To increase the scale of attacks, the authors of [23] propose an attack that only considers a k-hop neighborhood of the target.This method attacks a simplified GCN, introduced in [42], which applies a logistic regression classifier after k rounds of feature propagation.
Defenses to attacks such as FGSM and JSMA have been proposed, although several prove to be insufficient against stronger attacks.A simple improvement is to include adversarial examples in the training data [16].Defensive distillation is one such defense, in which a classifier is trained with high "temperature" in the softmax, which is reduced for classification [34].While this was effective against the methods from [38,16,33,29], it was shown in [4] that modifying the attack by changing the constraint function (which ensures the adversarial example is in a given class) renders this defense ineffective.More defenses have been proposed, such as pixel deflection [36] and randomization techniques [45], but many such methods are still found to be vulnerable to attacks [1,2].Other work has focused on provably robust defenses [41], with empirical performance often close to certifiable claims [9].Stochastic networks have also shown improved robustness to various attacks [12].In the wake of growing interest in adversarial robustness, several authors in the community have aggregated best practices for evaluation of systems in [5].
More recent work has focused on robustness of GCNs, including work on robustness to attacks on attributes [54] and more robust GCN variants [49].
Multiple authors have considered aggregation techniques that are less sensitive to outliers [15,7].One approach to a more robust classifier incorporates an attention mechanism that learns the importance of the attributes of other nodes' features to a node's class [40].Others have considered using a GCN with modified graph structure to improve robustness, such as using a low-rank approximation for the graph [14] and filtering edges based on attribute values [44].
Another method also considers attribute values, in this case creating a similarity graph from the attributes that augments the given graph structure to preserve node similarity in feature space [19].The low-rank structure and node similarity concepts are combined in [21] to create a neural network that aims to simultaneously learn the true graph structure from poisoned data and learn a classifier of unlabeled nodes.The authors of [10] explore a similar idea in the context of noisy data and few labels, using link prediction to augment the observed graph.Relations between attribute similarity and node class-including possible heterophily-are also exploited in GNNGuard [48].More recent work has shown that attacks that are adaptive to defenses easily undermine the robustness increase observed when using non-adaptive attacks [30].Other recent GCN developments include modifications to deal with heterophily, via classifier design choices [51] and by learning the level of homophily or heterophily in the graph as part of the training procedure [50].Several attacks [11,52,6,53,44,46] and defenses [16,44,49,14,21] have been incorporated into a software package called DeepRobust, enabling convenient experimentation across a variety of conditions [24,20].As with neural networks more generally, there has been work on certifiable robustness for GCNs [3,55].While this paper is focused on targeted attacks, several attacks, such as [53,46], attack the whole graph in order to degrade overall performance.Some attacks in this area have allowed adding new nodes [37], flipping labels [25], and rewiring edges [26].In addition, there are many machine learning tasks on graphs other than vertex classification, and work has been done on, for example, edge classification in an adversarial context [47].

Problem Model
We consider the vertex classification problem as described in [52], where we are given an undirected graph G = (V, E) of size N = |V | and an N × d matrix of vertex attributes X.Each node has an arbitrary numeric index from 1 to N .For this work, as in [52], we consider only binary attributes.In addition to its d attributes, each node has a label denoting its class.We enumerate classes as integers from 1 to C. Given a subset of labeled instances, the goal is to correctly classify the unlabeled nodes.
The focus of [52] is on GCNs, which make use of the adjacency matrix for the graph A = {a ij }, where a ij is 1 if there is an edge between node i and node j and is 0 otherwise.The GCN applies a symmetrized one-hop graph convolution [22] to the input layer.That is, if we let D be the diagonal matrix of vertex degrees-i.e., the ith diagonal entry is the number of edges connected to vertex i, d ii = N j=1 a ij -then the output of the first layer of the network is expressed as where W 1 is a weight matrix, X is a feature matrix whose ith row is x T i (the attribute vector for row vertex i), and σ is the rectifier function.From the hidden layer to the output layer, a similar graph convolution is performed, followed by a softmax output: The focus in [52] is on GCNs with a single hidden layer.Each vertex is then classified according to the largest entry in the corresponding row of Y .
The vertex attack proposed in [52] operates on a surrogate model where the rectifier function is replaced by a linear function, thus approximating the overall network as Nettack uses a greedy algorithm to determine how to perturb both A and X to make the GCN misclassify a target node.The changes are intended to be "unnoticeable," i.e., the degree distribution of G and the co-occurrence of features are changed negligibly.Using the approximation in (2), Nettack perturbs by either adding or removing edges or turning off binary features so that the classification margin is reduced the most at each step.Note that while it can change the topology and the features, Nettack does not change the labels of any vertices.In this paper, we only consider structural perturbations.Nettack allows either direct attacks, in which the target node itself has its edges and features changed, or indirect influence attacks, where neighbors of the target have their data altered.The classifier is evaluated in a context where only some of the labels are known, and the labeled data are split into training and validation sets.To train the GCN, 10% of the data are selected at random (or by one of the alternative methods outlined in Section 4), and another 10% is selected for validation.The remaining 80% is the test data.After training, nodes are selected for attack among those that are correctly classified.The goal of the defender is to make the a successful attack as expensive as possible.
As we discuss in Section 5, we also consider attacks other than Nettack, and classifiers other than standard GCNs.While the details differ (e.g., using different criteria to identify perturbations), the overall problem model remains the same.

Proposed Methods
As we investigated classification performance using Nettack, we noted that nodes in the test set with many neighbors in the training set were more likely to be correctly classified.This dependence on labeled neighbors is consistent with previous observations [31].We observed this effect using the standard method of training data selection used in the original Nettack paper: randomly select 10% for training, 10% for validation, and 80% for testing.This observation suggested that a training set where the held-out nodes are well represented among neighborhoods of the training data-providing a kind of "scaffolding" for the unlabeled data-could make the classification more robust.
We considered two methods to test this hypothesis.The first simply chooses the highest-degree nodes (stratified by class) to be in the training set.We refer to the stratified degree-based thresholding method as StratDegree.The other method uses a greedy approach in an attempt to ensure every node has at least a minimal number of neighbors in the training set.Starting with an empty training set and a threshold k = 0, we iteratively add a node of a particular class with the largest number of neighbors that are connected to at most k nodes in the training set.The class is randomly selected based on how many nodes of each class are currently in the training set and the number required to achieve class stratification.When there are no such neighbors, we increment k.This procedure continues until we have the desired proportion of the overall dataset for training.Algorithm 1 provides the pseudo-code.

Computational Complexity
Using StratDegree and GreedyCover both have computational costs beyond random sampling.StratDegree requires finding the highest-degree nodes, which, for a constant fraction of the dataset size

Experimental Setup
Each experiment in our study involves (1) a graph dataset, (2) a method for selecting training data, (3) a structure-based attack against vertex classification, and (4) a classification algorithm.We consider several options for each step in this process, as shown in Figure 1.This section details the methods and datasets

Algorithm 1 GreedyCover
Input: Processing chain for experiments.Each experiment takes a dataset, applies some method to split training, validation, and test data, applies an attack to a set of target nodes, then applies a classifier to the attacked dataset.
We evaluate the robustness of vertex classification--in terms of required attacker budget at a given attack success rate-across all possible combinations of dataset, selection methods, attacks, and classifier.
we use across the experiments in this paper.We use the DeepRobust library [24] for datasets, attacks, and classifiers.

Datasets
We use the three datasets used in the Nettack paper in our experiments, plus one larger citation dataset: • CiteSeer The CiteSeer dataset has 3312 scientific publications put into 6 classes.The network has 4732 links representing citations between the publications.The features of the nodes contain 1s and 0s indicating the presence of the word in the paper.There are 3703 unique words considered for the dictionary.
• Cora The Cora dataset consists of 2708 machine learning papers classified into one of seven categories.The citation network consists of 5429 citations.For each paper (vertex) in the network there is a feature vector of 0s and 1s for whether it contains one of 1433 unique words.
• PolBlogs The political blogs dataset consists of 1490 blogs labeled as either liberal or conservative.A total of 19,025 links between blogs form the directed edges of the graph.No attributes are used.
• PubMed The PubMed dataset consists of 19,717 papers pertaining to diabetes classified into one of three classes.The citation network consists of 44,338 citations.For each paper in the network there is a binary feature vector representing the presence of 500 words.

Training Data Selection
To select training data, we use StratDegree and GreedyCover as described in Section 4, as well as random selection.For StratDegree and Greedy-Cover, we split use the proposed algorithms to select 10% of the data, stratified by class.The remaining 90% of the data is randomly split (stratified by class) into validation (10%) and training data (80%).For random selection, we also want to determine whether adding more random training data improves classification robustness.Thus, in addition to using stratified random sampling to select 10% of the data for training, we consider larger training sets, increasing to 30% in 5% increments.In all cases, 10% of the data are used for validation and the remainder comprise the test set.We measure the average number of neighbors connected to a node outside of the training set, i.e., for the training set T ⊂ V , we record This allows us to evaluate what impact the overall number of connections to the training data has on performance, and whether performance with the proposed training data selection methods match any trend observed with random training.
• Fast Gradient Attack (FGA) Computes the gradient of the loss function at the target node with respect to the adjacency matrix, then perturb the entry with the largest gradient that points in the correct direction [6].
• Integrated Gradient Attack (IG-Attack) A similar method that integrates the gradient as an entry in the adjacency matrix varies from 1 to 0 (for edge removal) or 0 to 1 (for edge addition) [44].
• Simplified Gradient Attack (SGA) In this case, gradients are computed that only consider a k-hop subgraph around the target [23].
For direct attacks, we use up to 20 edge additions and removals for a target.
For influence attacks, we allow up to 50 perturbations.

Classifiers
We consider the following eight classifier models, some of which were developed with the explicit intent of improving robustness to adversarial attack: • GCN The original GCN architecture as used in [52].
• Jaccard Before training the GCN, removes edges between nodes that have dissimilar feature vectors before [43].
• SVD Uses a GCN in which the adjacency matrix is replaced with a lowrank approximation via truncated singular value decomposition [14].
• ChebNet Uses the spectral graph convolutions [13] of which the convolution operator ( 1) is a first-order approximation.
• Simple Graph Convolution (SGC) Applies a model similar to the surrogate (2), where the matrix W is learned via logistic regression on the features defined by ( • Graph Attention Network (GAT) Includes an attention mechanism based on the importance of each node's neighbors' features [40].
• Robust Graph Convolutional Network (RGCN) Uses Gaussian convolutions, in which the output is drawn from a Gaussian distribution whose parameters the output of a neural network [49].
• MedianGCN Aggregates neighbors' features based on their median values rather than weighted averages [7].

Training
We tuned classifier hyperparameters for each (classifier, attack, training selection method) triple, first performing a coarse grid search over all hyperparameters, then performing some refinements: altering each single parameter 10% and choosing the configuration with the best performance.The performance metric is a linear combination of the F 1 score (macro averaged) before an attack takes place with the was the average margin of 10 randomly selected targets after 5 perturbations with a direct attack.The resulting hyperparameters were used in all cases with the corresponding classifier, attack, and training selection method.

Evaluation
We evaluate performance based on 25 target nodes.The targets are randomly selected from the set of nodes that are correctly classified when no attack takes place.This procedure is repeated five times with the train/validation/test splits recomputed each time.Our robustness metric is the adversary's required budget to achieve a given attack success rate.We compute this based on the number of perturbations required to give a target a negative classification margin in its correct class.If the target is never successfully misclassified, we set the required budget to the maximum number of perturbations.The result is averaged across the five trials.
6 Results on Real Data

Impact of Training Methods
We first consider influence attacks, where the target node's neighbors are modified rather than the target itself.We apply both Nettack and FGA, replacing Nettack with SGA if the SGC-based classifier is used.We only obtained results using IG-FGSM on the PolBlogs dataset, which can be seen in the appendix.(IG-FGSM did not substantially outperform the other methods for the best-performing classifiers.)In all other cases, IG-FGSM did not finish in the allotted time.Results are illustrated in Figure 2. In addition to standard the results for standard GCNs, we plot the upper envelope for each method: at a given attack success probability, the largest required budget across all classifiers.See Appendix A for details about the performance of each individual classifier with each training scheme.CiteSeer has a particularly large increase in the attacker's required budget when using GreedyCover: more than doubling it over several rates of attack success.In fact, at low attack success probabilities, GreedyCover with a GCN provides similar robustness to any of the classifiers listed in Section 5.4 with random selection.In addition, GreedyCover provides greater robustness when used in conjunction with the most robust defenses, as shown by the upper envelope.There is a somewhat milder effect on the Cora dataset.In this case, GreedyCover still performs best when using Nettack, but the best performance when attacked with FGA comes from StratDegree (though GreedyCover is within one standard error).With PolBlogs, we also see a benefit from both methods, though we start from a much higher baseline in terms of required perturbations.We see an exception with PubMed, where random training performs best.Looking deeper into the data, we see that the target nodes for random data tend to have higher margins on the best-performing classifiers.In all other cases, GreedyCover performs as well or better than the other training set selection methods.When considering direct attacks, we use all four attacks, again with SGA replacing Nettack in the appropriate case.Results of these experiments are shown in Figure 3.It is much more difficult to defend against direct attacks; note that the attacker often only needs one or two perturbations to be successful.With the CiteSeer dataset, we once again see higher robustness with GreedyCover and StratDegree, in particular at low attack probabilities.With Nettack and FGA, GreedyCover improves performance when combined with other defenses.With IG-FGSM, on the other hand, the alternative training methods provide little benefit.Across datasets, this attack also has the lowest robustness across defenses, which would suggest it is a preferred attack for adversaries.(Note that for CiteSeer, we only obtained data for a GCN classifier when attacked with IG-FGSM when using GreedyCover.)We once again see a detriment in performance with PubMed, though in this case in the area where  a perturbing a single edge results in an successful attack.
Observation 6.2.Direct attacks typically benefit less than influence attacks from the alternative training methods.

Impact of Labeled Neighbors
One possibility we considered is that robustness from the alternative training methods comes entirely from the average number of trained neighbors for nodes in the test set.To test this possibility, we performed the same experiments with more randomly selected training data, as described in Section 5.2.Results of these experiments are shown in Figure 4, using Nettack as an influence attack against a GCN.While using more randomly selected training data does sometimes increase robustness, it is not consistent, and in some cases more randomly selected training data is slightly less robust.The one case where additional training data consistently outperforms GreedyCover in terms of robustness is Cora, where training using 30% of the dataset, randomly selected, outperforms the alternatives.In this case, the average number of neighbors per target for GreedyCover and StratDegree are 1.084 and 1.135, both between the values for 25% random training (1.029) and 30% (1.237).Thus, increasing the number of neighbors in the training set by adding more randomly selected training data does not necessarily increase classifier robustness to the same extent.
Observation 6.3.Using more training data with random selection does not consistently lead to higher robustness.

Robustness vs. Classification Performance
Another important consideration is whether increased robustness comes at the expense of classification performance.In Figure 5, we show the macro-averaged F 1 score for each method using all classifiers.Performance does occasionally vary.In particular, StratDegree results in somewhat lower performance than random training among most classifiers for all datasets.GreedyCover, on the other hand, typically yields similar performance to random selection and occasionally outperforms it, e.g., using SGC on CiteSeer and ChebNet on Cora.This yields another datapoint in favor of GreedyCover: it tends to yield the greatest robustness across datasets, and does not seem to greatly hinder overall classification performance.
Observation 6.4.Using GreedyCover yields no consistent reduction in classification performance compared to random training set selection.

Adaptive Attacks
In the image classification literature, numerous published defenses were found to primarily rely on model obfuscation and remain vulnerable to adaptive attacks    that take the new model into account [1].Recent work has raised similar concerns regarding the robustness of published defenses against GNN attacks [30].
If training set selection makes a classifier more robust, one advantage is that it makes no changes to the model class, and thus should not be vulnerable to such oversights.
We applied our training set selection methods to a demonstration provided by Mujkanovic and Geisler et al. 1 , which includes an adaptive attack based on projected gradient descent (PGD) [46].The code applies the attack with the objective of reducing the overall classifier accuracy.We applied the demo to the same datasets used by the authors-CiteSeer and Cora-and achieved the results shown in Figure 6.While the SVD-based method and GNNGuard are both effectively attacked by the PGD-based method, using GreedyCover to select the training data (again using 10% for training, 10% for validation, and 80% for testing) results in higher post-attack accuracy for with both classifiers.As defenses to new adaptive methods are published, it will be interesting to consider their use in conjunction with alternative training set selection.

Simulation Study
The results on real data show that GreedyCover often provides greater robustness to attack, but they are by no means conclusive.In the next section, we further explore the methods with simulated data to observe performance differences while controlling network properties.

Synthetic Dataset Generation
Synthetic network generation to evaluate network effects on the performance of GNNs has recently received attention in the research community [32].In this section, we outline a set of simulations that consider four key features of the network: degree distribution, level of clustering, homophily with respect to labels, and information gained via node attributes.

Random Graph Models
We use five random graph models that exhibit different properties in terms of clustering, degree distribution, and dependence on attributes.In each case, we use 1200 nodes and an average degree of approximately 10.
• Erdős-Rényi (ER) Graphs: Each pair of nodes shares an edge with probability 1/120.This model yields homogeneous degree distributions and very little clustering.
• Barabási-Albert (BA) Graphs: Each node enters the graph and connects 5 edges to existing nodes with probability proportional to their degrees.The process is initialized with a 6-node star.This model yields graphs with heterogeneous degree distributions and very little clustering.
• Watts-Strogatz (WS) Graphs: A ring lattice-where each node is connected to 5 nodes on either side-has 10% of its edges randomly rewired.This model yields graphs with substantial clustering and homogeneous degree distributions.
• Lancichinetti-Fortunato-Radicchi (LFR) Graphs: Generates a degree sequence with degree distribution p(d) ∝ d −3 , with average and minimum degree set to d avg = 10 and d max = 135.Nodes are randomly assigned to communities, whose sizes are distributed according to p(|C|) ∝ |C| −2 , with the minimum community size being 10.Nodes create 80% of their connections within the community and 20% outside the community.
If the generated graph has multiple connected components, we use the largest connected component for the experiment.

Label Assignment
We assign labels with varying levels of homophily.For the "high homophily" scenario, we partition the nodes based on the normalized graph Laplacian where A and D are the adjacency matrix and diagonal degree matrix as in Section 3 [8].We select the eigenvector u 2 associated with the second-smallest eigenvalue of L. The nodes associated with the N/2 entries in u 2 with the smallest values (i.e., values closest to −∞) are labeled 0, and the other nodes are labeled 1.Let V 0 and V 1 be the respective subsets of vertices.
For lower homophily graphs, we first compute the difference between the number of within-label edges and the number of cross-label edges, i.e., letting E ij be the set of edges between nodes in V i and nodes in V j , Depending on how homophilous we want the graph to be, we swap labels on pairs of nodes until ∆ reaches a given value, based on its value from the initial Laplacian-based partition (e.g., half as homophilous as the original).The node swapping mechanism is detailed in Algorithm 2.

Synthetic Attributes
As with the real graphs, we consider binary attribute vectors on the nodes of the synthetic graphs.In each case, we consider nodes with 100 attributes, and we give each attribute a probability of being true depending on its label.Probabilities are determined by an exponentially decreasing function.We consider three scenarios.In the most difficult case, the probabilities are the same for both classes.As we make the problem easier, we shift the function that determines the attribute probabilities so that high-probability attributes in class 0 still have relatively high probabilities class 0, but there is not a perfect match.The shifts were chosen to create cases where a generalized likelihood ratio test (with each attribute having an independent probability parameter, estimated based on 60 cases for each class) achieves accuracy of approximately 0.5, 0.7, and 0.9.We refer to these cases as having uninformative, moderately informative, and highly informative attributes, respectively.In addition, each node has a one-hot encoded attribute indicating its index in the node set.

Results
For all synthetic topologies, we ran experiments using Nettack perform an influence attack against a GCN trained with data selected by all three methods.Robustness results for ER, BA, WS, and LFR graphs are shown in Figure 7.When no attributes are used and homophily is high, we see a much larger performance difference using GreedyCover than StratDegree in the WS graphs, but the two methods yield more similar performance with the other models.For all models, as homophily decreases, the performance improvement gets more modest as homophily decreases.
When attributes with the same distribution are added to both classes (i.e., the case of "uninformative" attributes), robustness suffers in most cases.The LFR graphs in particular see a large decrease in robustness using random selection, with a much smaller decrease using the alternative methods.As feature distributions become more distinct between the classes, the difference between the methods becomes smaller, suggesting that the robustness improvements we observe are likely due to structural considerations.With highly informative attributes, we note that the models with homogeneous degree distributions still gain a benefit from StratDegree and GreedyCover when homophily is low, while the models with heterogeneous degree distributions are somewhat hindered by these methods.Like in the real data, this is because the targets have higher margins in the case of random training selection.This may happen due to low-degree nodes, which tend to connect to high-degree nodes: When homophily is low, nodes may become more difficult to predict based on their proximity to hubs, and are less likely to be selected to be labeled.We summarize our observations here as follows: Observation 7.1.With no attributes and high homophily, all models gain robustness from the alternative methods.
Observation 7.2.With no attributes and low homophily, GreedyCover provides robustness for all models, while for BA and LFR, StratDegree improves robustness only at higher attack success rates.
Observation 7.3.The increase in robustness for the alternative methods decreases as homophily decreases and as attributes become better class predictors.
Observation 7.4.With highly informative attributes and low homophily, Gree-dyCover and StratDegree maintain some increased robustness for homogeneous degree distributions, while they somewhat hinder performance for heterogeneous ones.
We consider the potential impact of the alternative training data on classifier performance as well.Results are shown in Figure 8.Since we use two balanced classes in all cases, we use accuracy as the classification metric.For each case, we plot accuracy as a function of heterophilicity [35], computed as  Each curve represents the average required budget over 5 train/validation/test splits, and error bars are standard errors.Higher is better for the defender.The principal performance differences occur with skewed degree distributions when homophily is low.
The denominator in (5) is the expected number of edges between V 0 and V 1 after random rewiring.A high-homophily graph will have relatively low heterophilicity.Note that both ER and BA graphs span the same range of heterophilicity, while LFR graph can achieve lower heterophilicity and WS can be almost perfectly homophilous.When no attributes are used, performance is similar across methods in the high-homophily (low-heterophilicity) cases, while the alternative methods perform worse in the low-homophily cases.This yields a significant gap in the in the cases with skewed degree distributions.In particular, LFR graphs maintain 75% accuracy with random training even in the case where there is no homophily (heterophilicity is 1).As in the analogous results in Figure 7, this may be due to low-degree nodes that are unlikely to be chosen as training data, but are more difficult to classify in a less homogeneous setting.
As attributes are added to the graphs, we see a decrease in performance when the uninformative attributes are added, though the difference is very small using GreedyCover for the clustered models.As we expect, accuracy increases as the attributes become more informative.As we observed in the robustness results, we see differences between methods diminish as attributes help discriminate the classes.
Observation 7.5.Graphs with skewed degree distributions and low homophily achieve lower accuracy with GreedyCover and StratDegree than random selection, but performance is similar in other cases.
Observation 7.6.For higher homophily graphs, performance differences between methods decrease as attributes become more informative.
When heterophilicity is approximately 1, accuracy is very low without informative attributes.Considering cases where there is at least some homophily and at least moderately informative attributes, the simulation results where robustness does not improve with StratDegree or GreedyCover are summarized in Table 1.As shown in the table, the cases where there is no improvement all have heterogeneous degree distributions, while the homogeneous degree distributions always have some improvement in robustness when attributes are at least moderately informative.In addition, the low homophily cases result in lower accuracy with the alternative methods.Note also that more informative attributes and lower clustering coefficient hinder the performance benefit.
Relating the synthetic data to the real datasets, recall that StratDegree and GreedyCover both failed to provide consistent improvement for PubMed, and struggled with direct attacks against PolBlogs.Looking into the features of these datasets, two points of interest are that PolBlogs has an especially heavytailed degree distribution: there are many nodes with hundreds of edges, which is rare in the other datasets.In addition, the PubMed dataset has node attributes that are very useful in identifying the class of the nodes: a support vector machine with a radial basis function kernel trained on the attribute vectors alone (50% of the nodes used for training), the F 1 score (macro averaged) for the Cora dataset is approximately 0.71, for CiteSeer is approximately 0.75, and for PubMed is about 0.87.As with the synthetic data, the cases with the most informative node attributes are hindered by the alternative training methods.

Conclusions
This paper explores the impact of complex network characteristics on the robustness of vertex classification using GCNs.In particular, we consider various scenarios regarding the structural relationship between the training data and the remainder of the network.We consider selecting training data using alternatives to random selection: using the highest degree nodes (StratDegree) and using nodes that result in more connections to training nodes from the test set (GreedyCover).We see the greatest improvement using Greedy-Cover against influence attacks, though there are improvements in other cases Table 1: Summary of cases where StratDegree and GreedyCover do not improve robustness when using informative attributes on synthetic graphs.The alternative methods are considered less robust than random training selection if the adversary's budget decreases by at least 1 standard deviation for at least 10 out of 20 points on the associated curve in Figure 7 (attack success probability in multiples of 0.05).They are considered to be similarly robust if the budget is within 1 standard deviation of for over 10 such points.For accuracy, the alternative methods result in lower accuracy if the average accuracy in Figure 8 decreases by at least 3% and similar accuracy if it is within 3%.All cases in the table have heterogeneous degree distributions.All cases with lower accuracy have low homophily.The improvement from the alternatives is also degraded as attributes become more informative (from 70% to 90% accuracy based on attributes alone) and clustering coefficient decreases.as well.With direct attacks via IG-Attack, on the other hand, performance is similar across methods, and robustness in the best case (or worst case for the defender) is lower than with the other attacks.We show that the robustness achieved against Nettack with the alternative training methods is not replicated through increasing the amount of randomly selected training data, and that there is no significant tradeoff between classifier performance and robustness using GreedyCover.In addition, we test StratDegree and Greedy-Cover against an adaptive global poisoning attack and show that Greedy-Cover yields better post-attack accuracy than random training.

Heterogenous
In simulation, we see other interesting phenomena in the context of influence attacks: GreedyCover increases robustness against Nettack for a diverse set of topologies when label homophily is high and there are no node attributes.We find that GreedyCover and StratDegree cease to be helpful when homophily is very low and degree distributions are heterogeneous, perhaps because there are fewer labels on low-degree nodes that attach to hubs.In all cases, variation between training selection methods becomes less pronounced as node attributes become more helpful in discriminating between classes.
The work documented here points to several open problems and avenues of potential investigation.First, it is interesting that the integrated gradient method is consistently the strongest attack against real data, regardless of how training data are selected.Determining whether some network phenomenon can be exploited to improve robustness against these attacks would be an interesting topic for future work.Considering additional models for topologies and attributes could yield additional insight into where the various methods perform best, with Google's GraphWorld being an important enabling technology [32].Another interesting question is whether there are certain topology-attribute combinations where there is a true tradeoff between robustness and classification performance.Identifying such cases-in the spirit of [39], focused on graph data-would be important to understand what could make classification inherently vulnerable to attack.Another potential area to consider is detectability.Attackers try to hide their manipulation of the data; what would be necessary to determine that an attack has been performed on a graph?For example, we observed that direct attacks from Nettack increase triangle count [27].There may be other network statistics that tend to change when an attack is carried out.These are all interesting questions to consider as the research community continues to expand its knowledge of vulnerability and robustness in graph machine learning.
Table 2: Results of influence attacks against each classifier with the CiteSeer dataset, with attacker budgets of 10, 30, and 50 edge perturbations.Results are included for Nettack (Net), FGA, and IG-FGSM (IG).For each classifier, we train with random (Rand.),StratDegree (SD), and GreedyCover (GC).Each entry is a probability of attack success, thus higher is better for the attacker and lower is better for the defender.To yield the most robust classifier, the defender picks the classifier/training method combination that minimizes the worst-case attack probability.These entries are listed in bold.Entries representing the most robust case for random training are in italic.Entries listed as N/A did not finish in the allotted time.The Jaccard-based classifier performs best, both overall (with GreedyCover) and using random training.Table 3: Results of influence attacks against each classifier with the Cora dataset, with attacker budgets of 10, 30, and 50 edge perturbations.Results are included for Nettack (Net), FGA, and IG-FGSM (IG).For each classifier, we train with random (Rand.),StratDegree (SD), and GreedyCover (GC).Each entry is a probability of attack success, thus higher is better for the attacker and lower is better for the defender.To yield the most robust classifier, the defender picks the classifier/training method combination that minimizes the worst-case attack probability.These entries are listed in bold.Entries representing the most robust case for random training are in italic.Entries listed as N/A did not finish in the allotted time.The Jaccard-based classifier performs best, both overall and using random training.If we focus on classifiers that achieve the best performance in Figure 5, (i.e., omitting Jaccard and SVD), the best performance is achieved by GCNs with the alternative training methods.For each classifier, we train with random (Rand.),StratDegree (SD), and GreedyCover (GC).Each entry is a probability of attack success, thus higher is better for the attacker and lower is better for the defender.To yield the most robust classifier, the defender picks the classifier/training method combination that minimizes the worst-case attack probability.These entries are listed in bold.Entries representing the most robust case for random training are in italic.Entries listed as N/A did not finish in the allotted time.Best results overall and with random training are achieved with SVD, while RGCN performs equally well when using StratDegree.
Table 5: Results of influence attacks against each classifier with the PubMed dataset, with attacker budgets of 10, 30, and 50 edge perturbations.Results are included for Nettack (Net), FGA, and IG-FGSM (IG).For each classifier, we train with random (Rand.),StratDegree (SD), and GreedyCover (GC).Each entry is a probability of attack success, thus higher is better for the attacker and lower is better for the defender.To yield the most robust classifier, the defender picks the classifier/training method combination that minimizes the worst-case attack probability.These entries are listed in bold.Entries listed as N/A did not finish in the allotted time.Only results using Jaccard, GCN, and ChebNet were obtained in time.While StratDegree and GreedyCover improve performance with the Jaccard-based classifier, the best performance is achieved by a ChebNet classifier with random training.In our experiments, this classifier with the PubMed data typically has a much higher margin before the attack takes place.
Table 7: Results of direct attacks against each classifier with the Cora dataset, with attacker budgets of 5, 10, and 20 edge perturbations.Results are included for Nettack (Net), FGA, and IG-FGSM (IG).For each classifier, we train with random (Rand.),StratDegree (SD), and GreedyCover (GC).Each entry is a probability of attack success, thus higher is better for the attacker and lower is better for the defender.To yield the most robust classifier, the defender picks the classifier/training method combination that minimizes the worst-case attack probability.These entries are listed in bold.Entries representing the most robust case for random training are in italic.Entries listed as N/A did not finish in the allotted time.While random training with the SVD classifier works best at a low attack budget, Jaccard with StratDegree performs better against better-resourced attackers.
, will require O(|E| + |V | log |V |) time (for computing degrees and sorting), compared to O(|V |) time for random sampling.Each step in GreedyCover requires finding the vertex with the most neighbors minimally connected to the training set.As written in Algorithm 1, each iteration requires O(|E|) time to count the number of such neighbors each node has, which would result in an overall running time of O(|V ||E|).This could be improved using a priority queue-such as a Fibonacci heap-to achieve O(|E| + |V | log |V |) time (O(|V |) logarithmic-time extractions of the minimum and O(|E|) constant-time key updates.Thus, the two proposed method require moderate overhead compared to the running time for the GCN.

Observation 6 . 1 .
Training with GreedyCover frequently outperforms other training methods, both with GCNs and in conjunction with published defenses.

Figure 2 :
Figure2: Robustness to influence attacks using GCNs (solid line) or with the best defense at a given attack success probability (dash line).Results are shown for the CiteSeer, Cora, PolBlogs, and PubMed datasets, each plotted in a subsequent row, and using both the Nettack/SGA (left column) and FGA (right column) attacks.Insufficient results were returned in the allotted time for IG-FGSM on all datasets, and FGA for PubMed.Each curve represents the average required budget over 25 randomly selected targets, and error bars are standard errors.Higher is better for the defender.With the exception of the PubMed dataset, GreedyCover performs at least as well as random training selection, and often performs much better.

Figure
Figure3: Robustness to direct attacks using GCNs (solid line) or with the best defense at a given attack success probability (dash line).Results are shown for the CiteSeer, Cora, PolBlogs, and PubMed datasets, attacked with Nettack/SGA (left column), FGA (center column), and IG-FGSM (right column).Insufficient results were returned in the allotted time for IG-FGSM and FGA on the PubMed dataset, or for IG-FGSM on the CiteSeer dataset when using a GCN with random training or StratDegree.Each curve represents the average required budget over 25 randomly selected targets, and error bars are standard errors.Higher is better for the defender.While GreedyCover performs better when paired with defenses on CiteSeer when attacked with Nettack or FGA, the alternative methods generally increase robustness less than with indirect attacks.

Figure 6 :
Figure 6: Performance using an adaptive attack for global poisoning with all three training schemes.Results are shown in terms of overall classifier accuracy using a GCN, an SVD-based GCN, and GNNGuard on the CiteSeer and Cora datasets.Bars showing accuracy before poisoning is desaturated, while accuracy after poisoning is solid.Higher accuracy is better for the defender.In all cases, selecting training data using GreedyCover results in better post-attack accuracy.

Figure 7 :Figure 8 :
Figure7: Robustness to influence attacks against GCNs on simulated data.Results are shown for ER (first column), BA (second column), WS (third column), and LFR (fourth column) graphs, in cases with no attributes (first row), uninformative attributes (second row), moderately informative attributes (third row), and highly informative attributes (fourth row).Each curve represents the average required budget over 25 randomly selected targets, and error bars are standard errors.Higher is better for the defender.Results are shown for high homophily (solid line) and low homophily (dash line) cases.As attributes become more helpful in classification, the advantage gained by the alternative training methods is substantially reduced.
Insufficient results were returned in the allotted time for IG-FGSM and FGA on the PubMed dataset, or for IG-FGSM on the CiteSeer dataset when using a GCN with random training or StratDegree.Each curve represents the average required budget over 25 randomly selected targets, and error bars are standard errors.Higher is better for the defender.While GreedyCover performs better when paired with defenses on CiteSeer when attacked with Nettack or FGA, the alternative methods generally increase robustness less than with indirect attacks.Each bar height represents the average F 1 score (macro averaged) across 5 separate train/validation/test sets, and error bars are standard errors.Performance is shown for each classifier where experiments completed within the allotted time.Higher is better for the defender.While StratDegree often underperforms random selection, GreedyCover typically shows similar performance.
3: Robustness to direct attacks using GCNs (solid line) or with the best defense at a given attack success probability (dash line).Results are shown for the CiteSeer, Cora, PolBlogs, and PubMed datasets, attacked with Nettack/SGA (left column), FGA (center column), and IG-FGSM (right column).Figure 5: Classifier performance across datasets when training data are selected using GreedyCover, StratDegree, or varying amounts of random selection.Results are shown for the CiteSeer (upper left), Cora (upper right), PolBlogs (lower left), and PubMed (lower right) datasets.

Table 4 :
Results of influence attacks against each classifier with the PolBlogs dataset, with attacker budgets of 10, 30, and 50 edge perturbations.Results are included for Nettack (Net), FGA, and IG-FGSM (IG).