Using *naive Bayes* and *relaxation labeling*, we classify nodes as either *white* or *black* using different sample sizes and different evaluation metrics (see section “Experimental setup”). Next, we will discuss our results and answer the three research questions which we raised before.

### RQ1: How does *network structure* affect the overall performance of collective classification?

We analyze to what extent the structure of the network (i.e., *homophily*, *class balance* and *edge density*) impacts classification performance. We measure performance of collective classification using ROCAUC scores, where each value can be interpreted as the probability of distinguishing between classes.

#### Overall performance vs. network structure

Figure 2 shows the classification performance on synthetic networks with number of nodes \(N={2000}\) and edge density \(d \in \{0.004, 0.02\}\) (rows)^{Footnote 8}. Class balance is defined by the parameter *B* (columns). Homophily *H* ranges from 0 to 1 (x-axis). Sample size, using random node sampling is shown as the percentage *pseeds* of nodes (colors), and the overall performance as ROCAUC scores (y-axis). At first glance, from Fig. 2 we notice four main patterns. **(i)** As expected, classification performance on neutral networks (\(H=0.5\)) is always similar to a random classifier. **(ii)** Surprisingly, heterophilic networks (\(H<0.5\)) require smaller samples to achieve high and stable classification performance compared to homophilic networks (\(H>0.5\)). **(iii)** ROCAUC scores are neither stable nor consistent (i.e., high variance) in the homophilic regime when samples are very small. In other words, classification performance varies widely. **(iv)** Dense networks (d=0.02) achieve higher classification performance compared to sparse networks (d=0.004) around \(H=0.5 \pm 0.3\), i.e., \(\overline{\text {ROCAUC}}_{d=0.02, H=0.5 \pm 0.3}=0.82 > \overline{\text {ROCAUC}}_{d=0.004, H=0.5 \pm 0.3}=0.74\).

#### Why is heterophily easier to predict?

In Fig. 2 we see an asymmetry between homophilic (\(H>0.5\)) and heterophilic (\({H<0.5}\)) regimes for small samples (red lines) and all class balance levels *B* (columns). To explain this discrepancy, we turn to the properties of the sampling error^{Footnote 9} and the network structure: Undirected networks only contain three types of edges, e.g., black-white, white-white, and black-black. In the heterophilic regime, only one type of edge is prevalent (black-white), while in the homophilic regime two types are equally prevalent (white-white, black-black). In general, for small training samples (e.g., \(pseeds \le {{30}\%}\)), the probability of correctly observing each type of edge is very low. Consequently, the parameter estimation is prone to be wrong. However, its impact depends on the class balance and homophily of the network.

*Balanced networks, B=0.5* First, note that the probability of observing a black-black edge in the synthetic network can be calculated analytically given the homophily (*H*), the class balance (*B*), and the degree exponents of the groups (\(\beta\)) as follows:

$$\begin{aligned} P_{bb} = \frac{B^{2} H (1-\beta _{w})}{Z} \end{aligned}$$

(5)

where, *Z* is a normalization constant, and \(\beta _{b}\) and \(\beta _{w}\) are the exponents of the degree distribution for the *black* and *white* nodes, respectively. For the detailed analytical derivations and values of \(\beta\) see (Karimi et al. 2018). Similarly, the probability of observing a black-white edge is given by:

$$\begin{aligned} P_{bw} = \frac{B(1-B)(1-H) [(1-\beta _{b})+(1-\beta _{w})]}{Z} \end{aligned}$$

(6)

In the heterophilic case (\(H = 0.2\)), the probability of observing a black-white edge in the whole graph is 0.8. Thus, the sampling error in a small sample follows \((0.8 |\hat{E}|)^{-\frac{1}{2}}\), where \(|\hat{E}|\) is the total number of edges in the sample. In the homophilic case (\(H = 0.8\)), the probability of observing a black-black edge is 0.4 and a white-white edge is also 0.4. The sampling error for *each* homophilic class is then \((0.4 |\hat{E}|)^{-\frac{1}{2}}\) which individually are smaller than the error in the heterophilic case but adding them together they are larger. These sampling errors are reflected in the *estimation error* calculated here as the squared distance between the model parameter inferred from the training sample (\(P\{.\}\)) and the global network (\(\theta \{.\}\)):

$$\begin{aligned} SE\{.\}=(P\{.\}-\theta \{.\})^2 \end{aligned}$$

(7)

We see these errors in the left-most column of Fig. 3, where the x-axis refers to \(SE_{maj|maj}\), and the y-axis to \(SE_{min|min}\). Note that large errors in homophilic networks (\(H=0.8\)) lead to low overall performance (brown). However, there are some cases where performance is also low even though such errors are small. This means that homophilic networks are more sensitive to the precision of the parameter estimation because it requires: \(P_{maj|maj} = P_{min|min}\).

*Unbalanced networks, B*<*0.5* In addition to the sampling error explained above, the group size differences and the inherent structure of the network add additional complexity to the learning process. This happens because of the interplay between homophily and preferential attachment which enables the formation of all different types of connections. For instance, in *homophilic networks* (\(H=0.8\)), minority nodes will be mainly attracted by other minority nodes. However, due to the preferential attachment, minority nodes will also be partly attracted to majority nodes. On the other hand, majority nodes will be mostly connected to other majority nodes due to both mechanisms. Therefore, the estimation error of the conditional probability \(P_{maj|maj}\) is on average lower than the estimation error for \(P_{min|min}\), as shown at the bottom-right plot in Fig. 3. The same principle applies to *heterophilic networks* (\(H=0.2\)). In this case, even though most edges are heterophilic, networks will also contain edges between nodes of the same type but in significantly different proportions. Since there is only a very limited number of minority nodes, there can only be a very limited number of edges between them. That is not the case for majorities because they can connect to many more majorities. Therefore, though locally they connect to a few other majorities, globally there are many edges within this group. This gives an advantage to small samples because the randomly selected majority nodes are likely to be either disconnected^{Footnote 10} or connected to other minority nodes that are in the training sample. Thus, the classifier learns that the network is heterophilic. This explains why heterophilic networks can achieve high overall performance even when estimation errors are high for \(P_{maj|maj}\) as shown in the top-right plot in Fig. 3. This holds as long as \(\frac{P_{maj|maj}}{P_{min|maj}} \times \frac{P_{min|min}}{P_{maj|min}} < 1\), otherwise the classifier believes that the network is extremely homophilic.

Finally, besides these conditional probabilities, class priors are also important in the collective inference. Thus, in the balanced case (\(B=0.5\)), we expect the class priors to be the same: \(P_{min} = P_{maj} = 0.5\); if this condition is not fulfilled, the classifier initially believes that one group is more prevalent than the other^{Footnote 11}. In the unbalanced case (\(B=0.1\)), however, it is enough to identify the minority group correctly, regardless of its actual group size.

#### To what extent do these results depend on the algorithm?

For interpretability reasons we chose the network-only Bayes classifier (nBC) as relational model, since its model parameters correlate with the homophily and class balance of the network. However, it is unclear whether the results shown in Fig. 2 are to some extent a product of the relational classifier. Therefore, we run the classification algorithm on the same networks by changing the relational model. We choose the LINK classifier (Zheleva and Getoor 2009; Altenburger and Ugander 2018), which learns a regularized logistic regression. The features of a node are the entire row of the adjacency matrix and the outcome variable is the node’s class. In this case, the model parameters are not based on the classes of the nodes (as in nBC), but purely on all nodes in the network. Results using this new setup are shown in Figure A3 in the Additional file 1. We see that the main patterns—compared to the results using nBC—persist. Classification performance achieves its best scores in the extreme levels of homophily, and it drops when networks are neutral. Also, classification on heterophilic networks is just slightly better than classification on homophilic networks. However, the most notorious difference between LINK and nBC is the performance across sample sizes. First, we notice that when using nBC, performance drops drastically when using small training samples on homophilic networks. Second, in this regime performance is not stable (i.e., high variance), see Fig. 2. These two issues do not appear in the results when using LINK, see Figure A3 in the Additional file 1. Therefore, we can conclude that performance, in terms of ROCAUC scores, is mainly driven by the type of network (i.e., the interplay between homophily, class balance, edge density and preferential attachment). When it comes to sample size, nBC gets penalized by small samples since their fluctuations introduce noise in the model parameters, while the parameters of LINK never change.

### RQ2: How does the choice of the sampling technique affect the overall performance of collective classification and its parameter estimation?

In section “RQ1: How does *network structure* affect the overall performance of collective classification?” we learned that certain properties of the network structure help in the parameter estimation even when training samples are very small. Now, we compare random node sampling with three other sampling methods, two of them are *biased* towards high degree nodes (*random edge sampling* and *degree ranking*), and one is *unbiased* (*partial crawls*); more details in section “Sampling: the observed network”.

Since the focus is on the sampling techniques, we fix the number of nodes and edge density of networks to \(N={2000}\) and \(d=0.004\), respectively. We also omit results on neutral networks, and large sample sizes since their performance is either consistent or often very high. Results are shown in Fig. 4. The x-axis represents the sum of the squared estimation errors of conditional probabilities \(P_{maj|maj}\) and \(P_{min|min}\), the y-axis shows the squared estimation error of the class prior \(P_{min}\), and colors represent the overall performance.

*Random nodes vs. other sampling techniques:* First, if we look at the estimation errors from the class prior and the conditional probabilities separately (as shown in Fig. 4) we notice that random edges, degree sampling, and partial crawls are better at estimating conditional probabilities than random nodes. This is because conditional probabilities are based on connections between nodes and all three sampling methods exploit these connections during the sampling. Second, not surprisingly, random node sampling is on average better at estimating class priors since it observes a random sample of nodes, and the class prior only depends on the prevalence of node attributes. Third, on average degree sampling achieves the highest performance (\(\overline{\text {ROCAUC}}\approx 0.91\)) followed by random edges, partial crawls, and random nodes (\(\overline{\text {ROCAUC}}\approx 0.81\)). On the other hand, partial crawls sampling provides the most accurate estimates followed by random edges, random nodes, and degree ranking. However, depending on the structure of the network these sampling techniques may improve or worsen their overall performance and parameter estimation as described below.

*Trade-off between homophily and class balance:* In terms of overall performance, all sampling techniques perform equally well in heterophilic networks in both the balanced and unbalanced regimes (\({\overline{\text {ROCAUC}}_{H=0.2}\approx 0.97}\)). Similarly, all sampling techniques perform equally well in homophilic networks (\(\overline{\text {ROCAUC}}_{H=0.8}\approx 0.76\)). However, this performance is proportional to the class balance: low for unbalanced networks (\(\overline{\text {ROCAUC}}_{H=0.8, B=0.1}\approx 0.67\)), and high for balanced networks (\(\overline{\text {ROCAUC}}_{H=0.8, B=0.5}\approx 0.85\)). Last but not least, we also see in Fig. 4 that the most accurate estimates across sampling techniques are obtained in balanced networks, especially when they are also heterophilic (more details in Figure A4 in the Additional file 1).

#### Which sampling technique should we use?

If the goal is to achieve high overall performance (\(\overline{\text {ROCAUC}}\approx 1.0\)) with a small sample, random edge sampling or partial crawls should be used in *heterophilic* networks^{Footnote 12}, and degree sampling in *homophilic* networks, as long as the degree of nodes is available, otherwise random edges should be considered. However, if the goal is to achieve good quality of estimates (\(\sum SE = SE_{min}+SE_{maj|maj}+SE_{min|min}\approx 0\)) with a small sample, then the most accurate estimates are obtained by degree ranking (followed by partial crawls) when networks are *balanced*, and partial crawls when networks are *unbalanced* (see Figure A4 in the Additional file 1).

### RQ3: How does network structure and the choice of sampling technique influence the direction of bias in collective classification?

Now, we explore how classification mistakes are distributed across both classes. If mistakes are concentrated in one class, the classifier is biased against that class. For example, when data is unbalanced, a majority class classifier will be highly accurate, but misclassify—or be biased against—the minority class. To disentangle *how well the algorithm classifies both minority and majority* classes, we extend the balanced accuracy (Brodersen et al. 2010) to assess the direction of bias. We then compare the true positive rates (TPR) of each class as follows:

$$\begin{aligned} bias = \frac{TPR_{min}}{TPR_{min}+TPR_{maj}} \end{aligned}$$

(8)

Since our classification task is on a binary attribute, \(TPR_{min}\) refers to *sensitivity* and \(TPR_{maj}\) refers to \(TNR_{min}\)^{Footnote 13} or *specificity*. This bias score ranges from 0 to 1. Depending on its value, classification can be interpreted as: (a) \(\text {bias}<0.5\): biased towards majorities (or against minorities), (b) \(\text {bias}>0.5\): biased towards minorities (or against majorities), and (c) \(\text {bias}=0.5\): unbiased.

Results on networks with fixed number of nodes (\(N={2000}\)) and fixed density (\(d=0.004\)), using the four sampling techniques, are shown in Fig. 5. Large samples (\(pseeds>{{30}\%}\)) are not shown since bias scores converge at that point for almost all cases^{Footnote 14}.

On average, classification results are unbiased in balanced networks (\(B=0.5\)). Additionally, when class balance decreases (\(B<0.5\)), classification results are often biased towards majority nodes. However, depending on the level of homophily of the network, the bias score decreases considerably in neutral and homophilic networks (\(H \in \{0.5,0.8\}\)), or just slightly in heterophilic networks (\(H=0.2\)). Notice as well that in the homophilic regime the standard deviation is high. This means that the variation with respect to which group is classified correctly is high. These results are consistent across all sampling methods, and indicates that unbiased results are more robust to changes in group-size and sampling choice in heterophilic networks than in neutral and homophilic networks.

Surprisingly, there are a few cases where classification is biased against majority nodes (\(\text {bias}>0.5\)). Specifically in homophilic networks when nodes are sampled randomly (blue) or by degree (green). Thus, their classification performance is low \({ROCAUC<0.6}\) (see Figure A5(a) and Figure A5(c) in the Additional file 1). On the other hand, when classification results are unbiased (\(\text {bias}=0.5\)) or biased against minority nodes (\(\text {bias}<0.5\)), their classification performance is high \(ROCAUC>0.8\) and inversely proportional to the sum of estimation errors (see Figure A5 in the Additional file 1).

Future work should investigate new sampling methods or classifiers that focus on overcoming the bias issue of collective classification especially in homophilic and neutral networks. For instance, one promising direction that has been proven to improve performance, especially for neutral networks, is to look at friends-of-friends similarities in the parameter estimation (Altenburger and Ugander 2018).