We evaluated the potential to combine similarity or relational information between a set of entities for application in biological data. For example, one might consider networks of proteins, genes, or bacterial species with extra experimental data. Our application of this model to biological problems provides a framework to predict attribute or connectivity information about a new observation. Note that we do not intend to suggest any new biological insights here, but rather that we can combine two sources of information for prediction tasks and alternative definitions of what constitutes a community in the data. Applying the attributed stochastic block model to integrate connectivity and attribute data provides a way to find a partition that takes into account two different sources of information, or a method to predict one source of information (connectivity, attributes) in the absence of the other (attributes, connectivity).
Microbiome subject similarity results
Motivation
In the analysis of biological data, it is often useful to cluster subjects based on a set of their measured biological features and to then determine what makes each of the subgroups different. One type of biological data gaining much attention in recent years is metagenomic sequencing data, used to profile the composition of a microbiome. We refer to this as the ’metagenomic profile’ and each feature is a count for each bacterial species, also known as operational taxonomic unit (OTU). Lahti et al. conducted a study among subjects across a variety of ethnicities, body mass (BMI) classifications, and age groups to understand differences in the intestinal microbiota (Lahti et al. 2014). Using metagenomic sequencing, the counts for 130 OTUs were provided for each subject. Using the OTU profiles, we created an attributed network between subjects based on their similarities in OTU profiles. In this case, connectivity represents pairwise correlation between subjects based on their OTU profiles. The attributes for each node are defined as the coordinates for a lower-dimensional embedding (described in the next section). In this case, we are overlaying two similar ways to look at the data and examining how this can enhance results. Our work here is not meant to reveal novel biology, but rather to provide a computational test for understanding the performance of our attributed SBM approach.
Pre-Processing The data were downloaded from http://datadryad.org/resource/doi:10.5061/dryad.pk75d. We extracted a subset of the subjects from Eastern Europe, Southern Europe, Scandinavia, and the United States. Using only these subjects, a between-subject similarity network was constructed between the 121 individuals who had a BMI measurement. This resulted in a network of 121 nodes, where each edge is the Pearson correlation between their microbial compositions. We then removed all edges in the network with weight (correlation) <0.7. Note that our attributed SBM does not allow for edge weights, so we simply ignored the edge weights as input to the model.
Constructing Node Attributes Since each node had a 130-dimensional vector of attributes (counts), we used this information to create a lower-dimensional attribute vector for each node by performing PCA and then representing each node with the first 5 principal components. Each dimension of this new attribute vector was then centered and scaled, and we observed an approximately Gaussian distribution.
We first visualized the differences in partitions obtained according to the classic and attributed stochastic block models in Fig. 8a-b, respectively. In both networks, nodes are colored by their community assignment. Using the classic stochastic block model and the model selection criterion described in (Daudin et al. 2008), 7 blocks were identified. With the attributed stochastic block model, 6 blocks were identified. While we do not have ground truth labels on the nodes, it is visually apparent that adding the attributes to the inference problem helps to ‘clean up’ the partition. For example, in Fig. 8a there is mixing between communities two and three in the network colored by communities identified with the regular SBM. In Fig. 8b, this mixing was reduced after using the attributed SBM by assigning all of the nodes in the general region to community three. Similarly, in Fig. 8a, members of community five are periphery nodes that are connected well to community seven but not to each other. Because these nodes do not have a lot of internal connectivity, it is not immediately apparent why they were assigned to be in their own community. This nuance is corrected in Fig. 8b with the use of the attributed stochastic block model. Here, the members of communities five and seven from Fig. 8a. are mostly agglomerated into community six. The layout for each of these networks was computed using the ‘layout.nicely’ function in the R igraph library.
Microbiome Link Prediction We performed link prediction on the microbiome subject similarity network as described in “Link prediction experiments” section. The associated ROC curves are plotted in Fig. 4. All five methods have satisfactory performance with the attributed stochastic block model giving the best results. The AUC values for the attributed SBM, Jaccard, Adamic-Adar, preferential attachment, and SBM are 0.71, 0.69, 0.69, 0.62, and 0.71, respectively.
We note that the performance of the attributed SBM and SBM are very similar in the microbiome network shown in Fig. 4. This likely arises from the fact that the node attributes and connectivity are quite similar. Alternatively in Fig. 5 the attributes and connectivity are more complementary sources of information. Moreover, including the attribute information greatly helps the link prediction task in this case.
Microbiome Collaborative Filtering We performed the collaborative filtering experiments on the microbiome subject similarity network in the manner described in “Collaborative filtering experiments” section to predict the 5-dimensional attribute vector for each node. The box plots in Fig. 6 indicate the distribution of root mean squared error (RMSE), \({\mathcal E}\), over the 121 nodes for the attribute SBM (turquoise), neighbor average (orange), weighted neighbor average (blue), and SBM (pink). We find that in general the edge-based collaborative filtering methods (neighbor average and weighted neighbor average) have better performance than community model-based predictions (attributed SBM, SBM). This is likely because predicting an attribute vector based on an average over a collection of nodes likely introduces more noise than predicting only based on closest neighbors. A similar behavior is also observed in the collaborative filtering experiments in the protein interaction network shown in Fig. 7. In general, all four methods have similar error distributions, but the attributed SBM outperforms the standard SBM. The mean RMSEs for the attributed SBM, neighbor average, weighted neighbor average, and SBM were 2.3, 2.06, 2.05, and 2.45, respectively. This experiment along with the experiment shown in Fig. 7 show that while not quite as accurate as the methods based on closest neighbors, the attributed stochastic block model nevertheless can be effectively used for collaborative filtering tasks. A possibly interesting direction for future work would be to consider combinations of these methods from different perspectives to attempt to identify greater improvements in accuracy.
Protein interaction network results
We also apply our attributed SBM approach to the protein interaction network presented in (Bonacci et al. 2014). This network represents interactions between proteins, predicted from the literature. Associated with each node (protein), is a classification of one of 6 experimental modifications observed from the exposure of cancer cells to a chemotherapeutic drug. While communities in this network should reflect functional relatedness among proteins (e.g. similar biological functions, in general), we also expect that members of a community should share similarities in the observed modification type. Also associated with each of the 6 modification types is whether that particular type of modification became either more or less prominent after treatment with the drug. Since we have two types of labels associated with these nodes, we also sought to explore how these two labeling schemes (6 class vs. 2 class) aligned with the communities returned by the algorithm.
Data Pre-Processing: We downloaded the unweighted protein interaction network data and the modification information from the supplement of (Bonacci et al. 2014). We removed 13 nodes that were not connected to the largest component of the network and considered only the 82 node largest connected component.
Constructing Node Attributes: Each node is classified with 1 of 6 possible modification types. For each node, we created an attribute vector that captured the modification types of its neighbors. To do this, we considered the 4th order neighborhood of each node. That is, for each node, we collected its neighbors who were four hops or less away in the network. Then to define the value for attribute c of node i, or xic, we counted the number of 4th order neighbors of node i with label c. After defining these attributes across all nodes, for each of the 6 classes, we centered and scaled each attribute across all of the nodes to have mean 0 and unit variance.
Figure 9a-b show the results of fitting a classic SBM and attributed SBM, respectively. These networks are visualized with the Fruchterman Reingold force directed layout in the R igraph library. Nodes are colored by their community assignment. The 6 possible modifications arise from 3 biological processes that can either increase or decrease after exposure to the drug. The node shape reflects whether the experimental modification for a node increased (square) or decreased (circle) after treatment with the chemotherapeutic agent. Again by fitting an SBM with the model selection criterion in (Daudin et al. 2008), five communities were identified. With our attributed SBM, nine communities were identified. Note that using the attributed SBM created more communities in that it split up the purple core community under the classic SBM into more small communities. The implications of this new partition are explored with an entropy calculation based on the biological classifications of the protein in Fig. 10.
Using the partition of the nodes under the classic and attributed stochastic block models, we sought to use the two different prior biological classifications or labeling of the nodes (proteins) (Bonacci et al. 2014). The first of these classifications includes one of six modification types. The other classification refers to whether the protein has increased or decreased levels of expression. Moreover, we could use these two types of labels or classifications of the proteins to compute a label entropy measure within each community. The expectation is that by incorporating attribute information that is related to the functional protein information into the community detection problem, we should see a decrease in the entropy over the biological classification labels in communities. In Fig. 10a-b, we plot the entropy for the 2 class and 6 class node classifications, respectively. We define Ec, the entropy for community c as
$$ \mathbf{E}_{c}=-\sum_{k} p_{k}\log(p_{k}). $$
(13)
Here, k indexes the unique classifications found in community c and pk is the probability that a node in community c belonged to biological classification k in community c. In these plots the black and purple curves correspond to the fits of the classic and attribute SBM fits, respectively. Using both types of node classifications to compute these entropy quantities, we see that the attribute SBM succeeds in breaking up one high entropy community (5) from the classic SBM partition into lower entropy communities.
Link Prediction in the Protein interaction network We performed link prediction on the protein interaction network using the procedure described in “Link prediction experiments” section. Given that this protein network is sparse, none of the link prediction methods performed particularly well. The AUC values for the attributed SBM, Jaccard, Adamic-Adar, preferential attachment, and SBM were 0.61, 0.58, 0.58, 0.54, and 0.42, respectively. The associated ROC curves are shown in Fig. 5. In this example, the attributed SBM significantly outperforms the regular SBM. This is most likely due to the complementary and non-overlapping nature of the connectivity and attribute information.
Collaborative filtering in the protein interaction network Collaborative filtering was performed using the method described in “Collaborative filtering experiments” section. Note that unlike the microbiome sample similarity network, the edges in this network are unweighted and hence the neighbor average and weighted neighbor average methods produce the same result. Similar to the results in the microbiome network in Fig. 6, we notice that edge-based methods (neighborhood average and weighted neighborhood average) result in a lower error than community-based methods (attributed SBM, SBM). Similarly, the attributed SBM has higher performance over the plain SBM. The root mean squared errors for the attributed SBM, neighborhood average, weighted neighborhood average and SBM were 0.74, 0.62, 0.62, and 0.94, respectively. Clearly, the attributed SBM offers several advantages over the original SBM. Similar to Fig. 6, the box plots in Fig. 7 represent the distribution of errors across each of the 82 nodes.
The attributed SBM leads to increased network assortativity
We further quantified the quality of the partition obtained by integrating network connectivity and node attributes with an assortativity measure (Newman 2003). Assortativity measures the extent of homophily in a network, under some labeling of the nodes. Intuitively, this implies that in a network with a high assortativity coefficient, nodes tend to have the same labels as their neighbors. Moreover, we labeled nodes according to their community assignments under the classic and attributed stochastic block model and computed assortativity using the assortativity function in igraph. We found that in both the microbiome and protein dataset, using the attributed SBM leads to higher assortativity. In Fig. 11 we show that the assortativities for the regular and attributed SBMs in the microbiome data were 0.61 and 0.71, respectively. Similarly, in the protein dataset, the assortativity measures for the regular and attributed SBM were -0.122 and 0.714, respectively. The assortativity computed in the protein-protein interaction network with the regular SBM was negative. This implies that the induced labels are not coherent between individual nodes and their neighbors.