 Research
 Open Access
 Published:
Improved prediction of missing protein interactome links via anomaly detection
Applied Network Science volume 2, Article number: 2 (2017)
Abstract
Interactomes such as Protein interaction networks have many undiscovered links between entities. Experimental verification of every link in these networks is prohibitively expensive, and therefore computational methods to direct the search for possible links are of great value. The problem of finding undiscovered links in a network is also referred to as the link prediction problem. A popular approach for link prediction has been to formulate it as a binary classification problem in which class labels indicate the existence or absence of a link (we refer to these as positive links or negative links respectively) between a pair of nodes in the network. Researchers have successfully applied such supervised classification techniques to determine the presence of links in protein interaction networks. However, it is quite common for proteinprotein interaction (PPI) networks to have a large proportion of undiscovered links. Thus, a link prediction approach could incorrectly treat undiscovered positive links as negative links, thereby introducing a bias in the learning. In this paper, we propose to denoise the class of negative links in the training data via a Gaussian process anomaly detector. We show that this significantly reduces the noise due to mislabelled negative links and improves the resulting link prediction accuracy. We evaluate the approach by introducing synthetic noise into the PPI networks and measuring how accurately we can reconstruct the original PPI networks using classifiers trained on both noisy and denoised data. Experiments were performed with five different PPI network datasets and the results indicate a significant reduction in bias due to label noise, and more importantly, a significant improvement in the accuracy of detecting missing links via classification.
Introduction
Graphical networks can depict many complex systems involving biological, social and informational connections between entities. At the most abstract level, these networks are modelled by graphs in which nodes represent individuals or agents and links denote the interactions or relationships between nodes. Structural properties of biological networks are of great interest as they directly correlate with biological function (Qi and Ge 2006; Wuchty et al. 2003). Various attempts have been made to understand the topological evolution of networks (Albert and Barabási 2002; Dorogovtsev and Mendes 2002). The evolution of networks involves two processes: i) the addition or deletion of nodes and ii) The addition or deletion of edges (links) between nodes. The second process of topological evolution particularly when new connections are added to the existing network has not yet been concretely formalised and revolves around the linkprediction problem. Many applications utilize link prediction to identify new links in large, sparse networks armed only with knowledge of network topology. Therefore, improvements in link prediction accuracy will be of great significance in both science and engineering applications. Meanwhile, linkprediction also reflects the extent to which the evolution of a network can be modelled by topological features intrinsic to the network itself.
The link prediction task can be stated as follows: given a network, or a graph, predict what edges will form between nodes in the future. Alternatively, in domains where data collection is costly and the resulting networks are noisy and incomplete, link prediction can be used to identify unobserved edges. In such cases, the problem is also known as the missing link problem.
The objective of this work is to better identify undiscovered (missing and suspicious) links between pairs of nodes in a proteinprotein interaction (PPI) network. Link prediction uses the existing protein interaction topology to predict missing links. Discovery of links in biological networks such as gene networks, proteinprotein interaction networks, metabolic networks etc. are very costly and timeconsuming if done via laboratory experiments and hence the known connections within these networks remains largely incomplete (Martinez et al. 1999; Sprinzak et al. 2003). Instead of identifying links between all possible pairs of nodes, predictions that focus on already known interactions and are accurate enough can sharply reduce the experimental costs. Discovering protein protein interactions is a pivotal task for understanding the underlying biological processes behind tasks such as protein function prediction, drug delivery control and disease diagnosis.
Researchers have formulated link prediction as a binary classification problem, where class labels indicate the presence or absence of a link (referred to as positive links or negative links respectively) between pairs of nodes in the network. In this approch, features based on network topology such as common neighbors, Jaccard coefficient, etc. of the two nodes under consideration are fed to the classifier which predicts the presence or absence of a link. This paper also formulates the link prediction problem as a binary classification problem based on topological features, with a view to improve classification performance. Recently it was found that local communitybased features were most effective for link prediction in biological networks both in monopartite (Cannistraci et al. 2013b) and bipartite networks (Daminelli et al. 2015). Therefore, we have included these in our feature list for the PPI network datasets under consideration.
An unresolved issue with formulating the link prediction problem as a classification problem is the label noise present in the training data. Typically, a set of positive and negative links are randomly chosen from the existing graph and are used for training a classifier, which is then used to predict links on the remaining network. However, the absence of a link in the network does not necessarily mean it is a negative link; it may be the case that the link exists but is undiscovered (as commonly occurs with PPI networks). Therefore, to include this pair of nodes in the training data as a negative link may introduce label noise and bias the resulting classifier. In this paper, we claim that by using anomaly detection on the negative links of the training data, and by subsequently filtering out the detected anomalous negative links from training data, we can obtain better classifiers that yield superior link prediction performance. The suggested approach is evaluated on five different PPI networks, with four different classifiers. A comparison with classification with and without anomaly detection is provided and results demonstrate that utilizing anomaly detection for filtering suspicious negative links yields superior classifier performance on test data.
Related work
General purpose neighborhood based methods have been proposed for link prediction in different kinds of networks: collaboration, social, citation, roadmaps, etc. (Liben Nowell and Kleinberg 2007; Zhou et al. 2009). Various bioinspired methods were created to either assess reliability of interactions in PPI networks such as Interaction Generality (IG1) (Saito et al. 2002), IG2 (Saito et al. 2003) and IRAP (Chen et al. 2005) or predict protein function such as the CzekanowskiDice Dissimilarity (CDD) (Brun et al. 2003) and FSW (Chen et al. 2006). Later, these techniques were applied to protein interaction prediction (Cannistraci et al. 2013b; Chua et al. 2006). Both approaches rely on the number of neighbors that two nondirectly connected nodes have and assign a likelihood score to this pair of nodes.
The simplest techniques are Jaccard’s coefficient (Jaccard 1912), Common Neighbors and Preferential Attachment (Newman 2001). Jaccard’s coefficient assigns higher likelihood scores to the node pairs for which the set of common interactors as a proportion of all available neighbors is higher and Common Neighbors does the same for pairs of nodes that simply share more interactors. Preferential Attachment, on the other hand, gives high scores when both nodes have a large number of neighbors: if one of the nodes has a low number of interactors, the score is reduced. In contrast, Adamic and Adar (2003) and Resource Allocation (Ou et al. 2007) are two similar indices that give more importance to Common Neighbors with low degree.
Various other methods have been proposed to assess the reliability of highthroughput protein interaction data. In 2009, Kuchaiev et al. (2009) proposed a method for geometric denoising of PPI networks. Cannistraci et al. in (2010) proposed topologybased link prediction method using minimum curvilinear embedding. In 2013, Cannistraci et al. (2013a) proposed a new valid variation of minimum curvilinear embedding, named noncentred minimum curvilinear embedding. AlanisLobato et al. in (2013) utilized several measures for the proximity of genes based on the common neighborhood structure of a GI network. However these methods do not explicitly utilize a classification based approach to the problem of identifying missing interactions.
Hasan et al. (2006) formulated the link prediction problem into binary classification problem. The method extracted a set of topological features of the network as input for supervised learning for link prediction. A binary classification approach integrated information from multiple measures to get a better prediction. In 2011 Fire M et al. (2011) utilized topological features for supervised learning, and ranked the importance of each feature. They proposed a set of simple, computationally efficient topological features that could be analyzed to identify missing links. In 2013 Cannistraci et al. (2013b) proposed a new paradigm to support link formation called the Local Community Paradigm (LCP), which emphasizes the role of the local network community structure in link formation. They proposed local communitybased Cannistraci features for linkprediction in PPI networks. Yu et al. (2006) in 2006 predicted missing links in PPI networks by completing defective cliques. Some methods have been reviewed in Lü and Zhou (2011) and some have been successfully applied for link detection in PPI networks.
Several anomaly detection techniques have been proposed for detecting outlier nodes, edges or substructures in graph data. The techniques may broadly be classified as: i) Featurebased approaches which utilize structural graphcentric features for outlier detection in the constructed feature space. Essentially, these methods transform the graph anomaly detection problem to the wellunderstood outlier detection problem (Akoglu et al. 2010; Henderson et al. 2011). ii) Proximitybased approaches that exploit the graph structure to measure closeness (or proximity) of objects in the graph. These methods capture the simple autocorrelation between these objects, where similar objects are likely to belong to the same class (Jeh and Widom 2002; Brin and Page 1998). iii) Communitybased approaches that utilize clustering methods for graph anomaly detection and rely on finding densely connected groups of ’closeby’ nodes in the graph to discover anomalies that have connections across communities (Chakrabarti 2004; Sun et al. 2005; Tong and Lin 2011). iv) Relational learning based approaches consist of networkbased collective classification algorithms, the main idea of which is to exploit the relationships between the objects to assign them into classes, where the number of classes is often two: anomalous and normal (Getoor et al. 2001; Jensen et al. 2004). Further details on these approaches can be found in a thorough survey (Akoglu et al. 2015). In this paper we use featurebased anomaly detection techniques to discover suspicious negative links, thereby reducing the impact of label noise introduced by assigning undiscovered positive links to the class of negative links in the training data.
Materials and methods
Network Datasets
We used four proteinprotein interaction (PPI) network datasets: Caenorhabditis elegans, Mus musculus, Arabidopsis thaliana and Rattus norvegicus. These are publicly available and were collected from the Protein Interaction Network Anaysis (PINA) platform. The platform integrates data from six curated databases and builds a complete, nonredundant dataset for the model organisms ^{1}. Since, only interactions reported across multiple datasets were considered after careful curation, in this paper we assume that the reported interactions are relatively noise free. A brief summarization of the nework characteristics is provided in Table 1:
Methods
Our objective was to minimize the classification bias arising due to currently undiscovered edges (positive links) being incorrectly labeled as negative links. To address this bias, we use anomaly detection for removing suspicious negative links (which may be undiscovered positive links) from the training set before classifier training. Finally, we train a link classifier on the filtered dataset after removal of these detected suspicious negative links. Since we focused on predicting links based only on network topology, we extracted a set of features for nodepairs (edges) from the corresponding PPI network with the goal of developing a network topological featurebased classifier. We then performed supervised learning, using different machine learning classifiers. The network topology based features utilized for classification are described here.
Topologybased Measures
We briefly describe the set of topologybased measures or features that were used during our experiment. A graph theoretic approach is used to model the proteinprotein interaction as a network. In this method, a PPI network is represented by an undirected graph G=(V,E), with a set of nodes or vertices V and a set of links or edges E, where vertices represent proteins and edges represent interactions between proteins respectively. In this paper, G will always be an unweighted, undirected graph. Graphs can be characterized by many different topologybased measures, each one reflecting some particular traits of the studied structure. The topologybased measures were chosen based on their successful application in prior work on link prediction (Cannistraci et al. 2013b; Fire et al. 2011; Zhou et al. 2009).
Nodebased measures: Let N(v) denote a neighborhood (or open neighborhood) of a node v in a graph G. N(v) is the set of all the nodes adjacent to v. The closed neighborhood of a node v, denoted by N[ v] is simply the set {v}∪N(v). The Formal definitions of neighborhoods that were used in this study to extract topological measures are:
Based on the above definition, neighborhoodsubgraph of v which induced by the neighborhoods of v are defined as:
Note that the Induced subgraph of the open and closed neighborhoods of a node are very different with respect to their topological properties.Following measures for a node are created using the above neighborhood definitions:

Node degree : The degree of a node in a network is the number of links the node has to other nodes. For an undirected network, degree of a node is defined as:
Let v∈V and
$$ deg(v)= N(v) $$(3) 
Node subgraphs: This measure denotes the number of links within the open and closed nbhdsubgraphs for each node v, which is defined as:
$$ \begin{aligned} subgraphedgeno(v) &= nbhdsubgraph(v)\\ subgraphedgeno[\!v] &= nbhdsubgraph[\!v] \end{aligned} $$(4)Density of subgraph is defined as:
$$ \begin{aligned} densitynbhdsubgraph(v)&= \frac {deg(v)} {nbhdsubgraph(v)}\\ densitynbhdsubgraph[\!v]&= \frac {deg(v)} {nbhdsubgraph[\!v]} \end{aligned} $$(5)
Note that the formal density of a graph is defined differently, however, the aim of this feature and all other features used in the paper is to be as straight forward and simple as possible. Therefore, we used a somewhat different density that is more related to a vertex v.
Edgebased measures: Let u,v∈V where u,v∉E. Using the neighborhoods of u and v we extract various measures. These measures help to determine the likelihood that a link between u and v exists.

CommonNeighbors (CN): The common neighbors (CN) of u and v refers to the number of common neighbors of u and v. Two vertices u and v are more likely to connect if they have bigger number of common neighbors. It is defined as Newman (2001):
$$ CN(u,v)=N(u) \cap N(v) $$(6) 
TotalNeighbors (TN): The total neighbors (TN) of u and v measure the number of distinct neighbors of u and v. which refers to the total number of neighbors u and v have together. The formal definition of TN is:
$$ TN(u,v)=N(u) \cup N(v) $$(7) 
Jaccard’s Coefficient (JC): Jaccard’s coefficient (JC) normalizes the size of common neighbors by total neighbors. This gives higher weight to those pairs of nodes which share a higher proportion of common neighbors relative to the total number of neighbors they have. The formal definition of JC is (Jaccard 1912):
$$ JC(u,v)=\frac{N(u) \cap N(v)}{N(u) \cup N(v)} $$(8) 
AdamicAdar Coefficient (AA): This metric refines the simple counting of common neighbors by assigning higher likelihood scores to neighbors that are not shared with many others. It is defined as (Adamic and Adar 2003):
$$ AA(u,v)=\sum_{z \in N(u) \cap N(v)} \frac{1}{log(N(z))} $$(9) 
Resource allocation Coefficient (RA): The RA coefficient and AA coefficient have very similar forms the only difference being that the RA coefficient punishes the high degree common neighbors more heavily than the AA coefficient. It is defined as (Ou et al. 2007):
$$ RA(u,v)=\sum_{z \in N(u) \cap N(v)} \frac{1}{N(z)} $$(10) 
Preferential Attachment (PA): This measure assigns higher likelihood scores to those pairs of nodes for which one or both nodes have a high degree. The formal definition of PA is (Newman 2001):
$$ PA(u,v)=N(u).N(v) $$(11) 
LCPbased measures and Cannistraci variants: The local community paradigm suggests that two nodes are more likely to link together if their commonfirstneighbors are members of a strongly innerlinked cohort or localcommunity. The Cannistraci (LCPbased) variants of classical neighborhood methods (CN, PA, AA, RA, JC) are defined as (Cannistraci et al. 2013b):
$$\begin{array}{@{}rcl@{}} CAR(u,v)= CN(u,v).LCL(u,v) =CN(u,v).\sum_{z \in N(u) \cap N(v)}\frac{\gamma(z)}{2} \end{array} $$(12)$$\begin{array}{@{}rcl@{}} CPA(u,v)= e_{u}.e_{v} + e_{u}.CAR(u,v) + e_{v}.CAR(u,v) + CAR(u,v)^{2} \end{array} $$(13)$$\begin{array}{@{}rcl@{}} CAA(u,v)=\sum_{z \in N(u) \cap N(v)}\frac{\gamma(z)}{{log}_{2}(N(z))} \end{array} $$(14)$$\begin{array}{@{}rcl@{}} CRA(u,v)=\sum_{z \in N(u) \cap N(v)}\frac{\gamma(z)}{N(z)}\!\!\qquad\qquad\qquad\qquad\qquad\quad \end{array} $$(15)$$\begin{array}{@{}rcl@{}} CJC(u,v)=\frac{CAR(u,v)}{N(u) \cup N(v)}\!\!\qquad\qquad\qquad\qquad\qquad\quad \end{array} $$(16)Where γ(z) refers to the subset of nodes in the neighborhood of z that are also common neighbors of of u and v, thus γ(z) is the local community degree of z; e _{ u } refers to the external degree of u, and is computed considering the nodes in the neighborhood of u that are not common neighbors of u and v.

Friends Measure (FM): Friend Measure (FM) of u and v measures the total number of links between the neighborhoods of u and v. Here we assume that two nodes have higher chance to get connected if their neighborhoods have more links with each other. The formal definition of FM is (Fire et al. 2011):
$$ FM(u,v)=\sum_{x \in N(u)} \sum_{y \in N(v)} \delta(x,y) $$(17)Where
$$ \delta(x,y)= \left\{\begin{array}{cl} 1 & if\ x=y \;or\; (x,y) \in E \;or\; (y,x) \in E\\ 0 & otherwise \end{array}\right. $$(18)
Edge Subgraphbased measures: The following subgraphs are defined by using the neighborhoods definitions (Fire et al. 2011):Let u,v∈V
The above subgraph equations contain information about the number of links between the neighborhood of u and v including the inner connections or links between each node neighborhood. The following subgraph equation represents the innerconnection subgraph:

Edge Subgraphs Edges Number: This measure counts the number of links in the above subgraphs:
$$ \begin{aligned} nbhdsubgraph(u,v)\\ nbhdsubgraph[\!u,v]\\ innersubgraph(u,v) \end{aligned} $$(21)
In this study, we extracted a total of 25 features for each PPI network.
Anomaly detection
We attempt to apply multiple anomaly detection techniques such as Parzen Windows, Principal Component Analysis (PCA), Nearest Neighbor (a distancebased method) and a oneclass Gaussian process for removing anomalous negative links from the training data. The details of these methods can be found in (Clifton 2007; 2009; Pimentel et al. 2014). We utilize the link prediction feature set for training the anomaly detector, described in an earlier subsection. Note that all the methods presented below require only normal data for training, however abnormal data is used for validating the models. In that sense the methods below may be considered unsupervised. as these methods do not require anomalous data for training. After experimentation, we found that the Gaussian Process based anomaly detection gave the most reliable results. Hence, we chose the Gaussian Process model as our anomaly detector for our classification experiments. Next, we present a brief introduction to all of the methods considered.
Parzen window method: The Parzen window kernel density estimator method (Parzen 1962) is the model adopted here to estimate the probability density function (pdf), p(x), for the training (normal) data. With this method (Bishop 2006), p(x) is estimated using the following steps:

1.
Locate a hyperspherical Gaussian window, or kernel, with width σ, on each of the Ddimensional feature vectors in the training dataset, x _{ i }, where i = 1, …, N.

2.
Evaluate the sum of the Gaussian distributions using the squared Euclidean distances between the test feature vector x and the training vectors x _{ i }, normalized by a factor that ensures p(x) integrates to 1.
This gives the following formula for the estimate of p(x):
By placing a Gaussian kernel over each feature vector x _{ i } in our training dataset, we construct a probability density estimate of p(x) that will have a higher value of p where the concentration of training data is greatest. Points in the test set with values of p(x) are classified as anomalies.
PCA method: PCA is an orthogonal transformation for transforming the raw data into a space such that the new basis vectors (principal components) are linear combinations of the original basis vectors, are linearly uncorrelated and correspond to the directions of maximal variance of the data, where the first principal component is in the direction of the highest variance, the second in the direction of the highest remaining variance and so on. Anomaly detection is performed with PCA under the assumption that normal data would be best explained by looking at the first few principal components whereas abnormal data would be captured by the remaining principal components (Bishop 2006; Chiang et al. 2001; Marsland 2003). Thus points in the data that have high coefficients for the last few principal components would correspond to anomalous data.
Nearest neighbor method: These approaches rely on the intuition that normal points will have normal neighbours in their vicinity and abnormal points would conversely have fewer normal points in their neighborhood (Hautamäki et al. 2004). Assuming that normal data is partitioned into clusters, the Novelty score z(x) of a data point x for some cluster width σ _{ k } k is given by:
where,
and μ _{ k } is the centre of cluster k, and σ _{ k } is defined to be the standard deviation of intracluster distances (Clifton 2009). Now, points with high Novelty scores for all clusters are regarded as anomalous.
Gaussian process: Given a training set \(D =\{(x_{i},y_{i})\}_{i=1}^{n} =(X,y)\) where x _{ i }∈X⊂R ^{d} denotes feature vector and y denotes a scalar output or target. We are interested in identifying the target y _{∗} for a new sample x _{∗}. The objective of regression is to find the association between inputs x and target y. To identify the association between the input and target, we modelled the mapping in terms of y=f(x)+ε, where f is an unknown function, and ε denotes a noise term. To do this, one approach is to assume that f is a parametric function f(x;θ) where the parameters θ are tuned based on the training data. But, the major pitfall of this kind of approach is that, if in case, a wrong form of the function is chosen, it can lead to poor predictions. Another approach, based on Gaussian process takes care of this problem by assigning a priori probability to all possible functions, which are more likely to be sampled. The process is based on the assumption that these functions are drawn from a specified probability distribution. This method requires a training set and may be considered supervised.
The core of GP regression lies in the selection of a prior probability distribution over latent function which are sampled from a Gaussian process i.e., \(f \sim \mathcal {GP} (m(x),\kappa (x,x'))\). Where m(x) and κ(x,x ^{′}) are mean and covariance function respectively. Without any prior knowledge about the underlying data, the most common choice is to choose a GP with mean zero. Gaussian Process can be described as a generalization of multivariate Gaussian distribution, where the dimensions can extend to infinity. The latent function f is said to follow a Gaussian process, if and only if every finite subset of function values is multivariate Gaussian distributed. Therefore, the function values f obey the model below:
Furthermore, we assume the noise ε to be Gaussian distributed with mean zero and standard deviation σ _{ n } i.e., \(\epsilon \sim \mathcal {N} (0,\sigma _{n}^{2})\). As a result, now output value y _{∗} for test sample x _{∗} can be deduced in a Bayesian manner by marginalizing over latent function f. Given training data D, the predictive distribution of y _{∗} is normally distributed i.e.,
Where moments μ _{∗} and \(\sigma _{*}^{2}\) can be given in closed form expressions. More details about GP framework, can be found in (Williams and Rasmussen 2006).
In 2010, Kemmler et al. (2010) have shown how GP regression can be employed for oneclass classification problems. They proposed using both the predictive mean μ _{∗} (GPMean) and negative variance \(\sigma _{*}^{2}\) (GPVar) as oneclass scores applied to training data with labels y = 1:
Where K=κ(X,X) denotes the kernel matrix of the training set, k _{∗}=κ(X,x _{∗}) represents the vector of kernel values between training set and test input and k _{∗∗}=κ(x _{∗},x _{∗}) is the kernel values of the test input. The correlation of function values using the similarity of input samples are calculated by the radial basis function (rbf): \(\kappa (x,x')=exp\left (\frac { x  x^{'}^{2}} {2.\sigma ^{2}}\right)\).
Experimental setup
Since the number of known links are few, we oversample the positive links in the datasets to generate sufficient positive links from each network when required. The set of negative links is much larger and it has been shown that the subset sampling method used to generate the negative training links impacts the performance of the resulting classifier (Yu et al. 2010). Two predominant sampling methods have been proposed for the negative set sampling in PPI networks, namely balanced random sampling and simple random sampling. In simple random sampling care is taken to ensure the proteins in the positive set must also appear in the negative set. In balanced random sampling the proteins must occur with the same frequency in both sets. It has further been shown that protein pairs with higher number of common neighbours are more likely to interact (AlanisLobato 2015), therefore by choosing non interacting pairs within 2 hops of each other we are in effect constructing a negative set that is harder to classify. To ensure no bias is introduced due to the sampling method we experimented with both balanced random and simple random sampling for choosing our negative set.
Figure 1 shows the workflow of the proposed method. The methodology is configurable into two phase. In the first phase, we construct a dataset to train the anomaly detector to filter out the anomalous negative links from training data of each PPI network. In the second phase, we construct another dataset (disjoint from the dataset in PhaseI) to train a classifier to classify a pair of nodes as a positive link or a negative link for each PPI network. To this end, we construct first and second phase as follows:
PhaseI: In this phase, First, we construct the dataset to train the anomaly detector. Then, we train and evaluate performance of different anomaly detection methods on each PPI network.

1.
We extract positive links from the network, and divided them into a validation and test set in the ratio 50:50. Note that positive links are not used for training the anomaly detector but only for validation.

2.
We extract negative links from the network, such that the vertices are within two hops of each other. These are divided into training, validation and test set in the ratio 60:20:20.

3.
Topological features are extracted for the above training, validation and test sets.

4.
We train and evaluate different anomaly detection methods and select the best performing anomaly detection method from these methods. We use this trained anomaly detector model in phaseII.
PhaseII: In this phase, We construct another dataset to train the classifiers. Then, we train and evaluate the performance of different classifiers on each PPI network.

1.
We extract positive links from the network, and divided them into a training set, validation set and a test set in the ratio 60:20:20.

2.
In order to introduce synthetic noise we mislabel a fraction of the positive links and assign them labels corresponding to negative links.

3.
We extract negative links from the network using simple random sampling, such that the vertices are within two hops of each other.

4.
The mislabelled negative links generated in step 2 are merged with the negative links in the training set from step 3 to allow for creation of a noisy dataset with synthetic ground truth. This is divided into a training, validation and test set in the ratio 60:20:20 such that the positive and negative datasets are balanced.

5.
Topological features are extracted for the above training and test sets. We call this the unfiltered training dataset.

6.
Next we generate a filtered version of the dataset using anomaly detection to filter out the noisy negative links we had generated in step 4.

7.
We evaluate different machine learning classifiers on both the filtered and unfiltered training datasets.

8.
Prediction accuracy is compared across classifiers trained on the filtered vs. the unfiltered dataset.
Results
Performance evaluation of different anomaly detector
We trained different anomaly detection methods on the training set of each PPI network and measured the performance on corresponding test set, where the training and test sets are constructed as described in PhaseI of the previous subsection. We trained the anomaly detection methods on negative links (normal class) only and utilized the positive links (abnormal class) for validation and test purpose. Each method yields an anomaly score on the validation set and a threshold is chosen for detection based on minimizing the false positive and false negative rate on the validation set. We report accuracy metrics on the test set using the chosen optimal threshold in Table 2. We notice that One Class Gaussian Process (gpoc) anomaly detection technique has a better score than the other anomaly detectors. Since this was consistent across the datasets, hence in this paper we used One Class Gaussian Process technique for anomaly detection.
Now we focus our attention on the dataset described earlier for PhaseII. We apply the anomaly detector trained above only on negative links of noisy PhaseII training set. We validate the performance of the GP anomaly detector (one class gaussian process) using True Positive Rate (TPR) and True Negative Rate (TNR). Results are provided in Table 3. The TNR is slightly lower than TPR because of uncertain labels of negative links i.e. some of the negative links may be positive links and may be detected as outliers.
Gene Ontology (GO) validation of anomaly detection
We also elucidated the biological significance of anomaly detection using the Gene Ontology (GO) scores of protein pairs in different PPI networks. Since proteins which are involved in the same biological function or share the same biological pathway are more likely to interact with each other compared to proteins which belong to other pathways, hence this statistics is a better measure to test the quality of our prediction. We calculated the GO score corresponding to each of the gene ontology classes i.e. biological process, cellular components and molecular functions of protein pairs using the Protein Interaction Network Analysis Platform (PINA) ^{2}. As we saw in Table 3 TNR is lower than the TPR this is because our anomaly detector extracts some non interacting protein pairs as anomalies. These may be undiscovered interactions and to validate this hypothesis we look at the GO scores of these anomalous protein pairs. In Mus musculus, the anomaly detection extracts 1360 proteins pairs as anomalies out of which 612 protein pairs have a GO score greater than 0.5, and out of these 268 protein pairs were found to interact with different public databases. In Rattus norvegicus, 707 protein pairs were extracted out of which 543 protein pairs had GO scores greater than 0.5. In Caenorhabditis elegans, 2826 protein pairs were extracted as anomalies out of which 1254 protein pairs had GO scores greater than 0.5. Thus, a high proportion of the protein pairs filtered by the anomaly detection technique outlined in this paper appear to have significant GO scores and may potentially have undiscovered interactions. We further validated the discovered anomalies against the Negatome Database which contains experimentally supported noninteracting protein pairs. On matching the resuts not a single anomalous negative link discovered by the anomaly detector was found to lie in the Negatome database, further validating the fact that the interactions discovered by the anomaly detector.
Performance evaluation of different classifiers
After we removed the anomalies and generated the two training sets before and after filtering out the suspicious negative links, we trained four standard classifiers on both training sets. We evaluated the different machine learning classifiers (SVM, C5.0, KNN and Naive Bayes) on each PPI network. We used three standard metrics Accuracy, Fmeasure and Area Under the ROC curve (AUC) to measure the performance of each classifier. The Fmeasure indicates the trade off between precision and recall score of a classifier for a particular threshold setting whereas the AUC is independent of the threshold. It is an evaluation of the classifier as threshold varies over all possible values. In evaluation terminology, we denote the set of true positives as TP, the set of true negatives as TN, the set of false positives as FP, the set of false negative as FN. Various evaluation metrics is defined as:
As mentioned earlier we experiment with both simple random and balanced random sampling for constructing our negative set, please refer to Tables 4 and 5 for a comparitive analysis. Results indicate that the accuracies of the models do not change significantly using either approach, so for all further experiments we chose simple random sampling as our sampling method.
We repeat our experiments ten times by randomly selecting the training and test set to remove any statistical bias. Then we used the ttest for validating the statistical significance of differences between the Accuracy scores obtained with and without anomaly detection. In all cases barring that of the C5.0 classifier for the Arabidopsis thaliana dataset, we get a pvalue < 0.0001 which shows that this difference may be considered to be extremely statistically significant. It can be seen that naive Bayes classifier is a weak classifier without anomaly detection technique, but improves most significantly after using anomaly detection technique. The reason for this is that the Naive Bayes classifier has more room for improvement after filtering the data via anomaly detection due to prior poor performance. The remaining classifiers SVM, C5.0, and KNN exhibit good classification performance before anomaly detection, but these too show significant performance improvement when anomaly detection is used for filtering the training set. The sole exception is the C5.0 classifier which yields 99.36 % accuracy on the Arabidopsis thaliana dataset without anomaly detection and therefore has very little margin for improvement after anomaly detection (this is marked with a * in Table 4). A comparison of the accuracies of the different classifiers on each PPI network is shown in Fig. 2. The results shown in Table 4 illustrate the classification performance measures in terms of Accuracy, Fmeasure, and AUC. In Table 4, we can see that all three performance measures (Accuracy, Fmeasure, AUC) improve with anomaly detection. Thus, it appears that classification performance improves after filtering via anomaly detection.
Feature importance
In order to understand the contribution from each feature for link prediction in the PPI network, we comparatively analyzed the predictive power of the features. To measure the relative importance of different features, we analysed the information gain with respect to each feature. Information gain is based on the decrease in entropy after a dataset is split on an attribute. An attribute with highest information gain is selected for the split. We obtained the information gain of an attribute as follows: Information gain=(Entropy of distribution before the split)  (entropy of distribution after the split)
Where, entropy of a discrete probability distribution p on the countable set {x _{1},x _{2},x _{3},...}, with p _{ i }=p(x _{ i }), is defined as:
By comparing the entropy before and after the split, we obtain a measure of information gain (Han and Kamber 2006). Now, we ranked all the features based on its information gain. Table 6 presents the Information gain on the training sets of all the PPI networks. It can be seen that CommonNeighbors, Adamic Adar Coefficient, Resource allocation Coefficient, Cannistracibased Preferential Attachment, Friends Measure, number of links in innersubgraph and number of links in neighborhoodssubgraph are higgly influential for almost all of the PPI networks. In PPI networks, we know that proteins that form complexes display common functions. So, if proteins A, B, and C share the same function and protein A interacts with B and C, it is very probable that B and C also interact. Thus it is expected that CommonNeighbors would be an influential feature for link prediction in PPI networks (AlanisLobato 2015). We also know that proteins which are grouped together into cliques and quasicliques in PPI networks share identical functions and hence have greater probability of link formation in a densely connected group of proteins. The number of links in innersubgraph and in neighborhoodsubgraph are thus also highly influential features for link prediction in PPI networks. It is also noteworthy that more nuanced neighbor counting features like AdamicAdar and Resource Allocation are more predictive than features that rely only on the number of common neighbors.
Performance evaluation across datasets
To evaluate the generalization of our method across different datasets we conducted experiments where a model trained on one dataset is tested on all the other datasets. Tables 7, 8, 9 and 10 tabulate the results for models trained on Arabidopsis thaliana, Caenorhabditis elegans, Mus musculus and Rattus norvegicus datasets respectively. The results demonstrate that for the most part the models derived from topological features on one dataset using anomaly detection show a gain over models learned without anomaly detection. The one exception is the translation of performance to C elegans which suggests that this network may be topologically somewhat different from the rest. Interestingly, the models trained on C elegans with anomaly detection do exhibit a strong gain in performance on other datasets. Overall though the method does seem to translate well across datasets.
Discussion
This paper presents a technique for filtering graphical link training data by using anomaly detection for the purpose of link prediction in PPI networks. The performance of the resulting predictor compares favourably with the classifier trained on unfiltered data. The central idea is to have a filtering step before the classification step where suspicious links are removed from the training data. One issue that needs emphasis here is that the choice of anomaly detection technique plays a critical role in the success of the resulting classifier. If the anomaly detector is inaccurate, then the classifier may not yield optimum performance. One way to ascertain the efficacy of the anomaly detection is by deliberately mislabeling the positive links and checking if the anomaly detection algorithm can detect them, which is how we have selected our Gaussian Process algorithm. Additionally, this technique shows most improvement in performance when the link prediction accuracy is not particularly high before filtering, as this allows for greater room for classification improvement. Additionally, this technique needs to be extended to link prediction in networks with directed edges (metabolic networks), weighted edges (neural networks). While the given technique is useful for detecting missing links more efficiently, it may have to adapt to work for evolving networks where the links are constantly changing. These ideas are the focus of our future work.
Endnotes
^{1} Downloaded from: http://cbg.garvan.unsw.edu.au/pina/interactome.stat.doon February 10, 2015.
^{2} http://cbg.garvan.unsw.edu.au/pina/interactome.goSimForm.do
References
Adamic, LA, Adar E (2003) Friends and neighbors on the web. Soc Netw 25(3): 211–230.
Albert, R, Barabási AL (2002) Statistical mechanics of complex networks. Rev Mod Phys 74(1): 47.
AlanisLobato, G, Cannistraci CV, Ravasi T (2013) Exploitation of genetic interaction network topology for the prediction of epistatic behavior. Genomics 102(4): 202–208.
AlanisLobato, G (2015) Mining protein interactomes to improve their reliability and support the advancement of network medicine. Front Genet 6: 296.
Al Hasan, M, Chaoji V, Salem S, Zaki M (2006) Link prediction using supervised learning In: Proceedings of the SDM’06 Workshop on Link Analysis, Counterterrorism and Security.. SIAM, Bethesda.
Akoglu, L, McGlohon M, Faloutsos C (2010) Oddball: Spotting anomalies in weighted graphs In: Advances in Knowledge Discovery and Data Mining, 410–421.. Springer, Berlin, Heidelberg.
Akoglu, L, Tong H, Koutra D (2015) Graph based anomaly detection and description: a survey. Data Mining Knowl Discov 29(3): 626–88.
Bishop, CM (2006) Pattern recognition and Machine Learning. SpringerVerlag, New York.
Brin, S, Page L (1998) The anatomy of a largescale hypertextual web search engine. Computer networks and ISDN systems 30(1): 107–117.
Brun, C, Chevenet F, Martin D, Wojcik J, Guénoche A, Jacq B (2003) Functional classification of proteins for the prediction of cellular function from a proteinprotein interaction network. Genome Biol 5(1): 1.
Cannistraci, CV, Ravasi T, Montevecchi FM, Ideker T, Alessio M (2010) Nonlinear dimension reduction and clustering by Minimum Curvilinearity unfold neuropathic pain and tissue embryological classes. Bioinformatics 26(18): i531–i539.
Cannistraci, CV, AlanisLobato G, Ravasi T (2013) Minimum curvilinearity to enhance topological prediction of protein interactions by network embedding. Bioinformatics 29(13): i199–i209.
Cannistraci, CV, AlanisLobato G, Ravasi T (2013) From linkprediction in brain connectomes and protein interactomes to the localcommunityparadigm in complex networks. Sci Rep 8: 3.
Chakrabarti, D (2004) Autopart: Parameterfree graph partitioning and outlier detection In: Knowledge Discovery in Databases: PKDD, 112–124.. Springer, Berlin, Heidelberg.
Chen, J, Hsu W, Lee ML, Ng SK (2005) Discovering reliable protein interactions from highthroughput experimental data using network topology. Artif Intell Med 35(1): 37–47.
Chen, J, Chua HN, Hsu W, Lee ML, Ng SK, Saito R, Sung WK, Wong L (2006) Increasing confidence of proteinprotein interactomes. Genome Inform 17(2): 284–297.
Chua, HN, Sung WK, Wong L (2006) Exploiting indirect neighbours and topological weight to predict protein function from proteinprotein interactions. Bioinformatics 22(13): 1623–1630.
Clifton, LA (2007) MultiChannel Novelty Detection and Classifier Combination, Ph.D. dissertation, Electrical and Electronic Engineering. Univ. Manchester, Manchester.
Clifton, DA (2009) Novelty detection with extreme value theory in jet engine vibration data. PhD diss, University of Oxford.
Chiang, LH, Braatz RD, Russell EL (2001) Fault detection and diagnosis in industrial systems. SpringerVerlag, London.
Daminelli, S, Thomas JM, Durn C, Cannistraci CV (2015) Common neighbours and the localcommunityparadigm for topological link prediction in bipartite networks. New J Phys113037(11).
Dorogovtsev, SN, Mendes JF (2002) Evolution of networks. Adv Phys 51(4): 1079–187.
Fire, M, Tenenboim L, Lesser O, Puzis R, Rokach L, Elovici Y (2011) Link prediction in social networks using computationally efficient topological features In: Proceedings of the Third IEEE International Conference on Privacy, Security, Risk and Trust (PASSAT’11) and the Third IEEE International Conference on Social Computing (SocialCom’11), 73–80.. IEEE, Boston.
Getoor, L, Friedman N, Koller D, Pfeffer A (2001) Learning probabilistic relational models In: Relational data mining, 307–335.. Springer, Berlin, Heidelberg.
Hautamäki, V, Kärkkäinen I, Fränti P (2004) Outlier Detection Using kNearest Neighbour Graph In: Proceedings of the 17th International Conference on Pattern Recognition, 430–433.. IEEE Computer Society, Cambridge.
Han, J, Kamber M (2006) Data Mining Concepts and Techniques. 2nd ed.. Morgan Kaufmann, San Francisco.
Henderson, K, Gallagher B, Li L, Akoglu L, EliassiRad T, Tong H, Faloutsos C (2011) It’s who you know: graph mining using recursive structural features In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 663–671.. ACM, New York.
Jaccard, P (1912) The distribution of the flora in the alpine zone. New Phytologist 11(2): 37–50.
Jeh, G, Widom J (2002) SimRank: a measure of structuralcontext similarity In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 538–543.. ACM, New York.
Jensen, D, Neville J, Gallagher B (2004) Why collective inference improves relational classification In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 593–598.. ACM, New York.
Kemmler, M, Rodner E, Denzler J (2010) Oneclass classification with gaussian processes In: Asian Conference on Computer Vision, 489–500.. Springer, Berlin, Heidelberg.
Kuchaiev, O, Rašajski M, Higham DJ, Pržulj N (2009) Geometric denoising of proteinprotein interaction networks. PLoS Comput Biol 5(8): e1000454.
Liben Nowell, D, Kleinberg J (2007) The link prediction problem for social networks. J Am Soc Inf Sci Technol 58(7): 1019–31.
Lü, L, Zhou T (2011) Link prediction in complex networks: a survey. Physica A: Stat Mech Appl 390(6): 1150–70.
Martinez, ND, Hawkins BA, Dawah HA, Feifarek BP (1999) Effects of sampling effort on characterization of foodweb structure. Ecology 80(3): 1044–55.
Marsland, S (2003) Novelty detection in learning systems. Neural Comput Surv 3(2): 157–195.
Newman, MEJ (2001) Clustering and preferential attachment in growing networks. Phys Rev E 64(2): 025102.
Ou, Q, Jin YD, Zhou T, Wang BH, Yin BQ (2007) Powerlaw strengthdegree correlation from resourceallocation dynamics on weighted networks. Phys Rev E 75(2): 021102.
Parzen, E (1962) On estimation of a probability density function and mode. Ann Math Stat 33(3): 1065–1076.
Pimentel, MA, Clifton DA, Clifton L, Tarassenko L (2014) A review of novelty detection. Signal Process 99: 215–49.
Qi, Y, Ge H (2006) Modularity and dynamics of cellular networks. PLoS Comput Biol 2(12): e174.
Saito, R, Suzuki H, Hayashizaki Y (2002) Interaction generality, a measurement to assess the reliability of a proteinprotein interaction. Nucleic Acids Res 30(5): 1163–1168.
Saito, R, Suzuki H, Hayashizaki Y (2003) Construction of reliable proteinprotein interaction networks with a new interaction generality measure. Bioinformatics 19(6): 756–763.
Sprinzak, E, Sattath S, Margalit H (2003) How reliable are experimental proteinprotein interaction data?J Mol Biol 327(5): 919–23.
Sun, J, Qu H, Chakrabarti D, Faloutsos C (2005) Neighborhood formation and anomaly detection in bipartite graphs In: Proceedings of the Fifth IEEE International Conference on Data Mining, IEEE Computer Society, 8.. IEEE Computer Society, Washington.
Tong, H, Lin CY (2011) NonNegative Residual Matrix Factorization with Application to Graph Anomaly Detection In: Proceedings of the 11th SIAM international conference on data mining (SDM), 143–153.. SIAM, Mesa.
Wuchty, S, Oltvai ZN, Barabási AL (2003) Evolutionary conservation of motif constituents in the yeast protein interaction network. Nat Genet 35(2): 176–179.
Williams, CK, Rasmussen CE (2006) Gaussian processes for machine learning. The MIT Press2(3): 4.
Yu, H, Paccanaro A, Trifonov V, Gerstein M (2006) Predicting interactions in protein networks by completing defective cliques. Bioinformatics 22(7): 823–9.
Yu, J, Guo M, Needham CJ, Huang Y, Cai L, Westhead DR (2010) Simple sequencebased kernels do not predict proteinprotein interactions. Bioinformatics 26(20): 2610–2614.
Zhou, T, Lü L, Zhang YC (2009) Predicting missing links via local information. Eur Phys J 71(4): 623–30.
Acknowledgments
The authors would like to thank J.N.U. and U.G.C., India for providing the research fellowship to K.V.S.
Authors’ contributions
L.V. and K.V.S. conceived and designed the algorithm. K.V.S. implemented the algorithm and prepared the figures of the numerical results. K.V.S. and L.V. analyzed and interpreted the results, and wrote the manuscript. Both the authors have read and approved the final manuscript.
Competing interests
We declare that there is no competing of interests for this work.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Link prediction
 Anomaly detection
 Protein protein interaction networks