 Research
 Open Access
 Published:
Entropybased approach to missinglinks prediction
Applied Network Science volume 3, Article number: 17 (2018)
Abstract
Linkprediction is an active research field within network theory, aiming at uncovering missing connections or predicting the emergence of future relationships from the observed network structure. This paper represents our contribution to the stream of research concerning missing links prediction. Here, we propose an entropybased method to predict a given percentage of missing links, by identifying them with the most probable nonobserved ones. The probability coefficients are computed by solving opportunely defined nullmodels over the accessible network structure. Upon comparing our likelihoodbased, local method with the most popular algorithms over a set of economic, financial and food networks, we find ours to perform best, as pointed out by a number of statistical indicators (e.g. the precision, the area under the ROC curve, etc.). Moreover, the entropybased formalism adopted in the present paper allows us to straightforwardly extend the linkprediction exercise to directed networks as well, thus overcoming one of the main limitations of current algorithms. The higher accuracy achievable by employing these methods  together with their larger flexibility  makes them strong competitors of available linkprediction algorithms.
Introduction
Linkprediction is an active research field within network theory, aiming at uncovering missing connections (e.g. in incomplete datasets) or predicting the emergence of future relationships from the observed network structure. Loosely speaking, the missing links prediction problem can be stated by asking the following question: given a snapshot of a network, can the next mostlikely links to be established be predicted? Such an issue is relevant in many research areas, such as social networks (LibenNowell and Kleinberg 2003; Pavlov and Ichise 2007; Berlusconi et al. 2016; Jalili et al. 2017), protein networks (Barzel and Barabási 2013; Singh and Vig 2017), brain networks (Cannistraci et al. 2013), etc.
To this aim, several algorithms have been proposed so far. Overall, “recipes” for linkprediction can be classified as belonging to either two main classes, similaritybased algorithms or likelihoodbased algorithms (Lu and Zhou 2011; Zhao et al. 2015). Both classes of algorithms output a list of scores to be assigned to nonobserved links: while the similaritybased ones may employ local (Barabasi and Albert 1999), quasilocal (Cannistraci et al. 2013; Jaccard 1901; Sorensen 1948; Salton and McGill 1983; Adamic and Adar 2003; Zhou et al. 2009) or global information (Katz 1953; Lu et al. 2015; Zhao et al. 2015) (e.g. the nodes degree, the degree of common neighbours and the length of paths connecting any two nodes, respectively), the likelihoodbased ones (Guimerá and SalesPardo 2009; Tan et al. 2014; Pan et al. 2016) are defined by a likelihood function whose maximization provides the probability that any two nodes are connected. This is usually achieved by assuming that some kind of benchmark information is known and by treating it as a constraint to account for. An alternative classification distinguishes between algorithms employing purely structural information (either binary or weighted (Lu and Zhou 2011)) and algorithms making use of some kind of external information as well (e.g. nodes attributes (Liao et al. 2015)).
This paper represents our contribution to the stream of research concerning missing links prediction. A novel algorithm is proposed, building upon a series of results concerning constrained entropymaximization (Park and Newman 2004; Garlaschelli and Loffredo 2008; Squartini and Garlaschelli 2011). In a nutshell, we advance the hypothesis that the tasks of predicting missing links and reconstructing a given network structure share many similarities worth to be further explored. The method we propose in the present work makes a first step in this direction, by employing entropybased nullmodels to approach the linkprediction problem. As a last remark, we notice that while the problem of missing links prediction is usually associated to the problem of spurious links identification, here we only address the former one.
The remainder of the paper is organized as follows. In the “Methods” section an overview of the missing links prediction problem is provided, together with a detailed description of the method we propose here. The “Data” section contains a synthetic description of the datasets used for testing our methods. In the “Results” section, we compare our method with the most common linkprediction algorithms and we comment on the results in the “Discussion” section.
Methods
In order to fix the formalism, let us briefly reformulate the linkprediction problem ab initio.
Let us indicate with the symbol A the adjacency matrix of the observed network and with the symbol E the corresponding set of observed links: as a consequence, upon indicating with U the set of all nodes pairs, U∖E will be referred to as to the set of nonexistent links. In order to fully control a given recipe for linkprediction, the link set is usually partitioned into a training set, E^{T}, and a probe set, E^{P}=E∖E^{T}. The former is used in the “calibration” phase of a given prediction algorithm, while the latter is used for testing it: links belonging to E^{P} are, in fact, removed, thus constituting the actual “prediction target”. We denote with E^{P}≡L_{miss} the cardinality of the probe set, corresponding to the number of missing links. Naturally, the adjacency matrix is partitioned as well: the portion of it corresponding to the training set will be indicated with the symbol A^{T}. The union of the missing links set and the nonexistent links set E^{N}=E^{P}∪ U∖E≡U∖E^{T} will be referred to as to the set of nonobserved links.
Linkprediction algorithms output a list of scores to be assigned to nonobserved links. Upon indicating with i and j the nodes constituting the extremes of nonobserved links, the most traditional recipes are quickly reviewed below. In what follows, we will focus on the algorithms employing either local or quasilocal information.
Linkprediction for undirected networks

The simplest recipe to define scores is based the number of common neighbours (CN) of i and j
$$ s_{ij}^{CN}=\Gamma(i)\cap\Gamma(j); $$(1) 
a slightly more elaborate function of it is represented by the Jaccard coefficient (J), which discounts the information encoded into the size of the nodes neighbourhoods:
$$ s_{ij}^{J}=\frac{\Gamma(i)\cap\Gamma(j)}{\Gamma(i)\cup\Gamma(j)}=\frac{s_{ij}^{CN}}{k_{i}+k_{j}s_{ij}^{CN}}; $$(2) 
algorithms based on the information provided by nodes degrees exist. The simplest example is provided by the one inspired to the “preferential attachment” (PA) mechanism, whose generic score reads
$$ s_{ij}^{PA}=k_{i}\cdot k_{j}; $$(3) 
other, instead, are defined by the inverse of some kind of function of the neighbours degree (according to the original AdamicAdar  AA  prescription or subsequent variations, as the “resource allocation”  RA  one)
$$ s_{ij}^{RA}=\sum\limits_{l\in\Gamma(i)\cap\Gamma(j)}\frac{1}{k_{l}},\:s_{ij}^{AA}=\sum\limits_{l\in\Gamma(i)\cap\Gamma(j)}\frac{1}{\ln k_{l}}; $$(4) 
modifications of the aforementioned indices have been recently proposed, encoding information on the link density of the neighbourhood of each pair of nodes. These indices are the socalled CARbased ones (Cannistraci et al. 2013) and prescribe to “correct” the scores above by adding a factor γ(l), counting how many neighbours of node l∈Γ(i)∩Γ(j) are also common neighbours of i and j. More explicitly
$$\begin{array}{@{}rcl@{}} {s_{ij}^{CAR}}&=&{s_{ij}^{CN}\cdot\sum\limits_{l\in\Gamma(i)\cap\Gamma(j)}\frac{\gamma(l)}{2},} \end{array} $$(5)$$\begin{array}{@{}rcl@{}} {s_{ij}^{CJC}}&=&{\frac{s_{ij}^{CAR}}{\Gamma(i)\cup\Gamma(j)},} \end{array} $$(6)$$\begin{array}{@{}rcl@{}} {s_{ij}^{CPA}}&=&{\left(e_{i}+s_{ij}^{CAR}\right)\cdot\left(e_{j}+s_{ij}^{CAR}\right),} \end{array} $$(7)$$\begin{array}{@{}rcl@{}} {s_{ij}^{CRA}}&=&{\sum\limits_{l\in\Gamma(i)\cap\Gamma(j)}\frac{\gamma_{l}}{k_{l}},} \end{array} $$(8)$$\begin{array}{@{}rcl@{}} {s_{ij}^{CAA}}&=&{\sum\limits_{l\in\Gamma(i)\cap\Gamma(j)}\frac{\gamma_{l}}{\ln k_{l}}} \end{array} $$(9)where e_{i} indicates the external degree of node i, i.e. the number of neighbours of i that are not neighbours of j.
Entropybased approach to linkprediction
The rationale of our method is based upon the concept of network reconstructability. In other words, provided that the accessible portion A^{T} of a network is satisfactorily reproduced by a given amount of topological information, it is reasonable to suppose that the latter allows the inaccessible portion to be inferred with reasonable accuracy as well. Invoking the aforementioned concept allows us to rephrase the link imputation problem within the network reconstruction framework, making it possible to employ the techniques developed there.
From a technical point of view, our algorithm is a local, likelihoodbased one. It rests upon the information provided by local, topological quantities, which are enforced as constraints of a maximization procedure defined within the Exponential Random Graph (ERG) framework (Park and Newman 2004; Squartini and Garlaschelli 2011). In the case of binary, undirected networks, constraints are represented by nodes degrees, i.e. \(\vec {k}\left (\mathbf {A}^{T}\right)\) and the ERG framework leads to the maximization of the likelihood function \(\mathcal {L}=\ln P\left (\mathbf {A}^{T}\right)\) where
and \(p_{ij}=\frac {x_{i}x_{j}}{1+x_{i}x_{j}}\). The numerical value of the unknown coefficients \(\vec {x}\) is obtained upon solving the system of equations
(see the Appendix: Configuration Models for the derivation of the condition above). Our algorithm, which is trained on A^{T}, prescribes to interpret the probability coefficients \(\{p_{ij}\}_{ij\in E^{N}}\phantom {\dot {i}\!}\) assigned to the nonobserved links, as scores to carry out the linkprediction: upon sorting the coefficients \(\phantom {\dot {i}\!}\{p_{ij}\}_{ij\in E^{N}}\) in decreasing order, the first L_{miss} largest ones are naturally interpreted as pointing out the L_{miss}most probable missing links (notice that such a prescription is based on the assumption that the number of missing links is known, although their identity is not: as a consequence, this number is retained). In other words, the reconstructability assumption underlying our method leads us to interpret the nonobserved links which have been assigned the largest probability coefficients as the ones that are most likely to appear given the chosen constraints.
Our recipe has a remarkable, equivalent formulation. In fact, the subset Σ^{∗} of L_{miss} links characterized by the largest probability coefficients identifies the subgraph satisfying the relationship
with \(P\left (\mathbf {\Sigma }\mathbf {A}^{T}\right)=\prod _{\substack {i<j\\ij\in E^{N}}}p_{ij}^{\sigma _{ij}}(1p_{ij})^{\sigma _{ij}}\). Since the maximum value of such a product is achieved once the L_{miss} largest factors are selected, the generic entry \(\sigma _{ij}^{*}\) obeys the following rule: \(\sigma _{ij}^{*}=1\) if ij belongs to the set of L_{miss} most probable missing links and \(\sigma _{ij}^{*}=0\) otherwise; in other words, Σ^{∗} is the subgraph with largest probability among the ones with precisely L_{miss} links. In the remainder of the paper, this approach will be named after the nullmodel employed to calculate the link scores, i.e. UBCM (Undirected Binary Configuration Model) (Squartini and Garlaschelli 2011).
Linkprediction for directed networks
Remarkably, our algorithm can be generalized to approach the missing links prediction problem in directed networks as well. It is enough to maximize the likelihood \(\mathcal {L}=\ln P\left (\mathbf {A}^{T}\right)\) where, now, \(P\left (\mathbf {A}^{T}\right)=\prod _{i\neq j}p_{ij}^{a_{ij}}(1p_{ij})^{1a_{ij}}\) by solving the system of equations
and consider the coefficients \(\{p_{ij}\}_{ij\in E^{N}}\phantom {\dot {i}\!}\) as scores to be assigned to the nonobserved links (see the Appendix: Configuration Models for the derivation of the condition above). The proper prediction step is still carried out by applying the recipe defined by Eq. 12, with the only difference that, now, the product runs over the directed pairs of nodes. In the remainder of the paper, this approach will be named after the nullmodel employed to calculate the link scores, i.e. DBCM (Directed Binary Configuration Model) (Squartini and Garlaschelli 2011).
Notice, instead, that no unambiguous ways to generalize traditional scores exist. Here we have adopted the (directed) extensions listed below, with the aim of accounting for link directionality whenever possible:

when considering directed networks, the concept of common neighbours can be replaced by the concepts of “successors” and “predecessors”, i.e. the nodes respectively “pointed by” and “pointing to” a given node. Upon indicating the set of “successors” of i with Γ_{S} and the set of “predecessors” of j with Γ_{P}, the CN index can be generalized as follows
$$ s_{ij}^{CN}=\Gamma_{S}(i)\cap\Gamma_{P}(j); $$(14) 
building upon the directed version of the CN index, the J index reads
$$ s_{ij}^{J}=\frac{\Gamma_{S}(i)\cap\Gamma_{P}(j)}{\Gamma_{S}(i)\cup\Gamma_{P}(j)}=\frac{s_{ij}^{CN}}{k_{i}^{out}+k_{j}^{in}s_{ij}^{CN}}; $$(15) 
the RA and AA indices can be straightforwardly generalized as follows:
$$ s_{ij}^{RA}=\sum\limits_{l\in\Gamma(i)\cap\Gamma(j)}\frac{1}{k_{l}^{tot}},\:s_{ij}^{AA}=\sum\limits_{l\in\Gamma(i)\cap\Gamma(j)}\frac{1}{\ln k_{l}^{tot}} $$(16)with \(k_{i}^{tot}=k_{i}^{out}+k_{i}^{in}\);

the PA score admits two different generalizations: one employing the total degree of nodes
$$ s_{ij}^{PA'_{I}}=k_{i}^{tot}\cdot k_{j}^{tot} $$(17)and the other employing the nodes out and indegree
$$ s_{ij}^{PA'_{II}}=k_{i}^{out}\cdot k_{j}^{in}; $$(18) 
while the CARbased indices are not straightforwardly generalizable to the directed case, other scores exist aiming at extending the concept of “closed triad” to account for link directionality (Schall 2014):
$$ s_{ij}^{TC}=\sum\limits_{l\in\Gamma(i)\cap\Gamma(j)}w_{i,j,l}\cdot w(l); $$(19)here, the “triad weight” \(w_{i,j,l}=\frac {\#T_{i\rightarrow j,l}+\#T_{i\leftrightarrow j,l}}{\#T_{i,j,l}}\) is defined by the (global) number #T_{i,j,l} of observed, open triads of the particular kind T_{i,j,l}, the (global) number #T_{i→j,l} of observed, closed triads via a directed link from i to j and the (global) number #T_{i⇔j,l} of observed, closed triads via a reciprocal link between i and j; w(l) is, instead, a nodespecific weight that can be set either to \(w(l)=\frac {1}{k_{l}}\) or to 1. In order to avoid misinterpretations, we set the weight to 1.
Testing linkprediction
Once a linkprediction algorithm has been defined, a number of statistical indices exist to test its effectiveness. In what follows we will briefly review the ones we have employed in the present paper to compare the aforementioned algorithms. The first index we have considered is the true positive rate (also known with the name of precision), defined as
and quantifying the percentage of missing links that are correctly recovered (i.e. the number L_{r} of rightly identified missing links within the list of the first L_{miss} links with the largest score). A similarinspirit index is the accuracy
quantifying the percentage of correctly classified links (i.e. both the missing ones and the nonexistent ones) with respect to the total number of nonobserved links. The third index we consider is the traditional area under the ROC curve, or AUC, proxied by the number
n^{′} counts the number of times a missinglink is assigned a higher probability than to a nonexistent one, while n^{″} accounts for the number of times they are assigned an equal probability. The denominator n coincides with the total number of comparisons (i.e. the number of missing links times the number of nonexistent links). This index is intended to quantify the probability that any missinglink is assigned a score that is larger than the score assigned to any nonobserved link. If all scores were i.i.d. the AUC value should be distributed around an expected value of 1/2: therefore, the extent to which the AUC value exceeds 0.5 provides an indication of how much better the algorithm performs than pure chance.
The set of missing links is usually randomly removed: we have followed such a procedure, by 1) randomly removing the 10% of links 10 times, 2) quantifying the performance of the algorithms above, by computing the three aforementioned indices over each sample, 3) averaging these values over the sample set (the sample standard deviation is used to proxy the estimation error.)
Data
Our approach to linkprediction has been tested on a number of economic and financial datasets (see Table 1) and on several foodwebs (see Table 2).
As a first dataset, we have considered the World Trade Web (WTW) across a period of 51 years, i.e. from 1950 to 2000. The dataset in (Gleditsch 2002) collects yearly, bilateral, aggregated data on exports and imports (the generic entry \(m_{ij}^{agg}(y)\) is the sum of the single commodityspecific trade exchanges between i and j during the year y). The binary, directed representation of the WTW we have considered here has been obtained by linking any two nodes whenever the corresponding element \(m_{ij}^{agg}(y)\) is strictly positive, i.e. \(a_{ij}(y)=\Theta [m_{ij}^{agg}(y)]\).
As a second dataset, we have considered the Dutch Interbank Network (DIN) across a period of 11 years, i.e. from 1998 to 2008 (’t Veld and van Lelyveld 2014). Such a dataset collects quarterly data on exposures between Dutch banks, larger than 1.5 million euros and with maturity shorter than one year.
As a third dataset, we have considered the eMID (i.e. the electronic Market for Interbank Deposits) network in a series of 61 temporal snapshots, corresponding to the maintenance periods (and ranging from 2005 to 2010). In this case, links represent granted loans (Iori et al. 2006). As for the WTW, the binary, directed representations of both the DIN and eMID have been obtained by linking any two nodes whenever a positive weight is observed between them.
The three realworld systems above are defined by directed connections. In order to evaluate the performance of the linkprediction algorithms considered in the present paper on undirected networks, we have properly symmetrized the adjacency matrices of these systems, according to the prescription \(a_{ij}^{sym}=a_{ij}+a_{ji}a_{ij}a_{ji}\).
Foodwebs, instead, are considered in their binary, directed version only: if species i preys on species j, a directed link is drawn from j to i.
Results
The performance of our linkprediction algorithm is shown in Figs. 1, 2 and 3: the three panels of Figs. 1 and 2 refer to the WTW, the DIN and eMID respectively while foodwebs are reported in Fig. 3.
As a general comment, our method performs better than the other algorithms, with respect to all considered indices. The success of the method is particularly evident when considering the AUC index, proxying the probability of (correctly) assigning a larger score to a missinglink than to a nonexistent link.
We argue the success of our algorithm to rest upon a core result that has been verified in a number of previous works (Squartini and Garlaschelli 2011; Squartini et al. 2011; Cimini et al. 2015): the (purely) topological structure of the networks considered here can be reconstructed, to a large degree of accuracy, by enforcing the information encoded into the degree sequences alone; very likely, thus, the same amount of information also defines an accurate recipe to spot potential missing links. Otherwise stated, the level of “complexity” of the considered networks seems to be largely encoded into the degree sequences, thus requiring (just) their enforcement to be fully accounted for.
The founding principle of our approach is, thus, radically different from the one inspiring other linkprediction algorithms: we aim at finding the (most likely) generative process for the network at hand, while other methods define increasingly detailed procedures with little control on the “quality” of the included information. This becomes evident when considering that other algorithms (i.e. the CN, J, RA, AA and the CARbased ones) employ a larger amount of information than the UBCM and DBCMbased ones: while the latter take as input just the nodes degrees, the former exploit the information provided by the whole set of common neighbours. This may indicate that the information encoded into the neighbourhood of any two nodes  supposedly providing more information than the one encoded into the degrees alone  is, actually, a mere consequence of lowerorder statistics (i.e. the degrees themselves).
This also sheds light on the reason why our algorithm is less sensitive than others to the original value of link density: provided that our entropybased recipes successfully individuate the process generating the networks at hand, the number of observed links is automatically accounted for.
Our comparison also points out that one of the factors determining the goodness of a given linkprediction algorithm concerns how the available information is used. An illustrative example is provided by the performance of the PA algorithm defined by Eq. 18, requiring the same basic knowledge of our entropybased recipes, i.e. the degree sequences of nodes. As clear upon inspecting the eMID directed case, the assumption that any two nodes establish a connection with a probability that is proportional to their total degrees fails to capture the process shaping the network structure; entropymaximization, on the other hand, makes a better use of the available information, by retaining the information on link directionality (that indeed plays a role, completely ignored by the aforementioned PA prescription).
Interestingly, the DBCM recipe described in the “Methods” section induces the “correct”, directed generalization of the PA algorithm, defined by Eq. 23 and outperforming the one defined by Eq. 18. For sparse networks, in fact, the DBCM probability coefficients can be approximated as follows
a simplified prescription that performs very similarly to the entropybased algorithm on the DIN and eMID; on dense networks  as the WTW  the DBCM performs much better, instead. The UBCM, on the other hand, reduces to \(p_{ij}^{\text {UBCM}}\propto k_{i}\cdot k_{j}\), i.e. the undirected PA prescription defined by Eq. 3.
Finally, let us comment on the performance of the TC index. The algorithm employing it has been designed to provide a solution to the problem of forecasting new connections among social networks users. By definition, it only predicts new links among disconnected nodes, disregarding all nodes pairs connected by, e.g. a nonreciprocated link. This explains the poor performance of the algorithm in our context, despite it performs satisfactorily to solve the specific task it was designed for (Schall 2014).
Discussion
Whenever judging the performance of a given linkprediction algorithm, one should consider both the amount of information it requires and the way in which this is employed to carry out the prediction step. While the usual linkprediction algorithms assume the existence of some nodespecific tendency at a microscopic level (e.g. social agents tend to close triads), ours focuses on the most likely process that may have generated the considered network. The guessed process is, first, trained on the visible portion of the network and, then, employed to infer the (supposedly) unknown portion of the network: the “homogeneity” assumption underlying the whole procedure leads us to expect that a model satisfactorily reproducing the accessible part of a system is also effective in spotting potential missing links.
One of the most effective recipes to tune generative processes is the one based on the entropymaximization: beside guaranteeing that the available information is encoded in the leastbiased way, the ERG framework is also very flexible, being applicable to both undirected and directed networks; other algorithms, on the contrary, rest upon concepts unambiguously defined only for undirected networks (an example is provided by the whole family of CAR indicators, whose core concept  i.e. the “local community links” factor γ(l)  does not admit a straightforward generalization).
Although every newlyproposed algorithm fosters the idea to be applicable to different kinds of systems, the effectiveness of a given (null) model depends on the particular system at hand: while economic, financial and food networks seem to be largely explained by the degree sequences, other systems may require a different (or additional) kind of information.
The results obtained so far on undirected, as well as directed, binary networks push us to look for further extensions of the proposed linkprediction technique. Interesting perspectives are represented by bipartite and weighted networks, for which the link and weightimputation topics are still little explored.
Appendix: Configuration Models
This Appendix is devoted to the explicit derivation of the undirected and directed version of the Binary Configuration Model. Let us start by defining the core quantity of our approach, i.e. Shannon entropy
representing a functional of the probability distribution \(\{P(\mathbf {A})\}_{\mathbf {A}\in \mathcal {A}}\) defined over the ensemble of configurations \(\mathcal {A}\). Its constrained maximization represents an inference procedure which has been proved to be maximally noncommittal with respect to the missing information. To this aim, let us define the Lagrangean function
with C_{m} representing the mth constraint and the C_{0}=〈C_{0}〉=1 summing up the normalization condition. Upon solving the equation one finds the expression \(P(\mathbf {A})=e^{1\vec {\theta }\cdot \vec {C}(\mathbf {A})}\) that can be further rewritten as
a formula defining the Exponential Random Graph formalism in its full generality. Since we are interested in defining a linkprediction algorithm employing only local information, let us now enforce the nodes degrees as constraints. In the undirected case, this amounts at posing \(\vec {C}\left (\mathbf {A}^{T}\right)=\vec {k}\left (\mathbf {A}^{T}\right)\) which leads to the equivalence \({\sum \nolimits }_{m=1}^{M}\theta _{m}C_{m}\left (\mathbf {A}^{T}\right)={\sum \nolimits }_{i<j}(\theta _{i}+\theta _{j})a_{ij}\left (\mathbf {A}^{T}\right)\) and, upon identifying \(x_{i}\equiv e^{\theta _{i}}\phantom {\dot {i}\!}\), further leads to Eq. 10. In the directed case, instead, \(\vec {C}\left (\mathbf {A}^{T}\right)=\left \{\vec {k}^{out}\left (\mathbf {A}^{T}\right), \vec {k}^{in}\left (\mathbf {A}^{T}\right)\right \}\), leading to \({\sum \nolimits }_{m=1}^{M}\theta _{m}C_{m}\left (\mathbf {A}^{T}\right)={\sum \nolimits }_{i\neq j}(\theta _{i}+\lambda _{j})a_{ij}\left (\mathbf {A}^{T}\right)\) and to the directed version of P(A) (a function, now, of \(\phantom {\dot {i}\!}x_{i}\equiv e^{\theta _{i}}\) and \(y_{i}=e^{\lambda _{i}}\phantom {\dot {i}\!}\)).
The recipe to estimate the unknown parameters comes from another principle, i.e. the likelihood maximization one. Upon maximizing the function
with respect to the unknowns (i.e. \(\vec {x}\) in the undirected case and \(\left \{\vec {x}, \vec {y}\right \}\) in the directed case) the systems of Eqs. 11 and 13 are recovered.
As a last remark, we stress that the computational complexity of the whole algorithm is the one required for solving the systems of Eqs. 11 and 13. In both cases, the formulation provided in the present paper can be further simplified by limiting ourselves to consider only the distinct values of the degrees (Garlaschelli and Loffredo 2008). This induces the resolution of a reduced system of equations, further lowering the computational complexity of the whole algorithm.
Abbreviations
 AA:

AdamicAdar
 ACC:

Accuracy
 AUC:

Area under the ROC curve
 BDN:

Binary directed network
 BUN:

Binary undirected network
 CAA:

Cannistraci AdamicAdar
 CAR:

CannistraciAlanisRavasi
 CJC:

Cannistraci Jaccard
 CN:

Common neighbours
 CPA:

Cannistraci preferential attachment
 CRA:

Cannistraci resource allocation
 DBCM:

Directed Binary Configuration Model
 DIN:

Dutch Interbank Network
 eMID:

Electronic Market for Interbank Deposits
 ERG:

Exponential Random Graph
 J:

Jaccard
 PA:

Preferential attachment
 RA:

Resource allocation
 TC:

Triadic closure
 TPR:

True positive rate
 UBCM:

Undirected Binary Configuration Model
 WTW:

World Trade Web
References
Adamic, LA, Adar E (2003) Friends and neighbors on the Web. Soc Networks 25(3):211–230.
Barabasi, AL, Albert R (1999) Emergence of Scaling in Random Networks. Science 286(509):509–512.
Barzel, B, Barabási AL (2013) Network link prediction by global silencing of indirect correlations. Nat Biotechnologies 31(8):720–725.
Berlusconi, G, Calderoni F, Parolini N, Verani M, Piccardi C (2016) Link prediction in criminal networks: a tool for criminal intelligence analysis. PLoS ONE 11(4):e0154244.
Cannistraci, CV, AlanisLobato G, Ravasi T (2013) From linkprediction in brain connectomes and protein interactomes to the localcommunityparadigm in complex networks. Sci Rep 3 1613. https://doi.org/10.1038/srep09794.
Cimini, G, Squartini T, Gabrielli A, Garlaschelli D (2015) Estimating topological properties of weighted networks from limited information. Phys Rev E 92(4):040802.
Garlaschelli, D, Loffredo MI (2008) Maximum likelihood: extracting unbiased information from complex networks. Phys Rev E 78(1):015101.
Gleditsch, KS (2002) Expanded trade and GDP data. J Confl Resolut 46:712–724.
Guimerá, R, SalesPardo M (2009) Missing and spurious interactions and the reconstruction of complex networks. Proc Natl Acad Sci 106(52):22073–22078. https://doi.org/10.1073/pnas.0908366106.
Iori, G, Jafarey S, Padilla FG (2006) Systemic risk on the interbank market. J Econ Behav Organ 61(4):525–542.
Jaccard, P (1901) Etude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaudoise Sci Nat 37(547):547–579.
Jalili, M, Orouskhani Y, Asgari M, Alipourfard N, Perc M (2017) Link prediction in multiplex online social networks. R Soc Open Sci 4:160863. https://doi.org/10.1098/rsos.160863.
Katz, L (1953) A new status index derived from sociometric analysis. Psychmetrika 18(1):39–43.
Liao, H, Zeng A, Zhang YC (2015) Predicting missing links via correlation between nodes. Physica A 436:216–223.
LibenNowell, D, Kleinberg J (2003) The linkprediction problem for social networks. J Am Soc Inf Sci 58:1019–1031.
Lu, L, Pan L, Zhou T, Zhang YC, Stanley HE (2015) Toward link predictability of complex networks. Proc Natl Acad Sci 112(8):2325–2330.
Lu, L, Zhou T (2011) Link prediction in complex networks: a survey. Physica A 390:1150–1170. https://doi.org/10.1016.physa.2010.11.027.
Pan, L, Zhou T, Lu L, Hu CK (2016) Predicting missing links and identifying spurious links via likelihood analysis. Sci Rep 6(22955). https://doi.org/10.1038/srep22955.
Park, J, Newman MEJ (2004) The statistical mechanics of networks. Phys Rev E 70(6):066117.
Pavlov, M, Ichise R (2007) Finding experts by link prediction in coauthorship networks. FEWS 290:42–55.
Salton, G, McGill MJ (1983) Introduction to modern information retrieval. McGrawHill, Auckland.
Schall, D (2014) Link prediction in directed social networks. Soc Netw Anal Min 4(157). https://doi.org/10.1007/s1327801401579.
Singh, KV, Vig L (2017) Improved prediction of missing protein interactome links via anomaly detection. Appl Netw Sci 2(2). https://doi.org/10.1007/s4110901700227.
Sorensen, T (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Biol Skr 5(1):1–34.
Squartini, T, Fagiolo G, Garlaschelli D (2011) Randomizing world trade I. A binary network analysis. Phys Rev E 84(4):046118.
Squartini, T, Garlaschelli D (2011) Analytical maximumlikelihood method to detect patterns in real networks. New J Phys 13(8):083001.
’t Veld, D, van Lelyveld I (2014) Finding the core: network structure in interbank markets. J Bank Financ 49:27–40.
Tan, F, Xia Y, Zhu B (2014) Link prediction in complex networks: a mutual information perspective. PLoS ONE 9(9). https://doi.org/10.1371/journal.pone.0107056.
Zhao, J, Miao L, Yang J, Fang H, Zhang QM, Nie M, Holme P, Zhou T (2015) Prediction of links and weights in networks by reliable routes. Sci Rep 5(12261).
Zhou, T, Lu L, Zhang YC (2009) Predicting missing links via local information. Europhys J B 71(4):623–630.
Funding
This work was supported by the EU projects CoeGSS (grant num. 676547), DOLFINS (grant num. 640772), MULTIPLEX (grant num. 317532), Shakermaker (grant num. 687941), SoBigData (grant num. 654024).
Availability of data and materials
Data concerning the World Trade Web are described in reference 27 (K. S. Gleditsch. Expanded trade and GDP data. Journal of Conicts Resolution 46, 712724 (2002)) and can be found at the address http://privatewww.essex.ac.uk/ksg/exptradegdp.html. Data concerning the foodwebs can be found at the public repository http://vlado.fmf.unilj.si/pub/networks/data/bio/foodweb/foodweb.htm. Data concerning the Dutch Interbank Network and eMID cannot be shared because of privacy issues preventing them from being publicly available.
Author information
Affiliations
Contributions
FP and TS developed the method. FP performed the analysis. FP, GC and TS wrote the manuscript. All authors reviewed and approved the manuscript.
Corresponding author
Correspondence to Federica Parisi.
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Parisi, F., Caldarelli, G. & Squartini, T. Entropybased approach to missinglinks prediction. Appl Netw Sci 3, 17 (2018). https://doi.org/10.1007/s4110901800734
Received:
Accepted:
Published:
PACS numbers
 89.75.Hc; 89.65.Gh; 02.50.Tt