A methodology for multilayer networks analysis in the context of open and private data: biological application

Malek, Maria; Zorzan, Simone; Ghoniem, Mohammad

doi:10.1007/s41109-020-00277-z

Research
Open access
Published: 23 July 2020

A methodology for multilayer networks analysis in the context of open and private data: biological application

Applied Network Science volume 5, Article number: 41 (2020) Cite this article

3516 Accesses
4 Citations
2 Altmetric
Metrics details

Abstract

Recently, an increasing body of work investigates networks with multiple types of links. Variants of such systems have been examined decades ago in disciplines such as sociology and engineering, but only recently have they been unified within the framework of multilayer networks. In parallel, many aspects of real systems are increasingly and routinely sensed, measured and described, resulting in many private, but also open data sets. In many domains publicly available repositories of open data sets constitute a great opportunity for domain experts to contextualise their privately generated data compared to publicly available data in their domain. We propose in this paper a methodology for multilayer network analysis in order to provide domain experts with measures and methods to understand, evaluate and complete their private data by comparing and/or combining them with open data when both are modelled as multilayer networks. We illustrate our methodology through a biological application where interactions between molecules are extracted from open databases and modelled by a multilayer network and where private data are collected experimentally. This methodology helps biologists to compare their private networks with the open data, to assess the connectivity between the molecules across layers and to compute the distribution of the identified molecules in the open network. In addition, the shortest paths which are biologically meaningful are also analysed and classified.

Introduction

Network theory is an important tool for describing and analysing complex systems which are represented as mathematical graphs. It has many applications in social, biological, physical, information and engineering sciences (Fortunato 2010; Newman 2003; Gosak et al. 2017; Seminar 2019; Pavlopoulos et al. 2011; Djemili et al. 2017). For example, it has been used to capture interesting properties of many real networks, e.g. having a heavy-tailed degree distribution, having the small-world property, the existence of nodes playing central roles and/or the existence of modular structures (Newman 2003).

Recently, an increasing body of work investigates networks with multiple types of links, as well as the so-called “networks of networks”. Variants of such systems have been examined decades ago in disciplines such as sociology and engineering, but only recently have they been unified, along with other nomenclature, within the framework of multilayer networks defined by Kivelä et al. (2014).

In parallel many aspects of real systems are increasingly and routinely sensed, measured and described, resulting in many private, but also open data sets. By private data we mean data collected internally in a company or institution. Open data refers to the idea that some data should be freely available to everyone to use and republish at will, without restrictions from copyright, patents or other mechanisms of control.

In many domains publicly available repositories of open data sets constitute a great opportunity for domain experts to contextualise their privately generated data compared to publicly available data in their domain.

In this paper we propose a methodology for multilayer network analysis in order to provide domain experts with measures and methods to understand, evaluate and complete their private data by comparing and/or combining them with open data when both are modelled as multilayer networks.

Main contributions of this paper are:

1.
We propose a new formalism for multilayer network that allows to carry out fine analysis by considering two levels: the intra-layer level and the inter-layer one. We show examples of how we can extend the definition of global and local measures as density and centralities to the inter-layer level and the whole network.
2.
We define the private multilayer network: the induced graph elaborated from the private data is extracted in order to be analysed and compared to the whole network.
3.
We define the private egocentric network: the notion of egocentric network which is defined around a given ego node (Marsden 2002; Djemili et al. 2017) is extended to an egocentric network around private multilayer network.The private egocentric network can be used to evaluate the connectivity strength between the different layers of private data in comparison to the whole network. The private egocentric network can also help to focus the study of the private network in the space of its neighbours across the layers especially in the context of very large-scale open networks.
4.
We define layer and inter-layer reachability metrics of a given sub-network: this measure is based on the private egocentric network and help to appreciate the connectivity strength of private data across layers.

We illustrate our methodology through a biological application. The open multilayer network is constructed from open databases where weighted interactions between proteins-proteins, metabolites-metabolites and proteins-metabolites are given. The private data is a set of proteins and metabolites collected experimentally and present a set of nodes in the open multilayer network. We show how the private network is constructed, analysed and compared to the whole (open) network. The private egocentric network is analysed and the layers reachability metrics are computed and discussed. Pathways between pairs of private proteins are then analysed and classified according to their location in the open network (private, egocentric or extra-egocentric). The KEGG (Kyoto Encyclopedia of Genes and Genomes) open data set (Kanehisa and Goto 2000) is also used to describe pathways.

By applying this methodology on the biological data we show how it can help biologist to complete, assess and interpret their private data by using the open network: weighted interactions between private collected molecules are added by using the open network. The connectivity between the molecules inter-layers and across layers are computed and the distribution of the identified molecules in the open network are observed and interpreted, Reachabilities across layer is computed in addition shortest paths which are biologically meaningful are also analysed and classified.

The rest of this paper is organised as follow: we present in “Multilayer network analysis elements” section elements and notions we use for multilayer networks analysis. Related work are presented in “Related work” section. We present in “Biological application” section the biological application. We finally present conclusion and perspectives in “Conclusion and perspectives” section.

Multilayer network analysis elements

We firstly present a new formalism of multilayer network as well as examples showing how we update global and local measures to the context of multilayer networks. We give then a formal definitions of multilayer egocentric network, of private multilayer network and of private egocentric one. We show then how we can use these notions to define the layer and inter-layer reachabilites of a given sub-network.

Notations, properties and metrics

We represent a multilayer network by a tuple that contains a set of vertices, a set of edges intra-layers and a set of edges inter-layers.

Let $\mathbb {N}=(\mathbb {V},\mathbb {E},\mathbb {C})$ be a graph containing l layers (see Fig. 1)

1.
$\mathbb {V}=\{V_{1},..V_{i},.. V_{l}\}$ is the set of vertices contained in the layers where l is the number of layers l>1, V_i is the set vertices in the layer number i, $V_{i}=\{v^{i}_{1},.. v^{i}_{n_{i}}\}$, n_i=∣V_i∣
Fig. 1
Example of a 2-layers network
Full size image
2.
$\mathbb {E}=\{E_{1},..E_{i},.. E_{l}\}$ is the set of edges intra-layer: E_i is a set of edges in layer number i, we denote ∣E_i∣ by m_i. $E_{i}=\{(v^{i}_{j}, v^{i}_{k})\mid v^{i}_{j} \in V_{i}, v^{i}_{k} \in V_{i} \}$^{Footnote 1}
3.
$\mathbb {C}=\{ C_{i_{1}j_{1}},.. C_{i_{b}j_{b}} \mid i_{k } \neq j_{k }\}$ is the set of inter-layer links, b is the number of bipartite components. $C_{{ij}}=\{(v^{i}_{k},v^{j}_{k'}) \mid v^{i}_{k} \in V_{i}, v^{j}_{k'} \in V_{j}\}$, we denote ∣C_ij∣ by c_ij.

This representation allows to propose an adaptation of global and local metrics taking into account the intra-layers and the inter-layer links. We can then aggregate these metrics in order to propose a metric for the whole network. For example, we can propose the following metric for the density:

Intra-layer density for the layer i: $D_{i}= \frac {m_{i}}{\frac {n_{i}* (n_{i}-1)}{2}}$
Inter-layer density for the bipartite component C_ij:

$D_{{ij}}= \frac {c_{{ij}}}{n_{i}* n_{j}}$
Multilayer density: $D= \frac {\sum _{C_{{ij}} \in \mathbb {C}}{c_{{ij}}}+\sum _{l \in \{1..l\}}{m_{l}}}{\sum _{C_{{ij}} \in \mathbb {C}}{n_{i}* n_{j}}+ \sum _{l \in \{1..l\}}\frac {n_{i}* (n_{i}-1)}{2}}$

Likewise, the degree centrality can be generalised to the inter-layer level and to the whole networks. The centrality degree and connectivities of a vertex $v^{i}_{j}$ belonging to the layer V_i are given by:

Intra-layers degree: $CD\left (v^{i}_{j}\right)=\frac {deg_{i}(v^{i}_{j})}{n_{i}-1}$ where n_i=∣V_i∣ where $deg_{i}\left (v^{i}_{j}\right)$ is the degree of $v^{i}_{k}$ in the layer i.
Inter-layers connectivity: we define the connectivity of a vertex in the bipartite component C_ik as $CN_{k}\left (v^{i}_{j}\right)=\frac {deg_{C_{{ik}}}(v^{i}_{k})}{n_{k}}$ where n_k=∣V_k∣ and $deg_{C_{{ik}}}\left (v^{i}_{k}\right)$ is the degree of $v^{i}_{k}$ in C_ik
Multilayers connectivity: we propose to generalise the definition of the connectivity of a node to the whole network: $CN\left (v^{i}_{j}\right)=\frac {deg_{i}(v^{i}_{j}) + \sum _{k}{CN_{k}(v^{i}_{j})}}{{n_{i}-1} + \sum _{k}{n_{k}}} \mid {C_{{ik}} \in \mathbb {C}}$,

Multilayer egocentric networks

Given a complex network (and more particularly an online social network), the egocentric network is defined around an ego node u is a sub-network containing the ego u and the alters (the neighbours) as well as the set of links of the ego-network. In the literature, two cases of online personal networks are identified depending on the distance of the alters from the ego: 1-level and k-level.

Let G=(V,E), and u a vertex, the 1-level egocentric network of u G^u=(V^u,E^u) is given by (see Fig. 2) :

V^u={x∈V∣(u,v)∈E}∪{u}
Fig. 2
1-level and 2-level ego-networks
Full size image
E^u={(x,y)∈E∣x∈V^u∧y∈V^u}

We propose an extension of this definition to multilayer networks which aims to access to the alters located in the same layer as well as the layers connected to the one of the ego (see Fig. 3).

u∈V_i, $\mathbb {N}^{u}=G(V^{u},E^{u})$

V^u={x∈V_i∣(u,x)∈E}∪{u}∪_k{y∈V_k∣(u,y)∈C_ik}
E^u={(x,y)∈E_i∣x∈V^u∧y∈V^u}∪_k{(u,y)∈C_ik}

Private multilayer network and private egocentric network

As mentioned before the purpose of this study is to provide domain experts with measures and methods to understand, evaluate and complete their private data by comparing and/or combining them with open data when both are modelled by multilayer networks. In our case, private data is a subset of nodes that are identified in the open network. The interactions between these private nodes are extracted for the open network We therefore propose to study the induced graph elaborated from the private data. This one has to be constructed, analysed and compared to the whole (open) network (see Fig. 4).

Let $\mathbb {N} $ be a multilayer network (extracted form the open data) : $\mathbb {N}=(\mathbb {V},\mathbb {E},\mathbb {C})$ containing l layers. Let PV be a set of vertices PV={PV₁,..PV_l} such that : PV_i⊂V_i (private data). We define the private multilayer Network $\mathbb {N}[PV]=(\mathbb {PV},\mathbb {PE},\mathbb {PC})$ where

1.
$\mathbb {PE}=\{PE_{1},..PE_{i},.. PE_{l}\}$ is the set of intra-layers edges:

PE_i is the set of edges in the layer number i given by: $PE_{i}=\left \{\left (pv^{i}_{j}, pv^{i}_{k}\right) \in E_{i} \mid pv^{i}_{j} \in PV_{i}, pv^{i}_{k} \in PV_{i} \right \}$
2.
$\mathbb {PC}=\left \{ PC_{i_{1}j_{1}},.. PC_{i_{b}j_{b}} \mid i_{k } \neq j_{k }\right \}$ is the set of inter-layer links

$PC_{{ij}}=\left \{\left (pv^{i}_{k},pv^{j}_{k'}\right) \in C_{{ij}}\mid pv^{i}_{k} \in PV_{i}, pv^{j}_{k'} \in PV_{j}\right \}$.

In Fig. 4, the blue graph represented the multilayer network $\mathbb {N} $ extracted from the open data, red nodes represent the private data and the red graph illustrates the private multilayer network $\mathbb {N}[PV]$.

We extend now the definition of egocentric network (which is defined around a given ego node (Marsden 2002; Djemili et al. 2017)) to an egocentric network around private multilayer network.

We define the private egocentric network as follow:

Let $\mathbb {N}[PV]=(\mathbb {PV},\mathbb {PE},\mathbb {PC})$ be the private mutilayer network. We define the private egocentric network : $\mathbb {N}^{PV}=G\left (V^{PV},E^{PV}\right)$

$V^{PV}= \bigcup _{u \in PV} \{x \in V_{i} \mid (u,x) \in E_{i}\} \bigcup \{u\}\bigcup _{k} \{y \in V_{k} \mid (u,y) \in C_{{ik}} \mid {C_{{ik}} \in \mathbb {C}}\}$
$E^{PV}=\bigcup _{u \in PV}\{(x,y) \in E_{i} \mid x \in V^{u} \wedge y \in V^{u} \}\bigcup _{k} \{(u,y) \in C_{{ik}} \mid {C_{{ik}} \in \mathbb {C}}\}$

In Fig. 5, red nodes represent the private data and the graph containing red and yellow nodes and edges illustrates the private egocentric network $\mathbb {N}^{PV}$

Layer and inter-layer reachability of a subnetwork

We define a graph reachability for a given layer as follow:

Let $\mathbb {N}=(\mathbb {V},\mathbb {E},\mathbb {C})$ be a multilayer network containing l layers, G=(V,E) a subgraph of $\mathbb {N}$ and i a given layer.

Reachability(G,i) is given by the subgraph $\phantom {\dot {i}\!}G_{i}=(V{\prime }_{i},E{\prime }_{i})$:
- $\phantom {\dot {i}\!}V^{\prime }_{i}=\{v{\prime }^{i}_{j} \in V \cap V_{i}\}$
- $\phantom {\dot {i}\!}E^{\prime }_{i}=\left \{\left (v{\prime }^{i}_{j}, v{\prime }^{i}_{k}\right) \in E \cap E_{i}\right \}$

In order to appreciate the connection strength between private nodes across layer, we apply the reachability on the private egocentric network computed on a given layer i to another layer j. Let $\mathbb {N}^{PV_{i}}=G\left (V^{PV_{i}},E^{PV_{I}}\right)$ be the private egocentric network computed from the layer i, let the reachability $Reachability\left (\mathbb {N}^{PV_{i}},j\right)$ to another layer j be the graph $\phantom {\dot {i}\!}\mathbb {N}{\prime }^{PV_{i}}_{j}$. Let $V^{\prime }_{j}$ be the set of nodes of $\phantom {\dot {i}\!}\mathbb {N}{\prime }^{PV_{i}}_{j}$, we can now evaluate the ratio of reachable private nodes on layer j by computing the precision and the recall as follow (see Figs. 6 and 7):

$$precisionR(i,j)=\vert{\frac{{V{\prime}_{j}}\cap PV_{i} }{V{\prime}_{j}}}\vert$$

$$recallR(i,j)=\vert{\frac{V{\prime}_{j}\cap PV_{i} }{PV_{i}}}\vert$$

precisionR(i,j) gives the ratio of private nodes belonging to layer j that are reachable from layer i to all reachable nodes in the layer j. recallR(i,j) is the ratio of private nodes of the layer j that are reachable from layer i to all private nodes belonging to layer j.

We define also a graph inter-layer reachability for a given bipartite part as follow. Let $\mathbb {N}=(\mathbb {V},\mathbb {E},\mathbb {C})$ be a multilayer network containing l layers, G=(V,E) a subgraph of $\mathbb {N}$ and C_ij is a given bipartie part.

InterReachability(G,C_ij) is given by the subgraph $\phantom {\dot {i}\!}G_{{ij}}=(V{\prime }_{{ij}},E{\prime }_{{ij}})$
- $\phantom {\dot {i}\!}V^{\prime }_{{ij}}=\left \{v{\prime }^{i}_{k} \in V \cap V_{i} \right \} \cup \left \{v{\prime }^{i}_{k^{'}} \in V \cap V_{j}\right \}$
- $\phantom {\dot {i}\!}E^{\prime }_{{ij}}=\left \{\left (v{\prime }^{i}_{k}, v{\prime }^{j}_{k^{'}}\right) \in E \cap C_{{ij}}\right \}$

Given a bipartite part C_ij, we can apply the InterReachability from the private induced multilayer network or from the private egocentric one.

For example, let $\mathbb {N}[PV]$ be the private multilayer network, let $InterReachability(\mathbb {N}[PV],C_{i j})$ be the graph $C^{\prime }_{{ij}}$ we can evaluate the reachable bipartite edges by computing the ratio $\frac {c{\prime }_{{ij}}}{c_{{ij}}}$ where $c^{\prime }_{{ij}}= \mid C{\prime }_{{ij}} \mid $ and c_ij=∣C_ij∣

Related work

Recently, there have been increasingly intense efforts to investigate networks with multiple types of connections as well as the so-called “networks of networks”. Variants of such systems have been examined decades ago in disciplines such sociology and engineering, but only recently have they been unified, along with other nomenclature, within the framework of multilayer networks defined by Kivelä et al.

In Kivelä et al. (2014) a complete review of the field of multilayer network is presented, the networks types, the characteristics of nodes and layers, the notion of aspect as well as the nature of coupling between layers are detailed.

Many studies are currently addressing themes related to multilayer networks as structure and dynamics of multilayer networks (Boccaletti et al. 2014; Magnani and Rossi 2013; Aleta and Moreno 2019), communities detection in multilayer networks (Liu et al. 2018) and visualisation (Mcgee et al. 2019).

Many work show also that experts in multiple domains as digital humanities (McGee et al. 2016), biology (Gosak et al. 2017), techno-anthropology etc. present their data using the multilayer networks and are aware of the strong necessity of having tools that analyse their data (Kivelä et al. 2019).

In this paper, we propose a methodology for multilayer network analysis in order to provide domain experts with measures and methods to understand, evaluate and complete their private data by comparing and/or combining them with open data when both are modelled as multilayer networks.

This methodology uses a formalism based on a set of graphs some of them represent layers (see “Notations, properties and metrics” section), others are biparties graphs representing the inter-layers connections. This formalism allows us to clearly separate three types of analysis: the intra-level one, the inter-level one and the global one that aggregate both (intra and inter) levels.

In Kivelä et al. (2014), a general formalism of the most general type of multilayer network was proposed, an underlying graph that represents this multilayer network is defined, where a node is represented by a tuple containing three identifiers: the node one, the layer one and the aspect one. In addition, two types of edges are proposed: intra-layer edges and inter-layer ones.

Our formalism for multilayer network allows to carry out fine analysis by considering two levels (see “Notations, properties and metrics” section) : the intra-layer level and the inter-layer one. We showed above, examples of how we can extend the definition of global and local measures as density and centralities to the inter-layer level. Measures for the whole networks are then computed by aggregating both precedent measures.

In many other work (Battiston et al. 2014), a monoplex network is constructed by aggregating data from the different layers of a multilayer network, the classical definition of node degree is then applied to the resulting monoplex network. However, network aggregation leads to a loss of information. In Some other work, the distinction of the layers is maintained and the degree of node is represented by a vector. It is also possible to define degree and neighbourhood in terms of a focal node and any subset of the layers (Berlingerio et al. 2013).

On the other hand, we defined layer and inter-layer reachability metrics of a given sub-network this measure is based on the private egocentric network and help to appreciate the connectivity strength of private data across layers (see “Layer and inter-layer reachability of a subnetwork” section).

In Kivelä et al. (2014) the mesure of node interdependence is defined as being the ratio of shortest paths in which two or more layers are used to the total number of shortest paths. It is a measure to quantify the value added by the multiplexicity to the reachability of nodes. The interdependence of a multiplex is computed as the average node interdependence.

Biological application

The aim of this application is to study several sets of biological data collected in experimentally related samples (i.e. cannabis samples). Identified molecules (proteins and metabolites) are measured form the biological collected data in different “omics” experiments: transcriptomics, proteomics and metabolomics. In their experiments, biologist measured at several time points, contigs: each one quantify genes, spots: each one quantify one or more proteins, and metabolites. Each gene produce typically one protein but sometimes more proteins.

At this point we only have nodes (but no edges), corresponding to molecules measured in the experiments. To get relationships biologist frequently used the open STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) database (Szklarczyk et al. 2019), which is the main protein-protein (and so also gene-gene) interactions database as well as the STITCH (Search Tool for InTeractions of CHemicals) one. STITCH (Szklarczyk et al. 2016) is a twin database including edges between metabolites and metabolites, and also between proteins and metabolites (see Fig. 8). Each interaction in both databases is based on the presence of experimental, coexpression (similar behaviour across several public available experiment), text mining (appearing in the same phrase),pathway (participating to the same known biological network). A combined score aggregating all these types of interactions whose value is are between 0 and 1000 is added to both databases (see Tables 1, 2 and 3).

Table 1 Examples for proteins interactions extracted from the open STRING database used to construct proteins layer

A methodology for multilayer networks analysis in the context of open and private data: biological application

Abstract

Introduction

Multilayer network analysis elements

Notations, properties and metrics

Multilayer egocentric networks

Private multilayer network and private egocentric network

Layer and inter-layer reachability of a subnetwork

Related work

Biological application

Proteins layer analysis

Metabolites layer analysis

Proteins-metabolites network analysis

Proteins pathways analysis using shortest paths

Proteins pathways analysis using the KEGG data base

Results discussion

Observations and results obtained from one layer analysis

Observations and results obtained from two layers analysis

Conclusion and perspectives

Appendix A: Proteins layer analysis

Appendix B: Metabolites layer analysis

Appendix C: Proteins-metabolites network analysis

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article