Ego-zones: non-symmetric dependencies reveal network groups with large and dense overlaps

Kudelka, Milos; Ochodkova, Eliska; Zehnalova, Sarka; Plesnik, Jakub

doi:10.1007/s41109-019-0192-6

Research
Open access
Published: 17 October 2019

Ego-zones: non-symmetric dependencies reveal network groups with large and dense overlaps

Milos Kudelka¹,
Eliska Ochodkova ORCID: orcid.org/0000-0001-8892-8900¹,
Sarka Zehnalova¹ &
…
Jakub Plesnik¹

Applied Network Science volume 4, Article number: 81 (2019) Cite this article

5972 Accesses
8 Citations
2 Altmetric
Metrics details

Abstract

The existence of groups of nodes with common characteristics and the relationships between these groups are important factors influencing the structures of social, technological, biological, and other networks. Uncovering such groups and the relationships between them is, therefore, necessary for understanding these structures. Groups can either be found by detection algorithms based solely on structural analysis or identified on the basis of more in-depth knowledge of the processes taking place in networks. In the first case, these are mainly algorithms detecting non-overlapping communities or communities with small overlaps. The latter case is about identifying ground-truth communities, also on the basis of characteristics other than only network structure. Recent research into ground-truth communities shows that in real-world networks, there are nested communities or communities with large and dense overlaps which we are not yet able to detect satisfactorily only on the basis of structural network properties.In our approach, we present a new perspective on the problem of group detection using only the structural properties of networks. Its main contribution is pointing out the existence of large and dense overlaps of detected groups. We use the non-symmetric structural similarity between pairs of nodes, which we refer to as dependency, to detect groups that we call zones. Unlike other approaches, we are able, thanks to non-symmetry, accurately to describe the prominent nodes in the zones which are responsible for large zone overlaps and the reasons why overlaps occur. The individual zones that are detected provide new information associated in particular with the non-symmetric relationships within the group and the roles that individual nodes play in the zone. From the perspective of global network structure, because of the non-symmetric node-to-node relationships, we explore new properties of real-world networks that describe the differences between various types of networks.

Introduction

Frequently solved problems in complex network analysis include the study of network structures. One of the challenges in this area is to design methods capable of detecting groups of nodes that have empirically determined properties that are common in real-world networks.

The procedure associated with this task is community detection, and it is a well-known fact that some real-world networks, e.g., social networks, have a community structure. However, the concept of a network community is not precisely defined. Informally, a network community is often described as a group of nodes that are strongly connected inside the community but weakly connected with other communities. Unfortunately, this definition cannot be applied in real-world situations, where one node may belong to multiple communities. In this case, communities either partially overlap or one community is entirely nested into another community.

There are many different methods used to detect communities in networks. These methods are based on various approaches, the first comprehensive overview of which can be found in Fortunato (2010). This survey is also focused on the methods used for detecting overlapping communities. In this survey, overlaps are understood mostly either as a group of nodes connecting several communities (hubs) or as a connection within a hierarchy described in a way similar to hierarchical clustering by a dendrogram. A detailed overview of overlapping community detection methods can be found in Xie et al. (2013). As the authors point out, a common feature of the methods being investigated is the small fraction of the nodes in the overlaps.

Recent results have shown three essential properties describing network community structure: (a) there are large overlaps that have a higher density compared to the density of overlapping communities (Yang and Leskovec 2012); (b) the similarity of links has a significant influence on the size of communities and their overlaps (Ahn et al. 2010), and (c) on the basis of a close relationship between the high density of triangles and the existence of a community structure, triadic closure as a natural mechanism leads to the emergence of a community structure (Bianconi et al. 2014).

In our approach, we combine ego-network analysis and seed-based community detection methods (Bagrow and Bollt 2005; Clauset 2005) in that we choose a node as a seed for the detection of a group. It differs from them in that, as in ego-network analysis, each seed (ego) is the basis of the group which we call an ego-zone (a zone in short). Ego-zone detection is, similarly to Ahn et al. (2010), based on the analysis of similarities in a networks. However, we analyze the similarity of adjacent nodes, and moreover, we understand the similarity as non-symmetric, which corresponds better to the reality (Tversky 1977). Therefore, the approach presented in this article combines the properties mentioned above with one additional principle – non-symmetric similarity.

To measure similarity, we use dependency (Kudelka et al. 2015). The calculation of the dependency is based on the ratio of the weights of triangles shared by the adjacent nodes and the weights of all the edges of each adjacent node. Using the non-symmetric relation of dependency, in this article we present several key findings based on observations of real-world networks. First, we show that there are three types of nodes in the networks in terms of dependency. They are (1) nodes that are not dependent on any other node, (2) nodes on which no other node depends, and (3) nodes that have around them both dependent nodes and nodes they depend on. In particular, the first type of nodes (independent) describes “key players”, especially in networks with social interaction. The nodes of both the first and the third types significantly affect overlapping groups of nodes, which we call ego-zones. For zones, we define the roles that nodes of each type play in them. We will show that our definition of a zone as a group with two types of internal dependencies and specific roles of nodes, not only in the neighborhood but also in the wider surroundings of the chosen node (ego), leads to overlapping zones. We will explain why overlaps are created and also that they can be large and other zones may be nested inside them. In experiments with both generated and real-world networks, we will show what properties they have in terms of dependency and zones, and how the real-world networks differ from the ones that are generated and among themselves. In experiments with real-world networks, we will explore the relationship between zones and communities in the traditional sense and ground-truth communities. An interesting conclusion is that in some types of networks, it is possible to find zones that correspond to the traditional view of communities, while in others, they correspond to the ground-truth communities. In our experiments, we investigate in particular those properties that are related to dependency and detected zones. However, for some comparisons, we also utilize known structural properties of networks, especially average degree, modularity, and average clustering coefficient.

Related work

Groups of nodes that are likely to share common features and/or play similar roles in the network are called clusters or, more often, communities. It is a well-known fact that some real-world networks, e.g., social networks, have a community structure. Community detection is not a well-defined problem because there is no universal definition of community and the nature of communities is not known in advance. The problem is also complicated by the variability of community forms: disjoint, overlapping or, for example, hierarchical communities may appear. As a result, there is no manual on how to use the algorithm, how to evaluate the performance of different algorithms or how to compare them. The authors (Fortunato and Hric 2016) offer a guided tour of the main aspects of this issue, discuss the strengths and weaknesses of popular methods, and provide guidance on how to use them.

One of the first publications about communities that is most often mentioned is Girvan and Newman (2002). The authors proposed a community detection algorithm based on edge betweenness, which is a generalization of Freeman’s node betweenness centrality (Freeman 1977). This method is an example of detection methods based on a division of a network (or underlying graph). Simultaneously, it is an example of a global, or a top-down, method. The method is not capable of finding overlapping communities, because each node is assigned to only one community. Other representatives of global methods, the results of which are non-overlapping communities, include, among others, one of the oldest algorithms, the Kernighan-Lin algorithm (Kernighan and Lin 1970), the spectral bisection method (Barnes 1982) and hierarchical clustering. The last example uses the symmetrical similarity rate because it assumes that communities are made of mutually similar nodes and this similarity is symmetrical. An example of a different approach to hierarchical clustering is, e.g., the Walkatrap algorithm, which is based on a random walk (Pons and Latapy 2005). The novel CAN algorithm (Zhang et al. 2018) is proposed to reveal community structure using the correlation analysis of nodes. A wide scale of methods is further represented by methods based on modularity (Newman and Girvan 2004) and its optimization (Blondel et al. 2008; Guimera et al. 2004). A large number of metrics have been proposed, a detailed survey of the metrics proposed for community detection and evaluation can be found in Chakraborty et al. (2017).

Local (or seed-based) methods begin searching from a random node and then gradually add neighboring nodes one by one on the basis of the optimization of measured metrics or heuristics. This process is named local expansion. From among the many methods, the following can be named: the well-known method of Bagrow and Bollt (2005) or the agglomerative algorithm of Clauset (2005), which uses greedy maximization of local modularity to find local communities. The starting nodes need not only be chosen at random. For instance, in Khorasgani et al. (2010), the community is created as a group of followers assembled around a potential leader.

It is a natural property of many real-world networks, especially social networks, that a node may be a member of multiple communities and not only of one community, which leads to the emergence of overlapping communities. The Clique Percolation Method (Palla et al. 2005), in which the community that is obtained, named the k-clique community, is the union of all k-cliques that can be reached from each other through a series of adjacent k-cliques, is a very popular method. This method, however, assumes the existence of cliques, which looks, even for social networks, like an unreal assumption. The idea of partitioning edges instead of nodes was also explored. The node in the original graph is called overlapping if the edges associated with it belong to more than one community (Ahn et al. 2010; Evans and Lambiotte 2010). Local expansion is also used to detect overlapping communities (Lancichinetti et al. 2009; Baumes et al. 2005). Another, dynamic, approach is the algorithm to detect overlapping communities in networks by label propagation called COPRA (Gregory 2010).

There is a question whether the structural view on communities corresponds to real-world communities, about the existence of which information is available from non-topological properties of networks (or from the attributes of nodes). A negative answer can be found in Hric et al. (2014). The authors (Yang and Leskovec 2015) introduced the concept of ground-truth communities and proposed a methodology, which compares and evaluates how do various structural definitions of network communities correspond to ground-truth communities. They allow ground-truth communities to be nested and to overlap. The existence of these nested communities and their detection was also published by, e.g., Tatti and Gionis (2013).

The community view on groups of nodes is one of the possible ones. A different approach to the analysis of groups of nodes is the egocentric approach. It is focused on the node referred to as the “ego” and its neighbors, known as “alters”. This approach naturally applies mainly to the analysis of social networks. For example, in Abbasi et al. (2012) the authors dealt with the analysis of co-authorship networks and the question of whether the collaboration skills and research performance of researchers were correlated. McAuley and Leskovec (2014) designed an algorithm to automatically detect circles in ego-networks, so that alters may belong to any number of circles, including none. They found circles that were disjoint, overlapping and hierarchically nested.

Our approach to the detection of groups of nodes (ego-zones) is related to Danisch et al. (2013). The authors suspect that a well-chosen set of few nodes could define a single community. The key idea is that, although one node generally belongs to numerous communities, a small set of appropriate nodes can fully characterize a single community. They work with similarity measure called Carryover opinion metric.

The term “dependency” can be found in Parshani et al. (2011); Bashan et al. (2011). The authors work with what are termed “dependency links” and “dependency networks” and analyze the cascade dissemination of errors in a system and state that if a node has a lot of neighbors that are dependent on it, then its vulnerability will affect the vulnerability of all of the dependent nodes. This fits in with our concept of ego-zones (see “Ego-zones” section), where we can watch ego-zones through the lens of “dependency links”, so that the removal of the ego from a network means, for example, the removal of the entire zone (if it is small and has no sub-zones). Alternatively, it can mean only the break-up of a large zone into sub-zones, in which the removed ego does not play an important role (most of the nodes in such a sub-zone are not dependent on this ego).

A similar term, “influence”, is used by Jacob et al. (2016), who propose a graph theory approach that focuses on the correlation influence between selected brain regions, named Dependency Network Analysis. Partial correlations are used to quantify the level of influence of each node during the performance of this task.

Dependency

If we consider a group of nodes fulfilling a particular purpose or function in a network, then we can expect that the nodes in a group will be similar in terms of this purpose. On the other hand, we can assume that the similarity between two objects in a group does not generally have to be symmetrical. This is based on the assumption that, in assessing the similarity of two objects, it is necessary to take into account not only their common properties but also the properties in which both objects differ (Tversky 1977).

Let us now project this assumption into the structure of a network in order to use this structure to measure the similarity between a pair of adjacent nodes, x and y. Consider all the nodes adjacent to node x or node y. These nodes can be divided into three groups. The first group is the shared neighbors of nodes x and y. These neighbors represent triangles shared by both nodes and can be understood as the basis of the similarity. Therefore, a higher number of triangles increases the similarity of nodes x and y. The remaining two groups of nodes include those nodes that are adjacent either to node x or node y. Here, a higher number of non-shared neighbors of nodes x or y reduces their similarity.

When formalizing these considerations, let us further assume that we are working with a weighted undirected network. The non-symmetrical similarity of node x to node y will be called a structural dependency, from now on referred to as dependency (Kudelka et al. 2015).

Definition 1

Structural dependency. Let x,y be nodes, then dependency D(x,y) of node x on node y is defined as follows:

$$ D(x, y) = \frac{w(x, y) + \sum_{v_{i} \in CN(x, y)}w(x, v_{i}) \cdot r(x, v_{i}, y)}{\sum_{v_{j} \in N(x)} w(x, v_{j})} $$

(1)

$$ r(x, v_{i}, y) = \frac{w(v_{i}, y)}{w(x, v_{i}) + w(v_{i}, y)}, $$

(2)

where CN(x,y)is set of all common neighbors of x,y, N(x) is set of all neighbors of x, w(x,y)is weight of edge between x,y, and r(x,v_i,y) is the coefficient of the dependency of node x on node y via the common neighbor v_i.

Equation 1 shows that the numerator contains the dependency of node x on y with the edge weight between nodes x and y counted in, as well as reduced edge weights between node x and particularly shared neighbors. The reduction is a value dependent on the weight of the edges between nodes x or y and their shared neighbors. The reduction value increases or decreases with an increase or decrease in the weight of the edge between a shared neighbor and node y. The denominator contains the sum of the weights of the edges between node x and all its neighbors. When we consider a reverse dependency of node y on node x, then the denominator will be the sum of the weights of the edges between node y and all its neighbors, and the numerator will also differ because of different weights and reduction values. Therefore, the dependency of node x on node y can differ from the dependency of node y on node x.

If we work with an unweighted network, then the weights of all the edges will be equal to 1, and all the reduced values will be equal to 0.5. The value of an expression in the numerator will be the same for both dependencies, but the values of denominators can vary. Therefore, even for an unweighted network, the dependency of the nodes is non-symmetric. Thus, our method is designed with weighted networks in mind, but can also be applied to unweighted ones. Moreover, the formulas from Definition 1 can also be used for directed networks; however, this case lies beyond the scope of this article. Therefore, below we will work only with weighted or unweighted undirected networks.

Figure 1 shows an undirected unweighted network with nine nodes to illustrate different dependencies of neighboring nodes and two zones with their overlap (which will be explained in detail in “Ego-zones” section and the experimental “Zones in generated networks” and “Zones in real-world networks” sections).

IsDependent relationship

To simplify the view on the dependency between two adjacent nodes x and y, let us define the relationship IsDependent as follows:

Definition 2

IsDependent. Let x,y be neighboring nodes, then IsDependent relationship is defined as follows:

IsDependent(x,y)=True if D(x,y)≥0.5; otherwise IsDependent(x,y)=False. The dependency threshold is set to 0.5 to take into account and reasonably balance a mutual dependency between two neighboring network nodes.

This relationship can be used to transform the original network into an unweighted directed network. In Fig. 2a is a well-known Karate Club network after the transformation. Edges exist only between nodes where at least one is dependent on the other, and their direction corresponds to the relationship IsDependent. The node size corresponds to the in-degree centrality of the given node. The transformed network in Fig. 2a emphasizes information about the structure of the original network, which is in Fig. 2b.

After the network has been transformed into its unweighted directed version, all the neighbors of each node of the network can be, by using Definition 3, divided into four groups described by different types of dependencies (for examples, see Fig. 2).

Definition 3

Four types of dependencies. Let x be a node, then:

OwDep_x is the number of neighbors on which x is dependent, but which are not dependent on x (one-way dependency);
OwIndep_x is the number of neighbors which are dependent on x, but x is not dependent on them (one-way independency);
TwDep_x is the number of neighbors which are dependent on x, and x is dependent on them (two-way dependency);
TwIndep_x is the number of neighbors which are not dependent on x, and x is not dependent on them (two-way independency).

The nodes that have a non-zero value for OwIndep deserve special attention. These nodes can be divided into two groups (see Fig. 2b). The first group includes (red) nodes that are not dependent on other nodes. The second group contains (yellow) nodes that are dependent on at least one other node.

Definition 4

Prominent nodes. Let x be a node, then:

a node x is called prominent if OwIndep_x>0;
a prominent node x is called strongly-prominent if OwDep_x=0 and TwDep_x=0.
a prominent node x which is not strongly-prominent is called weakly-prominent.

Strongly-prominent or weakly-prominent nodes play roles of global or local authorities for those network nodes that are unilaterally dependent on them. Below, we will call the nodes in the roles of authorities “centers”. In “Cause of overlaps” section, we show that the existence of prominent nodes is an important aspect causing overlaps between groups.

To determine whether and to what extent a node plays the center role, we define the value of node prominency (see Definition 5). When calculating this value, we measure the degree of independency of the node as the F1 score, based on a confusion matrix in which true positives=OwIndep, false negatives=TwDep, and false positives=OwDep. The point is to assess the network node x from the perspective of dependency of its neighbors on it and, also conversely, its independency on its neighbors; it means that positives are neighboring nodes dependent on the x node, and negatives are other neighbors.

Definition 5

Prominency. Let x be a node, then its prominency is

$$ Prominency(x) = \frac{2 \cdot {OwIndep}_{x}}{2 \cdot {OwIndep}_{x} + {TwDep}_{x} + {OwDep}_{x}}. $$

(3)

Prominency is not defined for nodes having zero values for all types of dependencies in the formula. In this case, we set Prominency=0.

In fact, using prominency, we can divide all network nodes into three prominency types. For strongly-prominent nodes, the Prominency=1, and for weakly-prominent nodes, the Prominency>0. The remaining network nodes are non-prominent and have Prominency=0. However, prominency should not be seen as a new centrality. For example, there may be nodes that have a comparable degree, but with different types of prominency. Nodes 6, 7 (weakly-prominent), 14, 28 (non-prominent), and 32 (strongly-prominent) in Fig. 2 are examples. Basically, prominency expresses the importance of a node for its neighbors, regardless of the number of these neighbors.

While strongly-prominent nodes are entirely independent, weakly-prominent nodes share their prominency with weakly-prominent or strongly-prominent nodes in their surroundings. In “Zones in real-world networks” section, we analyze 16 real-world networks. One of the key findings is the different proportion between the number of nodes of the three types of prominency for different types of networks (see Fig. 3).

In Fig. 2b, strongly-prominent or weakly-prominent nodes are marked in red or yellow. In Figs. 18 and 19 in Appendix C, the Les Misérables network is presented as well as the largest connected component of the Net Science network.

The node properties of all three small networks are summarized in Table 1, and the properties associated with the IsDependent relationship are shown in Table 2 (the NetDep property will be explained in “Zones in generated networks” section).

Table 1 Properties of small networks

Ego-zones: non-symmetric dependencies reveal network groups with large and dense overlaps

Abstract

Introduction

Related work

Dependency

Definition 1

IsDependent relationship

Definition 2

Definition 3

Definition 4

Definition 5

Ego-zones

Definition 6

Definition 7

Cause of overlaps

Zones in generated networks

Definition 8

Observation 1

Zones in real-world networks

Observation 2

Observation 3

Observation 4

Observation 5

Zones in LFR networks

Zone overlaps and their density

Zones in community structure

Zones and ground-truth communities

Summary

Conclusions

Appendix A: Zone detection method

Appendix B: Publicly archived datasets

Appendix C: Supplementary figures

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords