 Research
 Open access
 Published:
Egozones: nonsymmetric dependencies reveal network groups with large and dense overlaps
Applied Network Science volume 4, Article number: 81 (2019)
Abstract
The existence of groups of nodes with common characteristics and the relationships between these groups are important factors influencing the structures of social, technological, biological, and other networks. Uncovering such groups and the relationships between them is, therefore, necessary for understanding these structures. Groups can either be found by detection algorithms based solely on structural analysis or identified on the basis of more indepth knowledge of the processes taking place in networks. In the first case, these are mainly algorithms detecting nonoverlapping communities or communities with small overlaps. The latter case is about identifying groundtruth communities, also on the basis of characteristics other than only network structure. Recent research into groundtruth communities shows that in realworld networks, there are nested communities or communities with large and dense overlaps which we are not yet able to detect satisfactorily only on the basis of structural network properties.In our approach, we present a new perspective on the problem of group detection using only the structural properties of networks. Its main contribution is pointing out the existence of large and dense overlaps of detected groups. We use the nonsymmetric structural similarity between pairs of nodes, which we refer to as dependency, to detect groups that we call zones. Unlike other approaches, we are able, thanks to nonsymmetry, accurately to describe the prominent nodes in the zones which are responsible for large zone overlaps and the reasons why overlaps occur. The individual zones that are detected provide new information associated in particular with the nonsymmetric relationships within the group and the roles that individual nodes play in the zone. From the perspective of global network structure, because of the nonsymmetric nodetonode relationships, we explore new properties of realworld networks that describe the differences between various types of networks.
Introduction
Frequently solved problems in complex network analysis include the study of network structures. One of the challenges in this area is to design methods capable of detecting groups of nodes that have empirically determined properties that are common in realworld networks.
The procedure associated with this task is community detection, and it is a wellknown fact that some realworld networks, e.g., social networks, have a community structure. However, the concept of a network community is not precisely defined. Informally, a network community is often described as a group of nodes that are strongly connected inside the community but weakly connected with other communities. Unfortunately, this definition cannot be applied in realworld situations, where one node may belong to multiple communities. In this case, communities either partially overlap or one community is entirely nested into another community.
There are many different methods used to detect communities in networks. These methods are based on various approaches, the first comprehensive overview of which can be found in Fortunato (2010). This survey is also focused on the methods used for detecting overlapping communities. In this survey, overlaps are understood mostly either as a group of nodes connecting several communities (hubs) or as a connection within a hierarchy described in a way similar to hierarchical clustering by a dendrogram. A detailed overview of overlapping community detection methods can be found in Xie et al. (2013). As the authors point out, a common feature of the methods being investigated is the small fraction of the nodes in the overlaps.
Recent results have shown three essential properties describing network community structure: (a) there are large overlaps that have a higher density compared to the density of overlapping communities (Yang and Leskovec 2012); (b) the similarity of links has a significant influence on the size of communities and their overlaps (Ahn et al. 2010), and (c) on the basis of a close relationship between the high density of triangles and the existence of a community structure, triadic closure as a natural mechanism leads to the emergence of a community structure (Bianconi et al. 2014).
In our approach, we combine egonetwork analysis and seedbased community detection methods (Bagrow and Bollt 2005; Clauset 2005) in that we choose a node as a seed for the detection of a group. It differs from them in that, as in egonetwork analysis, each seed (ego) is the basis of the group which we call an egozone (a zone in short). Egozone detection is, similarly to Ahn et al. (2010), based on the analysis of similarities in a networks. However, we analyze the similarity of adjacent nodes, and moreover, we understand the similarity as nonsymmetric, which corresponds better to the reality (Tversky 1977). Therefore, the approach presented in this article combines the properties mentioned above with one additional principle – nonsymmetric similarity.
To measure similarity, we use dependency (Kudelka et al. 2015). The calculation of the dependency is based on the ratio of the weights of triangles shared by the adjacent nodes and the weights of all the edges of each adjacent node. Using the nonsymmetric relation of dependency, in this article we present several key findings based on observations of realworld networks. First, we show that there are three types of nodes in the networks in terms of dependency. They are (1) nodes that are not dependent on any other node, (2) nodes on which no other node depends, and (3) nodes that have around them both dependent nodes and nodes they depend on. In particular, the first type of nodes (independent) describes “key players”, especially in networks with social interaction. The nodes of both the first and the third types significantly affect overlapping groups of nodes, which we call egozones. For zones, we define the roles that nodes of each type play in them. We will show that our definition of a zone as a group with two types of internal dependencies and specific roles of nodes, not only in the neighborhood but also in the wider surroundings of the chosen node (ego), leads to overlapping zones. We will explain why overlaps are created and also that they can be large and other zones may be nested inside them. In experiments with both generated and realworld networks, we will show what properties they have in terms of dependency and zones, and how the realworld networks differ from the ones that are generated and among themselves. In experiments with realworld networks, we will explore the relationship between zones and communities in the traditional sense and groundtruth communities. An interesting conclusion is that in some types of networks, it is possible to find zones that correspond to the traditional view of communities, while in others, they correspond to the groundtruth communities. In our experiments, we investigate in particular those properties that are related to dependency and detected zones. However, for some comparisons, we also utilize known structural properties of networks, especially average degree, modularity, and average clustering coefficient.
Related work
Groups of nodes that are likely to share common features and/or play similar roles in the network are called clusters or, more often, communities. It is a wellknown fact that some realworld networks, e.g., social networks, have a community structure. Community detection is not a welldefined problem because there is no universal definition of community and the nature of communities is not known in advance. The problem is also complicated by the variability of community forms: disjoint, overlapping or, for example, hierarchical communities may appear. As a result, there is no manual on how to use the algorithm, how to evaluate the performance of different algorithms or how to compare them. The authors (Fortunato and Hric 2016) offer a guided tour of the main aspects of this issue, discuss the strengths and weaknesses of popular methods, and provide guidance on how to use them.
One of the first publications about communities that is most often mentioned is Girvan and Newman (2002). The authors proposed a community detection algorithm based on edge betweenness, which is a generalization of Freeman’s node betweenness centrality (Freeman 1977). This method is an example of detection methods based on a division of a network (or underlying graph). Simultaneously, it is an example of a global, or a topdown, method. The method is not capable of finding overlapping communities, because each node is assigned to only one community. Other representatives of global methods, the results of which are nonoverlapping communities, include, among others, one of the oldest algorithms, the KernighanLin algorithm (Kernighan and Lin 1970), the spectral bisection method (Barnes 1982) and hierarchical clustering. The last example uses the symmetrical similarity rate because it assumes that communities are made of mutually similar nodes and this similarity is symmetrical. An example of a different approach to hierarchical clustering is, e.g., the Walkatrap algorithm, which is based on a random walk (Pons and Latapy 2005). The novel CAN algorithm (Zhang et al. 2018) is proposed to reveal community structure using the correlation analysis of nodes. A wide scale of methods is further represented by methods based on modularity (Newman and Girvan 2004) and its optimization (Blondel et al. 2008; Guimera et al. 2004). A large number of metrics have been proposed, a detailed survey of the metrics proposed for community detection and evaluation can be found in Chakraborty et al. (2017).
Local (or seedbased) methods begin searching from a random node and then gradually add neighboring nodes one by one on the basis of the optimization of measured metrics or heuristics. This process is named local expansion. From among the many methods, the following can be named: the wellknown method of Bagrow and Bollt (2005) or the agglomerative algorithm of Clauset (2005), which uses greedy maximization of local modularity to find local communities. The starting nodes need not only be chosen at random. For instance, in Khorasgani et al. (2010), the community is created as a group of followers assembled around a potential leader.
It is a natural property of many realworld networks, especially social networks, that a node may be a member of multiple communities and not only of one community, which leads to the emergence of overlapping communities. The Clique Percolation Method (Palla et al. 2005), in which the community that is obtained, named the kclique community, is the union of all kcliques that can be reached from each other through a series of adjacent kcliques, is a very popular method. This method, however, assumes the existence of cliques, which looks, even for social networks, like an unreal assumption. The idea of partitioning edges instead of nodes was also explored. The node in the original graph is called overlapping if the edges associated with it belong to more than one community (Ahn et al. 2010; Evans and Lambiotte 2010). Local expansion is also used to detect overlapping communities (Lancichinetti et al. 2009; Baumes et al. 2005). Another, dynamic, approach is the algorithm to detect overlapping communities in networks by label propagation called COPRA (Gregory 2010).
There is a question whether the structural view on communities corresponds to realworld communities, about the existence of which information is available from nontopological properties of networks (or from the attributes of nodes). A negative answer can be found in Hric et al. (2014). The authors (Yang and Leskovec 2015) introduced the concept of groundtruth communities and proposed a methodology, which compares and evaluates how do various structural definitions of network communities correspond to groundtruth communities. They allow groundtruth communities to be nested and to overlap. The existence of these nested communities and their detection was also published by, e.g., Tatti and Gionis (2013).
The community view on groups of nodes is one of the possible ones. A different approach to the analysis of groups of nodes is the egocentric approach. It is focused on the node referred to as the “ego” and its neighbors, known as “alters”. This approach naturally applies mainly to the analysis of social networks. For example, in Abbasi et al. (2012) the authors dealt with the analysis of coauthorship networks and the question of whether the collaboration skills and research performance of researchers were correlated. McAuley and Leskovec (2014) designed an algorithm to automatically detect circles in egonetworks, so that alters may belong to any number of circles, including none. They found circles that were disjoint, overlapping and hierarchically nested.
Our approach to the detection of groups of nodes (egozones) is related to Danisch et al. (2013). The authors suspect that a wellchosen set of few nodes could define a single community. The key idea is that, although one node generally belongs to numerous communities, a small set of appropriate nodes can fully characterize a single community. They work with similarity measure called Carryover opinion metric.
The term “dependency” can be found in Parshani et al. (2011); Bashan et al. (2011). The authors work with what are termed “dependency links” and “dependency networks” and analyze the cascade dissemination of errors in a system and state that if a node has a lot of neighbors that are dependent on it, then its vulnerability will affect the vulnerability of all of the dependent nodes. This fits in with our concept of egozones (see “Egozones” section), where we can watch egozones through the lens of “dependency links”, so that the removal of the ego from a network means, for example, the removal of the entire zone (if it is small and has no subzones). Alternatively, it can mean only the breakup of a large zone into subzones, in which the removed ego does not play an important role (most of the nodes in such a subzone are not dependent on this ego).
A similar term, “influence”, is used by Jacob et al. (2016), who propose a graph theory approach that focuses on the correlation influence between selected brain regions, named Dependency Network Analysis. Partial correlations are used to quantify the level of influence of each node during the performance of this task.
Dependency
If we consider a group of nodes fulfilling a particular purpose or function in a network, then we can expect that the nodes in a group will be similar in terms of this purpose. On the other hand, we can assume that the similarity between two objects in a group does not generally have to be symmetrical. This is based on the assumption that, in assessing the similarity of two objects, it is necessary to take into account not only their common properties but also the properties in which both objects differ (Tversky 1977).
Let us now project this assumption into the structure of a network in order to use this structure to measure the similarity between a pair of adjacent nodes, x and y. Consider all the nodes adjacent to node x or node y. These nodes can be divided into three groups. The first group is the shared neighbors of nodes x and y. These neighbors represent triangles shared by both nodes and can be understood as the basis of the similarity. Therefore, a higher number of triangles increases the similarity of nodes x and y. The remaining two groups of nodes include those nodes that are adjacent either to node x or node y. Here, a higher number of nonshared neighbors of nodes x or y reduces their similarity.
When formalizing these considerations, let us further assume that we are working with a weighted undirected network. The nonsymmetrical similarity of node x to node y will be called a structural dependency, from now on referred to as dependency (Kudelka et al. 2015).
Definition 1
Structural dependency. Let x,y be nodes, then dependency D(x,y) of node x on node y is defined as follows:
where CN(x,y)is set of all common neighbors of x,y, N(x) is set of all neighbors of x, w(x,y)is weight of edge between x,y, and r(x,v_{i},y) is the coefficient of the dependency of node x on node y via the common neighbor v_{i}.
Equation 1 shows that the numerator contains the dependency of node x on y with the edge weight between nodes x and y counted in, as well as reduced edge weights between node x and particularly shared neighbors. The reduction is a value dependent on the weight of the edges between nodes x or y and their shared neighbors. The reduction value increases or decreases with an increase or decrease in the weight of the edge between a shared neighbor and node y. The denominator contains the sum of the weights of the edges between node x and all its neighbors. When we consider a reverse dependency of node y on node x, then the denominator will be the sum of the weights of the edges between node y and all its neighbors, and the numerator will also differ because of different weights and reduction values. Therefore, the dependency of node x on node y can differ from the dependency of node y on node x.
If we work with an unweighted network, then the weights of all the edges will be equal to 1, and all the reduced values will be equal to 0.5. The value of an expression in the numerator will be the same for both dependencies, but the values of denominators can vary. Therefore, even for an unweighted network, the dependency of the nodes is nonsymmetric. Thus, our method is designed with weighted networks in mind, but can also be applied to unweighted ones. Moreover, the formulas from Definition 1 can also be used for directed networks; however, this case lies beyond the scope of this article. Therefore, below we will work only with weighted or unweighted undirected networks.
Figure 1 shows an undirected unweighted network with nine nodes to illustrate different dependencies of neighboring nodes and two zones with their overlap (which will be explained in detail in “Egozones” section and the experimental “Zones in generated networks” and “Zones in realworld networks” sections).
IsDependent relationship
To simplify the view on the dependency between two adjacent nodes x and y, let us define the relationship IsDependent as follows:
Definition 2
IsDependent. Let x,y be neighboring nodes, then IsDependent relationship is defined as follows:
IsDependent(x,y)=True if D(x,y)≥0.5; otherwise IsDependent(x,y)=False. The dependency threshold is set to 0.5 to take into account and reasonably balance a mutual dependency between two neighboring network nodes.
This relationship can be used to transform the original network into an unweighted directed network. In Fig. 2a is a wellknown Karate Club network after the transformation. Edges exist only between nodes where at least one is dependent on the other, and their direction corresponds to the relationship IsDependent. The node size corresponds to the indegree centrality of the given node. The transformed network in Fig. 2a emphasizes information about the structure of the original network, which is in Fig. 2b.
After the network has been transformed into its unweighted directed version, all the neighbors of each node of the network can be, by using Definition 3, divided into four groups described by different types of dependencies (for examples, see Fig. 2).
Definition 3
Four types of dependencies. Let x be a node, then:

OwDep_{x} is the number of neighbors on which x is dependent, but which are not dependent on x (oneway dependency);

OwIndep_{x} is the number of neighbors which are dependent on x, but x is not dependent on them (oneway independency);

TwDep_{x} is the number of neighbors which are dependent on x, and x is dependent on them (twoway dependency);

TwIndep_{x} is the number of neighbors which are not dependent on x, and x is not dependent on them (twoway independency).
The nodes that have a nonzero value for OwIndep deserve special attention. These nodes can be divided into two groups (see Fig. 2b). The first group includes (red) nodes that are not dependent on other nodes. The second group contains (yellow) nodes that are dependent on at least one other node.
Definition 4
Prominent nodes. Let x be a node, then:

a node x is called prominent if OwIndep_{x}>0;

a prominent node x is called stronglyprominent if OwDep_{x}=0 and TwDep_{x}=0.

a prominent node x which is not stronglyprominent is called weaklyprominent.
Stronglyprominent or weaklyprominent nodes play roles of global or local authorities for those network nodes that are unilaterally dependent on them. Below, we will call the nodes in the roles of authorities “centers”. In “Cause of overlaps” section, we show that the existence of prominent nodes is an important aspect causing overlaps between groups.
To determine whether and to what extent a node plays the center role, we define the value of node prominency (see Definition 5). When calculating this value, we measure the degree of independency of the node as the F1 score, based on a confusion matrix in which true positives=OwIndep, false negatives=TwDep, and false positives=OwDep. The point is to assess the network node x from the perspective of dependency of its neighbors on it and, also conversely, its independency on its neighbors; it means that positives are neighboring nodes dependent on the x node, and negatives are other neighbors.
Definition 5
Prominency. Let x be a node, then its prominency is
Prominency is not defined for nodes having zero values for all types of dependencies in the formula. In this case, we set Prominency=0.
In fact, using prominency, we can divide all network nodes into three prominency types. For stronglyprominent nodes, the Prominency=1, and for weaklyprominent nodes, the Prominency>0. The remaining network nodes are nonprominent and have Prominency=0. However, prominency should not be seen as a new centrality. For example, there may be nodes that have a comparable degree, but with different types of prominency. Nodes 6, 7 (weaklyprominent), 14, 28 (nonprominent), and 32 (stronglyprominent) in Fig. 2 are examples. Basically, prominency expresses the importance of a node for its neighbors, regardless of the number of these neighbors.
While stronglyprominent nodes are entirely independent, weaklyprominent nodes share their prominency with weaklyprominent or stronglyprominent nodes in their surroundings. In “Zones in realworld networks” section, we analyze 16 realworld networks. One of the key findings is the different proportion between the number of nodes of the three types of prominency for different types of networks (see Fig. 3).
In Fig. 2b, stronglyprominent or weaklyprominent nodes are marked in red or yellow. In Figs. 18 and 19 in Appendix C, the Les Misérables network is presented as well as the largest connected component of the Net Science network.
The node properties of all three small networks are summarized in Table 1, and the properties associated with the IsDependent relationship are shown in Table 2 (the NetDep property will be explained in “Zones in generated networks” section).
Egozones
Using the dependency, we are able to describe a node group within a network with specific characteristics exactly and unambiguously. This description is based on one “central” node and dependencies in its surroundings.
Definition 6
Egozone. The egozone is a group of network nodes meeting three following criteria:

1.
the default member of the egozone is any network node called ego;

2.
a member of the egozone is any node that is dependent on ego or another node of the egozone; the set of all such nodes including the ego is called the innerzone;

3.
a member of an egozone is each node outside the innerzone on which at least one node of the inner zone is dependent; the set of all such nodes is called the outerzone.
Outerzone nodes can be divided into two groups based on whether they are dependent on other nodes in the outerzone.
Definition 7
Outerzone nodes. The outerzone consist of two types of nodes: Liaison is the outerzone node which is not dependent on any other nodes of the outerzone; Coliaison is the outerzone node which is dependent on at least one another node of the outerzone.
For egozones, an alternative name – dependency zone – can be used in networks other than social ones; below, we will only use zone. The algorithm to detect zones based on an iterative procedure derived from Definition 6, including its scalability, is given in Appendix A.
For illustration, Fig. 4 shows a zone with the red ego and four regular nodes in the yellow innerzone, one liaison and two coliaisons in the blue outerzone and, four green nodes outside the zone. The edge directions represent the dependency between nodes.
Thus, for each node of the network (ego), there exists its innerzone, the size of which depends on the degree of direct or indirect dependency of the surrounding nodes on this ego. The natural characteristic is that there may be, especially in cliqueclose structures, more egos that have the same innerzone. In this case, as is apparent from point 3 in Definition 6, their outerzone must also be the same, and thus the zones as a whole. We consider those zones with the same innerzone a single zone and refer to them as a multiego zone. Individual pairs of egos in these multiego zones must be dependent on one another (TwDep); otherwise, the individual egos would generate different innerzones. On the other hand, there may be zones with corresponding nodes but with different innerzones, and therefore their outerzones also differ. Such zones are considered different and are referred to as zones with alternative role configurations or an alternative zones. For examples of multiego and alternative zones, see Fig. 2.
As we will demonstrate in “Zones in realworld networks” section, depending on the network structure, zones of various sizes may exist, including zones with hundreds of nodes. Nevertheless, there may be trivial zones with a single node (e.g., node 28 at the Karate Club). In a more detailed view of the zone as a whole, there may be zones that have, for example, more nodes in the outerzone than the innerzone, or vice versa  zones that have no outerzone.
From Definition 6 it follows that the zone in the network is unambiguously detected and that, in addition to the group of nodes belonging to the zone, we also receive further information:

1
the first one is about nonsymmetric dependencies within a group. This information results in knowledge of the prominency of the individual nodes in the group;

2
the second one is the roles that individual nodes play in the zone. In particular, the questions are whether and to what extent there are prominent nodes in the group playing the role of egos in the innerzone or liaisons and coliaisons in the outerzone. Both these pieces of information are crucial for a detailed assessment of intra and intergroup relationships;

3
the third one is the size of the zone and its density. Small densely connected zones are expected to be homogeneous, and their heterogeneity increases as the number of nodes (especially liaisons) in the zone increases and their density decreases.
We can suppose that the more interconnected the group is, the higher the consensus on the purpose or function of the group will be. Nodes that correspond entirely to this purpose have no edges that face outward. Conversely, if the nodes have edges that face outward, they represent this purpose only in part of their egos. Moreover, the more prominent nodes the zone contains, the higher the potential of the zone overlapping with other zones will be, as will be explained in “Cause of overlaps” section.
Although this is an intuitionbased estimation, in an experiment in “Zones and groundtruth communities” section with four networks with identified groundtruth communities, we will show that the zones detected in some of these networks are a relatively good match for a nontrivial number of real communities.
We divide the nodes of the zone into four groups. In the first group, there are egos, and, the other nodes of the innerzone are the second group. The third and fourth groups contain liaisons and coliaisons, both belonging to the outerzone. An example of the inner and outerzones in the Les Misérables network is shown in Fig. 5. The properties of the nodes associated with zones are summarized in Table 3, the properties of zones are listed in Table 4, and the properties related to zone overlaps are listed in Table 5.
The values in Table 4 show that zones may overlap, so that one node can be a member of multiple zones. It is also clear that the maximum value of zone membership corresponds to the maximum value of membership in the liaison or coliaison role. Here the key parameter is prominency; nodes with a nonzero value of prominency, i.e., stronglyprominent or weaklyprominent nodes, have a nonzero value of OwIndep; for this reason, they have neighbors that are dependent on them, but they are not dependent on these neighbors. In the zones to which these neighbors belong, there can be a strongly or weaklyprominent node in the liaison role, and weaklyprominent node in the coliaison role. Thus, it is evident that the higher the value of OwIndep, the higher the potential of the prominent node for membership in different overlapping zones in the liaison role is. In all the three networks mentioned above, the node with the maximum membership value is in the liaison role in all the zones it belongs to (except for its own, where it is the ego).
Cause of overlaps
Nodes in the role of liaison or coliaison are the cause of large overlaps. This follows from the fact that these nodes can be in both the outer and innerzones of different zones. If we have such a node in two different zones, in the first of which it is in the outerzone and in the second zone it is in the innerzone, then the nodes from the first zone that are dependent on this node must be in the overlap of these zones as well.
For a better idea, let us have zone Z in which node v is a liaison (or coliaison). Then node v is in its outerzone. Thus, according to point 3 of Definition 6, one or more nodes u_{i} from the zone Z belonging to its innerzone must be dependent on node v. Next, let us have zone Z_{v} in which v is any node of its innerzone (e.g., ego). Then from point 2 of Definition 6 it follows that zone Z_{v} also contains nodes u_{i} which are dependent on v. As a result, nodes v and u_{i} must belong to the overlap of zones Z and Z_{v}.
The emergence of overlaps is illustrated in Fig. 1, where node 4 is the ego of the green zone and node 7 is its liaison; at the same time it applies that node 7 is the ego of the yellow zone. Thus, the overlap of these zones includes both nodes 4 and 7 and nodes 5 and 6, which are dependent on node 7. The same is true vice versa, where node 7 is the ego of the yellow zone, for which the liaison is node 4. Moreover, the overlap is a multiego zone in which the egos are nodes 5 and 6; nodes 4 and 7 are the liaisons of this zone.
To assess network community structure and quality of detected zones, we utilize two parameters; the first one is modularity, and the second one is embeddedness. With a higher value of modularity, the network community structure becomes more clear. Table 1 shows the weighted modularity Q for each network which comes from the interval [−1,1] and is defined as follows (Blondel et al. 2008):
where w_{ij} is the edge weight between nodes i and j, c_{i} is the community to which node i is assigned, \(k_{i}=\sum _{j} A_{ij}\) is the sum of weights of the edges attached to node i, Kronecker delta \(\delta _{c_{i}, c_{j}}\) is 1 if c_{i}=c_{j} and 0 otherwise, and \(m=\frac {1}{2}\sum _{ij}A_{ij}\).
To evaluate the quality of zones, we use the value of embeddedness from the interval [0,1]; see Table 3. Group embeddedness is defined as the ratio between the internal degree of the group and its total degree (Hric et al. 2014). The higher the zone embeddedness value, the stronger the belonging of groups of nodes to the group as a whole. For a group of nodes, a sum of internal degrees k_{in} and a sum of total degrees of k_{tot}, the group embeddedness ξ is defined as follows:
The modularity in Table 1 shows that all three small networks have a community structure. However, the zones that are detected have a lower average value of zone embeddedness and therefore a lower quality than can be expected from communities. This fact is the first factor to indicate that zones, despite their exact and deterministic definition, cannot generally be considered communities. This is because the dependencies in both the inner and outerzones implicitly do not provide a higher degree of interconnection within the zone than outside of it.
For each network, all pairs of overlapping zones were found. Table 5 lists the total number of overlaps, and their maximum and average sizes. The last four columns of the table show the total number of zones that are nested in some overlap of two other zones and their maximum and average sizes. The maximum overlap sizes are 7,16, and 13, and the maximum zone sizes in the overlaps are 7,15, and 11. In Fig. 6, for illustration, the Les Misérables network is shown with zone overlaps marked. There exist zones with other nested zones. We will use the terms superzone and subzone to describe such situations. Simply put, in this example, the two listed zones are superzones for all the zones in the overlap, and, vice versa, the zones in the overlap are subzones of both zones.
Figure 6 also shows two essential characteristics that correspond to the observation of realworld networks (Yang and Leskovec 2015). The overlaps of groups can be large and in the overlaps other groups can exist. To gain comprehensible visualization of these relations, we visualize the zone structures as weighted directed networks of subzones and zones that are nested in the overlaps of other zones. The visualized networks also include multiego zones and zones with the same group of nodes, but with alternative role configurations. Figure 7 shows the structures of zones for the Karate Club (A) and Les Misérables (B) networks. Figure 20 in Appendix C shows the same structure for the largest connected component of the Net Science network.
In the experiments described in “Zones in generated networks” and “Zones in realworld networks” sections, we analyze the zone properties of generated and realworld networks. To analyze the overlaps, we detected only those overlapping pairs of zones in each network for which each pair of overlapping zones had at least ten nodes. For each overlap with at least four nodes, we found the largest zone that the overlap contained (if such exists).
Zones in generated networks
How the network structure affects the properties of zones is a natural question. Typical properties of realworld networks include, for example, scalefreeness related to the powerlaw distribution of node degree or community structure. Generative models can be used to explore various properties of networks. We use three models to assess how the properties of zones with network properties are related. The first is ErdösRényi (ER) model which generates random networks (Erdös and Rényi 1959), the second is BarabásiAlbert (BA) model based on preferential attachment generating scalefree networks (Albert and Barabási 2002), and the third is Triadic Closure based (TC) model generating scalefree networks with a community structure (Bianconi et al. 2014). We used various settings for the experiments. The result was unweighted undirected networks with 10000 nodes. If an unconnected network was generated, we used the largest connected component for the analysis.
For networks generated using the ER model, we set the probability p of having an edge between a pair of nodes to the values of 0.0005 and 0.001. These are the values that levitate around the threshold of when the network becomes connected (0.0009). For the BA model, we chose the values 3 and 4 of m representing a number of existing nodes to which a new node is connected. In its basic version, the TC model works with two parameters that are related to the connection of a new node to the network. This new node is connected to the randomly selected network node in the first step. In the second step, the first parameter is the probability p, with which a neighbor of the selected node is preferred for connecting before a randomly selected node. The second parameter m defines the total number of connections for the new node, and, as a result, defines m−1 repeated connections in the second step. A higher probability value of p increases the local connectedness among nodes and thus emphasizes the community structure of the network. In contrast, a higher value of m increases the network density. For the experiments, we chose 0.7 and 0.97 for p and 2,3, and 4 for m. The results of the analysis of the networks that were generated are summarized in Tables 6, 7, 8, 9, and 10.
The values of some parameters point to differences from the results of the analysis for the three small networks in “Egozones” section. These differences are most significant in the ER model, as can be expected (see Table 6).
Table 7 shows that the low maximum values of OwDep and TwDep are a common feature of all the networks that were generated, not only when compared with the three small networks from “Egozones” section but also compared to the realworld networks that will be analyzed in “Zones in realworld networks” section.
Table 7 also shows that ER and BA models generate networks that have low average values of OwDep, TwDep, and OwIndep. On the contrary, the average value of TwIndep is higher than in TC networks. These characteristics can be interpreted as the cause of the fact that in these generated networks there is a very high proportion of small zones, such as trivial zones (only with ego node), dyads, and triads (zones with two or three nodes), as shown in Table 8. Therefore, the average size of the zone is small, as is the average number of nodes in the zones. Networks also have very few (or no) multiego zones. This is because network nodes predominantly have neighbors with which they are not twoway dependent (the twoway dependency of a pair of egos is a condition for both being in the same zone). We will informally refer to networks that predominantly have pairs of twoway independent neighboring nodes as weakly dependent. For this characteristic, we define NetDep – the total network dependency.
Definition 8
Network dependency. Let m^{∗} be the number of edges connecting mutually (twoway) independent nodes and m is the number of all the edges of the network. Then network dependency NetDep is defined as follows:
The NetDep value is from the interval [0,1], and the lower it is, the higher the proportion of twoway independent neighbors, while the overall dependency of the network becomes weaker. NetDep is low for connected networks with a high proportion of small zones. On the other hand, low NetDep value does not automatically imply a higher fraction of trivial zones. Even though individual nodes may not be dependent on most of their neighbors, they may have neighbors on which they are dependent or vice versa. In this case, as shown in Observation 2, the low NetDep value influences, in particular, the quality (embeddedness) of the zones. Our experiments show that ER and BA networks are weakly dependent because the NetDep value is very low (see Table 7). The same does not apply to TC networks.
Table 10 summarizes the results of the detection of zone overlaps and the detection of zones within the overlaps in the networks that are generated. The ER networks have a low maximum zone size and node memberships in zones (see Table 9); as can be seen in Table 10, no overlaps exist in these networks. Conversely, there are large zones in the BA networks, and it can be noticed that despite the relatively large maximum size of the zone overlap (75 or 73 nodes), there are no large zones in the overlaps (the maximum zone size in the overlap is 5 or 7 nodes respectively). The networks generated by the TC model differ from ER and BA networks. There are relatively small overlaps (the maximum overlap size is 24 nodes), but they may contain zones of comparable size (up to 11 nodes). For most of the properties, TC networks resemble the three small networks analyzed above and, as will be seen later, largescale realworld networks.
The small maximum sizes of overlaps in TC networks are related to the low OwDep and TwDep values (see Table 7), which represent the dependency of one node on multiple nodes. The first property increases the chance of there being more nodes in liaison or coliaison roles in the surroundings of the node. These are the primary cause for the overlapping of multiple zones. The TwDep property then ensures that there are pairs of nodes that are part of the same inner or outerzone.
Random networks generated by ER model are not scalefree and do not have a community structure; BA networks are scalefree but are known to have no community structure (they have low modularity). It is, therefore, a question what properties would be possessed by networks which do not have a high proportion of small zones. A natural expectation may be that, for nontrivial zones to exist, it should suffice for the network to have a community structure. On a wellknown football network that has a community structure (its modularity is 0.604; see Fig. 8) it can be seen that it does not.
The results of the analysis of the football network are summarized in the last row of the tables with generated networks. It can be seen that in the network with 115 nodes, there are only seven nontrivial zones with a maximum size of three. It can be concluded that apart from community structure, more varied occurrences of differentlysized zones are determined by other properties. From the point of view of NetDep and four parameters based on different types of dependencies between pairs of nodes (see Table 7), the football network is closest to the ER networks, less to the BA networks, and the least to the TC networks. The low NetDep value and the low averages of OwDep, OwIndep and TwDep, and, conversely, the high average of TwIndep for the football, ER and BA networks, affect together the small or zero percentage of stronglyprominent and weaklyprominent nodes, and therefore, the low share of the centers in these networks (see Table 6). But as can be seen, the football network does not contain any centers. Consequently, we can assume that the size of the zones (as well as the structure associated with their overlaps) is influenced by two factors.
Observation 1
Zones are not communities: Zones cannot, in general, be considered communities. Large zones exist in networks that have a community structure and, moreover, centers representing both global and local authorities, i.e., strongly and weaklyprominent nodes with a higher degree.
To further assess the properties of the networks concerning the parameters associated with the detected zones, Figs. 9 and 10 show the distributions (CCDF – complementary cumulative distribution function) and plots of selected properties of four of the generated networks described above.
The upper row in Fig. 9 shows the degree distributions and distributions of four zonerelated properties (zone size, liaisonship, coliaisonship and membership). For the ER networks analyzed here, there are no nodes in the coliaison role, for the BA networks only a small number of them and only exceptionally in more than two zones. This indicates that in these two networks, there are virtually no dependencies of weaklyprominent nodes on both types of prominent nodes. But as we show in “Zones in realworld networks” section, the dependency between stronglyprominent and weaklyprominent nodes is a common feature in realworld networks (see Observation 3). An example may be the dependencies of node 4 on nodes 2 and 3, node 33 on node 34 and node 2 on node 1 in the Karate Club network (see Fig. 2).
The lower row shows the distributions of properties related to types of node dependencies on their neighbors. As shown in Table 7, for the BA and TC networks there is a nontrivial number of nodes that have a larger number of neighbors with a higher maximum and average value of the OwIndep property than for the ER networks. This can be interpreted as the existence of network centers; in the BA network, they are nodes with up to hundreds of neighbors, while in the TC networks there are dozens of neighbors. Besides, the TC networks show that the distribution of OwIndep and OwDep properties is almost identical for networks with a stronger community structure (higher modularity). This means that oneway dependencies increase at the expense of independency, which can also be confirmed by the NetDep value in Table 7, which, for TC _{0.7 3}, is equal to 0.358 and, for TC _{0.97 3}, is equal to 0.615 (for BA _{3} it is 0.087).
The upper row in Fig. 10 displays plots showing the relationship between zone size and zone quality measured by the average zone embeddedness. The lower row contains cumulative distributions (CDF) expressing the frequency of occurrences of nodes with a given prominency value. We further show that higher occurrence and diversity in prominency values are a significant characteristic of networks resulting from human interaction (see Observation 5).
Zones in realworld networks
There are three key findings related to the analysis of the three small and ten generated networks. The first is the effect of the weakly dependent network (low NetDep value) on zone quality measured by embeddedness and on the ratio of trivial and generally small zones. The second finding is the existence of zone overlaps. For the small and TC networks, there are larger zones inside overlaps. The third is the finding that zones cannot be considered communities. The prerequisite for the existence of the zones is not only the community structure but also the more complex dependencies in the network. We continue to study these findings when analyzing realworld networks, mainly focusing on differences between them. Besides, we focus more on overlaps and their sizes and densities compared to zone densities, and moreover, on the relationship between zones and nonoverlapping or groundtruth communities, respectively.
For the experiments, we used a total of 16 known networks serving different analytical purposes. They are three collaboration networks (astroph, condmat, condmat2005), five communication (Brightkite, EmailEnron) and social (artist, facebook, new_sites) networks, two technological networks (as22july06, power), four biological networks (ChChMiner, PPDecagon, PPPathways, Yeast) and two networks constructed from groundtruth communities (comamazon, comdblp). For details of the individual datasets, see Appendix B. It is natural that the networks that are analyzed differ in their structure, which is also affected by the way the networks were constructed. The results of the analysis of realworld networks are summarized in Tables 11, 12, 13, 14, and 15.
Table 11 shows the percentage of nodes with a nonzero prominency, i.e., stronglyprominent and weaklyprominent nodes. In these values, the collaboration networks and networks with groundtruth communities differ from others. For each network of these two types, there are higher proportions of weaklyprominent nodes (9 percent or more), which is not the case for other networks. For networks with groundtruth communities, the share of prominent nodes is slightly lower, which is also reflected in the lower average value of coliaisonship (see Table 14). All of these networks represent the result of human activities. In collaboration networks, it is the direct activity of the authors associated with the publication of research articles. For both networks with groundtruth communities, specific activities are involved. In the comamazon dataset, the result of this activity is a network of products that people buy together (copurchasing). At comdblp, the network is constructed on the basis of coauthorship activity related to the participation of people in the same conference, or publishing articles in the same journal. All of these networks have a distinct community structure represented by high modularity and a high clustering coefficient (except comamazon). This is also related to the higher average values of zone embeddedness and NetDep values (see Tables 12 and 13); the lower clustering coefficient of the comamazon network is projected to a smaller average zone size (see Table 13). Here, let us note that this is due to the low average number of nodes in the outerzone (1.995). This, in turn, means that most of the zones are connected to their surroundings more weakly than in other networks; this is confirmed by the very high modularity of this network (0.926).
While social, communication, technological and biological networks share a higher proportion of stronglyprominent nodes, they have a very low fraction of weaklyprominent nodes. The only exception is the facebook network. This network, uniquely among the social networks, has a very low fraction of stronglyprominent nodes and a larger fraction of weaklyprominent nodes. This network is an exception in the whole group of realworld networks, which is probably due to the specific construction of the network by merging more egonetworks. It is possible that the egonetworks were chosen in such a way that almost every ego was dependent on some other ego. As a result, there are only a few stronglyprominent nodes in the network. Note also that the facebook network has high modularity but a very low NetDep value and a very low average of zone embeddedness.
What is noteworthy is the relationship between low NetDep values and zone embeddedness. Tables 12 and 13 show that a low NetDep value and a low average value of zone embeddedness have biological networks, social networks, and communication networks. Table 11 also shows that all of these networks, except facebook, have a low clustering coefficient.
Observation 2
Relationship of NetDep and embeddedness. If the network has a low NetDep value, then it also has a low average of zone embeddedness.
This observation can naturally be interpreted in such a way that if the network does not have a high degree of dependency between pairs of nodes, it is not possible to assume the frequent occurrence of zones strongly interconnected inwardly and weakly outwardly. However, the opposite relationship between NetDep and zone embeddedness does not apply; e.g., a technological network such as22july06 has a very low value of zone embeddedness and yet a high NetDep value.
A higher NetDep value is a characteristic feature of technological and collaboration networks, and networks with groundtruth communities. Both technological networks have a low clustering coefficient; the other mentioned networks have a higher clustering coefficient and zone embeddedness. For technological networks, Table 12 also shows that they have (in addition to a lower average degree 4.218 or 2.669, respectively) a lower average TwIndep compared to other networks (1.383 or 0.906, respectively). The two technological networks that were analyzed, therefore, have the highest dependency (the highest NetDep value).
Figures 11 and 12 display the distributions and plots of properties of four selected realworld networks described above (as22july06, comdblp, EmailEnron, PPDecagon). This selection provides both common and distinct characteristics for different types of networks from the perspective of dependencies and detected zones. The distributions and plots of the remaining twelve realworld networks that were analyzed are shown in Appendix C in Figs. 21, 22, 23 and 24.
The upper row of distributions in Fig. 11 shows five properties related to nodes and zones. The four networks that were analyzed (and also the distributions for the other networks in Appendix C) show that from a relatively low value the distributions of three properties (zonesize, membership and liaisonship) roughly copy the shape of node degree distribution. Conversely, there are two characteristics in which the networks differ as described in Observations 3 and 4.
Figure 11 shows that the technological and biological networks (as22july06, PPDecagon and other networks of these types in Appendix C) have a smaller fraction of nodes in the coliaison role than is the case in collaboration networks (including comdblp) and EmailEnron communication network (and also other networks with social interaction in Fig. 20 in Appendix C).
Observation 3
Dependencies in networks with social interaction. In realworld networks, dependencies exist between the liaison nodes, i.e., the mutual dependencies between pairs of coliaisons or the dependency of coliaisons on liaisons. In technological and biological networks, however, there is a low proportion of coliaison nodes; therefore, dependencies between them (e.g., in outerzones) do not exist so often as in networks resulting from human interaction (e.g., collaboration and communication networks, see Fig. 3).
When looking at the lower row of distributions of dependency properties, obviously, the low number of coliaisons is related to a small fraction of nodes, which have at least one twoway dependent neighbor (TwDep property is shown in blue in the lower row of plots).
From distributions in Fig. 11 and Figs.21 and 22 in Appendix C, it can be seen for biological networks that there prevail nodes with twoway independent neighbors. This is projected also to a greater distance of degree distribution from zonesize, membership and liaisonship distributions, especially for those biological networks that also have lower modularity (see Table 11 and degree distributions in plots shown in red).
Observation 4
Independent neighbors in biological networks. Nodes in biological networks have a higher proportion of neighbors with which they are mutually independent in comparison with technological, collaboration and communication networks.
However, a similar characteristic to that found in a biological network is found in the social networks artist, new_sites and facebook (see Figs. 21 and 22 in Appendix C). The distributions also show that the EmailEnron network and Brightkite communications network have characteristics at the boundary between collaboration and biological networks and the comamazon network at the boundary between technological and collaboration networks.
Differences in the individual types of networks are also shown in plots in Fig. 12. The upper row of plots represents zone quality, measured by the average embeddedness in relation to zone size. When the zone size increases, the average embeddedness is more varied, and for most networks, it is slightly higher. For the comdblp and collaboration networks, however, it is evident that the quality of the large zones is very heterogeneous. This is probably due to differing relationships in large collaborating teams.
The lower row of plots shows cumulative distributions (CDF) expressing the frequency of occurrence of nodes with a given prominency value. Here, the collaboration, communication, and social networks are distinctly different from the technological and biological networks. Plots with the embeddednes and prominency values of the remaining twelve realworld networks are shown in Figs. 23 and 24 in Appendix C).
Observation 5
Prominency in networks with social interaction. The networks resulting from human interaction have a more varied occurrence of prominency values. This suggests that in these networks, there are more complex nonsymmetric dependencies between nodes than in the case of technological and biological networks.
This observation is due to the fact that technological and biological networks, unlike collaboration, communication and social networks, have much less weaklyprominent nodes with prominency between 0 and 1, i.e., the nodes with a potential to act in the zones as a coliaison. Additionally, especially in collaboration networks, the occurrence of different prominency values is most varied; it implies high flexibility of connections within zones and between zones.
Zones in LFR networks
The last two rows in Tables 11, 12, 13, 14, 15 with the properties of realworld networks show the results of the analysis of two of LFR benchmark networks that were generated (Lancichinetti and Fortunato 2009). The generator of the LFR networks provides settings ensuring the existence of overlapping communities. Both the networks we analyzed were generated with 10000 nodes. The first network LFR _{20 500 2000} was generated to have an average degree of 20, a maximum community size of 500, and a total of 2000 nodes in the overlaps; a total of 186 communities with a minimum size of 7 and an average size of 98 was generated in this network. In the second network LFR _{7 60 4000}, for an average degree of 7, a maximum community size of 60, and a total of 2000 nodes in the overlaps, a total of 1000 communities with a minimum size of 3 and an average size of 19 was generated. The tables show that the LFR networks, unlike the other generated networks from “Zones in generated networks” section, do not significantly differ from realworld networks in any of the properties that were investigated. The distributions and plots in Fig. 13 confirm the same, describing other properties of the LFR networks. However, two interesting results can be seen.
The first is the maximum size of the zones found in the LFR networks. For the first network (see Table 13), the largest zone is considerably smaller than the largest community (88 vs. 500), while in the second network it is the opposite (72 vs. 60). The second result is a low average value of coliaisonship (0.066 and 0.095, respectively), which is particularly characteristic of some biological, technological, and social networks (see Table 14). Overall, though, the LFR networks that were generated are closest to the biological networks, as shown in the comparison of the LFR network properties in Fig. 13 with the properties of realworld networks in Figs. 21, 22, 23 and 24 in Appendix C. Unlike the biological networks, however, the LFR networks have a higher NetDep and embeddedness value. From this more detailed view, the LFR networks differ from the other realworld networks that were analyzed.
Zone overlaps and their density
Recent research on the properties of node groups in networks with groundtruth communities has shown that (1) groups of nodes may overlap, (2) other overlapping groups may exist in overlaps, and (3) overlaps may be denser than overlapping groups. We have dealt with the first two properties in previous experiments with small and generated networks. Table 15 summarizes the results of detecting pairs of overlapping zones with at least ten nodes and zone detection within overlaps with at least four nodes. As shown through the average size of the overlaps, most of them are small (except for biological networks). However, in all the networks, there is a nontrivial number of zones in larger overlaps, and there are also zones with a high number of nodes. This can be read from both the maximum and the average size of zones within overlaps in the last two columns of the table.
In the next experiment, we investigated the density of zones and overlapping zones of the same size. The goal was to verify that the overlaps are more densely connected compared to zones. In the upper row of plots in Fig. 14, the average density of the zones in relation to their size can be seen. The plots in the lower row provide the same information for zone overlaps. The same plots for the remaining twelve realworld networks are shown in Figs. 24 and 26 in Appendix C. When comparing the average densities for zones and overlaps with the same number of nodes, it can be seen that, especially for a smaller number (dozens) of nodes, the average density of the overlaps for most networks is higher than the density of zones of the same size. For large overlaps, although not so clear, the situation is similar. The exceptions are, in particular, the collaboration networks (astroph, condmat2005, condmat) networks, and also EmailEnron communication network, where the average density of small zones is higher than overlaps of the same size. This is because, in these networks, small zones are often formed by cliques.
Zones in community structure
As mentioned in “Zones in generated networks” section, zones cannot be considered communities. That is why we prepared an experiment in which we applied the Louvain algorithm to detect nonoverlapping communities across all of the realworld networks. We then found the best matching zone to every detected community. To compare the community and the zone, we utilized the Matthews Correlation Coefficient (MCC) based on a confusion matrix in which true positives are the number of nodes in the intersection of community and zone ZC, false negatives are the number of zone nodes outside the community Z, false positives are the number of community nodes outside the zone C, and true negatives are the remaining number of nodes outside the zone and the community O (see Equation 7). MCC returns a value between −1 and +1; a coefficient of +1 represents the perfect fit and −1 indicates total disagreement between community and zone (in our case of overlapping zones and communities, MCC>0). The results are summarized in Table 16.
For all the networks that were analyzed (except facebook), the maximum and average community sizes are higher than those for zones. For some networks, even the maximum community size is at tens of thousands of nodes, compared to zones with a maximum of dozens to hundreds of nodes. For facebook, this is the opposite, which is probably due to the construction of this network by merging individual egonetworks.
It can also be seen that the Louvain algorithm provides a relatively small number of predominantly large communities; this is a consequence of the resolution limit issue of modularity. To overcome this problem, it is possible to use the ECG approach (Ensemble Clustering for Graphs, Poulin and Théberge (2018)). Moreover, when many small communities exist in the network, there are approaches working better than the Louvain algorithm (e.g., InfoMap algorithm, Rosvall and Bergstrom (2008)). In our experiments, however, the Louvain algorithm is adequate to assess the match between such detected communities and large zones; we will focus on more communities, including the small ones, in “Zones and groundtruth communities” section.
The smaller size of zones in comparison with communities raises the question of whether communities are not formed as a union of several zones. However, the question is not an easy one; to answer it, it would be necessary to find egos whose zones could form the community (e.g., similarly to multiegocentered communities as described in Danisch et al. (2013)).
An interesting result is that at least one zone that exactly matched one of the communities (MCC=1) was found for almost all networks. The average match values range from 0.339 (power network) to over 0.8 (e.g., EmailEnron network). Compared to the other networks, there is a very high agreement between communities and zones for three communications networks, including Brightkite network (0.856) and anobii network (0.879), which will be described below in “Zones and groundtruth communities” section. Similarly, the facebook network is 0.727 and the comdblp network is 0.716. In these networks, therefore, the zones correspond very well to the communities. It can, therefore, be assumed that in these cases, communities are formed around egos of the corresponding zones.
In the last two columns of Table 16, the average embeddedness for the detected communities and the corresponding zones is shown. In general, the average quality of communities is higher than the quality of the zones. In most cases, however, the difference in the average embeddedness is not large. The lower quality of the zones can be attributed to another view on a group of nodes. While communities prefer their density and weak interconnections with other neighboring communities, the zones are based on nonsymmetric dependencies between nodes. These dependencies are reflected in a much more comprehensive view, in which they play a role concerning the inside of the group (innerzone) and the dependency outward (outerzone).
Zones and groundtruth communities
Our approach to detecting groups (zones) presented in this article is based on the assumption that groups existing for some purpose which is represented by an ego (a central node) can be extracted from the structure of a (weighted) network. In our experiments in “Zones in realworld networks” section, we worked with two networks comamazon and comdblp; the authors of the article (Yang and Leskovec 2015) identified in these networks groundtruth communities, otherwise referred to as real functional groups, and provided lists of the 5000 groundtruth communities with the highest quality. In addition to these two networks, in the experiment below, we have added two more networks with identified groundtruth communities. The first one is the comyoutube social network, with 1134890 nodes, 2987624 edges, and 5000 communities. The second one is a directed communication network, anobii, which we have transformed into an undirected one so that an edge between nodes only exists if there are mutually directed edges between neighbors. After this transformation and the removal of outliers (isolated nodes), the network has 158330 nodes, 785939 edges, and 4797 communities. For details see Appendix B.
We prepared a similar experiment for these networks as with the nonoverlapping communities above. The aim was to find out exactly how the groundtruth communities correspond to the zones and whether we are able to use zones to describe at least part of these communities existing for some real purpose and thus performing some function. The experiment is performed in two steps for each network. First, for every groundtruth community, the zone that best matches this community is found (measured by MCC, see Equation 7). Note that after this step, multiple communities can match the same zone. In the second step, the bestmatching community is then selected for each zone from the first step. Theoretically, after the first step, different zones may correspond to different communities, but it may also be that a single zone corresponds to more communities. If the reality is close to the first case, then the communities are unique from the perspective of zones, and vice versa, in the latter case, the communities are very similar (redundant in terms of the agreement of more communities with the same zone). We also conducted this experiment with both the LFR networks, for which we know what overlapping communities were generated. The community properties of these networks, together with the results, are presented in Table 17.
It can be seen that the comamazon network contains redundant communities from the perspective of zone detection; more than 71% of the communities have no zones that they would match better than other communities. This is probably due to the fact that the network is created from products hierarchically organized into categories and copurchased together; the categories of products in these hierarchies, which are considered communities, may not be very different in many cases. For the other networks, there is a unique zone for almost every community. The two realworld networks (comamazon and comdblp) show a very high match MCC>0.9 for more than half of the detected unique zones corresponding to groundtruth communities. It is worth noting that for the comdblp network there is an almost 40% perfect match (MCC=1) within 4850 out of all the 5000 identified groundtruth communities. Figure 15 shows the relationships between the size of the zone and the average match accuracy (MCC) with the corresponding community for all four realworld networks with groundtruth communities, as well as the frequency of occurrence of zones for the given match accuracy. The corresponding results for the LFR networks are shown in Fig. 16.
In the experiments with communities, we showed two key results. The first one is related to the Observation 1. Even though zones cannot be considered communities, in some networks (especially communication networks, see Table 16), most zones correspond very accurately to nonoverlapping communities. E.g., in the anobii network, the zones well describe nonoverlapping communities detected by Louvain algorithm; however, they do not well represent the groundtruth communities in this network. The second result shows that the zones provide high potential in describing groundtruth communities in some networks. In the comamazon and comdblp networks, zones that correspond very closely to a considerable number of identified groundtruth communities were detected. Here it is necessary to remember that each zone is the result of the exact analysis of the surroundings of a selected ego. Therefore, we may conclude that especially the smaller groundtruth communities in these networks are grouped around egos; thus, these egos determine the resulting communities very precisely.
On the other hand, especially for large groundtruth communities, there are often no zones that match with sufficient accuracy. In this case, communities could be, e.g., formed as a union of several zones generated by different egos. The analysis of this case was not, however, the subject of our research.
The results show that our approach cannot be seen as universal from the perspective of groundtruth communities. However, although zone detection is intended for weighted networks, it has been applied to unweighted networks in experiments with realworld networks. The question for further research is what the results of the comparison of zones and groundtruth communities would be in the case of weighted networks.
Summary
One of the interesting results of our experiments with realworld networks is the characteristics of these networks based on the dependency and properties of the detected zones. The Table 18 recapitulates selected outputs from experiments.
Key characteristics of our approach can be summarized into seven points that characterize the description and detection of zones.

1
The relationships between pairs of nodes in zones differ and are not symmetric (nonsymmetric similarity).

2
The zone has a clear outer boundary, beyond which nodes are not considered to be members of this zone.

3
The zone contains nodes in different roles. The role of the node is mainly related to its linking inandout (i.e., beyond the boundary) of this zone.

4
Zones may have different sizes. They can be both large (dozens or hundreds of nodes and more) and small (triads, dyads and trivial with a single node).

5
Zones may overlap (one node may be in multiple zones), and overlaps may be large (i.e., overlap size is close to the size of overlapping zones).

6
In most cases, overlaps of zones have a higher density than the zones of the same size.

7
Overlaps of larger zones with nontrivial structures (e.g., other than small cliques) contain in most cases other zones that can also overlap.
Conclusions
As far as we know, we are the first to focus on analyzing the structures of both weighted and unweighted undirected networks through the nonsymmetric similarity between the nodes, which we call dependency. This dependency allows us to clearly describe groups of nodes in the network structure which are organized around one of the group nodes – the ego. To distinguish such groups from communities, we call them egozones and examine them both locally and globally.
The local view extends the possibilities of traditional methods for analyzing egonetworks or egocommunities, respectively. Our approach contributes to this, in particular, by preferring weighted networks, exploring the wider surroundings of the ego, and working with nodes in newly defined roles in the zone. Especially in networks with social interaction, thanks to the zones detected for each ego, the analyst gets comprehensive information about the internal nonsymmetric dependencies in the zones and thus about the influence of each ego on its surroundings and, vice versa, the influence of its surroundings on it.
From a global perspective, our approach brings, in particular, a typology of network nodes that takes into account their importance on the basis of their structural independency – prominency. Prominent nodes are either entirely or, at least, partly independent on their neighbors and have a significant impact on the node dependencies in their surroundings. The way the zones and prominent nodes are defined using nonsymmetric dependencies contributes to understanding why large and dense overlaps of node groups emerge in networks. We consider this finding to be a significant contribution to the analysis of community network structures. Our experiments also show that, in terms of dependencies and overlapping zones, different types of realworld networks have different properties that distinguish them from each other and from generated networks.
The experimental results that we presented also raise questions for future research. Above all, it is a question of more indepth analysis of the relationships between zones exactly determined by their egos and overlapping or nonoverlapping communities in the traditional concept of many different detection methods. Furthermore, it also involves a detailed assessment of how zones and their egos correspond to the existence of groundtruth communities in various types of, in particular, weighted and directed networks.
Appendix A: Zone detection method
The method for detecting all zones in a network with n nodes and m edges consists of two steps. The first step is to calculate the dependency matrix. The second step is the detection of inner and outerzones for each node (ego). It is not easy to precisely define the time complexity for each step, because the calculations are dependent on the network representation, its complex structure, density, and dependencies between nodes. For this reason, we only describe some of the cases and the result of an experiment providing an estimation of complexity based on the analysis of realworld networks.
To compute the dependency matrix, two dependencies must be calculated for each pair of network nodes. This calculation is related to the detection of the common neighbors of two nodes. In general, a dense network is the worst case and the time complexity of finding the common neighbors of two nodes is O(n^{2}). Thus, the computation of the dependency matrix has, in the worstcase scenario, the complexity of O(mn^{2}). For sparse networks, finding common neighbors is related to the average degree of network d, and the time complexity, in this case, is O(md^{2}). Note, however, that we use the IsDependent relationship to detect zones which, thanks to the threshold 0.5, needs to take into account only a sufficient number of common neighbors in the numerator in Eq. 1 to calculate the dependency. The use of this property also affects the time complexity.
Thanks to the IsDependent relationship, the resulting dependency matrix is a binary one and, therefore, it is the adjacency matrix representing the original network after its transformation into its unweighted directed variant (see Fig. 2a for Karate Club network).
Zone detection is described in Algorithm 1 and, as can be seen, consists of innerzone and outerzone detection, and then the zone is a union of both of them.
The time complexity of the innerzone detection again depends on the network parameters, especially on the density and the number of iterations necessary to be performed. The key is that in each iteration it is necessary to find and test all neighbors of the nodes added in the previous step into the innerzone. To simplify, it is possible to work with two cases to consider complexity. The first and the worst case is a complete network in which each pair of nodes is twoway dependent. The second case is the sequence of oneway dependent nodes.
In the case of a complete unweighted network, we need two iterations to find one innerzone. In the first iteration, because of mutual dependency with ego, all nodes of the network (except for the ego) are added into the innerzone. In the second iteration, for all the added nodes, it is necessary to test whether their neighbors are outside the innerzone. The time complexity of the detection is, therefore, O(n^{2}) and also O(m) because the network is dense.
In the case of the sequence of oneway dependent nodes, the number of iterations corresponds to the position of the node in the sequence. However, at most, we need n and a minimum of 1 iteration always working with only one neighbor. In this case, the time complexity of the innerzone detection is O(n).
The time complexity of the outerzone detection is based on the previously detected innerzone for whose nodes it is necessary to find the nodes on which they are dependent and which are outside the innerzone (and thus form the outerzone). In the worst case, we can assume that the network is dense, and the innerzone contains half of the nodes of the network and the remaining nodes of the network form the outerzone. In this case, it is necessary for each of \(\frac {n}{2}\) nodes in innerzone to test \(\frac {n}{2}\) nodes outside the innerzone; therefore, the outerzone detection will have the time complexity O(n^{2}), i.e., not higher than the worst case for innerzone detection in dense network (O(m)).
Assuming that we detect each zone separately, the time complexity of detection zones for all network nodes is, in the worstcase scenario, O(nm) for dense networks. However, we must also consider cases where many nodes in the network can be, e.g., isolated after the transformation of the original network into its unweighted directed variant. In this case, the zone contains only the ego node, and the time complexity of the inner and outerzone detection is O(1). Therefore, we can expect that the time complexity will be lower for realworld networks.
Figure 17 shows the time needed for the computation of the dependency matrix and the time of detection of zones for each node in the network. Moreover, the green dotted line shows that the estimated time complexity O(m logn) corresponds well to the dependency matrix computation for the realworld networks that were analyzed. Similarly, the blue dotted line shows the estimated time complexity O(m) for the zones detection in all of the networks (that is \(O(\frac {m}{n}) = O(d)\) to detect one zone). Therefore, both the time complexities of the dependency matrix calculation and the zones detection can be assumed to be considerably lower for analyzed realworld networks than the abovementioned worst cases.
Appendix B: Publicly archived datasets
Zachary’s karate club Zachary (1977)  network of friendships between the 34 members of a karate club at a US university. Available at http://wwwpersonal.umich.edu/~mejn/netdata/
Les Misérables Knuth (1993)  coappearance network of characters in the novel Les Misérables. Available at http://wwwpersonal.umich.edu/~mejn/netdata/
Net Science Newman (2006)  a coauthorship network of scientists working on network theory and experiment. Available at http://wwwpersonal.umich.edu/~mejn/netdata/
American College football Girvan and Newman (2002)  network of American football games. Available at http://wwwpersonal.umich.edu/~mejn/netdata/
as22july06  a symmetrized snapshot of the structure of the Internet at the level of autonomous systems, reconstructed from BGP tables posted by the University of Oregon Route Views Project. Available at http://wwwpersonal.umich.edu/~mejn/netdata
comdblp Yang and Leskovec (2015)  a coauthorship network where two authors are connected if they publish at least one paper together. Publication venue, e.g, journal or conference, defines an individual groundtruth community. Available at https://snap.stanford.edu/data/comDBLP.html
EmailEnron Yang and Klimmt (2004); Leskovec et al. (2009)  communication network that covers all the email communication within a dataset of around half million emails. Available at https://snap.stanford.edu/data/emailEnron.html
PPDecagon_ppi Zitnik et al. (2018)  a proteinprotein association network that includes direct (physical) proteinprotein interactions, as well as indirect (functional) associations between human proteins. Available at https://snap. stanford.edu/biodata/datasets/10008/10008PPDecagon.html
artist Rozemberczki et al. (2018)  mutual like networks among verified Facebook pages – the types of sites included TV shows, politicians, athletes and artists among others. Available at https://snap.stanford.edu/data/gemsecFacebook. html
astroph Newman (2001)  weighted network of coauthorships between scientists posting preprints on the Astrophysics EPrint Archive. Available at http://wwwpersonal.umich.edu/~mejn/netdata/
Brightkite Cho et al. (2011)  undirected friendship network of Brightkite users (Brightkite was a locationbased social networking website). Available at https://snap.stanford.edu/data/locBrightkite.html
comamazon Yang and Leskovec (2015)  based on “Customers Who Bought This Item Also Bought” feature of the Amazon website. If a product i is frequently copurchased with product j, the graph contains an undirected edge from i to j. Each product category provided by Amazon defines each groundtruth community. Available at https://snap.stanford.edu/data/comAmazon.html
condmat Newman (2001)  network of coauthorships between scientists posting preprints on the Condensed Matter EPrint Archive. Available at http://wwwpersonal.umich.edu/~mejn/netdata/
condmat2005 Newman (2001)  update network of coauthorships between scientists posting preprints on the Condensed Matter EPrint Archive. Available at http://wwwpersonal.umich.edu/~mejn/netdata/
facebook_combined Leskovec et al. (2009)  dataset consists of ’circles’ (or ’friends lists’) from Facebook,https://snap.stanford.edu/data/egonetsFacebook.html
ChChMiner_drugbankchemchem Wishart et al. (2017)  network of interactions betweeen drugs, https://snap.stanford.edu/biodata/datasets/ 10001/10001ChChMiner.html
new_sites Rozemberczki et al. (2018)  datasets represent blue verified Facebook page networks of different categories. Nodes represent the pages and edges are mutual likes among them. Available at https://snap.stanford.edu/data/ gemsecFacebook.html
Power grid Watts and Strogatz (1998)  unweighted network representing the topology of the Western States Power Grid of the United States. Available at http://wwwpersonal.umich.edu/~mejn/netdata/
PPPathways_ppi Agrawal et al. (2018)  proteinprotein interaction network that contains physical interactions between proteins that are experimentally documented in humans https://snap.stanford.edu/biodata/datasets/10000/ 10000PPPathways.html
Yeast Bu et al. (2003)  proteinprotein interaction network in budding yeast. Available at http://vlado.fmf.unilj.si/pub/networks/data/bio/Yeast/Yeast. htm
anobii Aiello et al. (2012)  social network (aNobii.com) of book recommendation. Two types of networks are available. Network composed by union of friendship and neighborhood links is the first. The second one is communication network representing message exchanges. Available (on request) at https://www.icwsm. org/2016/datasets/datasets/
comyoutube Mislove et al. (2007)  social network representing a videosharing web site. Available at http://snap.stanford.edu/data/comYoutube.html
Appendix C: Supplementary figures
Availability of data and materials
The generated network data used in this study are available at https://homel.vsb.cz/~kud007/ego_zones_files/. Only publicly archived datasets (see Appendix B) or datasets generated according the known models were used during the study.
Abbreviations
 BA model:

BarabásiAlbert model
 CC:

clustering coefficient
 ER model:

ErdösRényi model
 MCC:

Matthews correlation coefficient
 NetDep:

network dependency
 OwDep:

oneway dependency
 OwIndep:

oneway independency
 TC:

triadicclosure based model
 TwDep:

twoway dependency
 TwIndep:

twoway independency
References
Abbasi, A, Chung KSK, Hossain L (2012) Egocentric analysis of coauthorship network structure, position and performance. Inf Process Manag 48(4):671–679.
Agrawal, M, Zitnik M, Leskovec J, et al (2018) Largescale analysis of disease pathways in the human interactome. Pac Symp Biocomput 23:111–122. World Scientific.
Ahn, YY, Bagrow JP, Lehmann S (2010) Link communities reveal multiscale complexity in networks. Nature 466(7307):761.
Aiello, LM, Deplano M, Schifanella R, Ruffo G (2012) People are strange when you’re a stranger: Impact and influence of bots on social networks In: ICWSM’12: Proceedings of the 6th AAAI International Conference on Weblogs and Social Media.. AAAI.
Albert, R, Barabási AL (2002) Statistical mechanics of complex networks. Rev Mod Phys 74(1):47.
Bagrow, JP, Bollt EM (2005) Local method for detecting communities. Phys Rev E 72(4):046108.
Barnes, ER (1982) An algorithm for partitioning the nodes of a graph. SIAM J Algebraic Discret Methods 3(4):541–550.
Bashan, A, Parshani R, Havlin S (2011) Percolation in networks composed of connectivity and dependency links. Phys Rev E 83(5):051127.
Baumes, J, Goldberg MK, Krishnamoorthy MS, MagdonIsmail M, Preston N (2005) Finding communities by clustering a graph into overlapping subgraphs. IADIS AC 5:97–104.
Bianconi, G, Darst RK, Iacovacci J, Fortunato S (2014) Triadic closure as a basic generating mechanism of communities in complex networks. Phys Rev E 90(4):042806.
Blondel, VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008(10):10008.
Bu, D, Zhao Y, Cai L, Xue H, Zhu X, Lu H, Zhang J, Sun S, Ling L, Zhang N, et al (2003) Topological structure analysis of the protein–protein interaction network in budding yeast. Nucleic Acids Res 31(9):2443–2450.
Chakraborty, T, Dalmia A, Mukherjee A, Ganguly N (2017) Metrics for community analysis: A survey. ACM Comput Surv (CSUR) 50(4):54.
Cho, E, Myers SA, Leskovec J (2011) Friendship and mobility: user movement in locationbased social networks In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining  KDD ’11, 1082–1090.. ACM. https://doi.org/10.1145/2020408.2020579.
Clauset, A (2005) Finding local community structure in networks. Phys Rev E 72(2):026132.
Danisch, M, Guillaume JL, Le Grand B (2013) Towards multiegocentred communities: a node similarity approach. Int J Web Based Communities 9(3):299–322.
Erdös, P, Rényi A (1959) On random graphs, i. Publ Math (Debrecen) 6:290–297.
Evans, TS, Lambiotte R (2010) Line graphs of weighted networks for overlapping communities. Eur Phys J B 77(2):265–272.
Fortunato, S (2010) Community detection in graphs. Phys Rep 486(35):75–174.
Fortunato, S, Hric D (2016) Community detection in networks: A user guide. Phys Rep 659:1–44.
Freeman, LC (1977) A set of measures of centrality based on betweenness. Sociometry 40(1):35–41. https://doi.org/10.2307/3033543.
Girvan, M, Newman ME (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821–7826.
Gregory, S (2010) Finding overlapping communities in networks by label propagation. New J Phys 12(10):103018.
Guimera, R, SalesPardo M, Amaral LAN (2004) Modularity from fluctuations in random graphs and complex networks. Phys Rev E 70(2):025101.
Hric, D, Darst RK, Fortunato S (2014) Community detection in networks: Structural communities versus ground truth. Phys Rev E 90(6):062805.
Jacob, Y, Winetraub Y, Raz G, BenSimon E, OkonSinger H, RosenbergKatz K, Hendler T, BenJacob E (2016) Dependency network analysis (d ep na) reveals context related influence of brain network nodes. Sci Rep 6:27444.
Kernighan, BW, Lin S (1970) An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J 49(2):291–307.
Khorasgani, RR, Chen J, Zaïane OR (2010) Top leaders community detection approach in information networks In: 4th SNAKDD Workshop on Social Network Mining and Analysis.. ACM.
Knuth, DE (1993) The Stanford GraphBase: A Platform for Combinatorial Algorithms In: Proceedings of the Fourth Annual ACMSIAM Symposium on Discrete Algorithms, 41–43, Philadelphia.
Kudelka, M, Zehnalova S, Horak Z, Kromer P, Snasel V (2015) Local dependency in networks. Int J Appl Math Comput Sci 25(2):281–293.
Lancichinetti, A, Fortunato S (2009) Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Phys Rev E 80(1):016118.
Lancichinetti, A, Fortunato S, Kertész J (2009) Detecting the overlapping and hierarchical community structure in complex networks. New J Phys 11(3):033015.
Leskovec, J, Lang KJ, Dasgupta A, Mahoney MW (2009) Community structure in large networks: Natural cluster sizes and the absence of large welldefined clusters. Internet Math 6(1):29–123.
McAuley, J, Leskovec J (2014) Discovering social circles in ego networks. ACM Trans Knowl Discov Data (TKDD) 8(1):4.
Mislove, A, Marcon M, Gummadi KP, Druschel P, Bhattacharjee B (2007) Measurement and analysis of online social networks In: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, 29–42.. ACM, New York.
Newman, ME (2001) The structure of scientific collaboration networks. Proc Natl Acad Sci 98(2):404–409.
Newman, ME (2006) Finding community structure in networks using the eigenvectors of matrices. Phys Rev E 74(3):036104.
Newman, ME, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026113.
Palla, G, Derényi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814.
Parshani, R, Buldyrev SV, Havlin S (2011) Critical effect of dependency groups on the function of networks. Proc Natl Acad Sci 108(3):1007–1010.
Pons, P, Latapy M (2005) Computing communities in large networks using random walks In: International Symposium on Computer and Information Sciences, 284–293.. Springer. https://doi.org/10.1007/11569596_31.
Poulin, V, Théberge F (2018) Ensemble clustering for graphs In: International Conference on Complex Networks and their Applications, 231–243.. Springer.
Rosvall, M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci U S A 105(4):1118–1123.
Rozemberczki, B, Davies R, Sarkar R, et al. (2018) Gemsec: Graph embedding with self clustering. arXiv preprint arXiv:1802.03997.
Tatti, N, Gionis A (2013) Discovering nested communities In: Machine Learning and Knowledge Discovery in Databases, 32–47.. Springer, Berlin.
Tversky, A (1977) Features of similarity. Psychol Rev 84(4):327.
Watts, DJ, Strogatz SH (1998) Collective dynamics of ’smallworld’networks. Nature 393(6684):440.
Wishart, DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, Sajed T, Johnson D, Li C, Sayeeda Z, et al (2017) Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic Acids Res 46(D1):1074–1082.
Xie, J, Kelley S, Szymanski BK (2013) Overlapping community detection in networks: The stateoftheart and comparative study. ACM Comput Surv (csur) 45(4):43.
Yang, J, Leskovec J (2012) Communityaffiliation graph model for overlapping network community detection In: 2012 IEEE 12th International Conference on Data Mining, 1170–1175.. IEEE. https://doi.org/10.1109/icdm.2012.139.
Yang, J, Leskovec J (2015) Defining and evaluating network communities based on groundtruth. Knowl Inf Syst 42(1):181–213.
Yang, Y, Klimmt B (2004) The Enron corpus: A new dataset for email classification research In: European Conference on Machine Learning, 217–226.. Springer.
Zachary, WW (1977) An information flow model for conflict and fission in small groups. J Anthropol Res 33(4):452–473.
Zhang, J, Cheng J, Su X, Yin X, Zhao S, Chen X (2018) Correlation Analysis of Nodes Identifies Real Communities in Networks. arXiv preprint arXiv:1804.06005.
Zitnik, M, Agrawal M, Leskovec J (2018) Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34(13):i457–i466.
Acknowledgements
Not applicable.
Funding
This work is partially supported by SGS, VSBTechnical University of Ostrava, under the grants no. SP2018/130 and SP2019/14 and Ministry of Health of the Czech Republic under grants no. MZ CR VES1631852A and MZ VES1632339A.
Author information
Authors and Affiliations
Contributions
We use the following notation for different types of contribution: A  algorithm design, I  implementation, P  paper writing, E  experimental evaluation. The authors contributed as follows: MK (A, I, P, E), EO (E, P), JP (I, E), SZ (P). All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Kudelka, M., Ochodkova, E., Zehnalova, S. et al. Egozones: nonsymmetric dependencies reveal network groups with large and dense overlaps. Appl Netw Sci 4, 81 (2019). https://doi.org/10.1007/s4110901901926
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s4110901901926