Uncovering complex overlapping pattern of communities in large-scale social networks

The conventional notion of community that favors a high ratio of internal edges to outbound edges becomes invalid when each vertex participates in multiple communities. Such a behavior is commonplace in social networks. The significant overlaps among communities make most existing community detection algorithms ineffective. The lack of effective and efficient tools resulted in very few empirical studies on large-scale detection and analyses of overlapping community structure in real social networks. We developed recently a scalable and accurate method called the Partial Community Merger Algorithm (PCMA) with linear complexity and demonstrated its effectiveness by analyzing two online social networks, Sina Weibo and Friendster, with 79.4 and 65.6 million vertices, respectively. Here, we report in-depth analyses of the 2.9 million communities detected by PCMA to uncover their complex overlapping structure. Each community usually overlaps with a significant number of other communities and has far more outbound edges than internal edges. Yet, the communities remain well separated from each other. Most vertices in a community are multi-membership vertices, and they can be at the core or the peripheral. Almost half of the entire network can be accounted for by an extremely dense network of communities, with the communities being the vertices and the overlaps being the edges. The empirical findings ask for rethinking the notion of community, especially the boundary of a community. Realizing that it is how the edges are organized that matters, the f-core is suggested as a suitable concept for overlapping community in social networks. The results shed new light on the understanding of overlapping community.


Introduction
A community in networks is conceived commonly as a group of vertices connected closely with each other but only loosely to the rest of the network. Such communities were widespread in many systems and their detection has attracted much attention in the past two decades (Fortunato, 2010). This vague notion of communities has many possible interpretations. The most common one is based on the ratio of the numbers of internal edges to outbound edges, which go out of the community. The more the internal edges to outbound edges, the more definite is the community. For example, the widely used methods based on strong/weak community (Radicchi et al, 2004), LS-set (Luccio and Sami, 1969), conductivity and network community profile (Jeub et al, 2015;Leskovec et al, 2009), and fitness functions (Baumes et al, 2005;Goldberg et al, 2010;Lancichinetti et al, 2009) favor a higher internal edges to outbound edges ratio. The idea works well for disjoint communities, but it has also been adopted by algorithms for detecting overlapping communities (Xie et al, 2013). Nonetheless, the number of members, mostly at the periphery, belonging to multiple communities is still expected to be small so that an "overlapping community" remains well separated from its surrounding. However, the actual structure of overlapping communities can be far more complex. It is commonplace that every individual has multiple social circles in social networks. It implies that all parts of a social community, peripheral and core, may be overlapping with a significant number of other communities and there can be far more outbound edges than internal edges. In what follows, we refer to a community with such properties as an "overlapping community". The existence of these significantly overlapped communities asks for a deeper understanding of what an overlapping community really is, where their boundaries are, and how to detect them.
Analyzing big data sets of real social networks is vital in network science. An immediate problem is that most existing methods are incapable of detecting significantly overlapped groups of vertices, because these groups have too many outbound edges to be identified as well separated communities. The recently proposed methods of OSLOM (Lancichinetti et al, 2011) and BIGCLAM (Yang and Leskovec, 2013) are useful to some extent in small synthetic networks, but they become inefficient for large-scale networks which readily have the size of millions to billions of vertices. Sampling small subnetworks (Maiya and Berger-Wolf, 2010) would not work either due to the small-world effect (Watts and Strogatz, 1998), e.g. the average distance between any two individuals on Facebook is only 4.74 Ugander et al, 2011), while the diameter of a social community is usually 3 or 4. A community may be localized, but it can also be widespread in the network. Sampling small subnetworks would preserve particular communities but decompose many others, making it inappropriate for studying the overlaps among communities. Some newly proposed algorithms (Epasto et al, 2017;Lyu et al, 2016;Sun et al, 2017) achieved linear-time complexity, but their validity and accuracy in detecting significantly overlapped communities requires further benchmarking and cross-checking. The lack of effective and efficient algorithms resulted in very few studies on detecting and analyzing overlapping community structure in large-scale social networks. An empirical study was carried out on Facebook (Ferrara, 2012), but only methods for detecting disjoint communities were used. A recent study on Friendster found that about 30% vertices belonged to multiple communities (Epasto et al, 2017). Yang and Leskovec analyzed metadata groups of some real networks and found that overlaps occur more often at the cores of communities Leskovec, 2014, 2015). This is contrary to the traditional notion that overlapping members are mostly at the periphery. Recent studies also revealed that metadata groups may not give the ground-truth of structural communities (Hric et al, 2014;Peel et al, 2017).
The present authors developed recently a scalable partial community merger algorithm (PCMA) (Xu, 2016;Xu and Hui, 2018). They tested PCMA against the LFR benchmark (Lancichinetti et al, 2008) and a new benchmark designed for significantly overlapping communities, and established the accuracy and effectiveness of PCMA in detecting communities with significant overlaps, as well as slightly overlapping and disjoint ones. The linear complexity of PCMA enabled the analysis of two huge online social networks with 79.4 and 65.6 million vertices -Sina Weibo and M represents a million. n and m are the number of vertices and edges. k is the average vertex degree. C WS is the average local clustering coefficient. c is the number of communities detected by PCMA. More detailed information is given in the appendix.
Friendster (see Table 1)without sampling small subnetworks. The ∼ 2.9 million detected communities were verified to be non-duplicating and have relatively high values of internal edge density. A surprising finding is that more than 99% of them have more outbound edges than internal edges, and the outbound edges often outnumbers the internal edges by many times. The communities overlap significantly, while still keeping relatively clear boundaries. These communities are strong empirical evidence against the traditional notion of an overlapping community. While we focused on developing the algorithm in Ref. (Xu and Hui, 2018), we uncover the complex overlapping pattern of social communities in the present work by examining the data in detail and explain why the communities can still remain well separated from each other. After introducing the four main characteristics of the overlapping pattern, we give a macroscopic picture of the social network structure by grouping edges of the entire network into five categories. The concept and possible better definitions of an overlapping community are discussed. Additional information on the data sets and the detection of communities is given in the appendix.

Characteristics of overlapping pattern
Characteristic 1. Multi-membership vertices or overlapping vertices are often thought to be peripheral members, but a recent study (Yang and Leskovec, 2014) found that they are more likely core members. Our analysis on the two large-scale social networks reveals that the overlapping vertices can be anywhere, i.e., core and periphery, in the community. In general, a vertex v may belong to m v communities. The vertices can then be sorted by their values of m v = m for m 1. The belongingness b v,C of a vertex v to a community C can be defined as where n C is the community size and k int v,C is the number of other members in C that are connected to v. A high (low) value of b v,C means that v is closer to the core (periphery) of C. If overlaps occur more often at the periphery (core), we would expect multi-membership vertices with m > 1 to have a lower (higher) belongingness b than those with m = 1. Fig. 1 shows that the belongingness distributions for vertices with different values of m are almost identical, with an insignificant tendency of multi-membership vertices having a slightly higher belongingness. The results imply that m v is basically uncorrelated with b v,C , and multi-membership vertices exist everywhere in a community with no preference towards the core or the periphery as compared with non-overlapping vertices.  Sina Weibo. For Friendster, the proportion is ∼ 60%, which is about twice of that reported in Ref. (Epasto et al, 2017). A related quantity is which gives the probability that a member of a community has m memberships.
Here, m = ∞ m=1 p m · m is the mean value of m. Referring to P m in Fig. 2, P m=1 = 18.8% and 12.9% for Sina Weibo and Friendster, respectively, implying that on average more than 80% of the members in a community are multi-membership vertices. This is in sharp contrast to the preconceived idea that only a small fraction of members in a community belong also to other communities. The results reveal that most members of a community have multiple memberships and they are everywhere in the community. Characteristic 2. The multi-membership vertices lead to a community overlapping with many other communities. We refer to them as neighbor communities. Fig. 3 shows the relationship between the number of neighbor communitiesd C and the size n C of a community in the two social networks. To extract information, the expected number of neighbor communities for a community of size n C is roughlȳ is the expected number of memberships of a member in the community. Although each member connects the community to m C − 1 other communities of which it is also a member, ( m C − 1) · n C overestimates the number of neighbor communities due to duplication, i.e., some members in the community have common neighbor communities. A factor r nd is introduced to represent the non-duplicate rate. Consider the simple case of a size-n C community with x members all in only one neighbor community. In this case, ( m C − 1) · n C = x whiled C = 1, implying r nd = 1/x. Thus, the value of r nd also indicates the extent of overlap between two communities. For overlaps of just 2 or 3 vertices, r nd drops below 50%. The analysis in Fig. 3 confirms thatd C ∼ n C , but with a slope gradually decreasing with increasing n C . Thus, r nd is negatively correlated with n C . The slopes are around 3 ∼ 4, which are about 30% smaller than the values 4.36 and 5.25 calculated by ( m C −1) from empirical data of Sina Weibo and Friendster, respectively. Note that these slopes are very large, e.g. a community of size as small as 30 could overlap with ∼ 100 other communities concurrently. The resulting non-duplicate rates r nd are above 70%, strongly indicating that most overlaps concern just one vertex. Characteristic 3. For each community identified by PCMA, we evaluated the total number of internal edges k int C and outbound edges k out where k int v,C (k out v,C ) denotes the number of a vertex v's edges that go inside (outside) the community C. The summations are over all n C vertices in the community. Note that each internal edge is counted twice as both ends are within the community 1a 1b 2 3 4 5 Figure 5 Edges can be classified into five types: (1) intra-community edges; (2) inter-community edges between two overlapped communities; (3) inter-community edges between two communities that do not overlap; (4) edges between vertices with membership m > 0 and isolated vertices (m = 0); (5) edges between isolated vertices. Focusing on the outbound edges of the green community with 5 members (circled), the edges 1b, 2, and 3+4 correspond to categories E1, E2, and E3 outbound edges, respectively, as defined in the text. and each outbound edge is included only once. Figure 4 shows that the number of outbound edges of a community is not only greater, but often many times greater than the number of internal edges. More than 99% of the 2.9 million communities have more outbound edges than internal edges, in contrast to the traditional notion.
To investigate into the network structure, we focused on the outbound edges and classified them into 3 categories (see Fig. 5) as E1: outbound edges from a member to a neighbor community to which the member also belongs; E2: outbound edges from a member to a neighbor community that the member does not belong to; E3: outbound edges not to a neighbor community.
Their proportions e 1 , e 2 , e 3 , with e 1 + e 2 + e 3 = 1, are calculated for each community. Fig. 6 shows the histograms. Typically, the edges to a neighbor community are usually through the common member(s) of the two communities as e 1 is much greater than e 2 . In addition, a significant proportion of outbound edges go to neighbor communities. In Sina Weibo, most communities (red region) have e 1 + e 2 ≈ 0.5. It means that ∼ 50% outbound edges are due to the vertices' multi-membership and communities are densely connected to their neighbor communities. Note that if a community's outbound edges were randomly connected to vertices in the network, most edges would be of the E3 type. Characteristic 4. How can communities ever be distinguished when each community overlaps with a significant number of others? The answer is that the overlap size between two communities is usually small, and the connection between them is mostly through the overlap. Table 2 lists the frequency of occurrence of the most common overlap sizes. Out of 232M (millions) overlaps among the 2.9M detected communities, more than 80% are of just a single vertex. Fig. 7 shows the actual structure of two detected communities. The outbound edges from community A (left) to its neighbor community B are highly organized through the overlap. Members of B usually only know the overlapping part of A, and vice versa. The overlapped vertex serves as the sole bridge and plays a unique role in passing information between the communities. Yet, there may exist some E2 edges between the communities. In social networks, they are possibly due to the common member introducing members of the two communities to know each other. In Fig. 6, e 2 is below 10% or even 5% for most communities and far less than e 1 . It is the small proportion of E2 edges that facilitates the easy separation of communities. The proportion e 2 is thus an indicator of the clearness of the boundary between a community and its neighbor communities. We checked every pair of overlapped communities on E2 edges. Results are listed in Table 2. For 37.8% (Sina Weibo) and 30.1% (Friendster) of them, there is not even a single E2 edge. The communities maintain a good separation from their surrounding despite each overlaps with a significant number of neighbor communities.

Mesoscopic view of social network structure
For the 2.9 million detected communities, we can classify all the edges in the two social networks into 5 types (see the caption of Fig. 5). The results are given in Table 3. The number of Type 1 edges suggests that the communities account for 30 ∼ 35% of the entire network in terms of edges. These communities, connected together by the huge number of overlaps, form an extremely dense and tight network by themselves. There are 10 ∼ 20% of the edges further connecting the overlapped communities (Type 2). The total number of them is comparable to that of Type 1, but since they are distributed among the huge number of overlaps, each overlap shares only a very few such edges. For example, in Sina Weibo, the 117M Type 2 edges are distributed among 73M overlaps, on average only 1.6 per overlap. The numbers further confirm the structure shown in Fig. 7. The Types 1 and 2 edges, together occupying half of the entire network, form an immense network of communities that can be regarded as a hidden skeleton of the social network in the mesoscopic scale. The remaining half of the edges are outside the skeleton, mostly Type 3 or Type 4. The former are "long-range" weak ties connecting different parts of the skeleton, thus making the skeleton an even smaller world. Although the majority of the vertices are outside the skeleton, i.e., vertices with m = 0, the edges among them (Type 5) account for less than 10%. For Friendster it is only 1.5%. These vertices are possibly the inactive users in the two online social network services.
The edge classification helps decompose the entire network and reveals a remarkably high proportion of the significantly overlapped communities. The proportion could be even higher if less tightly connected vertices are also accepted as communities. The immense size of the network of communities confirms its important role in social networks and invites in-depth analyses on the properties of the huge and dense skeleton of social networks.

Rethinking the concept of overlapping community
The strong empirical evidence from the analyses of the two social networks contradicts what we usually think a community is and asks for a reconsideration of the concept of community. Despite a wide variety of definitions, most of them, if not all, share an intuitive idea: members of a community should have some sort of internal cohesion and good separation from the rest of the network. The problem is how the idea should be interpreted, especially what a good separation and the boundary of a community are about.
Many definitions and quality measures of a community interpret "good separation" as the less the k out C (or k out C /k int C ), the more definite is the community. Examples include the widely used weak community k int  (Rosvall and Bergstrom, 2008) and label propagation (Raghavan et al, 2007). We argue that comparing k out C /k int C is ineffective in large-scale networks, no matter for overlapping or disjoint communities. As shown in Figs. 4 and 6, there are more outbound edges than internal edges, even if we ignore the neighbor community edges produced by the multi-membership vertices. The point is that simply a larger value of k out C does not necessarily mean the community is less definite. Consider the case that an arbitrary large number of outbound edges of a community are randomly distributed in the whole network, the community is not really strongly connected to any part of the network as long as the network size n k out C . This point has also been discussed in a recent review by Fortunato and Hric (2016). They suggested using edge probabilities instead of the number of edges. A member of a community should have a higher probability to form edges with the other members than with vertices outside the community. It is generally difficult to infer the edge probability between each pair of vertices. A simplified way is to assume the edge probabilities within a community (to the outside) are the same and equal to the internal (outbound) edge density δ int where n and n C are the network and community) sizes, respectively. However for large networks n n C , usually δ out C → 0, making the definition δ int C > δ out C not useful. The problem of k out C (and so of δ out C ) is that it counts the outbound edges to the whole network and reports only a summed quantity. What really matters is not the number k out C , but where the k out C outbound edges are distributed. As discussed under Characteristic 4 of the overlapping pattern, a multi-membership vertex may contribute much to k out C without messing up the boundary between the community and its neighbors. On the contrary, adding a number of outbound edges to a particular vertex outside is sufficient to change the boundary of the community. These two cases are due to the different distribution patterns of outbound edges: • Outbound edges from the same member to vertices outside the community • Outbound edges from different members to a particular vertex outside the community For the first case, it does not matter how many outbound edges there are. For the second case, however, the fewer the better. A good definition of overlapping community should be able to distinguish between the two cases. A useful concept here, as discussed in Ref. (Xu and Hui, 2018), is the f -core -a maximal connected subgraph in which each vertex is connected to equal to or more than a fraction f of the other vertices in the subgraph: b v,C f, ∀v ∈ C with b v,C being the belongingness of v to C as defined in Eq. (1). A vertex is acknowledged as a member of an f -core as long as the vertex has sufficient connections to the other members of the f -core. It is irrelevant whether it is connected to a large number of vertices outside the f -core. This property of f -core distinguishes the two cases of outbound edges successfully and allows a vertex to belong to multiple f -cores naturally. In contrast, the number-based counterpart called k-core, which requires each vertex to be a neighbor to at least k other vertices in the subgraph, is non-overlapping by definition. The "maximal connected subgraph" in the definition ensures all vertices outside the f -core having belongingness less than f , as defined in Eq. (6), except for the case that there does exist one vertex outside, but including it will result in some other member(s) of the f -core to be kicked out. The fraction f defines the boundary of the community. A problem is that there is no standard way to determine what value of f should be used. Communities in social networks often show core-periphery structures (Csermely et al, 2013;Rombach et al, 2014;Zhang et al, 2015) and have no definite boundaries. A large value of f extracts the core members of communities, and a small value results in more peripheral vertices being accepted as members. We are of the opinion that the belongingness b v,C is a better way to describe vertex memberships instead of forcing a vertex to be either inside or outside of a community. While the f -core is a good candidate, better definitions of overlapping community may still be possible. The key point is that the definition should take into account of the possibility of ubiquitous presence of multi-membership vertices: • The proportion of multi-membership vertices may range from 0 ∼ 100%, • A vertex may belong to an arbitrary number of communities, as revealed by data analysis. These are the causes of the significant overlaps among communities and a much greater number of outbound edges than internal edges.

Summary and outlook
We studied the overlapping structure of 2.9 million communities detected in the two huge online social networks. We found four main characteristics: • Most members of a community have multiple memberships. They are everywhere, at the periphery or in the core. • A community usually overlaps with a significant number of other communities, the number typically is several times its size. • The number of outbound edges of a community is many times greater than the number of internal edges. • Although communities overlap significantly, they remain relatively in good separation from each other. Most overlaps concern just one or sometimes two vertices. Such significant overlapping pattern asks for a rethinking of what the boundary of a community really is. We discussed several traditional interpretations and related issues, and suggested the f -core as a possible definition for overlapping community.
Our study also showed a dense and tight network of communities, with the communities taking the role of vertices and the overlaps being the edges. Most overlaps are just of a single vertex. Each of these vertices plays a unique role in passing on information between the communities that it belongs to. This network of communities accounts for almost half of the entire network. It serves more studies on how its structural properties would couple to many phenomena in social dynamics.
Our empirical study unfolded new aspects of overlapping community. The results provided researchers with clues for designing effective detection algorithms, generative models, and benchmarks for overlapping communities, especially in social networks. We look forward to more empirical studies powered by new tools, to cross-check the present work and explore areas not covered by PCMA.

Appendix: Datasets
For completeness, we describe the two social networks we analyzed. Table 1 gives the basic information. Sina Weibo is a directed network akin to Twitter. We focused on the embedded friendship network in which two connected individuals are following each other. Instead of sampling small subnetworks, we collected almost the whole giant component of the network, because the structural completeness of the sampled network is vital to the preservation of community structure, especially the overlapping pattern among communities. The network data of Friendster was downloaded from SNAP Datasets (Leskovec and Krevl, 2014).
We detected about 1.3 and 1.6 million communities in the two networks with PCMA (Xu and Hui, 2018). The algorithm is especially suitable for detecting communities in which the vertices have multiple memberships. Information on the detection was reported in Ref. (Xu and Hui, 2018) in detail. Using symbols introduced in Ref. (Xu and Hui, 2018), we used a harsh threshold l 10 to ensure that the detected communities are reliable (the larger the l, the more reliable the community). A drawback is that many small size communities were not included. In the present work, we add additional communities of which 6 l 9 and g > 3.0/l. The results are shown in Fig. 8. The latter condition ensures relatively high intra-community edge density of these communities, especially for those with low l. Figure 9 shows that all communities, including the newly added ones, have high values of intracommunity edge density.
The large values of the proportion of intra-community and E2 edges, as shown in Table 3 possible total number of communities in the two networks. However, it should be noted that there is no standard answer as to how many communities there are in a real network. We found that adding or removing the additional communities in the analyses only produces minor changes to the statistics. In particular, it does not change the characteristics of the overlapping pattern we discussed. The 2.9 million detected communities are believed to be adequate and representative.