Overlapping community finding with noisy pairwise constraints

Alghamdi, Elham; Rushe, Ellen; Mac Namee, Brian; Greene, Derek

doi:10.1007/s41109-020-00340-9

Research
Open access
Published: 11 December 2020

Overlapping community finding with noisy pairwise constraints

Elham Alghamdi ORCID: orcid.org/0000-0001-9487-1752¹,
Ellen Rushe¹,
Brian Mac Namee¹ &
…
Derek Greene¹

Applied Network Science volume 5, Article number: 98 (2020) Cite this article

2012 Accesses
1 Altmetric
Metrics details

Abstract

In many real applications of semi-supervised learning, the guidance provided by a human oracle might be “noisy” or inaccurate. Human annotators will often be imperfect, in the sense that they can make subjective decisions, they might only have partial knowledge of the task at hand, or they may simply complete a labeling task incorrectly due to the burden of annotation. Similarly, in the context of semi-supervised community finding in complex networks, information encoded as pairwise constraints may be unreliable or conflicting due to the human element in the annotation process. This study aims to address the challenge of handling noisy pairwise constraints in overlapping semi-supervised community detection, by framing the task as an outlier detection problem. We propose a general architecture which includes a process to “clean” or filter noisy constraints. Furthermore, we introduce multiple designs for the cleaning process which use different type of outlier detection models, including autoencoders. A comprehensive evaluation is conducted for each proposed methodology, which demonstrates the potential of the proposed architecture for reducing the impact of noisy supervision in the context of overlapping community detection.

Introduction

Complex networks occur in many aspects of life, from social systems to biological processes. Despite their diversity, many networks share common properties and principles of organization (Boccaletti et al. 2006). One essential property that helps us to understand complex networks is the idea of community structure. Finding these sets of nodes or communities provides us with three important capabilities: understanding the structures and functionalities, modeling the dynamic processes in networks, and predicting their future behaviors. Generally, algorithms for detecting communities are unsupervised in nature. That is, they rely solely on the network topology during the detection process, rather than using any prior information or training data regarding the “correct” community structure. One common issue is that these algorithms can fail to uncover groupings that accurately reflect the ground truth in a specific domain, particularly when these communities highly overlap with one another (Ahn et al. 2010).

Recent work has improved the effectiveness of such algorithms by employing ideas from semi-supervised learning (Alghamdi and Greene 2019). This involves harnessing existing background knowledge (e.g. from domain experts or crowdsourcing platforms), which can provide limited supervision for community detection. Often this information takes the form of pairwise constraints between nodes (Basu et al. 2004a). Typically, pairwise constraints are either must-link and cannot-link pairs, indicating that either two nodes should be assigned to the same community or should be in different communities. As an example, we might be interested in finding social groups based on common interests on social media platforms, such as Facebook or Twitter, in order to target the most influential member of each social group for marketing and recommendation purposes. To improve our ability to achieve this, and go beyond simply looking at connections, we could use a human annotator, to query whether two users should be in the same group or different groups and label them as must-link or cannot-link, then incorporate these labels as constraints into community detection algorithms. By using this kind of knowledge, we can potentially uncover communities of nodes which are otherwise difficult to identify when analyzing complex networks.

Despite the promise of semi-supervised learning, in many real applications the supervision coming from human annotators will be unreliable or “noisy”. For instance, this might occur when using annotation acquired by crowdsourcing platforms (Howe 2008) such as Amazon Mechanical Turk (Kittur et al. 2008). In general human oracles will often be “imperfect”, in the sense that they can make subjective decisions, they may disagree with one another, they might only have limited knowledge of a domain, or they may simply complete a labeling task incorrectly due to the burden of annotation (Amini and Gallinari 2005; Du and Ling 2010; Sheng et al. 2008). Thus, when such judgements are encoded as pairwise constraints for semi-supervised community detection can be unreliable or conflicting, which can create problems when used to guide community finding algorithms (Zhu et al. 2015).

In this study, we explore the effect of noisy, incorrectly-labeled constraints on the performance of semi-supervised community finding algorithms for overlapping networks. To mitigate such cases, we treat the noisy constraints as outliers, and use an outlier detection strategy to identify and remove them, which has the effect of “cleaning” the constraints coming from the human oracle. The primary contributions of the paper are as follows:

1
We introduce a general architecture for semi-supervised community finding which incorporates a cleaning methodology to reduce the presence of noisy pairwise constraints, using outlier detection. This architecture can be implemented with any semi-supervised community finding that might involve querying an imperfect oracle. In this study, we focus the use of the AC-SLPA algorithm (Alghamdi and Greene 2018).
2
We propose alternative designs for cleaning methodology, based on different outlier detection models. Each design involves executing two parallel processes to separately reduce noise from must-link constraints and cannot-link constraints.
3
We investigate the performance of combining conventional outlier detection models and deep learning models for identifying noisy constraints.
4
We conduct comprehensive experiments to evaluate these alternative cleaning methods, as individual components, and when integrated within the proposed general architecture on a range of synthetic and real-world networks containing overlapping community structure.

The remainder of this paper is structured as follows. Section “Related work” provides a summary of relevant work in semi-supervised learning, in the context of both cluster analysis and community finding. In Section “Methods”, we describe the proposed general architecture for community detection which incorporates a cleaning process to reduce noise levels in pairwise constraints, and we propose multiple designs for implementing the cleaning process. In Section “Evaluation”, we discuss four experimental evaluations of these methods. Finally, we conclude our work in Section “Conclusion” with suggestions for further extending this work in new directions.

Related work

To provide context for our work, this section describes related research of semi-supervised techniques in community finding, along with studies that address noisy pairwise constraints in both clustering and community finding.

Semi-supervised learning in community finding

Several types of prior knowledge have been used in semi-supervised strategies to guide the community detection process. The most widely-used approach has been to employ pairwise constraints, either must-link or cannot-link, which indicate that either two nodes must be in the same community or must be in different communities. This strategy has been implemented via several algorithms, including modularity-based methods (Li et al. 2014), spectral partitioning methods (Habashi et al. 2016; Zhang 2013), a spin-glass model (Eaton and Mansbach 2012), matrix factorization methods (Shi et al. 2015), and various other methods (Yang et al. 2017; Zhang et al. 2019). Such approaches have often provided significantly better results on benchmark data, when compared to standard unsupervised algorithms.

Other authors have used different kinds of prior knowledge to provide supervision for community detection. For instance, Ciglan and Nørvåg (2010) developed an algorithm for finding communities with size constraints, where the upper limit size of communities is given as a user-specified input. This algorithm is based on standard label propagation methods for finding disjoint communities. In Wu et al. (2016) an optimization algorithm based on density constraints was proposed. This algorithm constructs an initial skeleton of the community structure by maximizing a criterion function that incorporates constraints to only find communities with intra-cluster densities above a given threshold. The remaining nodes are subsequently classified with respect to this skeleton. Other algorithms have used node labels as prior knowledge to improve the performance of community detection, using an approach which resembles traditional training data in classification (Leng et al. 2013; Liu et al. 2014; Wang et al. 2015). Liu et al. (2015) developed a method that uses a semi-supervised label propagation algorithm based on node labels and negative information, where a node is deemed not to belong to a specific community.

The majority of algorithms in this area have been designed to only find non-overlapping communities, where each node can only belong to a single community. However, many real-world networks naturally contain overlapping community structure (Adamcsek et al. 2006). To the best of our knowledge, little work has been done in the context of finding overlapping communities from a semi-supervised perspective. Dreier et al. (2014) performed some initial work here, using supervision for the purpose of algorithm initialization. Specifically, a small set of seed nodes was selected, whose affinities to a community was provided as prior knowledge in order to infer the rest of the nodes’ affinities in the network. On the other hand, Shang et al. (2017) used an expansion method that classifies edges into communities, where this model is trained on set of predefined seeds. However, there is no external human supervision used during the seed selection or expansion processes. In contrast, for our study, we focus on the problem of semi-supervised community detection based on the external supervision by human who are part of the networks or domain experts, and encode it as pairwise constraints since they have proven to be effective in a range of other learning contexts (Basu et al. 2004c; Greene and Cunningham 2007).

Noisy constraints in clustering and community finding

Various algorithms have been proposed for the general task of pairwise constrained clustering, based on a variety of different clustering paradigms (e.g. Basu et al. 2004d; Davidson and Ravi 2005; Li et al. 2009). However, most assume the existence of “perfect” pairwise constraints which will be clean and will not contradict one another. Fewer studies have considered the requirement to handle noisy pairwise constraints. However, some relevant work in clustering has involved the development of new algorithms which are robust to noisy or conflicting pairwise constraints (Basu et al. 2004b; Coleman et al. 2008; Liu et al. 2007; Pelleg and Baras 2007). Other studies have introduced new metrics to assess the quality of constraints, considering aspects such as their informativeness and coherence (Davidson et al. 2006; Wagstaff et al. 2006). These can be used to filter or clean the pairwise constraints prior to clustering. A related study (Zhu et al. 2015) proposed an approach for handling noise by using a random forest classifier to identify incoherent constraints.

In contrast, in the field of semi-supervised community finding, the issue of noisy pairwise constraints has rarely been studied, and algorithms generally assume the veracity of any supervision supplied by an oracle. One related study from Li et al. (2014) initiated the work of handling “conflicting” pairwise constraints in non-overlapping community finding. That is, cases where ($v_{i}, v_{j}$) $\in$ must-link, ($v_{i}, v_{k}$) $\in$ must-link, and ($v_{j}, v_{k}$) $\in$ cannot-link. Such cases of conflict were identified using a dissimilarity index metric to measure the reliability of constraint pairs. However, this type of constraint conflict is in fact legitimate in the context of overlapping communities, as shown in our previous work in Alghamdi and Greene (2018). Therefore, the challenge remains of handling noisy constraints for overlapping community finding in an appropriate manner, which we seek to address in the next section.

Methods

Overview

Before describing our proposed architecture, we first provide a formal definition for the pairwise constraints which are used in this study. These definitions map to those which are widely adopted in the wider semi-supervised learning literature (Chapelle et al. 2006). Given a set of nodes V in a network, we define two constraint types:

1
A must-link constraint specifies that two nodes should be assigned to the same community. Let ML be the must-link constraint set: $\forall$ $v_{i}, v_{j}$ $\in$ V where i $\ne$ j, then the constraint ($v_{i}, v_{j}$) $\in$ ML indicates that two nodes $v_{i}$ and $v_{j}$ must be assigned to the same community.
2
A cannot-link constraint specifies that two nodes should not be assigned to the same community. Let CL be the cannot-link constraint set: $\forall$ $v_{i}, v_{j}$ $\in$ V where i $\ne$ j, then the constraint ($v_{i}, v_{j}$) $\in$ CL indicates that $v_{i}$ and $v_{j}$ must be assigned to two different communities.

As discussed in Alghamdi and Greene (2019), implementing pairwise constraints in the context of overlapping communities is challenging due to the lack of the transitive property for must-link constraints in the context of overlapping communities. In the case of non-overlapping communities, must-link constraints have a transitive property, where a third must-link relationship can be inferred from two other associated must-link constraint pairs. For instance, if ($v_{i}, v_{j}$) $\in$ ML, and ($v_{i}, v_{k}$) $\in$ ML, then we can also infer that ($v_{j}, v_{k}$) $\in$ CL. This property does not hold for overlapping communities. For instance, node $v_{i}$ might be an overlapping node and in this case there are two possible scenarios for the pair ($v_{j}, v_{k}$): (1) ($v_{j}, v_{k}$) $\in$ CL where node $v_{i}$ might be an overlapping node that have a must-link constraint with both $v_{j}$ and $v_{k}$, yet these two nodes could belong to two different communities; (2) ($v_{j}, v_{k}$) $\in$ ML where all three nodes are in fact in the same community. This problem has been addressed in detail in Alghamdi and Greene (2019) and therefore is not the main focus of this paper.

Now we describe our proposed general architecture for semi-supervised community detection which incorporates a methodology to reduce the presence of noisy pairwise constraints using an outlier detection model, as illustrated in Fig. 1. This architecture begins with a set of noisy pairwise constraints provided by a human oracle ($PC-$). The set of noisy pairwise constraints ($PC-$) is composed of must-link ($ML-$) and cannot-link ($CL-$) constraints. These constraints are cleaned to produce a revised set of constraints ($PC+$) (composed of must-link ($ML+$) and cannot-link ($CL+$) constraints) which are fed into the community finding process. The proposed architecture consists of three distinct phases:

1
Phase 1: Feature extraction. After receiving a set of pairwise constraints ($PC-$) from a potentially-noisy oracle, features vectors are constructed to provide inputs to outlier detection models later, with one vector per constraint pair (for both must-link and cannot-link). These vectors encode various aspects of the relationship between a pair of nodes according to the underlying network topology. These features include standard measures based directly on the network, including: whether the pair of nodes share an edge, their number of common neighbors, the shortest path length between them, and their cosine similarity. We also include more complex features, such as their SimRank similarity (Jeh and Widom 2002), and their similarity as computed on a node2vec embedding generated on the network (Grover and Leskovec 2016).
2
Phase 2: Identifying noisy constraints. This involves executing two parallel processes that use two different outlier detection models to separately eliminate noise from the original must-link set ($ML-$) and cannot-link set ($CL-$). The constructed feature vectors are fed into each model for multiple iterations of cleaning, returning a score for each constraint that determines whether or not it is an outlier (i.e. a noisy constraint).
3
Phase 3: Applying Semi-supervised Community Detection Process. The returned clean pairwise constraint set ($PC+: ML+,CL+$) is passed to a semi-supervised community detection algorithm to be used during the process of finding communities.

In the following sections, we describe the details of the proposed architecture in terms of the outlier detection methods used to identify potentially-noisy constraints (Section “Outlier detection methods”), the different variations of the second phase of the architecture shown in Fig. 1 (see Section “Process for identifying noisy constraints”), and the implementation of the proposed architecture in the context of the AC-SLPA community finding algorithm (see Section “AC-SLPA with noise identification”).

Outlier detection methods

Isolation Forests: This method, proposed by Liu et al. (2008), uses a tree-based ensemble strategy for anomaly detection. The assumption underlying this method is that anomalies will be isolated earlier in their trees as these examples are not only rare, but also have feature values substantially different from the normal data. Random partitions are used in order to separate examples, with the number of partitions acting as the path length. Because anomalous feature values are assumed to significantly differ from that of normal examples, these features will more easily split anomalous examples from normal examples early on in the tree, leading to a shorter path. This shortening effect is compounded by the fact that these examples are also assumed to be rare. In order to compute an anomaly score, the average path length over multiple trees is computed, and normalized by the average path length over all paths. Scores close to 1 are said to be anomalous and scores close to 0 are assumed normal. This algorithm fits the problem of noise detection when there are far fewer noisy labels than normal examples.

One-class SVM: One class Support Vector Machine (OCSVM) (Schölkopf et al. 2000, 2001) is a commonly-used method for anomaly detection which extends support vector algorithms to one-class classification. The reference to “one class” here refers to the assumption that primarily data from the normal class (i.e. non-outliers) will be modeled during training. First, data is transformed by a map $\phi$ to a higher dimensional space by evaluating a kernel function. The algorithm then seeks to find the separating hyperplane in the kernel space between data and the origin with the largest margin. This is achieved by solving the following quadratic program for given training examples ${\pmb {x}}_1,{\pmb {x}}_2,\ldots ,{\pmb {x}}_l$:

$$\begin{aligned} \min \frac{1}{2} \left\| w \right\| ^2 + \frac{1}{\nu l} \sum ^l_i \xi _i - p \end{aligned}$$

(1)

subject to

$$\begin{aligned} (w \cdot \phi ({\pmb {x}}_i)) \geqslant p - \xi _i, \xi _i \geqslant 0 \end{aligned}$$

(2)

where w and p solve the problem. Here, $\xi _i$ refers to the slack variable for a given example ${\pmb {x}}_i$ which softens the margin, allowing for some points to reside outside the margin, essentially relaxing the assumption of complete separability between normal and outlying data. The hyperparameter $\nu \in (0,1)$ controls the number of outliers with smaller values allowing outliers to have a greater affect on the decision function. The decision function is given by

$$\begin{aligned} f({\pmb {x}})= sgn((w \cdot \phi ({\pmb {x}})) - p) \end{aligned}$$

(3)

where sgn(z) outputs a value of $+1$ for $z \geqslant 0$, indicating normal data and $-1$ otherwise, indicating an outlier.

Local Outlier Factor: This method is based on the concept of local density in detecting outliers. Given a particular point p, we measure the density of p with respect to the density of its k nearest neighbors. Intuitively, if the local density of p is lower than the local densities of its neighbors, this indicates that p is an outlier. As discussed in Breunig et al. (2000), for a given neighborhood size k, the k-distance(p) for a point p is defined as the distance between p and its k-th neighbor o (i.e. the k-th closest point to p). The k-distance neighborhood $N_k(p)$ is the set of points whose distances do not exceed the k-distance(p). The reachability distance is then defined as:

$$\begin{aligned} reachdist_k(p,o) = max\{k-distance(o), d(p,o)\} \end{aligned}$$

(4)

This means that if p is o’s k-th nearest neighbor, this will be returned, otherwise, the true distance between p and o will be returned. In order to calculate the densities of different clusters of points, the “local reachability density” $lrd_k$ is calculated.

$$\begin{aligned} lrd_k(p) = 1/ \frac{\sum ^k_{o_i} reachdist_k(p,o) }{ |N_k(p)|} \end{aligned}$$

(5)

Finally, the local outlier factor (LOF) of point p is defined as:

$$\begin{aligned} LOF_k(p) = \frac{\sum ^k_{o_i} (\frac{lrd_k(o)}{lrd_k(p)}) }{|N_k(p)|} \end{aligned}$$

(6)

Autoencoders: An autoencoder (AE) represents a type of neural network architecture that attempts to reconstruct a given input in an effort to learn an informative latent feature representation. Formally, for an input vector x, an attempt is made to find a mapping from x to a reconstruction of itself $x^\prime$ . By doing this, a latent representation of the data is created in the hidden layer(s) of the network. The general form of a single hidden layer autoencoder as follows:

$$\begin{aligned} f(x) =\sigma (x,W^e), \quad g(z) =\sigma (z,W^d), \quad \text {and}\quad x ^\prime =g(f(x)) \end{aligned}$$

(7)

where f(x) is the encoder function for input x, g(z) is decoder function for encoding z, $\sigma$ is a non-linear function, $W^e$ and $W^d$ are weight matrices for the encoder and decoder respectively and $x ^\prime$ is the reconstruction of the input vector (Goodfellow et al. 2016).

These networks can use a “bottleneck” configuration where the hidden layer(s) of the network compress the data (Goodfellow et al. 2016). The network is trained by minimizing the mean squared error (MSE) between the reconstruction and input. as shown in formal (8):

$$\begin{aligned} MSE(x,x ^\prime ) = \frac{1}{n}\sum _{i=1}^{n}{(x_{i}-x_{i}^\prime )^2} \end{aligned}$$

(8)

Additionally, autoencoders can be constrained to enforce sparsity in the network and therefore no longer require a compressed network capacity. One type of constrained autoencoder adds a sparsity penalty to hidden representations by constraining their absolute value. This penalty term is weighted and added to the cost function. The constrained cost is defined as.

$$\begin{aligned} MSE(x,x ^\prime ) = \frac{1}{n}\sum _{i=1}^{n}{(x_{i}-x_{i}^\prime )^2 + \lambda \sum _{i}|h_i|} \end{aligned}$$

(9)

where $\lambda$ is the sparsity penalty and $h = f(x)$ (Goodfellow et al. 2016).

Autoencoders can be used in a number of capacities. In this work, we propose a number of techniques for noise detection from pairwise constraint sets which make use of autoencoders in different ways. Firstly, we show that autoencoders can be used as an effective outlier detection technique for noise detection in pairwise constraints. Secondly, we demonstrate that autoencoders can also be used as an embedding method to support other outlier detection methods in the identification of noisy constraints.

Process for identifying noisy constraints

In this section, we describe a number of alternative cleaning processes for reducing noise in pairwise constraints, before passing them to a semi-supervised community detection algorithm. These cleaning processes employ some of the outlier detection models described in Section “Outlier detection methods”. It is important to note that pairwise constraints are of two distinct types: must-link and cannot-link. The differences in their respective distributions, which can be seen in Fig. 2, motivates the use of two separate cleaning processes and exploring different outlier detection models for each. The selection of models is based on best performance in detecting noises in constraints as illustrated in the evaluation section.

In this study, we explore the implementation of the following cleaning processes which are classified into four categories based on the employed outlier detection model:

1
Traditional outlier detection: In this process, a stand-alone outlier detection method is selected (e.g. isolation forest, One-class SVM, local outlier factor) to identify noise in must-link and cannot-link sets separately. The input features are passed to these models, which then return a binary score for each constraint which determines whether or not it is a noisy constraint. See Fig. 3 for an illustration.
2
Outlier detection via deep embedding: Here the neural network autoencoder (AE) is used as an additional component to provide an embedding function for a traditional outlier detection method. In this case, only the encoder function from the autoencoder model is used. After feeding the feature vectors into the encoder function, the model learns to effectively compress the input feature vector into an informative latent feature representation in the hidden layer. Then this latent representation is used as an input to an outlier detection method such as Isolation forest, One-class SVM, or local outlier factor, which return a binary score that identify the noisy constraints. See Fig. 4 for an illustration. This process is conducted for must-link and cannot-link pairs separately with different encoder functions and outlier detection methods. The selection of models is based on experimental results as illustrated in the evaluation section.
3
Deep learning approach: In this case, the neural network autoencoder (AE) is used as an outlier detection technique for identify noises in pairwise constraints. Different autoencoder models is used for must-link and cannot-link pairs separately. The feature vectors are fed into the autoencoder model, which learns to reconstruct the original constraints from the latent representation. The reconstruction error is given by the difference between the original constraints and the reconstruction. A large error is indicative of an outlier (i.e. a noisy constraint), while a low error indicates a “normal” example (i.e. a correctly-labelled constraint). Finally, we sort the constraints in ascending order (lowest to highest error) in order to determine the top k constraints with the lowest level of error. The expectation is that, as the larger part of pairwise constraints are non-noisy, the autoencoder’s latent representation will be biased towards these examples. This makes the model somewhat robust to outliers. Based on this property, it is then assumed that examples which are noisy will have a high reconstruction error. See Fig. 5 for an illustration of the process.
4
Hybrid cleaning process: For each of the above described cleaning processes, we use separate processes of the same category to identify noises in must-link and cannot-link pairs. However, in this process, we investigate a combination of different categories processes for must-link and cannot-link pairs. Based on initial experiments, a Neural Network based cleaning process performed better for must-link pairs than cannot-link. On the other hand, using Outlier Detection with Deep Embedding for cannot-link pairs is found to yield better noise detection performance, when compared to using an autoencoder alone. See Fig. 6 for an illustration of the process.

We see from Fig. 2 that the distributions of correct labels and noisy constraints is more complex in the case of cannot-link constraints—i.e., there is a high overlap between both the correct and noisy groups. Separating these groups requires a more complex function, as compared to the equivalent case for must-link constraints, which are relatively easy to separate.

AC-SLPA with noise identification

Now we discuss the implementation of the general architecture discussed in Section “Overview” in the context of the existing AC-SLPA algorithm (Alghamdi and Greene 2018) in order to create a robust active semi-supervised SLPA algorithm that can handle the presence of noisy pairwise constraints. The new modified AC-SLPA consists of three stages. The first two stages include the pairwise constraints cleaning process, which are executed iteratively as follows:

Stage 1: Detecting noises in constraints during selection and annotation. At each iteration of AC-SLPA, informative pair of nodes are selected using Node Pair Selection method (Alghamdi and Greene 2019) and passed to the noisy oracle to be labelled as pairwise constraints. After generating a set of noisy pairwise constraints ($PC-$), this set is passed to the process of identifying noisy constraints for multiple sub-iterations of cleaning. As a new set of constraints is introduced at each iteration, the outlier detectors are retrained at each one of these iterations and reapplied to the remaining set of constraints. The output constraints of this process are then used to apply PC-SLPA algorithm. At the end of each run of AC-SLPA, the cleaned pairwise constraint set ($PC+$) is accumulated and mixed with the new chunk of noisy pairwise constraints ($PC-$) in the next iteration. The larger the constraints set passed to the outlier detection model, the better the performance.

Stage 2. Rechecking discarded pairwise constraints. The previous stage of cleaning may result in a number of non-noisy constraints being labelled as noisy. This is more likely to happen when the distribution of noisy constraints is highly overlapped with non-noisy constraints. The second stage is designed to recheck the discarded pairwise constraints set ($PC-$) that were potentially mislabelled as noises, by passing them to the process of identifying noisy constraints for another multiple iterations of cleaning, thus reducing any wastage of the annotation budget. The returned set of constraints from this process is added to the accumulated cleaned pairwise constraints set ($PC+$) from stage 1.

Stage 3. Apply PC-SLPA. The final stage involves applying the semi-supervised community detection process PC-SLPA using the final accumulated cleaned pairwise constraints ($PC+$) obtained from the previous two stages, thus producing a final set of communities. The complete architecture is summarized in Algorithm 1.

Evaluation

In this section, we describe the datasets and experimental configuration used to validate our proposed method for handling noisy constraints. We conduct four experiments to show its effectiveness, which are applied to synthetic benchmark networks of different sizes with overlapping communities, and real-world networks. Our objectives are as follows: (1) to quantify the ability of each constraint cleaning process to detect noisy constraints prior to community finding; (2) to choose the best architectures of autoencoder to use as a deep embedding function for outlier detection models; (3) to compare all types of constraint cleaning processes after integration with AC-SLPA, in order to evaluate the end-to-end performance of the complete architecture; (4) to examine the performance of the method on real-world data.

Datasets

Synthetic data. We constructed a diverse set of 64 benchmark synthetic networks using the widely-used LFR generator (Lancichinetti et al. 2008). These networks vary in terms of number of nodes $N \in [1000, 5000]$, communities per node (overlapping diversity) $O_m \in [2,8]$, and the fraction of nodes belonging to multiple communities (overlapping density) $On \in \{10\%,50\% \}$. These networks contain either small communities ($10-50$ nodes), or large communities ($20-100$ nodes). The mixing parameter $\mu$ varies from 0.1 to 0.3, which controls the level of community overlap. Details of the network generation parameters are in Table 1.

Table 1 Parameter ranges used for the generation of LFR synthetic networks

Overlapping community finding with noisy pairwise constraints

Abstract

Introduction

Related work

Semi-supervised learning in community finding

Noisy constraints in clustering and community finding

Methods

Overview

Outlier detection methods

Process for identifying noisy constraints

AC-SLPA with noise identification

Evaluation

Datasets

Experiment 1: Comparing Outlier Detection Models

Evaluating outlier detection methods

Evaluating autoencoders for deep embeddings

Experiment 2: Evaluation of noise removal methods

Experiment 3: End-to-end evaluation

Experiment 4: Real-world networks

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords