Hypergraph clustering by iteratively reweighted modularity maximization

Learning on graphs is a subject of great interest due to the abundance of relational data from real-world systems. Many of these systems involve higher-order interactions (super-dyadic) rather than mere pairwise (dyadic) relationships; examples of these are co-authorship, co-citation, and metabolic reaction networks. Such super-dyadic relations are more adequately modeled using hypergraphs rather than graphs. Learning on hypergraphs has thus been garnering increased attention with potential applications in network analysis, VLSI design, and computer vision, among others. Especially, hypergraph clustering is gaining attention because of its enormous applications such as component placement in VLSI, group discovery in bibliographic systems, image segmentation in CV, etc. For the problem of clustering on graphs, modularity maximization has been known to work well in the pairwise setting. Our primary contribution in this article is to provide a generalization of the modularity maximization framework for clustering on hypergraphs. In doing so, we introduce a null model for graphs generated by hypergraph reduction and prove its equivalence to the configuration model for undirected graphs. The proposed graph reduction technique preserves the node degree sequence from the original hypergraph. The modularity function can be defined on a thus reduced graph, which can be maximized using any standard modularity maximization method, such as the Louvain method. We additionally propose an iterative technique that provides refinement over the obtained clusters. We demonstrate both the efficacy and efficiency of our methods on several real-world datasets.

is insufficient to capture higher-order information and present it for analysis or learning tasks.
These systems can be more precisely modeled using hypergraphs where nodes represent the interacting components, and hyperedges capture higher-order interactions (Bretto and et al. 2013;Klamt et al. 2009;Satchidanand et al. 2014;Lung et al. 2018). A hyperedge can capture a multi-way relation; for example, in a co-authorship network, where nodes represent authors, a hyperedge could represent a group of authors who collaborated for a common paper. If this were modeled as a graph, we would be able to see which two authors are collaborating, but would not see if multiple authors worked on the same paper. This suggests that the hypergraph representation is not only more information-rich but is also conducive to higher-order learning tasks by virtue of its structure. Indeed, there is a recently expanding interest in research in learning on hypergraphs (Zhang et al. 2018;Kumar et al. 2020;Zhao et al. 2018;Saito et al. 2018;Feng et al. 2018;Chodrow and Mellor 2019).
Analogous to the graph clustering task, Hypergraph clustering seeks to discover densely connected components within a hypergraph (Schaeffer 2007). This has been the subject of several research works by various communities with applications to various problems such as VLSI placement (Karypis and Kumar 1998), discovering research groups (Kamiński et al. 2019), image segmentation (Kim et al. 2011), de-clustering for parallel databases (Liu and Wu 2001) and modeling eco-biological systems (Estrada and Rodriguez-Velazquez 2005), among others. A few early works on hypergraph clustering (Leordeanu and Sminchisescu 2012;Bulo and Pelillo 2013;Agarwal et al. 2005;Shashua et al. 2006;Liu et al. 2010) are confined to k-uniform hypergraphs where each hyperedge connects exactly k number of nodes. However, most of the real-world hypergraphs have arbitrary-sized hyperedges, which makes these methods unsuitable for several practical applications. Within the machine learning community, Zhou et al. (2007), were among the earliest to look at learning on non-uniform hypergraphs. They sought to support spectral clustering methods (for example see Shi and Malik (2000); Ng et al. (2002)) on hypergraphs and defined a suitable hypergraph Laplacian for this purpose. This effort, like many other existing methods for hypergraph learning, makes use of a reduction of the hypergraph to a graph (Agarwal et al. 2006) and has led to follow-up work (Louis 2015). Spectral based methods involve expensive computations to determine the eigenvector (multiple eigenvectors in case of multiple clusters), which makes these methods less suitable for large hypergraphs.
An alternative methodology for clustering on simple graphs (those with just dyadic relations) is modularity maximization (Newman 2006). This class of methods, in addition to providing a useful metric for evaluating cluster quality through the modularity function, also returns the number of clusters automatically and avoids the expensive eigenvector computation step -typically associated with other popular methods such as spectral clustering. In practice, a greedy optimization algorithm known as the Louvain method (Blondel et al. 2008) is commonly used, as it is known to be fast and scalable and can operate on large graphs.
Kindly note that this paper is a significantly extended version of our work titled A New Measure of Modularity in Hypergraphs: Theoretical Insights and Implications for Effective Clustering (Kumar et al. 2019), presented at The 8 th International Conference on Complex Networks and their Applications.
However, extending the modularity function to hypergraphs is a non-trivial task, as a node-degree preserving null model would be required, analogous to the graph setting. A straightforward procedure would be to leverage clique reduction, to reduce a hypergraph to a simple graph and then apply a conventional modularity-based solution. Such an approach ignores the underlying super-dyadic nature of interactions and thus loses critical information. Additionally, a clique reduction method would not preserve the node degree sequence of the original hypergraph, which is vital for the null model that modularity maximization techniques are typically based on.
Recently, there have been several attempts to define the null models on the hypergraphs. Chodrow (2019) proposed a Monte Carlo Markov Chain based method, in which random hypergraphs are generated by pairwise reshuffling the edges in the bipartite projection. A more recent study involves the generalization of the celebrated Chung-Lu random graph model (Chung and Lu 2002) to hypergraphs, and employs it to solve the problem of hypergraph clustering (Kamiński et al. 2019). The hypergraph modularity objective proposed by Kamiński et al. (2019) only counts the participation of hyperedges completely contained inside a cluster. Though this assumption enables the analytic tractability of the solution, it limits its applicability to real world hypergraphs where hyperedges can be of arbitrary size. There exists a parallel line of inquiry where hypergraphs are viewed as simplicial complexes, and null models are defined through the preservation of topological features of interest (Giusti et al. 2016;Courtney and Bianconi 2016;Young et al. 2017). Such models make a strong assumption -that of subset-inclusion 1 , which may not hold often in real-world data.
Unlike edges in graphs, there are different ways to cut a hyperedge. Depending on where a hyperedge is cut, the proportion and assignments of nodes on different sides of the cut will change, influencing the resultant clustering (Veldt et al. 2020). One way of incorporating the information from the hypergraph's structural properties is to introduce weights along hyperedges. These weights can be determined based on a measure or a function of the input data. For example researchers (Satchidanand et al. 2015) have used the Hellinger distance to weight hyperedges for transductive inference tasks. While this is a supervised metric, one can also consider unsupervised hyperedge weighting schemes that incorporate hyperedge information. One way of incorporating information based on properties of hyperedges or their vertices, is to introduce hyperedge weights based on a metric or function of the data. Building on this idea, we make the following contributions in this work : • ("Hypergraph modularity" section): We define a null model for graphs generated by hypergraph reduction that preserves the hypergraph node degree sequence. Using this null model and the proposed reduction, we define a modularity function that can be used in conjunction with the popular Louvain method to find hypergraph clusters.
• ("Iterative hyperedge reweighting" section): We propose a generic iterative refinement procedure for hypergraph clustering. This refinement is done by reweighting hyperedges and operates natively on the hypergraph structure.
• ("Evaluation on ground truth" section): We perform extensive experiments with the resultant algorithm, titled Iteratively Reweighted Modularity Maximization (IRMM), on a wide range of real-world datasets and demonstrate both its efficacy and efficiency over state-of-the-art methods. We empirically establish that the hypergraph based methods perform better than their graph-based counterparts.
• ("Results and analysis" section): We investigate the effect of the reweighting procedure and show that the proposed refinements indeed help us to achieve balanced hyperedge cuts. Furthermore, the experimental results demonstrate that the proposed iterative scheme helps achieve better results over their equivalent non-iterative methods on all datasets. • ("Results and analysis" section): We examine the scalability of the hypergraph modularity maximization algorithm using synthetic data.

Hypergraphs
Let V be a finite set of nodes and E be a collection of subsets of V that are collectively exhaustive. For a w ∈ R |E| + , G = (V , E, w) is a hypergraph, with vertex set V and hyperedge set E. Each hyperedge e has a positive weight w(e) associated with it. The number of vertices can be denoted by n = |V | and the number of hyperedges can be denoted by m = |E|.
While a traditional graph edge has just two nodes, a hyperedge can connect multiple nodes. For a vertex v, we can write its degree as d(v) = e∈E,v∈e w(e). The degree of a hyperedge e is the count of nodes it contains; we can write this as δ(e) = |e|.
The hypergraph incidence matrix H is given by h(v, e) = 1 if vertex v is in hyperedge e, and 0 otherwise. W, D v and D e are the hyperedge weight matrix, vertex degree matrix and edge degree matrix respectively; W and D e are diagonal matrices of size m × m, and D v is a diagonal matrix of size n × n.
Clique Reduction: For a given hypergraph, one can compute its clique reduction (Hadley et al. 1992) by substituting each hyperedge with a clique induced by its node-set. For a hypergraph with incidence matrix H, the adjacency matrix of its clique reduction can be written as: To remove the self-loops, we may subtract D v from the above expression. The resultant clique reduction becomes A clique = HWH T − D v

Modularity
When clustering graphs, it is desirable to cut as few edges (or edges with lesser weights in case of weighted graphs) within a cluster as possible. Modularity is a metric of clustering quality that measures whether the number of within-cluster edges is greater than its expected value. In Newman (2006) the modularity function is defined as: Here, δ(.) is the Kronecker delta function, and g i , g j are the clusters to which vertices i and j belong. The 1 2m will be dropped for the remainder of this work because it is a constant (number of edges) for a given graph and doesn't affect the maximization of Q. B ij = A ij − P ij is called the modularity matrix. A ij denotes the actual, and P ij denotes the expected number of edges between node i and node j, given by a null model. For graphs, the configuration model (Newman 2010) is used, where edges are drawn randomly while keeping the node-degree preserved. For two nodes i and j, with (weighted) degrees k i and k j respectively, the expected number of edges between them is hence given by: Since the total number of edges in a given network is fixed, maximizing the number of within-cluster edges is the same as minimizing the number of between-cluster edges. This suggests that clustering can be achieved by modularity maximization. Kindly note that in this article, we focussed on modularity as defined by Newman (2006). Other definitions of modularity (Courtney and Bianconi 2016) are not in the scope of this work.

Hypergraph modularity
One possible way to define hypergraph modularity is to introduce a hypergraph null model and utilize it to define a modularity function. Kaminski et al. (Kamiński et al. 2019) follow this approach and use a generalized version of the Chung-Lu model (Chung and Lu 2002) to define hypergraph modularity. The proposed modularity function only counts the participation of hyperedges entirely contained inside a cluster. Moreover, the modularity function requires separate processing of hypergraphs induced by hyperedges with different cardinalities. Though such assumptions can provide the analytic tractability of the solution, they limit its applicability to real-world hypergraphs where the hypergraphs can be of very large size with varying hyperedge cardinalities.
Another possible way to define hypergraph modularity is to convert the hypergraph to an appropriate graph and then define modularity on the resultant graph. Such an approach can get benefits from the already existing tools for graphs. In this section, we will follow the latter approach to introduce hypergraph modularity.
To introduce the hypergraph modularity, we start by proposing a null model on the graphs generated by reducing hypergraphs. In a reduced graph, we desire the nodes to possess the same degree as that of the original hypergraph. In a thus reduced graph, the expected number of edges connecting nodes i and j can be given as The proposed null model can be interpreted as a mechanism to generate random graphs where the node degree sequence of a given hypergraph is preserved irrespective of the count and cardinality of hyperedges. In order to define a modularity matrix, we need to obtain a graph reduction where the node degree sequence should remain preserved. One straightforward way could be to use a clique reduction of the original hypergraph. However, during clique reduction, the degree of a node in the resultant graph does not remain the same as its degree in the original hypergraph, as verified below.

Lemma 1 For the clique reduction of a hypergraph with incidence matrix H, the degree of a node i in the reduced graph is given by
where δ(e) and w(e) are the degree and weight of a hyperedge e respectively.
Proof For the clique reduction, the adjacency matrix of the resultant graph is given by In the resultant graph, each node has a self-loop that can be removed, since they are not cut during the clustering process. This is achieved by explicitly setting A clique ii = 0 for all i. Considering this, the degree of a node i in the resultant graph can be written as: From the above lemma, we can infer that in the clique reduction of a hypergraph, the degree of a node is not preserved and for each hyperedge e, it is overcounted by a factor of (δ(e) − 1). We can hence scale down the node degree in the reduced graph by a factor of (δ(e) − 1). This results in the following reduction equation, We can now verify that the above adjacency matrix preserves the hypergraph node degree.

Proposition 1 For the reduction of a hypergraph given by the adjacency matrix A hyp = HW (D e − I) −1 H T , the degree of a node i in the reduced graph (denoted k i ) is equal to its degree d(i) in the original hypergraph.
Proof We have, Following a similar argument from the previous theorem, we can explicitly set A hyp ii = 0 for all i. The degree of a node in the reduced graph can be written as With Eq. 3, we can reduce a given hypergraph to a weighted graph and zero out its diagonals by explicitly setting the diagonal entries to zero. The hypergraph modularity matrix can subsequently be written as, This new modularity matrix can be used in Eq. 1 to obtain an expression for the hypergraph modularity and can then be used in conjunction with a Louvain-style algorithm.
• A negative value of Q hyp indicates a clustering assignment, where a node pair (i, j) from the same cluster participates in lesser than the expected number of hyperedges. This situation may arise when the number of within-cluster edges is lower than the number of across cluster edges.
• A positive value of Q hyp indicates a clustering assignment, where a node pair (i, j) from the same cluster participates in more than the expected number of hyperedges. In graphs, typically, a modularity value higher than 0.3 is considered to be significant (Clauset et al. 2004).
• Q hyp = 0 indicates a clustering assignment, where a node pair (i, j) from the same cluster participates in the expected number of hyperedges. This situation can occur because of the random assignment of nodes to the clusters.
In the rest of the section, we will analyze the properties of the proposed modularity function. We will relate the graph reduction equation to the random walk model for hypergraphs. The relation establishes the link with earlier works on hypergraph clustering, where the random walk strategies were employed (Zhou et al. 2007).

Connection to random walks:
Consider the clique reduction of the hypergraph. We can distribute the weight of each hyperedge uniformly among the edges in its associated clique. All nodes within a single hyperedge are assumed to contribute equally; a given node would receive a fraction of the weight of each hyperedge it belongs to. The number of edges each node is connected to from a hyperedge e is δ(e) − 1. Hence by dividing each hyperedge weight by the number of edges in the clique, we obtain the normalized weight matrix W (D e − I) −1 . Introducing this in the weighted clique formulation results in the proposed reduction A = HW (D e − I) −1 H T .
Another way of interpreting this reduction is to consider a random walk on the hypergraph in the following manner -• pick a start node i • select a hyperedge e containing i, proportional to its weight w(e) • select a new node from e uniformly (there are δ(e) − 1 choices) The behaviour described above is captured by the following random walk transition model - By comparing the above with the random walk probability matrix for graphs (P = D −1 A) we can recover the reduction A = HW (D e − I) −1 H T .

Iterative hyperedge reweighting
When clustering graphs, it is desired that edges within clusters are greater in number than edges between clusters. Hence when trying to improve clustering, we look at minimizing the number of between-cluster edges that get cut. For a hypergraph, this would be done by minimizing the total volume of the hyperedge cut (Zhou et al. 2007). Consider the two-clustering problem, where the task is to divide the set V into two clusters S and S c . Zhou et al. (2007) observed that the volume of the cut ∂S is directly proportional to e w(e)|e ∩ S||e ∩ S c |, for a hypergraph whose vertex set is partitioned into two sets S and S c . For a hyperedge e, which has its vertices in both S and S c , the product |e ∩ S||e ∩ S c | can be interpreted as the number of cut sub-edges within a clique reduction. It can be seen that this product is maximized when the cut is balanced and there are an equal number of vertices in S and S c . In such a case, there will be δ(e) 2 2 sub-edges getting cut. On the other hand, when all vertices of e go into one partition and the other partition is left empty, the product is zero. Similarly, if one of the vertices of e go into one partition and the other partition contains all δ(e) − 1 vertices, then the product is δ(e) − 1. A min-cut algorithm would favor cuts that are as unbalanced as possible, as a consequence of the minimization of |e ∩ S||e ∩ S c |. In the sequel, we will present the intuition behind our proposed iterative re-weighting technique followed by its mathematical formulation.
Intuition: While clustering in graphs, when an edge gets cut between two clusters, one of its nodes becomes a member of the first cluster, and the other node becomes part of the second cluster. But in hypergraphs, a hyperedge can get cut in multiple ways. When a hyperedge gets cut, if the majority of its vertices go into the first cluster c 1 and only a smaller fraction of vertices go into the second cluster c 2 , then it is more likely that the vertices going into second cluster are similar to the rest and should be drawn into the first cluster. On the other hand, if a hyperedge gets cut equally across clusters, then its vertices are equally likely to be part of any cluster; hence it is less informative than a hyperedge that gets an unbalanced cut. Building on this idea, we would want to cut the less informative hyperedges (the ones getting balanced cut), and more informative hyperedges that got unbalanced cut to be left uncut.
This can be done by increasing the weights of hyperedges that get unbalanced cuts, and (relatively) decreasing the weights of hyperedges that get more balanced cuts. We know that an algorithm that tries to minimize the volume of the hyperedge boundary would try to cut as few heavily weighted hyperedges as possible. Since the hyperedges that had more unbalanced cuts get a higher weight, they are less likely to be cut after reweighting, and instead would reside inside a cluster. Hyperedges that had more balanced cuts get a lower weight, and on reweighting, continue to get balanced cuts. Thus after reweighting and clustering, we would observe fewer hyperedges between clusters, and more hyperedges pushed into clusters. Moreover, after reweighting, we expect that the hyperedges getting cut between clusters should get balanced cuts. In the remaining section, we will formally present the solution mentioned above. Its effectiveness can be seen in the example shown in Fig. 2. Now, we formally develop a reweighting scheme that satisfies the properties described above -increasing weight for a hyperedge that received a more unbalanced cut, and decreasing weight for a hyperedge that received a more balanced cut. Considering the case where a hyperedge gets partitioned into two clusters with k 1 and k 2 nodes in each partition (k 1 , k 2 = 0), the following equation operationalizes the above metnioned scheme - Here the multiplicative coefficient, δ(e), seeks to keep t independent of the number of vertices in the hyperedges. Note that for a hyperedge e with two partitions, δ(e) = k 1 +k 2 . Figure 1 illustrates an example where t takes two different values depending on the cut.
To see why this satisfies our desired property, note that t is minimized when k 1 and k 2 are equal. It can be verified by the following proposition. Proposition 2 In the function, t = 1 k 1 + 1 k 2 × δ(e), the minimum value of t = 4, and it is achieved when k 1 = k 2 = δ(e) 2 . Here, for a hyperedge e, δ(e) is its cardinality and k i represents the number of nodes in the i th partition.
Proof Let k i ∈ Z + Then, is minimized when k 1 = k 2 and the resultant value of t = 4.
Note: It can be observed that Eq. 5 coincides with the ratio between arithmetic mean (AM) and harmonic mean (HM) of the two numbers k 1 and k 2 . More precisely, we can write t = 4 AM(k 1 , k 2 ) HM(k 1 , k 2 ) By using the fact that AM(k 1 , k 2 ) ≥ HM(k 1 , k 2 ), and AM(k 1 , k 2 ) = HM(k 1 , k 2 ) only when k 1 = k 2 , we can obtain the similar result to Proposition 2. We can then generalize Eq. 5 to c partitions as follows - Here, +1 term in the denominator accounts for the cases when k i = 0. To compensate for this extra +1, +c has been added to the numerator. Additionally, m is the number of hyperedges, and the division by m is added to normalize the weights (Fig. 1). During the first iteration of the algorithm, we find clusters in the hypergraph using its default weights. At the end of the first iteration, we find the updated weights using the Eq. 6. It can be seen that for a hyperedge e if it does not get balanced cut, the w (e) will not be minimized, and its value will be proportional to the extent to which it gets unbalanced cut. Thus, updating hyperedge weights by Eq. 6 suffices our purpose.
At step t + 1, let w t (e) be the weight of hyperedge e till the previous iteration. Using Eq. 6, w (e) can be computed for the current iteration. The weight update equation can be written as, Here, α is a hyperparameter which decides the importance to be given to newly calculated weights over the current weights of hyperedges. The complete algorithm for modularity maximization on hypergraphs with iterative reweighting, entitled Iteratively Reweighted Modularity Maximization (IRMM), is described in Algorithm 1. In rest of the section, we will demonstrate the effectiveness of the hyperedge reweighting scheme by using a toy example. Initially when clustering this hypergraph by modularity maximization, the hypergraph had two highly unbalanced cuts. In Fig. 2a, hyperedge h 2 gets splitted by Cut 1, Cut 2 and Cut 3 in 1 : 4, 1 : 4 and 2 : 3 ratios respectively. Similarly, hyperedge h 3 gets cut by both Cut 1 and Cut 2 in ratio 1 : 2. After applying one iteration hyperedge reweighting, hyperedge h 1 gets split in a 1 : 1 ratio and h 2 gets cut in a 1 : 4 ratio (Fig. 2b). In this case, hyperedge reweighting procedure decreases the number of cuts and leaves two desired clusters. With the intital clustering, there were single nodes left out from hyperedges h 2 and h 3 , which are pulled back into the larger clusters after reweighting. This example illustrates that the reweighting scheme exhibits the desired behavior, as discussed earlier in this section.
We are now in a position to evaluate our ideas empirically.

Evaluation on ground truth
In this section, we will present the experiments conducted to validate the proposed methods. We used the Rand Index, average F1 measure (Yang and Leskovec 2012) and purity, three popular metrics to evaluate the clustering quality. We will start with a brief introduction to the Louvain method, followed by details on the experimental setup and datasets used. The Louvain method: The Louvain method is a greedy optimization method for detecting communities in large networks (Blondel et al. 2008). The method works on the principle of grouping the nodes that maximize the overall modularity. Since checking all possible cluster assignments is impractical, the Louvain algorithm uses a heuristic that is known to work well on real-world graphs. The method starts by assigning each node to its own cluster and merging those clusters, resulting in the highest modularity gain. Merged clusters are treated as single nodes, and again those cluster-pairs merge that result in the highest modularity gain. If there are no cluster pairs left that will further increase the overall network modularity, the algorithm stops and returns the clusters.
Fixing the number of clusters: We use the Louvain algorithm to maximize the hypergraph modularity as per Eq. 4. Since this method uses a node-degree-preserving graph reduction, we refer to it as NDP-Louvain (Node Degree Preserving Louvain). Louvain algorithm automatically returns the number of clusters. To get a predefined number of clusters c, we use agglomerative clustering (Ding and He 2002) on the top of clusters obtained by the Louvain algorithm. For the linkage criterion, we use the average linkage. It is a bottom-up hierarchical clustering method. The algorithm constructs a dendrogram that exhibits pairwise similarity among clusters. At each step, two clusters with the shortest distance are merged into a single cluster. The distance between any two clusters c i and c j is taken to be the average distance of all distances d(x, y), where node x ∈ c i and node y ∈ c j .
The proposed methods are shown in the results table as NDP-Louvain and IRMM.

Settings for IRMM
We investigate the effect of the hyperparameter α using a grid search over the set [ 0.1, 0.9] with a step size of 0.1. We did not observe any difference in the resultant Rand Index, purity, and F1 scores. While tuning the α, we witnessed a very minimal difference in the convergence rate, over a wide range of values (for example, 0.3 to 0.9 on the TwitterFootball dataset). It can be noted that α is a scalar value in a moving average; it will not cause any significant variation in the resulting weights. In our experiments, we decided to set it at α = 0.5. We stop the iterations if the difference between the mod of two subsequent weight assignments is less than a set threshold. In our experiments, we set chose to set this threshold at threshold = 0.01

Compared methods
To evaluate the performance of our proposed methods, we compared the following baselines. Clique Reductions: We reduced the original hypergraph using a clique reduction (A = HWH T ) and then applied the Louvain method and Spectral Clustering.

Hypergraph-based Spectral Clustering:
We use the hypergraph-based spectral clustering method, as defined in Zhou et al. (2007). The given hypergraph is reduced to a graph A = D In the results table, this method is referred to as Zhou-Spectral. PaToH 2 and hMETIS 3 : These are popular hypergraph partitioning algorithms that work on the principles of coarsening the hypergraph before partitioning. The coarsened hypergraph is partitioned using expensive heuristics. In our experiments, we used the original implementations from the corresponding authors.

Datasets
Dataset statistics are furnished in Table 1. For all datasets, we use the largest connected component of the hypergraph for our experiments. All the datasets are classification datasets, where the class labels accompany the data points. We use these class labels as the proxy for clusters. The detailed description of the hypergraph construction is given below: MovieLens 4 : This is a multi-relational dataset provided by GroupLens research, where movies are represented by nodes. We construct a co-director hypergraph by using the director relationship to represent hyperedges. A hyperedge would connect a group of nodes if the same individual directed them. Here, the genre of a movie represents the class of the corresponding node.
Cora and Citeseer: These are bibliographic datasets, where the nodes represent papers. In each dataset, a set of nodes is connected by a hyperedge if they involve the same set of words (after removing low frequency and stop words). Different disciplines were used as clusters (Sen et al. 2008).
TwitterFootball: This is a social network taken from the Twitter dataset (Greene et al. 2012). This dataset involves players of 20 football clubs (classes) of the English Premier League. Here, the nodes represent players, and if a set of players are co-listed, then the corresponding nodes are connected by a hyperedge.
Arnetminer: This is a large bibliographic dataset (Tang et al. 2008). Here, the nodes represent papers, and a set of nodes are connected if the corresponding papers are co-cited. The nodes in the hypergraph are accompanied by Computer Science subdisciplines. Different sub-disciplines were used as clusters.

Experiments
For the different datasets, we compare the Rand Index (Rand 1971), purity (Manning et al. 2008), and average F1 scores (Yang and Leskovec 2013) on all the methods discussed earlier. The number of clusters was first set to that returned by the Louvain method, in an unsupervised fashion. This is what would be expected in a real-world setting, where the number of clusters is not given apriori. Table 2 shows the results of this experiment. Secondly, we ran the same set of methods with the number of ground truth classes set as the number of clusters. In the case of Louvain method, the clusters obtained are merged Louvain, NDP-Louvain, and IRMM return the number of clusters on their own Best performance in each column is boldfaced using the post-processing technique explained earlier. The results of this experiment are given in Table 3. On some datasets, the Louvain method and IRMM return fewer clusters than the number of ground truth classes. In such cases, we do not report the results and leave the entries as "-." We also plotted the results for varying number of clusters using the same methodology described above, to assess our method's robustness. The results are shown in Fig. 3. In all datasets but Arnetminer, we set the number of clusters to a minimum value such as two and then increase it by a factor of two. For Arnetminer, since the IRMM method returns a very large number of clusters, we set the initial number of clusters to ten and increase it by a factor of ten. For all datasets, the maximum number of clusters is set to the number of clusters returned by the IRMM method. On some datasets, Louvain and NDP-Louvain methods return a fewer number of clusters than IRMM. In such cases, the corresponding curves in Fig. 3 are left truncated.

Results and analysis
We show that the proposed methods -NDP-Louvain and IRMM perform consistently better on all the datasets (except on one dataset with RI measure). To test the robustness Citeseer, Cora, Movielens, TwitterFootball, and Arnetminer have 6, 7, 2, 20, and 10 classes, respectively. On some datasets, the Louvain and IRMM method return fewer clusters than the number of ground truth classes. In such cases, we do not report the results and leave the entries as "-." Best performance in each column is boldfaced Here, x-axis represent the number of clusters and y-axis indicates F1 score of the proposed method, we vary the number of clusters and report the results in the latter half of the section. To investigate the effect of the reweighting scheme, we report the distribution of the sizes of hyperedges getting cut. This is followed by testing the scalability of the proposed algorithm against one of the competitive baseline. We will start by discussing the empirical evaluation of the proposed methods. From the Tables 2 and 3, it is evident that IRMM gives the highest cluster purity scores and average F1 scores across all the datasets and the highest Rand Index scores are obtained on all except Citeseer dataset. Besides the fact that IRMM significantly outperforms over other methods, we want to emphasize on the following two observations: Superior performance of hypergraph based methods: It is evident that hypergraph based methods perform consistently better than their clique based equivalents. Results indicate that Zhou-Spectral and NDP-Louvain are better than Spectral and Louvain respectively. Hence, preserving the super-dyadic structure helps in getting a better cluster assignment.

The proposed iterative reweighting scheme helps to boost up the performance:
The proposed hyperedge reweighting scheme aids in the performance across all datasets. It must be noted that the first iteration of IRMM is the NDP-Louvain and IRMM performance is consistently better than the NDP-Louvain method, which shows that balancing the hyperedge cut enhances the cluster quality.

Effect of reweighting on hyperedge cuts
Consider a hyperedge that is cut; its nodes partitioned into different clusters. Looking at Eq. 6, we can see that w (e) is minimized when all the partitions are of equal size, and maximized when one of the partitions is much larger than the other. The iterative reweighting procedure is designed to increase the number of hyperedges with balanced partitioning, and decrease the number of hyperedges with unbalanced partitioning. As iterations pass, hyperedges that are more unbalanced should be pushed into neighbouring clusters, and the hyperedges that lie between clusters should be more balanced.
We analyze the effect of hyperedge reweighting in Fig. 4. For each hyperedge, we find the relative proportion of the biggest partition and add them in the bins with interval size = 0.1. The plot illustrates the variation in the size of each bin over along with iterations.
relative size(e) = max i number of nodes in cluster i number of nodes in the hyperedge e If a hyperedge is a balanced cut, then the proportion of its largest partition is low; we call such hyperedges as fragmented. On the other hand, if a hyperedge has a very high proportion of its largest partition, then the hyperedge is not a balanced cut; we call such hyperedges as dominated.
On TwitterFootball dataset, the effect of reweighting is distinctly visible as the number of fragmented edges increases with iterations. This behavior confirms our intuition of achieving more balanced cuts with the proposed reweighting procedure. After four iterations, the method converges as we don't observe any change in the hyperedge distribution.
A similar trend is observed with the Cora dataset. Here, the number of fragmented edges fluctuate before their final convergence.
In the case of Arnetminer dataset, the change in fragmented and dominated edges is very minimal. One possible reason for such behavior could be its significantly large size as compared to the number of ground truth clusters.
In the case of Citeseer and Movielens datasets, we could not see the convergence in the change of hyperedge weights in a pre-fixed number of iterations. Though the number of hyperedges seems to fluctuate with iterations, the algorithm tries to find the best clustering at each step by using the NDP-Louvain algorithm. This results in the improved performance of the overall algorithm after following the refinement procedure.
Both in Citeseer and Movielens datasets, IRMM returns lesser number of clusters than NDP-Louvain. NDP-Louvain returns 16 clusters for Citeseer and 13 clusters for Movielens dataset. These number of clusters are reduced to 13 and 8 for Citeseer and Movielens datasets respectively. Thus, the refinement procedure tends to minimize the cut value along with cut-balacing.

Scalability of the NDP-Louvain method
To further motivate the extension of modularity maximization methods to the hypergraph clustering problem, we look at the scalability of the NDP-Louvain method against the strongest baseline, Zhou-Spectral. Table 4 shows the CPU times 5 for the NDP-Louvain and Zhou-Spectral on the real-world datasets. We see that while the difference is less pronounced on a smaller dataset like TwitterFootball, it is much greater on the larger datasets. In particular, the runtime on Arnetminer for NDP-Louvain is lower by a significant margin, not having to compute an expensive eigendecomposition. 5 The runtime of IRMM is not reported as it is highly dependent on the number of iterations. For some datasets such as TwitterFootball and Cora, our method converged in 4 and 13 iterations, respectively. For remaining datasets, we experimented with the number of iterations set to 20. Note: To compute the eigenvectors for spectral clustering based method, we use of the eig(.) function from MATLAB. The eig(.) function makes use of orthogonal similarity transformations to convert the matrix into upper Hessenberg matrix followed by QR algorithm to find its eigenvectors.
Analysis on synthetic hypergraphs: On the real-world data, modularity maximization showed improved scalability as the dataset size increased. To evaluate this trend, we compared the CPU times for the Zhou-Spectral and NDP-Louvain methods on synthetic hypergraphs of different sizes. For each hypergraph, we first ran NDP-Louvain and found the number of clusters returned, then ran the Zhou-Spectral method with the same number of clusters.
Following the hypergraph generation method used in EDRW: Extended Discriminative Random Walk 6 (Satchidanand et al. 2015), we generated hypergraphs with 2 classes and a homophily of 0.4 (40% of the hyperedges deviate from the expected class distribution). The hypergraph followed a modified power-law distribution, where 75% of its hyperedges contained less than 3% of the nodes, 20% of its hyperedges contained 3%-50% of the nodes, and the remaining 5% contained over half the nodes in the dataset. To generate a hypergraph, we first set the number of hyperedges to 1.5 times the number of nodes. For each hyperedge, we sampled its size k from the modified power-law distribution and chose k different nodes based on the homophily of the hypergraph. We generated hypergraphs of sizes ranging from 1000 nodes up to 10000 nodes, at intervals of 500 nodes. Figure 5 shows how the CPU time varies with the number of nodes, on the synthetic hypergraphs generated as given above.
While NDP-Louvain is shown to run consistently faster than Zhou-Spectral for the same number of nodes, the difference increases as the hypergraph grows larger. In Fig. 5, this is shown by the widening in the gap between the two curves as the number of nodes increases.

Conclusion and future directions
In this paper, we have defined the problem of clustering on hypergraphs and state challenges involved to solve it. We start with defining a null model for the graphs generated by the hypergraph reduction and theoretically show its equivalence to the configuration model defined for weighted undirected graphs. Our proposed graph reduction technique preserves the node degree sequence of the actual hypergraph. After reducing the hypergraph to a graph, we apply the Louvain algorithm to find clusters. We have motivated the problem of balancing the hypergraph cuts and provided an iterative solution for the same. Our extensive set of experiments demonstrates the supremacy of the proposed methods over state-of-the-art approaches. The promising results confirm the need for hypergraph modeling and open up new directions for further research. The proposed graph reduction technique can be used for different tasks such as node classification, link prediction, node representation learning, etc., that are left as the avenues of future research.