A review of stochastic block models and extensions for graph clustering

There have been rapid developments in model-based clustering of graphs, also known as block modelling, over the last ten years or so. We review different approaches and extensions proposed for different aspects in this area, such as the type of the graph, the clustering approach, the inference approach, and whether the number of groups is selected or estimated. We also review models that combine block modelling with topic modelling and/or longitudinal modelling, regarding how these models deal with multiple types of data. How different approaches cope with various issues will be summarised and compared, to facilitate the demand of practitioners for a concise overview of the current status of these areas of literature.


Introduction
Stochastic block models (SBMs) are an increasingly popular class of models in statistical analysis of graphs or networks.They can be used to discover or understand the (latent) structure of a network, as well as for clustering purposes.We introduce them by considering the example in Fig. 1, in which the network consists of 90 nodes and 1192 edges.The nodes are divided into 3 groups, with groups 1, 2 and 3 containing 25, 30 and 35 nodes, respectively.The nodes within the same group are more closely connected to each other, than with nodes in another group.Moreover, the connectivity pattern is rather "uniform".For example, compared to nodes 2 to 25, node 1 does not seem a lot more connected to other nodes, both within the same group or with another group.In fact, this model is generated by taking each pair of nodes at a time, and simulating an (undirected) edge between them.The probability of having such an edge or not is independent of that of any other pair of nodes.For two nodes in the same group, that is, of the same colour, the probability of an edge is 0.8, while for two nodes in different groups, the edge probability is 0.05.
The above example is a simulation from a simple SBM, in which there are two essential components, both of which will be explained in "Stochastic block models" section.The first component is the vector of group memberships, given by (1 1 • • • 1 ( 1 ) Given the group memberships, the block matrix, and the assumptions of an SBM (to be detailed in "Stochastic block models" section), it is straightforward to generate a synthetic network for simulation purposes, as has been done in the example.It will also be straightforward to evaluate the likelihood of data observed, for modelling purposes.However, in applications to real data, neither the group memberships nor the block matrix is observed or given.Therefore, the goal of fitting an SBM to a graph is to infer these two components simultaneously.Subsequently, the usual statistical challenges arise: 1. Modelling: How should the SBM be structured or extended to realistically describe real-world networks, with or without additional information on the nodes or the edges? 2. Inference: Once the likelihood can be computed, how should we infer the group memberships and the block matrix?Are there efficient and scalable inference algorithms? 3. Selection and diagnostics: Can we compute measures, such as Bayesian information criterion (BIC) and marginal likelihood, to quantify and compare the goodness of fit of different SBMs?
Inferring the group memberships is essentially clustering the nodes into different groups, and the number of groups, denoted by K, is quite often unknown prior to modelling and inference.This brings about another challenge: 4. Should we incorporate K as a parameter in the model, and infer it in the inference?
Or should we fit an SBM with different fixed K 's, and view finding the optimal K as a model selection problem?
In this article, we will review the developments of SBMs in the literature, and compare how different models deal with various issues related to the above questions.Such issues include the type of the graph, the clustering approach, the inference approach, and selecting or estimating the (optimal) number of groups.The issue with the number of groups K is singled out because how differently it is related to questions 1-3 depends largely on the specific model reviewed.
Quite often, other types of data, such as textual and/or temporal, appear alongside network data.One famous example is the Enron Corpus, a large database of over 0.6 million emails by 158 employees of the Enron Corporation before the collapse of the company in 2001.While the employees and the email exchanges represent the nodes and the edges of the email network, respectively, the contents and creation times of the emails provide textual and dynamic information, respectively.It has been studied by numerous articles in the literature, including (Zhou et al. 2006;McCallum et al. 2007;Pathak et al. 2008;Fu et al. 2009;Xing et al. 2010;Gopalan et al. 2012;Sachan et al. 2012;DuBois et al. 2013;Xu and Hero III 2013;Matias et al. 2015;Bouveyron et al. 2016;Corneli et al. 2018).Another example is collections of academic articles, in which the network, temporal and textual data come from the references/citations between the articles, their publication years, and their actual contents, respectively.In these cases, further questions can be asked: 5. Can longitudinal and/or textual modelling, especially for cluster purposes, be incorporated into the SBMs, to utilise all information available in the data?6.How are inference, model selection/diagnostics, and the issue with K dealt with under the more complex models?
A relevant field to answering question 5 is topic modelling, in which the word frequencies of a collection of texts/articles are analysed, with the goal of clustering the articles into various topics.While topic modelling and SBMs are applicable to different types of information, textual for the former and relational for the latter, their ultimate goals are the same, which is model-based clustering of non-numerical data.Similar issues to the aforementioned ones for SBMs are also dealt with in various works, which will therefore be reviewed in this article.
While incorporating longitudinal and/or topic modelling into SBMs are possible, it should be noted that they are well-developed and large fields on their own.Due to the scope of this article, we will focus on works that are more recent or relatively straightforward extensions of SBMs to these two fields.We hope, from these reviewed interdisciplinary works, to provide future directions to a comprehensive model that handles multiple types of information simultaeously and deals with the questions raised satisfactorily.
While articles on SBMs form a major body of works in the literature, especially over the last decade or so, they should not be completely separated from other methods or algorithms in statistical network analysis.For example, community detection methods and latent space models are two highly related fields to SBMs, and their developments are interwined with each other.Therefore, several important works, such as reviews of these two topics or articles which make connections with SBMs, will be mentioned as well, for the sake of comprehensiveness.
The rest of this article is organised as follows.A simple version of the SBM is introduced in "Stochastic block models" section.Extensions of the SBM regarding the type of graph are reviewed in "Type of graph and extensions of the SBM" section.Models that relax the usual clustering approach, in which each node is assumed a single group, are introduced in "Clustering approach" section.Related models and methods for graphs to the SBM are discussed in "Related methods for graphs" section.The inference approaches and the related issue of the number of groups are discussed in "Inference approach" and "Number of groups" sections, respectively.Models which incorporate longitudinal modelling are "SBM with longitudinal modelling" section.Topic modelling is briefly introduced in "Topic models" section, and its incorporation in SBMs is reviewed in "SBM with topic modelling" section.A summary and comparison of models are provided in "Comparison" section, and the discussion in "Discussion" section concludes the article.

Stochastic block models
In this section, we shall first formulate a basic version of the stochastic block model (SBM) and mention the concept of stochastic equivalence, illustrated by continuing with the example in "Introduction" section.This will pave the way for "Type of graph and extensions of the SBM" section, where we consider different extensions to accommodate additional information about the graph, to better describe real-world networks, and to potentially lead to more scalable inference algorithms.
To introduce the terminology, we consider a graph G = (N , E), where N is the node set of size n := |N |, and E is the edge list of size M := |E|.In the example in Fig. 1, N = {1, 2, . . ., 90}, n = 90, and M = 1192.We call a pair of nodes a dyad, and consider the existence or absence of an edge for the dyad (p, q), through the use of the n × n adjacency matrix, denoted by Y, which is another useful representation of the graph.If G is undirected, as is the example, Y pq = Y qp = 1 if p and q have an edge between them, 0 otherwise, where M rs represents the (r, s)-th element of matrix M. By construction, under an undirected graph, Y is symmetric along the major diagonal.If G is directed, Y pq = 1 (0) represents an edge (non-edge) from p to q, and is independent of Y qp , for Y need not be symmetric.We assume Y pp = 0, that is, no node has a self-edge, as this is quite often, but not always, assumed in the models reviewed.The adjacency matrix for the example is shown in Fig. 2, where black and white represent 1 and 0, respectively.Graphs with binary adjacency matrices are called binary graphs hereafter.
In the SBM, each node belongs to one of the K(< n) groups, where K = 3 in the example.As the groups are unknown before modelling, for node p = 1, 2, . . ., n also defined is a K-vector Z p , all elements of which are 0, except exactly one that takes the value 1 and represents the group node p belongs to.For instance, as nodes 1, 45 and 90 in the example belong to groups 1, 2 and 3, respectively, we have Z 1 = (1 0 0) T , Z 45 = (0 1 0) T , and Z 90 = (0 0 1) T .Also defined is an n×K matrix Z := (Z 1 Z 2 • • • Z n ) T , such that Z pi is the i-th element of Z p .The group sizes can be derived from Z, and are denoted by Essentially, N i is the sum, or number of non-zero elements, of the i-th column of Z.In the example, N 1 = 25, N 2 = 30, N 3 = 35.Finally, the K × K edge matrix between groups can be derived from Z and Y.It is denoted by E, where E ij represents the number of edges between groups i and j in undirected graphs, and from group i to group j in directed ones.In the example, E is symmetric as G is undirected, and E 11 = 245, E 22 = 341, E 33 = 481, E 12 = E 21 = 37, E 23 = E 32 = 52, and E 31 = E 13 = 36.
In order to describe the generation of the edges of G according to the groups the nodes belong to, a K × K block matrix, denoted by C, is introduced.If G is undirected, for 1 ≤ i ≤ j ≤ K, C ij ∈[ 0, 1] and represents the probability of occurrence of an edge between a node in group i and a node in group j.Here C is symmetric, as is (1) in the example.If G is directed, for 1 ≤ i, j ≤ K, C ij represents the probability of occurrence of a directed edge from a node in group i to a node in group j, and C needs not be symmetric.Note that no rows or columns in C need to sum to 1.
Whether G is undirected or directed, the idea of the block matrix C means that the dyads are conditionally independent given the group memberships Z.In other words, Y pq follows the Bernoulli distribution with success probability Z T p CZ q , and is independent of Y rs for (p, q) = (r, s), given Z p and Z q .This implies that the total number of edges between any two blocks i and j is a Binomial distributed random variable with mean equal to the product of C ij and the number of dyads available.For undirected and directed graphs, the latter term is N i N j /2 and N i N j , respectively.In fact, Fig. 2 can be viewed as a realisation of simulating from the Binomial distribution with the respective means.Conversely, the densities of each pair of blocks in the adjacency matrix, calculated to be ⎛ ⎜ ⎝ 0.817 0.049 0.041 0.049 0.784 0.05 0.041 0.05 0.808 are, as expected, close to (1).

Stochastic equivalence
The assumption that the edge probability of a dyad depends solely on their memberships (and C) is based on the concept of stochastic equivalence (see, for example, Nowicki and Snijders (2001)).In less technical terms, for nodes p and q in the same group, p has the same (and independent) probability of connecting with node r, as q does.This echoes the observation in the example that node 1 does not seem more connected to the whole network than other nodes in the same group are.While such probability depends on the group membership of r, the equality still holds between p and q.

Block modelling and community detection
The concept of stochastic equivalence in itself does not require that nodes in the same group are more connected within themselves, than with nodes in another group.Essentially, the elements along the major diagonal of C are not necessarily higher than the off-diagonal elements.However, this phenomenon, which is also observed in the example, is often the goal of community detection, a closely related topic to SBMs.Therefore, it is sometimes taken into account in the modelling in the assortative or affinity SBM by, for example, Gopalan et al. (2012) and Li et al. (2016).Community detection algorithms and assortative SBMs will be further discussed in "Community detection" and "Assortative SBM" sections, respectively.

Modelling and likelihood
Given Z and C, we can write down the likelihood based on the assumption of edges being Bernoulli distributed conditional on the group memberships.If G is undirected, again assuming no self-edges, the likelihood is If G is directed, we replace the index in the product in (2) from p < q to p = q.The model with this likelihood will be called the Bernoulli SBM hereafter.With a change of index, (2) can be written as where Multiplying over the group indices i, j will become more useful than over the node indices p, q in some cases.As mentioned in "Introduction" section, when applying SBMs to real-world data, usually neither Z nor C is known, and has to be inferred.Therefore, assumptions have to be made before modelling and inference.For p = 1, 2, . . ., n, we assume that the latent variable Z p is independent of Z q apriori.Also, we assume that Pr(Z pi = 1) = θ i , where θ i is the i-th element of the K-vector θ = (θ 1 θ 2 . . .θ K ) T such that K i=1 θ i = 1.Essentially, the latent group Z p follows the multinomial distribution with probabilities θ , which means A further assumption can be made that θ arises from the Dirichlet((α1 K ) distribution, of which the parameter α comes from a Gamma(a, b) prior.This will be useful in "Poisson and degree-corrected SBMs" section.We shall defer actual inference to "Inference approach" section, and first examine the extensions and variants of SBM in the following section.

Type of graph and extensions of the SBM
In this section, we briefly revisit the lineage of the SBM, discuss how it is extended for binary graphs, and introduce models for valued graphs, to answer question 1 in "Introduction" section.Special attention will be paid to two increasingly useful variants: the degree-corrected and the microcanonical SBMs.SBMs originated from their deterministic counterparts.Breiger et al. (1975) illustrated an algorithm to essentially permute the rows and columns of the adjacency matrix Y.The rearranged adjacency matrix contains some submatrices with zeros only, some others with at least some ones.The former and latter kinds of submatrices are summarised by 0 and 1, respectively, in what they called the "blockmodel", which can be viewed as the predecessor of the block matrix C. White and Breiger (1976) followed this line but also calculated the densities of the blocks in some of their examples.The stochastic generalisation of the "blockmodel" was formalised by Holland et al. (1983).While Wang and Wong (1987) applied the SBM to real directed graphs, they assumed that the block structure is known a priori.Snijders and Nowicki (1997) and Nowicki and Snijders (2001) studied a posteriori blocking, meaning that the groups are initially unknown and to be inferred via proper statistical modelling, for 2 and an arbitrary number of groups, respectively.These lead to the Bernoulli SBM for binary graphs described in "Stochastic block models" section.

Binary graphs
Apart from Snijders and Nowicki (1997), binary graphs have been studied by numerous models.In the absence of additional information, such as covariates or attributes associated with the edges or nodes, the focus of any developments has been on proposing alternative models, or optimising the number of groups K.For example, the mixed membership models in Airoldi et al. (2008); Fu et al. (2009); Xing et al. (2010); Fan et al. (2015) and Li et al. (2016) allowed each node to belong to multiple groups ("Clustering approach" section).Miller et al. (2009) and Mørup et al. (2011) proposed latent feature models ("Latent feature models" section) that deviate from the SBM, while Zhou (2015) proposed an edge partition model ("Related methods for graphs" section).Mørup et al. (2011) and Fan et al. (2015) also, alongside Kim et al. (2013) and McDaid et al. (2013), modelled K in different ways ("Modelling" section), whereas (Latouche et al. 2012;Côme and Latouche 2015) and (Yan 2016) derived various criteria for selecting K ("Criteria and model selection" section).Lee and Wilkinson (2018) proposed a model for binary directed acyclic graphs (DAGs), in which the possible combinations for (Y pq , Y qp ) are {(0, 0), (0, 1), (1, 0)}, and no directed cycles of any length are allowed.This means that if we go from a node to an arbitrary neighbour along the direction of the edges and perform this "random walk" recursively, it is not possible to reach the original node.
While longitudinal SBMs deal with multiple layers of graphs (in a temporal order) directly, there are cases in which the multiple layers of graphs are aggregated, and the observed graph contains one layer only, with information about the specific layers lost.Vallès-Català et al. (2016) proposed a multilayer SBM for such a situation.Using the example of 2 layers, they assumed the adjacency matrix Y l of layer l = 1, 2 is generated by a Bernoulli SBM according to block matrix C l independently, and then considered two versions of the model.In the first version, there is an edge between p and q in the graph to be observed if an edge for the dyad exists in either layer, which means In the second, the existence of the edge in the observed graph requires the existence of the corresponding edge in both layers, which menas Vallès-Català et al. (2016) found that the second version is more predictive (than the usual single-layer SBM) when it comes to link prediction or network reconstruction ("Missingness and errors" section) for real-world networks, which may therefore be better described as an aggregation of multiple layers.

Valued graphs
The Bernoulli SBM can be easily generalised or modified for valued graphs.Apart from the aforementioned undirected and directed graphs, Nowicki and Snijders (2001) also studied directed signed graphs, where Y pq can take the value −1, 0, or 1.If we consider the dyad of a node of group i and a node of group j, and further define two K × K matrices D and E such that D ij and E ij represent the probabilities that the dyad takes the value 0 and −1, respectively, subject to C ij +D ij +E ij = 1, the edge probability, with a slight abuse of notation, is where 1{A} is the indicator function of event A. They also studied graphs for tournaments, which means the edge between nodes p and q can represent a match between the two nodes.The possible combinations for (Y pq , Y qp ) are {(−1, 1), (1, −1), (0, 0)}, corresponding to a loss, win and tie for p, respectively.The specification of the edge probability is omitted here because of the higher notational complexity.There is limited adoption of models in the literature for this kind of graphs.Kurihara et al. (2006) considered a graph in which multiple edges between two nodes can be accounted for.Instead of taking the product over all dyads, they assumed each edge arises from the usual Bernoulli(Z T p CZ q ) distribution after drawing the two nodes p and q independently from a multinomial distribution with weights φ = (φ we assume the m-th edge corresponds to the dyad (p m , q m ), the likelihood is While Kurihara et al. (2006) mainly considered bipartite graphs, we consider the two types of nodes as essentially the same set of nodes, to align with the notation in other models.Yang et al. (2011) mainly worked with binary undirected graphs in their dynamic SBM (see "SBM with longitudinal modelling" section), but briefly extended to the valued version, where Y pq , now the discrete number of interactions for dyad (p, q), is being modelled by a geometric distribution: ( 5 ) DuBois et al. ( 2013) also proposed a model based on the SBMs for network dynamics.
In its simplest version, for the dyad (p, q), the interactions are being modelled by a Poisson process with intensity exp Z T p CZ q , where the elements of C are not bounded by 0 and 1 here.Further modifications of the model are discussed in "SBM with longitudinal modelling" section.

Poisson and degree-corrected SBMs
Karrer and Newman (2011) also worked with (undirected) valued graphs, but arguably in a more natural way.They first redefined Y pq to be the number of edges for the dyad (p, q) following a Poisson distribution, and C ij the expected number of edges from a node in group i to a node in group j.The density of Y pq is now They argued that, in the limit of a large sparse graph where the edge probability equals the expected number of edges, this version of the SBM, called the Poisson SBM, is asymptotically equivalent to the Bernoulli counterpart in (2).To further modify the model, a parameter φ p is introduced for each node, subject to a constraint n p=1 φ p 1{Z pi = 1} = 1 for every group i, so that the expected number of edges for the dyad (p, q) is now where . This is termed the degree-corrected (DC) SBM.The parameters φ p and C ij have natural interpretations as their maximum likelihood estimates (MLEs) are the ratio of p's degree to the sum of degrees in p's group, and the total number of edges between groups i and j, respectively.While Karrer and Newman (2011) have also considered self-edges in their model, such treatment is omitted here for easier notational alignment.
Of importance is the reason behind the DC-SBM.Karrer and Newman (2011) argued that the then existing SBMs usually ignores the variation in the node degrees in real-world networks.This is quite evident as, under the original SBM, the expected degree is the same for all nodes in each group, given Z and C. Karrer and Newman (2011) illustrated that the DC-SBM managed to discover the known factions in a karate club network, while the original counterpart failed to do so.Quite a few recent works built upon the DC-SBM in various ways.For instance, Yan et al. (2014) provided an approach to model selection ("Criteria and model selection" section) between the two versions of SBMs.Also, Lu and Szymanski (2019b) introduced additional parameters to φ as a means of taking the assortativeness into account; see "Assortative SBM" section.McDaid et al. (2013) managed to integrate out the model parameters of the Poisson SBM in (6), to form a collapsed SBM.Apart from (2), (4) and the Dirichlet distribution for θ , they also assumed apriori each C ij is independent and identically distributed (i.i.d.) according to a Gamma(γ , β) distribution (with mean γ /β).With a change of indices similar to (3), the likelihood can be written as Now, the joint density of Y and Z, with C and θ integrated out, can be derived: Now ( 9) is a function of Y, Z (through E and N), and the three scalar (hyper-)parameters only.This becomes particularly useful in inference ("Collapsed SBMs and efficient moves" section).Also, (9) can be compared with (11) in the microcanonical SBM below.Furthermore, if α, β and γ are fixed or integrated out, π(Y, Z) is the exponential of the integrated complete data log-likelihood (ICL).It is a useful quantity when it comes to inference ("Variational methods" section) and model selection ("Criteria and model selection" section).Aicher et al. (2015) introduced a unifying framework for modelling binary and valued graphs, by observing that both the Bernoulli and Poisson distributions belong to the exponential family of distributions.For example, (2) can be written as where S(x) = (x, 1) T and η = log (x/(1 − x)) , log (1 − x) T are both vector-valued functions, of the sufficient statistics and the natural parameters of the Bernoulli distribution, respectively.Now, with a different type of edges observed, S(x) and η(x) can be specified according to the appropriate distribution in the exponential family.In the case of the Poisson SBM (6), S(x) = (x, − log(x! ), −1) T and η(x) = (log x, 1, x) T .Aicher et al. (2015) also used this weighted SBM to clarify the meaning of zeros in valued graphs, as they could mean a non-edge, an edge with weight zero, or missing data.To overcome this ambiguity, they extended (10), so that the (log-)likelihood is a mixture of distributions: where ψ is the weight assigned to the two components.They can correspond to, for example, the Bernoulli and the Poisson SBM, for modelling the edge existence and edge weight, respectively.

Microcanonical SBM
Arguably the most important recent developments is the microcanonical SBM and its nested version (Peixoto 2017a;Peixoto 2017b), as these works are a culmination of applying the principle of minimum description length (MDL) (Grünwald 2007;Rosvall and Bergstrom 2007), a fundamental result regarding K that is related to community detection (Peixoto 2013) ("Community detection" section), an efficient MCMC inference algorithm (Peixoto 2014a), a hierarchical structure that models K simultaneously (Peixoto 2014b) ("Modelling" section), and an approach that models Z and C differently from (4), leading to efficient inference algorithm ("Monte Carlo methods" section).The microcanonical SBM is mentioned here because it can be derived from modifying the Poisson SBM.The case for undirected graphs is illustrated here, with (8) utilised again.Next, for 1 ≤ i ≤ j ≤ K and conditional on Z and an extra parameter λ, C ij is assumed to follow the Exponential distribution with rate N i N j /λ.This assumption replaces that according to (4).By doing so, C can be integrated out from the product of (8) and the exponential density of C: where M, as defined in "Stochastic block models" section, is the total number of edges.Now, Z influences (11), or what we call the integrated likelihood, through E and N only.
(We refrain from calling it the marginal likelihood, which we refer to the likelihood with Z also integrated out.)Furthermore, this π(Y|Z, λ) can be split into the product of the two underbraced terms, which is the joint likelihood of a microcanonical model (Peixoto 2017a).It is termed "microcanonical" because of the hard constraints imposed, as Y and Z together fix the value of E, and therefore Further marginalisation of λ, which is straightforward with one-dimensional integration of π(E|Z, λ) in ( 11), results in where no model parameters are involved, which however, as argued by Peixoto (2017a), is not compulsory in the microcanonical formulation.It is also called a nonparametric SBM, not because of having no parameters but because that K is being modelled ("Modelling" section).Finally, the same marginalisation of C (and λ) can be applied to ( 7) to arrive at the microcanonical DC-SBM.x T p β i + pi , where pi is an i.i.d.Gaussian-distruted error.In this way, the covariates x p determine the memberships Z p through the group-specific vectors {β 1 , β 2 , • • • , β K } (and the error terms).Vu et al. (2013) also studied binary graphs as well as directed signed graphs.Furthermore, they proposed a model that connected the SBMs with exponential random graph models (ERGMs), another prominent class of social network analysis models which can be traced back to Holland (1981).In one example of the model by Vu et al. (2013), the edge probability is

Graphs with covariates or attributes
where x are the covariates, and f and g are functions that may depend on x.We do not specify the index of x because the covariates may depend on the nodes or the dyads.The parameter ψ is constant to both x and the groups p and q belong to, while the use of C, each element of which is possibly vector-valued and not bounded by 0 and 1, is to illustrate the dependence on the blocks and alignment with other models.Peixoto (2018a) extended the microcanonical SBM by incorporating attributes x.Specifically, the joint density of Y and x is where π(Y|Z) is given by ( 12), and the model is termed the nonparametric weighted SBM.Again, a hierarchical or nested structure can be incorporated, so that the groups are modelled by another SBM, and so on, if necessary.Please see "Number of groups" section for further details.Stanley et al. (2019) proposed an attribute SBM in which the group memberships Z of the nodes determine both the graph Y and the non-relational attributes x, which are assumed conditionally independent given Z.In the model formulation, in addition to a Bernoulli SBM for an undirected graph, they assumed that the attribute x p for node p comes from a mixture of Gaussian distributions with weights θ , which are the probabilities for generating Z p as defined in "Modelling and likelihood" section.By utilising both x and Y together in inferring Z, Stanley et al. (2019) found that the two types of information can complement each other, and therefore their attribute SBM is useful for link prediction ("Missingness and errors" section).

Clustering approach
Most of the models introduced so far adopt a hard clustering approach, that is, each node belongs to one group.However, for real-world networks, it is not unreasonable to allow a node to belong to multiple groups.In this section, we will look at models that do so, by incoporating a soft clustering approach in an SBM.Care has to be taken regarding how nodes that can belong to more than one group interact to form edges.

Mixed membership SBM
In the mixed membership stochastic block model (MMSBM) by Airoldi et al. (2008), for each node p, the latent variable Z p , which contains exactly one 1, is replaced by a membership vector, also of length K, denoted by θ p .The elements of θ p , which represent weights or probabilities in the groups, have to be non-negative and sum to 1. Using the example in Fig. 1, Z 1 = (1 0 0) T could be replaced by θ 1 = (0.7 0.2 0.1) T , which roughly means that, on average, node 1 spends 70%, 20% and 10% of the time in groups 1, 2 and 3, respectively.Next, each node can belong to different groups when interacting with different nodes.In order to do so, still assuming that G is undirected, for each dyad (p, q), a latent variable Z pq is drawn from the multinomial distribution with probabilities θ p .As a K-vector containing exactly one 1, Z pq now represents the group node p is in when interacting with q. (Also drawn is Z qp from θ q to represent the group node q is in when interacting with p.) Going back to the example, if Z 12 = (0 0 1) T and Z 13 = (1 0 0) T , which are drawn independently from θ 1 , node 1 belongs to groups 3 and 1, respectively, when Y 12 and Y 13 are concerned.As each Z pq is a K-vector, the collection of latent variables Z is now an n × n × K array.The apriori density of Z now becomes We can carry out (Bayesian) inference once we specify the prior distributions.However, we shall defer this to "Inference approach" section.For the derivations of the model for directed graphs, please see Airoldi et al. (2008).What should be noted here is that the main goal of inference is not for the pairwise latent variables Z, but the mixed memberships .
Several articles built upon the MMSBM introduced.Li et al. (2016) proposed a scalable algorithm ("Inference approach" section), Fu et al. (2009) and Xing et al. (2010) incorporated longitudinal modelling ("SBM with longitudinal modelling" section), and Kim et al. (2013) and Fan et al. (2015) focused on modelling K, the number of groups ("Modelling" section).Fan et al. (2016) observed that the assumption that Z pq and Z qp are independent is not quite realistic in real-world networks, as nodes may have higher correlated interactions towards the ones within the same groups.Therefore, they proposed a copula MMSBM for modelling these intra-group correlations.
Godoy-Lorite et al. ( 2016) modified the MMSBM for recommender systems, in which the observed data is the ratings some users give to some items (such as books or movies), and the goal of modelling and inference is to predicting user preferences.The ratings are therefore treated as the observed edges in a bipartite graph, but it is not the existence of the these edges that is being modelled.Rather, it is the value of the ratings, that is, the edge weight, that depends on the depends on the respective mixed memberships of the users and items.By inferring these memberships as well as the block matrix, predictions on user preferences can be made for unobserved combinations of users and items.
The MMSBM is related to, or has been compared with other models for graphs.For example, the latent feature model (Miller et al. 2009) deviated from the hard clustering SBM in a different way than MMSBM did, and therefore made comparison with the latter in terms of performance.For the class of latent feature models, and their practical difference with MMSBM, please see "Latent feature models" section.A close connection with the latent space models (Hoff et al. 2002;Handcock et al. 2007) have been drawn by Airoldi et al. (2008); please see "Latent space models" section.

Overlapping SBM
When applying MMSBMs, while some nodes might have genuine mixed memberships, some other nodes might have single memberships, that is, θ p being a vector with all 0's but one 1.Such phenomenon is being addressed in another group of models, called overlapping SBMs, in a more direct way.For example, if there are K = 2 groups in the network, there will be nodes belonging to group 1 only, some others belonging to group 2 only, and the rest belonging to both groups simultaneously.When K > 2, overlapping between more than two groups is allowed.Unlike the hard clustering SBMs, the groups are not disjoint anymore in these overlapping models, which therefore are an alternative to the MMSBMs, as far as soft clustering is concerned.Theoretically, there are 2 K − 1 choices of membership combinations for each node, as, for each node, there is a binary choice for belonging to each of the K groups, with the only constraint that it has to have at least one grou membership.
The difference with the MMSBM is noted by Peixoto (2015b), who used the MDL approach (Peixoto 2017a) ("Microcanonical SBM" section) for overlapping models.They first considered a variation of the Bernoulli (or Poisson) SBM, in which the memberships are relaxed in the way described in the paragraph above.They then observed that, for sparse graphs, such an overlapping model can be approximated by a non-overlapping model for an augmented graph.Each distinct membership of a single node in the original graph can be considered as a different node with a single membership in the augmented graph.Modelling and inference ("Inference approach" section) are then straightforward.The expected degree for a node p (in the original graph) with membership in multiple groups will be larger than that for nodes in either group, as p received edges associated with each of the groups independently.Contrastly, in MMSBM, such quantity will be the weighted average of the corresponding quantities of the groups p belongs to.A degree-corrected version which incorporates the DC-SBM (Karrer and Newman 2011) is also derived, in which the soft clustering is achieved through hard clustering of the half-edges, rather than the hard clustering of the augmented graph in the non-degreecorrected version.
The increased number of membership choices (from K to 2 K − 1) naturally brings about the increased complexity of the overlapping SBM.However, such complexity is usually not favoured in applications to real-world networks.By comparing the MDL of the non-overlapping and overlapping SBMs, Peixoto (2015b) managed to carry out model selection, and found that the latter is more likely to overfit, and is selected as the better model only in a few cases.This finding is echoed by Xie et al. (2013) in the context of community detection algorithms ("Community detection" section).They observed through a comparative study that, in real-world networks, each overlapping node typically belongs to 2 or 3 groups.Furthermore, the proportion of overlapping nodes is relatively small for real-world networks, usually less than 30%.Ranciati et al. (2017) proposed an overlapping model that is similar to but not an overlapping SBM, because the connections between the nodes, which they called actors, are unknown in the data.Instead, available in the data are whether the actors have attended certain events.Equivalently, the data can be viewed as a bipartite network between the actors and the events.In the proposed model, clustering is applied to the actor nodes, which can belong to one or more groups, hence the overlapping nature of the model.Subsequently, memberships were inferred without the direct knowledge of edges between the actor nodes.

Related methods for graphs
In this section, two classes of models for graphs and one class of models for hypergraphs, are reviewed, with a focus on how they work with network or graph data in different ways to SBMs.Community detection algorithms will also be mentioned, which are a class of methods with a similar (but not identical) goal to SBMs.

Latent feature models
A class of models closely related to SBMs is the latent feature models (Miller et al. 2009;Mørup et al. 2011), in which there are no longer K groups but K features.For example, if Fig. 1 represents a social network where nodes and edges correspond to people and personal connections, respectively, then the K = 3 features could be gender (0 for female and 1 for male), whether they wear glasses, and whether they are left-handed (0) or righthanded (1).Each element of Z p is a binary latent variable without constraint, representing the absence or presence of a latent feature, meaning that the sample space of Z p is the 2 K combinations of 0's and 1's (note the similarity with the number of combinations in an overlapping SBM in "Overlapping SBM" section).Continuing with the example, if node 1 is a female who wears glasses and is right-handed, Z 1 = (0 1 1) T .
The element C ij (1 ≤ i, j ≤ K) in the matrix C represents the probability of an edge from a node with feature i to a node with feature j.In their infinite multiple relational model (IMRM), Mørup et al. (2011) assumed that the feature combinations are independent, which means that the probability of an edge for the dyad Pr No edge from p with feature i to q with feature j|Z, C where P is a matrix such that Miller et al. (2009) specified their latent feature relational model (LFRM) in a slightly different way, by using a weight matrix W in place of C such that Z T p WZ q can take any real value, and a function Not only do ( 16) and ( 17) look similar to the (conditional) edge probability Pr(Y pq = 1|Z, C) = Z T p CZ q in the aforementioned version of SBM, the latent feature models can also be reduced to the SBM when only one feature is allowed, by imposing the constraint The latent feature models should not be confused with the mixed membership models ("Clustering approach" section), where a node can belong to multiple groups with weights.The practical difference is, while the SBM so far and the latent feature model allow one and multiple 1's in each Z p , respectively, the MMSBMs allow non-binary and non-negative weights in θ p , subject to the constraint that these weights sum to 1 for each p. Zhou (2015) proposed a similar model, called the edge partition model (EPM), in which each element of Z and W is assumed to come from the Gamma distribution, resulting in a non-negative value for Z T p WZ q , which is assumed to be the mean rate of interaction, for dyad (p, q).Assuming that the number of interactions is Poisson distributed and that p is connected to q if they have interacted once, we have Palla et al. ( 2012) extended the LFRM by Miller et al. (2009) by introducing subclusters for the latent features in their infinite latent attribute (ILA) model.Universally, there are still K features and an n × K binary matrix Z representing the presence or absence of latent features for the nodes.Additionally, for feature m, there are m) , and an n-vector denoted by V (m) such that V (m) p represents the subcluster that node p belongs to if it has feature m.If we denote the collections of subcluster vectors and weight matrices by V and W, respectively, we have where σ (•) is the same as in Miller et al. (2009) that maps (−∞, ∞) to (0, 1), such as the sigmoid function σ Comparing the models introduced in this section, the EPM (Zhou 2015) is found to outperform the ILA model (Palla et al. 2012), which in turn outperforms the LFRM (Miller et al. 2009), which in turn outperforms the MMSBM (Airoldi et al. 2008) introduced in "Clustering approach" section.However, (Mørup et al. 2011) did not compare their IMRM (Mørup et al. 2011) with the models here, not was there a single comparison between all these latent feature models.

Hypergraph models
Coauthorship or collaboration networks are a popular kind of data that statistical network methods have been applied to (Newman 2001a;2001b;2004;Newman and Girvan 2004;Ji and Jin 2016).However, the graphs are usually constructed by assigning an edge, possibly valued, to two authors if they have coauthored one or more articles.Such representation, however, does not preserve all the information (Ng and Murphy 2018) and may not be very realistic.For example, pairwise edges between nodes (authors) 1, 2 and 3 could mean that each pair have collaborated separately, or that all three of them have written one or more articles as a whole, or a combination of both.Furthermore, when an article is written by, say, more than 20 authors, it is unrealistic to assume that each pair of authors know each other with equal strengths.
A more natural representation of such data is through the use of hypergraph.Specifically, a hyperedge is an unordered subset of the node set N , and when all hyperedges are node pairs, the hypergraph is reduced to a graph.In the example with the three authors, each pair having collaborated separately corresponds to 3 hyperedges: {1, 2}, {2, 3} and {3, 1}, whereas all three of them collaborating together corresponds to 1 hyperedge: {1, 2, 3}.
Hypergraph data can also be modelled with the same goal of clustering the nodes.However, it is not quite direct to extending from SBMs to "connect a random number of two or more nodes", making it more difficult to work with hyperedges.Ng and Murphy (2018) resorted to and extended the latent class analysis (LCA), in which the hyperedges are clustered into the latent groups, and the memberships of the nodes can be seen as a mixture of the memberships of the hyperedges they are in.Lunagómez et al. (2017) considered a geometric representation of the nodes in an Euclidean space to construct a hypergraph.For ease of explanation, we assume the nodes lie on a 2-dimensional plane, and, for each node, a circle of the same radius is drawn.Then for each set of nodes that have their circles overlapped, a hyperedge is assigned.Essentially, instead of clustering the nodes into groups, this model projects them onto an Euclidean space and infers their latent positions.

Latent space models
Projecting the nodes of a graph G to an Euclidean space and discovering their latent positions has also been explored in the literature.Hoff et al. (2002) proposed the latent space model, in which the latent variable associated with node p, still denoted by Z p here, does not correspond to the group membership, but a position represented by, for example, the vector of coordinates in the Euclidean space.Then the probability of nodes p and q having an edge in G, assuming it is undirected, depends on the distance between Z p and Z q : where d(•, •) is a distance measure, possibly with some parameters, satisfying the triangular inequality.So, probability of an edge between p and q decreases with the distance between Z p and Z q .If covariates x pq about the dyad (p, q) are available, they can also be incorporated into the model: , where α and β are extra parameters.They noted that this model formulation is useful for handling undirected graphs because of the symmetry between p and q.For directed graphs, they proposed , where the asymmetric term Z T p Z q /|Z q | is the signed magnitude of Z p in the direction of Z q .
Real-world networks usually exhibit transitivity, which means that, if both nodes A and B are connected to node C, then A and B are likely to be connected.Another common phenonmenon is homophily, which means that nodes with similar attributes are more likely to be connected.They are accounted for by the above model through the use of latent space and dyad-specific covariates x pq , respectively.Handcock et al. (2007) proposed an extension in the form of a latent space cluster model, by considering the apriori distribution of the latent positions.Specifically, for each node p, Z p is assumed to be drawn from a mixture of K Gaussian distributions, each of which has a different mean and covariance matrix to represent a different group/cluster.In this way, the clustering of the nodes are accounted for explicitly.Airoldi et al. (2008) noted the similarity between their MMSBM and the latent space models.For generating an edge between nodes p and q, the terms Z T pq CZ qp and Z T p IZ q are involved in the former and the latter, respectively, where I is an identity matrix.They have also compared their performances when being applied to the same set of data.
A recent development with latent space models is by Sanna Passino and Heard (2019), who first used spectral clustering (von Luxburg 2007) to project, or embed, the graph to a d-dimensional Euclidean space.Then, a Gaussian mixture model, with K components, is fit to these spectral embeddings.Their novelty is the estimation of d and K simultaneously ("Modelling" section) in their inference algorithm.For more details on spectral clustering and its relation to SBM, please see Rohe et al. (2011).

Community detection
Without the pretext of statistical or probabilistic modelling, community detection can be the goal of analysing a network, which is to cluster nodes so that the edge density is high within a group and low between groups.This concept is also called assortativeness.In the context of SBMs, this means C ii (i = 1, 2, . . ., K) is high while C ij is low for j = i.As mentioned in "Stochastic equivalence" section, this is not guaranteed by the concept of stochastic equivalence alone.While SBMs can find communities with high within-group edge densities, they are in fact a more general method that allow other types of structure in the network to be found (Guimerà and Sales-Pardo 2009;McDaid et al. 2013).
The above effect is illustrated by, for example, the difference between the DC-SBM by Karrer and Newman (2011) and the original version, when applied to a real-world network with K = 2.While the former accounted for the variation in the degree and managed to discover the two communities, the latter put the highly connected nodes together in one group, the rest in another.In bigger networks with nodes on the periphery of the network, that is, they are only connected to one or a few nodes which are more central to the network, these peripheral nodes will be put together in a "miscellaneous" group with a low edge density, under the original SBM, instead of the same groups as the more central nodes they are connected to.

Assortative SBM
One way of achieving community detection is to modify the SBMs to align with this goal.In the assortative (or affinity) SBM (Gopalan et al. 2012;Li et al. 2016), a constraint is imposed that C ij = δ for i = j, where δ is a parameter presumed to be smaller than C ii .While reducing the number of parameters in C from K(K + 1)/2 (in the case of undirected graphs, K 2 in the case of directed ones) to K + 1 may not significantly reduce the computational cost unless K is large, it implies assortativeness.However, it should be noted that incorporating assortativeness in SBMs is not a universal solution.For example, it is not sensible when bipartite networks are modelled, in which connectivity is high between groups but zero within groups.Therefore, caution should be taken whenever an assortative SBM is used, although the stochastic gradient method by Li et al. (2016) should be easily generalisable to a non-assortative model.Lu and Szymanski (2019b) proposed a regularised SBM which extends the DC-SBM to control the desired level of assortativeness.The expected number of edges for the dyad (p, q) is now φ p φ q Z T p CZ q if Z p = Z q , (k p − φ p )(k q − φ q )Z T p CZ q otherwise, where k p is the degree of node p.While a different expected number of edges is allowed according to the group memberships, the parameter φ p is regulated by a parameter h, which is a number between 0 and 1, according to φ p = max(hk p , 1).The tuning parameter h is not estimated but varied, to give differnt clustering results corresonding to different levels of assortativeness.A high value of h leads to a more assortative partition and, in the application, recovers the same known factions in the karate club network as in Karrer and Newman (2011).
The assortative models introduced so far are actual SBMs and not merely related methods for graphs.It should be noted that they are introduced here because of the proximity to the goal of community detection.

Non-probabilistic and modularity methods
Another way of achieving community detection is to step away from SBMs, and apply methods which are not based on statistical or probabilistic modelling but mainly on heuristics, and are usually iterative in nature.For example, in the label propogation algorithm by Raghavan et al. (2007), initially each node is randomly assigned to one of the K groups.Then each node takes turn to join the group to which the maximum number of its neighbours belong (with ties broken uniformly randomly).The iterative process continues until no node changes its group membership anymore.Other methods, for example, are based on the edge betweenness centrality measure Girvan and Newman (2002), random walks on the graph (Pons and Latapy 2006), and network flows and information-theoretic principles (Rosvall et al. 2009).
One very popular framework of community detection is the use, and optimisation, of the modularity of a network, by the highly-cited Newman and Girvan (2004).Assuming an undirected (but possibly valued) graph, the formulae of the modularity, denoted by Q, is where k p = n q=1 Y pq is the degree of node p.As it can be interpreted as the number of edges within groups minus the expected number of such edges, the modularity is a very useful measure for how "good" the clustering is, and numerous algorithms have been proposed for its optimisation (Clauset et al. 2004;Newman 2006a;Newman 2006b;Wakita and Tsurumi 2007;Blondel et al. 2008).Moreover, it can be computed for results not based on modularity optimisation, and compared between all methods.Clauset et al. (2004) suggested that, in practice, a modularity above 0.3 is a good indicator of significant community structure.
Newman ( 2016) established a connection between modularity optimisation and the DC-SBM ("Poisson and degree-corrected SBMs" section).Specifically, maximising the likelihood of a special case of the DC-SBM is equivalent to maximising a generalised version of the modularity.This is also noted by, for example, (Lu and Szymanski 2019a), when they tackled the resolution limit problem, which is reviewed below.

Resolution limit
Community detection algorithms are in general easy to implement, and likely to be faster than applying an SBM.However, its disadvantages includes that different initial configurations may lead to different results even under the same algorithm, and that different results give vastly different results.More importantly, modularity optimisation methods suffer from a resolution limit (Fortunato and Barthélemy 2007;Good et al. 2010;Lancichinetti and Fortunato 2011).This means that by maximising the modularity, smaller groups or clusters cannot be detected, especially for large graphs.For example, performing community detection on a network of 1 million nodes may result in the smallest group having 500 nodes, but the algorithm performed on the whole network cannot go deeper to further cluster these 500 nodes (or that modularity will decrease by doing so).The need of a possible additional round of community detection is not ideal as it should have been a systematic part of the initial community detection.
The above issue of resolution limit is tackled by recent works in various ways.Chen et al. (2014) proposed an alternative to modularity, called modularity density, which theoretically resolves the resolution limit problem, and empricially improves results significantly when being applied to community detection algorithms.Lu and Szymanski (2019a) proposed an agglomerative community detection algorithm to detect communities at multiple scales, thus avoiding the resolution limit issue.Peixoto (2013) noted the connection with the resolution limit for SBMs, and went on to resolve the issue in the hierarchical model (Peixoto 2014b) ("Microcanonical SBM" section).

Miscellaneous
Community detection is, in itself, a largely studied topic in the literature, and a few important reviews should be referred to for further exploration of the topic.Fortunato (2010) provided an introduction and comprehensive review of community detection, while Abbe (2018) surveyed the developments that establish the fundamental limits for community detection in SBMs.Schaub et al. (2014) reviewed a spectrum of algorithms according to their different motivations that underpin community detection, with the aim of providing guildlines for selecting appropriate algorithms for the given purposes.In addition to providing an overview of algorithms based on modularity optimisation, Cherifi et al. (2019) reviewed recent advances on two specific topics, namely community detection for time evolving networks, and immunisation strategies in networks with overlapping and non-overlapping community structure.The former is, in the context of SBMs, reviewed in "SBM with longitudinal modelling" section, while the latter is useful for, for example, targeting a small group of nodes to prevent the spread of epidemics in networks.

Inference approach
In this section, the general framework of inference for SBMs is reviewed, in which there are two main classes of methods: Monte Carlo ("Monte Carlo methods" section) and variational ("Variational methods" section), to answer question 2 in "Introduction" section.Greedy methods will be mentioned in "Greedy methods and others" section, while methods for predicting or correcting the edges or non-edges will be presented in "Missingness and errors" section.While the general approach is discussed here, the more specific models or algorithms for estimating or selecting the number of groups K will be mentioned in "Number of groups" section.Nevertheless, inference and the issue of K are, most of the time, intertwined with each other, and are presented in separate sections here for clarity.
We first observe that, by combining (2) and (4), inference is possible by the frequentist approach.This can be achieved via direct maximisation of likelihood or the expectationmaximisation (EM) algorithm (Dempster et al. 1977), both of which are illustrated in Snijders and Nowicki (1997) for a simple case where K = 2.However, we will focus on the more popular and arguably more powerful Bayesian approach here, as did Nowicki and Snijders (2001).What remains is assigning priors to C and θ before inference can be carried out.We assume each element of C has an independent Beta prior, that is where A and B are K × K matrices with all positive hyperparameters.We also assume θ arises from the Dirichlet(α1 K ) distribution, of which the parameter α comes from a Gamma(a, b) prior.
The joint posterior of Z, θ , C and α, up to a proportionality constant, is Under the soft clustering approach in "Clustering approach" section, the weight vectors {θ 1 , θ 2 , . . ., θ K } are assumed to be i.i.d.according to the Dirichlet(α1 K ) distribution.This yields the joint posterior of Z, , C and α: ( Comparing this with (21) illustrates why the hard clustering approach is preferred in the literature of SBMs.As K is a lot smaller than n usually (hence the purpose of clustering), the computational cost mainly depends on the number of latent variables Z.In the hard and soft clustering approaches, this amounts to O(n) and O(n 2 ) iterations, respectively.The quadratic computation cost means that a simple Gibbs sampler is not very scalable in soft clustering (Mørup et al. 2011).

Monte Carlo methods
If algorithmic simplicity is preferred to computational efficiency, inference can be carried out in a straightforward way via Markov chain Monte Carlo (MCMC).More specifically, a simple regular Gibbs sampler can be used, where all the parameters and latent variables (except α) can be updated via individual Gibbs steps; see, for example, Nowicki and Snijders (2001) and Lee and Wilkinson (2018).In quite a few cases, a Gibbs sampler is natural when K is being modelled ("Modelling" section), and this is used by both SBMs (Tang and Yang 2011;2014;Palla et al. 2012;Fan et al. 2015;Zhou 2015) and a latent feature model (Miller et al. 2009) If computational efficiency is the focus of the MCMC algorithm, careful considersations are required so as not to waste computational time on naive moves.Mørup et al. (2011) noted that, in a latent feature model, the possible combinations of latent features is 2 Kn , compared to the K n possible combinations of group memberships in an SBM, and so standard Gibbs samplers are unlikely to be scalable.Therefore, in their algorithm, certain types of moves are employed so that the computational complexity increases linearly with the number of edges, not of node combinations.Li et al. (2016) made a similar claim for their stochastic gradient MCMC algorithm, in which only a small mini-batch of the nodes is required in each iteration, to greatly reduce the computational overhead.These algorithms benefit from the fact that large graphs are usually sparse in reality.

Collapsed SBMs and efficient moves
While both McDaid et al. (2013) and Peixoto (2017a) and Peixoto (2017b) integrated C out in their respective SBMs, they proposed different efficient moves in their MCMC algorithms.In McDaid et al. (2013), one of the following four moves is selected randomly uniformly and performed in each iteration.The first one is a Gibbs move for a randomly selected node.The second is a move that selects two groups at random and proposes to reassign all the nodes in these two groups, in a scheme described by Nobile and Fearnside (2007).The last two are moves that affect the number of groups alongside the memberships, as K is treated as a parameters and estimated.McDaid et al. (2013) also proposed a method to deal with the issue of label-switching, which is circumvented in the spectral clustering model by Sanna Passino and Heard (2019).
Peixoto (2017a); Peixoto (2017b) applied a single-node move proposed by Peixoto (2014a) that works as follows.For node p, whose membership Z p is to be updated, a neighbour q is selected randomly uniformly, whose membership is, say, Z q = i.Then node p is proposed to move to a group j (which could be the same as i) with probability proportional to E ij , that is, the number of edges between groups i and j.Note that this is different from proposing to move p to a group j with probability proportional to the number of neighbours of p in group j.Another move is also proposed for merging groups, as K is being modelled ("Modelling" section).

Variational methods
An alternative to the Monte Carlo methods is the class of variational expectationmaximisation (VEM) methods.The principle of these algorithms is to first provide a lower bound to the marginal log-likelihood, in which the latent variables (Z here) are integrated out, by the use of an approximate variational distribution.This lower bound is then tightened or maximised, with respect to the model parameters (θ , C and α here) as well as those of the variational distribution.We illustrate this using the Bernoulli SBM.By writing η = {θ , C, α}, for any distribution Q of the latent variables Z, we have The second line is due to Jensen's inequality as the logarithm function is concave.As log As the marginal likelihood on the left-hand side is constant to the choice of Q, maximising the first term with respect to Q and η is equivalent to minimising the K-L divergence, thus improving the approximation.While the ideal choice of Q(Z) is the conditional distribution π(Z|Y, η) such that the K-L divergence is 0, the latter is usually intractable in the models discussed here, and so the best tractable choices of Q(Z) are being sought.Usual choices are such that Q(Z) is factorisable, making analytical calculations of the lower bound possible.Finally, the lower bound is iteratively maximised with respect to η in the M step, and (the parameters of ) Q in the E step, in an EM algorithm.
The factorisable variational distributions and the iterative steps have been illustrated by, for example, Kurihara et al. (2006) and the following references in this section.Gopalan et al. (2012) and Kim et al. (2013) considered stochastic optimisation in place of EM algorithms, while Matias et al. (2015) considered variants of the M step.Hayashi et al. (2016) opted for belief propagation as the alternative approach to obtaining the variational distribution Q.
That the variational approach is adopted is due to several reasons in different articles.Airoldi et al. (2008) argued that in the MMSBMs (Airoldi et al. 2008;Fu et al. 2009;Xing et al. 2010) it outperforms (with computational cost O(nK + 2K)) the corresponding MCMC methods (with computation cost O(n 2 )).Similarly, (Vu et al. 2013) argued that their algorithm for SBM is more scalable (O(n)) compared to latent space models (O(n 2 )) ("Latent space models" section).In some other cases, it facilitates the computation of a criterion, such as the ICL π(Y, Z) (Latouche et al. 2012;Matias et al. 2015;Hayashi et al. 2016;Matias and Miele 2017), or the observed data log-likelihood π(Y|η) (Yan et al. 2014).This criterion can be directly used for model selection, or equivalently selecting K ("Criteria and model selection" section).

Greedy methods and others
While most articles in the literature have used either Monte Carlo methods or variational methods, there are a few exceptions.In the DC-SBM (Karrer and Newman 2011), the log-likelihood can be obtained by summing the logarithm of ( 7) over all possible dyads.Instead of integrating out C and φ, they adopted a frequentist approach and substituted their respective MLEs to obtain the objective function π(Y|Z), which is the basis of their label-switching algorithm.In each step, each node is proposed to move from one group to another, and selected is the move that will most increase or least decrease π(Y|Z).Once all nodes have been moved and the group memberships Z are according updated, the objective function is calculated for all the steps this greedy algorithm passed through, and the one with the highest objective is selected as the initial state of another run of the algorithm.The algorithm is stopped when there is no further increase in the objective function.This label-switching algorithm is also incorporated by Xu and Hero III (2013) in the inference algorithm for their dynamic SBM, which will be discussed in "SBM with longitudinal modelling" section.
Côme and Latouche (2015) also proposed a greedy step in their inference algorithm.Before doing so, required is the ICL, log π(Y, Z), by integrating out the parameters η = {θ , C, α}.They worked out an asymptotic version via certain approximations, as well as an exact version under certain priors.The ICL then becomes the objective function in their greedy optimisation algorithm.Similar to Karrer and Newman (2011), in each step, each node is proposed to move from one group to another, and selected is the move that will most increase the ICL.However, if no proposed move results in an increase, Z p remains unchanged.The algorithm terminates again if there is no further increase in the objective.
Yan (2016) combined the work by Karrer and Newman (2011) and Côme and Latouche (2015), by approximating the ICL for the DC-SBM.As the focus is selecting the number of groups by comparing the objective under different values of K, Yan (2016) only provided the calculations for the ICL in the absence of an inference algorithm for the parameters and/or the latent variables.
In the work by van der Pas and van der Vaart (2018), they did not consider the inference algorithm and were mainly concerned with the posterior mode, that is, the value of Z that maximises the posterior density in ( 21), or the collapsed version thereof according to McDaid et al. (2013).They found that this mode converges to the true value of Z (as n → ∞), under the frequentist setup that there is such a true value, and with the condition that the expected degree is at least of order (log n) 2 .The reason that the priors of Z (and of other parameters) are used even in a frequentist setup is that such priors actually play a part in establishing the convergenece.Furthermore, van der Pas and van der Vaart (2018) established a connection of such maximised posterior density, termed Bayesian modularity, with the likelihood modularity defined by Bickel and Chen (2009) ("Criteria and model selection" section).

Missingness and errors
In some cases, the goal of applying an SBM is not (only) the inference of the group memberships Z, but on dealing with partially or errorfully observed graphs.For example, there is no information regarding the interactions between some dyads.Inferring these edges or non-edges, with some (un)certainty, then becomes the goal of inference, and is called the link prediction problem in the literature.In the context of Sbms, this has been looked into by, for example, Guimerà and Sales-Pardo (2009), Zhou (2015) 2019), and also by Miller et al. (2009) for latent feature models.Closely related is the issue with errors in the observed graphs, where the spurious edges may lead to wrong conclusions.Here, network reconstruction becomes another goal of inference.This has been studied by, for example, Guimerà and Sales-Pardo (2009), Priebe et al. (2015) and Peixoto (2018b), in the context of SBMs.

Number of groups
As the main goal of SBMs is to cluster nodes into groups or communities, without a given number of groups K, it is difficult to evaluate the likelihood and infer the group memberships Z. Unless there is prior information on K, it might not be objective to carry out inference for one fixed value of K. Solutions usually come in two main directions.The first is to fit an SBM to multiple values of K, with a measure computed to quantify the goodness of fit, followed by selecting the optimal K. Works in this direction will be reviewed in "Criteria and model selection" section.The second is to model K as yet another parameter and estimate it in the inference, most likely by transdimensional inference algorithms.Works in this direction will be reviewed in "Modelling" section.Together, these methods can be seen as an answer to questions 3 and 4 in "Introduction" section simultaneously.
In some articles, the number of groups is fixed, rather than being estimated.Snijders and Nowicki (1997) set K = 2, enabling the MLE calculations and working out the associated EM algorithm.Tallberg (2005) and Karrer and Newman (2011) used K = 2 and K = 3, respectively, for their data but acknowledged the issue of assuming that K is given, which is usually not the case in practice.Yang et al. (2011) focussed on the dynamic structure of social networks, and fixed K to 2 for two of their datasets, and to 3 for a third dataset, by using prior knowledge on the three sets of data.Similarly, Vu et al. (2013) set K = 5 by following external relevant practices of using five groups for their data.Xu and Hero III (2013) fixed K = 7 for their Enron email network data, also using prior knowledge on the classes the nodes (employees) belonged to.In their model for recommender systems, Godoy-Lorite et al. ( 2016) reported the results for 10 groups of users and 10 groups of items, and found no differences in performance when either number of groups is increased.The data analysed by Zhang et al. (2017) contained individual attributes that allowed them to divide the nodes into K = 3 groups, and their focus was on showing that their dynamic SBM obtained results closer to the ground truth than the static counterpart did.When van der Pas and van der Vaart (2018) illustrated the connection of their Bayesian modularity with the likelihood modularity, through the proximity of the empirical results to those by Bickel and Chen (2009), they used both K = 2 and K = 4 in the application to the same set of data.Nowicki and Snijders (2001) made separate fits to their data using K = 2, 3, 4, 5, and compared the information as well as a parameter representing the "clearness" of the block structure, to aid their decision on K = 3.This set the precedent to using some kind of criterion to select the optimal K.

Criteria and model selection
A measure previously mentioned that requires no modelling is the modularity in (20), which has been compared with the profile log-likelihood by Bickel and Chen (2009), who called it a likelihood modularity.To obtain this criterion, first observe that, in (3), the MLE of C ij is simply Ĉij = E ij /N ij , the edge density between groups i and j.This estimate can then be plugged in (3) to obtain While Bickel and Chen (2009) did not use the profile log-likelihood mainly for model selection, it can be seen as a precedent in moving from the non model-based modularity to a criterion based on SBM.Another criterion used in earlier works is the Bayesian information criterion (BIC) (Airoldi et al. 2008;Fu et al. 2009;Xing et al. 2010).However, according to Latouche et al. (2012) and Côme and Latouche (2015), it is not tractable in most cases as it depends on the observed data log-likelihood log π(Y|η), where η represents the collection of parameters, and is known to misestimate K in certain situations (Yan et al. 2014).Yan (2016), however, drew connection with the minimum description length (MDL) (Peixoto 2013) and dervied a more principled BIC.
Instead of evaluating the BIC, Yan et al. (2014) focussed on approximating the observed data log-likelihood log π(Y|η).They applied what they called belief propagation in their EM algorithm, in order to approximate log π(Y|η) from log π(Y|Z, η), which is derived for both DC-SBM and its original counterpart.By doing so, a (log-)likelihood ratio can be computed, denoted by log π(Y|η 1 ) − log π(Y|η 2 ), where η 1 and η 2 correspond to the parameters under the DC-SBM and the original version, respectively.With its asymptotic behaviour derived, a likelihood ratio test can be performed to answer the question of whether degree correction should be applied to the data in hand.
While log π(Y|η) or its approximation is also being used by, for example, Decelle et al. ( 2011) (who termed it free-energy density in statistical physics terms), more works were based on the ICL log π(Y, Z), which is equivalent to the MDL (Peixoto 2014b; Newman and Reinert 2016) under compatiable assumptions.In order to obtain it, we start with the quantity log π(Y, Z|η) in ( 24), and attempt to integrate η out: However, this is usually not tractable (Côme and Latouche 2015).Daudin et al. (2008) provided an approximation for directed graphs: The above approximation, however, can be improved by noting that some parameters in η, such as C and θ , can be integrated out by considering conjugate priors.This is exactly the case in (9), which in fact is equivalent to log π(Y, Z|η) in (24) but with η = (α, β, γ ).Noting such potential, Latouche et al. (2012) and Côme and Latouche (2015) integrated out parameters whose dimensions depend on K or n, and fixed the values of the remaining scalar parameters in η, but parted ways in their subsequent novelty.While Côme and Latouche (2015) derived an exact ICL, Latouche et al. (2012) proposed to approximate the marginal log-likelihood log π(Y), using the version of (24) without η.After the convergence of the variational algorithm ("Variational methods" section), the first term on the right hand side of (24) can be computed by plugging in the estimated Z.This maximised lower bound E Q log π(Y, Z) − log Q(Z) is then used by Latouche et al. (2012) as the approximation of the marginal log-likelihood, under the assumption that the K-L divergence is close to zero and does not depend on K.This is also being adopted by Aicher et al. (2015), who calculated the Bayes factor of one model to another, according to the difference of their respective approximate marginal log-likelihoods.Similarly, Hayashi et al. (2016) provided an approximation to the marginal log-likelihood, this time an asymptotic one, called the fully factorised information criterion (F 2 IC).
Wang and Bickel (2017) also investigated log-likelihood ratio similar to Yan et al. (2014), but this time for an underfitting/overfitting K against the true K. Using its asymptotic results, they derived a penalty term λ K(K+1) 2 n log n, where λ is a tuning parameter, and subsequently a penalised (log-)likelihood criterion.However, Hu et al. (2019) argued that this penalty tends to underestimate K, and therefore proposed a lighter penalty λn log K + K(K+1) 2 log n in their corrected BIC.Other criteria used include the mean log-likelihood of held-out test data (Gopalan et al. 2012;DuBois et al. 2013;Li et al. 2016), termed perplexity by the latter, as the criterion for selecting K. Finally, Ludkin et al. (2018) considered the ratio of the sum of the squared distances in the k-means clustering for all nodes in different groups, to the sum of squared distances between all node pairs, in what they called the Elbow plot, to determine the number of groups.
Vallès-Català et al. ( 2018) investigated an empirical approach to model selection, which is to maximise the performance in link prediction ("Missingness and errors" section).They found that the results are sometimes inconsistent with those by a criterion formulated in a similar way to those described above, which is the posterior likelihood π(Z|Y) in their case.

Modelling
The formulation of the SBM and the inference algorithm in "Inference approach" section relies on K being specified beforehand.To increase the flexibility of the model, θ can be assumed to arise not from a Dirichlet distribution but from a (Hierarchical) Dirichlet process, which will be introduced in "Hierarchical Dirichlet process" section, as it was first used in topic modelling.In this way, K becomes a random quantity generated by the process, and potentially an infinite number of groups is allowed.By incorporating the Dirichlet process in the SBM, as did Kurihara et al. (2006), Mørup et al. (2011), Tang and Yang (2011), Tang and Yang (2014), Kim et al. (2013) and Fan et al. (2015), K can be estimated along other parameters and latent variables.
A similar structure for models which are not SBMs can be incorporated, so that K can be modelled and inferred.In the latent feature model by Miller et al. (2009), an Indian buffet process is used, while a hierarchical Gamma process is used in the edge partition model by Zhou (2015).
One major issue with modelling and inferring K is that, in the SBMs with C and/or θ not integrated out, the number of parameters changes with K.As we have seen in ( 9) and ( 11), as well as in "Criteria and model selection" section, in some cases they can be marginalised, usually with conjugate priors.Having a likelihood in which the number of parameters is constant to K greatly facilitates the associated inference algorithm, regardless of whether K is modelled or selected by a criterion.Therefore, in recent years, these models have been preferred to the aforementioned models with a nonparametric process for K.In the case where K is being modelled, estimation is very often carried out using MCMC, and care has to be taken when K grows or shrinks in the algorithm.Both McDaid et al. (2013) and Sanna Passino and Heard (2019) included two moves that allow so.The first move proposes to add or remove an empty group, while the second proposes to merge two groups into one or split one into two.
Peixoto (2014a) also allowed K to be modelled, and proposed a progressive way of merging groups.To explain such way, first consider each group i in the original graph, denoted by G (0) , as a node in a graph one level above, denoted by G (1) .Similarly, the total number of edges E (0) ij between groups i and j in G (0) is now the valued edge Y (1)  ij between nodes i and j in G (1) .Then proposing to merge groups in G (0) can be seen as proposing to move the membership of a node in G (1) , and this can be carried out in a similar fashion to that described in "Collapsed SBMs and efficient moves" section.
Similar to McDaid et al. (2013), Newman and Reinert (2016) also considered the possibility of empty groups, through modelling K explicitly as a parameter.Specifically, K represents the number of groups that the nodes can potentially occupy, not the number they actually do according to Z.Such possibility of empty groups ensure the joint posterior of Z and K, with other parameters integrated out in their Poisson SBM, to be computed correctly: Now, applying an MCMC algorithm with π(Z, K|Y) as the target density will give a (marginal) posterior of K, π(K|Y).

Nested model
Peixoto (2017a) presented an example of a synthetic network containing 64 cliques of 10 nodes.The nodes are connected with each other within a clique, while there is no edges between nodes in any two different cliques.The results of fitting the microcanonical SBM ("Microcanonical SBM" section) suggested an optimal of 32 groups, each containing two cliques, mean that the model suffers from underfitting rather than overfitting, due to the fact that the maximum detectable K is (proportional to) √ n (Peixoto 2013).Therefore, Peixoto (2014b) proposed a nested SBM, also adopted by Peixoto (2017a), Peixoto (2017b), which can be described using the aforementioned level-one graph G (1) .While G (0) is characterised by the original SBM, the resulting G (1) is characterised by another SBM, which results in G (2) , and so on, until there is only one group at the top level.Accompanied by the moves described above for modelling and inferring K, they found that this nested SBM managed to overcome the underfitting issue, while discovering a hierarchical structure.

Comparison
In this section, an overall comparison in two aspects is provided between the works reviewed.The first aspect concerns the performances of the models that all applied to two real-world examples.The second aspect concerns the approaches reviewed in "Clustering approach", "Inference approach" and "Number of groups" section.

Real-world examples and performances
A classic undirected network studied in the literature is the karate club reported by Zachary (1977).Due to the nature of the study, it was known that the club has been split into two main factions over a disagreement, eventually forming two smaller clubs.Zachary (1977) managed to observe the friendship among n = 34 of the members, and compiled the undirected network of M = 78 edges, which is plotted in Fig. 3 (there were more members who however did not interact with this biggest component of the network).It has been studied for both community detection methods and SBMs (Karrer and Newman 2011;Yan et al. 2014;Yan 2016;van der Pas and van der Vaart 2018;Lu and Szymanski 2019b).Within the community detection algorithms, Girvan and Newman (2002) and Newman and Girvan (2004) managed to reveal the two true groups, Blondel et al. (2008) stopped short at four groups.But this should not be taken as an indication of superiority automatically, as, in fact, there is also no unanimous agreement within the SBMs.On one hand, the models mostly agreed on the optimal K, as Decelle et al. (2011) found K = 2 using their belief propagation algorithm and the observed data log-likelihood criterion, and Newman and Reinert (2016) found that the posterior π(K|Y) in their DC-SBM is indeed maximised at K = 2. On the other hand, while the DC-SBM (Karrer and Fig. 3 The karate club network by Zachary (1977) Newman 2011) revealed the two known factions that the original SBM failed, the likelihood ratio test by Yan et al. (2014) suggested that there were not enough evidence to suggest that the network was generated by DC-SBM (against the original SBM), even when the inhomogeneous degree distribution prompted the degree correction in the first place.Furthermore, Yan (2016) found that using the ICL criterion selected K = 1 under the DC-SBM, meaning that any blocking leads to overfitting, while K = 4 was selected under the original SBM, with different groups corresponding to different levels of node degrees.
Another example is the directed network of political blogs by Adamic and Glance (2005) during the 2004 U.S. Presidential Election.While it contains 1490 blogs in total, usually only the giant component of 1222 blogs (nodes) with 19089 is analysed.Furthermore, the blogs were labeled as either "liberal" or "conservative" by Adamic and Glance (2005) according to their political leanings, providing some sort of ground truth.By forcing K = 2, the DC-SBM by Karrer and Newman (2011) again showed a high agreement with the ground truth, while the original SBM did not.The evidence for the former is supported by the likelihood ratio test by Yan et al. (2014).Yan (2016) presented a mixed picture in which the DC-SBM always dominated the original SBM, even more so at smaller K, but the ICL criterion actually increased with K, thus potentially requiring some penalty.While Wang and Bickel (2017) selected K = 4 using their penalised loglikelihood, closer investigation revealed that one ground-truth group matched well with one inferred group, while the other ground-truth group is split into three smaller inferred groups.However, Hu et al. (2019) found that, at a certain value of the tuning parameter λ ("Criteria and model selection" section) the penalised log-likelihood selected K = 1, therefore arguing for their corrected BIC that has a heavier penalty.Finally, the nested SBM (Peixoto 2017a) managed to cluster the nodes beneath the two main political factions, and discovered 20 groups in the lowest-level SBM according to the DC-SBM.
While there is an agreement over the big picture, the models usually differ in the fine details.

Approaches
Several aspects of SBMs and related models have been considered so far, namely the type of graph, whether the model is an SBM or not, the inference approach, the clustering approach, the number of groups, and whether there is longitudinal modelling.However, these aspects are not completely independent of each other.For example, opting for a latent feature model (Miller et al. 2009;Mørup et al. 2011) not only deviates from an SBM, but also increases the computation complexity because the number of latent variable combination increases from K n to 2 Kn , thus in turn prompting for an efficient and scalable inference algorithm.Another example is that a soft clustering approach influences partly how K is being modelled or selected.The models considered here either opted for modelling by a Dirichlet process (Kurihara et al. 2006;Kim et al. 2013;Fan et al. 2015) or selecting by a criterion (Airoldi et al. 2008;Fu et al. 2009;Xing et al. 2010;Gopalan et al. 2012;Li et al. 2016).Also, the added computational complexity prompted (Li et al. 2016) to propose an inference algorithm which exploits the sparity of graphs.Finally, using a variational approach more often results in K being selected by a criterion (Airoldi et al. 2008;Fu et al. 2009;Xing et al. 2010;Gopalan et al. 2012;Latouche et al. 2012;Aicher et al. 2015;Matias et al. 2015;Hayashi et al. 2016;Matias and Miele 2017) than not (Kurihara et al. 2006;Kim et al. 2013;Vu et al. 2013).Note that these are not intended as definitive arguments but to highlight that the modelling, inference, and clustering approaches, and the issue with the number of groups, are quite interconnected.For a comprehensive comparison, please see Tables 1 and 2 for models with and without longitudinal modelling, respectively.

SBM with longitudinal modelling
The SBMs and related models introduced so far assume that the graph is observed at one instant, which can be regarded as a cross-section of a graph that is evolving over time.If temporal information of the interactions is available, or the graph is observed at multiple instants, longitudinal or dynamic models for graphs can be applied.Therefore, articles that incoporate longitudinal modelling will be reviewed in this section, providing a partial answer to question 3 in "Introduction" section.
In the dynamic model by Fu et al. (2009) and Xing et al. (2010), which are direct extensions of the MMSBM by Airoldi et al. (2008), the membership probabilities of node p is now indexed by time t, denoted by θ p , and is dependent on θ (t−1) p via a state space model.Similarly, the block matrix now becomes C (t) , and evolves over time according to a separate and independent state space model.The latent pairwise group memberships at time t, Z (t) , is then generated by (t) = θ (t)  1 . Finally, the observed graph Y (t) is assumed to arise from an MMSBM (Airoldi et al. 2008) with parameters C (t) and latent variables Z (t) .Fan et al. (2015) proposed two dynamic models, by combining the Dirichlet process and the MMSBM.In their mixture time variant (MTV) model, at each instant the group memberships Z (t) are drawn according to the membership probability vector (t) , elements of which arises from a Dirichlet process.The process parameters in turn depend   on the history of group memberships {Z (1) , Z (2) , . . ., Z (t−1) }.In the mixture time invariant (MTI) model, at each instant the group memberships p depends on K time-invariant θ (i)  p (i = 1, 2, . . ., K), also from a Dirichlet process, and the previous group memberships Z (t−1) .The graph Y (t) is then generated from an MMSBM with Z (t) , and a universal block matrix C.This means that the historical group memberships influence the distribution of the current ones, in the MTV and MTI models, through the Dirichlet process parameters and the group-specific (and time-invariant) membership probabilities, respectively.Tang and Yang (2011); Tang and Yang (2014) adopted a similar idea in their dynamic SBM with temporal Dirichlet process.As hard clustering is used here, there is one membership vector θ (t) for all nodes at each instant.However, this θ (t) is dependent on the Dirichlet process parameter and the group memberships Z (t−1) at the previous instant, so that the collection {θ (t) } arises from what they called the recurrent Chinese restaurant process.The ingredients for generating the graph Y (t) include the block matrix C, which is constant over time, and the group memberships Z (t) , which are generated by θ (t) in turn.Yang et al. (2011) considered a simpler evolution mechanism of the group memberships in their dynamic SBM.A Markov chain is assumed for Z (t) , which means, for node p, Z (t−1) p moves to Z (t)  p according to a transition matrix, which remains unchanged over time.Also assumed constant over time are the model parameters θ and C. Matias and Miele (2017) proposed a similar model, except that the graph is allowed to be valued, in which case each element in C does not necessarily mean the edge probability.They noted that allowing the group memberships and the block matrix to vary over time will lead to identifiability and label-switching issues, therefore fixing C to be constant over time.
Xu and Hero III (2013) used a dynamic SBM quite different to the ones introduced so far.They considered the observed density of edges between two groups to be a noisy observation of a dynamic system, in which the corresponding element in the block matrix is the state.While C (t) is modelled by a state space model, there is none specified for the time-specific group memberships Z (t) .
Another distinctive model is the autoregressive SBM (Ludkin et al. 2018), in which the group membership Z (t)  p follows a continuous-time Markov chain (CTMC), meaning that node p spends an exponentially distributed time in a group, before moving to another group chosen uniformly at random from the remaining groups.If dyad (p, q) belongs completely to one group, that is, both p and q belong to the same group, the presence or absence of an edge over time is modelled by a separate CTMC, the transition rates of which are universal to all dyads that belong completely to the same group as (p, q).For all the remaining dyads where the group memberships do not coincide, there is one extra set of transition rates governing the independent edge process for each dyad.Essentially, there are n(n − 1)/2 CTMCs modelling the dyads, the parameters of which are determined, as always in SBMs, by the group memberships.
Whilst also using a CTMC in their dynamic model, Zhang et al. (2017) placed a it on the edge existence/absence Y (t) pq , instead of the group membership Z p .Using the DC-SBM (7), the rate that edges are removed depends on the group memberships of p and q, while the rate that edges are added is the product of the edge-removal rate and φ p φ q Z T p CZ q , which is the same term used in the static model ( 7).Using the Markov property of a CTMC, they worked out the probabilities of the graph observed at one time point, given the graph at the previous time point, thus the whole likelihood.For inference, conditional on Z, the MLEs of the other parameters can be obtained analytically (in terms of Z).These estimates are then plugged in to the original log-likelihood to arrive at the profile log-likelihood, which is then maximised using a greedy algorithm, to obtain the MLEs of Z. Peixoto (2015a) proposed two versions of modelling longitudinal or temporal networks, in which group memberships are the same across layers or time, based on the MDL approach (Peixoto 2017a).In the first, the total numbers of edges between the groups across all layers are viewed as a collapsed graph that arises from one SBM (according to, for example, (3) or ( 8)).The edges between groups i and j are then assumed to be uniformly distributed, conditioned only on the number of edges between i and j in each layer.In the second version, each layer is generated from an independent SBM, but nodes are allowed to belong only to a subset of the layers.A node not belonging to a layer means that the node is forbidden to have edges in that layer.Equivalence can be established between special cases of the two versions for non-degree-corrected SBMs, but their degree-corrected counterparts are in general not equivalent.This is because degree variability across layers is allowed for the second version but not the first.For temporal modelling, unlike the models introduced above, there are no temporal dynamics explicitly specified.However, Peixoto (2015a) suggested binning the times so that each bin can be viewed as a layer, within which the large-scale network structure does not change significantly.Depending on the version chosen and whether degree correction is incorporated, how the edges can be assigned to the nodes at different layers is flexibly regulated.Finally, selecting the most appropriate time binning, alongside the choice between the two versions and that of degree correction or not, can be aided by model selection through the use of the MDL.
The models introduced so far model the evolution of a graph, usually a binary one, and observed at different instants.They are however not suitable for recurrent interaction events in a network, of which the times of interaction are random in nature.In the aforementioned DuBois et al. (2013), the interactions for the dyad (p, q) arise from a Poisson process with intensity exp(Z T p CZ q ).The intensity can be extended to depend on not only the group memberships only but also on the historical interactions.Specifically, it is the dot product of a vector summarising the interactions of the dyad and Z T p CZ q , which is now allowed to be vector-valued (essentially making C a 3-dimensional array).However, due to such a generalisation, it is misleading to continue calling such model a Poisson process, because the piecewise constant intensity depends on the history of the process, thus making it self-exciting in nature.Matias et al. (2015) formulated a similar model using conditional Poisson processes.The block matrix C(t) is now a K × K matrix of intensity functions, not probabilities or scalar parameters.Conditional on Z p and Z q , which are assumed to remain unchanged over the course of time, the interactions of the dyad (p, q) follows a nonhomogeneous Poisson process with intensity Z T p C(t)Z q .Similar to Matias and Miele (2017), identifiability and label-switching issues were also discussed.
It should be noted that having both the temporal information in the data and a longitudinal/dynamic SBM does not guarantee that the actual groups will be discovered.Ghasemian et al. ( 2016) studied a dynamic SBM, and derived a threshold as a function of the rate of change and the strength of the groups/communities.They found that, below that threshold and empirically, no efficient inference algorithms can identify the groups better than chance.

Topic models
In this section, we briefly introduce the general form of topic models, in particular latent Dirichlet allocation, to prepare for "SBM with topic modelling" section, where they are incoporated in SBMs.While topic models are not the main focus of this review, we will discuss aspects including the inference approach, dealing with the number of topics, and longitudinal modelling, which have been covered in "Inference approach, "Number of groups", "SBM with longitudinal modelling" section for SBMs, respectively.
The main goal of topic modelling is to cluster a collection of documents into different topics.Each latent topic is represented by its distribution over the words that appear in the documents, with such distribution usually being visualised by wordclouds.The roles of the documents, the topics, and the words are then analogous to those of the nodes, the groups, and the edges, respectively.One major difference is that a document is usually assumed to be independent of other documents apriori, whereas a node in a graph is defined by its interactions, or the lack thereof, with others.Another difference is that soft clustering is the dominating approach in topic modelling, meaning that each document usually has non-binary weights over multiple topics.For example, an article in genetics may be about biology, chemistry, statistics, and medicine, for 40%, 30%, 20% and 10%, respectively.This is in contrast to the (usual) hard clustering approach in SBMs.
We first introduce some terminology and notation, which may look different to that in the literature, but is intended to align with what we have introduced for SBMs.Assume that the there are m documents, and n distinct words in total, denoted by In document p = 1, 2, . . ., m, there are N p words.The q-th word is represented by an n-vector W pq := W pq1 W pq2 . . .W pqn T , in which one element is 1, the rest 0. If W pqk = 1, this means the k-th word in V is the one used as the q-th word in document p.We also define the and which essentially is the whole of the data.Finally, an m × n matrix M, called the document-word frequency matrix, is defined.Each element M pk (p = 1, 2, . . ., m; k = 1, 2, . . ., n) represents the frequency of the word V k in the p-th document.
A common assumption in topic modelling is that there are K latent topics.Associated with the i-th topic is an n-vector which represents the distribution of the vocabulary in this topic.The sequence {φ i } (i = 1, 2, . . ., K) is assumed to be independent and identically distributed (i.i.d.) according to the Dirichlet(β1 K ) distribution.Collectively, we write In a similar fashion, associated with the p-th document is a K-vector θ p , subject to the constraint θ T p 1 K = 1, which represents the distribution of the K topics for this document, or its mixed membership in the topics.The sequence {θ p } (p = 1, 2, . . ., m) is assumed to be i.i.d.according to the Dirichlet(α1 K ) distribution.Collectively, we write := (θ 1 θ 2 • • • θ m ) T , which is an m × K matrix.This is similar to the definitions in "Clustering approach" section.
The main difference between various topic models is the generating mechanism of the words in the documents.The principle in latent Dirichlet allocation (LDA) by Blei et al. (2003) is that a document can belong to different topics when generating each word.Specifically, for document p, associated with the q-th word (q = 1, 2, . . ., N p ) is a Kvector latent variable, denoted by Z pq .Only one element of Z pq is 1, representing the topic the document belongs to for this particular word, the rest of which 0. (This is similar to the latent group membership of a node p for its particular interaction with q in MMSBM (Airoldi et al. 2008).)Collectively, we write T , which is an N p ×K matrix, and Z := {Z 1 , Z 2 , . . ., Z n }, which is a sequence of matrices as well as a collection of all latent variables.Now, with the data W, the latent variables Z, the parameters and , the likelihood can be computed in the usual manner, and inference carried out.As with MMSBM ("Clustering approach" section), the main goal of inference is for (and ) but not Z.Instead of going through the derivations of the posterior in detail, we shall mention works which have worked on various aspects in "Inference approach", "Number of groups" and "SBM with longitudinal modelling" sections.

Inference approach
In the usual approach which is Bayesian inference, in addition to the Dirichlet distribution assumptions made for and , once the priors are assigned to α and β, the joint posterior of Z, , , α and β can be derived.Unlike in SBMs, even with the soft clustering approach, the computational complexity grows linearly, but not quadratically, with the number of documents m.
Similar to what has been described in the beginning of "Inference approach" section, because of the use of conjugate priors for , , α and β, for algorithmic simplicity a Gibbs sampler can be derived, in which the parameters and the latent variables can be updated via individual Gibbs steps.However, Griffiths and Steyvers (2004) were not interested in and and therefore integrated them out, to obtain a collapsed Gibbs sampler.Furthermore, as they fixed the values for α and β, only the latent groups Z are to be inferred.This is similar to collapsing the hard clustering SBMs in, for example, (McDaid et al. 2013), (Côme and Latouche 2015), Peixoto (2017a) and Peixoto (2017b), although the parameters C in the soft clustering MMSBM (Airoldi et al. 2008) are not being integrated out.Other uses of MCMC algorithms include Teh et al. (2005), Ren et al. (2008), Papaspiliopoulos andRoberts (2008), Ahmed and Xing (2010) and (Griffiths and Ghahramani 2011).However, such algorithms are more useful when the number of topics K is being modelled ("Number of topics" section), rather than selected by a criterion, by processes such as the hierarchical Dirichlet process and the Indian buffet process.Blei et al. (2003), along with Wang et al. (2011) used variational inference, the general formulation of which is described in "Variational methods" section.Blei and Lafferty (2006) combined variational inference and a Kalman filter for their dynamic topic model, in which a state space model is incorporated, to be explained in "Longitudinal modelling" section.Arun et al. (2010) focussed on finding the optimal number of topics and did not explicitly state their inference approach.

Number of topics
Using the same notation K as in "Stochastic block models" section is because the number of topics in a topic model is analogous to the number of groups an SBM.Therefore, there also exists the issue of whether K should be fixed, selected by a criterion, or modelled.Of the topic models reviewed here, only Blei and Lafferty (2006) used a fixed number of topics (K = 20).Blei et al. (2003) used the perplexity, which is the likelihood of held-out test data, to determine K.This is (almost) the same as the definition of perplexity in SBMs (Gopalan et al. 2012;DuBois et al. 2013;Li et al. 2016).Griffiths and Steyvers (2004) selected K by the marginal log-likelihood, which is similar to Latouche et al. (2012) for SBMs, although they have not provided the derivations of this quantity log π (W), or approximations thereof.Arun et al. (2010) proposed an empirical measure to find the optimal K.They started with viewing that LDA essentially "factorises" the document-word frequency matrix M into the two stochastic matrices and .(Of course, this does not mean that M = , however.)By defining the vector of document lengths N := (N 1 N 2 • • • N m ) T , they gave the expression of the criterion for selecting K, as the symmetric K-L divergence between the distribution of the singular values of , and the distribution obtained by normalising N T .

Hierarchical Dirichlet process
The Dirichlet process (DP) and its hierarchical version have to be introduced before incorporating into topic models.As it is only used in some of the models reviewed, we focus on the ideas behind these nonparametric Bayesian processes, rather than their technical details.
The topic model introduced in the beginning of this section assumes that, for each document, there is a same number (K) of topics to choose from, and the membership probabilities θ p (p = 1, 2, . . ., m) form a K-vector which comes from a Dirichlet(α1 K ) distribution.Dirichlet process takes this one step further by not requiring K to be prespecified.While the elements of θ p drawn from a DP still sum to 1, K varies between i.i.d.samples and is implied by the length of θ p .This means, for example, θ p = (0.2 0.5 0.3) T and θ p = (0.15 0.45 0.35 0.05) T can be two independent samples from the same DP, implying K = 3 and K = 4, respectively.
To take the DP one step further, Teh et al. (2005) assumed that θ p comes from a DP, the parameters of which come from another DP.Introducting this structure in their hierarchical Dirichlet process (HDP) leads to a clustering effect.In the context of topic models, the topics generated for each document are uncorrelated in a DP, while the same set of topics is shared across all documents in an HDP.
This description of the DP and HDP is an informal one, and has omitted the technical details of actual sampling from a DP; see, for example, Teh et al. (2005), Ren et al. (2008), and Ahmed and Xing (2010).What is to be highlighted here is that K arises naturally when sampling from a DP or HDP, and therefore does not need to be pre-specified.As these nonparametric Bayesian processes concerns the generation of the mixing vectors θ p and is independent of the topic models, it can be incorporated in models that require these membership probabilities, hence its use in the SBMs discussed in "Modelling" section.
It has been argued in "Inference approach" section and shown by, for example, Griffiths and Steyvers ( 2004) that a Gibbs sampler can be derived for the parameters and the latent variables Z of the topic model when K is fixed.Teh et al. (2005) have shown that a Gibbs sampler is also possible under the HDP formulation, hence the preference to using MCMC as the inference algorithm too, in Ren et al. (2008) and Ahmed and Xing (2010).While Papaspiliopoulos and Roberts (2008) proposed two MCMC algorithms to improve the then existing ones for HDP models in general, Wang et al. (2011) pointed out the limitation that such algorithms require multiple passes through all the data and are therefore not very scalable.They proposed a variational infernce algorithm as an alternative.
So far, for a topic model or SBM, incorporating an HDP means that the quantity K corresponds to is potentially infinite and to be modelled.The equivalent for latent feature models exists and is termed the Indian buffet process, and again a Gibbs sampler for such models is possible (Miller et al. 2009).For a detailed introduction and an extensive review, see Griffiths and Ghahramani (2011).

Longitudinal modelling
As temporal information is sometimes available for text data, such as academic articles, longitudinal or dynamic models can be incorporated in a similar way as they are for SBMs or other models for graphs.Blei and Lafferty (2006) proposed a dynamic topic model, in which the distribution of vocabulary for the i-th topic is now indexed by time t, denoted by φ (t) i , and is dependent on φ (t−1) i via a state space model.The incorporation of a state space model is similar to that in Fu et al. (2009); Xing et al. (2010) and Xu and Hero III (2013), mentioned in "SBM with longitudinal modelling" section.The main difference is the generation the membership probabilities, as once the p-th document is generated from θ p there is no need to evolve its distribution over the topics.Instead, there is a "universal" vector of membership probabilities, denoted by ψ (t) , and depends on ψ (t−1) via a separate state space model.Assuming that, without loss of generality, the p-th document is to occur at time t, θ p is generated from ψ (t) with random noise.Words of the document are then generated in the usual way according to θ p and φ (t) .
As mentioned before, in a non-dynamic model that models K, the memberships θ p arises from an HDP.Ren et al. (2008) introduced temporal dependency in their dynamic model, by assuming that the documents occurred sequentially, and that θ p depends on both θ p−1 and an HDP which represents innovation.Ahmed and Xing (2010) modified the HDP in a slightly different way.The membership probabilities now come from, loosely speaking, an evolving DP, which depends on the process at the previous time point through a state space model.They argued that their dynamic model allows topics to be born and die at any instant.

SBM with topic modelling
In this section articles that combine a model for graphs and a topic model are reviewed.Similar to "Topic models" section, different aspects related to modelling and inference approaches are discussed, thus providing answers to questions 5 and 6 in Introduction section.
Before looking at SBMs with topic modelling for graph and textual data, it is useful to look at a recently proposed SBM for textual data by Gerlach et al. (2018).Instead of probabilitistically factorising the document-word frequency matrix M ("Topic models" section), they treated the relations between the words and the documents as a bipartite graph, in which both the words and the documents are the nodes.By doing so, the SBMs reviewed before can be direct applied to the graph.In particular, they fit a nested SBM (Peixoto 2014b), thus constructing a hierarchical structure and allowing the number of groups to be inferred.The proposed model was found to perform better than the traditional approach of LDA.It can be compared with the model by Godoy-Lorite et al. (2016) for recommender systems, who also applied an SBM for seemingly non-relational data that can however be represented as graphs.

Type of data and modelling approach
In the articles reviewed in this section, the modelling approach is largely influenced by the type of data available, in particular whether the textual information is for the edges or nodes in the graph.The availability of such textual information is in turn determined by the nature of the data set itself.Therefore, we will review these two aspects together.

Textual edges
One famous example of data that contains both network and text information is the Enron Corpus, a large database of over 0.6 million emails by 158 employees of the Enron Corporation before the collapse of the company in 2001.The nature of an email network leads to a directed and valued graph, in which the nodes and edges are the employees and the email exchanges, respectively, with the latter of which the texts are associated.In terms of the models reviewed in "Stochastic block models" section, it has been studied by Fu et al. (2009) 2018), which will be discussed individually.Zhou et al. (2006) proposed two models in which the graph is not modelled explicitly.Instead, each email, or communication document in general, is generated by an extension of LDA.In the first model, the membership probabilities θ is not associated with the document, but with the users involved with this document (for example, sender and receiver of an email).Then, the latent topic variable for word q in document p, denoted by Z pq as in "Topic models" section, is generated according to the membership probabilities of the users.There is an additional layer in which users come from different groups (communities), but it is not that each user has a mixed membership over the groups, but that for each group there is a distribution of users representing their participation.In the second model, associated with each group is the membership probabilities θ of the topics, and associated with each topic is a distribution of the users.This time, it is not made clear how a word in the document is generated from the user, who is in turn generated by the topic.
Similar to those by Zhou et al. (2006) is the author-recipient-topic model by McCallum et al. (2007), in which the membership probabilities θ are now specific to each author-recipient pair.For document p, the author and the set of recipients are treated as given.For word q in this document, a recipient is selected at random uniformly, and the latent topic Z pq follows a multinomial distribution according to the aforementioned pairspecific membership probabilities.Finally, the word is generated, as usual, according to the word distribution φ over the selected latent topic.Pathak et al. (2008) extended the author-recipient-topic model, by adding the group element of the authors and recipients.In particular, chosen at random uniformly is not the recipient but the group.Then, similar to Zhou et al. (2006), the authors and recipients are selected according to a group-specific distribution of users.Sachan et al. (2012) introduced a topic user community model, which is different to those by Zhou et al. (2006) despite the similarities in the model names.For each sender, there is a group distribution, and for each sender-group pair, there is a distribution over the topics.There is one latent group for each document by the sender, and the latent topic variables for individual words are generated according to the aforementioned sender-group-specific probabilities.To complete the model specification, the recipients are chosen by a group-specific distribution of the latent group.Bouveyron et al. (2016) combined the SBM and LDA to form the stochastic topic block model (STBM).As in the SBM introduced in "Stochastic block models" section, the groups of nodes p and q arise from a multinomial distribution, and the edge variable Y pq follows a Bernoulli distribution with parameter Z T p CZ q , where Z p and Z q represent the memberships in the groups, not the topics.Now, for each pair of groups i and j, there is a specific vector θ ij representing the memberships in the topics, the collection of which, still denoted by , is a 3-dimensional array.Then, the latent topic variables for individual words of a document from p to q follow a multinomial distribution with parameters Z T p Z q , conditional on Y pq = 1.This means that the latent groups influence both the dyad and potentially the words in the document, if an edge exists.Corneli et al. (2018) extended the STBM by incorporating a dynamic component, and is the only work reviewed in this section with longitudinal modelling, possibly due to complexity of combining a model for a graph and a topic model.Instead of conditioning on Y pq = 1, the occurrences of the documents from p to q arise from a nonhomogeneous Poisson process, according to the collection of group-pair-specific intensity functions and the latent groups Z p and Z q .Furthermore, these intensity functions are piecewise constant, that is, constant within each of the time clusters, which are also latent variables and have to be inferred.Within each time cluster and conditioned on the existence of a document, the latent topic variables and individual words are again generated in the same way as in Bouveyron et al. (2016).Essentially, Corneli et al. (2018) proposed a model for simultaneously clustering three aspects, namely the nodes into groups, the documents into topics, and the occurrences into time clusters.It can also been seen as a generalisation, or even a direct combination, of both the STBM and the dynamic extension of the LDA.

Textual nodes
In another set of models, the entities are usually documents with texts and links between them.Note that a document is defined in its broad sense, as it can be a Wikipedia page, a blog post, or an academic article, the links in the last of which are citations or references.The data structure is then a graph, usually directed, with texts in the nodes.We shall discuss a few of them.Liu et al. (2009) argued that links between documents are not only determined by content similarity, but also by the connections between authors, because authors are naturally more aware of documents in their community and might not be aware of the possibly more relevant documents outside it.They introduced a topic-link LDA model, in which the edge probability between document p and q depends on a linear combination of document similarity and author similarity.The former similarity is the dot product of the topic memberships θ p and θ q of the documents, generated as described in "Topic models" section, while the latter is the dot product of the memberships of the authors, drawn in a similar way from a separate Dirichlet distribution.The coauthorship network of the authors, however, is not incorporated to enrich the information on the author memberships.Chang and Blei (2010) proposed a relational topic model, in which the graph of the documents depends on their content similarity.First, each document is generated according to LDA.Next, for documents p and q, their edge probability depends on the similarity between the latent variables Z p and Z q as defined in "Topic models" section, quantified by a link probability function.Different version of this function are considered.Ho et al. (2012) introduced a model called TopicBlock, in which a latent hierarchical or tree structure is assumed to generate the documents, which are the leaf nodes.Also defined is a hierarchical node h, that is, the root or anything between the root and the leaves.Associated with h is a word distribution, denoted by φ h for alignment with notation in "Topic models" section, as well as a parameter for the edge probability between any two documents that share h as their deepest common ancestor.Three things are then generated for each document.First, the path from the root to the document arises from a nested Chinese Restaurant process, which is related to the DP.Second, the words are generated according to a mixture of the distributions φ h of all the nodes h along the path.Finally, the presence or absence of edge of this document with another document follows a Bernoulli distribution according to the parameter associated with their deepest common ancestor.The use of a hierarchical latent structure to infer the number of groups or topics and for scalability is similar to Peixoto (2017a).

Inference approach
As it is illustrated in "Stochastic block models" and "Topic models" sectiona that a naive MCMC algorithm can be derived for a block model and a topic model, respectively, it is natural and possible to derive a similar algorithm for a model that combines both, if scalability is less of a concern compared to algorithmic simplicity.Each of Zhou et al. (2006);McCallum et al. (2007); Pathak et al. (2008); Ho et al. (2012) and Sachan et al. (2012) have derived a Gibbs sampler for their corresponding model.In particular, Ho et al. ( 2012) followed Griffiths and Steyvers (2004) and integrated out the parameters other than the latent variables in their collapsed Gibbs sampler.
On the other hand, the VEM algorithm described in "Variational methods" section is very general and equally feasible as the alternative.Therefore, it has been used by Liu et al. (2009); Chang and Blei (2010); Bouveyron et al. (2016) and Corneli et al. (2018) as the inference algorithm.Bouveyron et al. (2016) and Corneli et al. (2018) observed that, in their STBM and dynamic counterpart, the equivalent of the lower bound in (23) can be split into two components, one depending on the variational distribution of the latent topic vectors Z and the other not.Therefore they applied an extension of the VEM algorithm, called the classification VEM algorithm, in which the lower bound is maximised alternatively in two steps.In the first step, the lower bound is maximised with respect to the variational distribution and the collection of the word distributions, , in the usual way of a VEM algorithm.In the second step, the maximisation is carried out in a greedy fashion with respect to the remaining latent variables, namely the latent groups in Bouveyron et al. (2016), and the latent groups and latent time clusters in Corneli et al. (2018).

Clustering approach
For data with textual edges, while soft clustering still applies to discovering the topics of these documents, it is less clear for the nodes (users).For example, in Zhou et al. (2006) and Pathak et al. (2008), a distribution of users is associated with each group, rather than the other way round.In such cases, we only report the clustering approach for the documents in Table 3. Sachan et al. (2012) are the only ones who soft clustered both the documents and users, the latter of which can belong to multiple documents and multiple groups.Bouveyron et al. (2016) and Corneli et al. (2018) hard clustered the nodes and soft clustered the documents, which are the predominant approaches in SBMs and topic models, respectively.
For data with textual nodes, the models essentially assume that the same set of latent variables, that is their group memberships influence both the generation of the words and the connection between the documents.The soft clustering or mixed membership approach adopted by LDA is possible to be carried over just to a model which combines LDA with a block model, overriding the usual hard clustering approach for the latter.This is the case for all of Liu et al. (2009); Chang andBlei (2010), andHo et al. (2012).

Number of groups/topics
The number of groups or topics is usually fixed or determined by certain criterion, possibly because of the complexity of the model.For data with textual edges, both clustering the nodes into groups and the documents into topics are required.Zhou et al. (2006) used 6 groups and 20 topics for the Enron data set, while Pathak et al. (2008) used 8 groups and 25 topics, and perplexity is used by both McCallum et al. (2007) and Sachan et al. (2012) as the criterion, the latter of which selected 10 groups and 20 topics.On the other hand, Bouveyron et al. (2016) used a BIC-like criterion to find 10 groups and 5 topics, while Corneli et al. (2018) used the ICL to find 6 groups and 9 topics.It should be noted that, such discrepancy is due to not only the choice of the criterion but also the model itself.
For data with textual nodes, clustering the nodes into groups is equivalent to clustering the documents in topics.Chang and Blei (2010) used various numbers (5,10,15,20,25) of topics for their data, while Liu et al. (2009) used perplexity as the criterion.Ho et al. (2012) are the only ones who used incorporated a DP in their hierarchical model, although the number of groups found is not reported.Furthermore, they fixed the hierarchical level to 2 or 3 for their data sets.

Comparison
A comparison is provided in this section regarding works that incorporate longitudinal and/or topic model in SBMs and other models for graphs.Specifically, we will look at their performances on one widely studied dataset.

Real-world example
The Enron email network dataset (Klimt and Yang 2004), which contains the email communications between its employees from 1999 to 2002 before its bankcrupty in 2001, has been widely studied because of the availability of the temporal information (the time points of the email exchanges), the textual information (the actual contents of the emails), and the subsequent graph (the network of the employees).While the works reviewed have a different size of network, due to the different time periods studied, in most cases the number of nodes n is around 150, except Xu and Hero III (2013) and Sanna Passino and Heard (2019) with n = 184.A few events happened during that period: These events likely affected the dynamics of the employees, which could be reflected from the email exchanges, and were looked into by different works below.

Longitudinal modelling
The dynamic SBMs reviewed in "SBM with longitudinal modelling" section provided various insights into the dynamics of the network.In their dynamic MMSBM, Fu et al. (2009) and Xing et al. (2010) examined the temporal evolution of the mixed memberships of each employee, which can be interpreted as different (latent) roles.They found that, perhaps not surprisingly, that employees with strong connections with multiple groups and/or important positions tended to have multiple active roles most of the time.They also discovered the major changes in the mixed memberships aligned with real-life events, such as 2 and 5 above.Related observations were made by Xu and Hero III (2013), who examined the evolution of the block matrix C (t) instead, and fixed the K = 7 groups according to the known roles of the employees.They discovered that the elements of C (t) spiked at event 1.
Other applications include Matias et al. (2015), who found out that their ICL criterion did not manage to select a reasonably small K under their SBM with conditional Poisson processes.Fixing K = 3, they discovered that two groups had intra-group communications that peaked at different periods, while the remaining group had very little activity (within itself and with other groups).While Fan et al. (2015) also applied their MTI model to the Enron dataset, they only reported that it outperformed the then existing models such as the MMSBM (Airoldi et al. 2008) and the LFRM (Miller et al. 2009).Similar claims of better performance over SBM were made by DuBois et al. (2013) in their dynamic model.

Topic modelling
Next, we look at the topics in the emails exchanged, discovered by models reviewed in "SBM with topic modelling" section.McCallum et al. (2007) found the K = 50 topics according to their author-recipient-topic model matched well with specific issues in the operations of a company, and such observation is echoed by the respective models of Zhou et al. (2006) and Pathak et al. (2008).Upon comparison with simply fitting an SBM to the network, they found that two employees in the same group inferred by the SBM might actually have vastly different roles in the company.This means that, while the two employees wrote to roughly the same set of employees (hence the stochastic equivalence), the difference in their roles might not be revealed without taking into account the texts of the emails.Using their community-author-recipient-topic model, Pathak et al. (2008) showed that the groups found were topically meaningful, which means that different groups wrote emails on different topics, implying a successful incorporation of the textual information.However, Zhou et al. (2006);McCallum et al. (2007), and (Pathak et al. 2008) did not compare the groups they found with the actual roles of the employees.
Apart from making a similar observation to Pathak et al. (2008); Sachan et al. (2012) managed to compute the modularity 20, and found that not only did their topic user community model outperformed the other models mentioned in this section across different values of K, it also achieved a maximimum modularity (at K = 10) greater than 0.3, indicating significant community structure (Clauset et al. 2004).As modularity is not a formal criterion for model selection, Sachan et al. (2012) used perplexity to select an optimal number of groups, which was also around K = 10.
In their STBM, Bouveyron et al. (2016) optimised the BIC-likelihood criterion with respect to the number of groups K and the number of topics simultaneously.The optimal K = 10 (and 5 topics) is larger than that found by an SBM alone (K = 8).Closer investigation revealed that some employees in the same group found by SBM talked about different topics than the rest of the group, hence the splitting of a group into smaller ones found by STBM.This is in agreement with the observation by McCallum et al. (2007).

Longitudinal and topic modelling
In the dynamic STBM by Corneli et al. (2018), model selection was done by maximising the ICL criterion with respect to the number of groups K, the number of topics, and the number of time clusters.Compared to 10 groups and 5 topics in the STBM, 6 groups, 9 topics and 4 time clusters were selected here.The 4 time clusters corresponded well (with a slight lag) to the four periods separated by events 3, 4 and 5 described at the beginning of this section.Furthermore, they found different interactions between groups at different time clusters, with different topics discussed, thus showing the need of incorporating both temporal and textual information.

Approaches
Similar to the comparison for SBMs in "Comparison" section, the modelling approach in topic models influences, and is influenced by other aspects discussed.For example, the use of a hierarchical process for the number of topics usually leads naturally to an MCMC algorithm, in particular a Gibbs sampler.While topic models are not the focus of this review, we related some of them with their counterparts for graphs, as similar issues regarding inference and dealing with K arise in both sets of models.For a comprehensive comparison between the topic models, please see Table 4.

Discussion
In this review we have seen a spectrum of statistical models for graphs, in particular the SBMs, with or without the longitudinal and/or topic modelling.They have been compared in different aspects, including the clustering approach, the inference approach, and how the number of groups K is being coped with.Instead of looking at each model as a unit, we investigated them in a systematic and cross-sectional way, hopefullying presenting the landscape of the literature in a straightforward manner, so that the models most relevant to the reader can be compared with relative ease.It should be noted that, however, these aspects are not completely independent of each other, nor are they independent of the model specification.In particular, by looking at Table 1, a shift in the approach to K, which is highly related to the developments in the models themselves, can be seen over the years.While previous SBMs have relied on nonparametric Bayesian processes used in topic models, more recent SBMs are usually collapsed.With model parameters being integrated out, this allows model selection criteria to be computed (Côme and Latouche 2015), or tractable posterior (McDaid et al. 2013;Newman and Reinert 2016;Peixoto 2017a) to be used with efficient MCMC algorithms (Peixoto 2014b), therefore enabling K to be selected or estimated without trans-dimensional methods.
Two of the recent developments are worth singling out.The first is the nested microcanonical SBM (Peixoto 2017a), which establishes the MDL (Peixoto 2013) and its equivalence with the usual Bayesian inference approach, incorporates an efficient MCMC algorithm (Peixoto 2014a) and a hierarchical structure (Peixoto 2014b) to model K and circumvent the issue with potential underfitting due to the maximum K detectable in an SBM (Peixoto 2013).One possible direction is to apply these methods and techniques to hypergraphs as well, as they are still a growing field to our knowledge.
The second is the increasing adopting of the DC-SBM (Karrer and Newman 2011), not only in modifying the model, but also in aiding model selection in different ways (Yan et al. 2014;Yan 2016;Wang and Bickel 2017;Hu et al. 2019).While it is particularly useful for dealing with the degree heterogeneity, the results are more mixed when it is applied to real-world networks, as seen in "Comparison" section.Moreover, the usefulness of the DC-SBM should not be seen as an indication of the inferiority of the original SBM, as it captures different underlying structures that follow stochastic equivalence.One possible direction is to combine the regularised SBM (Lu and Szymanski 2019b) and the weighted SBM (Aicher et al. 2015), and have both versions of the SBM in an unifying model.Rather than aimlessly increasing the model complexity, one should aim at modifying or extending an SBM to realistically capture the properties of real-world networks.
Future directions can also be made for extending models that combine SBMs and topic models, and can be split into the two broad categories as in "Type of data and modelling approach" section.For graphs with textual edges, while Corneli et al. (2018) currently represents the state-of-the-art among the models reviewed, a mixed membership version for the nodes could be incorporated.The number of groups/topics/time clusters could also be modelled, possibly with the associated parameters, whose dimensions grow with these numbers integrated out, potentially leading to more accurate computations of criterion and/or more efficient inference algorithms.
For graphs with textual nodes, there are a few possible directions.One way is to develop a truly hierarchical model, potentially by incorporating a nested SBM, to infer the number of groups and the depth of the hierarchical structure simultaneously.Another way is to combine Chang and Blei (2010) with an MMSBM to soft cluster the nodes, accompanied by a criterion to infer the number of groups/topics.Methods by, for example, Li et al. (2016) can also be incorporated to obtain a scalable and efficient inference algorithm.The model by Gerlach et al. (2018) can also be applied to account for the graph between words and documents and the graph between documents simultaneously, in an unifying

Fig
Fig. 1 Network of 90 nodes

)
Daudin et al. (2008) also provided an algorithm for the optimisation of the first term on the right hand side.This approximate ICL has been adopted by, for example,Matias et al. (2015),Matias andMiele (2017), andStanley et al. (2019).

Table 1
Models for graphs without longitudinal or topic modelling

Table 2
Stochastic block models with longitudinal modelling

Table 4
Topic models and related clustering models