Improving accuracy of expected frequency of uncertain roles based on efficient ensembling

This study tackles the problem of extracting the node roles in uncertain graphs based on network motifs. Uncertain graphs are useful for modeling information diffusion phenomena because the presence or absence of edges is stochastically determined. In such an uncertain graph, the node role also changes stochastically according to the presence or absence of edges, so approximate calculation using a huge number of samplings is common. However, the calculation load is very large, even for a small graph. We propose a method to extract uncertain node roles with high accuracy and high speed by ensembling a large number of sampled graphs and efficiently searching for all other transitionable roles. This method provides highly accurate results compared to simple sampling and ensembling methods that do not consider the transition to other roles. In our evaluation experiment, we use real-world graphs artificially assigned uniform and non-uniform edge existence probabilities. The results show that the proposed method outperforms an existing method previously reported by the authors, which is the basis of the proposed method, as well as another current method based on the state-of-the-art algorithm, in terms of efficiency and accuracy.

received information to other nodes. Accordingly, extracting the motif-based role of each node can be applied, for example, in identifying important influencers in viral marketing.
Information diffusion over social networks can be treated as an uncertain graph, where the existence of edges between nodes is probabilistic. In the last few years, the study of uncertain graphs has attracted considerable attention in the field of network science. The counting of motifs and roles in uncertain graphs can facilitate a more detailed analysis of a given graph, and it is expected to be used in a wide range of fields such as marketing, urban planning, and protein analysis. For uncertain graphs with L uncertain edges, 2 L possible graphs need to be enumerated; in addition, the number of motifs (roles) for each of them needs to be counted, and the numbers with the weight of the occurrence probability of each possible graph need to be averaged. However, the number of possible graphs is very large, and even for small graphs it is difficult to compute the exact expectation. Therefore, in general, sampling-based approximations have been adopted.
The LINC algorithm of Ma et al. (2019) is a state-of-the-art technique for counting motifs in uncertain graphs. Instead of counting the number of motifs for all sample graphs from scratch, LINC focuses on the structural similarity between sample graphs, and it efficiently updates the number of motifs by considering only the difference edges between two sample graphs. In a situation of low uncertainty, that is, extremely high or low edge probability, LINC can compute the expected frequency more quickly than can naive sampling-based methods.
The aim of this study is to extract groups of nodes with similar motif-based roles, and promising results to this end have already been reported (Pržulj 2007;Guerrero et al. 2008;Ohnishi et al. 2010;McDonnell et al. 2014;Sarajlić et al. 2016). Therefore, this study follows these frameworks, which consist of three steps: counting motifs or roles and constructing feature vectors, calculating node similarity, and clustering nodes. In the context of an uncertain graph, we need to sample and ensemble either graphs, vectors, similarities, or clusters. It is important to evaluate the amounts of difference among the exact clustering results obtained by processing all possible graphs, depending on the steps at which sampling and ensembling are performed. In our previous study (Naito and Fushimi 2021), we proposed an efficient ensemble method, graph-ensemble, to ensemble possible graphs sampled from the given uncertain graph; it then generates a weighted graph we call an ensembled-graph, where the edge weight is the ratio of graphs with edge existence to the total number of sampled graphs; finally, the method counts the roles from this weighted graph by considering the edge weights. Experimental evaluations have compared the vector-ensemble and similarity-ensemble methods, both derived from the LINC algorithm, with the graph-ensemble method. The results show that the graph-ensemble method outputs similar results to the previous methods but much faster. On the other hand, the results of subsequent experiments show that the error of the graph-ensemble methods in presenting exact results is, to some extent, larger than the vector-ensemble and the similarity-ensemble methods. This is because the graph-ensemble method integrates sampled graphs and counts the roles from it; consequently, absent edges in some samples are eliminated by present edges in other samples and changes in the motif-roles cannot be considered. The vector-ensemble and similarity-ensemble methods count roles from each sampled graph, and thus changes in motif roles can be considered. Therefore, there is need for a method that is as efficient as the graph-ensemble method but also as effective as the vector-ensemble and similarityensemble methods. In this study, we improve the error by making it as small as that of the vector-ensemble method. This is done by considering the change in role due to the probabilistic absence of edges at the expense of a certain degree of speed that is possible with the graph-ensemble method.
As an extension of the conference version of this work (Naito and Fushimi 2021), we propose the extended graph-ensemble method, add graphs to illustrate our experiments, and compare and evaluate the proposed method along with existing methods from the viewpoint of error. Furthermore, the pseudo-code related to our method is added. This paper is organized as follows: "Related work" section introduces related research. "Problem framework" section sets up the problem addressed in this study. "Existing methods" and "Proposed method: extended graph-ensemble method" sections describe the existing and proposed methods, and "Experimental evaluations" section presents evaluation experiments using each method. Finally, "Conclusion" section summarizes this study and mentions future work.

Related work
In this study, we consider the problem of extracting motif-based roles for uncertain graph nodes. Therefore, we briefly discuss related work in terms of network motifs, role extraction, and uncertain graphs.

Network motifs
Motif counting techniques have been studied for many years, starting with the pioneering work of Milo et al. (2002). Various algorithms in these techniques have been developed for different purposes (Wernicke 2005;Itzhack et al. 2007;Grochow and Kellis 2007;Ahmed et al. 2015;Pinar et al. 2017). Wernicke proposed a hash-based algorithm called ESU, which avoided the need for storing all subgraphs in a hash table and improved the efficiency of motif counting by not counting the same subgraph twice (Wernicke 2005). Itzhack et al. proposed an efficient algorithm to traverse a breadthfirst search tree with the target node as the root. It represents the existence of a link in a subgraph as a bit string, and it can efficiently identify motif patterns without checking the isomorphism of each subgraph (Itzhack et al. 2007). This study adopts the algorithm of Itzhack et al. for motif counting from sample graphs. Grochow and Kellis proposed an efficient algorithm for searching for a single motif (Grochow and Kellis 2007). This algorithm constructs a partial mapping from a particular graph to a target motif. In addition, the algorithm introduces a method called symmetric-break to avoid multiple counting of motifs, which greatly improves execution time. Ahmed et al. proposed a parallel algorithm for three-and four-node motifs that does not enumerate all motif instances but counts certain motifs, such as cliques and cycles, and uses the transition relations between motifs to compute all other motifs analytically . Pinar et al. proposed a divide-and-conquer algorithm that identifies the substructure of each found subgraph and divides it into smaller ones. Pinar et al. (2017). However, although it is a very efficient method, it cannot be applied to directed networks.

Role extraction
Extracting node roles from a network is an important research topic. Role extraction methods are largely divided into two types, graph-based and feature-based methods . Graph-based methods, such as concept and extraction algorithms of regular equivalence (Everett and Borgatti 1994) and structural equivalence (Lorrain and White 1971) have been proposed. These concepts focus on local structures such as relationships among neighboring nodes similar to network motifs, but extracting exactly equivalent nodes is costly. More recently, by relaxing the concept of equivalence, many feature-based role discovery techniques have been proposed (Henderson et al. 2011Rossi et al. 2012Rossi et al. , 2013Gilpin et al. 2013).
Feature-based methods transform the graph representation into a feature representation, so in that sense, our method belongs to this category. Some studies defined the motif-based roles (a.k.a orbits) and graphlet degree vector for each node, whose element is the number of roles (Pržulj 2007;Guerrero et al. 2008;McDonnell et al. 2014). Przulj constructed a vector of 73 kinds of orbits obtained from 2-to 5-node graphlets and attempted to quantify the similarity among graphs or nodes (Pržulj 2007). McDonnell et al. proposed a transformation matrix from motif-frequency vector to role-frequency vector to efficiently compute the number of roles for each node or the whole graph (McDonnell et al. 2014). Our study also defines the feature vector of each node based on the number of roles of each node, but we count the number of roles based on Itzhack's algorithm, not McDonnell's one.
Furthermore, some methods calculated the similarity between the vectors and clustered the nodes into groups. Ohnishi et al. analyzed an inter-firm network using motifroles and found economically meaningful clusters of nodes (Ohnishi et al. 2010). Sarajlic et al. discovered the core-broker-periphery structure from world trade networks and predicted the economic attributes of each country node (Sarajlić et al. 2016). Following the promising results of the above studies, our role extraction framework consists of counting roles, constructing feature vectors, calculating node-similarity, and clustering nodes.

Uncertain graphs
Research on uncertain graphs has been pursued in a wide range of contexts. One important task is the extension of existing graph analysis methods, including node centrality, clustering, embedding, and motif counting, to uncertain graphs.
Pfeiffer et al. extended certain representative structural indices for the deterministic graph, i.e., shortest path length, clustering coefficient, and betweenness centrality ranking, to uncertain graphs by introducing the expected value of each index for the occurrence probability of each possible graph (Pfeiffer and Neville 2011). Such a notion and sampling-based approximation have been widely used in subsequent research on uncertain graphs, including the work in this study.
Ceccarello et al. developed a node clustering method for uncertain graphs and reduced the basic problem to k-center and k-median problems (Ceccarello et al. 2017). In this method, the distances between nodes are defined by the inverse of the connection probability among them, which is efficiently and accurately estimated by the Monte Carlo sampling method.
Hu et al. proposed an embedding method for uncertain graphs, which constructs a matrix of expected proximities of all node pairs in an uncertain graph and reduces the number of the matrix dimensionality via a matrix factorization technique to obtain low-dimensional vectors for the nodes [10]. This method uses the Jaccard coefficient for the set of adjacent nodes when calculating the expected proximity between nodes, that is, it calculates the similarity between nodes based on the local structure. Similarly, our method constructs vectors based on the expected number of motif roles, which represents the local structure. The procedure is reversed because the purpose of Hu's method of obtaining a low-dimensional vector from the similarity between nodes and that of our method of obtaining a similarity matrix from a low-dimensional vector is different.
Motif counting for uncertain graphs has not yet been thoroughly studied. The following are some of the major studies on the subject. Tran et al. proposed a method to compute an unbiased estimator of the number of motifs from noisy and incomplete data, but the method assumes that all edges have uniform joint probabilities and does not apply to non-uniform probabilities (Tran et al. 2013).
Ma et al. proposed two sampling-based algorithms to obtain basic statistics such as the mean, variance, and probability distribution of motif counts (Ma et al. 2019). The first is a simple sampling method, called PGS, which samples a large number of possible graphs from uncertain graphs and counts the instances of a single motif from each sample graph. However, the method requires a sufficient number of samples to accurately estimate the average number of motifs based on Hoeffding's inequality. The second, more efficient method, called LINC, uses the structural similarity between sample graphs to update the frequency of motifs by examining only edge differences between consecutive samples. It outputs the same results as PGS but runs much faster when the same samples are used. In this work, we consider the LINC algorithm a state-of-the-art technique and propose a more efficient ensemble algorithm than those equipped with a role counting routine by LINC.

Problem framework
This study deals with the problem of extracting node groups with similar motif-roles in uncertain graphs. For convenience of explanation, the case of role extraction based on a motif composed of three nodes and directed edges is described here; however, the method can be applied to role extraction based on a small k node motif that is not limited to k = 3, regardless of whether it is directed or undirected. Table 1 summarizes the nomenclature used in this paper. As for R, C, H, the calligraphic font of the capital letter represents the role vectors, the similarity matrix, and the affiliation matrix of an uncertain graph; the bold font with subscript represents those of a deterministic graph sampled from an uncertain graph; the bold font with over-bar shows those of the ensembled (averaged) version.

Motif-role extraction
First, we formulate the problem of extracting a motif-role from the deterministic graph, G = (V, E). Here, V is a set of nodes, E is a set of edges, N = |V | is the number of nodes, and L = |E| is the number of edges. A motif is a graph with a few nodes and edges among them, and it is considered a building block of a large graph. For a set of nodes U ⊂ V and edges among them F = (U × U ) ∩ E , we define g = (U, F) as a motif when g is a connected graph. In this study, we focus on the directed 3-node motif, i.e., |U | = 3 and |F | ≤ 6 . The number of patterns of edge-existence states between all pairs of U is 2 |F | = 2 6 . Among these, 54 patterns are connected ones, and by coordinating them according to the graph-isomorphism, the number of subgraph patterns is 13, i.e., the number of patterns of a directed 3-node motif is 13 as shown in Fig. 1.
Role was first defined as the structural equivalence in the graph, and role discovery as any process that divides nodes into classes of structurally equivalent nodes (Lorrain and Set of edges that appear in G s but not in G s ′ , and vice versa. White 1971). Relaxing this definition, in this study, according to the study by McDonnell et al. (2014), the role is defined based on the structural equivalence in the motif. In the directed 3-node motif, there are 30 types of roles as shown in Fig. 1. In this study, role extraction is accomplished by the following three steps: 1) constructing the role vector for each node, 2) calculating the similarity between role vectors for all node pairs, and 3) extracting node groups based on similarity (see Fig. 2). In the construction of the role vector for each node in step 1, the numbers of roles R are counted for each node v, and the appearance frequency is arranged in the R dimension vector r v . The ith element in r v represents the number of times the node v appears as role i. In the case of the directed 3-node motif, the number of role types is R = 30 as shown in Fig. 1. The matrix in which the role vectors of all N nodes are arranged is expressed as R = [r 1 , . . . , r N ] T , where r T represents the transpose of r . In step 2, the cosine similarity between role vectors c u,v = r T u r v ||r u ||||r v || is used to calculate the similarity of all node pairs. Let the similarity of all N × N node pairs be the similarity matrix C = [c u,v ] u∈V ,v∈V . In step 3, all nodes are classified into K clusters by the greedy method of k-medoids clustering (Nemhauser et al. 1978), which outputs the affiliation matrix H = [h u,k ] K u∈V ,K =1 , where h u,k = 1 if node u belongs to cluster k, otherwise h u,k = 0 . In this way, the role extraction process outputs K clusters, each of which consists of nodes with similar role vectors.

Uncertain graph
This study targets uncertain graphs, in which the existence of edges between nodes is probabilistically determined. The uncertain graph G = (G, p) is defined by the backbone graph G = (V, E), consisting of the node set V and the edge set E, and the existence probability of each edge p : E → (0, 1] . Since the uncertain graph can be expressed as a set of its possible graphs, it is expressed as G = {G i = (V , E i ); E i ⊆ E} . Assuming that the number of uncertain edges is L, the number of possible graphs in the uncertain graph is 2 L = |G| . Following the related study, the occurrence probability Pr[G i ] for each possible graph G i is calculated based on independent Bernoulli trials for all edges:

Motif-role extraction in uncertain graph
Next, we formulate the problem of extracting the motif-role from the uncertain graph G . To solve the above role extraction problem exactly for uncertain graphs, it is necessary to perform the three previously listed steps for all possible graphs G = {G i = (V , E i ); E i ⊆ E} and ensemble the results in consideration of the occurrence probability Pr[G i ] of each possible graph G i as follows: (1 − p(e)). Here, is an operator of the ensemble, and it indicates that the clustering result H G of each possible graph G is ensembled in consideration of the weight of the occurrence probability Pr [G]. To obtain an exact ensemble result for an uncertain graph with L uncertain edges, sampling and ensembling are required for the number of possible graphs 2 L ; this process is difficult to implement even for a small graph. Therefore, approximation by sampling is generally adopted.

Existing methods
This section describes four ensemble methods that sample possible graphs from an uncertain graph and output clustering results. As shown in Fig. 3, four ensemble methods use S possible graphs, {G 1 , . . . , G S } , sampled from the given uncertain graph G . The graph-ensemble method ensembles sampled graphs, generates a weighted graph Ḡ , and counts motif-roles from the weighted graph. The vector-ensemble method ensembles role vectors {R 1 , . . . , R S } obtained from each sampled graph and generates an averaged role matrix (vectors) R . The similarity-ensemble method ensembles similarity matrices {C 1 , . . . , C S } calculated from each role matrix (vectors) and generates an averaged similarity matrix C . The cluster-ensemble method ensembles affiliation matrices {H 1 , . . . , H S } obtained from each similarity matrix and generates an ensembled affiliation matrix H , which is a clustering result. When sampling many graphs, i.e., S ≃ 2 L , the ensembled results, Ḡ , R , C and H become close to the true results G , R , C and H . The details of these existing methods are described in the following subsections.

Graph-ensemble method
First, we explain the graph ensemble method proposed in our previous study (Naito and Fushimi 2021). The procedure for outputting the similarity matrix C in the graph ensemble method (hereinafter, the GE method) is shown in Algorithm 1. In the GE method, ensembling is performed on a group of sample graphs {G 1 , . . . , G S }, G s = (V , E s ), E s ⊆ E to generate an ensembled graph Ḡ (see Algorithm 2): Here, Ḡ = (V ,Ē,p) is a weighted graph with weights p(e) = S s=1 δ(e ∈ E s )/S , which means the sample probability of edge e appearing in S sample graphs, and δ(cond) is a Boolean function that returns 1 if the condition cond is True and 0 if it is False.
Next, for an ensembled graph Ḡ , we search for connected-triples based on the algorithm of Itzhack et al. (2007) (Algorithm 3). In Algorithm 3, Γ (u) = {v; (u, v) ∈Ē ∧ (v, u) ∈Ē} at Line 6 stands for a set of adjacent nodes of node u, and Γ (u) at Line 7 is a set of nodes searched for in the for-loop at Line 6. That is, Γ (u) \Γ (u) at Line 8 represents a set of adjacent nodes of node u that are not searched for at Line 6. Then, in the searched for connected-triples G (m) , the role of each node is identified and counted in consideration of the weight p(e) . By aligning the number of roles for each node and regarding it as a vector, we construct (N × R) role vectors (matrix) R (Algorithm 4). In detail, (1) we represent the presence or Naito and Fushimi Applied Network Science (2022) 7:55 absence of 6 edges between the 3 nodes u, v, w of a connected-triple G (m) by the 6-bit bit string b u via the motif2bits function; (2) we obtain the role number i ← Rcode(b u ) from the dictionary Rcode , which is a correspondence table between the bit string and the role number; (3) we add an occurrence probability Pr[G (m) ] to the ith element of the role vector of node u, r u,i , where Pr[G (m) ] ← e∈E (m) p(e) e∈E\E (m) (1 − p(e)) is the occurrence probability of connected-triple G (m) calculated based on the presence/ absence of 6 edges and their probabilities of existence. For directed 3-node motifs, by bit-shifting the bit string b u focused on node u, the bit strings b v , b w focused on the other 2 nodes v, w can be obtained. For motifs with more than 3 nodes, this is not a simple bit shift, but the bit string can be obtained in a similar manner.
After constructing the role vectors of each node, H is output by classifying each node into clusters based on the matrix C , whose elements are the similarity between the role vectors. In this method, the ensembled graph Ḡ is obtained by ensembling S graphs with L edges. Let p be the average edge existence probability, where the expected number of edges in each sample graph is pL; accordingly, an ensembled graph can be obtained with O(SpL). For one ensembled graph, the connected three nodes are searched for according to the algorithm of Itzhack et al., and the number of roles of all N nodes is counted. Therefore, as with the computational complexity of the 3-node motif count, when the average degree is d , the ensemble role vectors R is obtained with a computational complexity of O(Nd 2 )).

Vector-ensemble method
The role-vector-ensemble method (hereinafter, VE method) generates an ensembled role vector R by averaging the role vector {R 1 , . . . , R S } obtained from the sample graph G s : Then, the cosine similarity C is calculated from the obtained ensembled role vector R . Each node is divided into 1 of K clusters based on the similarity matrix, and H is output. When constructing the role vector R s from each sample graph G s , the LINC algorithm (Ma et al. 2019), which is the state-of-the-art technique, is used. The LINC algorithm focuses on the difference D s,s ′ = (E s \ E s ′ ) ∪ (E s ′ \ E s ) between the edge sets E s and E s ′ in the two sample graphs G s and G s ′ , and only the number of appearances of the roles related to edge e ∈ D s,s ′ whose existence/absence state has changed is updated. Let p be the average edge appearance probability. The expected value of the number of edges that change state is 2L(p − p 2 ) ; hence, it is effective when the uncertainty is small, such as when p = 0.1 or p = 0.9. In this way, S role vectors (matrices) {R 1 , . . . , R S } , each of which is an (N × R) matrix, are efficiently calculated and averaged to obtain an ensembled role vector R . Therefore, if m is the average number of motif instances including each edge, the ensembled role vector R is obtained with a computational complexity O(S(L(p − p 2 )m + NR)) by the VE method.

Similarity-ensemble method
In the similarity-ensemble method (hereinafter, SE method), the average of similarity matrices {C 1 , . . . , C S } calculated from role vectors {R 1 , . . . , R S } is calculated, and the ensembled similarity matrix C is generated: Then, based on the ensembled similarity matrix C , all of the nodes divided into clusters and H are outputted. In the SE method, the number of roles in sample graphs is counted and updated based on the LINC algorithm, as in the VE method. In this way, S similarity matrices {C 1 , . . . , C S } , each of which is an (N × N ) matrix, are calculated and then averaged. Therefore, the dominant computational complexity of the SE method to obtain the ensembled similarity matrix C is O(SN 2 ). 1

Cluster-ensemble method
The cluster-ensemble method ensembles the clustering results {H 1 , . . . , H S } and produces the membership matrix H : Unlike ensembling for supervised classification results, ensembling unsupervised clustering results is a challenging task because the correspondence relationship between obtained clusters is not clear and its degree has to be considered. Therefore, since no ensemble method for clustering results has been established yet, we do not discuss the issue in this article.

Proposed method: extended graph-ensemble method
This study proposes the extended graph-ensemble method (hereinafter, Ext-GE method) to calculate the expected value of the role frequency of each node under the assumption that motif stochastically collapses and shifts to another role. As shown in Fig. 4, the hierarchy can be defined for each role according to the number of edges in the corresponding motif.  of the edges outgoing from Role 15 node disappears, it becomes Motif 7, and if any one of the bidirectional edges between Role 22 nodes disappears, it becomes Motif 5. When the edge is absent stochastically, the upper role changes to the lower role, and the frequency of appearance of the lower role increases. Therefore, when counting the role frequency of each node for Ḡ = (V ,Ē,p) ensembled with S sample graphs G s , 1 ≤ s ≤ S , the subordinate motif of the corresponding motif is searched for, and the number of roles included in that motif is also counted at the same time (Algorithm 5). In the while-loop at Line 6 to 18 in Algorithm 5, the motif of the connected triples searched for in the ensemble graph and its subordinate motifs and roles are also considered.
In detail, at Line 4, to search for all of the lower motifs without duplication and without omission, we express the edge-existence state as a bit string b via the motif2bits function, as do the above-mentioned methods GE, VE, and SE. By repeatedly performing the bit AND operation at Line 8 and the subtraction at Line 17 for b u , in which all 6 bits are initialized with 1 at Line 5, the subordinate motifs and roles are searched for efficiently and comprehensively. The while-loop repeats at most 64 times in the case of a directed 3-node motif represented by 6 bits. The other parts are the same as the count_roles function in Algorithm 4. The bits2motif function at Line 11 is the inverse function of motif2bits at Line 4, which returns the graph structure whose edge states, i.e., presence or absence, is expressed as b u , b v , b w . Algorithm 6 shows the whole picture of the Ext-GE method. The only difference between this method and the GE method (Algorithm 1) is whether the transition to the lower roles should be considered when calculating the expected value of the role number at Line 8.

Experimental evaluations
In this study, we tackle the problem of counting the number of motif-derived roles for the nodes of the uncertainty graph and extracting node groups with similar role vectors. To confirm how the approximation of role counts by our methods affects the similarity between role vectors, and the final clustering result, we evaluate how accurately our method can output them against the true results.

Dataset and settings
In our experimental evaluations, role extraction based on the directed 3-node motif is performed on the following four directed graphs observed in the real world, and the effectiveness and efficiency of the proposed method is confirmed. The graph sizes are shown in Table 2. For these graphs, we set a uniform edge existence probability p(e) = p ∈ [0.1, 0.2, . . . , 0.9] . For the last graph, we set a non-uniform probability p(e) ∼ Beta(α, β), (α, β) ∈ {(1.5, 5.0), (2.5, 2.5), (5.0, 1.5)} . The number of samples is set to S ∈ {10 1 , 10 2 , 10 3 , 10 4 } , and the number of clusters is set to K = 10. By varying the number of samples and clusters, we evaluated the variation in the error of the results and the execution time with the numbers of samples and clusters. Because we obtained similar results, only the results for K = 10 are presented in this study.
In our experiments, the true value for the expected number of roles R is calculated based on a previous work (Todor et al. 2015), which calculates the expected number of motifs for the uncertain graph; the true similarity matrix C is calculated from R , and the true clustering result H is computed from C . As an error measure for role vectors and a For the true similarity matrix C = [c * u,v ] u∈V ,v∈V and the approximated one C = [c u,v ] u∈V ,v∈V , As s similarity measure for the true clustering result in H and the approximated one H , we employed normalized mutual information (hereinafter, NMI) (Kvålseth 2017).

Error evaluation for role vectors
First, we evaluated our method in terms of the error of the role vectors. Figure 5 illustrates the RMSE in a logarithmic scale with respect to the number of samples S. For almost all of the networks we used, we could make the following observations. As the size of the graph increases, the absolute number of appearance roles increases; therefore, the error value tends to increase. On the contrary, the lower the edge-existence probability, the smaller is the absolute amount of the number of appearance roles; therefore, the error value tends to be smaller. As the number of samples increases, the error decreases with some exceptions, for example, the GE method for the Blog and Enron networks. Furthermore, Ext-GE achieves smaller errors than, or errors almost equal to, VE.

Error evaluation for similarity matrix
Next, we discuss the RMSE of similarity matrices. Figure 6 depicts the RMSE in a logarithmic scale to the number of samples S. For almost all of the networks we used, we made the following observations. As the size of the graph increases, the absolute number of appearance roles increases; therefore, the error value tends to increase. As the number of samples increases, the errors of Ext-GE and VE decrease, while those of GE and SE do not decrease. Furthermore, Ext-GE achieves smaller errors than VE.

Similarity evaluation for clustering results
Next, we confirm the effectiveness of our method of focusing on the similarity to the true clustering results. Figure 7 shows the NMI for the number of samples S. From these figures, we can make the following observations. In almost all cases, when the edgeexistence probability is small and the number of samples is large, all methods produce more similar results to the true results (considering the difference in the axes' ranges). SE outputs the worst results; GE sometimes outputs good results depending on the networks; Ext-GE and VE stably output better results than the other methods independent of the networks, probability, and number of samples. The error of the role vectors affects the error of similarity matrices and the final clustering results; therefore, more accurate role vectors are required.

Efficiency
Next, we evaluate our method in terms of computational efficiency. Figure 8 indicates the running time up to the outputs of the ensembled similarity matrix C from the given uncertain graph, in a logarithmic scale with respect to the number of samples S. From these figures, for all of the networks, our Ext-GE is much faster than VE and SE, which are derived from the state-of-the-art LINC algorithm, especially when the number of samples is large.

Non-uniform setting
Finally, to confirm the difference between edge-existence probabilities, we compared the results for Celegans under the settings of uniform and non-uniform edge-existence probability. As a non-uniform setting, we set a non-uniform probability according to the beta distribution, p(e) ∼ Beta(α, β), (α, β) ∈ {(1.5, 5.0), (2.5, 2.5), (5.0, 1.5)} . The mean value of the random numbers can be calculated as α/(α + β) , so they are about 0.23, 0.5, and 0.77. Figures 9 and 10 show the RMSE of the role vectors and the NMI of the clusters with respect to the number of samples S. From the results, we can observe that there is no remarkable difference between uniform and non-uniform settings, i.e., our Ext-GE method achieves much smaller RMSE values and much higher NMI values than the  Although not shown here, there is a similar tendency in the RMSE of the similarity matrices. Furthermore, the computational costs of these methods do not depend on the probability values; in fact, the running times were confirmed to be almost the same.
Trajectories of information propagation in social media can be modeled as an uncertain graph with non-uniform edge-existence probabilities. Such an uncertain graph is observed as many instances where edges stochastically appear and disappear, and thus its true structure and true edge-existence probabilities cannot actually be known. Our method reflects this fact and ensembles many observed (sampled) graphs and outputs accurate results close to those obtained from the true structure. Therefore, our method is applicable to real-world uncertain graphs, and it is promising for accurately identifying important nodes in a viral marketing strategy.

Conclusion
In this study, for the task of motif-role extraction from an uncertain graph, we proposed an efficient and effective method, called the extended-graph ensemble method. It involves counting node roles defined by the position in motifs, calculating the similarity between nodes based on role vectors, and dividing all of the nodes into clusters with similar role vectors. This method ensembles sampled graphs and counts roles by considering the transition to lower-layer roles due to stochastically occurring edge-disappearance.
In experiments using real-world networks with added uniform and non-uniform edge probabilities, we confirmed the effectiveness and efficiency of our proposed method. The proposed method, the extended-graph-ensemble method, outputs results with smaller errors from the true values and works more quickly than the existing methods, including our previously proposed method, the graph-ensemble method, and the vector-ensemble and similarity-ensemble methods, which are both derived from LINC, the state-of-theart technique. Accordingly, we conclude that the extended graph ensemble method is the most suitable for the problem addressed in this study.
Future tasks include motif-role extraction that is not limited to 3-node motifs, determination of the appropriate number of samples using Hoeffding's inequality, and more detailed analysis of the clustering results.