Skip to main content

DC-RST: a parallel algorithm for random spanning trees in network analytics

Abstract

The Mantel Test, discovered in the 1960s, determines whether two distance metrics on a network are related. More recently, DimeCost, an equivalent test with improved computational complexity, was proposed. It was based on computing a random spanning tree of a complete graph on n vertices—the spanning tree was computed using Wilson’s random walk algorithm. In this paper, we describe DC-RST, a parallel, divide-and-conquer random walk algorithm to further speed up this computation. Relative to Wilson’s sequential random-walk algorithm, on a system with 48 cores, DC-RST was up to 4X faster when first creating random partitions and up to 20X faster without this sub-step. DC-RST is shown to be a suitable replacement for the Mantel and DimeCost tests through a combination of theoretical and statistical results.

Introduction

The Mantel Test (Mantel 1967) is a well-established statistical test for characterizing the relation between two distance metrics (for example, to determine the relationship between drive times and driving distances in a road network). The Mantel Test and its variants have been used to quantify such relationships and is available in a package for the statistical language R (Dray and Dufour 2007). Schneider and Borlund (2007) have reviewed its use in anthropology, psychology, and geography; see also Sokal and Rohlf (1962), Corliss et al. (1974) and Cooper-Ellis and Pielou (1994). Ricaut et al. (2010) used it to determine whether there is a correlation between genetic and discrete trait proximity matrices for individuals in the Egyin Gol necropolis in Mongolia. Smouse et al. (1986) describes the importance of the Mantel Test in biology, on distance metrics of genetic markers, morphological traits, ecological divergence, etc. Kouri et al. (2014) used it in cheminformatics, to study the relationship between bond count distances and Tanimoto distances.

The Mantel Test entails computing Pearson’s correlation coefficient on an \(n \times n\) matrix for a number of permutations (denoted by t), making the overall computation \(\Theta (t \cdot n^2)\), where t (typically 100) may be considered to be a constant. This computational complexity is prohibitive for modern Big Data applications (social networks, the internet web graph, etc.), where \(n \sim 10^9\). Bourbour et al. (2020) recently presented DimeCost that instead utilizes uniform random spanning trees to achieve an improved complexity of \(\Theta (t \cdot n)\).

This paper goes one step further: a key bottleneck in DimeCost is the computation of a uniform random spanning tree of a complete graph using Wilson’s algorithm; we aim to speed this up by performing this step in parallel. While the generation of uniform random spanning trees in parallel has been studied before, this was only done theoretically (Anari et al. 2021). Our proposed algorithm, DC-RST (Divide and Conquer-Random Spanning Tree), has been analyzed theoretically and implemented experimentally, and is shown to be a suitable replacement for sequential random spanning tree generation.

Definitions

Consider a set of n objects and two distance metrics \(d_1\) and \(d_2\), which can be used to compute pairwise distances between all pairs of objects.Footnote 1 A distance metric can be interpreted as a symmetric matrix: an \(n \times n\) matrix, say \(D_1\), where \(D_1[i,j]\) is the distance between i and j using the \(d_1\) metric. It may also be viewed as a weighted network: a weighted, undirected, complete graph, \(G_1\), where each vertex corresponds to an object, and an edge from \(v_i\) to \(v_j\) has weight equal to the distance between i and j using the distance metric \(d_1\). Both representations are equivalent: matrix \(D_1\) can be viewed as the weighted adjacency matrix for \(G_1\) as illustrated in Fig. 1.

Fig. 1
figure 1

Distance matrices \(D_1\) and \(D_2\) with corresponding graphs \(G_1\) and \(G_2\). Since both matrices are symmetric, only their upper triangles are drawn

Mantel test

The Mantel test answers the question: is there a statistically significant relationship between distance metrics \(d_1\) and \(d_2\)? In the road network example, we expect drive times to be closely related to driving distances, and the Mantel test provides a quantitative method for measuring this relationship. Specifically, the Mantel test takes as input the matrices \(D_1\) and \(D_2\) and:

  1. 1.

    Computes the Pearson’s correlation coefficient (\(r_{init}\)) between \(D_1\) and \(D_2\).

  2. 2.

    Randomly permutes the rows and columns of \(D_1\) to obtain \(D_1'\).

  3. 3.

    Compute r between \(D_1'\) and \(D_2\).

Steps 2 and 3 are run t times and the number of times x that \(r > r_{init}\) is recorded. If \(x/t \le p\) (e.g., p could be 0.05), the test asserts that metrics \(d_1\) and \(d_2\) are related (Mantel 1967).

Dimecost

Bourbour et al. (2020) proposed an algorithm called Dimecost, which uses uniform, random spanning trees instead of matrices. Recall that the distance matrix is equivalent to a weighted, undirected, complete graph on n vertices. Dimecost computes a random spanning tree of this graph. \(r_{init}\) is then computed using the edges of the spanning tree and permutations of the \(d_1\) weights are used to perform a test similar to Mantel. Bourbour et al. show this method works better than randomly selecting edges, has a lower complexity than the original Mantel test (\(\Theta (t \cdot n)\) instead of \(\Theta (t \cdot n^2)\)), and results in similar correlation values between distance metrics.

Random walks

A key step in Dimecost uses random walks on the graph to obtain a uniform, random spanning tree. Given an undirected graph \(G = (V, E)\), a spanning tree is any undirected, acyclic, connected sub-graph \(T = (V_T, E_T)\) that spans the graph. In general, undirected graphs have many spanning trees; thus, a uniform random spanning tree is a spanning tree, selected at random from the possible set of all spanning trees (where each tree is equally-likely [uniformly] to be selected). Aldous (1990) and Broder (1989) gave equivalent algorithms using a random walk to generate a uniform random spanning tree by tracking the edges traversed when discovering a vertex. Their algorithm has complexity equal to the mean cover time of the graph—for cliques, this is \(O(n \log n)\) (Lovász 1993). Later, Wilson (1996) gave a different algorithm: the key difference is at any point in the random walk, a partial tree T has been created (initially just the starting vertex). To expand T, a random vertex \(v_i\) not in T is selected uniformly at random, and a random walk is started at \(v_i\) until a path (with cycles erased) is found that reaches T. This path is then added to T, forming a larger partial tree. The algorithm terminates when T consists of \(n - 1\) edges, i.e., when T is a spanning tree. The complexity of Wilson’s algorithm is equal to the mean hitting time of a graph—for cliques, this is O(n) (Lovász 1993). Dimecost uses Wilson’s algorithm.

Parallel algorithm and analysis

Algorithm outline

We propose the following approach to create random spanning trees.

  1. 1

    Randomly partition the original clique into k sub-cliques.

  2. 2

    Run Wilson’s algorithm on each sub-clique, forming a forest of k spanning sub-trees.

  3. 3

    “Merge” the sub-trees together to form the final tree as follows:

    1. (a)

      Consider each sub-clique to be a “supernode”, mutually connected to form a “supergraph” clique.

    2. (b)

      Run Wilson’s algorithm on the “supergraph” to get a “supertree”.

    3. (c)

      For each superedge in the supertree, obtain a “real” edge by choosing uniformly at random a vertex from both subtrees and join with an edge.

Consider a clique of 24 vertices randomly partitioned in Step 1 into four subgraphs of six vertices each, as shown in Fig. 2.

Fig. 2
figure 2

Four random partitions of the original clique

For each of the four sub-graphs, Step 2 runs four independent instances of Wilson’s algorithm (one per sub-graph) in parallel resulting in a forest of spanning trees (referred to as “sub-trees”) shown in Fig. 3. Recall, Wilson’s algorithm generates uniform random spanning trees, thus this forest of spanning trees is just one of many possible forests.

Fig. 3
figure 3

Sub-trees from step 2

With these sub-trees, we now run the “Merge” step of Step 3. Step 3a creates the “supergraph”, which is a clique of 4 vertices as shown in Fig. 4. Step 3b runs Wilson’s algorithm on the supergraph, to form a “supertree”, also shown in Fig. 4.

Fig. 4
figure 4

Supergraph and supertree for this example

Step 3c converts each superedge in the supertree to a “real” edge by choosing uniformly at random a vertex from each tree and joining those two with an edge. The example supertree indicates trees 0 and 2 should be connected with an edge, thus a vertex from each tree is chosen at uniform random to be connected (vertex 12 and 3). This process, repeated for the remaining superedges, results in the final tree (new edges in red), shown in Fig. 5.

Fig. 5
figure 5

Connecting the forest with edges using super tree

Theoretical analysis

We first show that our algorithm meets the following necessary condition for a uniform random spanning tree.

Theorem 1

Let \(G = (V, E)\) be an complete, undirected graph (i.e., a clique), with \(n = |V|\) and \(m = |E|\). Let \((u, v) \in E\) be any edge in G. Let T be a spanning tree, chosen uniformly at random from all possible spanning trees of G (i.e., T is a uniform random spanning tree).

Then, the probability \((u,v) \in T\) is:

$$\begin{aligned} P((u, v) \in T) = \frac{2}{n} \end{aligned}$$
(1)

Proof

We prove this by composition:

  1. 1

    For a clique with n vertices, there are \(n^{n-2}\) spanning trees (Cayley’s formula).

  2. 2

    Each spanning tree has \(n - 1\) edges

  3. 3

    The total number of edges in the graph is:

    $$\begin{aligned} m = (n - 1) + (n - 2) + \cdots + 1 = {n \atopwithdelims ()2} = \frac{n \cdot (n - 1)}{2} \end{aligned}$$
    (2)

Combining, we get:

$$\begin{aligned} P((u,v) \in T) = P(T) \cdot P((u,v) \in T | T)&= \frac{1}{n^{n-2}} \cdot \frac{n^{n-2} (n-1)}{m} \end{aligned}$$
(3)
$$\begin{aligned}&= \frac{n-1}{n \cdot (n-1) / 2} \end{aligned}$$
(4)
$$\begin{aligned}&= \frac{2}{n} \end{aligned}$$
(5)

\(\square\)

We now consider the case where the original graph is partitioned into equal-sized disjoint groups, and our proposed algorithm is used to create the final tree T.

Theorem 2

Let \(G = (V, E)\) be a clique, let \((u, v) \in E\) be any edge in G, and let T be a spanning tree our algorithm creates. If the vertices V are partitioned uniformly at random into \(n_1\) groups each of size \(n_2\) (thus \(n = n_1 \cdot n_2\)), then the probability \((u, v) \in T\) is:

$$\begin{aligned} P((u, v) \in T) = \frac{2}{n} \end{aligned}$$
(6)

Proof

When partitioning, there are two general “types” of edgesFootnote 2:

  1. 1

    Edges within a partition (called “within” edges). Formally: given a subset of vertices \(V_i \in V\), an edge (uv) is a within edge if and only if \(u \in V_i \wedge v \in V_i\).

  2. 2

    Edges spanning between partitions (called “between” edges). Formally: given two subsets of vertices \(V_i \in V\), \(V_j \in V, i \ne j\); an edge (uv) is a between edge if and only if \(u \in V_i \wedge v \in V_j\)

We start by considering each case individually.

  1. 1.

    Within Edges. Due to equal partitions, \(|V_i| = n_2 \quad \forall i \in [1, n_1]\); thus,

    $$\begin{aligned} P((u, v) \in T | u \in V_i \wedge v \in V_i) = \frac{2}{n_2} \end{aligned}$$
    (7)

    This is a corollary from Theorem 1.

  2. 2.

    Between Edges. Between any pair of sub-trees, there are \(n_2^2\) edges (each vertex, of which there are \(n_2\) in each tree, has an edge to every vertex in the other tree). There are \({n_1 \atopwithdelims ()2}\) unique pairs of sub-trees, thus the total number of “between” edges is:

    $$\begin{aligned} {n_1 \atopwithdelims ()2} \cdot n_2^2 \end{aligned}$$
    (8)

    The random spanning tree T has \(n_1 - 1\) “between” edges (since there are only \(n_1\) sub-trees to be connected), thus the probability that \((u, v) \in T\) given that \(u \in V_i \wedge v \in V_j, i \ne j\) is:

    $$\begin{aligned} P((u, v) \in T | u \in V_i \wedge v \in V_j) = \frac{n_1-1}{{n_1 \atopwithdelims ()2} \cdot n_2^2} = \frac{n_1-1}{n_1 (n_1 - 1) /2 \cdot n_2^2} = \frac{2}{n_1 \cdot n_2^2} \end{aligned}$$
    (9)

We now construct the final probability by using these two cases. To do this, we will first calculate the probability two arbitrary vertices \(v_1\) and \(v_2\) are placed into the same or different partitions.

  • Let \(E_1\) be when \(v_1\) and \(v_2\) are placed into the same partition. Formally, \(u \in V_i \wedge v \in V_i\)

  • Let \(E_2\) be when \(v_1\) and \(v_2\) are placed into different partitions. Formally, \(u \in V_i \wedge v \in V_j, i \ne j\)

Clearly, \(P(E_2) = 1 - P(E_1)\). We now first calculate \(P(E_1):\)

$$\begin{aligned} P(E_1) = \dfrac{n_2 - 1}{n - 1} \end{aligned}$$
(10)

Without loss of generality, suppose u is randomly placed into \(V_i\). Since each partition will be of size \(n_2\), and there are n vertices total, the probability v is also placed into \(V_i\) is \(\dfrac{n_2 - 1}{n - 1}\).

We now combine the two cases above with \(P(E_1)\) and \(P(E_2)\) to solve for the final probability.

$$\begin{aligned} P((u, v) \in T)&= P(E_1) \cdot P((u, v) \in T | E_1) + P(E_2) \cdot P((u, v) \in T | E_2) \end{aligned}$$
(Law of Total Probability)
$$\begin{aligned}&\qquad \qquad \qquad = \frac{n_2 - 1}{n - 1}\cdot \frac{2}{n_2} + \left( 1 - \frac{n_2 - 1}{n - 1} \right) \cdot \frac{2}{n_1 \cdot n_2^2} \nonumber \\&\qquad \qquad \qquad = \frac{2 \cdot (n_2 - 1)}{n_2 \cdot (n_1 n_2 - 1)} + \frac{2 \cdot (n_1 n_2 - n_2)}{(n_1 n_2 - 1) \cdot (n_1 n_2^2)} \nonumber \\&\qquad \qquad \qquad = \frac{2 n_1 n_2 (n_2 - 1) + 2 n_2 (n_1 - 1)}{(n_1 n_2 - 1) \cdot (n_1 n_2^2)} \nonumber \\&\qquad \qquad \qquad = \frac{2}{n_1 n_2} \nonumber \\&\qquad \qquad \qquad = \frac{2}{n} \end{aligned}$$
(11)

\(\square\)

We now consider the case where the vertices V are randomly split into k groups of possibly unequal size, \(g_1\) through \(g_k\). That is:

$$\begin{aligned} |V| = n = |g_1| + |g_2| +\cdots + |g_k| = \sum _{i = 1}^k |g_i| \end{aligned}$$
(12)

Theorem 3

Let \(G = (V, E)\) be a clique, let \((u, v) \in E\) be any edge in G, and let T be a spanning tree our algorithm creates. If the vertices V are partitioned uniformly at random into k groups of possibly unequal size, then the probability \((u, v) \in T\) is: \(P((u, v) \in T) = \frac{2}{n}\)

Proof

Follows a similar sequence as the proof for equal partitions.

When partitioning, there are two general “types” of edges:

  1. 1

    Edges within a partition (called “within” edges). Formally: given a subset of vertices \(V_i \in V\), an edge (uv) is a within edge if and only if \(u \in V_i \wedge v \in V_i\).

  2. 2

    Edges spanning between partitions (called “between” edges). Formally: given two subsets of vertices \(V_i \in V\), \(V_j \in V, i \ne j\); an edge (uv) is a between edge if and only if \(u \in V_i \wedge v \in V_j\)

We start by considering each case individually.

  1. 1

    Within Edges For any group \(g_i\) which contains vertices \(V_i\):

    $$\begin{aligned} P((u, v) \in T | u \in V_i \wedge v \in V_i) = \frac{2}{|g_i|} \end{aligned}$$
    (13)

    This is a corollary from Theorem 1.

  2. 2

    Between Edges Between any two partitions \(g_i\) and \(g_j\), there are \(|g_i| \cdot |g_j|\) edges. Since the groups \(g_1, \ldots , g_k\) are of possibly unequal sizes, we must pairwise multiply the each group and sum to count the total number of “between” edges.

    $$\begin{aligned} \sum _{1\le i < j \le k} \left( n_i \cdot n_j\right) \end{aligned}$$
    (14)

    By Vieta’s formula, this sum equals the 2nd coefficient, \(a_{k-2}\), of the polynomial given in (15).

    $$\begin{aligned} f(x) = x^k + a_{k-1} x^{k-1} + a_{k-2} x^{k-2} + \cdots + a_1 x^1 + a_0 = (x - |g_1|) \cdot (x - |g_2|) \cdot \ldots \cdot (x - |g_k|) \end{aligned}$$
    (15)

    The random spanning tree T has \(k - 1\) “between” edges (since there are only k sub-trees to be connected), thus probability \((u, v) \in T\) given (uv) is a “between” edge:

    $$\begin{aligned} P((u, v) \in T | (u,v) \text { is a ``between'' edge}) = \frac{k - 1}{a_{k - 2}} \end{aligned}$$
    (16)

We now construct the final probability by using these two cases. To do this, we will first calculate the probability two arbitrary vertices u and v are placed into the same or different partitions.

  • Let \(E_1\) be the event when \(v_1\) and \(v_2\) are placed into the same partition \(g_i\). Formally, \(u \in V_i \wedge v \in V_i\)

  • Let \(E_2\) be the event when \(v_1\) and \(v_2\) are placed into different partitions \(g_i\) and \(g_j\). Formally, \(u \in V_i \wedge v \in V_j, i \ne j\)

We now calculate \(P(E_1)\) and \(P(E_2)\):

$$\begin{aligned} P(E_1)= & {} \frac{|g_i|}{n} \cdot \left( \frac{|g_i| - 1}{n-1} \right) \end{aligned}$$
(17)
$$\begin{aligned} P(E_2)= & {} \frac{|g_i|}{n} \cdot \left( \frac{|g_j|}{n - 1} \right) \end{aligned}$$
(18)

Without loss of generality, suppose u is randomly placed into \(g_i\). Since each \(g_i\) will be of size \(|g_i|\), and there are n vertices total, the probability v is also placed into \(g_i\) is \(\dfrac{|g_i| - 1}{n - 1}\). A similar argument is made for \(P(E_2)\).

We now combine the two cases above with \(P(E_1)\) and \(P(E_2)\) to solve for the final probability.

$$\begin{aligned} P((u, v) \in T)&= {\sum _{i = 1}^k \sum _{j = 1}^k \left( {\left\{ \begin{array}{ll} \frac{|g_i|}{n} \cdot \frac{|g_i|-1}{n-1} \cdot \frac{2}{|g_i|} &{} i = j \\ \frac{|g_i|}{n} \cdot \frac{|g_j|}{n-1} \cdot \frac{2}{k} \cdot \frac{1}{|g_i| |g_j|} &{} i \ne j \end{array}\right. }\right) } \end{aligned}$$
(19)
$$\begin{aligned}&= \frac{2}{n} \end{aligned}$$
(20)

While the initial step is complicated, it follows directly from the previous work:

  1. 1

    We enumerate over all pairwise groups, and need to consider two cases \(i = j\) and \(i \ne j\).

  2. 2

    The first case, \(i=j\), is simply: \(P(E_1) \cdot P((u, v) \in T | E_1)\)

  3. 3

    The second case, \(i \ne j\) is the same idea: \(P(E_2) \cdot P((u, v) \in T | E_2)\) However, when this is placed in the summation, we need to tweak \(P((u, v) \in T | E_2)\). This is because the solution found previously using Vieta’s formula already accounts for the sum, yet we are starting from the sum again.Footnote 3 Therefore, (a) only 1 of the \(|g_i| \cdot |g_j|\) edges between \(g_i\) and \(g_j\) is needed, and (b) \(\dfrac{k-1}{{k \atopwithdelims ()2}} = \dfrac{2}{k}\) is the probability an edge will even be needed between \(g_i\) and \(g_j\) (i.e., that’s the probability Wilson’s algorithm chooses the “supernodes” \(g_i\) and \(g_j\) are to be connected directly with an edge).

\(\square\)

However, a simple counterexample shows that DC-RST does not generate random spanning trees with uniform probability. Specifically, the “star” tree (Fig. 6) will never be generated by DC-RST unless \(k = 1\) or \(k = n\) (when DC-RST degenerates to Wilson’s algorithm). Thus, DC-RST creates a random (but non-uniform) spanning tree of a clique such that edges are chosen with the same probability as a uniform random spanning tree.

Fig. 6
figure 6

The “star” tree for \(n = 8\): all vertices are connected to one central vertex

Statistical analysis

This raises the following question: if DC-RST does not generate uniform random spanning trees, then is DC-RST a suitable replacement for Wilson’s algorithm in DimeCost? To answer this, we compared two versions of DimeCost—one using Wilson’s algorithm, the other using DC-RST—on three data sets used by Bourbour et al. In each experiment, we use the Mantel Test to compute the correlation coefficient between a pair of distance matrices (r), and compute 95% confidence intervals (CIs) for DimeCost using both Wilson’s algorithm and DC-RST for spanning tree generation. We then examine whether the CIs contain r.

Data set 1: Comparison of distance norms

For this first data set, we randomly generated 25 (xy) points. For each pair of points (i.e., \(\frac{25\times 24}{2} = 300\) pairs of points), we applied four distance metrics: the \(L_1\), \(L_2\), \(L_3\), and \(L_\infty\) norms; considering these pairwise gives six data sets.Footnote 4 The results are summarized in Table 1.

Table 1 Comparison between the Mantel test and DimeCost on various \(L_p\)-norms

The CIs using DC-RST differ slightly from those using Wilson’s algorithm, but still contain r from the Mantel Test.

Data set 2: Line graph

For this example, consider a road network represented by a line graph, modeling a long continuous stretch of an interstate highway where each edge represents a highway segment. In the first graph, the weights of all edges (representing travel distances) are equal to 1. In the second graph (representing travel time), the weights of all edges are again 1 except the last edge, which possesses an unusually high weight, 100. This could represent an accident in that segment of the highway causing times, but not distances, to drastically increase. Both graphs are illustrated for the case of \(n = 6\) in Fig. 7.

Fig. 7
figure 7

Two line graphs with 6 vertices

We then computed distance matrices for both metrics by computing the shortest path distance between each pair of vertices. Corresponding elements in the two matrices have identical values, except the last row and column, which represent edges to the last vertex. The results are summarized in Table 2.

Table 2 Comparison between the Mantel test and DimeCost on Data set 2

As before, while the CIs using DC-RST differ slightly from those computed using Wilson’s algorithm, they still contain the r value calculated by the Mantel Test. This test previously showed that simply choosing random edges from the complete graph does not suffice, as the CIs generated in this example with random edges do not contain the r value calculated by the Mantel Test (Bourbour et al. 2020).

Data set 3: Random

Lastly, we consider the case of two random distance metrics; i.e., the value returned by \(d_1(a, b)\) and \(d_1(b, a)\) is simply a random number (likewise for \(d_2\)). We expect these two distance metrics to be uncorrelated. The results are summarized in Table 3.

Table 3 Comparison between the Mantel test and DimeCost on Data set 3

Again, the confidence intervals both contain the correlation coefficient calculated by the Mantel Test. These results suggests that DC-RST is indeed a suitable replacement for Wilson’s algorithm in DimeCost.

Performance analysis

Implementation details

We implemented DC-RST in C++ along with the OpenMP parallel library. The pseudocode is shown below in Fig. 8.

Fig. 8
figure 8

Pseudocode for the parallel algorithm

The function random-partition() was implemented by (1) initializing an array A with entries \(1 \ldots n\) and (2) shuffling A uniformly at random. This array now defines the partitions. If parts \(1 \ldots k\), have sizes \(|p_1|, \ldots , |p_k|\), then the first \(|p_1|\) elements of A are assigned to part 1, the next \(|p_2|\) elements of A are assigned to part 2, etc. Step (2) of random-partition() involves shuffling an array of size n—the best known serial algorithm for this is Fisher-Yates, which runs in O(n) time (Bacher et al. 2018).

Performance on a symmetric multiprocessor architecture

We compared DC-RST against the serial implementation of Wilson’s algorithm on various combinations of k and n—in real-world settings, we recommend k be at least as large as the number of CPU cores. The function std::uniform_int_distribution was used to generate the random numbers required in wilsons-algorithm(). This code was executed on Colorado School of Mines’ “Isengard” server, which is a 48-core symmetric multiprocessor architecture with over 350 GB of main memory.

We initially implemented step (2) of random-partition() with a standard sequential shuffle. The speedups are shown in Fig. 9. As can be seen, the speedups are far below linear: even with 200 partitions, we fail to achieve even 3x speedup.

Fig. 9
figure 9

Timing DC-RST at various numbers of equal-sized partitions using std::shuffle() and comparing speedup over serial

After performing a runtime analysis, it was revealed the algorithm spent almost 90% of total execution time on generating a random partition. Intuitively, this makes sense, as Wilson’s algorithm is O(n) on cliques; and a serial shuffle (which is a sub-step of making the partitions) is also O(n)—thus our speedup is entirely from the lower-order terms hidden by the asymptotic notation.

To try and address this, we tried implementing step (2) of random-partition() with a parallel shuffling algorithm instead: MergeShuffle (Bacher et al. 2018), which was chosen because of the availability of an implementation that uses OpenMP. The authors of MergeShuffle suggest it is the current fastest parallel shuffling algorithm.

The speedups are shown in Fig. 10. While this results in an improvement over Wilson’s algorithm, observe that the speedup of DC-RST remains sub-linear w.r.t. the number of processors, with \(k = 200\) partitions achieving a little less than 4x speedup on the largest clique.

Fig. 10
figure 10

Speedup over serial when using DC-RST with MergeShuffle

To further assess our algorithm, we simply removed the shuffle step from random-partition(). For timing purposes this yields the equivalent of shuffling in 0 time. We called this variation NoShuffle, and re-ran our benchmarks on NoShuffle to compare its performance to the previous results. This is shown in Fig. 11.

Fig. 11
figure 11

Speedup over serial when using NoShuffle

As expected, NoShuffle performs significantly better: on some inputs, NoShuffle is more than 20X faster than serial. This illustrates the potential for DC-RST to improve the performance of generating random spanning trees, should a better solution to Shuffle be found or if random partitions are pre-computed.

The results are summarized in Fig. 12. This figure illustrates the speedup of the different variations of DC-RST (std::shuffle, MergeShuffle, and NoShuffle) over serial across the different clique sizes, all at \(k = 200\) partitions. NoShuffle is by far the fastest, followed by MergeShuffle, with std::shuffle coming up last.

Fig. 12
figure 12

Comparison of shuffling variations (\(k = 200\))

Conclusions and discussion

We have described DC-RST, a parallel divide-and-conquer algorithm for creating random spanning trees on a clique. While DC-RST does not result in uniform random spanning trees, we showed experimentally that the impact on the proposed network science application Dimecost is not significant. On a machine with 48 cores, DC-RST achieves 4X speedup when using MergeShuffle, and achieves 20X speedup, when shuffling is not used. This points to the need for a faster parallel implementation of shuffling.

DimeCost used Wilson’s random-walk algorithm to create a uniform random spanning tree in expected O(n) time. In this paper, we similarly used Wilson’s Algorithm to generate a forest of k spanning sub-trees and to connect the spanning forest. We thank an anonymous reviewer for pointing us to an algorithm proposed by Aldous (Algorithm 2 from Aldous (1990)) to obtain a uniform random spanning tree in worst-case \(\Theta (n)\) time for a complete graph. This algorithm does not incur the higher variance associated with Wilson’s random walk. It instead uses n calls to a random-number generator, followed by a random shuffle. The need for a random shuffle at the end will continue to be a bottleneck in a parallel implementation.

Finally, from the perspective of its use in practice, we note that DC-RST only needs the size of the partitions (in order), and returns the list of edges in the computed spanning tree. This means the entire complete graph does not need to first be generated to run DC-RST: instead, only \(n - 1\) rather than \(\approx n^2/2\) queries are needed to determine spanning tree edge-weights to compute the correlation between distance metrics.

Availability of data and materials

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Notes

  1. Note: distance metrics are assumed to be symmetric.

  2. These can be seen in Fig. 2: the “within” edges are drawn, and all “between” edges are not.

  3. Vieta’s formula is used in the simplification of this sum.

  4. The \(L_p\)-norm for \(p \ge 1\) of a vector \(\textbf{x}\) is a commonly used measure of “distance” in machine learning for clustering, and is defined by \(||\textbf{x}||_p = (|x_1|^p + |x_2|^p + \cdots + |x_n|^p)^\frac{1}{p}\). Note that \(L_\infty (\textbf{x}) = \max \{|x_1|, |x_2|, \ldots , |x_n| \}\) is the limit of the \(L_p\) norm as \(p \longrightarrow \infty\).

References

Download references

Acknowledgements

We would like to thank the Anonymous Reviewer for suggesting an alternative algorithm to Wilson’s for future consideration

Funding

This research was self-funded, no sources of funding were sought nor received.

Author information

Authors and Affiliations

Authors

Contributions

LH and DM conceived and developed the theory underlying the presented idea. LH contributed all of the computer code and conducted all of the experiments; LH wrote the manuscript with support from DM. LH and DM read and approved the final manuscript.

Corresponding author

Correspondence to Luke Henke.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Henke, L., Mehta, D. DC-RST: a parallel algorithm for random spanning trees in network analytics. Appl Netw Sci 9, 45 (2024). https://doi.org/10.1007/s41109-024-00613-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41109-024-00613-7

Keywords