 Research
 Open Access
 Published:
Efficient orbitaware triad and quad census in directed and undirected graphs
Applied Network Sciencevolume 2, Article number: 13 (2017)
Abstract
The prevalence of select substructures is an indicator of network effects in applications such as social network analysis and systems biology. Moreover, subgraph statistics are pervasive in stochastic network models, and they need to be assessed repeatedly in MCMC sampling and estimation algorithms. We present a new approach to count all induced and noninduced fournode subgraphs (the quad census) on a pernode and peredge basis, complete with a separation into their nonautomorphic roles in these subgraphs. It is the first approach to do so in a unified manner, and is based on only a cliquelisting subroutine. Computational experiments indicate that, despite its simplicity, the approach outperforms previous, less general approaches.
By way of the more presentable triad census, we additionally show how to extend the quad census to directed graphs. As a byproduct we obtain the asymptotically fastest triad census algorithm to date.
Introduction
The \(\mathcal F\)census of a graph is the frequency distribution of subgraphs from a family \(\mathcal {F}\) of nonisomorphic graphs in an input graph. In this work we focus on fournode subgraphs, i.e. quads.
Discrimination of graphs by a subgraph census was proposed already by Holland and Leinhardt (1970; 1976) in the context of social networks and it is of utmost importance for the effects of exponential random graph models (Robins et al. 2007). While there is extensive work on determining the subgraph census for varying subgraph sizes (Kloks et al. 2000; Kowaluk et al. 2011; Lin et al. 2012) and also for directed graphs (Eppstein et al. 2010), the focus is almost always on the global distribution, i.e. the number of triangles a graph contains, but not on how often a given node is part of a triangle. For many properties characterizing nodes and edges it is however necessary to know the subgraph census on the node or edge level. For example, to calculate a node’s clustering coefficient we need to know in how many triangles it is contained. The same holds for the Jaccard index computed with respect to an edge. Although for these two examples it is not necessary to calculate the frequencies of all nonisomorphic induced 3node subgraphs, i.e. the triad census, there exist edge weights that take different subgraph configurations into account (Auber et al. 2003) and the running time for most edge metrics (Melançon and Sallaberry 2008) is dominated by calculating the frequencies of particular subgraphs. Using the ksubgraph census on an edge level finds application in the context of extracting sparse graph representations that amplify group cohesion (Auber et al. 2003; Nick et al. 2013; Nocaj et al. 2015). While the approach by Nick et al. relies on the triad census, Nocaj et al. (2015) show that using a weighted quad census instead results in a superior sparsifier, as quads are more encompassing in reflecting local density. A further scenario where the quad census is of vital importance is in the evaluation of graph models with respect to the accuracy by which they resemble observed graphs (Pržulj et al. 2004).
While ksubgraph censuses specific for nodes and edges are not used widely in social network analysis, it is different for bioinformatics. So far, however, even here the use is restricted to connected knode subgraphs, so called graphlets (Pržulj et al. 2004) or motifs (Milo et al. 2002).
A further differentiation of subgraph censuses consist in the distinction of node and edge automorphism classes (orbits) in each graphlet. For example, in a diamond (i.e. a complete fournode graph minus one edge), there are two node and edge orbits as shown in Fig. 1. The node orbits are defined by the nodes with degree 2 and 3, respectively. The edge orbits are determined by the edge connecting the nodes with degree 3 and all remaining edges, respectively. This differentiation by orbits is particularly interesting for the distinction of roles nodes and edges respectively fill in a quad. For example two nodes might have the same number of occurrences in a claw, cf. Fig. 1, which would lead to the assumption that they are similar, however by distinguishing the orbits we might see that the one node is always in orbit 11, and therefore in control of e.g. the information flow, while the other is always in orbit 12. That is the reason why the orbitaware subgraph census has been used to mine central role structures in graphs (Doran 2014), but restricted to triads. Direct applications of the orbitaware quad census can for example be found in the context of graph clustering (Milenković and Pržulj 2008; Solava et al. 2012).
Due to the importance of subgraph enumeration and censuses in bioinformatics, various computational methods (Hočevar and Demšar 2014; Marcus and Shavitt 2012; Milenković et al. 2008; Wernicke and Rasche 2006) were proposed.
The general approach to determine a subgraph census on the global level is to solve a system of equations that relates the noninduced subgraph frequency of each nonisomorphic knode subgraph with the number of occurrences in other knode subgraphs (Eppstein et al. 2010; Eppstein and Spiro 2009; Kloks et al. 2000; Kowaluk et al. 2011; Lin et al. 2012). It is known that the time needed to solve the system of equations for the fournode subgraph census, which we refer to as the quad census, on a global level is \(\mathcal {O}(a(G)m + i(G))\) (Lin et al. 2012), where i(G) is the time needed to calculate the frequency of a given fournode induced subgraph in G, and a(G) is the arboricity, i.e. the minimum number of forests needed to cover E. Following the idea of relating noninduced and induced subgraph counts, Marcus and Shavitt (2012) present a system of equations for the orbitaware connected quad census on a node level that runs in \(\mathcal {O}(\Delta (G)m+m^{2})\) time with Δ(G) denoting the maximum degree of G. Because of the larger number of algorithms invoked by Marcus and Shavitt’s approach, Hočevar and Demšar (2014) present a different system of equations, again restricted to connected quads, that requires fewer counting algorithms and runs in \(\mathcal {O}(\Delta (G)^{2}m)\) time, but does not determine the noninduced counts.
Contribution: We present the first algorithm to count both induced and noninduced occurrences of all fournode subgraphs (quads). It is based on a fast algorithm for listing a single quad type and capable of distinguishing the various roles (orbits) of nodes and edges. While this simplifies and generalizes previous approaches, our experimental evaluation indicates that it is also more efficient. Furthermore, we show using the example of the orbitaware directed triad census a strategy to extend the orbitaware quad census computation to directed graphs and thus obtain the asymptotically fastest algorithm for graphlevel triad census computation along the way.
In the following section we provide basic notation followed by an introduction of the system of linear equations and the algorithm utilized in section “Determining the orbitaware quad census”. In section “Runtime experiments” we present a running time comparison on observed and synthetic graphs showing that our approach is more efficient than related methods. Using the example of the triad census, we present in section “Triad census” a strategy to calculate the orbitaware quad census of directed graphs without changing its asymptotic running time. We finally conclude in section “Conclusion”.
Preliminaries
We consider finite simple undirected graphs G=(V,E) and denote the number of nodes by n=n(G)=V and the number of edges by m=m(G)=E. The neighborhood of a node v∈V is the set N(v)={w : {v,w}∈E} of all adjacent nodes, its cardinality d(v)=N(v) is called the degree of v, and Δ(G)= maxv∈V{d(v)} denotes the maximum degree of G.
For finite simple directed graphs G=(V,E) we denote the outgoing neighborhood of a node v∈V by N ^{+}(v)={w : (v,w)∈E}. The incoming neighborhood N ^{−}(v) is defined analogously and we call N ^{⇔}(v)=N ^{+}(v)∩N ^{−}(v) the mutual neighborhood. The underlying undirected graph G ^{′}=(V,E ^{′}) of a simple directed graph G=(V,E) has the edge set E ^{′}={{u,v} : (u,v)∨(v,u)∈E}
A complete graph with k nodes is denoted by K _{ k }, and K _{3} is also called a triangle. We use \(T(u)\,=\,{N(u)\choose 2} \cap E\) to refer to the set of node pairs completing a triangle with u and T({u,v})=N(u)∩N(v) for the set of nodes completing a triangle with the edge e={u,v}. For the cardinality of these sets we write t(u)=T(u) and t(e)=T(e). A triad and a quad are any graphs on exactly three and four nodes.
A subgraph G ^{′}=(V ^{′},E ^{′}) of G=(V,E), V ^{′}⊆V, is called (node)induced, if \(E' = {V' \choose 2} \cap E\), and it is called noninduced, if \(E' \subseteq {V' \choose 2} \cap E\).
Two undirected graphs G and G ^{′} are said to be isomorphic, if and only if there exists a bijection π:V(G)→V(G ^{′}) such that {v,w}∈E(G) ⇔ {π(v),π(w)}∈E(G ^{′}). Each permutation, including identity, of the node set V, such that the resulting graph is isomorphic to G is called an automorphism and the groups formed by the set of automorphisms is denoted automorphism class or orbit.
Determining the orbitaware quad census
The knode subgraph census is usually computed via a system of linear equations relating the noninduced and induced ksubgraph frequencies, as the noninduced frequencies are easier to compute. Lin et al. (2012) show that for k=4 all noninduced frequencies, except for K _{4}, can be computed in \(\mathcal {O}(a(G)m)\) time. This implies that the total running time to calculate the quad census at the level of the entire graph is in \(\mathcal {O}(a(G)m+i(G))\), where i(G) is the time needed to compute the induced frequencies for some induced quadtype.
The approach of Lin et al. however, is not suitable to answer questions as to how often a node or an edge is contained in a K _{4}. Furthermore, the automorphism class of the node/edge in the quad is sometimes of interest. All nonisomorphic graphs with four nodes are shown in Fig. 1 and the node/edge labels refer to their automorphism classes (orbits). For example in a diamond all edges of the C _{4} belong to the same orbit while the diagonal edge belongs to another. Analogously the orbits of the nodes can be distinguished.
As our approach also relies on relating noninduced and induced frequencies we will start by presenting how the noninduced frequencies for a node/edge in a given orbit relate to the induced counts. Thereafter, we will present equations to compute the respective noninduced frequencies and prove that our approach matches the running time of Lin et al., implying that it is asymptotically as fast as the fastest algorithm to compute the frequencies on a node and edge level for any induced quad. Note that in the following when we talk about noninduced frequencies we exclude those of the K _{4}, as it equals the induced frequency.
Relation of induced and noninduced frequencies
To establish the relation between induced and noninduced frequencies, the number of times G ^{′} is noninduced in any other graph G with the same number of nodes has to be known. For instance, let us assume that G ^{′} is a P _{3} and G a K _{3} (copaw and claw without isolated node cf. Fig. 1). Having the definition of the edge set for noninduced subgraphs in mind, we see that G contains three noninduced P _{3}, as each edge can be removed from a K _{3} to create a P _{3}. Consequently, if we know the total number of noninduced P _{3} and we subtract three times the number of K _{3} we obtain the number of induced P _{3} of the input graph.
Similarly, we can establish systems of equations relating induced and noninduced frequencies on a node and edge level distinguishing the orbits for quads, see Figs. 3 and 4.^{1} Note that both systems of equations are needed since we cannot compute the node from the edge frequencies and vice versa, but from both we can compute the global distribution. In the following we show the correctness for ei _{10}(e).
Induced orbit 10 edge census. Let us assume we want to know how often edge e is in orbit 10 or in other words part of a C _{4}. We know that a C _{4} is a noninduced subgraph of a diamond, K _{4} and of itself, cf. Fig. 2, and that there is no other quad containing a noninduced C _{4}. Let us first concentrate on the diamond. In a diamond we have two different edge orbits; orbit 11, i.e. the edges on the C _{4}, and orbit 12, i.e. the diagonal edge. Figure 2 shows that for every diamond where e is in orbit 12 there is no way to remove an edge, such that this graph becomes a C _{4}, but for each diamond where e is in orbit 11 we can remove the diagonal edge and end up with a C _{4}. Therefore, the noninduced number of subgraphs where e is in orbit 10 contains once the number of induced subgraphs where e is in orbit 11, but not those in orbit 12. As for the case of the C _{4} in a K _{4} all edges are in the same orbit. From a K _{4} we can construct a C _{4} containing a specific edge in two ways. The first is to remove both diagonal edges, cf. Fig. 2; and the second to delete the two horizontal edges. As a consequence the induced number of e being in orbit 10 is given by ei _{10}(e)=en _{10}(e)−ei _{11}(e)−2ei _{13}(e).
Following this concept all other equations can be derived.
Calculating noninduced frequencies
The calculation of the noninduced frequencies is (computationally) easier than for the corresponding induced counts, except for K _{4}s. This is due to the fact that the noninduced frequencies can be constructed from smaller, with respect to the number of nodes, subgraphs cf. Figs. 3 and 4. In the following we show the correctness of nn _{14}(u) and en _{4}(u,v).
Noninduced orbit 14 node census. To determine nn _{14}(u) we start by enumerating all triangles containing u. Let v and w form a triangle together with u. As u is in orbit 14 we know that each neighbor of v and w that is not u,v or w definitely creates a noninduced paw with u in orbit 14; while this does not necessarily hold for neighbors of u as they might not be connected to v or w (and, if they are, we already gave credit to this). Note that even if a neighbor of either v or w is a neighbor of u as well there is no additional paw with u in orbit 14 and therefore \(nn_{14}(u) = \sum _{\{v,w\}\in T(u)} (d(v)+d(w) 4)\).
Noninduced orbit 4 edge census. Edge e={u,v} is noninduced in orbit 4 for each P _{3} starting at u or v which neither contains e nor closes a K _{3} with e. The number of P _{3}s starting at u not containing e equals \(\sum _{w \in N(u) \setminus v}(d(w)  1)\). However, the node v might be a neighbor of w and therefore there is a path of length two (via w) connecting u and v. Since this creates a threenode subgraph, more precisely a triangle, and not a quad we have to adjust for this by subtracting twice the number of triangles containing e. Consequently, \(en_{4}(u,v) = \sum _{w \in N(u)} d(w) + \sum _{w \in N(v)} d(w)  2(d(u)+d(v))+2 2t(u,v)\).
In the following, we focus on the algorithm calculating all required frequencies to solve the systems of equations.
Listing complete quads
In order to be able to solve the systems of equations we need to compute the noninduced quad counts as well as any of the induced frequencies. This requires an algorithm that is capable of solving the following tasks on a node and edge level:

1.
Counting and listing all K _{3}

2.
Calculating noninduced C _{4} frequencies

3.
Determine induced counts of any quad
We chose to calculate the induced counts for K _{4} to fulfill requirement 3. The reasons are a) to our knowledge there are no algorithms calculating induced counts on a node and edge level for any other quad more efficiently than the algorithm we are presenting here; b) a K _{4} has the property that all nodes and edges lie in the same orbit; c) all noninduced C _{4} can be counted during the execution of our algorithm. Since listing, also known as enumerating, all K _{4} has to solve the subproblem of listing all K _{3}, we will start explaining our algorithm by presenting how K _{3}s can be listed efficiently. Note that this algorithm satisfies requirement 1.
Listing all triangles in a graph is a well studied topic (Ortmann and Brandes 2014). We show in our previous work (Ortmann and Brandes 2014) that one of the oldest triangle listing algorithms, namely K3 by Chiba and Nishizeki (1985) is in practice the fastest. This algorithm is based on neighborhood intersection computations. To achieve the running time of \(\mathcal {O}(a(G)m)\), Chiba and Nishizeki process the graph in a way such that for each intersection only the neighborhood of the smaller degree node has to be scanned. This is done by processing the nodes sequentially in decreasing order of their degree. The currently processed node marks all its neighbors and is removed from the graph. Then the number of marked neighbors of a marked node is calculated.
Let us think of this algorithm differently. When we process node u and remove it from the graph then every triangle that contains u is an edge where both endpoints are marked, cf. Fig. 5. This perception of the algorithm directly points us to a solution for the second and third requirement. As shown in Fig. 5, when node u is removed from the graph, every K _{4} that contains u becomes a K _{3} where all nodes are marked, implying that K3 can be easily adapted to list all K _{4}s. Chiba and Nishizeki call this extension COMPLETE. Furthermore, only nodes that are connected to a neighbor of u can create a noninduced C _{4} and each C _{4} contains at least two marked nodes. Since all these nodes are processed already during the execution of algorithm K3 counting noninduced C _{4} on a node and edge level can be also done in \(\mathcal {O}(a(G)m)\) time. The corresponding algorithm is called C4 in (Chiba and Nishizeki 1985) and the combination of these different algorithms is presented in Algorithm 1. It runs in \(\mathcal {O}(a(G)^{2}m)\) (Chiba and Nishizeki 1985), and its novelty is that it follows the idea of directing the graph acyclic as we already proposed in the context of triangle listing (Ortmann and Brandes 2014). Furthermore, this acyclic orientation allows omitting node removals, and given the proper node ordering, it has the property that the maximum outdegree is bounded by \(\mathcal {O}(a(G))\). Therefore, unlike for algorithm COMPLETE and C4 (Chiba and Nishizeki 1985), no amortized running time analysis is needed to prove that the running time is in \(\mathcal {O}(a(G)^{2}m)\) and \(\mathcal {O}(a(G)m)\), respectively, as we will show next.
Runtime We will first show that the running time bound of our variant implementation of algorithm C4 is in \(\mathcal {O}(a(G)m)\), therefore we ignore Lines 4, 6, 8, 12–19 and 27 of Algorithm 1 for now.
The running time of the remaining algorithm is given by the following equation:
As we order the nodes by successively removing the node of minimum degree from the graph, which can be computed in \(\mathcal {O}(m)\) using a slightly modified version of the algorithm presented in (Batagelj and Zaveršnik 2003), it holds that Δ ^{+}(G)<2a(G) (Zhou and Nishizeki 1994). The time required to initialize all marks is in \(\mathcal {O}(n)\), orienting the graph is in \(\mathcal {O}(n+m)\), and consequently the total running time is in \(\mathcal {O}(a(G)m)\).
Let us now focus on the time required for calculating all K _{4}s and therefore ignore Lines 9–11 and 20–27 of Algorithm 1 that is given by the following equation:
By the same arguments it follows that our variant implementation of COMPLETE runs in \(\mathcal {O}(a(G)^{2}m)\). Since Line 4 is in \(\mathcal {O}(a(G)m)\) (Ortmann and Brandes 2014) and solving the systems of equations requires \(\mathcal {O}(n+m)\) time, the overall complexity of Algorithm 1 is in \(\mathcal {O}(a(G)^{2}m)\).
Before we give experimental evidence that our algorithm is not just asymptotically, but also in practice, superior to the currently fastest orbitaware quad census algorithm, we want to give a more detailed explanation as to why algorithm C4 runs in \(\mathcal {O}(a(G)m)\) instead of \(\mathcal {O}(a(G)^{2}m)\), although every K _{4} contains three noninduced C _{4}. The reason lies in the fact that COMPLETE belongs to the class of listing algorithms, while C4 is a counting algorithm. Since a listing algorithm has to enumerate every single occurrence of the subgraph of interest, its running time cannot be asymptotically faster than the number of subgraphs it has to list. For example every algorithm for listing all triangles in a graph cannot be asymptotically faster than Θ(n ^{3}), since the complete graph contains \({n \choose 3}\) triangles. However, as counting does not require to enumerate every single triangle there exist algorithms with a lower worstcase complexity, e.g. via matrix multiplication (Coppersmith and Winograd 1990). This difference and the fact that in the noninduced scenario we can ignore the existence of some edges, explain the asymptotical differences between the two algorithms.
Runtime experiments
We provide experimental evidence that our approach is not only asymptotically faster but also more efficient in practice than the currently fastest orbitaware quad census algorithm. Comparison is restricted to the orca software (v1.0) implementing the approach of Hočevar and Demšar (2014), as the authors show that it is superior to other software tools in the context of quad census computation. Additionally, it is the only software we are aware of which can compute the orbitaware quad census on an edge level, even if only for connected quads. To the best of our knowledge, except in the orca code, there is no other documentation of their approach.
Setup and data
We implemented our approach in C++ using the Standard Template Library and compiled the code with the g++ compiler version 4.9.1 set to the highest optimization level. The orca software is freely available as an R package. To avoid measuring errors due to the R and C++ interface communication we extracted the C++ code and cleaned it from all R dependencies.
The tests were carried out on a single 64bit machine with an 3.60GHz quadcore Intel Core i74790 CPU, 32GB RAM, running Ubuntu 14.10. The times were measured via the gettimeofday command with a resolution up to 10^{−6} seconds. We ran the executable in a single thread and forced it to one single core, which was dedicated only to this process. Times were averaged over 5 repetitions.
Data We compared both approaches on a number of real world networks. The Facebook100 dataset (Traud et al. 2011) comprises 100 Facebook friendship networks of higher educational institutes in the US with network sizes of 762≤n<41K nodes and 16K<m<1.6M edges. Although these networks are rather sparse, they feature a small diameter, thereby implying a high concentration of connected quads. Apart from this we tested the algorithms on a variety of networks from the Stanford Large Network Set Collection (Leskovec and Krevl 2014). The downloaded data were taken from different areas to have realistic examples that encompass diverse network structures.
Additionally, we generated synthetic networks from two different models. The one class of generated graphs are smallworlds, which were created by arranging nodes on a ring, connecting each one with its r nearest neighbors, and then switching each dyad with probability p. The other class of graphs was drawn from a preferential attachment like model. Here we added n nodes over time to the initially empty network and each new node v connects to r existing nodes, each of which either chosen by preferential attachment or with probability p randomly from \(\bigcup _{u \in N(v)}N(u)\). We generated graphs with fixed n=20000 and varying average degree as well as graphs with n∈{50000,140000,…,500000} and gradually increasing average degree. Four graphs were generated for each parameter combination.
We refer the reader to (Ortmann and Brandes 2014) for a more detailed description of the utilized graph models, the tested Stanford graphs, the chosen average degree, and parameters r and p.
Results
In Fig. 6 we present the results of our experiments. In the top subfigure we plotted the avg. running time of orca against the avg. time needed by our approach for all but the largest Standford graphs. Each point that lies below the main diagonal indicates that our approach is faster. Consequently, the picture makes it clear that our algorithm is faster than the orca software for each tested network, even though we compute the whole node and edge orbitaware quad census. The same findings are obtained for the larger graphs taken from SNAP.
The speedup we achieve lies between 1.6 and 10 for the tested graphs. In general, however, the speedup should be in Θ(logΔ(G)) for larger graphs. The reason is that, once n exceeds 30K, the algorithm implemented in the orca software runs in \(\mathcal {O}(\Delta (G)^{2}m\log \Delta (G))\), instead of \(\mathcal {O}(\Delta (G)^{2}m)\). The logarithmic factor originates from the time required for adjacency testing. While the orca software uses an adjacency matrix for these queries for graphs with n≤30K, it takes logΔ(G) for larger graphs (binary search), since no adjacency matrix is constructed. In contrast Algorithm 1 requires only \(\mathcal {O}(n)\) additional space to perform adjacency tests in constant time. Note that orca’s algorithm using the adjacency matrix appears to follow the ideas of Chiba and Nishizeki, yet without exploiting the potential of utilizing a proper node ordering. Besides the faster K _{4} algorithm, another important aspect explaining the at least constant speedup of our approach is our system of equations. For both the node and edge orbitaware quad census Hočevar and Demšar do not calculate the exact noninduced counts. This requires that each induced subgraph with 3 nodes is listed several times and, more importantly, also noncliques, which is not the case in our approach.
Triad census
So far we have shown a general framework building on relating noninduced and induced frequencies to compute the orbitaware ksubgraph census on a node and edge level basis using the example of quads. While this approach was restricted to simple undirected graphs, we show in the following how it can be extended to directed graphs. However, since the number of nonisomorphic directed quads is already 218 (Davis 1953), we will introduce this framework in the context of the (directed) triad census. As the required modifications for node and edge orbitawareness are the same we will restrict our explanations to the node orbitaware triad census computation. Note that since solving the (directed) quad census relies on noninduced frequencies of smaller, i.e. subgraph of size less than four, the distinctions of directed triads is required in order to solve the quad census for directed graphs and therefore some of the following equations are necessary for its computation.
The triad census of a graph denotes the frequency distribution of all nonisomorphic directed triads, cf. Fig. 7, in an input graph and finds application, among others, in social sciences e.g. to compare different graphs (Faust 2007; Wasserman and Faust 1994) or to extract distinct roles in networks (Doran 2014). The probably first algorithm to compute the triad census on a graph level is attributed to Moody (1998) with a running time of \(\mathcal {O}\left (n^{2.376}\right)\) (Coppersmith and Winograd 1990). While this approach relies on matrix multiplication, Batagelj and Mrvar (2001) propose a combinatorial algorithm calculating the triad census in \(\mathcal {O}(\Delta (G)m)\) time. Like Batagelj and Mrvar’s approach the proposed technique by Eppstein et al. (2010) requires to enumerate all connected triads. However, using a proper data structure allows them to further reduce the asymptotical complexity to an amortized \(\mathcal {O}(h(G)m)\) running time where h(G) is the largest integer such that there exist h nodes of degree at least h (Hirsch 2005). Yet still the algorithm of Eppstein et al. is not optimal, as we will show in the following, since, as it is the case for the quad census, it is sufficient to list only all complete triads, which is asymptotically faster.
Following the framework presented in the context of the undirected quad census we can relate orbitaware noninduced and induced triad census frequencies via a system of linear equations as presented in Fig. 8. Since deriving this system of linear equations follows exactly the same strategy we presented earlier we omit the correctness proofs here. Although the system of linear equations in Fig. 8 requires the computation of several induced frequencies, compared to only one in the undirected case, cf. Figs. 3 and 4, we can make the following observation. All the induced orbit frequencies, i.e. 21 to 35, are triangles in the underlying undirected graph. Since each triangle in the underlying undirected representation G ^{′} of G corresponds to a directed triangle in G, and vice versa, we can list all triangles, T(G ^{′}), in G ^{′} and then calculate the orbits of the nodes in each triangle t∈T(G ^{′}) w.r.t. G. This directly implies that, since orbit 0 to 20 can be computed in \(\mathcal {O}(m)\) which matches the running time to construct G ^{′}, that the total running time of the orbitaware triad census on a node level is in \(\mathcal {O}(a(G)m + \sum _{t \in T(G)}o(t))\), since m(G ^{′})≤m(G). The \(\mathcal {O}(a(G)m)\) factor is the running time of Chiba and Nishizeki’s algorithm K3 to list all triangles in a graph (Chiba and Nishizeki 1985; Ortmann and Brandes 2014), and o(t) denotes the complexity to compute the orbit of each node in a triangle. In the following we will show that \(o(t) \in \mathcal {O}(1)\) and therefore the time complexity for the computation of the orbitaware triad census on a node level in \(\mathcal {O}(a(G)m)\). As the (orbitaware) triad census on a graph level can be computed from the node level, and since a(G)≤h(G) (Lin et al. 2012), this implies that our approach is not just easier to implement than the currently best algorithm for the triad census computation (Eppstein et al. 2010) on a graph level, but also asymptotically faster.
The idea of working on G ^{′} rather than on G for the computation of the triad census has already been used by Batagelj and Mrvar (2001). In order to relate the undirected triad u,v,w∈V in G ^{′} with its directed version in G they propose to map a triad to a number computed by the following formula
with l(i,j)=1 if (i,j)∈E(G) and 0 otherwise. Since this mapping is unique for each possible triad each number encodes exactly one of the 16 nonisomorphic triads, cf. Fig. 7. Furthermore, as we know for each possible triad the orbits of the nodes, we can extend this mapping to also encode the node orbits, cf. Table 1. Note that Table 1 contains all codes, yet our approach requires only those entries encoding orbits larger than 20. Since Table 1 allows us in constant time to map the code of u,v,w to their orbits, it remains to show that the computation of the encoding can also be done in constant time and therefore \(o(t) \in \mathcal {O}(1)\).
With minor modifications it is possible to enable algorithm K3 to list, besides all nodes, also all edges belonging to a triangle in G ^{′}, while not changing the algorithms asymptotic running time. If we further attach during the transformation from G to G ^{′} to each edge the information how it is directed in G, we can access l(i,j) in constant time. Consequently, we can compute code (u,v,w) and therefore o(t) in \(\mathcal {O}(1)\). Note that if N ^{⇔} and d ^{⇔}(u) are not part of the input they can also be computed during the construction of G ^{′}. Even though the described algorithm can be directly derived from Algorithm 1, for convenience we present in Algorithm 2 the orbitaware triad census on a node level algorithm. Since the additional work that has to be done compared to plain triangle listing is in \(\mathcal {O}(m)\), we refer the reader to the evaluation of triangle listing algorithms in (Ortmann and Brandes 2014) to get an impression of the practical running times. Note that the presented strategy can also be used for orbitawareness on an edge level without changing the asymptotic running time and that it can be directly applied to derive the orbitaware directed quad census.
Conclusion
We presented two systems of equations that enable us to efficiently determine the orbitaware quad census of a graph down to the level of nodes and edges by applying an efficient singlesubgraph listing algorithm and its subroutine. It was shown how induced and noninduced frequencies relate to one another and that we can compute the noninduced frequencies in \(\mathcal {O}(a(G)m)\) time. This matches the best known running time bound for the more restricted noninduced quad census on the graph level, i.e. oblivious to the specific nodes and edges involved in each quad. With Algorithm 1 we showed a routine that is capable of computing all noninduced frequencies and listing all K _{4} while running in \(\mathcal {O}(a(G)^{2}m)\) time, which is the asymptotically best known running time bound for listing any induced quad. This implies that the total running time of our approach matches the best known running time for quad census computation on a graph level in sparse graphs (Lin et al. 2012). In experiments we were able to show that the simplicity of our system of equations in combination with this efficient algorithm outperforms the currently best software to calculate the quad census.
Furthermore, using the example of the orbitaware directed triad census on the node level, we outlined a strategy to extend the orbitaware quad census on both the node and edge level to directed graphs. As a byproduct, we presented with Algorithm 2 the asymptotically fastest algorithm for the triad census computation on the graph level. We note that both algorithms can be parallelized with little effort.
Endnote
^{1} Note that the preliminary version contained several typing errors.
References
Auber, D, Chiricota Y, Jourdan F, Melançon G (2003) Multiscale visualization of small world networks In: 9th IEEE Symposium on Information Visualization (InfoVis 2003), 2021 October 2003, Seattle, WA, USA. doi:10.1109/INFVIS.2003.1249011.
Batagelj, V, Mrvar A (2001) A subquadratic triad census algorithm for large sparse networks with small maximum degree. Social Netw. 23(3): 237–243. doi:10.1016/S03788733(01)000351.
Batagelj, V, Zaveršnik M (2003) An O(m) algorithm for cores decomposition of networks. CoRRcs.DS/0310049: 1–10.
Chiba, N, Nishizeki T (1985) Arboricity and subgraph listing algorithms. SIAM J. Comput. 14(1): 210–223. doi:10.1137/0214017.
Coppersmith, D, Winograd S (1990) Matrix multiplication via arithmetic progressions. J. Symb. Comput. 9(3): 251–280. doi:10.1016/S07477171(08)800132.
Davis, RL (1953) The number of structures of finite relations. Proc. Amer. Math. Soc. 4(3): 486–495.
Doran, D (2014) Triadbased role discovery for large social systems In: Social Informatics  SocInfo 2014 International Workshops, Barcelona, Spain, November 11, 2014, Revised Selected Papers, 130–143. doi:10.1007/9783319151687_18 http://dx.doi.org/10.1007/9783319151687_18.
Eppstein, D, Spiro ES (2009) The hindex of a graph and its application to dynamic subgraph statistics In: Algorithms and Data Structures, 11th International Symposium, WADS 2009, Banff, Canada, August 2123, 2009. Proceedings, 278–289. doi:10.1007/9783642033674_25 http://dx.doi.org/10.1007/9783642033674_25.
Eppstein, D, Goodrich MT, Strash D, Trott L (2010) Extended dynamic subgraph statistics using hindex parameterized data structures In: Combinatorial Optimization and Applications  4th International Conference, COCOA 2010, KailuaKona, HI, USA, December 1820, 2010, Proceedings, Part I, 128–141. doi:10.1007/9783642174582_12 http://dx.doi.org/10.1007/9783642174582_12.
Faust, K (2007) Very local structure in social networks. Sociological Methodology 37(1): 209–256. doi:10.1111/j.14679531.2007.00179.x.
Hirsch, JE (2005) An index to quantify an individual’s scientific research output. Proc. of the National Academy of Sciences of the United States of America 102(46): 16569–16572.
Hočevar, T, Demšar J (2014) A combinatorial approach to graphlet counting. Bioinformatics 30(4): 559–565. doi:10.1093/bioinformatics/btt717.
Holland, PW, Leinhardt S (1970) A method for detecting structure in sociometric data. Am. J. Soc. 76(3): 492–513.
Holland, PW, Leinhardt S (1976) Local structure in social networks. Soc. Method. 7: 1–45.
Kloks, T, Kratsch D, Müller H (2000) Finding and counting small induced subgraphs efficiently. Inf. Process. Lett. 74(34): 115–121. doi:10.1016/S00200190(00)000478.
Kowaluk, M, Lingas A, Lundell E (2011) Counting and detecting small subgraphs via equations and matrix multiplication In: Proceedings of the TwentySecond Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2011, San Francisco, California, USA, January 2325, 2011, 1468–1476. doi:10.1137/1.9781611973082.114.
Leskovec, J, Krevl A (2014) SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data.
Lin, MC, Soulignac FJ, Szwarcfiter JL (2012) Arboricity, hindex, and dynamic algorithms. Theor. Comput. Sci. 426: 75–90. doi:10.1016/j.tcs.2011.12.006.
Marcus, D, Shavitt Y (2012) RAGE  A rapid graphlet enumerator for large networks. Computer Networks 56(2): 810–819. doi:10.1016/j.comnet.2011.08.019.
Melançon, G, Sallaberry A (2008) Edge metrics for visual graph analytics: A comparative study In: 12th International Conference on Information Visualisation, IV 2008, 811 July 2008, London, UK, 610–615. doi:10.1109/IV.2008.10.
Milenković, T, Pržulj N (2008) Uncovering biological network function via graphlet degree signatures. Cancer informatics 6: 257.
Milenković, T, Lai J, Pržulj N (2008) Graphcrunch: A tool for large network analyses. Bioinformatics 9. doi:10.1186/14712105970.
Milo, R, ShenOrr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U (2002) Network motifs: Simple building blocks of complex networks. Science 298(5594): 824–827. doi:10.1126/science.298.5594.824, http://science.sciencemag.org/content/298/5594/824.
Moody, J (1998) Matrix methods for calculating the triad census. Social Networks 20(4): 291–299. doi:10.1016/S03788733(98)000069.
Nick, B, Lee C, Cunningham P, Brandes U (2013) Simmelian backbones: amplifying hidden homophily in facebook networks In: Advances in Social Networks Analysis and Mining 2013, ASONAM ’13, Niagara, ON, Canada  August 25  29, 2013, 525–532. doi:10.1145/2492517.2492569 http://doi.acm.org/10.1145/2492517.2492569.
Nocaj, A, Ortmann M, Brandes U (2015) Untangling the hairballs of multicentered, smallworld online social media networks. J. Graph Algorithms Appl. 19(2): 595–618. doi:10.7155/jgaa.00370.
Ortmann, M, Brandes U (2014) Triangle listing algorithms: Back from the diversion In: 2014 Proceedings of the Sixteenth Workshop on Algorithm Engineering and Experiments, ALENEX 2014, Portland, Oregon, USA, January 5, 2014, 1–8. doi:10.1137/1.9781611973198.1.
Ortmann, M, Brandes U (2016) Quad census computation: Simple, efficient, and orbitaware In: Advances in Network Science  12th International Conference and School, NetSciX 2016, Wroclaw, Poland, January 1113, 2016, Proceedings, 1–13. doi:10.1007/9783319283616_1.
Pržulj, N, Corneil DG, Jurisica I (2004) Modeling interactome: scalefree or geometric?Bioinformatics 20(18): 3508–3515. doi:10.1093/bioinformatics/bth436.
Robins, G, Pattison P, Kalish Y, Lusher D (2007) An introduction to exponential random graph (p ^{∗}) models for social networks. Social Networks 29(2): 173–191. doi:10.1016/j.socnet.2006.08.002.
Solava, RW, Michaels RP, Milenković T (2012) Graphletbased edge clustering reveals pathogeninteracting proteins. Bioinformatics 28(18): 480–486. doi:10.1093/bioinformatics/bts376.
Traud, AL, Kelsic ED, Mucha PJ, Porter MA (2011) Comparing community structure to characteristics in online collegiate social networks. SIAM Review 53(3): 526–543. doi:10.1137/080734315.
Wasserman, S, Faust K (1994) Social Network Analysis: Methods and Applications vol. 8. Cambridge university press, New York, NY. Chap. 14.
Wernicke, S, Rasche F (2006) FANMOD: a tool for fast network motif detection. Bioinformatics 22(9): 1152–1153. doi:10.1093/bioinformatics/btl038.
Zhou, X, Nishizeki T (1994) Edgecoloring and fcoloring for various classes of graphs In: Algorithms and Computation, 5th International Symposium, ISAAC ’94, Beijing, P. R. China, August 2527, 1994, Proceedings. Lecture Notes in Computer Science, vol. 834, 199–207. doi:10.1007/3540583254_182.
Acknowledgements
We gratefully acknowledge financial support from Deutsche Forschungsgemeinschaft under grant Br 2158/111.
Authors’ contributions
UB designed research, MO developed solution and performed experiments, and MO and UB wrote paper. Both authors read and approved the final manuscript.
Competing interests
Both authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Additional information
A preliminary version of this work appeared in the proceedings of NetSciX 2016 (Ortmann and Brandes 2014).
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Graphlets
 Motifs
 Subgraph census
 Graph statistics