The k-node subgraph census is usually computed via a system of linear equations relating the non-induced and induced k-subgraph frequencies, as the non-induced frequencies are easier to compute. Lin et al. (2012) show that for k=4 all non-induced frequencies, except for K
4, can be computed in \(\mathcal {O}(a(G)m)\) time. This implies that the total running time to calculate the quad census at the level of the entire graph is in \(\mathcal {O}(a(G)m+i(G))\), where i(G) is the time needed to compute the induced frequencies for some induced quad-type.
The approach of Lin et al. however, is not suitable to answer questions as to how often a node or an edge is contained in a K
4. Furthermore, the automorphism class of the node/edge in the quad is sometimes of interest. All non-isomorphic graphs with four nodes are shown in Fig. 1 and the node/edge labels refer to their automorphism classes (orbits). For example in a diamond all edges of the C
4 belong to the same orbit while the diagonal edge belongs to another. Analogously the orbits of the nodes can be distinguished.
As our approach also relies on relating non-induced and induced frequencies we will start by presenting how the non-induced frequencies for a node/edge in a given orbit relate to the induced counts. Thereafter, we will present equations to compute the respective non-induced frequencies and prove that our approach matches the running time of Lin et al., implying that it is asymptotically as fast as the fastest algorithm to compute the frequencies on a node and edge level for any induced quad. Note that in the following when we talk about non-induced frequencies we exclude those of the K
4, as it equals the induced frequency.
Relation of induced and non-induced frequencies
To establish the relation between induced and non-induced frequencies, the number of times G
′ is non-induced in any other graph G with the same number of nodes has to be known. For instance, let us assume that G
′ is a P
3 and G a K
3 (co-paw and -claw without isolated node cf. Fig. 1). Having the definition of the edge set for non-induced subgraphs in mind, we see that G contains three non-induced P
3, as each edge can be removed from a K
3 to create a P
3. Consequently, if we know the total number of non-induced P
3 and we subtract three times the number of K
3 we obtain the number of induced P
3 of the input graph.
Similarly, we can establish systems of equations relating induced and non-induced frequencies on a node and edge level distinguishing the orbits for quads, see Figs. 3 and 4.1 Note that both systems of equations are needed since we cannot compute the node from the edge frequencies and vice versa, but from both we can compute the global distribution. In the following we show the correctness for ei
10(e).
Induced orbit 10 edge census. Let us assume we want to know how often edge e is in orbit 10 or in other words part of a C
4. We know that a C
4 is a non-induced subgraph of a diamond, K
4 and of itself, cf. Fig. 2, and that there is no other quad containing a non-induced C
4. Let us first concentrate on the diamond. In a diamond we have two different edge orbits; orbit 11, i.e. the edges on the C
4, and orbit 12, i.e. the diagonal edge. Figure 2 shows that for every diamond where e is in orbit 12 there is no way to remove an edge, such that this graph becomes a C
4, but for each diamond where e is in orbit 11 we can remove the diagonal edge and end up with a C
4. Therefore, the non-induced number of subgraphs where e is in orbit 10 contains once the number of induced subgraphs where e is in orbit 11, but not those in orbit 12. As for the case of the C
4 in a K
4 all edges are in the same orbit. From a K
4 we can construct a C
4 containing a specific edge in two ways. The first is to remove both diagonal edges, cf. Fig. 2; and the second to delete the two horizontal edges. As a consequence the induced number of e being in orbit 10 is given by ei
10(e)=en
10(e)−ei
11(e)−2ei
13(e).
Following this concept all other equations can be derived.
Calculating non-induced frequencies
The calculation of the non-induced frequencies is (computationally) easier than for the corresponding induced counts, except for K
4s. This is due to the fact that the non-induced frequencies can be constructed from smaller, with respect to the number of nodes, subgraphs cf. Figs. 3 and 4. In the following we show the correctness of nn
14(u) and en
4(u,v).
Non-induced orbit 14 node census. To determine nn
14(u) we start by enumerating all triangles containing u. Let v and w form a triangle together with u. As u is in orbit 14 we know that each neighbor of v and w that is not u,v or w definitely creates a non-induced paw with u in orbit 14; while this does not necessarily hold for neighbors of u as they might not be connected to v or w (and, if they are, we already gave credit to this). Note that even if a neighbor of either v or w is a neighbor of u as well there is no additional paw with u in orbit 14 and therefore \(nn_{14}(u) = \sum _{\{v,w\}\in T(u)} (d(v)+d(w) -4)\).
Non-induced orbit 4 edge census. Edge e={u,v} is non-induced in orbit 4 for each P
3 starting at u or v which neither contains e nor closes a K
3 with e. The number of P
3s starting at u not containing e equals \(\sum _{w \in N(u) \setminus v}(d(w) - 1)\). However, the node v might be a neighbor of w and therefore there is a path of length two (via w) connecting u and v. Since this creates a three-node subgraph, more precisely a triangle, and not a quad we have to adjust for this by subtracting twice the number of triangles containing e. Consequently, \(en_{4}(u,v) = \sum _{w \in N(u)} d(w) + \sum _{w \in N(v)} d(w) - 2(d(u)+d(v))+2 -2t(u,v)\).
In the following, we focus on the algorithm calculating all required frequencies to solve the systems of equations.
Listing complete quads
In order to be able to solve the systems of equations we need to compute the non-induced quad counts as well as any of the induced frequencies. This requires an algorithm that is capable of solving the following tasks on a node and edge level:
-
1.
Counting and listing all K
3
-
2.
Calculating non-induced C
4 frequencies
-
3.
Determine induced counts of any quad
We chose to calculate the induced counts for K
4 to fulfill requirement 3. The reasons are a) to our knowledge there are no algorithms calculating induced counts on a node and edge level for any other quad more efficiently than the algorithm we are presenting here; b) a K
4 has the property that all nodes and edges lie in the same orbit; c) all non-induced C
4 can be counted during the execution of our algorithm. Since listing, also known as enumerating, all K
4 has to solve the subproblem of listing all K
3, we will start explaining our algorithm by presenting how K
3s can be listed efficiently. Note that this algorithm satisfies requirement 1.
Listing all triangles in a graph is a well studied topic (Ortmann and Brandes 2014). We show in our previous work (Ortmann and Brandes 2014) that one of the oldest triangle listing algorithms, namely K3 by Chiba and Nishizeki (1985) is in practice the fastest. This algorithm is based on neighborhood intersection computations. To achieve the running time of \(\mathcal {O}(a(G)m)\), Chiba and Nishizeki process the graph in a way such that for each intersection only the neighborhood of the smaller degree node has to be scanned. This is done by processing the nodes sequentially in decreasing order of their degree. The currently processed node marks all its neighbors and is removed from the graph. Then the number of marked neighbors of a marked node is calculated.
Let us think of this algorithm differently. When we process node u and remove it from the graph then every triangle that contains u is an edge where both endpoints are marked, cf. Fig. 5. This perception of the algorithm directly points us to a solution for the second and third requirement. As shown in Fig. 5, when node u is removed from the graph, every K
4 that contains u becomes a K
3 where all nodes are marked, implying that K3 can be easily adapted to list all K
4s. Chiba and Nishizeki call this extension COMPLETE. Furthermore, only nodes that are connected to a neighbor of u can create a non-induced C
4 and each C
4 contains at least two marked nodes. Since all these nodes are processed already during the execution of algorithm K3 counting non-induced C
4 on a node and edge level can be also done in \(\mathcal {O}(a(G)m)\) time. The corresponding algorithm is called C4 in (Chiba and Nishizeki 1985) and the combination of these different algorithms is presented in Algorithm 1. It runs in \(\mathcal {O}(a(G)^{2}m)\) (Chiba and Nishizeki 1985), and its novelty is that it follows the idea of directing the graph acyclic as we already proposed in the context of triangle listing (Ortmann and Brandes 2014). Furthermore, this acyclic orientation allows omitting node removals, and given the proper node ordering, it has the property that the maximum outdegree is bounded by \(\mathcal {O}(a(G))\). Therefore, unlike for algorithm COMPLETE and C4 (Chiba and Nishizeki 1985), no amortized running time analysis is needed to prove that the running time is in \(\mathcal {O}(a(G)^{2}m)\) and \(\mathcal {O}(a(G)m)\), respectively, as we will show next.
Runtime We will first show that the running time bound of our variant implementation of algorithm C4 is in \(\mathcal {O}(a(G)m)\), therefore we ignore Lines 4, 6, 8, 12–19 and 27 of Algorithm 1 for now.
The running time of the remaining algorithm is given by the following equation:
$$\begin{array}{@{}rcl@{}} t(\textsf{C4}) & \leq & \sum\limits_{u \in V} d^{-}(u) + 2\sum\limits_{v \in N^{-}(u)}d^{-}(v) + d^{+}(v)\\ & = & m + 2 \sum\limits_{v \in V} d^{+}(v) (d^{-}(v)+d^{+}(v))\\ & \leq & m + 4m\Delta^{+}(G) \end{array} $$
As we order the nodes by successively removing the node of minimum degree from the graph, which can be computed in \(\mathcal {O}(m)\) using a slightly modified version of the algorithm presented in (Batagelj and Zaveršnik 2003), it holds that Δ
+(G)<2a(G) (Zhou and Nishizeki 1994). The time required to initialize all marks is in \(\mathcal {O}(n)\), orienting the graph is in \(\mathcal {O}(n+m)\), and consequently the total running time is in \(\mathcal {O}(a(G)m)\).
Let us now focus on the time required for calculating all K
4s and therefore ignore Lines 9–11 and 20–27 of Algorithm 1 that is given by the following equation:
$$\begin{array}{@{}rcl@{}} t(\textsf{COMPLETE}) & \leq & \sum\limits_{u \in V} d^{-}(u) + \sum\limits_{v \in N^{-}(u)}2d^{+}(v) + \sum\limits_{w \in N^{+}(v)}d^{+}(w)\\ & \leq & m + \Delta^{+}(G)\sum\limits_{v \in V} 2d^{+}(v) + \sum\limits_{w \in N^{+}(v)}d^{+}(w)\\ & \leq & m + 2m\Delta^{+}(G) + \Delta^{+}(G) \sum\limits_{v \in V}d^{-}(v)\Delta^{+}(G)\\ & = & m + 2m\Delta^{+}(G) + m\Delta^{+}(G)^{2} \end{array} $$
By the same arguments it follows that our variant implementation of COMPLETE runs in \(\mathcal {O}(a(G)^{2}m)\). Since Line 4 is in \(\mathcal {O}(a(G)m)\) (Ortmann and Brandes 2014) and solving the systems of equations requires \(\mathcal {O}(n+m)\) time, the overall complexity of Algorithm 1 is in \(\mathcal {O}(a(G)^{2}m)\).
Before we give experimental evidence that our algorithm is not just asymptotically, but also in practice, superior to the currently fastest orbit-aware quad census algorithm, we want to give a more detailed explanation as to why algorithm C4 runs in \(\mathcal {O}(a(G)m)\) instead of \(\mathcal {O}(a(G)^{2}m)\), although every K
4 contains three non-induced C
4. The reason lies in the fact that COMPLETE belongs to the class of listing algorithms, while C4 is a counting algorithm. Since a listing algorithm has to enumerate every single occurrence of the subgraph of interest, its running time cannot be asymptotically faster than the number of subgraphs it has to list. For example every algorithm for listing all triangles in a graph cannot be asymptotically faster than Θ(n
3), since the complete graph contains \({n \choose 3}\) triangles. However, as counting does not require to enumerate every single triangle there exist algorithms with a lower worst-case complexity, e.g. via matrix multiplication (Coppersmith and Winograd 1990). This difference and the fact that in the non-induced scenario we can ignore the existence of some edges, explain the asymptotical differences between the two algorithms.