- Research
- Open Access
- Published:
Cliques in high-dimensional random geometric graphs
Applied Network Science volume 5, Article number: 92 (2020)
Abstract
Random geometric graphs have become now a popular object of research. Defined rather simply, these graphs describe real networks much better than classical Erdős–Rényi graphs due to their ability to produce tightly connected communities. The n vertices of a random geometric graph are points in d-dimensional Euclidean space, and two vertices are adjacent if they are close to each other. Many properties of these graphs have been revealed in the case when d is fixed. However, the case of growing dimension d is practically unexplored. This regime corresponds to a real-life situation when one has a data set of n observations with a significant number of features, a quite common case in data science today. In this paper, we study the clique structure of random geometric graphs when \(n\rightarrow \infty\), and \(d \rightarrow \infty\), and average vertex degree grows significantly slower than n. We show that under these conditions, random geometric graphs do not contain cliques of size 4 a. s. if only \(d \gg \log ^{1 + \epsilon } n\). As for the cliques of size 3, we present new bounds on the expected number of triangles in the case \(\log ^2 n \ll d \ll \log ^3 n\) that improve previously known results. In addition, we provide new numerical results showing that the underlying geometry can be detected using the number of triangles even for small n.
Introduction
Graphs are the most natural way to model many real-world networks. The links between nodes in a network are often occasional, that is why deterministic graphs are often irrelevant for network modeling. Random graphs, in which the link presence is a random variable, appear in the 20 th century with the simplest model proposed by Erdős and Rényi (see Erdős and Rényi 1960; Bollobás 2001; Alon and Spencer 2004). In this model, the edges between vertices appear independently with equal probability. However, this model fails to describe some important properties of many real networks, such as the community structure. Many models have been proposed to overcome this and other problems: the Barabási–Albert model (see Albert and Barabási 2002, 1999; Fronczak et al. 2003), random geometric graphs (see Gilbert 1961; Penrose 2003), hyperbolic geometric graphs (see Barthélemy 2011; Krioukov et al. 2010). Perhaps random geometric graphs are the simplest natural model where the edge appearance depends only on the Euclidean distance between given nodes. These graphs resemble real social, technological and biological networks in many aspects. Also, they might be useful in statistics and machine learning tasks: the correlations between observations in a data set can be represented by links between corresponding close vertices. Moreover, analyzing the graph in this situation can help to determine the presence of the underlying geometry and, therefore, the possibility of embedding into some geometric space. A reader can find these and other applications in Preciado and Jadbabaie (2009), Haenggi et al. (2009), Pottie and Kaiser (2000), Nekovee (2007), Xiao and Yeh (2011), Higham et al. (2008), Arias-Castro et al. (2015).
Let us define the model of the graphs that interest us and introduce a notation system. We follow the article Devroye et al. (2011) and define a random geometric graph G(n, p, d) as follows. Let \(X_1, \ldots , X_n\) be independent random vectors uniformly distributed on the \((d-1)\)-dimensional sphere \({\mathbb {S}}^{d-1} \subset {\mathbb {R}}^d\). Two distinct vertices \(i\in [n]\) and \(j\in [n]\) are adjacent if and only if \(\langle X_i, X_j \rangle \ge t_{p,d}\); here \(t_{p,d}\) is defined in such a manner that \({\mathbb {P}}(\langle X_i, X_j \rangle \ge t_{p,d}) = p\).
As for fixed d, many properties of random geometric graphs (connectivity, large components, number of small subgraphs) have been revealed at the end of the 20 th century. We refer to Penrose (2003) for an intensive study of random geometric graphs; among other papers on this topic we can highlight papers Arias-Castro et al. (2015), Penrose (1999), Appel and Russo (2002), McDiarmid (2003), Müller (2008), McDiarmid and Müller (2011). However, the lack of results regarding the high-dimensional case \(d\rightarrow \infty\) looks quite surprising. This becomes even more remarkable with the growing number of features in data sets and, hence, possible application of high-dimensional random geometric graphs in machine learning problems.
As far as we are concerned, the first paper treating the case \(d\rightarrow \infty\) is the article of Devroye et al. (2011). The authors studied the clique number in the asymptotic case when \(n\rightarrow \infty , \; d \gg \log n\), and p is fixed (hereinafter \(f(n) \gg g(n)\) (or \(f(n) \ll g(n))\) means \(\lim _{n\rightarrow \infty } \frac{f(n)}{g(n)} = \infty\) (respectively, \(\lim _{n\rightarrow \infty } \frac{f(n)}{g(n)} = 0\))). As it was proven in Devroye et al. (2011), in this dense regime, the clique number of the random geometric graph G(n, p, d) is close to that of the Erdős–Rényi graph G(n, p).
Another extremely important paper is the work of Bubeck et al. (2016). The authors considered the thermodynamic regime when a node has a constant number of neighbours on average; their paper focuses mainly on the difference between random geometric and Erdős–Rényi graphs. They have obtained a ‘negative’ result: for \(d \ll \log ^3 n\), the graph G(n, c/n, d) (here c is a constant) is ‘different’ from the Erdős–Rényi graph G(n, c/n) in the sense that the total variation distance between two random models converges to 1 as \(n \rightarrow \infty\). Also, the authors made a conjecture about a ‘positive’ result: for \(d \gg \log ^3 n\), the graphs G(n, p, d) and G(n, p) are close. In order to obtain the ‘negative’ result, they proved that in the case \(d \ll \log ^3 n\) the average number of triangles in G(n, c/n, d) grows at least as a polylogarithmic function of n, which is different from the expected number of triangles in G(n, c/n). The difference between these regimes seems quite interesting, and that is why we concentrate on the case when d is a power of \(\log n\).
Our main interest is the investigation of cliques in the asymptotic case when \(n\rightarrow \infty\) and \(d \gg \log n\). Since the dense regime (when p is fixed) is well studied in Devroye et al. (2011), we will focus on the sparse regime when \(p = p(n) = o(1)\) as \(n\rightarrow \infty\). Also, we assume that p(n) does not go to 0 ‘extremely fast’ (for instance, as \(e^{-n}\)). This, of course, includes the regime considered in Bubeck et al. (2016).
The main contribution of the present paper consists of three results in the sparse regime. The first one, presented in the next section in Theorem 4, states that almost surely the clique number of G(n, p, d) is at most 3 under the condition \(d \gg \log ^{1+\epsilon } n\). The second one shows that the expected number of triangles grows as \(\left( {\begin{array}{c}n\\ 3\end{array}}\right) p^3\) in the case \(d \gg \log ^3 n\). This result is given in Theorem 5. Finally, in Theorem 7 we will present new lower and upper bounds on the expected number of triangles in the case \(\log ^2 n \ll d \ll \log ^3 n\). This lower bound improves the result of Bubeck et al. (2016) since it grows faster than any polylogarithmic function (recall that the lower bound from Bubeck et al. (2016) is polylogarithmic in n).
The latter result clearly shows that for \(d \ll \log ^3 n\), random geometric graphs have a greater tendency to form clusters than it was discovered in the paper (Bubeck et al. 2016). We will present some numerical results showing that this is true even for relatively small n. Hence, the number of triangles can be used as a tool to determine if a given graph has some hidden geometry.
This paper is an extended version of the Complex Networks conference paper Avrachenkov and Bobu (2019). Compared to the short paper, we present here absolutely new numerical results and more specifically discuss the application of the present work to real-life tasks of machine learning. Also, we have improved Theorems 4 and 5 and give full proofs of the results mentioned in Avrachenkov and Bobu (2019).
Related works
Let us describe the related works (Devroye et al. 2011) and (Bubeck et al. 2016) in more detail. We start with the results of Devroye et al. (2011). Although this paper is devoted to the dense regime, the following two theorems do not require the condition \(p = \mathrm {const}\) and turn out to be useful in our situation. Let us denote by \(N_k(n,p,d)\) the number of cliques of size k in G(n, p, d). The results below establish lower and upper bounds on \({\mathbb {E}} [N_k(n,p,d)]\).
Theorem 1
(Devroye et al. 2011) Introduce
Let \(\delta _n \in (0,2/3]\), and fix \(k \ge 3\). Assume
Define \(\alpha > 0\) as
Then
where \(\displaystyle {\tilde{\Phi }}_k(d,p) = \Phi \left( \frac{\alpha t_{p,d}\sqrt{d} + \delta _n}{\sqrt{1 - \frac{2(k+1)^2 \log (1/{\tilde{p}})}{d}}} \right)\). Here \(\Phi (\cdot )\) denotes the CDF of the standard normal distribution.
Theorem 2
(Devroye et al. 2011) Let \(k \ge 2\) be a positive integer, let \(\delta _n > 0\), and define
Assume
Furthermore, for \(p < 1/2\), define \(\beta = 2\sqrt{\log (4/{\hat{p}})}\) and for \(\beta \sqrt{k/d} < 1\), let \(\alpha = \sqrt{1 - \beta \sqrt{k/d}}\). Then for any \(0< \delta _n < \alpha t_{p,d} \sqrt{d}\), we have
The following result of Bubeck et al. (2016) gives a lower bound on the expected number of triangles and significantly improves Theorem 1 if only \(d \ll \log ^3 n\).
Theorem 3
(Bubeck et al. 2016) There exists a universal constant \(C > 0\) such that whenever \(p < 1/4\) we have that
Note that if \(p = \theta (n)/n\) with some function \(\theta (n) \ll n^{1-\epsilon }\) and \(d \ll \log ^3 n\), the expected number of triangles grows as a polylogarithmic function. This is totally different from the Erdős–Rényi graph \(G(n,\theta (n)/n)\) where the average number of triangles grows as \(\theta ^3(n)\) with \(n\rightarrow \infty\). This result will be further improved by our Theorem 7.
Results
Clique number in the sparse regime
Theorems 1 and 2 allow us to say that for constant p and \(d \gg \log ^7 n\), the clique number of a random geometric graph grows similarly to that of an Erdős–Rényi graph, which is \(2\log _{1/p} n - 2\log _{1/p}\log _{1/p} n + O(1)\). We will show that in the sparse regime, under some conditions on p and d, there is no clique of size 4 in G(n, p, d) a. s. Let us also remark that the condition \(d \gg \log ^7 n\) is necessary only for the dense regime, as Theorems 1 and 2 do not impose any restrictions on d.
To apply Theorems 1 and 2, we need a lemma that establishes the growth rate of \(t_{p,d}\), which is crucial for asymptotic analysis in the sparse regime. Since p is the normalized surface area of a spherical cap of angle \(\arccos t_{p,d}\) (an example for a circle is given in Fig. 1 below), we learn from convex geometry that (see Brieden et al. 2001):
The following more explicit bound on \(t_{p,d}\) has been derived in Bubeck et al. (2016).
Lemma 1
There exists a universal constant \(C>0\) such that for all \(0 < p \le 1/2\), we have that
Before we present our main result on the clique number, let us prove a lemma that will be useful later.
Lemma 2
Let \(d \ge \log ^{1+\epsilon }n\) for some fixed \(\epsilon > 0\) and \(p \ge 1/n^m\) for some fixed m. Then for any \(\gamma > 0\), there exists \(n_0\) such that for \(n \ge n_0\)
Proof
First of all, let us notice that \(t_{p,d} \ge 0\), that is why
therefore, it is sufficient to derive a bound on \(e^{-t^2_{p,d} (d-1)/2}\). We next show that this quantity might be approximated by \(\frac{1}{t_{p,d}\sqrt{d}}(1-t^2_{p,d})^{\frac{d-1}{2}}\). Indeed,
and since \(t_{p,d} \rightarrow 0\), we can approximate the logarithm by Taylor series with \(\displaystyle k = \left\lceil 1/\epsilon \right\rceil + 1\) terms:
That implies:
The first term in the sum under the exponent function gives exactly what we needed. Therefore, it can be expressed as follows:
The first term in the right-hand side product, accordingly to (1), is at most proportional to p:
Further, as we learn from Lemma 1, there exists such a constant \(C > 0\) that
for large enough n. The right-hand side converges to 0 as \(n\rightarrow \infty\) since \(1-\epsilon k < 0\) due to the choice of parameter k. That allows us to bound the second term by a constant larger than 1, for instance:
Let us consider the last term of (2). As in the above argumentation,
But \(\exp \left( Cm^k\log ^{1-\epsilon i}n\right)\) grows slower than \(n^{\gamma /k}\) for any \(\gamma > 0\), which means that for large n the third term of (2) is at most \(n^{\gamma }\):
A final combination of these three bounds finishes the proof:
\(\square\)
Now we can present our result that states that G(n, p, d) does not contain cliques of size 4 almost surely. Let us remind that \(N_k(n,p,d)\) denotes the number of k-cliques in G(n, p, d).
Theorem 4
Suppose that \(\displaystyle k \ge 4, n^{-m} \le p \le n^{-2/3 - \gamma }\) with some \(m>0\) and \(\gamma > 0\) and \(d \gg \log ^{1+\epsilon }n\) with some \(\epsilon > 0\). Then
Proof
First, let us check that the conditions of Theorem 2 are satisfied with the parameters of our case. It is known that
Thus, in our case
Therefore, since \(t_{p,d} \sqrt{d} \rightarrow \infty\),
Here we have used the result of Lemma 2. Considering \(d \gg \log ^{1+\epsilon } n\), we obtain that \(\beta \rightarrow 0\) as \(n\rightarrow \infty\), that is why for sufficiently large n,
Now we need to select an appropriate parameter \(\delta _n\). Since \(\alpha = \sqrt{1 - \beta \sqrt{k/d}} \rightarrow 1\) as \(n\rightarrow \infty\), we can take \(\delta _n = \log ^{1/2 - \epsilon /2} n\) for sufficiently large n.
It is only left to check the condition
Indeed, as far as k is constant, for n large enough
But d grows faster than \(\log ^{1+\epsilon } n\), and condition (4) is then satisfied. Thus, it is now possible to apply the bound from Theorem 2.
According to the asymptotic representation of \(\Phi (x)\) and Lemma 2, for sufficiently large n,
with some universal constant \(C_{\Phi } > 0\). Since \(\beta \le 2C \sqrt{m} \sqrt{\log n}\) and \(t_{p,d} \le C\sqrt{\frac{\log (1/p)}{d}}\), we have
The second exponent can be bounded similarly:
Therefore, for sufficiently large n,
Finally, we get that
Let us notice that \(\left( {\begin{array}{c}n\\ k\end{array}}\right) \le \frac{n^k}{k!}\). It is easy to verify that \(k - \left( \frac{2}{3} + \gamma \right) \frac{k(k-1)}{2} < 0\) for \(k \ge 4\) and \(\gamma > 0\). Then for \(k \ge 4\),
It only remains to mention that
The theorem is proved.
\(\square\)
Number of triangles in the sparse regime: \(d \gg \log ^3 n\)
As noted in the previous section, in the sparse regime, G(n, p, d) does not contain any complete subgraph larger than a triangle. The natural question arises, how many triangles are in G(n, p, d). The next two results give some idea of the expected number of triangles. The first result refers to the case \(d \gg \log ^3n\); in this case, the average number of triangles grows as the function \(\theta (n)\) that determines the probability \(p(n) = \theta (n)/n\).
Our first goal is to obtain a more accurate analogue of Lemma 2.
Lemma 3
Assume \(p \ge n^{-m}\) for some \(m > 0\) and \(d \gg \log ^2 n\). Then the following inequality holds true:
Proof
From (1) we learn that
Let us write the Taylor series of \((1-t^2_{p,d})^{d/2}\):
Lemma 1 and the condition \(d \gg \log ^2n\) guarantee that \(t^4_{p,d} d \rightarrow 0\) as \(n \rightarrow \infty\). This means that for any \(\delta > 0\) and sufficiently large n, the quantity \(\exp \left( O(dt^4_{p,d})\right)\) can be bounded as follows:
The same statement holds true for \(1/\sqrt{1-t^2_{p,d}}\):
Therefore, taking \(\delta < 1 - 1/\sqrt{2}\),
Thus, inequality (5) is proved. \(\square\)
Theorem 5
Let us suppose that \(d \gg \log ^3 n\) and \(p = \theta (n)/n\) with \(n^m \le \theta (n) \ll n\) for some \(m > 0\). Then for any \(0< \epsilon < 1\) and sufficiently large n, the expected number of triangles can be bounded as follows:
Proof
The idea of the proof is quite similar to that of Theorem 4, but it uses both Theorems 1 and 2. Besides, we need more accurate asymptotic analysis, as now a rough bound of Lemma 2 is not sufficient for the application of Theorems 1 and 2. We are going to use more precise Lemma 3.
Upper bound As previously, we first need to verify the conditions of Theorem 2 . It is obvious that still
Take \(\delta _n = \frac{\log (1 + \varepsilon /4)}{C\sqrt{m} \sqrt{\log n}}\). Then Theorem 2 holds true for
But the right-hand side does not grow faster than \(\log ^3 n\) due to the choice of \(\delta _n\) and the argumentation similar to that of Theorem 4. Consequently, the above condition is satisfied if only \(d \gg \log ^3 n\), and now we can apply Theorem 2.
Let us rewrite the bound from this theorem for \(k = 3\):
Of course, the most important term here is \(1 - \Phi (\alpha t_{p,d}\sqrt{d} - \delta _n)\). Similar to the proof of the previous theorem one can get (with the asymptotic representation (3)) that
From (5) we learn that
Further, since \(d \gg \log ^3 n\) and \(\beta < 2C\sqrt{m}\sqrt{\log n}\) (see the proof of Theorem 4), for sufficiently large n,
The second exponent is just a constant for the chosen \(\delta _n\) (here we use the fact that \(\alpha < 1\)):
Putting all together and taking into account that \(\left( {\begin{array}{c}n\\ 3\end{array}}\right) \le \frac{n^3}{6}\),
Lower bound Now we are going to use Theorem 1. First, we need to determine the asymptotic behavior of the function \({\tilde{p}}\):
As one can easily see,
Then \(\beta = \Theta (\sqrt{\log n})\) and \(\beta \sqrt{3/d} \rightarrow 0\) with \(n\rightarrow \infty\). This implies that \(\alpha = \sqrt{1 - \beta \sqrt{k/d}} \rightarrow 1\) as \(n\rightarrow \infty\). Let us take \(\delta _n = \frac{\log (1 + \epsilon )}{2\alpha C\sqrt{m} \sqrt{\log n}}\). Similar to the previous case, the condition
is satisfied if \(d \gg \log ^3 n\).
Let us remind that the bound of Theorem 1 is written as follows:
where \(\displaystyle {\tilde{\Phi }}_3(d,p) = \Phi \left( \frac{\alpha t_{p,d}\sqrt{d} + \delta _n}{\sqrt{1 - \frac{32 \log (1/{\tilde{p}})}{d}}} \right)\).
Since \(t_{p,d}\sqrt{d} \rightarrow \infty\) and \(\alpha \rightarrow 1\) as \(n \rightarrow \infty\),
Here we used a simple inequality \(1/(1-x) < 1 + 2x\) for \(0< x < 1/2\) and the fact that \(\frac{4(K+1)^2 \log (1/{\tilde{p}})}{d} = O(1/\log ^2 n) \rightarrow 0\) as \(n\rightarrow \infty\). By the same reason,
Hence,
Consequently,
The inequality (5) guarantees that
Further, similarly to the previous case,
Finally, it is easy to check that \(\delta _n^2 \rightarrow 0\) and \(\sqrt{\frac{8K}{d} \log \frac{4}{{\tilde{p}}}} t^2_{p,d} d \rightarrow 0\) under the condition \(d\gg \log ^3 n\). That is why
That leads us to the final bound:
\(\square\)
Number of triangles in the sparse regime: \(d \ll \log ^3 n\)
So far, the presented results more likely confirm the similarity of random geometric graphs and Erdős–Rényi graphs. However, from Bubeck et al. (2016) one can learn that these graphs are completely different in the sparse regime if \(d \ll \log ^3 n\). This can be easily deduced from the result of Theorem 3. It states that the expected number of triangles of a random geometric graph grows significantly faster (as a polylogarithmic function of n) than one of the corresponding Erdős–Rényi graph. It turns out that the bound of Theorem 3 can be improved.
In order to make this improvement, we present some results from convex geometry. First of all, it is known that the surface area \(A_d\) of \((d-1)\)-dimensional sphere \({\mathbb {S}}^{d-1}\) can be calculated as follows (see Blumenson 1960):
Now we need a result providing the expression for the surface area of the intersection of two spherical caps in \({\mathbb {R}}^d\). Let us denote by \(A_d(\theta _1, \theta _2, \theta _{\nu })\) the surface area of the intersection of two spherical caps of angles \(\theta _1\) and \(\theta _2\) with the angle \(\theta _{\nu }\) between axes defining these caps. The paper (Lee and Kim 2014) gives the exact formula for this quantity in terms of the regularized incomplete beta function.
Theorem 6
(Lee and Kim 2011) Let us suppose that \(\theta _{\nu } \in [0,\pi /2)\) and \(\theta _1, \theta _2 \in [0,\theta _{\nu }]\). Then
where \(\theta _{min}\) is defined as follows
and \(I_x(a,b)\) stands for the regularized incomplete beta function, that is
Theorem 7
Let \(d \gg \log ^2 n\), and assume \(p = \theta (n)/n\) with \(n^m \le \theta (n) \ll n\) for some \(m > 0\). Then there exist constants \(C_l > 0\) and \(C_u > 0\) such that
Proof
Notation and a general plan of the proof Let us make some preparations. Denote by \(E_{i,j}\) the event \(\{\langle X_i, X_j \rangle \ge t_{p,d}\}\) and by \(E_{i,j}(x)\) the event \(\{\langle X_i, X_j \rangle = x\}\). In what follows, we condition on the zero probability event \(E_{i,j}(x)\). It should be understood as conditioning on the event \(\{x - \epsilon \le \langle X_i, X_j \rangle \le x + \epsilon \}\) with \(\epsilon \rightarrow 0\). Using this notation, we can rewrite
where \(f_d(x)\) is the density of a coordinate of a uniform random point on \({\mathbb {S}}^{d-1}\) (see Bubeck et al. 2016), that is
Using the fact that
we can present \(f_d(x)\) as follows:
Here \(C_f(d)\) denotes some function of d with \(1/100 \le C_f(d) \le \sqrt{2}\).
Here is a general outline of the proof. We treat the terms \(T_1\) and \(T_2\) separately and start with \(T_1\). The probability \(P\bigl (E_{2,3} E_{1,3} | E_{1,2}(x) \bigr )\) can be expressed with the normalized surface area of the intersection of two spherical caps. First, we need to bound this quantity. After that, using the representation (8), we will calculate \(T_1\) in terms of the CDF of the standard normal distribution and will estimate its asymptotic behavior. As for \(T_2\), it will be enough to show that \(T_2 = o(T_1)\) as \(n\rightarrow \infty\).
Estimation of term \(T_1\). As was mentioned above, we start with \(T_1\). It will be more handful to write it in the following form:
Conditioning on \(E_{1,2}(\alpha t_{p,d})\), the probability \({\mathbb {P}}\bigl (E_{2,3} E_{1,3} | E_{1,2}(\alpha t_{p,d})\bigr )\) is just the normalized surface area of the intersection of two caps of angle \(\arccos (t_{p,d})\) (and \(\arccos (\alpha t_{p,d})\) is the angle between the axes of these caps).
where \(J_d^{a,b}\) is defined in Theorem 6, and
Because both caps are of the same angle, the parts in the right-hand side of (9) are equal. Therefore, recalling the definition of \(J_d^{\theta _{min}, \arccos (t_{p,d})}\),
Let us make a change of variables: \(\sin ^2 \phi = z\). Using this change and the expression for \(\theta _{min}\), we obtain
Considering the definition of a regularized beta function and the formula \(\Gamma (d+1) = d\Gamma (d)\), we have
Next, we need a simple double bound on the incomplete beta function \(I_u(a,1/2)\):
which can be established by estimation of \((1-t)^{-1/2}\) and subsequent explicit integration. That is why
Here we used the fact that \(1 \le d/(d-2) \le 2\) for \(d \ge 4\). We can transform:
which gives us the following estimation:
It is easy to check that \(\displaystyle \frac{1}{\sqrt{z(1-z)}}\) has a minimum value 1/2 for \(z \in [0, 1)\), and \(\displaystyle \frac{1}{1-z}\) is increasing in z for \(z > 0\). Therefore,
Now we can explicitly compute the integral:
which implies the final bounds on \({\tilde{p}}(\alpha )\):
This means that \(T_1\) can be estimated as follows:
Let us recall (8) and rewrite the ‘essential’ part of previous inequalities.
Of course, we are most interested in the term \(\left( 1 - (2 + \alpha ^2)t^2_{p,d} + 2\alpha t^3_{p,d} \right) ^{(d-1)/2}\), which can be rewritten in the following form:
Since \(t_{p,d} \rightarrow 0\) as \(n\rightarrow \infty\), one can use Taylor series for logarithm:
From (5) one can easily deduce that
where \(1 \le C_e(p,d) \le 12\) is some function that depends only on p and d. Therefore,
We have transformed the main term of (11). Let us deal with ‘unimportant’ parts of (10) and (11). Denote
Since \(1 \le \alpha \le 2\) and \(t_{p,d} \rightarrow 0\), for sufficiently large n and some constants \(C_l > 0\) and \(C_u > 0\),
and
Then, plugging (11), (12), (13) and (14) into (10), we obtain the following final bounds on \(T_1\) at this step:
Expression of bounds on \(T_1\) with the CDF of the standard normal distribution One can easily get that
which implies
Let us treat the integral in the right-hand side of the last equation. Changing the variable \(\beta = (\alpha - t_{p,d})t_{p,d} \sqrt{d-1}\), we obtain that
Since \(t_{p,d}\sqrt{d-1} \rightarrow \infty\) as \(n\rightarrow \infty\),
But the ratio of the second and the first terms in the right-hand side converges to 0 as \(n\rightarrow \infty\). Indeed,
The condition \(t_{p,d} \rightarrow 0\) implies that
and, therefore,
The last equality holds because under the condition \(d \gg \log ^2 n\), it is true that \(dt^4_{p,d} \rightarrow 0\) as \(n\rightarrow \infty\), and \(e^{dt^4_{p,d}} \rightarrow 1\). Putting (17) in (15) gives the final bounds on \(T_1\):
This concludes the first part the proof.
Estimation of \(T_2\) The second term can be treated much more easily. Indeed, let us bound from above:
But, similarly to the argumentation in (16), \(\sqrt{d} e^{-4t^2_{p,d}(d-3)/2}\) is \(\displaystyle o\left( p^3 t^2_{p,d} e^{t^3_{p,d} d}\right)\), hence, finally,
It is only left to use the standard asymptotic representation of the binomial coefficient \(\left( {\begin{array}{c}n\\ 3\end{array}}\right) = \frac{n^3}{6}(1+o(1))\) in order to obtain the bounds on the expected number of triangles:
The theorem is proved.
\(\square\)
Let us now discuss the result of this theorem. To make the expressions more handful, we consider only \(p = 1/n\), but the idea can be extended up to any sufficiently small p. First of all, in this case, as we know from Lemma 1, \(t^3_{p,d} d = \Theta \left( \frac{\log ^{3/2}n}{\sqrt{d}}\right)\). If \(d \ll \log ^3 n\), the exponent \(\exp \left( \frac{\log ^{3/2}n}{\sqrt{d}} \right)\) grows faster than any polylogarithmic function of n, which means that the obtained result is better than that of Lemma 3. Unfortunately, the upper bound of Theorem 7 is still \(1/t^2_{p,d}\) times larger than the lower bound, although this margin is much smaller than the ‘main’ term \(e^{t^3_{p,d} d}\). This exponent still grows slower than any power of n, but we believe that for \(d = \Theta (\log n)\), the number of triangles is linear (or almost linear) in n.
Numerical results
Comparison of clustering in G(n, p, d) and G(n, p) for \(p = \ln (n)/n\). a global clustering coefficient for \(n = 5\,000\) and \(\ln n \le d \le \ln ^{3.5} n\); b global clustering coefficient for \(n = 10\,000\) and \(\ln n \le d \le \ln ^{3.5} n\); c number of triangles for \(d = \ln n\) and \(100 \le n \le 10\,000\); d number of triangles for \(d = \ln ^3 n\) and \(100 \le n \le 10\,000\)
So far, all the presented results are strictly theoretical and refer to the asymptotic case. But real-life networks have a limited number of nodes, and it is not clear whether our theoretical results might be applied to their description. To verify this hypothesis, we conducted a few simple numerical experiments.
Our aim is to compare the clustering coefficient and the number of triangles in random geometric and Erdős–Rényi graphs. Here we use the global clustering coefficient (GCC), or transitivity, which is defined as follows:
where a triplet is a configuration of three nodes connected by at least two edges. This quantity represents then the proportion of ‘closed’ triplets. In other words, the GCC is the probability that two of my ‘friends’ are also ‘friends’ to each other. In general, this is a good quantitative expression of the clustering level in a network (see Wasserman and Faust 1994).
Figure 2 illustrates the difference between a random geometric graph and an Erdős–Rényi graph in terms of the GCC and the number of triangles. For our experiments, we took \(p = \ln n/n\), a quite popular regime, usually called significantly sparse. Figure 2a, b show the average GCC (over 20 iterations) of G(n, p, d) and G(n, p) for \(n=5000\) and \(n = 10000\), respectively, and for \(\ln n \le d \le \ln ^{3.5}n\). As expected, the difference is large when d is relatively small. As d increases, the difference goes to 0, and for \(d = \ln ^3 n\) (617 and 781, respectively), it equals 0.002. Since the GCC of G(n, p) is simply p for large n, the GCC significantly higher than p gives a reason to suppose that the network has the underlying geometry with small d. On the other hand, the GCC close to p means that a low-dimensional graph representation is most likely impossible.
Unfortunately, our results and the result of Bubeck et al. (2016) do not give the exact value of constants for given n, and we cannot try them in practice. However, we can compare the growth rate of the number of triangles for different d. Fig. 2c, d show how fast the number of triangles grows with \(n\rightarrow \infty\) for \(d = \ln n\) and \(d = \ln ^3 n\), respectively. The number of triangles in G(n, p), of course, does not depend on d and grows as \(\ln ^3 n\) on both pictures. As for G(n, p, d), the number of triangles grows almost linearly in n for \(d = \ln n\), while \(d = \ln ^3 n\) gives a ‘no geometry’ situation of the corresponding Erdős–Rényi graph, up to a multiplication factor.
To conclude, we see a possible extension of this theoretical work in obtaining practical results with explicit bounds on the expected number of triangles. Such bounds would help to determine whether a network has an underlying geometry. Moreover, if this is the case, an interesting problem is to determine (perhaps in real-life tasks) the dimension of the underlying geometric space. This would help, for instance, to make embedding of big data sets more efficient by better choice of the embedding dimension.
Conclusion
As we have seen, high-dimensional random geometric graphs in the sparse regime always fail to create really large communities. It would be natural to expect that these graphs do not differ in any way from Erdős–Rényi graphs; however, for \(d \ll \log ^3 n\), they show a rather high tendency for clustering (my ‘friends’ are connected with high probability). Is it true that in the opposite case G(n, p, d) and G(n, p) resemble each other? We believe in the conjecture stated in Bubeck et al. (2016), which proposes the positive answer to this question. Since a similar conjecture was proved in that work for the dense regime, the situation does not look hopeless. However, the technique applied in the dense regime cannot be easily extended to the sparse regime. Any result describing the total variation between G(n, p, d) and G(n, p) in this regime would be very interesting.
In the present paper and in Devroye et al. (2011), Bubeck et al. (2016), the case \(d \ge \log n\) is always considered. What happens if d grows at a lower pace? What is the value of the clique number? We do not have the answers for this regime. But, obviously, the theoretical framework may differ quite a lot from what we used in our work.
As for triangles, it is not hard to prove that the number of triangles can be approximated by the Poisson distribution with an appropriate parameter that is, of course, the expected number of triangles. Hence, we need sharper bounds on this quantity, especially in the case \(d \ll \log ^3 n\). We are convinced that the upper bound in Theorem 7 cannot be improved, and the statement holds true for \(\log n \ll d \ll \log ^2 n\).
For sure, apart from the description of cliques and communities, many properties of high-dimensional random geometric graphs remain unexplored: connectivity, the existence of the giant component, the chromatic number, to name but a few. But even for fixed d, the results describing these properties require quite complex methods, so we do not expect immediate breakthroughs in this direction.
The previous section presents some numerical results that already might be useful for practical purposes. Let us discuss some possible further work in this direction. Firstly, we believe that the cliques (especially triangles) might be useful for community detection in networks with a geometric structure. Secondly, we think that some of the ideas introduced in this paper can help determine if a network has an underlying geometry. The latter is important because if it is known that the nodes are embedded in some space, then one can hope to make a lower-dimensional representation of the network structure or to use its geometric properties (e. g., two distant nodes cannot have a common neighbor). However, this requires more accurate bounds on the number of triangles with explicit constants. We did not pursue this goal and concentrated only on asymptotic results. Finally, the results obtained above can be helpful for the investigation of possible multiple correlations in data sets with a very large number of features.
Availability of data and materials
Not applicable.
Abbreviations
- CDF:
-
Cumulative distribution function
- GCC:
-
Global clustering coefficient
References
Albert R, Barabási A-L (1999) Emergence of scaling in random networks. Science 286(5439):509–512. https://doi.org/10.1126/science.286.5439.509
Albert R, Barabási A-L (2002) Statistical mechanics of complex networks. Rev Mod Phys 74(1):47–97. https://doi.org/10.1103/RevModPhys.74.47
Alon N, Spencer J (2004) The probabilistic method. Wiley, New York
Appel M, Russo R (2002) The connectivity of a graph on uniform points on [0, 1]d. Stat Probabil Lett 60(4):351–357. https://doi.org/10.1016/S0167-7152(02)00233-X
Arias-Castro E, Bubeck S, Lugosi G (2015) Detecting positive correlations in a multivariate sample. Bernoulli 21(1):209–241. https://doi.org/10.3150/13-BEJ565
Avrachenkov K, Bobu A (2019) Cliques in high-dimensional random geometric graphs. In: Cherifi H, Gaito S, Mendes J, Moro E, Rocha L (eds) Complex Networks and Their Applications VIII. Complex networks 2019. Springer, Berlin, pp 591–600. https://doi.org/10.1007/978-3-030-36687-2_49
Baccelli F, Błaszczyszyn B (2010) Stochastic geometry and wireless networks: Volume ii applications. Found Trends Network 4(1–2):1–312. https://doi.org/10.1561/1300000026
Barthélemy M (2011) Spatial networks. Phys Rep 499(1–3):1–101. https://doi.org/10.1016/j.physrep.2010.11.002
Blumenson L (1960) A derivation of n-dimensional spherical coordinates. Am Math Mon 67(1):63–66. https://doi.org/10.2307/2308932
Bollobás B (2001) Random graphs. Cambridge University press, Cambridge, UK. https://doi.org/10.1017/CBO9780511814068
Brieden A, Gritzmann P, Kannan R, Klee V, Lovàsz L, Simonovits M (2001) Deterministic and randomized polynomial-time approximation of radii. Mathematika 48(1–2):63–105. https://doi.org/10.1112/S0025579300014364
Bubeck S, Ding J, Eldan R, Rácz M (2016) Testing for high-dimensional geometry in random graphs. Rand Struct Algorith 49(3):503–532. https://doi.org/10.1002/rsa.20633
Devroye L, György A, Lugosi G, Udina F (2011) High-dimensional random geometric graphs and their clique number. Electron J Probabil 16:2481–2508. https://doi.org/10.1214/EJP.v16-967
Erdős P, Rényi A (1960) On the evolution of random graphs. Publ Math Inst Hung Acad Sci Ser A 5(1):17–60
Fronczak A, Fronczak P, Holyst J (2003) Mean-field theory for clustering coefficients in barabási-albert networks. Phys Rev E 68(4):046126. https://doi.org/10.1103/PhysRevE.68.046126
Gilbert E (1961) Random plane networks. J Soc Ind Appl Math 13:266–267
Haenggi M, Andrews J, Baccelli F, Dousse O, Franceschetti M (2009) Stochastic geometry and random graphs for the analysis and design of wireless networks. IEEE J Sel Areas Commun 27(7):1029–1046. https://doi.org/10.1109/JSAC.2009.090902
Higham D, Ras̆ajski M, Prz̆ulj N (2008) Fitting a geometric graph to a protein-protein interaction network. Bioinformatics 24(8):1093–1099. https://doi.org/10.1093/bioinformatics/btn079
Krioukov D, Papadopoulos F, Kitsak M, Vahdat A, Boguñá M (2010) Hyperbolic geometry of complex networks. Phys Rev E 82(3):036106. https://doi.org/10.1103/PhysRevE.82.036106
Lee Y, Kim W (2014) Concise formulas for the surface area of the intersection of two hyperspherical caps. KAIST Technical Report
McDiarmid C (2003) Random channel assignment in the plane. Rand Struct Algorithms 22(2):187–212. https://doi.org/10.1002/rsa.10077
McDiarmid C, Müller T (2011) On the chromatic number of random geometric graphs. Combinatorica 31(4):423–488. https://doi.org/10.1007/s00493-011-2403-3
McPherson M (2004) A blau space primer: prolegomenon to an ecology of affiliation. Ind Corp Change 13(1):263–280. https://doi.org/10.1093/icc/13.1.263
Müller T (2008) Two-point concentration in random geometric graphs. Combinatorica 28(5):529. https://doi.org/10.1007/s00493-008-2283-3
Nekovee M (2007) Worm epidemics in wireless ad hoc networks. New J Phys 9(6):189. https://doi.org/10.1088/1367-2630/9/6/189
Penrose M (1999) On k-connectivity for a geometric random graph. Rand Struct Algorithms 15(2):145–164
Penrose M (2003) Random geometric graphs, vol 5. Oxford University press, Oxford. https://doi.org/10.1093/acprof:oso/9780198506263.001.0001
Pottie G, Kaiser W (2000) Wireless integrated network sensors. Commun ACM 43(5):51–58. https://doi.org/10.1145/332833.332838
Preciado V, Jadbabaie A (2009) Spectral analysis of virus spreading in random geometric networks. In: Proceedings of the 48h IEEE Conference on Decision and Control (CDC) Held Jointly with 2009 28th Chinese Control Conference, pp. 4802–4807. https://doi.org/10.1109/CDC.2009.5400615. IEEE
Wasserman S, Faust K (1994) Social network analysis: methods and applications. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511815478
Xiao H, Yeh E (2011) Cascading link failure in the power grid: A percolation-based analysis. In: 2011 IEEE International Conference on Communications Workshops (ICC), pp. 1–6. https://doi.org/10.1109/iccw.2011.5963573. IEEE
Acknowledgements
Not applicable.
Funding
The authors have no funding sources to acknowledge for this study.
Author information
Authors and Affiliations
Contributions
KA suggested the principal ideas and the topic of research. AB designed the research. KA and AB wrote, reviewed and revised the manuscript. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Avrachenkov, K.E., Bobu, A.V. Cliques in high-dimensional random geometric graphs. Appl Netw Sci 5, 92 (2020). https://doi.org/10.1007/s41109-020-00335-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41109-020-00335-6
Keywords
- Random geometric graphs
- High dimension
- Clique number
- Triangles