Lγ-PageRank for semi-supervised learning

Bautista, Esteban; Abry, Patrice; Gonçalves, Paulo

doi:10.1007/s41109-019-0172-x

Research
Open access
Published: 19 August 2019

L^γ-PageRank for semi-supervised learning

Applied Network Science volume 4, Article number: 57 (2019) Cite this article

1658 Accesses
13 Citations
1 Altmetric
Metrics details

Abstract

PageRank for Semi-Supervised Learning has shown to leverage data structures and limited tagged examples to yield meaningful classification. Despite successes, classification performance can still be improved, particularly in cases of graphs with unclear clusters or unbalanced labeled data. To address such limitations, a novel approach based on powers of the Laplacian matrix L^γ (γ>0), referred to as L^γ-PageRank, is proposed. Its theoretical study shows that it operates on signed graphs, where nodes belonging to one same class are more likely to share positive edges while nodes from different classes are more likely to be connected with negative edges. It is shown that by selecting an optimal γ, classification performance can be significantly enhanced. A procedure for the automated estimation of the optimal γ, from a unique observation of data, is devised and assessed. Experiments on several datasets demonstrate the effectiveness of both L^γ-PageRank classification and the optimal γ estimation.

Introduction

Context

Graph-based Semi-Supervised Learning (G-SSL) is a modern important tool for classification. While Unsupervised Learning fully relies on the data structure and Supervised Learning demands extensive labeled examples, G-SSL combines limited tagged examples and the data structure to provide satisfactory results. This makes the field of G-SSL of utmost importance as nowadays large and structured datasets can be readily accessed in comparison to expert data which may be hard to obtain. Examples where G-SSL provide state of the art results are vast, ranging from classification of BitTorrent contents and users (Avrachenkov et al. 2012a), text categorization (Subramanya and Bilmes 2008), medical diagnosis (Zhao et al. 2014), or zombie hunting under BGP protocol (Fontugne et al. 2019). Algorithmically, PageRank constitutes the reference tool in G-SSL. It has spurred a deluge of theory (Chung 2010; Avrachenkov et al. 2018; Litvak et al. 2009; Chung 2007), applications (Avrachenkov et al. 2008; Graham et al. 2009; Avrachenkov et al. 2012a; Fontugne et al. 2019) and implementations (Andersen et al. 2007; Andersen and Chung 2007). Despite successes, the performance of G-SSL can still be improved, particularly for graphs with unclear clusters or imbalance of labeled datasets, two situations that we aim to address in this work.

Related works

In graphs, a ground truth class is represented by a subset of graph nodes, denoted S_gt. Thus, in graphs, the classification challenge corresponds to finding the binary partition of the graph vertices: $\mathcal {V} = S_{gt}\cup S_{gt}^{c}$. If the data is structured, then S_gt forms a cluster, i.e., a densely and strongly connected graph region that is weakly connected to the rest of the graph. This is exploited by G-SSL methods that essentially amount to diffuse information placed on the tagged nodes of S_gt, through the graph, expecting a concentration of information in S_gt that reveals its members. Among the family of G-SSL propositions obeying this rationale (Zhou et al. 2004; Zhou and Burges 2007; Avrachenkov et al. 2012b), PageRank is considered the state of the art approach in terms of performance, algorithms and theoretical understanding. The PageRank algorithm can be interpreted as random walkers that start from the labeled points and, at each step, diffuse to an adjacent node with probability α or restart to the starting point with probability (1−α). In the limit (of infinite steps), each node is endowed a score proportional to the number of visits to it. Thus, vertices of S_gt are expected to get larger scores as walkers get trapped for a long time by the connected structure of S_gt. The capacity of PageRank to confine the random walks within S_gt depends on a topological parameter of S_gt known as the Cheeger ratio, or conductance, counting the ratio of external and internal connections of S_gt. More precisely, it is shown in Andersen et al. (2007) that the probability of a PageRank random walker leaving S_gt is upper bounded by the Cheeger ratio of S_gt. In other terms, a small Cheeger ratio designates a strongly disconnected cluster that PageRank can eventually easily detect. Based on the scores, a binary partitioning via a sweep-cut procedure allows to retrieve an estimate $\hat {S}_{gt}$. This procedure is granted to obtain an estimate $\hat {S}_{gt}$ with a small Cheeger ratio if a sharp drop in magnitude appears on the sorted scores, then $\hat {S}_{gt}$ is potentially a good estimation of the ground truth S_gt (Andersen and Chung 2007). In Zhou and Belkin (2011), an issue affecting G-SSL methods, coined as the ‘curse of flatness’, was highlighted. Such work proposes to extend PageRank by iterating the random walk Laplacian in the PageRank solution, as a mean to enforce Sobolev regularity to the vertex scores and amend the aforementioned problem. However, with this approach, guarantees that a sweep-cut still leads to a meaningful clustering remains unproven and it can be given neither diffusion nor topological interpretations. Thus, preventing insights on the properties and qualities of partitions it retrieves. This makes it hard to build upon and to address the issues listed above.

Goals, contributions and outline

In this work, we revisit Laplacian powers as a way to improve G-SSL and to address the issues listed above. As our first contribution, we propose a generalization of PageRank by using (non necessarily integers) powers of the combinatorial Laplacian matrix L^γ (γ>0). We call this generalization the L^γ-PageRank method. In contradistinction to (Zhou and Belkin 2011), our approach (i) enables us to have an explicit closed form expression of the underlying optimization problem (see the L^γ-PageRank definition and solution in Definition 4 and Lemma 4, respectively); (ii) permits a diffusion and a topological interpretation (see the L^γ-PageRank properties in Lemmas 5 and 6). In our approach, we show that, for each γ, a new graph is generated. These new graphs, which we refer to as L^γ-graphs, reweight the links of the original structure and create edges, which can be positive or negative, between initially far-distant nodes. This topological change has the potential to improve classification as the signed edges introduce what can be seen as agreements (positive edges) or disagreements (negative edges) between nodes, allowing to revamp clusters as groups of nodes agreeing between them and disagreeing with the rest of the graph. This paper investigates the potential of these L^γ-graphs to better delineate a targeted S_gt, compared to PageRank. The theoretical analysis of our proposition permits to extend the Cheeger ratio to L^γ-graphs and to prove that if there is a L^γ−graph in which S_gt has a smaller Cheeger ratio, then we can more accurately identify it with our generalized L^γ-PageRank procedure using the sweep-cut technique. Then, by means of numerical investigations, we point the existence of an optimal γ value that maximizes performance. As a second contribution, we propose an algorithm that allows to estimate the optimal γ directly from the initial graph and the labeled points. Lastly, we demonstrate the classification improvements permitted by L^γ-PageRank on several real world datasets commonly used in classification, as well as the relevance of the estimation procedure for the optimal tuning.

The paper is organized as follows: “State of the art” section sets definitions and recalls classical results on G-SSL. “L^γ-PageRank for semi-Supervised learning” section presents the main contributions of the paper: “The L^γ-graphs” section introduces L^γ-graphs; “The L^γ-PageRank method” section defines L^γ-PageRank and its theoretical analysis; “The selection of γ” section discusses the existence of an optimal γ and its estimation. “L^γ-PageRank in practice” section evaluates L^γ-PageRank and the algorithm for the optimal γ estimation in practice.

State of the art

Preliminaries

Let $\mathcal {G}(\mathcal {V},\mathcal {E}, w)$ denote a weighted undirected graph with no self-loops in which: $\mathcal {V}$ refers to the set of vertices of cardinality $|\mathcal {V}| = N$; $\mathcal {E}$ denotes the set of edges, where a connected pair $u,v \in \mathcal {V}$, denoted u∼v, implies $(u,v), (v,u) \in \mathcal {E}$; and $w : \mathcal {E} \to \mathbb {R}^{+}$ is a weight function. The graph adjacency matrix is denoted by W in which W_uv=w(u,v) if u∼v and W_uv=0 otherwise. For a vertex $u \in \mathcal {V}$ we let $d_{u} = \sum \nolimits _{v} W_{uv}$ denote the degree of u and D=diag(d₁,…,d_N) be the diagonal matrix of degrees. Let Δ_uv denote the geodesic distance between u and v. Given a set of nodes $S \subseteq \mathcal {V}$, we denote by $\mathbbm {1}_{S}$ the indicator function of such set, meaning that $\left ({\mathbbm {1}_{S}}\right)_{u} = 1$ if u∈S and $\left ({\mathbbm {1}_{S}}\right)_{u} = 0$ otherwise. The volume of S is defined to be $vol(S) = \sum \nolimits _{u \in S} d_{u}$. We refer to the volume of the entire graph by vol(G). Let $f : \mathcal {V} \to \mathbb {R}$ be a signal lying on the graph vertices. Graph signals are represented as column vectors, where f_u refers to the signal value at node u. The sum of signal values in the set S is denoted by $f(S) = \sum \nolimits _{u \in S} f_{u}$. We denote by L=D−W the combinatorial graph Laplacian which, by construction, is a real symmetric matrix with eigendecomposition of the form L=QΛQ^T. The positivity of the Dirichlet form $f^{T} L f = \sum \nolimits _{u \sim v} W_{uv}(f_{i} - f_{j})^{2} \geq 0$ implies that L has real non-negative eigenvalues.

A random walk on a graph is a Markov chain where the nodes form the state space. Thus, when a walker is located at a node u at a specific time t, at time step t+1 the walker moves to a neighbor v with probability P_uv, where P=D⁻¹W. If the graph signal χ represents the distribution for the random walk starting point, then the signal x^T=χ^TP^t denotes the distribution of the walker position at time t. Independently of the starting distribution, if the graph is connected and not bipartite, the random walk converges to a stationary distribution π^T=π^TP, where π_u=d_u/vol(G).

Clustering is the search of groups of nodes that are strongly connected between them and weakly connected to the rest of the graph. The Cheeger ratio is a metric that counts the ratio of external and internal connections of a group of nodes, thus assessing its pertinence as a cluster, while penalizing uninteresting solutions that may fit the cluster criteria, like isolated nodes linked by a few edges. It is defined as follows.

Definition 1

For a set of nodes $S \subseteq \mathcal {V}$, the Cheeger ratio, or conductance, of S is defined as:

$$ h_{S} := \frac{ \sum\nolimits_{u \in S} \sum\nolimits_{v \in S^{c}} W_{uv} }{ \min \{ vol(S), vol(S^{c}) \} }. $$

(1)

Thus, we define clustering as finding the binary partition of the graph vertices: $\mathcal {V} = S \cup S^{c}$ such that S has low h_S.

PageRank-based semi-Supervised learning

Let $\mathcal {V}_{S_{gt}} \subseteq S_{gt}$ denote the set of nodes tagged to belong to the ground truth S_gt and y be indicator function of $\mathcal {V}_{S_{gt}}$, i.e. y_u=1 if node $u \in \mathcal {V}_{S_{gt}}$ and y_u=0 otherwise. The PageRank G-SSL is defined as the solution to the optimization problem (Avrachenkov et al. 2012b):

$$ \underset{f}{\text{arg min}} \left\{ f^{T} D^{-1}L D^{-1} f + \mu \left(f - y\right)^{T} D^{-1} \left(f-y \right)\right\}. $$

(2)

Optimization problem (2) can be seen as the search of a smooth graph signal in the sense that strongly connected nodes should have similar values (left term), while the labeled data is respected (right term), and a regularization parameter μ tunes the trade off between both terms. Notably, problem (2) is convex with closed form solution given by Avrachenkov et al. (2012b):

$$ f = \mu \left(LD^{-1} + \mu \mathbb{I} \right)^{-1} y. $$

(3)

We present the PageRank solution in this form as it will simplify derivations in the reminder of the paper, but it is not hard to rewrite (3) to its more popular version: $f^{T} =(1 - \alpha) \sum \nolimits _{k = 0}^{\infty } \alpha ^{k} y^{T} P^{k}$ where α=1/(1+μ). This latter helps to expose the connection between PageRank and diffusion processes. Namely, it corresponds to the equilibrium state of a random walk that decides either to continue with probability α, or to restart to the starting distribution y with probability (1−α). As $y = \sum \nolimits _{u \in \mathcal {V}_{S_{gt}}} \delta _{u}$ is the combination of different random walk initial distributions, it is clear that the PageRank score assigned to a particular node is (up to re-normalization) the probability of finding a walker that started from the labels, at equilibrium, at this node. PageRank diffusion satisfies the following properties (Tsiatas 2012): (i) mass preservation: $\sum \nolimits _{u \in \mathcal {V}} f_{u} = \sum \nolimits _{u \in \mathcal {V}} y_{u}$; (ii) stationarity: f=π if y=π; and (iii) limit behavior: f→π as μ→0 and f→y as μ→∞.

In Andersen et al. (2007), it is shown that the behavior of this type of random walks is tightly related to the cluster structure of graphs. This connection between PageRank and clustering is quantified in the following result.

Lemma 1

Andersen et al. (2007) Let $S \subset \mathcal {V}$ be an arbitrary set with vol(S)≤vol(G)/2. For a labeled point placed at a node u∈S selected with probability proportional to its degree in S, i.e. d_u/vol(S), the PageRank satisfies:

$$ \mathbb{E} [ f(S^{c}) ] \leq \frac{h_{S}}{\mu}. $$

(4)

This lemma implies that if we apply PageRank diffusion to the labels of S_gt and it has a small $h_{S_{gt}}$, then the probability of finding a walker outside S_gt is small and the nodes with largest PageRank value should index S_gt. This is formalized in Andersen et al. (2007) and Andersen and Chung (2007). The former shows that a proxy $\hat {S}_{gt}$ that has small $h_{\hat {S}_{gt}}$ can be found by looking for regions of high concentration of PageRank mass. The latter improves that result, showing that $\hat {S}_{gt}$ can be found more easily by looking for a sharp drop in the PageRank scores. To state their result, we first introduce the sweep-cut technique.

Definition 2

A sweep-cut is a procedure to retrieve a partition $\mathcal {V} = \hat {S}_{gt} \cup \hat {S}^{c}_{gt}$ from the PageRank vector. The procedure is as follows:

Let v₁,…,v_N be a rearrangement of the vertices in descending order, so that the permutation vector q satisfies $q_{v_{i}} = f_{v_{i}}/d_{v_{i}} \geq q_{v_{i+1}} = f_{v_{i+1}}/d_{v_{i+1}}$ ;
Let S_j={v₁,…,v_j} be the set of vertices indexed by the first j elements of q ;
Let $\tau (f) = \min _{j} h_{S_{j}}$ ;
Retrieve $\hat {S}_{gt} = S_{j}$ for the set S_j achieving τ(f).

Now, we state the result of Andersen and Chung (2007), showing that if there is a sharp drop in rank at S_j, then the set S_j has small Cheeger ratio.

Lemma 2

Andersen and Chung (2007) Let h∈(0,1), j be any index in [1,N] and α∈(0,1] denote the PageRank restarting probability. Let $C(S_{j},S_{j}^{c}) = \sum \nolimits _{u \in S_{j}} \sum \nolimits _{v \in S_{j}^{c}} W_{uv}$ be the numerator of the Cheeger ratio. Then, S_j satisfies one of the following: (a) $C(S_{j},S_{j}^{c}) < 2hvol(S_{j})$; or (b) there is some index k>j such that vol(S_k)≥vol(S_j)(1+h) and q_k≥q_j−α/hvol(S_j).

In other words, this lemma implies that either S_j has a small Cheeger ratio, or there is no sharp drop at q_j.

Generalization to multiple classes

PageRank G-SSL can be readily generalized to a multi-class setting in which labeled points of K classes are used to find a partition $\mathcal {V} = S_{1} \cup S_{2} \cup \dots \cup S_{K}$. Let $\mathcal {V}_{S_{k}}$ denote the labeled points of class k and the indicator function of $\mathcal {V}_{S_{k}}$ be placed as the k-th column of a matrix Y. Then, the multi-class PageRank is computed in matrix form as (Sokol 2014): minF{F^TD⁻¹LD⁻¹F+μ(F−Y)^TD⁻¹(F−Y)}, with classification matrix given in closed form by $F = \mu \left (LD^{-1} + \mu \mathbb {I} \right)^{-1} Y$. This leads a node u to have K associated scores and it is assigned to the cluster k satisfying arg max_kF_uk. In Sokol (2014), the following rule explaining the classification is provided: let pr_uv denote the probability that a random walk reaches node v before restarting to node u, then v is assigned to the class k that satisfies the inequality:

$$ \sum\limits_{u \in \mathcal{V}_{k}} pr_{uv} \geq \sum\limits_{w \in \mathcal{V}_{k'}} pr_{wv}, ~\hspace{5pt} \forall k' \neq k. $$

(5)

This inequality highlights an important issue of the multi-class approach as the sums depend on the cardinality of the sets of labeled points. Thus, cases of unbalanced number of labeled points can potentially bias the classification.

L ^γ-PageRank for semi-Supervised learning

The L ^γ-graphs

In this work, we propose to change the graph topology in which the problem is solved as a means to improve classification. We evoke such change by considering powers of the Laplacian matrix, noting that the L^γ operator, for γ>0, generates a new graph for every fixed γ value. More precisely, the Laplacian definition indicates that L^γ=VΛ^γV^T=D_γ−W_γ codes for a new graph, where [D_γ]_uu=[L^γ]_uu refers to a generalized degree matrix and [W_γ]_uv=−[L^γ]_uv, with u≠v, to a generalized adjacency matrix that satisfies the Laplacian property $\left [D_{\gamma }\right ]_{uu} = \sum \nolimits _{v} \left [W_{\gamma } \right ]_{uv}$ since $L^{\gamma } \mathbbm {1} = 0$. We refer to such graphs as L^γ-graphs.

The L^γ-graphs reweight the edges of the original structure and creates links between originally far-distant nodes. Indeed, for $\gamma \in \mathbb {Z}$ the new edges can be related to paths of different lengths. To have a grasp on this, let us take the topology from γ=2 as an example: for L²=(D−W)²=D²+W²−(DW+WD), the elements of the emanating graph are given as $[D_{2}]_{uu} = [L^{2}]_{uu} = D_{uu}^{2} +\sum \nolimits _{v} W_{uv}^{2}$ and $\left [W_{2}\right ]_{uv} = - [L^{2}]_{uv} = (D_{uu}+D_{vv})W_{uv} - \sum \nolimits _{l\neq u,v} W_{ul}W_{lv}$, showing that, in W₂, nodes originally connected get their link reweighted (still, remaining positive) while those at a 2-hop distance become linked by a negatively weighted edge.

This change in the topology has the potential to impact clustering, as the emergence of positive and negative edges opens the door for an interpretation in terms of an agreement (positive edge) or a disagreement (negative edge) between datapoints. Hence, clustering can be revamped to assume that nodes agreeing should belong to the same cluster and nodes disagreeing should belong to different ones. From this perspective, a revisit to the case of γ=2 shows that this is indeed a potentially good topology since, for several graphs, it is more likely that vertices having a 2-hop distance lie in different clusters than in the same one, thus creating a considerable amount of disagreements between clusters, that may enhance their separability. This idea is illustrated in Fig. 1, where for a realization of the planted partition model we show that with γ=2 a big amount of negative edges appear between clusters.

Thus, in the reminder of the paper we investigate if for a target set of nodes S_gt, the detection of S_gt can be enhanced by solving the clustering problem in some of these new graphs.

Remark 1

The graphs emerging in the regime 0<γ<1 have already been studied in Pérez Riascos and Mateos (2014); de Nigris et al. (2017); Bautista et al. (2017), where it is shown that such graphs remain within the class of graphs with only positive edges, hence preserving the random walk framework. In these works, the graphs were shown to embed the so-called Lévy flight, permitting random walkers to perform long-distant jumps in a single step.

The L ^γ-PageRank method

The signed graphs emerging from L^γ preclude the employment of the random walk-based approaches to find clusters, as ‘negative transitions’ appear. Thus, the L^γ-graphs call for a technique to find clusters in such graphs. In this subsection, we introduce L^γ-PageRank, a generalization of PageRank to find clusters on the L^γ-graphs. Further, we analyze the L^γ-PageRank theoretical properties and clustering capabilities.

For our analysis, it is useful to first extend some of the graph topological definitions to the L^γ-graphs. Let $vol_{\gamma }(S) = \sum \nolimits _{u \in S} \left [D_{\gamma } \right ]_{uu}$ denote the generalized volume of S. Let π_γ denote a generalized stationary distribution with entries given by (π_γ)_u=[D_γ]_uu/vol_γ(G). It is important to stress that $[D_{\gamma }]_{uu} = \sum \nolimits _{k} \lambda _{k}^{\gamma } Q_{uk}^{2} \geq 0$. Thus, for all γ>0, the generalized volume and the generalized stationary distribution are non-negative quantities.

The Cheeger ratio metric, lacking the ability to account for the sign of edges, cannot be employed to assess the presence of clusters in the L^γ-graphs. Thus, we generalize the Cheeger ratio definition to the new graphs as follows.

Definition 3

For a set of nodes S⊆V, the generalized Cheeger ratio, or generalized conductance, of S is defined as:

$$ h_{S}^{(\gamma)} = \frac{ \sum\nolimits_{u \in S} \sum\nolimits_{v \in S^{c}} \left[W_{\gamma}\right]_{uv} }{ \min \left(vol_{\gamma}(S), vol_{\gamma}(S^{c}) \right) }{.} $$

(6)

This generalization of the Cheeger ratio is mathematically sound. First, it is a non-negative quantity since $\sum \nolimits _{u \in S} \sum \nolimits _{v \in S^{c}} \left [W_{\gamma }\right ]_{uv} = \mathbbm {1}_{S}^{T} L^{\gamma } \mathbbm {1}_{S} \geq 0$. Second, the set S attaining the minimum value coincides with a sensible clustering. To show the latter, let the edges in W_γ be split according to their sign as W_γ=W_γ⁺+W_γ⁻. Let ${ \mathcal {A}^{(\gamma)}_{in}(S)} = \sum \nolimits _{u \in S}\sum \nolimits _{w \in S} |\left [W_{\gamma }^{+}\right ]_{uw}|$ be the sum of agreements within S, ${\mathcal {A}^{(\gamma)}_{out}(S)} = \sum \nolimits _{u \in S}\sum \nolimits _{v \in S^{c}} |\left [W_{\gamma }^{+}\right ]_{uv}|$ the agreements between S and S^c, ${ \mathcal {D}^{(\gamma)}_{in}(S)} = \sum \nolimits _{u \in S}\sum \nolimits _{w \in S} |\left [ W_{\gamma }^{-}\right ]_{uw}|$ the disagreements within S, and ${ \mathcal {D}^{(\gamma)}_{out}(S)} = \sum \nolimits _{u \in S}\sum \nolimits _{v \in S^{c}} |\left [ W_{\gamma }^{-} \right ]_{uv}|$ the disagreements between S and S^c. Then we state the following lemma.

Lemma 3

Let $S^{*} = \text {arg min}_{S} h_{S}^{(\gamma)} ~\forall S$ s.t. vol_γ(S)≤vol_γ(G)/2. Then, S^∗ is the set that attains the best balance of:

Maximal $\mathcal {D}^{(\gamma)}_{out}(S^{*})$ and $\mathcal {A}^{(\gamma)}_{in}(S^{*})$;
Minimal $\mathcal {D}^{(\gamma)}_{in}(S^{*})$ and $\mathcal {A}^{(\gamma)}_{out}(S^{*})$.

The proof is provided in Appendix Proof of Lemma 3.

Lemma 3 shows that, for a given L^γ-graph, sets of small generalized Cheeger ratio are bound to have strong between-cluster disagreements and strong within-cluster agreements as well as small between-cluster agreements and small within-cluster disagreements, thus coinciding with our definition of clusters in signed graphs.

Now, we introduce the L^γ-PageRank formulation. Departing from the optimization problem in (2), we revamp PageRank to operate on the L^γ-topology as follows.

Definition 4

The L^γ-PageRank G-SSL is defined as the solution to the optimization problem:

$$ \underset{f}{\mathrm{arg min}} \left\{ f^{T} D_{\gamma}^{-1} L^{\gamma} D_{\gamma}^{-1} f + \mu (f - y)^{T} D_{\gamma}^{-1} (f - y) \right\}{.} $$

(7)

The two following Lemmas show that, for any γ>0, the L^γ-PageRank solution exists in closed form and such solution preserves the PageRank properties.

Lemma 4

Let γ>0. Then, problem (7) is convex with closed form solution given as:

$$ f = \mu \left(L^{\gamma}D_{\gamma}^{-1} + \mu \mathbb{I} \right)^{-1} y {.} $$

(8)

The proof is provided in Appendix Proof of lemma 4.

Remark 2

Eq. (8) emphasizes the difference between our approach and the one in Zhou and Belkin (2011): they propose to iterate the operator in the G-SSL solution as $f = \mu \left (\left [ LD^{-1} \right ]^{m}+ \mu \mathbb {I} \right)^{-1} y$, for $m \in \mathbb {Z}_{> 0}$, for which the formulation of the optimization problem having this expression as solution remains unknown.

Remark 3

The solution of L^γ-PageRank in Eq. (8) can be easily cast as a low-pass graph filter, allowing a fast and distributed approximation via Chebyshev polynomials (Shuman et al. 2018).

Lemma 5

Let γ>0. The L^γ-PageRank solution in (8) satisfies the following properties: (i) mass preservation: $\sum \nolimits _{u \in \mathcal {V}} f_{u} = \sum \nolimits _{u \in \mathcal {V}} y_{u}$; (ii) stationarity: f=π_γ if y=π_γ; and (iii) limit behavior: f→π_γ as μ→0 and f→y as μ→∞.

The proof is provided in Appendix Proof of lemma 5.

The previous Lemmas are important because they show that our generalization, for any γ>0, is a well-posed problem. Indeed, the properties of Lemma 5 imply that, while not necessarily modeled by random walkers, L^γ-PageRank remains a diffusion process having π_γ as stationary state and diffusion rate controlled by the μ parameter.

Our next results shows that it is hard for such diffusion process to escape clusters in the L^γ-graphs.

Lemma 6

Let γ>0 and let $S \subset \mathcal {V}$ be an arbitrary set with vol_γ(S)≤vol_γ(G)/2. For a labeled point placed at node u∈S with probability proportional to its generalized degree in S, i.e. $\frac {[D_{\gamma }]_{uu}}{vol_{\gamma }(S)}$, L^γ-PageRank satisfies:

$$ \mathbb{E}\left[f(S^{c}) \right] \leq \frac{h_{S}^{(\gamma)}}{\mu}{.} $$

(9)

The proof is provided in Appendix Proof of lemma 6.

Lemma 6 admits a similar interpretation as Lemma 1. Namely, if L^γ-PageRank is applied to the labeled points of some set S with small $h_{S}^{(\gamma)}$, then diffusion is confined to S and the score values outside of S are expected to be small. Thus, by looking at the nodes with largest score values we should be able to retrieve a good estimation of S. If such score concentration phenomenon takes place, then a sharp drop must appear after sorting the L^γ-PageRank scores in descending order. We will use the following lemma to show that if a sharp drop is present, then the sweep cut procedure applied on the L^γ-PageRank vector retrieves a partition $\hat {S}$ that has small $h_{\hat {S}}^{(\gamma)}$.

Lemma 7

Let q denote the permutation vector and S_j denote the set associated to q_j obtained by applying the sweep-cut procedure on the L^γ-PageRank vector. Then, the partition $\mathcal {V} = S_{j} \cup S_{j}^{c}$ satisfies the inequality:

$$\begin{array}{*{20}l} \mathcal{A}^{(\gamma)}_{out}&(S_j) \left(2 - \frac{(q_{j} - q_{j+1})}{(q_{1} - q_N)}\right) - \mathcal{D}^{(\gamma)}_{out}(S_j)\left(2\frac{(q_{j} - q_{j+1})}{(q_{1} - q_N)} - 1\right) \geq \frac{\mu \left(y(S_j) - f(S_j) \right)}{(q_{1} - q_n)} \\ & \geq \mathcal{A}^{(\gamma)}_{out}(S_j) \left(2\frac{(q_{j} - q_{j+1})}{(q_{1} - q_N)} - 1\right) - \mathcal{D}^{(\gamma)}_{out}(S_j)\left(2 - \frac{(q_{j} - q_{j+1})}{(q_{1} - q_N)}\right){.} \end{array} $$

(10)

The proof is provided in Appendix Proof of lemma 7.

We have that $\sum \nolimits _{u \in S_{j}} \sum \nolimits _{v \in S_{j}^{c}} \left [W_{\gamma }\right ]_{uv} = \mathcal {A}^{(\gamma)}_{out}(S_{j}) - \mathcal {D}^{(\gamma)}_{out}(S_{j}) \geq 0$. Thus, the generalized Cheeger ratio of S_j is small if $\mathcal {A}^{(\gamma)}_{out}(S_{j})$ is not much larger than $\mathcal {D}^{(\gamma)}_{out}(S_{j})$. In the inequality above, we have two cases in which (q_j−q_j+1)/(q₁−q_N)≈1: (a) q is approximately constant; and (b) q has a drop that satisfies q_j≈q₁ and q_j+1≈q_N. The former can only occur if f→π_γ and clearly no cluster can be retrieved from that vector, as confirmed by the inequality growing unbounded. The latter case is what we coin as having a sharp drop between q_j and q_j+1. In such case, the inequality is controlled by the difference y(S_j)−f(S_j) which, due to the mass preserving property and the assumption that q_j+1≈q_N, should be small. Thus, granting that $\mathcal {A}^{(\gamma)}_{out}(S_{j})$ is not much larger than $\mathcal {D}^{(\gamma)}_{out}(S_{j})$ and S_j has a small $h_{S_{j}}^{(\gamma)}$.

Discussion. The previous results show that L^γ-PageRank is a sensible tool to find clusters in the L^γ-graphs, i.e. groups of nodes with small generalized Cheeger ratio. Thus, revisiting the classification case in which we target group of nodes S_gt, we have that the smaller the value of $h_{S_{gt}}^{(\gamma)}$, the better the L^γ-PageRank method can recover it. This observation, in addition to noting that standard PageRank emerges as the particular case of γ=1, indicate that we should be able to enhance the performance of G-SSL in the detection of S_gt by finding the graph, i.e. the γ value, in which $h_{S_{gt}}^{(\gamma)} < h_{S_{gt}}^{(1)}$.

The selection of γ

Case of γ=2: analytic study

In “The L^γ-graphs” section, it was argued that the topology emerging from L² places a negatively weighted link between nodes at a 2-hop distance, thus carrying the potential to place a big amount of disagreements between clusters that may enhance their separability. Our next result formalizes this claim, demonstrating that on graphs from the Planted Partition model it is expected that the L²-graph improves the generalized Cheeger ratio.

Theorem 1

Consider a Planted Partition model of parameters (p_in, p_out) and cluster sizes $|S_{gt}| = |S_{gt}^{c}| = n$. Then, as n→∞ we have that

$$ \mathbb{E} \left[h^{(2)}_{S_{gt}}\right] = 2 \mathbb{E}\left[ h_{S_{gt}}^{(1)} \right]^{2}, $$

(11)

where $\mathbb {E}\left [h_{S_{gt}}^{(1)} \right ] = p_{out}/(p_{in} + p_{out})$.

The proof is provided in Appendix Proof of theorem 1.

Corollary 1

If p_in≥p_out, then $\mathbb {E} \left [h^{(2)}_{S_{gt}}\right ] \leq \mathbb {E}\left [ h_{S_{gt}}^{(1)} \right ]$, with equality occurring in the case p_in=p_out.

The proof is provided in Appendix Proof of corollary 1.

Theorem 1 and Corollary 1 open the door to investigate, on arbitrary graphs, in which cases the L²-graph improves the generalized Cheeger ratio of a set. In the next Proposition, we provide a sufficient condition in which the L²-graph improves the generalized Cheeger ratio a set.

Proposition 1

Let $\langle D_{S_{gt}} \rangle $ denote the mean degree of S_gt. A sufficient condition on S_gt so that $h^{(2)}_{S_{gt}} \leq h_{S_{gt}}^{(1)}$ is that

$$ \langle D_{S_{gt}} \rangle \geq \max_{u \in S_{gt}} \sum\nolimits_{v \in S_{gt}^{c}} W_{uv} + \max_{w \in S_{gt}^{c}} \sum\nolimits_{\ell \in S_{gt}} W_{w \ell}{.} $$

(12)

The proof is provided in Appendix Proof of proposition 1.

This proposition points in the same direction as Theorem 1, saying that graphs having a cluster structure are bound to benefit from L². Concretely, the first term on the right hand side of the inequality searches, among all the nodes of S_gt, the one that has the maximum number of connections towards $S_{gt}^{c}$. The second term does the reverse for the nodes of $S_{gt}^{c}$. Hence, asking for the nodes of S_gt to have, on average, more connections than the maximum possible boundary implies that S_gt should have a cluster structure.

An algorithm for the estimation of the optimal γ

Numerical experiments show that increasing γ can further decrease the generalized Cheeger ratio up to a point where it starts increasing. We show an example of this phenomenon in Fig. 2a, displaying the evolution of $h_{S_{gt}}^{(\gamma)}$ as a function of γ when S_gt corresponds to a digit of the MNIST dataset. From the figure, it is evident that an optimal value appears, denoted $\gamma ^{*} = \text {arg min}_{\gamma } h_{S_{gt}}^{(\gamma)}$, raising the question of how to find such value. Since the behavior of $h_{S_{gt}}^{(\gamma)}$ depends on S_gt, in practice, the derivative or a greedy search to find γ^∗ cannot be employed since S_gt is unknown. A second question that arises is whether the optimal value changes drastically or smoothly with changes in S_gt. We perform the following test: for a given S_gt (same MNIST digit), we remove some percentage of the nodes in S_gt and record the optimal value on subsets of S_gt. More precisely, recall that $h_{S_{gt}}^{(\gamma)} = \mathbbm {1}_{S_{gt}}^{T} L^{\gamma } \mathbbm {1}_{S_{gt}} / \mathbbm {1}_{S_{gt}}^{T} D_{\gamma } \mathbbm {1}_{S_{gt}}$, hence we randomly select some percentage of the entries indexing S_gt in $\mathbbm {1}_{S_{gt}}$, set them to zero and obtain a new indicator function indexing a subset of S_gt. Mean results are evaluated in the original curve and displayed in Fig. 2b. The figure suggest that it is not necessary to know S_gt to find a proxy $\hat {\gamma }$ of γ^∗, it suffices to know a subset of S_gt. Based on the last observation, we propose Algorithm 1 for the estimation of γ^∗. The rationale of the algorithm is to exploit the labeled points and the graph to find a proxy $\hat {S}$ of S_gt on which we can compute the estimate. The procedure consists in letting walkers started from the label points, run for a number of steps that is determined by the maximum geodesic distance between the labels. This allows walkers to explore S_gt without escaping too far from it. After running the walk, we list the nodes in descending order according to the probability of finding a walk at a node. We take the first element on the list (the one where it is more likely to find a walker), add it to $\hat {S}$ and remove it from the list, so that the former second element becomes the first in the listing. We repeat the procedure until the probability of finding a walker in the nodes conforming $\hat {S}$ is 0.7.

In Table 1, we evaluate the performance of Algorithm 1 on the estimation of γ^∗ for all the digits of the MNIST. The first row displays, as γ^∗, the value of γ (from the input range) attaining the minimum generalized Cheeger ratio. The second row displays the performance of the algorithm when estimating such value. The last three rows show the value of the generalized Cheeger ratio evaluated at γ^∗, $\hat {\gamma }$ and γ=1, respectively. The estimator finds values of $\hat {\gamma }$ whose Cheeger ratios are: (a) significantly smaller than those of γ=1; (b) close to the optimal.

Table 1 Evaluation of Algorithm 1 on the MNIST Dataset

Full size table

L ^γ-PageRank in practice

Planted partition

Experimental setup and goals. In the following experiment, we show that L^γ-PageRank can increase the performance of G-SSL as the graph approaches the Planted Partition detectability transition. More precisely, it is shown in Mossel et al. (2015) that the Planted Partition possesses a detectability threshold above which unsupervised methods are unable to retrieve a meaningful clustering. Indeed, if the clusters sizes are denoted as $|S_{gt}| = |S_{gt}^{c}| = n$, the mean degree of a node is given as C_avg=C_in+C_out, where C_out=(p_out)(n) and C_in=(p_in)(n−1). It is then possible to recover a cluster that is positively correlated with the true partition, in an unsupervised manner, if (C_in−C_out)²>2(C_in+C_out), and impossible otherwise. As for G-SSL, the work in Zhang et al. (2014) showed that such threshold can be overcome when a fraction of labeled points is introduced to the task. Nonetheless, the performance of G-SSL drastically degrades when approaching the detectability transition.

The experimental setup is the following: for a given C_out/C_in, a realization of the Planted Partition is drawn with n=500 and C_avg=3. Then, 1% of labeled points are sampled at random and the L^γ-PageRank method is applied for different values of μ lying on a discrete grid. The clusters are determined via a sweep-cut procedure, and the best performance is retained. The whole procedure is repeated for 10 different realizations of the labeled points. Finally, all the preceding steps are repeated for 100 graph realizations. Performance is assessed in terms of the Matthews Correlation Coefficient (MCC) (Matthews 1975), so that a value of 1 implies perfect agreement with the true partition and 0 a random decision.

Results and discussion.

Figure 3 displays the performance of L^γ-PageRank at recovering the Planted Partition as a function of the ratio C_out/C_in. Standard PageRank (γ=1) performs poorly as the configuration approaches the phase transition (referred by the vertical line) since $h_{S}^{(1)}$ becomes large. Clearly, the introduction of γ allows to decrease $h_{S_{gt}}^{(\gamma)}$, which, accordingly, enhances the clustering performance. Furthermore, the figure verifies that the smaller the value of $h_{S_{gt}}^{(\gamma)}$ (right plot), the better the L^γ-PageRank recovers the true partition (left plot). It is important to remark that, for this experiment, while γ=2 shows good improvements, larger values of γ keep improving $h_{S}^{(\gamma)}$, until it reaches a saturation plateau, designating a region of optimal γ values (γ≥6).

Real world datasets

Experimental setup and goals.

In our following experiment, we assesses the performance of L^γ-PageRank and Algorithm 1 on real world datasets.

The experimental setup is as follows: graphs are build connecting the K-Nearest Neighbors (KNN) with distances computed via the Gaussian kernel, so that the weight between points x_u and x_v is given by $W_{uv} = \exp \{ - || \mathbf {x}_{u} - \mathbf {x}_{v} ||^{2}_{2} /\sigma ^{2} \}$. For each class, 2% of labeled points are randomly selected, L^γ-PageRank is applied for a grid of μ values, partitions are retrieved via the sweep-cut, and the best performance, assessed in terms of MCC, is retained. Such procedure is repeated for 100 realization of labeled points, except for the MNIST on which 30 realizations only are employed. In all cases, classes are balanced in size and the graph construction parameters are selected to provide a good distribution of weights as follows: (a) MNIST (Lecun et al. 1998): Images of handwritten digits (1 to 9). From the entire dataset, 200 images of each digit are selected and used to build the graph with KNN = 10 and σ=10⁴; (b) Gender Images (Hond and Spacek 1997): Images of male and female subjects for gender recognition. From the entire dataset, 200 images of each gender are selected and used to build the graph with KNN = 60 and σ=10⁴. The large value of KNN is to avoid disconnected components; (c) BBC articles (Greene and Cunningham 2006): Word frequency attributes from news media articles. From the entire dataset, 200 business and 200 entertainment articles are used to build the graph with KNN = 5 and σ=50; and (d) Phoneme (The phoneme database): Five attributes to discern nasal sounds from oral sounds. From the entire dataset, 200 oral and 200 nasal sounds are used to build the graph with KNN = 10 and σ=2.

Results and discussion.

Table 2 shows the performance of L^γ-PageRank on the classification of these real world datasets. Clearly, the introduction of γ can significantly improve performance and, in general, the estimation $\hat {\gamma }$ performs close to the optimal value γ^∗. It can be seen that some datasets are more sensitive to γ than others. For instance, in the BBC articles we observe that a small change in γ, going from γ=1 to γ^∗=1.1, increases performance, and going further to $\hat {\gamma } = 1.3$ and γ=2 significantly worsens the classification. On the other hand, the MNIST dataset is less sensitive to γ, obtaining similar performances with larger variations in γ.

Table 2 Performance on real world datasets: each cell reports MCC, 95% confidence interval (parenthesis) and the value of γ [squared brackets]

Full size table

It is important to stress that, thus far, we have assumed possession of the proper tuning of the diffusion rate (μ) that attains the best results. However, when working with real data, clusters may have intricate local structures, e.g. sub-clusters, that play an important role in the way information diffuses, and that can make more difficult the finding of the optimal diffusion rate μ. As a result, two clusters may have equal Cheeger ratios but one of them being harder to find if its local structure is complex. Digit 8 poses an example of this phenomenon, where the mean performance for $\hat {\gamma }$ is slightly better than that of γ^∗. This anomaly can be explained as an aftereffect of using a finite grid on μ: for some realization of labeled points, the best performance for γ^∗ falls in a region not covered by the grid.

Unbalanced labeled data

Experimental setup and goals

In our last experiment, we show that L^γ-PageRank, adapted to the multi-class setting described in “Generalization to multiple classes” section, can improve the performance of G-SSL in the presence of unbalanced labeled data.

The experimental setup is as follows: graphs with two balanced classes (in size) are built using the datasets from the preceding experiments. The parameters of the graphs’ construction follow the guidelines provided in “Real world datasets” section. For the Planted Partition, the configuration is n=200, C_avg=3, C_out=0.1. Then, unbalanced labeled points are drawn at random: 2% from one class and 6% from the other. Lastly, L^γ-PageRank, in the multi-class setting, is applied for a grid of μ values and the best performance, assessed by MCC, is recorded. For the planted partition, the procedure is repeated over 15 realizations of the labeled points and for 100 graph realizations. For the other datasets, 100 realizations of labeled points are employed.

Results and discussion

Table 3 displays the performance L^γ-PageRank in the presence of unbalanced labeled data. It is important to stress that, in this framework, a unique value of γ is used to retrieve all the clusters at the same time, precluding the notion of an optimal γ as defined in “The selection of γ” section. However, one value of γ seems to perform better, we denote it as γ= Best. The results confirm that the introduction of γ helps to improve the classification in the presence of the unbalanced labeled data.

Table 3 Performance on unbalanced labaled data: each cell reports MCC, 95% confidence interval (parenthesis) and the value of γ [squared brackets]

Full size table

Conclusion

This work proposed L^γ-PageRank, an extension of PageRank based on (non necessary integer) powers of the (combinatorial) Laplacian matrix. Our analysis shows that the added degree of freedom offers more versatility than standard PageRank, providing the potential to address some of the limitations of G-SSL. Precisely, we showed that when clusters are obtained via the sweep-cut procedure, L^γ-PageRank can significantly outperform standard PageRank. Further, we showed that the multi-class approach also benefits from our proposition, as performance was enhanced in the presence of unbalanced labeled data. These improvements were possible due to the L^γ (γ>0) operator coding for graphs whose topology can reinforce the separability of clusters. The richness of such graphs comes from the sign of edges, allowing to code for similarities but also to emphasize dissemblance between individuals. Thus, while 2 nodes can only be disconnected on the initial graph, they can ‘repulse’ themselves in these topologies. Notably, we have shown that there is an optimal graph (related to an optimal γ) on which the classification will lead to a maximal performance. We proposed a simple yet efficient algorithm to estimate the optimal γ and hence determine the best topology for analyzing a given dataset. The procedures proposed in this work open the door for more in-depth study of the L^γ-graphs and what determines their optimal topology. They also pave the way towards the extension of other standard clustering tools, such as Unsupervised Learning via Spectral Clustering, to exploit these richer topologies.

Appendix: Proofs

Proof of Lemma 3

Proof

For an arbitrary and fixed γ>0, let S denote an arbitrary set of vol_γ(S)≤vol_γ(G)/2. Let $r = (\mathcal {A}^{(\gamma)}_{out}(S) - \mathcal {D}^{(\gamma)}_{out}(S)) / (\mathcal {A}^{(\gamma)}_{in}(S) - \mathcal {D}^{(\gamma)}_{in}(S))$. It is easy to show that $h_{S}^{(\gamma)} = r/(r+1)$, which is monotonically increasing with r. Thus, the set S that minimizes r also minimizes $h^{(\gamma)}_{S}$. □

Proof of lemma 4

Proof

It suffices to show the positive semi-definiteness of the functional and to apply the first order optimality condition. Let $\tilde {f} = Q^{T} D_{\gamma }^{-1} f$. Then, the left term satisfies $\sum \nolimits _{j} \lambda _{j}^{\gamma } \tilde {f}_{j}^{2} \geq 0$. It can be shown that $[D_{\gamma }]_{uu} = \sum \nolimits _{j} Q_{uj}^{2} \lambda _{j}^{\gamma } \geq 0$ granting the right term satisfies $\sum \nolimits _{u} (f_{u} - y_{u})^{2} / [D_{\gamma }]_{uu} \geq 0$. Now, computing the derivative of the functional with respect to f and equaling to 0 leads to: $L^{\gamma } D_{\gamma }^{-1}f + \mu (f - y) = 0$. The lemma is proved after isolating f. □

Proof of lemma 5

Proof

From the demonstration of Lemma 4 we have that $L^{\gamma }D_{\gamma }^{-1} f + \mu (f - y) = 0$. Then, $\mathbbm {1}^{T} L^{\gamma }D_{\gamma }^{-1} + \mu \mathbbm {1}^{T} f = \mu \mathbbm {1}^{T} y$. Since $\mathbbm {1}^{T} L^{\gamma } = 0$ we have that $\mathbbm {1}^{T} f = \mathbbm {1}^{T} y$, proving (i). We prove property (iii) using the same expression. We only develop the case μ→0 since the case μ→∞ follows the same steps: taking ${\lim }_{\mu \to 0} \left \{ L^{\gamma }D_{\gamma }^{-1}f + \mu (f-y) = 0 \right \}$ leads to $L^{\gamma }D_{\gamma }^{-1} f = 0$, whose solution is proportional to $\pi _{\gamma } = D_{\gamma }\mathbbm {1}/vol_{\gamma }(G)$. Lastly, we prove (ii) by noting that the operator $L^{\gamma }D_{\gamma }^{-1}$ has a positive real spectrum as it is similar to $D_{\gamma }^{-1/2} L^{\gamma } D_{\gamma }^{-1/2}$ which is positive semi-definite. Thus, we can use the inverse Laplace transform of the resolvent $(L^{\gamma }D_{\gamma }^{-1} + \mu \mathbb {I})^{-1} = \int \nolimits _{0}^{\infty } e^{-t} e^{-tL^{\gamma }D_{\gamma }^{-1}/\mu } dt$, which, after using its Taylor expansion, allows to rewrite the PageRank solution as $f = \sum \nolimits _{k = 0}^{\infty } \frac {(-1)^{k}}{\mu ^{k}} \left (L^{\gamma }D_{\gamma }^{-1} \right)^{k} y$. If y=π_γ, the previous equation is only non-zero for k=0, proving (ii). □

Proof of lemma 6

Proof

Let $y = D_{\gamma }\mathbbm {1}_{S}/vol_{\gamma }(S)$. Using (8) we can see that

$$ \mathbbm{1}_{S^{c}}^{T} f = \sum\limits_{u \in S} \frac{\left[ D_{\gamma} \right]_{uu}}{vol_{\gamma}(S)} \mathbbm{1}_{S^{c}}^{T} \left[ \mu \left(L^{\gamma}D_{\gamma}^{-1} + \mu \mathbb{I} \right)^{-1} \delta_{u} \right], $$

(13)

showing that $\mathbbm {1}_{S^{c}}^{T} f$ can be interpreted as $\mathbb {E}\left [ f(S^{c}) \right ]$ when labels are selected with probability proportional to their generalized degree in S. Using the fact that

$$ \left(L^{\gamma}D_{\gamma}^{-1} + \mu \mathbb{I} \right)^{-1} \left(L^{\gamma}D_{\gamma}^{-1} + \mu \mathbb{I} \right) = \mathbb{I}, $$

(14)

we express

$$ f = \left(\mathbb{I} - \frac{1}{\mu} L^{\gamma}D_{\gamma}^{-1} + \frac{1}{\mu} L^{\gamma}D_{\gamma}^{-1} \left(L^{\gamma} D_{\gamma}^{-1} + \mu \mathbb{I} \right)^{-1} L^{\gamma}D_{\gamma}^{-1} \right)y. $$

(15)

The upper bound is thus obtained by substituting y and summing over S.

$$\begin{array}{*{20}l} \mathbbm{1}_{S}^{T} f &= \frac{ \mathbbm{1}_{S}^{T} D_{\gamma} \mathbbm{1}_{S} }{vol_{\gamma}(S)} - \frac{\mathbbm{1}_{S}^{T} L^{\gamma} \mathbbm{1}_{S}}{\mu~vol_{\gamma}(S)} + \frac{ \mathbbm{1}_{S}^{T} L^{\gamma} \left(L^{\gamma} + \mu D_{\gamma} \right)^{-1} L^{\gamma} \mathbbm{1}_{S} }{\mu~vol_{\gamma}(S)} \\ & \geq \frac{ \mathbbm{1}_{S}^{T} D_{\gamma} \mathbbm{1}_{S} }{vol_{\gamma}(S)} - \frac{\mathbbm{1}_{S}^{T} L^{\gamma} \mathbbm{1}_{S}}{\mu~vol_{\gamma}(S)} \\ & = 1 - \frac{h_{S}^{(\gamma)}}{\mu}. \end{array} $$

(16)

Employing property (i) from Lemma 5 finishes the proof. □

Proof of lemma 7

Proof

We only show the proof of the lower bound as the upper bound follows a similar derivation. We recast (4) as $L^{\gamma }D_{\gamma }^{-1}f = \mu \left (y - f \right)$. Thus, the set S_j satisfies:

$$\begin{array}{*{20}l} {} \mu \left((y(S_j) - f(S_j) \right) &= \mathbbm{1}_{S_{j}}^{T} L^{\gamma} D_{\gamma}^{-1} f \\ &= \mathbbm{1}_{S_{j}}^{T} L^{\gamma} q \\ &= \sum\limits_{u \in S_{j}, v \in S_{j}^{c}} \left[ W_{\gamma} \right]_{uv} (q_{u} - q_v) \\ &= \sum\limits_{u \in S_{j}, v \in S_{j}^{c}} |\left[ W_{\gamma}^+ \right]_{uv}|(q_{u} - q_v) - \sum\limits_{u \in S_{j}, v \in S_{j}^{c}} |\left[ W_{\gamma}^- \right]_{uv}|(q_{u} - q_v) \\ &\qquad+ \sum\limits_{u \in S_{j}, v \in S_{j}^{c}} |\left[ W_{\gamma}^+ \right]_{uv}| (q_{j} - q_{j+1}) - \sum\limits_{u \in S_{j}, v \in S_{j}^{c}} |\left[ W_{\gamma}^+ \right]_{uv}|(q_{j} - q_{j+1}) \\ &\qquad+ \sum\limits_{u \in S_{j}, v \in S_{j}^{c}} |\left[ W_{\gamma}^- \right]_{uv}|(q_{j} - q_{j+1}) - \sum\limits_{u \in S_{j}, v \in S_{j}^{c}} |\left[ W_{\gamma}^- \right]_{uv}|(q_{j} - q_{j+1}) \\ &\geq (q_{j} - q_{j+1}) \left(2\mathcal{A}^{(\gamma)}_{out}(S_j) + \mathcal{D}^{(\gamma)}_{out}(S_j) \right) \\ &\qquad - (q_{1} - q_N) \left(2\mathcal{D}^{(\gamma)}_{out}(S_j) + \mathcal{A}^{(\gamma)}_{out}(S_j) \right){.}\end{array} $$

(17)

Re-ordering terms finishes the proof. □

Proof of theorem 1

Proof

Let n=|S|. For u,v∈S and w∈S^c the Planted Partition satisfies $\sum \nolimits _{v} W_{uv} \sim B(n-1, p_{in})$ and $\sum \nolimits _{w} W_{uw} \sim B(n, p_{out})$. The key step in the proof is to show that, in the limit n→∞, $\mathbb {E}\left [ h_{S}^{(1)} \right ] = \mathbb {E}\left [ \frac {\mathbbm {1}_{S}^{T} L \mathbbm {1}_{S}}{vol(S)}\right ] = \frac { \mathbb {E}\left [ \mathbbm {1}_{S}^{T} L \mathbbm {1}_{S} \right ]}{\mathbb {E} \left [ vol(S) \right ]} $, and the same for $h_{S}^{(2)}$. By application of the Chebyshev inequality we have that

$$ Pr \left(d_{u} - \mathbb{E}\left[ d_{u} \right] \geq \mathbb{E}\left[ d_{u} \right] \right) \leq \frac{var(d_{u})}{var(d_{u}) + \mathbb{E}\left[d_{u} \right]^{2}}= \mathcal{O}(n^{-1}). $$

(18)

Thus, in the limit of n→∞ we can establish the inequality $d_{u} < 2\mathbb {E}\left [ d_{u} \right ]$ and further that $vol(S) < 2 \mathbb {E}\left [vol(S)\right ]$. This latter allows to express $\mathbb {E}\left [ h_{S}^{(1)} \right ]$ as follows (Rice 2008):

$$\begin{array}{*{20}l} \mathbb{E}\left[ h_{S}^{(1)} \right] &= \mathbb{E}\left[ \frac{\mathbbm{1}_{S}^{T} L \mathbbm{1}_{S}}{vol(S)}\right] \\ &= \frac{ \mathbb{E}\left[ \mathbbm{1}_{S}^{T} L \mathbbm{1}_{S} \right]}{\mathbb{E} \left[ vol(S) \right]} + \sum\limits_{i = 1}^{\infty} (-1)^{i} \frac{\mathbb{E}[ \mathbbm{1}_{S}^{T} L \mathbbm{1}_{S}] \langle\langle^{i} vol(S) \rangle\rangle + \langle\langle \mathbbm{1}_{S}^{T} L \mathbbm{1}_{S},^{i} vol(S) \rangle\rangle}{\mathbb{E}\left[ vol(S) \right]^{i+1}} \\ &= \frac{ \mathbb{E}\left[ \mathbbm{1}_{S}^{T} L \mathbbm{1}_{S} \right]}{\mathbb{E} \left[ vol(S) \right]} +\sum\limits_{i = 1}^{\infty} (-1)^{i} \frac{\mathbb{E}\left[ \mathbbm{1}_{S}^{T} L \mathbbm{1}_{S} (vol(S) - \mathbb{E}[vol(S)])^{i}\right] }{\mathbb{E}[vol(S)]^{i+1}} \\ &= \frac{ \mathbb{E}\left[ \mathbbm{1}_{S}^{T} L \mathbbm{1}_{S} \right]}{\mathbb{E} \left[ vol(S) \right]} + \sum\limits_{i = 1}^{\infty} (-1)^{i} \mathbb{E}\left[\frac{\mathbbm{1}_{S}^{T} L \mathbbm{1}_{S}}{\mathbb{E}[vol(S)]} \left(\frac{ vol(S)}{\mathbb{E}[vol(S)]} - 1\right)^{i} \right] \\ &= \frac{ \mathbb{E}\left[ \mathbbm{1}_{S}^{T} L \mathbbm{1}_{S} \right]}{\mathbb{E} \left[ vol(S) \right]} + \sum\limits_{i = 1}^{\infty} (-1)^{i} c_{i}{,} \end{array} $$

(19)

where $\langle \langle a,^{i} b \rangle \rangle = \mathbb {E}\left [ (a - \mathbb {E}[a]) (b - \mathbb {E}[b])^{i} \right ]$. The fact that $vol(S) < 2 \mathbb {E}\left [vol(S)\right ]$ and the monotonicity of the expected value imply that the sequence $\sum \nolimits _{i} |c_{i}|$ decreases monotonically. Also, it can be shown that its dominant term: $c_{1} = \mathcal {O}(n^{-2})$. Replacing the expectations and evaluating the limit leads to:

$$ {\lim}_{n \to \infty} \mathbb{E} \left[ h_{S}^{(1)} \right] = \frac{p_{out}}{p_{in} + p_{out}}. $$

(20)

The case of $\mathbb {E}[ h_{S}^{(2)} ]$ follows a similar derivation. Since $\left [ D_{2} \right ]_{uu} = d_{u}^{2} + d_{u}$, the Jensen inequality implies that $\left [ D_{2}\right ]_{uu} < 2 \mathbb {E}\left [ [D_{2}]_{uu} \right ]$ and consequently that $vol_{2}(S) < 2 \mathbb {E}\left [ vol_{2}(S) \right ]$. Thus, we cast:

$$\begin{array}{*{20}l} \mathbb{E}\left[ h_{S}^{(2)} \right] &= \mathbb{E}\left[ \frac{\mathbbm{1}_{S}^{T} L^{2} \mathbbm{1}_{S}}{vol_{2}(S)}\right] \\ &= \frac{ \mathbb{E}\left[ \mathbbm{1}_{S}^{T} L^{2} \mathbbm{1}_{S} \right]}{\mathbb{E} \left[ vol_{2}(S) \right]} + \sum\limits_{i = 1}^{\infty} (-1)^{i} \mathbb{E}\left[\frac{\mathbbm{1}_{S}^{T} L^{2} \mathbbm{1}_{S}}{\mathbb{E}[vol_{2}(S)]} \left(\frac{ vol_{2}(S)}{\mathbb{E}[vol_{2}(S)]} - 1\right)^{i} \right] \\ &= \frac{ \mathbb{E}\left[ \mathbbm{1}_{S}^{T} L^{2} \mathbbm{1}_{S} \right]}{\mathbb{E} \left[ vol_{2}(S) \right]} + \sum\limits_{i = 1}^{\infty} (-1)^{i} c^{(2)}_{i}{.} \end{array} $$

(21)

Let the random variable $O_{u} = \sum \nolimits _{w \in S^{c}} W_{uw}$. Then we have that $\mathbbm {1}_{S}^{T} L^{2} \mathbbm {1}_{S} = 2 \sum \nolimits _{u \in S} \left (O_{u} \right)^{2}$. This fact, in addition to $vol_{2}(S) = \sum \nolimits _{u\in S}d_{u}^{2} + d_{u}$, allow to show that the sequence $\sum \nolimits _{i} |c^{(2)}_{i}|$ is monotonically decreasing with $c^{(2)}_{1} = \mathcal {O}(n^{-1})$. Replacing the expectations and evaluating the limit leads to:

$$ {\lim}_{n \to \infty} \mathbb{E} \left[ h_{S}^{(2)} \right] = 2 \left(\frac{p_{out}}{p_{in}+p_{out}} \right)^{2}{.} $$

(22)

□

Proof of corollary 1

Proof

Let p_in=p_out+ε and assume that $h_{S}^{(1)} \geq h_{S}^{(2)}$. Thus p_out/(p_in+p_out)≥2(p_out/(p_in+p_out))², which can be further simplified to 1≥2p_out/(2p_out+ε). We observe that such expression holds for ε≥0 and equality occurs when ε=0. □

Proof of proposition 1

Proof

We search a condition on S that permits $\frac {\mathbbm {1}_{S}^{T} L \mathbbm {1}_{S}}{\mathbbm {1}_{S}^{T} D \mathbbm {1}_{S}} \geq \frac {\mathbbm {1}_{S}^{T} L^{2} \mathbbm {1}_{S}}{\mathbbm {1}_{S}^{T} D_{2} \mathbbm {1}_{S}} $, or equivalently, that satisfies the inequality $\frac {\mathbbm {1}_{S}^{T} D_{2} \mathbbm {1}_{S}}{\mathbbm {1}_{S}^{T} D \mathbbm {1}_{S}}-\frac {\mathbbm {1}_{S}^{T} L^{2} \mathbbm {1}_{S}}{\mathbbm {1}_{S}^{T} L \mathbbm {1}_{S}} \geq 0$. We have

$$\begin{array}{*{20}l} \frac{\mathbbm{1}_{S}^{T} D_{2} \mathbbm{1}_{S}}{\mathbbm{1}_{S}^{T} D \mathbbm{1}_{S}}-\frac{\mathbbm{1}_{S}^{T} L^{2} \mathbbm{1}_{S}}{\mathbbm{1}_{S}^{T} L \mathbbm{1}_{S}} &\geq \frac{\mathbbm{1}_{S}^{T} D^{2} \mathbbm{1}_{S}}{\mathbbm{1}_{S}^{T} D \mathbbm{1}_{S}}-\frac{\mathbbm{1}_{S}^{T} L^{2} \mathbbm{1}_{S}}{\mathbbm{1}_{S}^{T} L \mathbbm{1}_{S}} \\ &\geq \frac{\mathbbm{1}_{S}^{T} D^{2} \mathbbm{1}_{S}}{\mathbbm{1}_{S}^{T} D \mathbbm{1}_{S}} - \left(\max_{u \in S} \sum\limits_{w \in S^{c}} W_{uw} + \max_{\ell \in S^{c}} \sum\limits_{v \in S} W_{\ell v} \right) \\ &\geq \frac{\mathbbm{1}_{S}^{T} D \mathbbm{1}_{S}}{\mathbbm{1}_{S}^{T} \mathbbm{1}_{S}} - \left(\max_{u \in S} \sum\limits_{w \in S^{c}} W_{uw} + \max_{\ell \in S^{c}} \sum\limits_{v \in S} W_{\ell v} \right)\\ &= \frac{vol(S)}{|S|} - \left(\max_{u \in S} \sum\limits_{w \in S^{c}} W_{uw} + \max_{\ell \in S^{c}} \sum\limits_{v \in S} W_{\ell v} \right), \end{array} $$

(23)

where we have used Lehmer’s and Holder’s inequalities and that $\mathbbm {1}_{S}^{T} L^{2}\mathbbm {1}_{S} = \sum \nolimits _{u \in S}\left (\sum \nolimits _{w \in S^{c}} W_{uw} \right)^{2} + \sum \nolimits _{\ell \in S^{c}} \left (\sum \nolimits _{v \in S} W_{\ell v} \right)^{2} $. Thus, it is sufficient that S satisfies:

$$ \frac{vol(S)}{|S|} - \left(\max_{u \in S} \sum\limits_{w \in S^{c}} W_{uw} + \max_{\ell \in S^{c}} \sum\limits_{v \in S} W_{\ell v} \right) \geq 0{.} $$

(24)

□

Availability of data and materials

All data generated or analysed during this study are included the following articles (Lecun et al. 1998; Hond and Spacek 1997; Greene and Cunningham 2006; The phoneme database). The code to replicate the results is available in the GitHub repository, https://github.com/estbautista/Lgamma-PageRank_Paper.

References

Andersen, R, Chung FRK, Lang KJ (2007) Using pagerank to locally partition a graph. Int Math 4:35–64. https://doi.org/10.1080/15427951.2007.10129139.
MathSciNet MATH Google Scholar
Andersen, R, Chung F (2007) Detecting sharp drops in pagerank and a simplified local partitioning algorithm. In: Cai J-Y, Cooper SB, Zhu H (eds)Theory and Applications of Models of Computation, 1–12.. Springer, Berlin.
MATH Google Scholar
Avrachenkov, K, Dobrynin V, Nemirovsky D, Pham SK, Smirnova E (2008) Pagerank based clustering of hypertext document collections In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 873–874.. ACM, Singapore.
Google Scholar
Avrachenkov, K, Gonçalves P, Legout A, Sokol M (2012a) Classification of Content and Users in BitTorrent by Semi-supervised Learning Methods In: International Wireless Communications and Mobile Computing Conference (3rd International Workshop on Traffic Analysis and Classification).. IEEE, Cyprus. Best paper award.
Avrachenkov, K, Gonçalves P, Mishenin A, Sokol M (2012b) Generalized Optimization Framework for Graph-based Semi-supervised Learning In: Proceedings of the 2012 SIAM International Conference on Data Mining, 966–974. https://epubs.siam.org/doi/abs/10.1137/1.9781611972825.83.
Avrachenkov, K, Kadavankandy A, Litvak N (2018) Mean Field Analysis of Personalized PageRank with Implications for Local Graph Clustering. J Stat Phys 173(3-4):895–916. https://doi.org/10.1007/s10955-018-2099-5.
Article MathSciNet MATH Google Scholar
Bautista, E, De Nigris S, Abry P, Avrachenkov K, Gonçalves P (2017) Lévy Flights for Graph Based Semi-Supervised Classification In: 26th Colloquium GRETSI. GRETSI, 2017 - Proceeding of the 26th colloquium.. GRETSI, Groupe d’Etudes du Traitement du Signal et des Images, Juan-Les-Pins.
Google Scholar
Chung, F (2007) Four cheeger-type inequalities for graph partitioning algorithms In: Proceedings of ICCM.. Scietech Publisher, Hiroshima.
Google Scholar
Chung, F (2010) Pagerank as a discrete green’s function. Geom Anal I ALM 17:285–302.
MathSciNet MATH Google Scholar
de Nigris, S, Bautista E, Abry P, Avrachenkov K, Goncalves P (2017) Fractional graph-based semi-supervised learning In: 2017 25th European Signal Processing Conference (EUSIPCO), 356–360. https://doi.org/10.23919/EUSIPCO.2017.8081228.
Fontugne, R, Bautista E, Petrie C, Nomura Y, Abry P, Gonçalves P, Fukuda K, Aben E (2019) BGP Zombies: an Analysis of Beacons Stuck Routes In: PAM 2019 - 20th Passive and Active Measurements Conference, 1–13.. Springer, Puerto Varas. https://hal.inria.fr/hal-01970596.
Google Scholar
Graham, FC, Horn P, Tsiatas A (2009) Distributing antidote using pagerank vectors. Int Math 6:237–254.
MathSciNet MATH Google Scholar
Greene, D, Cunningham P (2006) Practical solutions to the problem of diagonal dominance in kernel document clustering In: Proceedings of the 23rd International Conference on Machine Learning. ICML ’06, 377–384.. ACM, New York. https://doi.org/10.1145/1143844.1143892.
Hond, D, Spacek L (1997) Distinctive descriptions for face processing. In: Clark AF (ed)BMVC.. British Machine Vision Association, Essex.
Lecun, Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791.
Article Google Scholar
Litvak, N, Scheinhardt W, Volkovich Y, Zwart B (2009) Characterization of tail dependence for in-degree and pagerank. In: Avrachenkov K, Donato D, Litvak N (eds)Algorithms and Models for the Web-Graph, 90–103.. Springer, Berlin.
Chapter MATH Google Scholar
Matthews, BW (1975) Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochim Biophys Acta Protein Struct 405(2):442–451. https://doi.org/10.1016/0005-2795(75)90109-9.
Article Google Scholar
Mossel, E, Neeman J, Sly A (2015) Reconstruction and estimation in the planted partition model. Probab Theory Relat Fields 162(3):431–461. https://doi.org/10.1007/s00440-014-0576-6.
Article MathSciNet MATH Google Scholar
Pérez Riascos, A, Mateos J (2014) Fractional dynamics on networks: Emergence of anomalous diffusion and lévy flights. Phys Rev E 90:032809. https://doi.org/10.1103/PhysRevE.90.032809.
Article Google Scholar
Rice, SH (2008) A stochastic version of the Price equation reveals the interplay of deterministic and stoch astic processes in evolution. BMC evolutionary biology 8(1):262.
Article Google Scholar
Shuman, DI, Vandergheynst P, Kressner D, Frossard P (2018) Distributed signal processing via chebyshev polynomial approximation. IEEE Trans Signal Inf Proc Over Netw 4(4):736–751. https://doi.org/10.1109/TSIPN.2018.2824239.
Article MathSciNet Google Scholar
Sokol, M (2014) Graph-based semi-supervised learning methods and quick detection of central nodes. PhD thesis. Université de Nice, Ecole Doctorale STIC, Inria Sophia Antipolis, Maestro.
Subramanya, A, Bilmes J (2008) Soft-supervised learning for text classification In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. EMNLP ’08, 1090–1099.. Association for Computational Linguistics, Stroudsburg.
Chapter Google Scholar
Tsiatas, A (2012) Diffusion and clustering on large graphs. PhD thesis. University of California at San Diego, La Jolla.
Google Scholar
The phoneme database. https://www.openml.org/d/1489. Accessed 1 Feb 2019.
Zhang, P, Moore C, Zdeborova L (2014) Phase transitions in semisupervised clustering of sparse networks. Phys Rev E Stat Nonlinear Soft Matter Phys 90. https://doi.org/10.1103/PhysRevE.90.052802.
Zhao, M, Chan RHM, Chow TWS, Tang P (2014) Compact graph based semi-supervised learning for medical diagnosis in alzheimer’s disease. IEEE Signal Proc Lett 21(10):1192–1196. https://doi.org/10.1109/LSP.2014.2329056.
Article Google Scholar
Zhou, X, Belkin M (2011) Semi-supervised learning by higher order regularization. In: Gordon G, Dunson D, Dudík M (eds)Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, 892–900.. PMLR, Fort Lauderdale. http://proceedings.mlr.press/v15/zhou11b/zhou11b.pdf.
Zhou, D, Bousquet O, Lal TN, Weston J, Schölkopf B (2004) Learning with local and global consistency. In: Thrun S, Saul LK, Schölkopf B (eds)Advances in Neural Information Processing Systems 16, 321–328. http://papers.nips.cc/paper/2506-learning-with-local-and-global-consistency.pdf.
Zhou, D, Burges CJC (2007) Spectral clustering and transductive learning with multiple views In: Proceedings of the 24th International Conference on Machine Learning. ICML ’07, 1159–1166.. ACM, New York. https://doi.org/10.1145/1273496.1273642.
Chapter Google Scholar

Download references

Acknowledgements

Not applicable

Funding

This work was supported by CONACyT and the Labex MILyon

Author information

Authors and Affiliations

Univ Lyon, Inria, CNRS, ENS de Lyon, UCB Lyon 1, LIP UMR 5668, Lyon, F-69342, France
Esteban Bautista & Paulo Gonçalves
Univ Lyon, Ens de Lyon, Univ Claude Bernard, CNRS, Laboratoire de Physique, Lyon, F-69342, France
Esteban Bautista & Patrice Abry

Authors

Esteban Bautista
View author publications
You can also search for this author in PubMed Google Scholar
Patrice Abry
View author publications
You can also search for this author in PubMed Google Scholar
Paulo Gonçalves
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

EB, PA and PG participated equally in designing and developing the project, and in writing the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Esteban Bautista.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Bautista, E., Abry, P. & Gonçalves, P. L^γ-PageRank for semi-supervised learning. Appl Netw Sci 4, 57 (2019). https://doi.org/10.1007/s41109-019-0172-x

Download citation

Received: 01 March 2019
Accepted: 22 July 2019
Published: 19 August 2019
DOI: https://doi.org/10.1007/s41109-019-0172-x

Lγ-PageRank for semi-supervised learning

Abstract

Introduction

Context

Related works

Goals, contributions and outline

State of the art

Preliminaries

Definition 1

PageRank-based semi-Supervised learning

Lemma 1

Definition 2

Lemma 2

Generalization to multiple classes

L γ-PageRank for semi-Supervised learning

The L γ-graphs

Remark 1

The L γ-PageRank method

Definition 3

Lemma 3

Definition 4

Lemma 4

Remark 2

Remark 3

Lemma 5

Lemma 6

Lemma 7

The selection of γ

Case of γ=2: analytic study

Theorem 1

Corollary 1

Proposition 1

An algorithm for the estimation of the optimal γ

L γ-PageRank in practice

Planted partition

Real world datasets

Unbalanced labeled data

Conclusion

Appendix: Proofs

Proof of Lemma 3

Proof

Proof of lemma 4

Proof

Proof of lemma 5

Proof

Proof of lemma 6

Proof

Proof of lemma 7

Proof

Proof of theorem 1

Proof

Proof of corollary 1

Proof

Proof of proposition 1

Proof

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

L^γ-PageRank for semi-supervised learning

L ^γ-PageRank for semi-Supervised learning

The L ^γ-graphs

The L ^γ-PageRank method

L ^γ-PageRank in practice