Learning attribute and homophily measures through random walks

Antunes, Nelson; Banerjee, Sayan; Bhamidi, Shankar; Pipiras, Vladas

doi:10.1007/s41109-023-00558-3

Research
Open access
Published: 27 June 2023

Learning attribute and homophily measures through random walks

Nelson Antunes¹,
Sayan Banerjee²,
Shankar Bhamidi² &
…
Vladas Pipiras²

Applied Network Science volume 8, Article number: 39 (2023) Cite this article

1244 Accesses
2 Altmetric
Metrics details

Abstract

We investigate the statistical learning of nodal attribute functionals in homophily networks using random walks. Attributes can be discrete or continuous. A generalization of various existing canonical models, based on preferential attachment is studied (model class $\mathscr {P}$), where new nodes form connections dependent on both their attribute values and popularity as measured by degree. An associated model class $\mathscr {U}$ is described, which is amenable to theoretical analysis and gives access to asymptotics of a host of functionals of interest. Settings where asymptotics for model class $\mathscr {U}$ transfer over to model class $\mathscr {P}$ through the phenomenon of resolvability are analyzed. For the statistical learning, we consider several canonical attribute agnostic sampling schemes such as Metropolis-Hasting random walk, versions of node2vec (Grover and Leskovec, 2016) that incorporate both classical random walk and non-backtracking propensities and propose new variants which use attribute information in addition to topological information to explore the network. Estimators for learning the attribute distribution, degree distribution for an attribute type and homophily measures are proposed. The performance of such statistical learning framework is studied on both synthetic networks (model class $\mathscr {P}$) and real world systems, and its dependence on the network topology, degree of homophily or absence thereof, (un)balanced attributes, is assessed.

Introduction

Attributed networks, namely graphs in which nodes and/or edges have attributes, are at the center of network-valued datasets in many modern applications. For example, in real-world network datasets most nodes have values of characteristics of interest; in social networks, users have attributes such as “gender”, “age”, “language”; in citation networks, articles are classified by the main subject, field, sub-field, keywords. Networks also differ in the range of attributes values (cardinality), their types (discrete or continuous) and the size of each group. In one direction, machine learning pipelines such as network representation learning Fan et al. (2021), clustering Chang et al. (2019), classification Lee et al. (2017), and community detection Baroni et al. (2017) have been developed to study the entire network. Another recent direction, specifically related to attributed network valued data, is the use of attribute information, in addition to graph topological information, in improving the performance of exploratory data analytic techniques such as community detection Berahmand et al. (2022) or link prediction tasks Nasiri et al. (2023). Both papers, through careful development of methodological analysis using graph regularization and non-negative matrix factorization, and through detailed empirical analysis, show significant improvement for such machine learning pipelines via incorporating node attribute information. Driven by the scale of data, the main motivation of this paper is network sampling, where limited explorations based on random walks are used to learn network level functionals of attributes.

In real-world networks, the attributes of a node will co-vary and are not independent. One standard phenomenon in many such real world systems is homophily Shrum et al. (1988); McPherson et al. (2001); Mislove et al. (2010), i.e., node pairs with similar attributes being more likely to be connected than node pairs with discordant attributes. For instance, many social networks show this property, which is the tendency of individuals to associate with others who are similar to them; e.g., with respect to the gender, ethnicity, political ideologies. Furthermore, the distribution of user attributes over the network is usually uneven, with coexisting groups of different sizes, e.g., one ethnic group may dominate others Espín-Noboa et al. (2021). On the other hand, another co-variation across neighbors is due to heterophily, where nodes with the same attribute type value repel each other.

Performance of network sampling algorithms in such settings has received some attention including: the bias of several sampling methods in conserving position of nodes and visibility of groups Wagner et al. (2017); the effect of homophily on centrality measures and visibility of minority groups and fairness questions Karimi et al. (2018). More recently the synthetic models that motivate this paper were used in Espín-Noboa et al. (2022) to understand the inequality of node ranking algorithms (e.g. as measured by the Gini coefficient) as well as inequity (e.g. by contrasting the percentage of a given attribute amongst the most popular k%-age of nodes with the true demographic percentage of that group), in particular trying to understand the foundational characteristics of network evolution such as homophily or preferential attachment in (quoting Espín-Noboa et al. (2022)) “reducing, replicating or amplifying” representation of specific groups by these ranking algorithms. In a different direction, Espín-Noboa et al. (2021) uses these synthetic models to understand the accuracy of semi-supervised machine learning tasks such as learning/prediction of attribute labels given partial information on the labels of a subset of seeded vertices; the goal is to understand the impact of homophily/heterophily and preferential attachment driven growth characteristics of the underlying network on the accuracy of a host of popular relational classifiers and collective inference algorithms.

This paper is motived by the lack of theoretical results in the analysis of attribute network models with homophily and the development of a learning framework to estimate attribute functionals in real networks. We investigate the following research questions (RQ).

RQ1

How to analyze and extend the existing network models with homophily and derive the main functionals of interest?

We describe a generalization of the directed preferential attachment model with homophily (called model class $\mathscr {P}$) formulated in Karimi et al. (2018) where new nodes connect to existing ones based on the attributes of both end points of the potential edge and centrality of the existing vertex. The network model can generate scale-free networks with discrete or continuous attributed nodes, and different intensities of homophily. The dynamics of the network is the following. Starting from a fully connected cluster of nodes with attributes, each node that arrives has attribute generated independently according to a given distribution and connects to a fixed (constant) number of nodes. The probability that a new node connects to an existing node is proportional to the product of the degree (to the power of a parameter) with a function that measures the propensity of the two nodes attributes to interact. Thus, the model encodes the interplay between the two main mechanisms of tie formation found in social networks: preferential attachment and homophily. Given the importance of this model in applications, theoretical analysis of this model including stability properties of heterophily and homophily statistics are of great importance; yet till date the only functional amenable to theoretical analysis has been degree distribution Karimi et al. (2018); Jordan (2013). We describe a related model of network evolution (called model class $\mathscr {U}$) which is much more amenable to theoretical analysis and a phenomenon we term resolvability which enables one to transfer results from model class $\mathscr {U}$ to model class $\mathscr {P}$; in this paper we specialize to large network limits for degree distribution for an attribute type and homophily and heterophily statistics, deferring a full treatment to Antunes et al. (2023).

RQ2

How to use the existing link trace algorithms to sample the network and take into account the attributes of nodes?

Uniform random sampling of nodes or edges is the “gold standard”, providing unbiased estimates of corresponding attribute functionals. However, owing to both computational and privacy issues in social networks and other settings, such sampling is often infeasible. Other networks that allow random access limit the rate of API (Application Program Interface) calls implying that creating a sample of sufficient size takes a prohibitive time. In these cases, link trace sampling, such as random walks (RWs) are typically used; see references in Antunes et al. (2021 2021) for estimation of functionals such as degree distribution and clustering. However, much less is known in the context of estimating quantities influenced by attribute types in homophily networks.

In this work, we consider several existing canonical attribute agnostic sampling schemes proposed in the literature (that do not use the attribute type of nodes to construct the sample) such as Metropolis-Hasting random walk and versions of node2vec Grover and Leskovec (2016) that incorporate both classical random walk and walks with non-backtracking propensities. These random walks have been designed to preserve structural properties of the network in the sample, such as high degree nodes, clustering, diameter and not the different types of node attributes. We are interested not only in estimating the proportion of nodes with a given attribute but also in the structural properties of the sub-network spanned by vertices of a specified attribute type including the degree distribution and homophily measures. Our main contribution here is to show that random walks that use edge weights can be attribute aware samplers through the proposal of variants of node2vec where edge weights depend on attributes of its end nodes. This will be especially useful in homophilic networks for analyzing geometric properties involving nodes with minority attributes.

RQ3

How to estimate the attribute functionals and homophily measures through the sampling schemes and evaluate their performance?

We propose estimators for attribute functionals and homophily measures that are based on correcting the bias of the empirical sample quantities through the use of stationary distribution of the RWs associated in sampling nodes and edges.

We study the performance of the considered random walk sampling schemes in terms of estimation error of the attribute distributions and homophily measures across the following four dimensions in both synthetic networks using the model class $\mathscr {P}$ and real world settings: (a) Inherent homophilic propensity of the network and underlying density of attributes; (b) Impact of centrality of nodes as measured by degree in the evolution of the network; (c) Nonlinear impact of incorporating “escape echo chamber” mechanisms in random walks by encouraging walks to jump across edges with discordant attributes; (d) Impact of reducing the backtracking propensity to encourage walks to explore more of the network.

We find that (i) RWs with attribute dependent weights can perform better over attribute agnostic RWs in homophilic networks; (ii) the weights need to balance the movements between/within nodes with different/same attributes; (iii) non-backtracking improves performance, especially in conjunction with attribute dependent weights and low edge density; (iv) methods seem to work comparably well for synthetic and real networks.

This paper is a significant extension of the conference paper Antunes et al. (2023) including: (a) appreciable expansion of the theoretical developments to the network models described in Antunes et al. (2023), including describing the notion of resolvability of such models which allows one to connect them to a different class of models for which asymptotic analysis for a wide range of functionals, such as degree exponent for an attribute type, homophily and heterophily statistics can be undertaken; (b) substantial expansion of the methodological development of the paper, including a new class of functionals (degree distribution for an attribute and homophily measures) to be estimated through network sampling schemes; (c) new network sampling schemes from node2vec variants; (d) further applications of the methodology developed to new network data for evaluation and comparison; and (e) a final section with extensions and future directions of the work.

Attributed network models and homophily functionals

As described above, synthetic models have been used to great effect in understanding the structure and evolution of attributed networks and the impact of ranking, sampling and classification algorithms in such settings. The overarching goal in this section is to describe an extension of the canonical (linear) attributed network models currently considered in the literature. We refer the interested reader to Karimi et al. (2018); Espín-Noboa et al. (2022 2021) and the references therein for further discussion on motivations and use of such models. More concretely in this section:

(a)
We will describe the main synthetic model, termed non-linear preferential attachment (NLPA) model with homophily, and referred to for the rest of the paper as model class $\mathscr {P}$.
(b)
We will give concrete formulations of key network functionals measuring homophily between different groups.
(c)
Understanding (large network) asymptotics for model class $\mathscr {P}$ is non-trivial. We will introduce a related model (referred to as model class $\mathscr {U}$), that seems significantly more amenable to analysis, formalize a notion called resolvability, connecting model classes $\mathscr {P}$ and $\mathscr {U}$ and then describe the explicit results that can be derived for model class $\mathscr {P}$, at least in the linear case using $\mathscr {U}$. Technical justifications of these connections can be found in Antunes et al. (2023).

Fix an attribute (or latent) space ${\mathcal {A}}$ with probability measure μ. Fix a (potentially asymmetric) function $f: {\mathcal {A}}\times {\mathcal {A}}\rightarrow {\mathbb {R}}_+$ which measures propensities of node pairs to interact based on their attributes. Fix $\alpha \ge 0$ describing the role of degree in measuring popularity and integer $m\ge 1$ denoting the number of edges a new vertex has when entering the system, to connect to pre-existing vertices. In principle m could be random and/or dependent on the attribute type, but for simplicity and to match existing literature (e.g. Karimi et al. (2018)) we focus on the fixed m setting (see Antunes et al. (2023) for results when m is attribute dependent). Let N be the number of nodes (vertices) in the network. In the model class $\mathscr {P}$, nodes $\left\{ v_{{ n}}:1\le {n}\le N\right\}$ enter the system sequentially starting at ${n}=1$ with a base connected graph ${\mathcal {G}}_1$ (with every node having an attribute in ${\mathcal {A}}$) with dynamics:

(i)
Every node $v_{{ n}}$ has attribute $a(v_{{ n}}) \in {\mathcal {A}}$ generated independently using $\mu$.
(ii)
Node $v_{{n}}$ enters the system with m edges.
(iii)
The dynamics for connecting each of the m edges are recursively defined as follows: suppose the network has been constructed till stage n with structure ${\mathcal {G}}_{n}$. For any n and $0\le i\le m-1$ and $v\in {\mathcal {G}}_{n}$, let $\deg _i(v,{n})$ denote the degree of v at time n when i of the edges of $v_{{n}+1}$ have connected to ${\mathcal {G}}_{n}$. Conditional on ${\mathcal {G}}_{n}$ and stage i, the probability that the $(i+1)$th edge of $v_{{n}+1}$ connects to $v\in {\mathcal {G}}_{n}$ is proportional to:
$$\begin{aligned} P_{v_{{n}+1} v} \propto f(a(v), a(v_{{n}+1})) [\deg _i(v,{n})]^\alpha . \end{aligned}$$
(1)
Once this edge has connected, all the degrees are updated and the above dynamics is repeated till all m edges have connected to ${\mathcal {G}}_{n}$. When $m=1$, then each new vertex has only one edge to connect to the network and in this case we write $\deg (v,{n}){:}{=} \deg _0(v,{n})$.

We will refer to this as model class $\mathscr {P}$ (or $\mathscr {P}(\alpha , \mu , f)$ when we want to specify all the parameters; we suppress dependence on m to ease notation) and sometimes write $\left\{ {\mathcal {G}}_n:1\le n\le N\right\} \sim \mathscr {P}(\alpha , \mu ,f)$. The model (1) extends various existing models including: Barabási-Albert model Barabási and Albert (1999) ($f\equiv 1$, $\alpha =1$), sublinear PA Krapivsky and Redner (2001) ($f\equiv 1$, $0< \alpha <1$), PA with multiplicative fitness Bianconi and Barabási (2001) ($f(a,a^\prime ) =a$, $\alpha =1$), scale free homophilic model de Almeida et al. (2013) ($f(a,a^\prime ) = 1- |a-a^\prime |$, ${\mathcal {A}}= [0,1]$, $\alpha =1$), and geometric versions with $\alpha =1$, a compact metric space ${\mathcal {A}}$ and an appropriate function f of the distance Flaxman et al. (2007) and Jordan (2013). Most existing studies focus on asymptotics for either the degree distribution or maximal degree. The notation used in the paper is summarized in Table 1.

Table 1 Summary of the main notation

Full size table

Homophily functionals

When the latent space ${\mathcal {A}}= \left\{ 1,2,\ldots , K\right\}$ is finite, one can define macroscopic measures of homophily, and conversely heterophily Park and Barabási (2007), from an observed network ${\mathcal {G}}$ (either synthetic or empirically observed) on N nodes as follows. Let ${\mathcal {E}}$ denote the total edge set; for $a\in {\mathcal {A}}$, let ${\mathcal {V}}_a$ be the set of nodes of type a, and for $a,a'\in {\mathcal {A}}$, let ${\mathcal {E}}_{aa'}$ be the set of edges between nodes of types a and $a'$. Let $p = |{\mathcal {E}}|/{N \atopwithdelims ()2}$ be the edge density. For $a\in {\mathcal {A}}$, dyadicity

$$\begin{aligned} D_{a} = |{\mathcal {E}}_{aa}| \Big / \bigg ({|{\mathcal {V}}_a| \atopwithdelims ()2} p \bigg ) \end{aligned}$$

(2)

measures the contrast in edges within the cluster of nodes a as compared to a setting where all edges are randomly distributed; thus $D_a > 1$ signals homophilic characteristics of type a nodes while $D_a<1$ signifies heterophilic nature of type a nodes. Similarly, for $a\ne a'$, heterophilicity

$$\begin{aligned} H_{aa'} = |{\mathcal {E}}_{aa'}|/(|{\mathcal {V}}_a||{\mathcal {V}}_{a'}|p) \end{aligned}$$

(3)

denotes propensity of type a nodes to connect to type $a'$ nodes as contrasted with random placement of edges with probability equal to the global edge density. If $H_{aa'} < 1$, nodes of opposite labels do not tend to be connected (homophilic); if $H_{aa'} > 1$, there are more connections between nodes of different labels a and $a'$ (heterophilic).

Illustrations of homophilic synthetic networks of the model class $\mathscr {P}(\alpha , \mu , f)$ generated from (1) are given in Fig. 1. The total number of nodes is $N=1000$ and each node has an attribute in ${\mathcal {A}}=\{1, 2, 3\}$ according to the probability mass function (p.m.f.) $\mu = (0.7, 0.2,0.1)$; the propensities of node pairs to connect based on their attributes are $f(a,a)=0.8$, $f(a,a')=0.1$, $a\ne a'=1,2,3$ and $m=2$. The networks are plotted for different values of $\alpha$ in Fig. 1a–c. For instance, with $\alpha =0.2$, the corresponding homophily measures are $D_1=1.364$, $D_2=3.038$, $D_3=7.38$, $H_{12}=0.336$, $H_{13}=0.386$ and $H_{23}=0.399$. Figure 2 shows the case of heterophilic networks with $N=1000$ of the model class $\mathscr {P}(\alpha , (0.7,0.2,0.1), f)$ for different values of $\alpha$ with $f(a,a)=0.2$, $f(a,a')=0.4$, $a\ne a'=1,2,3$ and $m = 2$. For $\alpha =1$, the homophily measures are $D_1=0.750$, $D_2=0.772$, $D_3=0.750$, $H_{12}=1.479$, $H_{13}=1.615$ and $H_{23}=1.873$.

Model class $\mathscr {U}$ and rationale

While model class $\mathscr {P}$ has been heavily used in applications, deriving large network asymptotics of functionals is non-trivial. Next we will describe a related network model (model class $\mathscr {U}$), the rationale for why this might be more amenable to analysis, and then formalize situations where given $\mathscr {P}$, one can construct (using as input the parameters $\alpha , \mu , f$ from $\mathscr {P}$), a corresponding model in class $\mathscr {U}$ such that properties of $\mathscr {P}$ can be read off from (the more easily analyzable) $\mathscr {U}$. For most of this discussion we will only consider the $m=1$ setting, albeit the formulae for asymptotics for various functionals considered below seem to extend, at least in simulations, in a straightforward manner to general m setting.

Since the general setting (with “continuous” attribute space) is more technical, let us explain the basic rationale in the simpler discrete setting where ${\mathcal {S}}= [K]{:}{=}\left\{ 1,2,\ldots , K\right\}$ so that $\mu$ is a p.m.f.. Fix a (potentially and in most cases different from $\mu$) p.m.f. $\nu$ and consider the attributed network model $\left\{ {\tilde{{\mathcal {G}}}}_n: n\ge 0\right\}$ with dynamics:

$$\begin{aligned} {{\,\mathrm{{\mathbb {P}}}\,}}\left( a(v_{{n}+1}) = a^\star , v_{{n}+1} \leadsto v| {{\tilde{{\mathcal {G}}}}}_{n}\right) {:}{=} \frac{\nu (a^\star )f(a(v), a^\star ) [\deg (v,{n})]^\alpha }{\sum _{a\in [K]}\sum _{v^\prime \in {{\tilde{{\mathcal {G}}}}}_{n} } \nu (a) f(a(v^\prime ), a) [\deg (v^\prime , {n})]^\alpha }. \end{aligned}$$

(4)

Note that the above model is invariant to scaling in $\nu$, so it will be convenient to allow $\nu$ to be a general weight sequence instead of normalizing it to be a probability measure.

The above belongs to a general class of models defined below that we will refer to as $\mathscr {U}(\alpha , \nu , f)$. Thus, here the p.m.f. $\nu$ plays the role of a weight and further, unlike the model $\mathscr {P}$ where each new arriving vertex has attribute sampled independently from the current state of the network, here the distribution of new vertices is closely dependent on the entire state of the current network.

Rationale for technical tractability: Tabling the issue of connection with $\mathscr {P}$ for the next sections, first note that $\mathscr {U}$ can be simulated via dynamics where every vertex essentially behaves independently ((c) below). In brief, if one wanted to simulate model class $\mathscr {U}$ starting from one vertex of type a, then this can be done as follows:

(a)
Every vertex v that enters the system (starting with the root of type a) gives birth independently to child nodes with attributes in continuous time, connected to the vertex.
(b)
For a node of type a, conditional on its degree d, the rate of reproduction of a child node of type $a^\prime$ is $\nu (a) f(a,a^\prime ) d^\alpha$.
(c)
Reproduction dynamics is independent across nodes.

Write $\left\{ {{\,\textrm{BP}\,}}(t):t\ge 0\right\}$ for the (continuous time) process and for any $n\ge 1$, $T_n$ be the (random) time such that the size $|{{\,\textrm{BP}\,}}(T_n)| =n$. (BP stands for Branching Process.) Then it is easy to check that $\left\{ {{\,\textrm{BP}\,}}(T_n):1\le n\le N\right\}$ has the same distribution as $\left\{ {\tilde{{\mathcal {G}}}}_n: 1\le n \le N\right\} \sim \mathscr {U}(\alpha , \nu , f)$. Further the independence in the evolution makes this model much more amenable to analysis, yielding asymptotic information for the process ${{\,\textrm{BP}\,}}$ and thus the model $\mathscr {U}$.

Resolvability

Note that the main model of interest, both as a synthetic test bed in this paper, and in pre-existing work, is the model class $\mathscr {P}$. The main goal of this section is to formalize a connection between model classes $\mathscr {P}$ and $\mathscr {U}$. Given $\left\{ {\tilde{{\mathcal {G}}}}_n:0\le n\le N\right\} \sim \mathscr {U}(\alpha , \nu , f)$, for $n\ge 1$ define ${\tilde{\pi }}_n = \sum _{t=1}^n \delta \left\{ a(v_t)\right\}$, i.e. the empirical measure of attributes in ${\tilde{{\mathcal {G}}}}_n$.

Now say that model $\mathscr {P}(\alpha , \mu , f)$ is resolvable if there exists $\nu$ such that for the model class $\mathscr {U}(\alpha , \nu , f)$, the empirical measures of attribute types satisfy: ${\tilde{\pi }}_n \rightarrow \mu$ as $n\rightarrow \infty$. In words, one can chose a weight measure $\nu$ such that the corresponding dynamics for $\mathscr {U}$ with the same $\alpha$ and f drives the empirical distribution to the limiting empirical distribution $\mu$ of model class $\mathscr {P}$ (since every new vertex has attribute distribution $\mu$ independent of the network evolution).

Resolvability in the linear finite attribute case

The linear case ($\alpha =1$) with a finite attributes ${\mathcal {S}}=[K]$ turns out to be completely resolvable under the following.

Assumption 1

Assume the sampling measure $\mu = (\mu _1, \ldots , \mu _K)$ has all entries strictly positive and assume the affinity kernel $f(a,a^\prime ) >0$, $\forall a, a^\prime \in [K]$.

Fix a model class $\mathscr {P}(\alpha =1, \mu , f)$ satisfying the above Assumption. Let ${\mathcal {P}}([K])$ denote the $K-1$ dimensional simplex of probability mass functions on [K]. Define (in the interior of ${\mathcal {P}}([K])$) the function:

$$\begin{aligned}V_{\mu }(y){:}{=} 1-\frac{1}{2}\sum _{j\in {\mathcal {S}}} \mu _j\left( \log (y_j) + \log (\sum _{k\in {\mathcal {P}}} y_k f(k,j) ) \right) , \qquad y \in {\mathcal {P}}([K]).\end{aligned}$$

By Jordan (2013, P8), under the above Assumption, $V_{\mu }(\cdot )$ has a unique minimizer $\eta {:}{=} \eta (\mu ) = (\eta _1(\mu ), \ldots , \eta _K(\mu ))$ in the interior of ${\mathcal {P}}({\mathcal {S}})$. Now, define

$$\begin{aligned} \nu _a{:}{=} \frac{\mu _a}{\sum _{l=1}^K f(l,a)\eta _l}, \qquad \phi _{a,b}{:}{=} f(a,b)\nu _b, \qquad \phi _a {:}{=} \sum _{b=1}^K \phi _{a,b} = 2- \frac{\mu _a}{\eta _a}, \end{aligned}$$

(5)

where the final identity follows from Jordan (2013, P8). Let $\nu = (\nu _1, \ldots , \nu _K)$. Then the following paraphrases some of the results in Antunes et al. (2023):

1.
Under the above Assumption, model $\mathscr {P}(\alpha =1,\mu ,f)$ is resolvable with one resolving measure $\nu$ given as above. This implies, in particular, local functionals (such as degree distribution PageRank) converge to the same limits as those for $\mathscr {U}(\alpha =1, \nu , f)$. Two specific implications are given next.
2.
For each $a\in [K]$, the empirical p.m.f. of vertice degrees of type ${\textbf{p}}_n^a {\mathop {\longrightarrow }\limits ^{P}}{\textbf{p}}_\infty ^a$ where the limit p.m.f. has tail exponent ${\textbf{p}}_\infty ^a(k)\sim k^{1+2/\phi _a}$ as $k\rightarrow \infty$.
3.
Using the objects defined in (5), define the matrix
$$\begin{aligned} {\textbf{M}}= \left( {\textbf{M}}_{a,b}{:}{=}\frac{\phi _{a,b}}{2 - \phi _a}\right) _{a, b\in [K]}. \end{aligned}$$
(6)
Then the homophily and heterophily statistics $\left\{ D_{n,a}: a\in [K]\right\}$ and $\left\{ H_{n, (a,a^\prime )}: a\ne a^\prime \in [K]\right\}$ satisfy the asymptotics,
$$\begin{aligned} D_{n,a} {\mathop {\longrightarrow }\limits ^{P}}\frac{[{\textbf{M}}]_{a,a}}{\mu _a}, \qquad H_{n, (a,a^\prime )} {\mathop {\longrightarrow }\limits ^{P}}\frac{1}{2}\left[ \frac{[{\textbf{M}}]_{a^\prime ,a}}{\mu _a} + \frac{[{\textbf{M}}]_{a, a^\prime }}{\mu _{a^\prime }} \right] \end{aligned}$$
(7)

Remark 1

Result (b) above was previously derived in Jordan (2013) using stochastic approximation techniques.

The results above are illustrated numerically in Fig. 3 and Tables 2 and 3. We fixed the model class $\mathscr {P}(1, (0.7,0.2,$ 0.1), f), where $f(a,a)=0.8$, $f(a,a')=0.1$, for $a\ne a'=1,2,3$ and $m = 1$. The model is resolvable with resolving measure $\nu$ approximately equal to (0.742, 0.189, 0.069). We generate the model classes $\mathscr {P}$ and $\mathscr {U}=(\alpha , \nu , f)$ using (1) and (4), respectively, for different network sizes. Figure 3 shows the degree distributions of attribute 2 for both models which are getting closer as N increases. In the limit they converge to the same p.m.f.. We fit a power-law distribution function using a maximum likelihood approach to the empirical degree distribution tail per attribute of the model class $\mathscr {P}$ for each network size. The respective tail exponents are shown in Table 2 with the asymptotic limit p.m.f. tail exponent $1+2/\phi _a$. Finally, the empirical and asymptotic dyadicity and heterophilicity measures, respectively, (2), (3) and (7), are given in Table 3. The results show that complicated functionals of the model class $\mathscr {P}$ can be easily approximated with good precision even for moderate network sizes.

Table 2 Tail exponent of the degree distribution per attribute of model class $\mathscr {P}(\alpha , \mu , f)$ for different network sizes N and asymptotically ($N\rightarrow \infty$), where $\alpha =1$, $\mu =(0.7,0.2,$ 0.1), $f(a,a)=0.8$, $f(a,a')=0.1$, $a\ne a'$ $=1,2,3$ and $m = 1$

Full size table

Table 3 Homophily measures of model class $\mathscr {P}$ for different network sizes N and asymptotically ($N\rightarrow \infty$), where $\alpha =1$, $\mu =(0.7,0.2,$ 0.1), $f(a,a)=0.8$, $f(a,a')=0.1$, $a\ne a'$ $=1,2,3$ and $m = 1$

Full size table

Random walk samplings in attributed networks

Since many real-world networks can only be crawled, in the sense that only the neighbors of the current visited node can be explored, we consider sampling procedures that are based on random walks. They are also a core technique for constructing various algorithms to extract information on networks, such as community detection, ranking of nodes and edges, and dimension reduction. We introduce well-known random walks which are attribute agnostic. These random walks have been designed to preserve structural properties of the network and not the representativeness of node attributes in the sample. We are interested (see next section) in estimating the attribute distribution but also structural properties (node degrees) depending on the node attributes. We show next that some random walks that use edge weights can be attribute aware samplers. This will be especially useful in homophilic networks. Throughout this section, for graph ${\mathcal {G}}$ and node $i\in {\mathcal {G}}$, $d_i$ will denote its degree. We assume a static graph and that only limited set of initial seed nodes $i\in {\mathcal {G}}$ that initializes the random walk are available. When we say that a node is sampled, it means that its attribute a(i) (and degree d(i) dependent on the quantities of interest) is added to the sample.

Metropolis Hastings Random Walk (MHRW) At each step, if the walk is currently at node i, a neighbor j is selected uniformly at random and the proposed move to j is accepted with probability $\min (1,d_i/d_j)$, else the walk stays at i. Thus proposed moves towards a node of smaller degree are always accepted whilst we reject some of the proposed moves towards higher degree nodes. It is easy to check that the stationary distribution is uniform over the node set, i.e.,

$$\begin{aligned} \pi _i =1/N, \quad 1\le i\le N. \end{aligned}$$

(8)

The stationary distribution over the edge set is

$$\begin{aligned} \pi _{ij} = \frac{1}{N d_i}, \quad (i,j) \in {\mathcal {E}}. \end{aligned}$$

(9)

Node2vec (N2V) As proposed in Grover and Leskovec (2016), in full generality, the transitions of N2V depend on the neighborhood both of the currently visited node, and the node visited prior to the current node. Let the previously and currently visited nodes be k and i, resp. The next visited node j is chosen according to the transition probability proportional to:

$$\begin{aligned} p {(j|k,i)} \propto \left\{ \begin{array}{rl} \beta w_{ij} , &{} \quad k \ne j, (k,j) \notin {\mathcal {E}}, \\ \gamma w_{ij} , &{} \quad k \ne j, (k,j) \in {\mathcal {E}},\\ \theta w_{ij}, &{} \quad k = j, \end{array} \right. \end{aligned}$$

where $w_{ij}$ is the weight of edge (i, j), $\theta$ is the parameter that represents the propensity for the random walk to backtrack, $\gamma$ is the quantifying probability of reaching a common neighbor of the currently visited node and the node visited in the last step, and $\beta$ is the parameter of exploring any of other neighbor–see Fig. 4. N2V is a second order Markov chain. We now describe specific variants of this random walk which includes some classical versions.

Node2vec-1 (N2V-1): If the network is undirected, unweighted and $\theta =\beta =\gamma$, one obtains the classical RW with the well-known stationary distributions,

$$\begin{aligned} \pi _i = \frac{d_i}{2|{\mathcal {E}}|}, \qquad \pi _{ij} = \frac{1}{|{\mathcal {E}}|} . \end{aligned}$$

(10)

Node2vec-2 (N2V-2): If the network is undirected and $\theta =\beta =\gamma$, one obtains a weighted RW. This walk can use node attributes through weights in contrast to N2V-1. We assume that for each sampled node i, we have access to the attributes of the neighbors of i. If there is a connection between i and j, the weight $w_{ij}$ is a function of a(i) and a(j). In a homophilic network, setting $w_{ij}$ to a lower value if nodes have equal attributes encourages the sampling of nodes with different attributes. The stationary distributions in this case are given by

$$\begin{aligned} \pi _i = \frac{ \sum _j w_{ij}}{\sum _k \sum _j w_{kj}}, \qquad \pi _{ij} = \frac{w_{ij}}{\underset{k < l}{\sum \sum }\ w_{kl}} . \end{aligned}$$

(11)

Node2vec-3 (N2V-3): If the network is undirected, without self-loops, multiple edges and $\beta =\gamma$, $\theta >0$, with equal weights $w_{ij}$, the stationary distributions for nodes and edges are given by (10) Meng and Masuda (2020). With small $\theta$, the walk approaches the non-backtracking random walk avoiding 2-hop redundancy in the sample.

Node2vec-4 (N2V-4): We consider next the combination of the last two schemes, with $\beta =\gamma$, $\theta >0$ and weights $w_{ij}$ dependent on the attributes of i and j. In this setting, one major technical hurdle is that, unlike the settings above, there is no explicit formula for the stationary distributions. Analogous to the stationary distributions for N2V-3 matching the usual RW in the stationary regime, it is expected that especially in the small $\theta$ setting, the stationary distributions can still be approximated by those in (11). We explore the efficacy of these approximations for moderate size synthetic networks below.

Node2vec-5 (N2V-5): In this variant the weights $w_{ij}$ are equal to 1 and $\theta$, $\gamma$ and $\beta$ are different. To enhance the exploration of the network to sampled nodes which are further away from the previous visited nodes, we consider the case $\theta< \gamma < \beta$. The stationary distributions in this case are not known and we will use the empirical distribution obtained through simulations.

Node2vec-6 (N2V-6): This is the more general variant extending N2V-5 to have weights. Again, the most interesting case is $\theta< \gamma < \beta$. As in N2V-5 the stationary distributions are unknown. However, we include this sampling scheme for a full evaluation of the performance of N2V. We believe that for the network model an approximation can be obtained for stationary distributions through the resolvability of the model classes $\mathscr {P}$ and $\mathscr {U}$. Due to the technical nature of the problem, it is outside the scope of this paper, and will be considered in a future work.

For comparison to RWs, we will also use the following baseline samplings. These can be viewed as “ideal” for sampling purposes and correspond to the limiting distributions of some RWs.

Node Sampling (NS) NS sampling requires full access to the network and is unavailable for many real networks. In the classical NS, nodes are chosen independently and uniformly from the network with replacement.

Edge Sampling (ES) In the classical ES, edges are chosen independently and uniformly from the network with replacement. Since ES selects edges rather than nodes to populate the sample, the node set is constructed by including both incident nodes in the sample when a particular edge is sampled.

Estimation of attribute distributions and homophily measures

We consider here estimation in the case of discrete-valued attributes; the case of continuous-valued attributes is discussed at the end of this work. Our estimators of quantities of interest will be based on one of the following two general estimators. The first estimator is for the proportion p(A) of nodes i with a certain characteristic A(i) taking value A. The characteristic takes discrete values and could be the discrete attribute $a_i=a(i)$ itself, the degree $d_i=d(i)$, the combination of the latter two, etc. The estimator of p(A) for a random walk is defined as follows. Run a random walk (any of the sampling schemes described above) for n steps and let $i_s$ denote the sth node sampled by the random walk, for $1\le s \le n$. Since nodes are sampled with replacement and with probabilities $\pi _i$ in the stationary regime, the proportion p(A) can be estimated as

$$\begin{aligned} {{\widehat{p}}}(A) = \frac{1}{N n} \sum _{s=1}^n \frac{ {\textbf{1}}\{A(i_s) = A\}}{\pi _{i_s}}, \end{aligned}$$

(12)

where ${{\textbf{1}}}\{E\}=1$ if E is true and 0 otherwise Kolaczyk (2009) (Chapter 5). If the total number of nodes N is unknown, its estimator is given by ${\widehat{N}} = (1/n) \sum _s 1/ \pi _{i_s}$, and (12) becomes

$$\begin{aligned} {{\widehat{p}}}(A) = \frac{1}{\sum _{s=1}^n 1/\pi _{i_s}} \sum _{s=1}^n \frac{ {{\textbf{1}}}\{A(i_s) = A\}}{\pi _{i_s}}. \end{aligned}$$

(13)

A direct application of e.g. (12) yields the following estimators for the proportion p(k, a) of nodes with degree k and attribute a, the proportion p(a) of nodes with attribute a, and the conditional proportion $p(k|a)=p(k,a)/p(a)$ of nodes of degree k having attribute a:

$$\begin{aligned} {{\widehat{p}}}(k,a)= & {} \frac{1}{N n} \sum _{s=1}^n \frac{ {\textbf{1}}\{d(i_s)=k, a(i_s) = a\}}{\pi _{i_s}}, \qquad a \in {\mathcal {A}}, \end{aligned}$$

(14)

$$\begin{aligned} {{\widehat{p}}}(a)= & {} \frac{1}{N n} \sum _{s=1}^n \frac{ {{\textbf{1}}}\{a(i_s) = a\}}{\pi _{i_s}}, \qquad a \in {\mathcal {A}}, \end{aligned}$$

(15)

$$\begin{aligned} {{\widehat{p}}}(k|a)= & {} \sum _{s=1}^n \frac{ {{\textbf{1}}}\{d(i_s) = k, a(i_s) = a\}}{\pi _{i_s}} \Big / \sum _{s=1}^n \frac{ {\textbf{1}}\{a(i_s) = a\}}{\pi _{i_s}}, \qquad a \in {\mathcal {A}}. \end{aligned}$$

(16)

We note that the quantities in (14)–(16) are given in terms of the sample obtained through the random walk used with N estimated by ${{\widehat{N}}}$.

The performance of ${{\widehat{p}}}(A)$ in (12) and hence the components of the estimators (14)–(16) can be assessed through their MSE. For fixed A, the MSE of ${{\widehat{p}}}(A)$ is given by $E[({{\widehat{p}}}(A)- p(A))^2]$. In the stationary regime, ${{\widehat{p}}}(A)$ in (12) is an unbiased estimator of p(A) and the MSE is equal to the variance $V[{{\widehat{p}}}(A)]$. The variance of ${{\widehat{p}}}(A)$ can be related to the spectral gap of the RW. More specifically, let P be the associated transition matrix of the random walk with eigenvalues (real by reversibility): $1=\lambda _1\ge \lambda _2 \ge \ldots \ge \lambda _N \ge -1$. The spectral gap is defined as $\delta = 1-\lambda _2$. Equivalently, the relaxation time of the RW is the reciprocal of the spectral gap. A larger spectral gap implies a faster convergence of the RW to its stationary distribution. From Aldous and Fill (2002) (Proposition 4.29), we have

$$\begin{aligned} V ({{\widehat{p}}}(A)) \le \frac{2 \Lambda (A) }{\delta n} \left( 1 + \frac{\delta }{2 n} \right) , \end{aligned}$$

(17)

where $\Lambda (A)=\sum _{i=1}^N {\textbf{1}}\{A(i) = A\} /(N^2 \pi _i)$. The error in estimating the proportion of nodes with characteristic A is thus proportional to the inverse of the spectral gap and $\Lambda (A)$; the latter is small if the probability of sampling nodes with characteristic A is large. We will see in Section Experiments that for N2V-2, if edge weights $w_{ij}$ are inversely related to the concordance of the attributes, thus encouraging the walk to explore vertices with different attributes, then in some settings, this increases $\delta$ and decreases $\Lambda (a)$ (for attributes with small proportions), resulting in a smaller variance of the estimator for the proportion p(a) of nodes with attribute a.

The second estimator is for the proportion p(B) of edges (i, j) with a certain characteristic B(i, j) taking value B. The values B are assumed to be discrete. For the random walk considered above, since edges are sampled with probabilities $\pi _{ij}$ in the stationary regime, the proportion p(B) can be estimated similarly to (12) as

$$\begin{aligned} {{\widehat{p}}}(B) = \frac{1}{(n-1)|{\mathcal {E}}|} \sum _{s=1}^{n-1} \frac{ {{\textbf{1}}}\{B(i_s,i_{s+1}) = B\}}{\pi _{i_s,i_{s+1}}} \end{aligned}$$

(18)

and if needed, the number of edges as

$$\begin{aligned} \widehat{|{\mathcal {E}}|} = \frac{1}{n-1} \sum _{s=1}^{n-1} \frac{1}{\pi _{i_s,i_{s+1}}}. \end{aligned}$$

(19)

A direct application of (18)–(19) is to estimation of homophily measures $D_a$ and $H_{aa'}$ in (2) and (3) as:

$$\begin{aligned} {{\widehat{D}}}_{a} = \widehat{|{\mathcal {E}}_{aa}|} \Big / \bigg ({ \widehat{|{\mathcal {V}}_a|} \atopwithdelims ()2} {{\widehat{p}}} \bigg ), \qquad \widehat{H}_{aa'} = \widehat{|{\mathcal {E}}_{aa'}|} \Big / \big ( \widehat{ |{\mathcal {V}}_a|} \widehat{|{\mathcal {V}}_{a'}|} {{\widehat{p}}} \big ), \end{aligned}$$

(20)

where $\widehat{| {\mathcal {V}}_{a} |} = {{\widehat{N}}} {{\widehat{p}}}(a)$, $\widehat{p} = \widehat{|{\mathcal {E}}|} / { \widehat{|N|} \atopwithdelims ()2}$ and

$$\begin{aligned} \widehat{| {\mathcal {E}}_{aa'} |} = \frac{1}{n-1} \sum _{s=1}^{n-1} \frac{ {{\textbf{1}}}\{(a(i_s), a(i_{s+1}))= (a,a')\vee (a(i_s), a(i_{s+1}))=(a',a) \}}{\pi _{i_s,i_{s+1}}}, \end{aligned}$$

(21)

where $a,a' \in {\mathcal {A}}$. We note again that the quantities in (19)–(21) are given by the sample obtained through the respective random walk used. We are not aware of the results of the type (17) to assess the variability of the estimator ${{\widehat{p}}}(B)$ in (18).

In terms of complexity of the learning framework, the random walks considered in this work are computationally efficient in terms of both space and time requirements Grover and Leskovec (2016). For instance, for each visited node, we need to check the immediate neighbors and their attributes. For the second order random walks (N2V-3, -4, -5 and -6), we need additionally to keep track of the interconnections between the neighbors of the current visited node, however, the average degree of the graph is usually small for most real world networks. The proposed estimators are obtained from simple weighted sample statistics.

Experiments

In this section, we assess the performance of the sampling methods and estimators in learning the attribute distribution, degree distribution per attribute and homophily measures on synthetic and real-world networks with discrete attributes.

Synthetic network with homophily

We consider the model class $\mathscr {P}(\alpha , \mu , f)$ with $N=2000$ nodes and 3 discrete attributes. In the generation of the network, each node that enters the system has attribute 1, 2 or 3 with probabilities $\mu _1=0.7$, $\mu _2=0.2$, $\mu _3=0.1$, respectively, and connects to $m=2$ nodes proportional to (1), where $f(a,a)=0.8$, $f (a,a')=0.1$, $a,a'=1,2,3$, $a\ne a'$. We investigate the effect of homophily in the estimation of the quantities of interest in a controlled environment for the two most interesting network topologies: sublinear ($\alpha = 0.2$) and linear ($\alpha = 1$).

Attribute distribution

Setting 1 ($\alpha =0.2$): The evaluation of the several sampling methods in learning the attribute distribution using (15) assuming N unknown is shown in Fig. 5. Each boxplot is constructed from the results of 500 estimates. The length of each walk is 0.15N. MHRW has an important property that the stationary distribution is uniform over all the nodes. Thus, in principle, MHRW is equivalent to RNS of the network for an infinite RW. In practice, MHRW typically requires sample sizes of O(N) to achieve the stationary distribution Kumar and Sundaram (2021). It is challenging to use MHRW for large scale networks with millions of nodes, where typical sample size is much smaller than the network size. Networks with a strong homophily are problematic in this case since MHRW tends to get stuck in nodes with the same attributes. The classical variant of node2vec, N2V-1, which like MHRW is also attribute agnostic has the property that the stationary distribution is uniform over all the edges. N2V-1 is equivalent to RES of the network for an infinite RW. In practice, it suffers from the same drawbacks of MHRW to a lower extent. The poor performance can also be explained through the bound of the variance (17). Table 4 shows that MHRW has the lowest spectral gap while N2V-1 has a high value $\Lambda (3)$ for attribute 3 (this is detailed next for N2V-2).

Table 4 The variation of spectral gap $(\delta )$ and $\Lambda (3)$ from the bound of the variance of ${{\widehat{p}}}(3)$ under model class $\mathscr {P}(\alpha ,\mu ,f)$ and sampling method parameters as described in Fig. 5

Full size table

The attribute aware samplers like N2V-2 use node attribute to determine the next node to add to the sample, by checking the attribute of the node against the attribute of the last node added to the sample. To simplify the exposition (instead of $w_{ij}$ for nodes i and j), we write ${{\overline{w}}}_{aa}$ for the weights of nodes with the same attributes, and ${{\overline{w}}}_{aa'}$ with different attributes. Table 5 shows the effects of the weights in the standard deviation of the estimate for N2V-2 for attribute 3. To explain their differences, we turn to the bound of the variance of the estimator (17). The error in estimating the proportion of nodes with an attribute a is upper bounded by the inverse of the spectral gap. If ${{\overline{w}}}_{aa}$ is much smaller than ${{\overline{w}}}_{aa'}=1$, say $\overline{w}_{aa}=0.05$, then the movements of N2V-2 between different node attributes are very frequent and exploration within each attribute is insufficient. In this case, the spectral gap is low creating a bottleneck for approaching the stationary probability. As ${{\overline{w}}}_{aa}$ increases the inter-attribute moves are less frequent, accelerating the convergence to the stationary distribution. On the other hand, when ${{\overline{w}}}_{aa}$ becomes greater or equal than ${{\overline{w}}}_{aa'}$, the spectral gap decreases until that N2V-2 hardly transits from one attribute value to another. The error in estimating the attribute distribution is also bounded by the quantity $\Lambda (a)$ which is small if the probability of sampling nodes with attribute a is large. We also observe from Table 5 the effect of ${{\overline{w}}}_{aa}$ on the value $\Lambda (a)$ for attribute 3. The tradeoff between $\delta$ and $\Lambda (a)$ explains the smaller standard deviation for attribute 3 of N2V-2 with $\overline{w}_{aa}=0.25$. The convex behavior of the empirical standard deviation as a function of ${{\overline{w}}}_{aa}$ will be explored at the end of this work in the guidelines for setting the weights of attribute aware samplers.

Table 5 Empirical standard deviation of ${{\widehat{p}}}(3)$, and the variation of spectral gap $(\delta )$ and $\Lambda (3)$ (from the bound of the variance of ${{\widehat{p}}}(3)$) with N2V-2 ($w_{ij}=1 , a(i) \ne a(j))$ under model class $\mathscr {P}(\alpha , \mu , f)$ where $N=2000$, $\alpha =0.2$, $\mu = (0.7, 0.2,0.1)$, $f(a,a)=0.8$, $f(a,a')=0.1$, $a\ne a'=1,2,3$ and $m=2$

Full size table

In N2V-3, the parameter $\theta$ of the propensity for the random walk to backtrack is set close to zero $\theta =10^{-3}$ such that if the walker arrives at a node with degree 1, it always backtracks in the next time step since this is the only possible move, and $\beta =\gamma =1$. In this case, N2V-3 tends to explore better the network, avoiding the redundancy of nodes in the sample which accelerates the convergence (see the spectral gap in Table 4). The result is consistent with the non-backtracking RWs on regular graphs Alon et al. (2007). In many cases, they find spectral gap “twice as good” compared to the classical RW, as also in our case.

N2V-4 combines features of both attribute aware and non-backtracking samplers. We use the same weights and backtracking parameters as in N2V-2 and N2V-3 above. Since the stationary distribution $\pi _i$ in (15) is not known, it is obtained through simulations. The results show that N2V-4 can provide better estimates with lower variability compared to N2V-2 and N2V-3. This can be explained by the increase of the spectral gap while keeping $\Lambda (a)$ small for attribute values 2 and 3 (see Table 4). We have confirmed the use of the approximation in (11) for the stationary distribution of N2V-4. The choice is heuristic but the results show very good accuracy compared to the empirical distribution for this network scenario.

N2V-5 ignores the attributes of nodes while sampling the network. We set $\theta =10^{-3}$, $\gamma = 0.1$ and $\beta =1$, forcing the RW to explore non-common neighbors of the previous and currently visited nodes. The performance is worse compared with N2V-4 with the decrease of the spectral gap and the increase of $\Lambda (3)$ (Table 4). N2V-6 is the version of N2V-5 with attribute aware sampling. We now set $\beta w_{ij} = 0.3$ if nodes have equal attributes and 1 otherwise as in N2V-4 and keep the other parameters used in N2V-5. There is an improvement of performance, however, its variability is similar to N2V-4. In both N2V-5 and -6, the stationary distributions used in the estimation are obtained through simulations.

Setting 2 ($\alpha =1$): We next consider the linear model class $\mathscr {P}(1, \mu , f)$ case, where $\mu$, f, N and the sampling rate are the same as in Setting 1. The boxplots of 500 estimates for each sampling scheme using (15) are given in Fig. 6. In this case, the performance of MHRW is worse due to the existent of high degree nodes which tend to be avoided by MHRW, reducing the spectral gap. Note that high degree vertices increase “conductance” in the network (small world phenomenon) and hence avoiding them decreases the mixing time of MHRW. For the variants of N2V the estimates for attributes 2 and 3 tend to be better. This can be explained by the homophily and preferential attachment in the model which enables different types of attachment propensities as we now indicate. The attributes with small proportions 2 and 3 will be mainly attracted by the same node attributes. However, due to the preferential attachment, nodes with attributes from small proportions will also be partly attracted to the majority proportion of nodes with attribute 1 (see Fig. 1b). Therefore, the variability in the estimation tends to be smaller for attributes with lower proportions. The ranking of the performance of sampling methods is the same as in the sublinear case.

Other settings such as the presence of weak homophily and balanced attributes, i.e. the distribution of attributes in the network being uniform will be investigated with real data.

Degree distribution per attribute

Setting 3 ($\alpha =0.2$): Fig. 7 depicts the boxplots of the estimation error $(\sum _k ({{\widehat{p}}} (k|a) - p(k|a))^2 )^{1/2}$ of the degree distribution per attribute for a sublinear network from 500 estimates under MHRW, N2V-1 to -4, and baseline sampling methods. Since the stationary distributions of N2V-5 and -6 are not known and the N2V-5 and -6 performances approach N2V-3 and -4, respectively, we omitted them in the plot. The number of nodes sampled is 0.2N and the parameters of N2V-3 and 4 are the same as in Setting 1. N2V-4 achieves the highest performance especially for attributes 2 and 3 (even compared with RES) due to being attribute aware. We use its empirical stationary distribution and also check the approximation (11) which shows similar boxplots. On the other hand, MHRW has a poor performance compared with the baseline RNS. The results for the variants of N2V are consistent with the estimation of the attribute distribution.

Homophily measures

Setting 4 ($\alpha =1$): The homophily measures are $D_1=1.34$, $D_2=3.44$, $D_3=4.87$, $H_{12}=0.28$, $H_{13}=0.37$, $H_{23}=0.56$. Figure 8 shows the estimates of the dyadicity and heterophilicity using N2V variants with known or approximate stationary distribution. The estimators in (20) involves the ratio of several quantities which are sensitive to small deviations. Thus a larger sample size 0.3N is used to reduce the variability. The other parameters are the same as in Setting 2. We have omitted MHRW in the plots due to having the worst performance and also the baseline RNS. For the heterophilicity measure, N2V-4 achieves the lower variability followed by RES. We note that ${{\widehat{H}}}_{aa'}$ in (20) involves the estimation of the number of edges between different attribute nodes, which due to the reduced number of these connections is better estimated with N2V-4 than RES.