 Research
 Open access
 Published:
Learning attribute and homophily measures through random walks
Applied Network Science volume 8, Article number: 39 (2023)
Abstract
We investigate the statistical learning of nodal attribute functionals in homophily networks using random walks. Attributes can be discrete or continuous. A generalization of various existing canonical models, based on preferential attachment is studied (model class \(\mathscr {P}\)), where new nodes form connections dependent on both their attribute values and popularity as measured by degree. An associated model class \(\mathscr {U}\) is described, which is amenable to theoretical analysis and gives access to asymptotics of a host of functionals of interest. Settings where asymptotics for model class \(\mathscr {U}\) transfer over to model class \(\mathscr {P}\) through the phenomenon of resolvability are analyzed. For the statistical learning, we consider several canonical attribute agnostic sampling schemes such as MetropolisHasting random walk, versions of node2vec (Grover and Leskovec, 2016) that incorporate both classical random walk and nonbacktracking propensities and propose new variants which use attribute information in addition to topological information to explore the network. Estimators for learning the attribute distribution, degree distribution for an attribute type and homophily measures are proposed. The performance of such statistical learning framework is studied on both synthetic networks (model class \(\mathscr {P}\)) and real world systems, and its dependence on the network topology, degree of homophily or absence thereof, (un)balanced attributes, is assessed.
Introduction
Attributed networks, namely graphs in which nodes and/or edges have attributes, are at the center of networkvalued datasets in many modern applications. For example, in realworld network datasets most nodes have values of characteristics of interest; in social networks, users have attributes such as “gender”, “age”, “language”; in citation networks, articles are classified by the main subject, field, subfield, keywords. Networks also differ in the range of attributes values (cardinality), their types (discrete or continuous) and the size of each group. In one direction, machine learning pipelines such as network representation learning Fan et al. (2021), clustering Chang et al. (2019), classification Lee et al. (2017), and community detection Baroni et al. (2017) have been developed to study the entire network. Another recent direction, specifically related to attributed network valued data, is the use of attribute information, in addition to graph topological information, in improving the performance of exploratory data analytic techniques such as community detection Berahmand et al. (2022) or link prediction tasks Nasiri et al. (2023). Both papers, through careful development of methodological analysis using graph regularization and nonnegative matrix factorization, and through detailed empirical analysis, show significant improvement for such machine learning pipelines via incorporating node attribute information. Driven by the scale of data, the main motivation of this paper is network sampling, where limited explorations based on random walks are used to learn network level functionals of attributes.
In realworld networks, the attributes of a node will covary and are not independent. One standard phenomenon in many such real world systems is homophily Shrum et al. (1988); McPherson et al. (2001); Mislove et al. (2010), i.e., node pairs with similar attributes being more likely to be connected than node pairs with discordant attributes. For instance, many social networks show this property, which is the tendency of individuals to associate with others who are similar to them; e.g., with respect to the gender, ethnicity, political ideologies. Furthermore, the distribution of user attributes over the network is usually uneven, with coexisting groups of different sizes, e.g., one ethnic group may dominate others EspínNoboa et al. (2021). On the other hand, another covariation across neighbors is due to heterophily, where nodes with the same attribute type value repel each other.
Performance of network sampling algorithms in such settings has received some attention including: the bias of several sampling methods in conserving position of nodes and visibility of groups Wagner et al. (2017); the effect of homophily on centrality measures and visibility of minority groups and fairness questions Karimi et al. (2018). More recently the synthetic models that motivate this paper were used in EspínNoboa et al. (2022) to understand the inequality of node ranking algorithms (e.g. as measured by the Gini coefficient) as well as inequity (e.g. by contrasting the percentage of a given attribute amongst the most popular k%age of nodes with the true demographic percentage of that group), in particular trying to understand the foundational characteristics of network evolution such as homophily or preferential attachment in (quoting EspínNoboa et al. (2022)) “reducing, replicating or amplifying” representation of specific groups by these ranking algorithms. In a different direction, EspínNoboa et al. (2021) uses these synthetic models to understand the accuracy of semisupervised machine learning tasks such as learning/prediction of attribute labels given partial information on the labels of a subset of seeded vertices; the goal is to understand the impact of homophily/heterophily and preferential attachment driven growth characteristics of the underlying network on the accuracy of a host of popular relational classifiers and collective inference algorithms.
This paper is motived by the lack of theoretical results in the analysis of attribute network models with homophily and the development of a learning framework to estimate attribute functionals in real networks. We investigate the following research questions (RQ).
RQ1
How to analyze and extend the existing network models with homophily and derive the main functionals of interest?
We describe a generalization of the directed preferential attachment model with homophily (called model class \(\mathscr {P}\)) formulated in Karimi et al. (2018) where new nodes connect to existing ones based on the attributes of both end points of the potential edge and centrality of the existing vertex. The network model can generate scalefree networks with discrete or continuous attributed nodes, and different intensities of homophily. The dynamics of the network is the following. Starting from a fully connected cluster of nodes with attributes, each node that arrives has attribute generated independently according to a given distribution and connects to a fixed (constant) number of nodes. The probability that a new node connects to an existing node is proportional to the product of the degree (to the power of a parameter) with a function that measures the propensity of the two nodes attributes to interact. Thus, the model encodes the interplay between the two main mechanisms of tie formation found in social networks: preferential attachment and homophily. Given the importance of this model in applications, theoretical analysis of this model including stability properties of heterophily and homophily statistics are of great importance; yet till date the only functional amenable to theoretical analysis has been degree distribution Karimi et al. (2018); Jordan (2013). We describe a related model of network evolution (called model class \(\mathscr {U}\)) which is much more amenable to theoretical analysis and a phenomenon we term resolvability which enables one to transfer results from model class \(\mathscr {U}\) to model class \(\mathscr {P}\); in this paper we specialize to large network limits for degree distribution for an attribute type and homophily and heterophily statistics, deferring a full treatment to Antunes et al. (2023).
RQ2
How to use the existing link trace algorithms to sample the network and take into account the attributes of nodes?
Uniform random sampling of nodes or edges is the “gold standard”, providing unbiased estimates of corresponding attribute functionals. However, owing to both computational and privacy issues in social networks and other settings, such sampling is often infeasible. Other networks that allow random access limit the rate of API (Application Program Interface) calls implying that creating a sample of sufficient size takes a prohibitive time. In these cases, link trace sampling, such as random walks (RWs) are typically used; see references in Antunes et al. (20212021) for estimation of functionals such as degree distribution and clustering. However, much less is known in the context of estimating quantities influenced by attribute types in homophily networks.
In this work, we consider several existing canonical attribute agnostic sampling schemes proposed in the literature (that do not use the attribute type of nodes to construct the sample) such as MetropolisHasting random walk and versions of node2vec Grover and Leskovec (2016) that incorporate both classical random walk and walks with nonbacktracking propensities. These random walks have been designed to preserve structural properties of the network in the sample, such as high degree nodes, clustering, diameter and not the different types of node attributes. We are interested not only in estimating the proportion of nodes with a given attribute but also in the structural properties of the subnetwork spanned by vertices of a specified attribute type including the degree distribution and homophily measures. Our main contribution here is to show that random walks that use edge weights can be attribute aware samplers through the proposal of variants of node2vec where edge weights depend on attributes of its end nodes. This will be especially useful in homophilic networks for analyzing geometric properties involving nodes with minority attributes.
RQ3
How to estimate the attribute functionals and homophily measures through the sampling schemes and evaluate their performance?
We propose estimators for attribute functionals and homophily measures that are based on correcting the bias of the empirical sample quantities through the use of stationary distribution of the RWs associated in sampling nodes and edges.
We study the performance of the considered random walk sampling schemes in terms of estimation error of the attribute distributions and homophily measures across the following four dimensions in both synthetic networks using the model class \(\mathscr {P}\) and real world settings: (a) Inherent homophilic propensity of the network and underlying density of attributes; (b) Impact of centrality of nodes as measured by degree in the evolution of the network; (c) Nonlinear impact of incorporating “escape echo chamber” mechanisms in random walks by encouraging walks to jump across edges with discordant attributes; (d) Impact of reducing the backtracking propensity to encourage walks to explore more of the network.
We find that (i) RWs with attribute dependent weights can perform better over attribute agnostic RWs in homophilic networks; (ii) the weights need to balance the movements between/within nodes with different/same attributes; (iii) nonbacktracking improves performance, especially in conjunction with attribute dependent weights and low edge density; (iv) methods seem to work comparably well for synthetic and real networks.
This paper is a significant extension of the conference paper Antunes et al. (2023) including: (a) appreciable expansion of the theoretical developments to the network models described in Antunes et al. (2023), including describing the notion of resolvability of such models which allows one to connect them to a different class of models for which asymptotic analysis for a wide range of functionals, such as degree exponent for an attribute type, homophily and heterophily statistics can be undertaken; (b) substantial expansion of the methodological development of the paper, including a new class of functionals (degree distribution for an attribute and homophily measures) to be estimated through network sampling schemes; (c) new network sampling schemes from node2vec variants; (d) further applications of the methodology developed to new network data for evaluation and comparison; and (e) a final section with extensions and future directions of the work.
Attributed network models and homophily functionals
As described above, synthetic models have been used to great effect in understanding the structure and evolution of attributed networks and the impact of ranking, sampling and classification algorithms in such settings. The overarching goal in this section is to describe an extension of the canonical (linear) attributed network models currently considered in the literature. We refer the interested reader to Karimi et al. (2018); EspínNoboa et al. (20222021) and the references therein for further discussion on motivations and use of such models. More concretely in this section:

(a)
We will describe the main synthetic model, termed nonlinear preferential attachment (NLPA) model with homophily, and referred to for the rest of the paper as model class \(\mathscr {P}\).

(b)
We will give concrete formulations of key network functionals measuring homophily between different groups.

(c)
Understanding (large network) asymptotics for model class \(\mathscr {P}\) is nontrivial. We will introduce a related model (referred to as model class \(\mathscr {U}\)), that seems significantly more amenable to analysis, formalize a notion called resolvability, connecting model classes \(\mathscr {P}\) and \(\mathscr {U}\) and then describe the explicit results that can be derived for model class \(\mathscr {P}\), at least in the linear case using \(\mathscr {U}\). Technical justifications of these connections can be found in Antunes et al. (2023).
Fix an attribute (or latent) space \({\mathcal {A}}\) with probability measure μ. Fix a (potentially asymmetric) function \(f: {\mathcal {A}}\times {\mathcal {A}}\rightarrow {\mathbb {R}}_+\) which measures propensities of node pairs to interact based on their attributes. Fix \(\alpha \ge 0\) describing the role of degree in measuring popularity and integer \(m\ge 1\) denoting the number of edges a new vertex has when entering the system, to connect to preexisting vertices. In principle m could be random and/or dependent on the attribute type, but for simplicity and to match existing literature (e.g. Karimi et al. (2018)) we focus on the fixed m setting (see Antunes et al. (2023) for results when m is attribute dependent). Let N be the number of nodes (vertices) in the network. In the model class \(\mathscr {P}\), nodes \(\left\{ v_{{ n}}:1\le {n}\le N\right\}\) enter the system sequentially starting at \({n}=1\) with a base connected graph \({\mathcal {G}}_1\) (with every node having an attribute in \({\mathcal {A}}\)) with dynamics:

(i)
Every node \(v_{{ n}}\) has attribute \(a(v_{{ n}}) \in {\mathcal {A}}\) generated independently using \(\mu\).

(ii)
Node \(v_{{n}}\) enters the system with m edges.

(iii)
The dynamics for connecting each of the m edges are recursively defined as follows: suppose the network has been constructed till stage n with structure \({\mathcal {G}}_{n}\). For any n and \(0\le i\le m1\) and \(v\in {\mathcal {G}}_{n}\), let \(\deg _i(v,{n})\) denote the degree of v at time n when i of the edges of \(v_{{n}+1}\) have connected to \({\mathcal {G}}_{n}\). Conditional on \({\mathcal {G}}_{n}\) and stage i, the probability that the \((i+1)\)th edge of \(v_{{n}+1}\) connects to \(v\in {\mathcal {G}}_{n}\) is proportional to:
$$\begin{aligned} P_{v_{{n}+1} v} \propto f(a(v), a(v_{{n}+1})) [\deg _i(v,{n})]^\alpha . \end{aligned}$$(1)Once this edge has connected, all the degrees are updated and the above dynamics is repeated till all m edges have connected to \({\mathcal {G}}_{n}\). When \(m=1\), then each new vertex has only one edge to connect to the network and in this case we write \(\deg (v,{n}){:}{=} \deg _0(v,{n})\).
We will refer to this as model class \(\mathscr {P}\) (or \(\mathscr {P}(\alpha , \mu , f)\) when we want to specify all the parameters; we suppress dependence on m to ease notation) and sometimes write \(\left\{ {\mathcal {G}}_n:1\le n\le N\right\} \sim \mathscr {P}(\alpha , \mu ,f)\). The model (1) extends various existing models including: BarabásiAlbert model Barabási and Albert (1999) (\(f\equiv 1\), \(\alpha =1\)), sublinear PA Krapivsky and Redner (2001) (\(f\equiv 1\), \(0< \alpha <1\)), PA with multiplicative fitness Bianconi and Barabási (2001) (\(f(a,a^\prime ) =a\), \(\alpha =1\)), scale free homophilic model de Almeida et al. (2013) (\(f(a,a^\prime ) = 1 aa^\prime \), \({\mathcal {A}}= [0,1]\), \(\alpha =1\)), and geometric versions with \(\alpha =1\), a compact metric space \({\mathcal {A}}\) and an appropriate function f of the distance Flaxman et al. (2007) and Jordan (2013). Most existing studies focus on asymptotics for either the degree distribution or maximal degree. The notation used in the paper is summarized in Table 1.
Homophily functionals
When the latent space \({\mathcal {A}}= \left\{ 1,2,\ldots , K\right\}\) is finite, one can define macroscopic measures of homophily, and conversely heterophily Park and Barabási (2007), from an observed network \({\mathcal {G}}\) (either synthetic or empirically observed) on N nodes as follows. Let \({\mathcal {E}}\) denote the total edge set; for \(a\in {\mathcal {A}}\), let \({\mathcal {V}}_a\) be the set of nodes of type a, and for \(a,a'\in {\mathcal {A}}\), let \({\mathcal {E}}_{aa'}\) be the set of edges between nodes of types a and \(a'\). Let \(p = {\mathcal {E}}/{N \atopwithdelims ()2}\) be the edge density. For \(a\in {\mathcal {A}}\), dyadicity
measures the contrast in edges within the cluster of nodes a as compared to a setting where all edges are randomly distributed; thus \(D_a > 1\) signals homophilic characteristics of type a nodes while \(D_a<1\) signifies heterophilic nature of type a nodes. Similarly, for \(a\ne a'\), heterophilicity
denotes propensity of type a nodes to connect to type \(a'\) nodes as contrasted with random placement of edges with probability equal to the global edge density. If \(H_{aa'} < 1\), nodes of opposite labels do not tend to be connected (homophilic); if \(H_{aa'} > 1\), there are more connections between nodes of different labels a and \(a'\) (heterophilic).
Illustrations of homophilic synthetic networks of the model class \(\mathscr {P}(\alpha , \mu , f)\) generated from (1) are given in Fig. 1. The total number of nodes is \(N=1000\) and each node has an attribute in \({\mathcal {A}}=\{1, 2, 3\}\) according to the probability mass function (p.m.f.) \(\mu = (0.7, 0.2,0.1)\); the propensities of node pairs to connect based on their attributes are \(f(a,a)=0.8\), \(f(a,a')=0.1\), \(a\ne a'=1,2,3\) and \(m=2\). The networks are plotted for different values of \(\alpha\) in Fig. 1a–c. For instance, with \(\alpha =0.2\), the corresponding homophily measures are \(D_1=1.364\), \(D_2=3.038\), \(D_3=7.38\), \(H_{12}=0.336\), \(H_{13}=0.386\) and \(H_{23}=0.399\). Figure 2 shows the case of heterophilic networks with \(N=1000\) of the model class \(\mathscr {P}(\alpha , (0.7,0.2,0.1), f)\) for different values of \(\alpha\) with \(f(a,a)=0.2\), \(f(a,a')=0.4\), \(a\ne a'=1,2,3\) and \(m = 2\). For \(\alpha =1\), the homophily measures are \(D_1=0.750\), \(D_2=0.772\), \(D_3=0.750\), \(H_{12}=1.479\), \(H_{13}=1.615\) and \(H_{23}=1.873\).
Model class \(\mathscr {U}\) and rationale
While model class \(\mathscr {P}\) has been heavily used in applications, deriving large network asymptotics of functionals is nontrivial. Next we will describe a related network model (model class \(\mathscr {U}\)), the rationale for why this might be more amenable to analysis, and then formalize situations where given \(\mathscr {P}\), one can construct (using as input the parameters \(\alpha , \mu , f\) from \(\mathscr {P}\)), a corresponding model in class \(\mathscr {U}\) such that properties of \(\mathscr {P}\) can be read off from (the more easily analyzable) \(\mathscr {U}\). For most of this discussion we will only consider the \(m=1\) setting, albeit the formulae for asymptotics for various functionals considered below seem to extend, at least in simulations, in a straightforward manner to general m setting.
Since the general setting (with “continuous” attribute space) is more technical, let us explain the basic rationale in the simpler discrete setting where \({\mathcal {S}}= [K]{:}{=}\left\{ 1,2,\ldots , K\right\}\) so that \(\mu\) is a p.m.f.. Fix a (potentially and in most cases different from \(\mu\)) p.m.f. \(\nu\) and consider the attributed network model \(\left\{ {\tilde{{\mathcal {G}}}}_n: n\ge 0\right\}\) with dynamics:
Note that the above model is invariant to scaling in \(\nu\), so it will be convenient to allow \(\nu\) to be a general weight sequence instead of normalizing it to be a probability measure.
The above belongs to a general class of models defined below that we will refer to as \(\mathscr {U}(\alpha , \nu , f)\). Thus, here the p.m.f. \(\nu\) plays the role of a weight and further, unlike the model \(\mathscr {P}\) where each new arriving vertex has attribute sampled independently from the current state of the network, here the distribution of new vertices is closely dependent on the entire state of the current network.
Rationale for technical tractability: Tabling the issue of connection with \(\mathscr {P}\) for the next sections, first note that \(\mathscr {U}\) can be simulated via dynamics where every vertex essentially behaves independently ((c) below). In brief, if one wanted to simulate model class \(\mathscr {U}\) starting from one vertex of type a, then this can be done as follows:

(a)
Every vertex v that enters the system (starting with the root of type a) gives birth independently to child nodes with attributes in continuous time, connected to the vertex.

(b)
For a node of type a, conditional on its degree d, the rate of reproduction of a child node of type \(a^\prime\) is \(\nu (a) f(a,a^\prime ) d^\alpha\).

(c)
Reproduction dynamics is independent across nodes.
Write \(\left\{ {{\,\textrm{BP}\,}}(t):t\ge 0\right\}\) for the (continuous time) process and for any \(n\ge 1\), \(T_n\) be the (random) time such that the size \({{\,\textrm{BP}\,}}(T_n) =n\). (BP stands for Branching Process.) Then it is easy to check that \(\left\{ {{\,\textrm{BP}\,}}(T_n):1\le n\le N\right\}\) has the same distribution as \(\left\{ {\tilde{{\mathcal {G}}}}_n: 1\le n \le N\right\} \sim \mathscr {U}(\alpha , \nu , f)\). Further the independence in the evolution makes this model much more amenable to analysis, yielding asymptotic information for the process \({{\,\textrm{BP}\,}}\) and thus the model \(\mathscr {U}\).
Resolvability
Note that the main model of interest, both as a synthetic test bed in this paper, and in preexisting work, is the model class \(\mathscr {P}\). The main goal of this section is to formalize a connection between model classes \(\mathscr {P}\) and \(\mathscr {U}\). Given \(\left\{ {\tilde{{\mathcal {G}}}}_n:0\le n\le N\right\} \sim \mathscr {U}(\alpha , \nu , f)\), for \(n\ge 1\) define \({\tilde{\pi }}_n = \sum _{t=1}^n \delta \left\{ a(v_t)\right\}\), i.e. the empirical measure of attributes in \({\tilde{{\mathcal {G}}}}_n\).
Now say that model \(\mathscr {P}(\alpha , \mu , f)\) is resolvable if there exists \(\nu\) such that for the model class \(\mathscr {U}(\alpha , \nu , f)\), the empirical measures of attribute types satisfy: \({\tilde{\pi }}_n \rightarrow \mu\) as \(n\rightarrow \infty\). In words, one can chose a weight measure \(\nu\) such that the corresponding dynamics for \(\mathscr {U}\) with the same \(\alpha\) and f drives the empirical distribution to the limiting empirical distribution \(\mu\) of model class \(\mathscr {P}\) (since every new vertex has attribute distribution \(\mu\) independent of the network evolution).
Resolvability in the linear finite attribute case
The linear case (\(\alpha =1\)) with a finite attributes \({\mathcal {S}}=[K]\) turns out to be completely resolvable under the following.
Assumption 1
Assume the sampling measure \(\mu = (\mu _1, \ldots , \mu _K)\) has all entries strictly positive and assume the affinity kernel \(f(a,a^\prime ) >0\), \(\forall a, a^\prime \in [K]\).
Fix a model class \(\mathscr {P}(\alpha =1, \mu , f)\) satisfying the above Assumption. Let \({\mathcal {P}}([K])\) denote the \(K1\) dimensional simplex of probability mass functions on [K]. Define (in the interior of \({\mathcal {P}}([K])\)) the function:
By Jordan (2013, P8), under the above Assumption, \(V_{\mu }(\cdot )\) has a unique minimizer \(\eta {:}{=} \eta (\mu ) = (\eta _1(\mu ), \ldots , \eta _K(\mu ))\) in the interior of \({\mathcal {P}}({\mathcal {S}})\). Now, define
where the final identity follows from Jordan (2013, P8). Let \(\nu = (\nu _1, \ldots , \nu _K)\). Then the following paraphrases some of the results in Antunes et al. (2023):

1.
Under the above Assumption, model \(\mathscr {P}(\alpha =1,\mu ,f)\) is resolvable with one resolving measure \(\nu\) given as above. This implies, in particular, local functionals (such as degree distribution PageRank) converge to the same limits as those for \(\mathscr {U}(\alpha =1, \nu , f)\). Two specific implications are given next.

2.
For each \(a\in [K]\), the empirical p.m.f. of vertice degrees of type \({\textbf{p}}_n^a {\mathop {\longrightarrow }\limits ^{P}}{\textbf{p}}_\infty ^a\) where the limit p.m.f. has tail exponent \({\textbf{p}}_\infty ^a(k)\sim k^{1+2/\phi _a}\) as \(k\rightarrow \infty\).

3.
Using the objects defined in (5), define the matrix
$$\begin{aligned} {\textbf{M}}= \left( {\textbf{M}}_{a,b}{:}{=}\frac{\phi _{a,b}}{2  \phi _a}\right) _{a, b\in [K]}. \end{aligned}$$(6)Then the homophily and heterophily statistics \(\left\{ D_{n,a}: a\in [K]\right\}\) and \(\left\{ H_{n, (a,a^\prime )}: a\ne a^\prime \in [K]\right\}\) satisfy the asymptotics,
$$\begin{aligned} D_{n,a} {\mathop {\longrightarrow }\limits ^{P}}\frac{[{\textbf{M}}]_{a,a}}{\mu _a}, \qquad H_{n, (a,a^\prime )} {\mathop {\longrightarrow }\limits ^{P}}\frac{1}{2}\left[ \frac{[{\textbf{M}}]_{a^\prime ,a}}{\mu _a} + \frac{[{\textbf{M}}]_{a, a^\prime }}{\mu _{a^\prime }} \right] \end{aligned}$$(7)
Remark 1
Result (b) above was previously derived in Jordan (2013) using stochastic approximation techniques.
The results above are illustrated numerically in Fig. 3 and Tables 2 and 3. We fixed the model class \(\mathscr {P}(1, (0.7,0.2,\) 0.1), f), where \(f(a,a)=0.8\), \(f(a,a')=0.1\), for \(a\ne a'=1,2,3\) and \(m = 1\). The model is resolvable with resolving measure \(\nu\) approximately equal to (0.742, 0.189, 0.069). We generate the model classes \(\mathscr {P}\) and \(\mathscr {U}=(\alpha , \nu , f)\) using (1) and (4), respectively, for different network sizes. Figure 3 shows the degree distributions of attribute 2 for both models which are getting closer as N increases. In the limit they converge to the same p.m.f.. We fit a powerlaw distribution function using a maximum likelihood approach to the empirical degree distribution tail per attribute of the model class \(\mathscr {P}\) for each network size. The respective tail exponents are shown in Table 2 with the asymptotic limit p.m.f. tail exponent \(1+2/\phi _a\). Finally, the empirical and asymptotic dyadicity and heterophilicity measures, respectively, (2), (3) and (7), are given in Table 3. The results show that complicated functionals of the model class \(\mathscr {P}\) can be easily approximated with good precision even for moderate network sizes.
Random walk samplings in attributed networks
Since many realworld networks can only be crawled, in the sense that only the neighbors of the current visited node can be explored, we consider sampling procedures that are based on random walks. They are also a core technique for constructing various algorithms to extract information on networks, such as community detection, ranking of nodes and edges, and dimension reduction. We introduce wellknown random walks which are attribute agnostic. These random walks have been designed to preserve structural properties of the network and not the representativeness of node attributes in the sample. We are interested (see next section) in estimating the attribute distribution but also structural properties (node degrees) depending on the node attributes. We show next that some random walks that use edge weights can be attribute aware samplers. This will be especially useful in homophilic networks. Throughout this section, for graph \({\mathcal {G}}\) and node \(i\in {\mathcal {G}}\), \(d_i\) will denote its degree. We assume a static graph and that only limited set of initial seed nodes \(i\in {\mathcal {G}}\) that initializes the random walk are available. When we say that a node is sampled, it means that its attribute a(i) (and degree d(i) dependent on the quantities of interest) is added to the sample.
Metropolis Hastings Random Walk (MHRW) At each step, if the walk is currently at node i, a neighbor j is selected uniformly at random and the proposed move to j is accepted with probability \(\min (1,d_i/d_j)\), else the walk stays at i. Thus proposed moves towards a node of smaller degree are always accepted whilst we reject some of the proposed moves towards higher degree nodes. It is easy to check that the stationary distribution is uniform over the node set, i.e.,
The stationary distribution over the edge set is
Node2vec (N2V) As proposed in Grover and Leskovec (2016), in full generality, the transitions of N2V depend on the neighborhood both of the currently visited node, and the node visited prior to the current node. Let the previously and currently visited nodes be k and i, resp. The next visited node j is chosen according to the transition probability proportional to:
where \(w_{ij}\) is the weight of edge (i, j), \(\theta\) is the parameter that represents the propensity for the random walk to backtrack, \(\gamma\) is the quantifying probability of reaching a common neighbor of the currently visited node and the node visited in the last step, and \(\beta\) is the parameter of exploring any of other neighbor–see Fig. 4. N2V is a second order Markov chain. We now describe specific variants of this random walk which includes some classical versions.
Node2vec1 (N2V1): If the network is undirected, unweighted and \(\theta =\beta =\gamma\), one obtains the classical RW with the wellknown stationary distributions,
Node2vec2 (N2V2): If the network is undirected and \(\theta =\beta =\gamma\), one obtains a weighted RW. This walk can use node attributes through weights in contrast to N2V1. We assume that for each sampled node i, we have access to the attributes of the neighbors of i. If there is a connection between i and j, the weight \(w_{ij}\) is a function of a(i) and a(j). In a homophilic network, setting \(w_{ij}\) to a lower value if nodes have equal attributes encourages the sampling of nodes with different attributes. The stationary distributions in this case are given by
Node2vec3 (N2V3): If the network is undirected, without selfloops, multiple edges and \(\beta =\gamma\), \(\theta >0\), with equal weights \(w_{ij}\), the stationary distributions for nodes and edges are given by (10) Meng and Masuda (2020). With small \(\theta\), the walk approaches the nonbacktracking random walk avoiding 2hop redundancy in the sample.
Node2vec4 (N2V4): We consider next the combination of the last two schemes, with \(\beta =\gamma\), \(\theta >0\) and weights \(w_{ij}\) dependent on the attributes of i and j. In this setting, one major technical hurdle is that, unlike the settings above, there is no explicit formula for the stationary distributions. Analogous to the stationary distributions for N2V3 matching the usual RW in the stationary regime, it is expected that especially in the small \(\theta\) setting, the stationary distributions can still be approximated by those in (11). We explore the efficacy of these approximations for moderate size synthetic networks below.
Node2vec5 (N2V5): In this variant the weights \(w_{ij}\) are equal to 1 and \(\theta\), \(\gamma\) and \(\beta\) are different. To enhance the exploration of the network to sampled nodes which are further away from the previous visited nodes, we consider the case \(\theta< \gamma < \beta\). The stationary distributions in this case are not known and we will use the empirical distribution obtained through simulations.
Node2vec6 (N2V6): This is the more general variant extending N2V5 to have weights. Again, the most interesting case is \(\theta< \gamma < \beta\). As in N2V5 the stationary distributions are unknown. However, we include this sampling scheme for a full evaluation of the performance of N2V. We believe that for the network model an approximation can be obtained for stationary distributions through the resolvability of the model classes \(\mathscr {P}\) and \(\mathscr {U}\). Due to the technical nature of the problem, it is outside the scope of this paper, and will be considered in a future work.
For comparison to RWs, we will also use the following baseline samplings. These can be viewed as “ideal” for sampling purposes and correspond to the limiting distributions of some RWs.
Node Sampling (NS) NS sampling requires full access to the network and is unavailable for many real networks. In the classical NS, nodes are chosen independently and uniformly from the network with replacement.
Edge Sampling (ES) In the classical ES, edges are chosen independently and uniformly from the network with replacement. Since ES selects edges rather than nodes to populate the sample, the node set is constructed by including both incident nodes in the sample when a particular edge is sampled.
Estimation of attribute distributions and homophily measures
We consider here estimation in the case of discretevalued attributes; the case of continuousvalued attributes is discussed at the end of this work. Our estimators of quantities of interest will be based on one of the following two general estimators. The first estimator is for the proportion p(A) of nodes i with a certain characteristic A(i) taking value A. The characteristic takes discrete values and could be the discrete attribute \(a_i=a(i)\) itself, the degree \(d_i=d(i)\), the combination of the latter two, etc. The estimator of p(A) for a random walk is defined as follows. Run a random walk (any of the sampling schemes described above) for n steps and let \(i_s\) denote the sth node sampled by the random walk, for \(1\le s \le n\). Since nodes are sampled with replacement and with probabilities \(\pi _i\) in the stationary regime, the proportion p(A) can be estimated as
where \({{\textbf{1}}}\{E\}=1\) if E is true and 0 otherwise Kolaczyk (2009) (Chapter 5). If the total number of nodes N is unknown, its estimator is given by \({\widehat{N}} = (1/n) \sum _s 1/ \pi _{i_s}\), and (12) becomes
A direct application of e.g. (12) yields the following estimators for the proportion p(k, a) of nodes with degree k and attribute a, the proportion p(a) of nodes with attribute a, and the conditional proportion \(p(ka)=p(k,a)/p(a)\) of nodes of degree k having attribute a:
We note that the quantities in (14)–(16) are given in terms of the sample obtained through the random walk used with N estimated by \({{\widehat{N}}}\).
The performance of \({{\widehat{p}}}(A)\) in (12) and hence the components of the estimators (14)–(16) can be assessed through their MSE. For fixed A, the MSE of \({{\widehat{p}}}(A)\) is given by \(E[({{\widehat{p}}}(A) p(A))^2]\). In the stationary regime, \({{\widehat{p}}}(A)\) in (12) is an unbiased estimator of p(A) and the MSE is equal to the variance \(V[{{\widehat{p}}}(A)]\). The variance of \({{\widehat{p}}}(A)\) can be related to the spectral gap of the RW. More specifically, let P be the associated transition matrix of the random walk with eigenvalues (real by reversibility): \(1=\lambda _1\ge \lambda _2 \ge \ldots \ge \lambda _N \ge 1\). The spectral gap is defined as \(\delta = 1\lambda _2\). Equivalently, the relaxation time of the RW is the reciprocal of the spectral gap. A larger spectral gap implies a faster convergence of the RW to its stationary distribution. From Aldous and Fill (2002) (Proposition 4.29), we have
where \(\Lambda (A)=\sum _{i=1}^N {\textbf{1}}\{A(i) = A\} /(N^2 \pi _i)\). The error in estimating the proportion of nodes with characteristic A is thus proportional to the inverse of the spectral gap and \(\Lambda (A)\); the latter is small if the probability of sampling nodes with characteristic A is large. We will see in Section Experiments that for N2V2, if edge weights \(w_{ij}\) are inversely related to the concordance of the attributes, thus encouraging the walk to explore vertices with different attributes, then in some settings, this increases \(\delta\) and decreases \(\Lambda (a)\) (for attributes with small proportions), resulting in a smaller variance of the estimator for the proportion p(a) of nodes with attribute a.
The second estimator is for the proportion p(B) of edges (i, j) with a certain characteristic B(i, j) taking value B. The values B are assumed to be discrete. For the random walk considered above, since edges are sampled with probabilities \(\pi _{ij}\) in the stationary regime, the proportion p(B) can be estimated similarly to (12) as
and if needed, the number of edges as
A direct application of (18)–(19) is to estimation of homophily measures \(D_a\) and \(H_{aa'}\) in (2) and (3) as:
where \(\widehat{ {\mathcal {V}}_{a} } = {{\widehat{N}}} {{\widehat{p}}}(a)\), \(\widehat{p} = \widehat{{\mathcal {E}}} / { \widehat{N} \atopwithdelims ()2}\) and
where \(a,a' \in {\mathcal {A}}\). We note again that the quantities in (19)–(21) are given by the sample obtained through the respective random walk used. We are not aware of the results of the type (17) to assess the variability of the estimator \({{\widehat{p}}}(B)\) in (18).
In terms of complexity of the learning framework, the random walks considered in this work are computationally efficient in terms of both space and time requirements Grover and Leskovec (2016). For instance, for each visited node, we need to check the immediate neighbors and their attributes. For the second order random walks (N2V3, 4, 5 and 6), we need additionally to keep track of the interconnections between the neighbors of the current visited node, however, the average degree of the graph is usually small for most real world networks. The proposed estimators are obtained from simple weighted sample statistics.
Experiments
In this section, we assess the performance of the sampling methods and estimators in learning the attribute distribution, degree distribution per attribute and homophily measures on synthetic and realworld networks with discrete attributes.
Synthetic network with homophily
We consider the model class \(\mathscr {P}(\alpha , \mu , f)\) with \(N=2000\) nodes and 3 discrete attributes. In the generation of the network, each node that enters the system has attribute 1, 2 or 3 with probabilities \(\mu _1=0.7\), \(\mu _2=0.2\), \(\mu _3=0.1\), respectively, and connects to \(m=2\) nodes proportional to (1), where \(f(a,a)=0.8\), \(f (a,a')=0.1\), \(a,a'=1,2,3\), \(a\ne a'\). We investigate the effect of homophily in the estimation of the quantities of interest in a controlled environment for the two most interesting network topologies: sublinear (\(\alpha = 0.2\)) and linear (\(\alpha = 1\)).
Attribute distribution
Setting 1 (\(\alpha =0.2\)): The evaluation of the several sampling methods in learning the attribute distribution using (15) assuming N unknown is shown in Fig. 5. Each boxplot is constructed from the results of 500 estimates. The length of each walk is 0.15N. MHRW has an important property that the stationary distribution is uniform over all the nodes. Thus, in principle, MHRW is equivalent to RNS of the network for an infinite RW. In practice, MHRW typically requires sample sizes of O(N) to achieve the stationary distribution Kumar and Sundaram (2021). It is challenging to use MHRW for large scale networks with millions of nodes, where typical sample size is much smaller than the network size. Networks with a strong homophily are problematic in this case since MHRW tends to get stuck in nodes with the same attributes. The classical variant of node2vec, N2V1, which like MHRW is also attribute agnostic has the property that the stationary distribution is uniform over all the edges. N2V1 is equivalent to RES of the network for an infinite RW. In practice, it suffers from the same drawbacks of MHRW to a lower extent. The poor performance can also be explained through the bound of the variance (17). Table 4 shows that MHRW has the lowest spectral gap while N2V1 has a high value \(\Lambda (3)\) for attribute 3 (this is detailed next for N2V2).
The attribute aware samplers like N2V2 use node attribute to determine the next node to add to the sample, by checking the attribute of the node against the attribute of the last node added to the sample. To simplify the exposition (instead of \(w_{ij}\) for nodes i and j), we write \({{\overline{w}}}_{aa}\) for the weights of nodes with the same attributes, and \({{\overline{w}}}_{aa'}\) with different attributes. Table 5 shows the effects of the weights in the standard deviation of the estimate for N2V2 for attribute 3. To explain their differences, we turn to the bound of the variance of the estimator (17). The error in estimating the proportion of nodes with an attribute a is upper bounded by the inverse of the spectral gap. If \({{\overline{w}}}_{aa}\) is much smaller than \({{\overline{w}}}_{aa'}=1\), say \(\overline{w}_{aa}=0.05\), then the movements of N2V2 between different node attributes are very frequent and exploration within each attribute is insufficient. In this case, the spectral gap is low creating a bottleneck for approaching the stationary probability. As \({{\overline{w}}}_{aa}\) increases the interattribute moves are less frequent, accelerating the convergence to the stationary distribution. On the other hand, when \({{\overline{w}}}_{aa}\) becomes greater or equal than \({{\overline{w}}}_{aa'}\), the spectral gap decreases until that N2V2 hardly transits from one attribute value to another. The error in estimating the attribute distribution is also bounded by the quantity \(\Lambda (a)\) which is small if the probability of sampling nodes with attribute a is large. We also observe from Table 5 the effect of \({{\overline{w}}}_{aa}\) on the value \(\Lambda (a)\) for attribute 3. The tradeoff between \(\delta\) and \(\Lambda (a)\) explains the smaller standard deviation for attribute 3 of N2V2 with \(\overline{w}_{aa}=0.25\). The convex behavior of the empirical standard deviation as a function of \({{\overline{w}}}_{aa}\) will be explored at the end of this work in the guidelines for setting the weights of attribute aware samplers.
In N2V3, the parameter \(\theta\) of the propensity for the random walk to backtrack is set close to zero \(\theta =10^{3}\) such that if the walker arrives at a node with degree 1, it always backtracks in the next time step since this is the only possible move, and \(\beta =\gamma =1\). In this case, N2V3 tends to explore better the network, avoiding the redundancy of nodes in the sample which accelerates the convergence (see the spectral gap in Table 4). The result is consistent with the nonbacktracking RWs on regular graphs Alon et al. (2007). In many cases, they find spectral gap “twice as good” compared to the classical RW, as also in our case.
N2V4 combines features of both attribute aware and nonbacktracking samplers. We use the same weights and backtracking parameters as in N2V2 and N2V3 above. Since the stationary distribution \(\pi _i\) in (15) is not known, it is obtained through simulations. The results show that N2V4 can provide better estimates with lower variability compared to N2V2 and N2V3. This can be explained by the increase of the spectral gap while keeping \(\Lambda (a)\) small for attribute values 2 and 3 (see Table 4). We have confirmed the use of the approximation in (11) for the stationary distribution of N2V4. The choice is heuristic but the results show very good accuracy compared to the empirical distribution for this network scenario.
N2V5 ignores the attributes of nodes while sampling the network. We set \(\theta =10^{3}\), \(\gamma = 0.1\) and \(\beta =1\), forcing the RW to explore noncommon neighbors of the previous and currently visited nodes. The performance is worse compared with N2V4 with the decrease of the spectral gap and the increase of \(\Lambda (3)\) (Table 4). N2V6 is the version of N2V5 with attribute aware sampling. We now set \(\beta w_{ij} = 0.3\) if nodes have equal attributes and 1 otherwise as in N2V4 and keep the other parameters used in N2V5. There is an improvement of performance, however, its variability is similar to N2V4. In both N2V5 and 6, the stationary distributions used in the estimation are obtained through simulations.
Setting 2 (\(\alpha =1\)): We next consider the linear model class \(\mathscr {P}(1, \mu , f)\) case, where \(\mu\), f, N and the sampling rate are the same as in Setting 1. The boxplots of 500 estimates for each sampling scheme using (15) are given in Fig. 6. In this case, the performance of MHRW is worse due to the existent of high degree nodes which tend to be avoided by MHRW, reducing the spectral gap. Note that high degree vertices increase “conductance” in the network (small world phenomenon) and hence avoiding them decreases the mixing time of MHRW. For the variants of N2V the estimates for attributes 2 and 3 tend to be better. This can be explained by the homophily and preferential attachment in the model which enables different types of attachment propensities as we now indicate. The attributes with small proportions 2 and 3 will be mainly attracted by the same node attributes. However, due to the preferential attachment, nodes with attributes from small proportions will also be partly attracted to the majority proportion of nodes with attribute 1 (see Fig. 1b). Therefore, the variability in the estimation tends to be smaller for attributes with lower proportions. The ranking of the performance of sampling methods is the same as in the sublinear case.
Other settings such as the presence of weak homophily and balanced attributes, i.e. the distribution of attributes in the network being uniform will be investigated with real data.
Degree distribution per attribute
Setting 3 (\(\alpha =0.2\)): Fig. 7 depicts the boxplots of the estimation error \((\sum _k ({{\widehat{p}}} (ka)  p(ka))^2 )^{1/2}\) of the degree distribution per attribute for a sublinear network from 500 estimates under MHRW, N2V1 to 4, and baseline sampling methods. Since the stationary distributions of N2V5 and 6 are not known and the N2V5 and 6 performances approach N2V3 and 4, respectively, we omitted them in the plot. The number of nodes sampled is 0.2N and the parameters of N2V3 and 4 are the same as in Setting 1. N2V4 achieves the highest performance especially for attributes 2 and 3 (even compared with RES) due to being attribute aware. We use its empirical stationary distribution and also check the approximation (11) which shows similar boxplots. On the other hand, MHRW has a poor performance compared with the baseline RNS. The results for the variants of N2V are consistent with the estimation of the attribute distribution.
Homophily measures
Setting 4 (\(\alpha =1\)): The homophily measures are \(D_1=1.34\), \(D_2=3.44\), \(D_3=4.87\), \(H_{12}=0.28\), \(H_{13}=0.37\), \(H_{23}=0.56\). Figure 8 shows the estimates of the dyadicity and heterophilicity using N2V variants with known or approximate stationary distribution. The estimators in (20) involves the ratio of several quantities which are sensitive to small deviations. Thus a larger sample size 0.3N is used to reduce the variability. The other parameters are the same as in Setting 2. We have omitted MHRW in the plots due to having the worst performance and also the baseline RNS. For the heterophilicity measure, N2V4 achieves the lower variability followed by RES. We note that \({{\widehat{H}}}_{aa'}\) in (20) involves the estimation of the number of edges between different attribute nodes, which due to the reduced number of these connections is better estimated with N2V4 than RES.
Synthetic network with heterophily
Attribute distribution
Setting 6: We consider the model class \(\mathscr {P}(\alpha =1, \mu =(0.7,0.3,0.1), f)\) with \(f (a,a)=0.2\), \(f (a,a')=0.4\), \(a,a'=1,2,3\), \(a\ne a'\). The network size and sampling rate are the same as in the synthetic network with homophily (Settings 1 and 2). The network generated is heterophilic with measures \(D_1 =0.667\), \(D_2 =0.573\), \(D_3 =0.962\), \(H_{12}= 1.261\), \(H_{13}= 1.668\) and \(H_{23}= 1.378\). Figure 9 gives the estimates of the attribute distribution under several sampling schemes using (15). With heterophily, for attributes aware samplers the weights are higher if nodes have equal attributes. For N2V2 and N2V4 the weights are \(w_{ij} =1\) if nodes have different attributes and 0.8 otherwise. The differences between the different sampling methods are now smaller. In this case, even though most edges are heterophilic, networks will also contain edges between nodes of the same attribute type (see Fig. 2b). This is specially true for nodes with attribute 1 where locally they connect to few other nodes with attribute 1, but globally there are many connections between them. This mixing of different types of edges explains why heterophilic networks can achieve high overall performance among the different sampling methods. The spectral gap of the random walks increases and also the quantity \(\Lambda (.)\) decreases which also explains the results.
Empirical networks
We analyze four publicly available datasets of real attributed networks from different domains and with different homophily levels. Table 6 shows some key characteristics of interest. Wikipedia dataset is a hyperlink network where nodes represent U.S. politicians with attributes as either male or female. Blogs dataset is a network from political blogs from the 2004 U.S. election. Nodes represent blog pages and edges hyperlinks between them. Each blog is either right or leftleaning as attribute. APS is a scientific network from the American Physical Society where nodes represent articles from two subfields and edges represent citations. Swarthmore is a university network with friendship links between users’ pages with attribute gender (male or female). The estimation of the quantities of interest below are replicated 500 times for each sampling scheme.
Attribute distribution
Figure 10 shows the results of estimation of the attribute distributions using (15) for all data sets. We investigate only the sampling methods with known (or approximately computable) stationary distributions. For N2V3 and 4, we use \(\theta =10^{3}\), \(\gamma =\beta =1\) through this section. The sample size is 0.15N. Wikipedia has unbalanced attributes and moderate homophily. For N2V2 and 4 the weights are \(w_{ij}=0.75\) if nodes have equal attributes and 1 otherwise. The performance of MHRW with real data shows again the worst performance. The variants 3 and 4 of N2V presents the lowest variability. Blogs is an approximately balanced attribute data set with a significant homophily. The weights for the variants of N2V are \(w_{ij}=0.3\) if nodes have equal attributes and 1 otherwise. Due to the high density of edges (i.e., the fraction of existing edges out of all possible edges, \({\mathcal {E}}/{N \atopwithdelims ()2}\)) the performance of N2V3 is similar to N2V1. APS is an unbalanced attribute dataset with strong homophily. In this case \(w_{ij}=0.25\) if nodes have equal attributes and 1 otherwise. Swarthmore is a dataset which is very weakly homophilic. The diferences between the sampling methods are less significant where we use \(w_{ij}=0.95\) if nodes have equal attributes and 1 otherwise. These empirical networks are heterogenous with respect to homophily complementing the settings considered in the synthetic case. In Antunes et al. (2023) we have estimated the attribute distribution of a Facebook webgraph dataset restricted to pages from four attributes (politicians, governmental organizations, television shows and companies) where edges represent mutual likes between sites.
Degree distribution per attribute
Figure 11 depicts the estimation error \((\sum _k ({{\widehat{p}}} (ka)  p(ka))^2 )^{1/2}\) of the degree distribution for each attribute of Wikipedia and APS. The parameters for the different sampling methods are the same as for the estimation of attribute distribution with sample size 0.2N. The degree distributions for both attributes are heavytailed in the two datasets. For the majority attributes the tail exponents are 2.823 and 3, respectively, for Wikipedia and Blogs. The error in the estimation decreases significantly with N2V4, especially for the minority attribute.
Homophily measures
The dyadicity and heterophilicity measures using (20) are given in Fig. 12 for Wikipedia and Blogs. Only N2V variants have been considered in the evaluation with the same parameters as above and sample size 0.3N. The performance of the samplings methods for Wikipedia are in line with the synthetic model with discrete attribute set. The high density of edges in Blogs, as discussed above, explains the inferior performance of N2V3 similar to N2V1 especially in the estimation of \(H_{12}\).
Extensions and future directions
How to sample the network and set the sampling method parameters?
Here are some guidelines on how to sample and learn the attribute functionals of a network. If the homophily level is unknown (or even if it is not known if the network is homophilic), the network should be sampled with N2V3 to estimate the dyadicity and heterophilicity measures. As seen from our experiments the backtracking parameter should be close to zero and the other parameters equal to one. In the case that the sampled network indicates that the network is homophilic, we propose the following approach to set the initial edge weights of attribute aware samplers (N2V2 and N2V4) to estimate the attribute distribution (and additionally the degree distribution). If dyadicity is, say, greater than 1.5 and heterophilicity is less than 0.5, then set the weights to \(w_{ij}=0.3\) if nodes have equal attributes and 1 otherwise. For lower homophily levels, set \(w_{ij}=0.7\) if nodes have equal attributes. (In the case of N2V4, additionally the backtracking parameter should be close to zero.) As observed in the Section Synthetic Network with Homophily (Setting 1), the empirical standard deviation of the estimator of the attribute distribution as a function of the weights \(w_{ij}\) (when nodes have equal attributes) is convex. Thus, the weights can then be tuned in practice as follows if feasible. (1) Fix the initial set of weights as described above and a minority attribute, and run n (say, greater than 10) independent attribute aware samplers for a number of steps and obtain the empirical standard deviation of the n estimates of the proportion of the minority attribute; (2) The weights of the n samplers are then increased (decreased) with increment \(\Delta\) and run again to compute the empirical standard deviation; (3) The previous step is repeated until an inflection point of the empirical standard deviation is reached and the “optimal“ weight is outputted.
Continuous attributes
The estimators (12) and (18) were defined for node and edge characteristics that are discrete. But they have natural continuous analogues. More specifically, in connection to (12), assume that the characteristic A(i) values are such that \(A\in {\mathbb {R}}^d\). Then, we expect the density g(A) to be estimated by the kernel smoothing as
where \(h>0\) is a bandwidth, \(K:{\mathbb {R}}^d \rightarrow {\mathbb {R}}\) is a kernel function, and the weights \({{\widetilde{w}}}_s\) satisfy
For the density g(a) of continuous attributes \(a(i)\in {\mathbb {R}}\), the estimator (22) was explored briefly in synthetic and real networks in our conference paper Antunes et al. (2023).
Similarly, the continuous analogue of (18) is
where K and h are as in (22), and the weights \({{\widetilde{w}}}_{s,s+1}\) satisfy
Exploring (22) and (24) further is left for future work. For the attribute aware samplers the weights can be taken as \(w_{ij} =  a(i)  a(j) ^{b}\), which allows moving between similar attribute values of nodes but also giving more weight to edges with different values. The choice of b is motivated by similar arguments as in the case of discrete attributes. If the weights between edges of different groups are too large, then the convergence is decelerated because exploration within the same group attribute is not sufficient due to the intergroup moves.
Future directions
We expect to show that for the parameter \(m \ge 2\) in model class \(\mathscr {P}\), the networks are ‘expanders’ in the sense that the mixing time of RWs on the network is of a much smaller order than the network size (typically logarithmic in network size) Mihail et al. (2003); BenHamou et al. (2018). This would indicate that, although explicitly finding the stationary distribution is infeasible in most cases (e.g. in N2V4,5,6 discussed above), it can be approximated by observing the RW for a relatively small number of steps. A description of the local limits of neighborhoods of typical vertices in the network Berger et al. (2014); Garavaglia et al. (2022); Banerjee et al. (2023) will then provide tractable recursive distributional equations (e.g. Chen et al. (2017) for Pagerank distribution) characterizing the limiting empirical stationary distribution of the RW (as the network size grows). This representation can be exploited to analyze detailed behavior of this limiting distribution including tail exponents, means, etc.
Random walks are also closely tied to ranking mechanisms such as the Pagerank centrality, and we plan to study the impact of the parameters driving the random walk on such centrality scores, thus looping back to one of the central motivations for studying attributed networks namely fairness of ranking mechanisms Karimi et al. (2018). Other questions, including learning joint distributions of the multivariate attribute distributions, both in terms of developing synthetic models, as well as real world data will also be considered. We considered simple time snapshots of the network process, without directionality information, for estimation in this work, but in future work it will be interesting to exploit the temporality and directionality in network data. Finally, there has been significant recent interest in incorporating higher order interactions (network data and models largely hinge on binary or pairwise interactions) in the evolution of networks and the impact of dynamics such as percolation and epidemics resulting from such interactions Courtney and Bianconi (2017); Battiston et al. (2020); Majhi et al. (2022); Iacopini et al. (2019); Fan et al. (2022); Sun et al. (2023). Exploring versions of such questions incorporating attribute information suggests fascinating new directions of research.
Conclusions
In this paper, we developed a statistical framework for learning attribute functionals through sampling in networks with homophily. First, we proposed a generalization of the preferential attachment model with homophily (model class \(\mathscr {P}\)). We described a related model (model class \(\mathscr {U}\)), that is significantly more amenable to analysis, formalizing the notion of resolvability, which provides explicit information (degree distribution of an attribute, homophily and heterophily statistics) for model class \(\mathscr {P}\) by using model class \(\mathscr {U}\). Second, we introduced link trace samplers (random walks) with weights for networks with restricted access that explore better the attribute space (attributed aware). Third, estimators that correct the bias of the considered sampler methods were proposed for the several attribute and geometric quantities of interest. Fourth, we showed experimental results for synthetic (using model class \(\mathscr {P}\)) and a variety of real world datasets, demonstrating that attribute aware samplers are more efficient and outperform attribute agnostic random walks samplers for several network settings. Finally, we presented extensions of the developed framework including continuous attributes and directions for future work.
Availability of data and materials
All data are publicly available on GitHub at GESIS  Leibniz Institute for the Social Sciences: https://github.com/orgs/gesiscss/repositories.
References
Aldous D, Fill J.A (2002) Reversible Markov Chains and Random Walks on Graphs. Unfinished monograph, recompiled 2014 http://www.stat.berkeley.edu/~aldous/RWG/book.html
Antunes N, Banerjee S, Bhamidi S, Pipiras V (2023a) Attribute network models, stochastic approximation, and network sampling and ranking. Preprint arXiv:2304.08565v1
Antunes N, Bhamidi S, Pipiras V (2023b) Learning attribute distributions through random walks. In: Cherifi H, Mantegna RN, Rocha LM, Cherifi C, Micciche S (eds) Complex networks and their applications XI. Springer, Cham, pp 17–29
Antunes N, Bhamidi S, Guo T, Pipiras V, Wang B (2021a) Sampling based estimation of indegree distribution for directed complex networks. J Comput Gr Stat 30(4):863–876
Antunes N, Guo T, Pipiras V (2021b) Sampling methods and estimation of triangle count distributions in large networks. Netw Sci 9:134–156
Alon N, Benjamini I, Lubetzky E, Sodin S (2007) Nonbacktracking random walks mix faster. Commun Contem Math 09(04):585–603
Barabási A.L, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512
BenHamou A, Lubetzky E, Peres Y (2018) Comparing mixing times on sparse random graphs. In: Proceedings of the twentyninth annual ACMSIAM symposium on discrete algorithms, pp 1734–1740. SIAM
Baroni A, Conte A, Patrignani M, Ruggieri S (2017) Efficiently clustering very large attributed graphs. In: 2017 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM), pp 369–376
Berahmand K, Mohammadi M, SaberiMovahed F, Li Y, Xu Y (2022) Graph regularized nonnegative matrix factorization for community detection in attributed networks. IEEE Trans Netw Sci Eng
Banerjee S, Deka P, OlveraCravioto M (2023) Local weak limits for collapsed branching processes with random outdegrees. arXiv preprint arXiv:2302.00562
Battiston F, Cencetti G, Iacopini I, Latora V, Lucas M, Patania A, Young JG, Petri G (2020) Networks beyond pairwise interactions: structure and dynamics. Phys Rep 874:1–92
Berger N, Borgs C, Chayes JT, Saberi A (2014) Asymptotic behavior and distributional limits of preferential attachment graphs. Ann Probab 42(1):1–40
Bianconi G, Barabási AL (2001) BoseEinstein condensation in complex networks. Phys Rev Lett 86(24):5632
Chang CH, Chang CS, Chang CT, Lee DS, Lu PE (2019) Exponentially twisted sampling for centrality analysis and community detection in attributed networks. IEEE Trans Netw Sci Eng 6(4):684–697
Chen N, Litvak N, OlveraCravioto M (2017) Generalized pagerank on directed configuration networks. Random Struct Algorithms 51(2):237–274
Courtney OT, Bianconi G (2017) Weighted growing simplicial complexes. Phys Rev E 95(6):062301
de Almeida ML, Mendes GA, Madras Viswanathan G, da Silva LR (2013) Scalefree homophilic network. Eur Phys J B 86(2):38
EspínNoboa L, Karimi F, Ribeiro B, Lerman K, Wagner C (2021) Explaining classification performance and bias via network structure and sampling technique. Appl Netw Sci 6(79)
EspínNoboa L, Wagner C, Strohmaier M, Karimi F (2022) Inequality and inequity in networkbased ranking and recommendation algorithms. Sci Rep 12(1), 2012
Fan H, Zhong Y, Zeng G, Sun L (2021) Attributed network representation learning via improved graph attention with robust negative sampling. Appl Intell 51(1):416–426
Fan J, Yin Q, Xia C, Perc M (2022) Epidemics on multilayer simplicial complexes. Proc R Soc A 478(2261):20220059
Flaxman AD, Frieze AM, Vera J (2007) A geometric preferential attachment model of networks ii. Int Math 4(1):87–111
Garavaglia A, Hazra R.S, van der Hofstad R, Ray R (2022) Universality of the local limit of preferential attachment models. arXiv preprint arXiv:2212.05551
Grover A, Leskovec J (2016) Node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’16, pp 855–864. Association for Computing Machinery, New York, NY, USA
Iacopini I, Petri G, Barrat A, Latora V (2019) Simplicial models of social contagion. Nat Commun 10(1):2485
Jordan J (2013) Geometric preferential attachment in nonuniform metric spaces. Electron J Probab 18:1–15
Karimi F, Génois M, Wagner C, Singer P, Strohmaier M (2018) Homophily influences ranking of minorities in social networks. Sci Rep 8(1):11077
Krapivsky PL, Redner S (2001) Organization of growing random networks. Phys Rev E 63(6):066123
Kumar S, Sundaram H (2021) Attributeguided network sampling mechanisms. ACM Trans Knowl Discov Data 15(4)
Kolaczyk ED (2009) Statistical analysis of network data. Springer, New York
Lee DJL, Han J, Chambourova D, Kumar R (2017) Identifying fashion accounts in social networks. In: Proceedings of the KDD workshop on ML meets fashion
Majhi S, Perc M, Ghosh D (2022) Dynamics on higherorder networks: a review. J R Soc Interface 19(188):20220043
Meng L, Masuda N (2020) Analysis of node2vec random walks on networks. Proc R Soc A Math Phys Eng Sci 476(2243):20200447
McPherson M, SmithLovin L, Cook JM (2001) Birds of a feather: homophily in social networks. Annu Rev Sociol 27(1):415–444
Mihail M, Papadimitriou C, Saberi A (2003) On certain connectivity properties of the internet topology. In: 44th annual IEEE symposium on foundations of computer science, 2003. Proceedings, pp 28–35. IEEE
Mislove A, Viswanath B, Gummadi KP, Druschel P (2010) You are who you know: inferring user profiles in online social networks. In: Proceedings of the third ACM international conference on web search and data mining, pp 251–260
Nasiri E, Berahmand K, Li Y (2023) Robust graph regularization nonnegative matrix factorization for link prediction in attributed networks. Multimedia Tools Appl 82(3):3745–3768
Park J, Barabási AL (2007) Distribution of node characteristics in complex networks. Proc Natl Acad Sci 104(46):17916–17920
Shrum W, Cheek Jr N.H, Mac DS (1988) Friendship in school: gender and racial homophily. Sociol Educ 2:227–239
Sun H, Radicchi F, Kurths J, Bianconi G (2023) The dynamic nature of percolation on networks with triadic interactions. Nat Commun 14(1):1308
Wagner C, Singer P, Karimi F, Pfeffer J, Strohmaier M (2017) Sampling from social networks with attributes. In: Proceedings of the 26th international conference on World Wide Web. WWW ’17, pp 1181–1190, Republic and Canton of Geneva, CHE
Acknowledgements
The authors would like to thank the editors and two anonymous reviewers for their comments that led to significant improvements in the paper.
Funding
S. Banerjee is partially supported by the NSF CAREER award DMS2141621. S. Bhamidi and V. Pipiras are partially supported by NSF DMS2113662. S. Banerjee, S. Bhamidi and V.Pipiras are partially supported by NSF RTG grant DMS2134107.
Author information
Authors and Affiliations
Contributions
All authors contributed equally to the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Antunes, N., Banerjee, S., Bhamidi, S. et al. Learning attribute and homophily measures through random walks. Appl Netw Sci 8, 39 (2023). https://doi.org/10.1007/s41109023005583
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41109023005583