The main focus of our research is a thorough analysis of costs that are associated with RV, RN, and the other sampling methods that we will introduce. In our introduction we mentioned the most obvious cost, sampling a vertex. In many contexts this cost would be equivalent for sampling a vertex from the graph and for sampling a vertex from the neighbors of an already sampled vertex, and we will in fact make this assumption in some of our analyses. However, we suggest that this may not always be the case. Sampling a neighbor my be less expensive as the set from which the neighbor will be sampled is smaller. Or, perhaps there is a privacy concern related to learning connections that would apply only to sampling a neighbor which would make sampling a neighbor more expensive. We therefore generalize the sampling costs to two distinct costs, \(C_v\), the cost of sampling a vertex, and \(C_n\), the cost of sampling a neighbor.
Critical \(C_n\)
Let us fix \(C_v=1\) so that we are expressing both \(C_n\) and total cost in terms of \(C_v\). We seek a ‘critical \(C_n\)’ value (\(CC_n\)), that is a value for \(C_n\) where RV and RN perform equally well in light of their associated costs. Knowledge of a such a value for a specific graph would allow a proper evaluation of whether or not RN should be used instead of RV. \(CC_n\) is, ultimately, a measure of RN’s superiority over RV as the higher the degree of a selected neighbor is compared to that of a sampled vertex, the more one would be willing to pay in order to sample the neighbor. Following the same logic, finding some scenario where \(CC_n < 0\) would indicate that (somehow) \(RV > RN\).
\(CC_n\) for expected degree
Obviously, in order to quantify \(CC_n\), we first need to define what metric of success we are using to quantify the respective performances of RV and RN. We will first focus on the expected degree of a vertex/neighbor selected by each. We have fixed \(C_v=1\), so a vertex selected by RV, with its corresponding expected degree, requires one cost unit. A neighbor selected by RN costs \(C_v + C_n = 1 + C_n\). Therefore, the \(CC_n\) value that equates the two methods in terms of cost for their respective expected degrees can be calculated as follows:
$$\begin{aligned} RV&= \frac{RN}{1+CC_n} \\ RV+RV(CC_n)&=RN \\ CC_n&= \frac{RN}{RV}  1 \end{aligned}$$
(2)
There is a strong intuition to this expression. The ratio \(\frac{RN}{RV}\) should capture how much more someone would be willing to pay for a neighbor over a vertex. Also notice that, if \(\frac{RN}{RV} < 2\), \(CC_n < 1\). This means that if \(C_v = C_n\), which would arguably be our natural assumption in most scenarios, sampling with RN would only be preferred to sampling with RV if RN is twice as strong as RV for the desired metric. Otherwise, a more robust cost model would be required in order to justify the intuitive appeal of RN sampling.
\(CC_n\) in canonical graphs
We will apply Eq. 2 to a few famous graph types.
dregular Graph In any perfectly assortative graph RN reduces to RV and \(CC_n=0\). The intuition is obvious. In a graph where RN offers no advantage, any positive cost would be wasted.
Star Graph RV in a star graph is equal to \(2(n1)/n\) and RN is equal to \(((n1)^2+1)/n\), so
$$\begin{aligned} CC_n = \frac{1+(n1)^2}{2(n1)}  1 \end{aligned}$$
(3)
As n increases, \(CC_n \xrightarrow {} n\). This is of course the same bound as the degree of the hub. The expression’s similarity to the degree of the hub reflects the high \(C_n\) price worth paying for taking the leaf vertex one would initially sample with high probability and exchanging it for the hub.
Complete bipartite graph Assume we have a complete bipartite graph with sides L and R. All vertices in L are of degree R, and all vertices in R are of degree L. \(RV=(LR + RL)/(L+R)\), and \(RN=(L^2+R^2)/(L+R)\). Therefore, in a complete bipartite graph, \(CC_n=(L^2+R^2)/2LR  1\).
\(CC_n\) for different sampling amounts and results
Our second exploration of \(CC_n\) will define it as a function of either how many samples are taken or a function of some desired result. Importantly, this means that these versions of \(CC_n\) will not be fixed for a graph. This is an important use of \(CC_n\) because it demonstrates how RN’s value can fluctuate even for the same graph depending on how long it is used or what the desired outcome is. For these analyses we will define three metrics to quantify the success of a sampling method:

1
Total Degrees—We repeatedly sample vertices from a graph with replacement and track the sum of the degrees of all selected vertices. \(CC_n\) for this metric should converge on the \(CC_n\) value based on expected degree defined in Eq. 2.

2
Total Unique Degrees—We repeatedly sample vertices from a graph with replacement and track the sum of the degrees of any new vertices selected. Here we will present resulting values as a percentage of the sum of all degrees in the graph.

3
Max Degree—We repeatedly sample vertices from a graph with replacement and track the maximum degree vertex selected. Here we will present resulting degree values as a percentage of the maxdegree vertex in the graph.
The second metric corrects for the inclusion of duplicates in the first metric. If our goal is to immunize a network, for example, we probably can’t take credit for inoculating the same vertex twice. We include the first metric mostly for comparison, but we still suggest it might be useful in some scenarios. For example, in a situation where the goal of sampling is information dissemination, our goal would be to reach as many highdegree vertices as possible in order to spread the information to their neighbors. But we might still appreciate selecting the same vertex multiple times as each selection reiterates the importance of the information and therefore increases the likelihood of it being shared.
In order to calculate \(CC_n\) as a function of sampling iterations, let RV(i) and RN(i) be the resulting values, according to one of the success metrics, of selecting i vertices with RV and with RN respectively. The cost of i vertices selected with RV is i and the cost of i neighbors selected with RN is \(i(1+C_n)\). Therefore, for any i, we can calculate \(CC_n(i)\) as follows:
$$\begin{aligned} \frac{RV(i)}{i} = \frac{RN(i)}{i(1+CC_n(i))} \\ CC_n(i) = \frac{RN(i)}{RV(i)}  1 \end{aligned}$$
(4)
To calculate \(CC_n\) as a function of resulting values, assume some resulting value V requires i sampling iterations of RV and j sampling iterations of RN, or \(V=RV(i)=RN(j)\). For this value V, we can calculate \(CC_n(V)\) as
$$\begin{aligned} \frac{V}{i}&= \frac{V}{j(1+CC_n(V))} \\ CC_n(V)&= \frac{i}{j}1 \end{aligned}$$
(5)
We experimented with ER and BA graphs with varying parameters as well as the graphs of multiple real world networks taken from the Koblenz online collection (Kunegis 2013). Figure 3 shows results from some of the experiments on BA graphs. These results were fairly typical for ER and real world graphs as well.
The results for total degrees correlated with \(RN/RV  1\) as predicted. For the other two measures, the results are somewhat more interesting. The bottom two charts measure success by max degree. The first chart shows calculated values of \(CC_n\) for samples taken. \(CC_n\) starts off high, because RN gives a higher maximum degree than RV. However, as we continue to take samples, RV will eventually find the maxdegree vertex in the graph, and any further sampling for either method accomplishes nothing. This explains the (roughly) monotonically decreasing values of \(CC_n\), ultimately converging on 0. The second chart plots \(CC_n\) as a function of the percent of the maximum degree vertex being sought. This plot is noisier because sampling for a max degree vertex will not normalize as easily with repeated experiments, but the generally increasing nature of the function shows that RN has more relative value compared to RV as the degree of the maximum degree vertex being sought increases.
The middle charts are measuring the sum of all unique degrees accumulated. The chart on the left sees \(CC_n\) steadily decrease as the hubs have already been selected and RN’s value is diminishing. Interestingly, \(CC_n\) is actually negative for a range of values. This corresponds to the point where RN is continuing to sample hubs that have already been selected and therefore has no value, but RV is still finding new leaves. In this range \(RV > RN\) which explains the negative \(CC_n\) value. Then RV also finds all of the vertices it will find and \(CC_n\) levels out at 0, neither method offers any advantage. The chart on the right shows a roughly monotonically decreasing \(CC_n\). As we seek a higher and higher percentage of the total degrees in the graph, RN’s value over RV decreases because of its failure to find leaves. Eventually, when enough of the degrees are being sought, \(CC_n\) becomes negative because of the difficulty it has finding leaves while RV is continuing to sample all vertices with uniform probability.