The risk of node re-identification in labeled social graphs

Real network datasets provide significant benefits for understanding phenomena such as information diffusion or network evolution. Yet the privacy risks raised from sharing real graph datasets, even when stripped of user identity information, are significant. When nodes have associated attributes, the privacy risks increase. In this paper we quantitatively study the impact of binary node attributes on node privacy by employing machine-learning-based re-identification attacks and exploring the interplay between graph topology and attribute placement. We also analyze the risk of anonymity over epidemic networks subject to different node re-identification attacks. Our experiments show that the population’s diversity on the binary attribute consistently degrades anonymity. More interestingly, we show that similar diverse populations in the SI epidemic model maintain different levels of anonymity with different infection rates.


INTRODUCTION
Real graph datasets are fundamental to understanding a variety of phenomena, such as epidemics, adoption of behavior, crowd management and political uprisings. At the same time, many such datasets capturing computer-mediated social interactions are recorded nowadays by individual researchers or by organizations. However, while the need for real social graphs and the supply of such datasets are well established, the flow of data from data owners to researchers is significantly hampered by serious privacy risks: even when humans' identities are removed, studies have proven repeatedly that de-anonymization is doable with high success rate [18,22,30,43]. Such de-anonymization techniques reconstruct user identities using third-party public data and the graph structure of the naively anonymized social network: specifically, the information about one's social ties, even without the particularities of the individual nodes, is sufficient to re-identify individuals.
Many anonymization methods have been proposed to mitigate the privacy invasion of individuals from the public release of graph data [21]. Naive anonymization schemes employ methods to scrub identities of nodes without modifying the graph structure. Structural anonymization methods change the topology of the original graph while attempting to preserve (at least some of) the original graph characteristics [25,26,36]. Often the utility of an anonymized graph depends not only on preserving essential graph properties of the original graph, but also node attributes such as labels that identify nodes as cheaters or noncheaters in online gaming platforms [4].
However, the effects of node attributes on the risks of re-identifications are not yet well understood. While intuitively any extra piece of information can be a danger to privacy, a rigorous understanding of what topological and attribute properties affect the re-identification risks is needed. In cases such as information dissemination, node attributes may be informed by the local graph topology. How does the interplay between topology and node attributes affect node privacy?
Our work assesses the additional vulnerability to re-identification attacks posed by the attributes of a labeled graph. We consider exactly one binary attribute to understand the lower bound of the damage that node attributes inflict. We focus our empirical study on the interplay between topology and labeling as a leverage point for re-identification. Because our focus is to understand in which conditions node re-identification is feasible, this study is independent of any anonymization technique. We apply machine learning techniques that use both topological and attribute information to reidentify nodes based on a common threat model. Our study involves real-world graphs and synthetic graphs in which we control how labels are placed relative to ties to mimic the ubiquitous phenomena of homophily found in social graphs [28].
Our empirical results show that the vulnerability to node reidentification depends on the population diversity with respect to the attribute considered. Using information about the distribution of labels in a node's neighborhood provides additional leverage for the re-identification process, even when labels are rudimentary. Furthermore, we quantify the relative importance of attribute-related and topological features in graphs of different characteristics.
The remainder of this paper is organized as follows. Section 2 outlines the related work. The system model to quantify anonymity is presented in Section 3. Section 4 describes the characteristics of the datasets we used in our empirical investigations. we present our results in Section 5 and discuss our contributions in Section 6.

RELATED WORK
The availability of auxiliary data helps reveal the true identities of anonymized individuals, as proven empirically in large privacy violation incidents [7,23]. Similarly, in the case of graph de-anonymization attacks, information from an auxiliary graph is used to re-identify the nodes in an anonymized graph [31]. The quality of such an attack is determined by the rate of correct re-identification of the original nodes in the network. In general, de-anonymization attacks harness structural characteristics of nodes that uniquely distinguish them [21]. Many such attacks can be categorized into seed-based and seed-free, based on the prior seed knowledge available to an attacker [21].
In seed-based attacks, sybil nodes [3] or some known mappings of nodes in an auxiliary graph aid the re-identification of anonymized nodes [18,19,22,30,43]. The effectiveness of such attacks is influenced by the quality of the seeds [39].
In seed-free attacks, the problem of deanonymization is usually modeled as a graph matching problem. Several research efforts have proposed statistical models for the re-identification of nodes without relying on seeds, such as the Bayesian model [33] or optimization models [16,17]. Many heuristics are used in the propagation process of re-identification, exploiting graph characteristics such as degree [8], k-hop neighborhood [48], linkage-covariance [2], eccentricity [31], or community [32].
Recently, there have been efforts to incorporate node attribute information into deanonymization attacks. Gong et al. [6] evaluate the combination of structural and attribute information on link prediction models. Attributes not present may be inferred through prior knowledge and network homophily. Qian et al. [35] apply link prediction and attribute inference to deanonymization by quantifying the prior background information of an attacker using knowledge graphs. In knowledge graphs, edges not only represent links between nodes but also node-attribute links and link relationships among attributes. The deanonymization attack in [14] maps node-attribute links between an anonymized graph and its auxiliary. In addition to structural similarity, nodes are matched by attribute difference, the union of the attributes of the node in the anonymized and auxiliary divided by their intersection.
However, the success rate of a de-anonymization process is often reported in the literature as dependent on the chosen heuristic of the attack, which is typically designed with knowledge of the anonymization technique. Comparing the strengths of different anonymization techniques thus becomes challenging, if not impossible. Recently, Sharad [39] proposed a general threat model to measure the quality of a deanonymization attack which is independent of the anonymization scheme. He proposed a machine learning framework to benchmark perturbation-based graph anonymization schemes. This framework explores the hidden invariants and similarities to re-identify nodes in the anonymized graphs [40,41]. Importantly, this framework can be easily tuned to model various types of attacks.
Several researchers propose theoretical frameworks to examine how vulnerable or deanonymizable any (anonymized) graph dataset is, given its structure [15,16,20,34]. However, some techniques are based on Erdös-Rènyi (ER) models [34], while others make impractical assumptions about the seed knowledge [15]. Ji et al. [20] also introduced a configuration model to quantify the deanonymizablity of graph datasets by considering the topological importance of nodes. The same set of authors analyzed the impact of attributes on graph data anonymity [14]. They show a significant loss of anonymity when more node-attribute relations are shared between anonymized and auxiliary graph data. Specifically, they measure the entropy present in node-attribute mappings available for an attacker. As the entropy decreases, the graph loses node anonymity.
The main aspects distinguishing this study from existing works are as follows: i) In our work, we study the inherent conditions in graphs that provide resistance/vulnerability to a general node re-identification attack based on machine learning techniques. ii) To the best of our knowledge, this is the first work that quantifies the privacy impact of node attributes under an attribute attachment model biased towards homophily. iii) We analyze the interplay between the intrinsic vulnerability of the graph structure and attribute information.

METHODOLOGY
Our main objective is to quantitatively estimate the vulnerability to re-identification attacks added by node attributes. In particular, we ask: Given a graph topology, how much better does a node reidentification attack perform when the node attributes are included in the attack compared to when there is no node attribute information available to the attacker?
We are interested in measuring the intrinsic vulnerability of a graph with attributes on nodes, in the absence of any particular anonymization technique on topology or node attributes. The intuition is that particular graphs are inherently more private: for example, in a regular graph, nodes are structurally indistinguishable. Adding attributes to nodes, however, may contribute extra information that could make the re-identification attack more successful. Consider another example, in a highly disassortative network (such as a sexual relationships network), knowing the attribute values (i.e., gender) of a few nodes will quickly lead to correctly inferring the attribute values of the majority of nodes, and thus possibly contributing to the re-identification of more nodes. Thus, the questions we address in this study also include: How does the distribution of node attributes affect the intrinsic vulnerability to a re-identification attack of a labeled graph topology?
To answer these question, we developed a machine learningbased re-identification attack inspired from that presented in [39]. We use the same threat model (Section 3.1) that aims at finding a bijective mapping between nodes in two different graphs. We mount a machine-learning based attack (Section 3.2), in which the algorithm learns the correct mapping between some pairs of nodes from the two graphs, and estimates the mapping of the rest of the dataset. As input data, we use both real and synthetic datasets (as presented in Section 4).

The Threat Model
The threat model we consider is the classical threat model in this context [34]: The attacker aims to match nodes from two networks whose edge sets are correlated. We assume each node is associated with a binary valued attribute, and this attribute is publicly available. Common examples of such attributes are gender, professional level (i.e., junior or senior), or education level (i.e., higher education or not).
For clarity, consider the following example: an attacker has access to two networks of individuals in an organization that represent the communication patterns (e.g., email) and friendship information available from an online social network. Individuals in the communication network are described by professional seniority (e.g., junior or senior), while individuals in the friendship network are described by gender. These graphs are structurally overlapping, in that some individuals are present in both graphs, even if their identities have been removed. The attacker's task is to find a bijective mapping between the two subsets of nodes in the two graphs that correspond to the individuals present in both networks.

Machine Learning Attack
We assume that the adversary has a sanitized graph G san that could be associated with an auxiliary graph G aux for the re-identification attack (as depicted in Figure 1). As in the scenario discussed above, G san could be the communication network, while G aux is the friendship network of a set of individuals in an organization.
In order to model this scenario using real data, we split a real dataset graph G = (V , E) into two subgraphs G 1 = (V 1 , E 1 ) and The fraction of the overlap α is measured by the Jaccard coefficient of two subsets: α = |V 1 ∩V 2 | |V 1 ∪V 2 | . In the shared subgraph induced by the nodes in V α , nodes will preserve their edges with nodes from V α but might have different edges to nodes that are part In an optimistic scenario, an attacker has access to a part of the original graph (e.g., G 1 ) as auxiliary data and to an unperturbed subgraph (e.g., G 2 ) as the sanitized data whose nodes the attacker wants to re-identify. We use G 1 and G 2 as baseline graphs to measure the impact of attributes on de-anonymizability of network data. It is also possible to split G 1 and G 2 recursively into multiple overlapping graphs, maintaining the same values of overlap parameters as above. This allows us to assess the feasibility of the de-anonymization process for large networks by significantly reducing the size of G 1 and G 2 .
The resulting graphs are now the equivalent of the email/friendship networks we used as an example above. The overlap is the knowledge repository that the attacker uses for de-anonymization [11]. Part of this knowledge will be made available to the machine learning algorithms.
Previous work shows that the larger α, the more successful the attack. However, the relative success of attacks under different anonymization schemes is observed to be independent of α [39]. In order to experiment with a homogeneous attack, we set the value of α = 0.2, and we build V α by building a breadth-first-search tree starting from the highest degree node (BFS-HD) in G. While other alternatives are certainly possible, we chose this approach for two reasons. First, it appears that the threat model we used is quite sensitive to the sampling process when generating G 1 and G 2 [34]. To avoid sampling bias, we chose a BFS-HD split to have a deterministic set of nodes in V α . Second, we empirically found that BFS-HD provides the maximally informed seeds for an adversary to propagate the re-identification process, thus providing a best-case scenario for the attacker.
3.2.1 Node Signatures. Since we are employing machine learning techniques to re-identify nodes in a graph, we need to represent nodes as feature vectors. We define the node u's features using a combination of two vectors made up from its neighborhood degree distribution (NDD) and neighborhood attribute distribution (NAD) (as depicted in Figure 2).
NDD is a vector of positive integers where N DD q u [k] represents the number of u's neighbors at distance q with degree k. We concatenate the binned version of N DD 1 u with the binned version of N DD 2 u to define the node u's NDD signature. We use a bin size of 50, which was shown empirically [39] to capture the high degree variations of large social graphs. For each q, we use 21 bins, which would correspond to a larger node degree of 1050. All larger values are binned in the last bin. This binning strategy is designed to capture the aggregate structure of ego networks, which is expected to be robust against edge perturbation [38]. NAD is defined by N AD q u [i] which represents the number of u's neighbors at distance q with an attribute value i. It is shown experimentally that the use of neighbor attributes as features often improves the accuracy of edge classification tasks [27].
We use the notation GS to represent the prediction results from the input features made up from the topology (e.g., NDD). GS(LBL) to represent features from both the topology and attribute information (e.g., concatenation of NDD and NAD vectors).

Random Forest Classification.
Note that the nodes in G san ∩ G aux , common to both graphs, can be recognized as being the same node (identical) in the two graphs based on their node identifier. Non-identical nodes are unique to each G san and G aux and would not exist in the overlap. In the classification task, we wish to output 1 for an identical node pair and 0 for a non-identical node pair. This is the ground truth against which we measure the accuracy of the learning algorithms.
We generate examples for the training phase of the deanonymization attack by randomly picking node pairs from the sanitized (G san ) and the auxiliary (G aux ) graphs, respectively. In most cases, we have an unbalanced dataset with the degree of imbalance depending on the overlap parameter α, where the majority is nonidentical node pairs. We use the reservoir sampling technique [9] to take ℓ = 1000 balance sub-samples from the population S, and the SMOTE algorithm [5] as an over-sampling technique for each subsample. Each sample is trained by a forest of 100 random decision trees that allows the algorithm to learn features. Gini-index is used as an impurity measure for the random forest classification. Given the size α of the overlap, we measure the quality of the classifier on the task of differentiating two nodes as identical or not.

Metrics.
We measure the accuracy of the classifier in determining whether a randomly chosen pair of nodes (with one node Figure 2: Example of a node signature defined as a combined feature vector made up from NDD and NAD vectors. In the NDD vector, each bin value corresponds to the number of nodes that have a degree value represented in the bin range, such that the j t h bin holds the nodes in the degree (k j ) range j × b ≤ k j < (j + 1) × b. If the degree exceeds the maximum range, such nodes are included in the last bin. Further, both 1-hop and 2-hop NDDs are calculated and merged. For example, node x has no 1-hop neighbor nodes that have degree in the range of 1 − 2, and one 2-hop neighbor node that its degree is in the range of 1 − 2. In the NAD vector, each element corresponds to the number of nodes with the given attribute. Both 1-hop and 2-hop NADs are calculated and merged. Node x has one 1-hop neighbor node, and two 2-hop neighbor nodes with the attribute Red. Note that the node value represents the associated degree, and the border color represents the node attribute Red or Blue.
in G san and another in G aux ) are identical or not. We use F1-score to evaluate the quality of the classifier. F1-score is the harmonic mean between precision and recall, typical metrics for prediction output of machine learning algorithms.
For each data sample, we perform 5 × 2 cross-validation to evaluate the classifier and record the mean F1-score. We thus build two vectors of mean F1-scores, each of size ℓ = 1000 (as described above), one for the labeled (GS(LBL)) and one for the unlabeled network topology (GS). An important aspect of these vectors is that they are related in the sense that the i t h element in one vector represents the same sample as the i t h element of the other vector. This is important for the pairwise comparison of the two mean F1-score vectors.
We perform a standard T-test on these two vectors and report the T-statistic value. The T-statistic value is a measure of how close to the hypothesis an estimated value is. In our case, the hypothesis is the prediction accuracy of the node identities in the unlabeled graph (GS) and the estimated value is the prediction accuracy in the labeled graph (GS(LBL)). Thus, a large T-statistic value implies a significantly better prediction accuracy of node identities in GS(LBL) than in GS. In such cases, we can say that the network with node attributes is more vulnerable to node re-identification. This value serves as our statistical measurement to quantify the vulnerability cost of node attributes.

DATASETS
Because our work is empirically driven, a larger set of test datasets promises a better understanding of the relations between vulnerability to re-identification attacks and the particular characteristics of the node attributes (such as fractions of attributes of a particular value or the assignment of attributes to topologically related nodes). In this respect, real datasets are always preferable to synthetic ones, as they potentially encapsulate phenomena that are missing in the graph generative models. As an example, until very recently, the relation between the local degree assortativity coefficient and node degree was not captured in graph topology generators [37].
However, relying only on real datasets has its limitations, due to the scarcity of relevant data (in this case, networks with binary node attributes) and the difficulty of covering the relevant space of graph metrics when relying only on available real datasets. Thus, in this work, we combine real networks (described in Section 4.1) with synthetic networks generated from the real datasets. For generating synthetic labelled networks, we employ ERGMs [12,47] and a controlled node-labeling algorithm as described in Section 4.2.

Real Network Datasets
We chose six publicly available datasets from four different contexts and generated eight networks with binary node attributes.
• polblogs [1] is an interaction network between political blogs during the lead up to the 2004 US presidential election. This dataset includes ground-truth labels identifying each blog as either conservative or liberal. • fb-dartmouth, fb-michigan, and fb-caltech [46] are Facebook social networks extant at three US universities in 2005. A number of node attributes such as dorm, gender, graduation year, and academic major are available. We chose two such attributes that could be represented as binary attributes: gender and occupation, whereby occupation we could identify the attribute values "student" and "faculty". From each dataset, we obtained two networks with the same topology but different node attribute distributions. • pokec-1 [44] is a sample of an online social network in Slovakia. While the Facebook samples are university networks, Pokec is a general social platform whose membership comprises 30% of the Slovakian population. pokec-1 is a one-fortieth sample. This dataset has gender information available as a node attribute.
• amazon-products [24] is a bi-modal projection of categories in an Amazon product co-purchase network. Nodes are labeled as "book" or "music", edges signify that the two items were purchased together. As Table 1 shows, the networks generated from these datasets have different graph characteristics. For example, the density (d) of the graphs varies across three orders of magnitude, while degree assortativity oscillates between disassortative (for polblogs, r = −0.22, where there are more interactions between popular and obscure blogs than expected by chance) to assortative (as expected for social networks). All topologies except for amazon-products have small average path length.
The metrics p and τ shown in Table 1 are inspired from the synthetic node labeling algorithm used for generating synthetic graphs (and presented later), and they also show high variation across different networks. Intuitively, p captures the diversity of attribute values in the node population (with p = 0.5 showing equal representation of the attributes) while τ captures the homophily phenomenon (that functions as an attraction force between nodes with identical attribute values). The homophilic attraction metric τ varies between 0 in pokec-1 (thus, no higher than chance preference for social ties with people of the same gender in Slovakia) to 0.99 in amazon-products (books are purchased together with other books much more strongly than given by chance). The diversity metric p varies between the overrepresentation of males in the US academic Facebook networks (8% female representation) to an almost perfect political representation in the polblogs dataset (where p = 0.48). Note that, we only consider p as the minimum proportion of two node groups due to the symmetric nature of attributes in our experiments.
This wide variation in graph metrics values is what motivated our choice for these set of real networks. We opted to include the three Facebook networks from similar contexts to also capture more subtle variations in network characteristics.

Synthetic Graphs
In order to be able to control graph characteristics and node attribute distributions, we also generated a number of synthetic graphs comparable with the real datasets just described. The graph generation included two aspects: topology generation, for which we opted for ERGMs, and node attribute assignments, for which we implemented the technique proposed in [42].

Varying Topology via ERGMs.
Exponential-family random graph models (ERGMs) or p-star models [12,47] are used in social network analysis for stipulating, within a set structural parameters, distribution probabilities for networks. Its primary use is to describe structural and local forces that shape the general topology of a network. This is achieved by using a selected set of parameters that encompass different structural forces (e.g., homophily, degree correlation/assortativity, clustering, and average path length). Once the model has converged, we can obtain maximum-likelihood estimates, model comparison and goodness-of-fit tests, and generate simulated networks tied to the relationship between the original network and the probability distribution provided by the ERGM.
Our interest in ERGMs is based on simulating graphs that retain set structural information from the original graph to generate a diverse set of graph structures. We used R [45] and the statnet suite [10], which contains several packages for network analysis, to produce ERGMs and simulate graphs from our real-world network datasets. In this case, we focused on three structural aspects of the graphs: clustering coefficient, average path length, and degree correlation/assortativity.
For the ERGM based on clustering coefficient, we used the edges and triangle parameters in the statnet package. The edges parameter measures the probability of linkage or no linkage between nodes, and the triangle term looks at the number of triangles or triad formations in the original graph. For the average path length model, edges and twopath terms were used. The twopath term measures the number of 2-paths in the original network and produces a probability distribution of their formation for the converged ERGM. Lastly, for the assortativity measure, the terms edges and degcor were used to produce the models. The degcor term considers the degree correlation of all pairs of tied nodes (for more on ERGMs see [13,29]). These terms proved to be our best choices for preserving, to a certain extent, the desired structural information. Although the creation of ERGMs is a trial and error process, the selected terms were successful in producing models for each of the original networks. After a successful model convergence, a simulated graph was generated constraining the number of edges to those of the original graph for each model. It is worth mentioning that within the built-in simulate function in the statnet suite there is no way of forcibly constraining the aspects of the original we want to control. Thus, we experience variation, in some cases more than others. The difference between the original and the simulated graphs seemed more prominent for smaller networks (see Table 1 and Table 2 for comparison) than models based on the larger networks which came closer to the real values of the original graphs.

Synthetic
Labeling. A simple model that parameterizes a labeled graph with a tendency towards homophily (ties disproportionately between those of similar attribute background) is an "attraction" model [42]. In the basic case of a binary attribute variable and a constant tendency to inbreed, two parameters, p and τ , both in the (0,1) interval, characterize the distribution of ties within and between the two groups. The first is the proportion of the population that takes on one value of the attribute (with 1 − p, the proportion taking on the other value). The second parameter, the inbreeding coefficient or probability, expresses the degree to which a tie whose source is in one group is "attracted" to a target in that group. When τ = 0, there is no special attraction and ties within and between groups occur in chance proportions. When τ > 0, ties occur disproportionately within groups, increasing as τ approaches 1. Given a total number of ties, values for p and τ determine the number of ties/edges that are between groups, namely, In the process of generating synthetic node attributes, we first randomly assign two arbitrary values (i.e., R and B) as labels to all the nodes in the graph for a given p, 1 − p split. Then, we draw an R node and a B node at random and swap labels if it would decrease the number of R-B ties. This process would converge when the total number of cross-group ties reduce to δ for a particular value of τ . Table 1: Graph properties of the real network datasets. All graphs are undirected, and nodes are annotated with a binary valued attribute. E.g., nodes in the polblogs network have the attribute party with values; conservative and liberal. For simplicity, binary values are presented using the notation of R and B, together with the distributions of such values over nodes and edges. p and τ present the estimated parameter values of the attraction model. Density (d) is the fraction of all possible edges, transitivity (C) is the fraction of triangles of all possible triangle in the network. degree-assortativity (r ) measures the similarity of relations depending on the associated node degree. Average path length (κ) depicts the average shortest path length between any pairs of nodes. As an example, Figure 3 shows the proportion of cross-group ties on the synthetic labelled networks generated from polblogs topology. The proportion of cross-group ties is proportional to p, while it is inversely proportional to τ . When p reaches its maximum (p max = 0.5 due to the symmetric nature of binary attribute values), the proportion of cross group ties is relatively larger at minimum inbreeding coefficient τ .
It should be noted that convergence is not guaranteed for all possible combinations of p and τ . The swapping procedure holds constant all graph properties except the mapping of nodes to labels, and consequently, it may not be possible to find a mapping of nodes to labels that achieves a target number of ties between groups (when that number is low as it is for higher values of τ ). Table 2 presents the graph characteristics of the synthetically generated labeled graphs.

EMPIRICAL RESULTS
Our objective is not to measure the success of re-identification attacks on original datasets in which node identities have been removed: it has been demonstrated long ago [3] that naive anonymization of graph datasets does not provide privacy. Instead, our objective is to quantify the exposure provided by node attributes on top of the intrinsic vulnerability of the particular graph topology under attack.  In our experiments, we leverage the real and synthetic networks described above. We mount the machine learning attack described in Section 3.2 to re-identify nodes using features based on both Table 2: Basic statistics of generated ERGM networks, and the population of node pairs. Note that dc,cc and apl define the set of parameters that used to generate ERGM graphs based on assortativity (degree correlation), clustering coefficient, and average path length, respectively. We generated a total of ≈ 500 million identical and non-identical node pairs over three ERGM graph spaces of the six real social network datasets. S is the population of generated node pairs concerning a given graph topology. graph topology and node attributes. Our first guiding question is thus: How much risk of node re-identification is added to a network dataset by its binary node attributes? Figure 4 presents the accuracy of node re-identification in the original graph topology GS and in the same topology augmented with node attributes GS(LBL). As expected, the re-identification attack performs (generally) better when node attributes are used in the attack. Surprising to us, however, is the relatively small vulnerability cost that node attributes introduce. For example, the occupation attribute has a barely noticeable benefit to the attacker in fb-dartmouth. More interestingly, however, the same attribute performs differently for the other two Facebook networks considered: for fb-caltech the occupation label functions as noise, leading to a small decrease in the F1-score. For fb-michigan, on the other hand, the occupation label significantly improves the attacker's performance.

The Vulnerability Cost of Node Attributes
Another observation from this figure is that different node attributes applied to the same topology have different outcomes: see, for example, the case of the fb-michigan topology, where the difference between the impacts of the gender and the occupation attributes is the largest. We thus formulate a new question: What placement of attributes onto nodes reveal more information?

Diversity Matters, Homophily Not
To understand how the placement of attribute values on nodes affects vulnerability, we generate synthetic node attributes in a controlled manner. By varying p (the diversity ratio) and τ (the bias of nodes with same-value attributes to be connected by an edge), we can study the effect of these parameters on node re-identification. Figure 5 presents the T-statistics of the F1-scores for node reidentification attacks on the original topology vs. labeled versions of the original topology. In addition to the original topologies, Figure 5 also presents results on various synthetic networks generated as presented in Section 4.2.
We observe three phenomena: First, it appears that p is positively correlated with the T-statistic value measuring the re-identification impact of attributes. That is, the more diversity (that is, the larger p), the more vulnerable to re-identification the labeled nodes become on average. Intuitively, in a highly skewed attribute population, while the minority nodes will be identified quicker due to node attributes, the majority remains protected. On the other hand, when p = 0.5, a network has two equal-sized sets of nodes where each set takes one of two attribute values. This is explained by the fact that the NAD feature vector captures more diverse information in the attributes of neighbots when p is larger. This is also the explanation for why the node attributes contribute so much more to vulnerability in the polblogs dataset, which has a large diversity (p = 0.48) (thus, almost equal numbers of conservative and liberal blogs). Note that the effect of p on the added vulnerability remains consistent across all topologies (real and synthetic) tested.
The second observation is that there is no visible pattern on how τ influences the vulnerability added by binary node attributes. While this is disappointing from the perspective of story telling, it is potentially encouraging for data sharing, as it suggests that datasets that record homophily (or influence, the debate is irrelevant in this context) do not have to be anonymized by damaging this pattern. As a specific example, the privacy of a dataset that records an information dissemination phenomenon could be provided without perturbing the cascading-related ties. The third class of observations is related to the relative effect of the topological characteristics on the added vulnerability. Both amazon-products and pokec-1 are orders of magnitude sparser than the other datasets considered. This means that the topological information available to the machine learning algorithm is limited. In this situation, the addition of the attribute information turns out to be very significant: the T-statistic values for these datasets are significantly larger than for the other datasets, with values over 400 in some cases.
Another topological effect is noticed when comparing the real pokec-1 topology with the ERGM-generated ones in Figure 5d: the node attribute contributes much more to the vulnerability of the original topology compared to the synthetic topologies. The reason for this unusual behavior may lay in the different clustering coefficients of the networks, as seen in Tables 1 and 2: the ERGM-generated topologies have clustering coefficients one order of magnitude higher than the original topology (for the same graph density), which leads to more diverse NDD feature vectors for the networks with higher clustering and thus richer training information. This in turn leads to better accuracy in node re-identification in the unlabeled ERGM topologies (with higher clustering) than in the original topology. For example, the maximum F1-score for the ERGM-dc topology is 0.92 while for the original is 0.76 in pokec-1. Thus, the relative benefit of the node attribute is significantly higher when the topology features were poorer. is responsible for accurately classifying a large proportion of examples. We make three observations from this figure. First, most of the NAD features (together with node's attribute value) that represent node attribute information prove to be important in all datasets.

Topology Leaks
Second, among the NDD features, only a small number contributes consistently to accurate prediction. As shown in Figures 6c -6i, the first bin of 1-hop and 2-hop NDD vectors contribute the most. That is, a high impact on the re-identification of a node is brought by the number of its neighbors with degrees between 1 and 50. Even in large networks such as pokec-1 and amazon-products with a larger range of node degrees, this behavior is observed.
Third, Figure 6 suggests what features explain the effect of diversity p on node re-identification in labeled networks. On datasets with large diversity (such as polblogs or pokec-1), the topological information contributes less than on datasets with low diversity (such as fb-caltech (gender)). This is because high diversity correlates to richer NAD feature vectors, and thus the relative importance of the NAD features increases.

SUMMARY AND DISCUSSIONS
Our work shows that the addition of even a single binary attribute to nodes in a network graph increases its vulnerability to re-identification. Previous work showed that vulnerability increases with the addition of multiple, multi-category attributes [14]. We measure the vulnerability increase and study how it is affected by network and attribute properties.
The increase in vulnerability derives from the fact that the machine learning attack makes use of the interaction between topology and the distribution of node labels. Using information about the distribution of labels in a node's neighborhood provides additional leverage for the re-identification process even when labels are rudimentary.
Furthermore, we find that a population's diversity on the binary attribute consistently degrades anonymity and increases vulnerability. Diversity means a more even distribution of the binary attribute which produces a more varied set of neighborhood distributions that a particular node may exhibit. Consequently, nodes are more easily distinguished from one another by virtue of their differing neighborhood distributions of labels.
One puzzle remains. There is no consistent discernible impact of homophily, as measured by the inbreeding coefficient, on vulnerability. Our procedure for investigating the impact of homophily simply involves swapping labels without disturbing ties. Therefore, both local and global (unlabeled) topologies remain constant as we decrease the number of cross-group ties to achieve a target value implied by a particular inbreeding coefficient for a given proportional split along the binary attribute. This procedure disturbs the local labeled topology but because the machine learning attack uses information from that local topology it apparently can adapt to the changes and make equally successful predictions regardless of the value of the inbreeding coefficient. Perhaps that is why many different factors in attacks on the labeled graphs have some degree of responsibility for success and, no relatively small subset gets the lion's share of the credit.