The phantom alignment strength conjecture: practical use of graph matching alignment strength to indicate a meaningful graph match

The alignment strength of a graph matching is a quantity that gives the practitioner a measure of the correlation of the two graphs, and it can also give the practitioner a sense for whether the graph matching algorithm found the true matching. Unfortunately, when a graph matching algorithm fails to find the truth because of weak signal, there may be “phantom alignment strength” from meaningless matchings that, by random noise, have fewer disagreements than average (sometimes substantially fewer); this alignment strength may give the misleading appearance of significance. A practitioner needs to know what level of alignment strength may be phantom alignment strength and what level indicates that the graph matching algorithm obtained the true matching and is a meaningful measure of the graph correlation. The Phantom Alignment Strength Conjecture introduced here provides a principled and practical means to approach this issue. We provide empirical evidence for the conjecture, and explore its consequences.


Introduction
This paper is about graph matchability in practice.Specifically, when given two graphs and an unobserved "true" bijection (also called "true matching" or "true alignment") between their vertices, will exact (i.e.optimal) graph matching and approximate graph matching algorithms provide us with the matching which is the "truth"?How might we know in actual practice whether the "truth" has been found?Our work is in response to the latter question.The main contribution here is our formulation of the Phantom Alignment Strength Conjecture in Section 4, followed up in Section 4 with the practical implications of this conjecture in deciding when alignment strength is high enough to indicate truth.This conjecture is also interesting as a theoretical matter, completely aside from its consequences.
Graphs (networks) are a commonly used data modality for encoding relationships, interactions, and dependencies in data in an incredibly broad range of the sciences and engineering; this includes sociology (e.g., social network analysis [1]), neuroscience connectomics [2,3], biology (e.g., biological interaction networks [4,5]), and automated knowledge discovery [6], to name just a few application areas.
The graph matching problem is, given two graphs with the same number of vertices, to find the bijection between the vertex sets that minimizes the number of adjacency "disagreements" between the graphs.Often there is an underlying "true" bijection that the graph matching is attempting to recover/approximate.Sometimes part of this true bijection is known a-priori, in which case minimizing the number of disagreements over the remainder of the bijection is called seeded graph matching.Graph matching and seeded graph matching are formally defined in Section 2.
Graph matching and seeded graph matching are used in a wide variety of places, and we mention just a few.Information about the interactions amongst objects of interest is sometimes split across multiple networks or multiple layers of the same network [7].In many applications, such as neuroscience connectomics where, for example, DT-MRI derived graphs can be generated by aligning scans to a common template before uncovering the underlying edge structure [8], the vertices across networks or across layers are a priori aligned and identified.These aligned vertex labels can then be used to create joint network inference procedures that can leverage the signal across multiple networks for more powerful statistical inference [9,10,11,12].In many other applications, the vertex labels across networks or across layers are unknown or noisily observed.Social networks provide a canonical example of this, where common users across different social network platforms may use different user names and their user profiles may not be linked across networks.Discovering this latent correspondence (in the social network example, this is anchoring profiles to a common user across networks) is a key inference task [13,14] for leveraging the information across networks for subsequent inference, and it is a key consideration for understanding the degree of user anonymity [15] across platforms.
For a thorough survey of the relevant graph matching literature, see [16,17,18].The graph matching problem is computationally complex.Indeed, the simpler graph isomorphism problem has been shown to be of quasi-polynomial complexity [19].Allowing loopy, weighted, directed graphs makes graph matching equivalent to the NP-hard quadratic assignment problem.Due to its practical importance and computational difficulty, a large branch of the graph matching literature is devoted to developing algorithms to efficiently, but approximately, solve the graph matching problem; see, for example, [20,21,22,23,24,25,26,27,28] among myriad others.
Somewhat dual to the algorithmic development literature, a large branch of the modern graph matching literature is devoted to theoretically exploring the question of graph matchability, also called graph de-anonymization; this is the question of determining when there is enough signal present for graph matching to recover the "true" bijection.Many of the recent papers in this area have introduced latent alignment across graphs by correlating the edges across networks between common pairs of vertices, focusing on understanding the phase transition between matchable and non-matchable networks in terms of the level of correlation across networks and/or the sparsity level of the networks; see, for example, [29,30,31,32,33,34,35,36,37,38].
In [39], a novel measure of graph correlation between two random graphs called total correlation is introduced; it is neatly partitioned into an inter-graph contribution (the "edge correlation" that had been the previous focus in the literature) and a novel intra-graph contribution.Furthermore, they introduce a statistic called alignment strength, which is 1 minus a normalized count of the number of disagreements in an optimal/true graph match; they prove under mild conditions that alignment strength is a strongly consistent estimator of total correlation.Experimental results in [39] suggest that the matchability phase transition, as well as the complexity of the problem, is a function of this more nuanced total correlation rather than simply the cross-graph edge correlation/edge sparsity that had been the previous focus in the literature.
Analyses mining the matchability phase transition in the literature that also have considered similarity across generative network models beyond simple sparsity have thus far focused on simple community-structured network models [40,41,42], or have proceeded by removing the heterogeneous within-graph model information and simply using the across graph edge correlation [43].Recently, there have been numerous papers in the literature at the interface between algorithm development and mining matchability phase-transitions; see, for instance, [44,38,37].A common theme of many of these results is that, under assumptions on the across graph edge-correlation and network sparsity, algorithms are designed to efficiently (or approximately efficiently) match graphs with corresponding theoretical guarantees on the performance of the algorithms in recovering the latent alignment.
However, the question remains how a practitioner knows in practice whether or not a graph matching has successfully recovered the truth.This issue is not resolved by asymptotic analysis with hidden constants.Nor, in general, are the underlying parameters known to the practitioner.It seems that the graph alignment statistic is a very natural metric to use in deciding if the truth is found.Unfortunately, when there is an absence of signal, an optimal (or approximately optimal) graph matching will find spurious and random alignment strength due to chance.Indeed, this meaningless alignment strength can be high and misleading.How do we gauge whether or not it is high enough to signal that truth is found?
After formally defining seeded graph matching and alignment strength in Section 2 and defining the correlated Bernoulli random graph model (and attendant parameters) in Section 3, we then address this issue with our Phantom Alignment Strength Conjecture in Section 4, and in the ensuing discussion in Section 4.Then, in Section 5, we present empirical evidence for the conjecture using synthetic and real data, and comparing to theoretical results; Section 5 begins with a thorough summary.This is followed in Section 6 by notable mentions, and future directions.

Seeded graph matching, alignment strength
In the seeded graph matching setting, we are given two simple graphs, say they are It is usually understood that there exits a "true" bijection ϕ * ∈ Π which represents a natural correspondence between the vertices in V 1 and the vertices in V 2 ; for example, V 1 and V 2 might be the same people, with E 1 indicating which pairs exchanged emails and E 2 indicating pairs that communicated in a different medium.Or G 1 may be the electrical connectome (brain graph) of a worm and G 2 might be the chemical connectome of the same worm, both graphs sharing the same vertex set of neurons.The vertex set V 1 is partitioned into two disjoint sets, S "seeds" (possibly empty) and N "nonseeds," denote s := |S| and n := |N |. (When s = 0 this is the conventional graph matching problem.)The graphs G 1 and G 2 are observed, and the values of ϕ * are observed on the set of seeds S, however the values of ϕ * are not observed on the nonseeds N , and one of several important tasks is to estimate ϕ * .
Let Π S denote the set of all bijections V 1 → V 2 that agree with ϕ * on the seeds S. For any ϕ ∈ Π S , its match ratio is defined to be 1 n |{v ∈ N : ϕ(v) = ϕ * (v)}|, i.e. the fraction of the nonseeds that are correctly matched by ϕ.(It is common to multiply the match ratio by 100 to express it as a percentage.) For any set V , let V 2 denote the set of two-element subsets of V ; for each i = 1, 2 and any {u, v} ∈ Vi 2 let u ∼ Gi v and u ∼ Gi v denote adjacency and, respectively, nonadjacency of u and v in G i .Next, let 1 denote the indicator function for its subscript.Given any ϕ ∈ Π, we define the full number of disagreements through ϕ to be 2 ) and, given any ϕ ∈ Π S , we define the restricted number of disagreements through ϕ to be The seeded graph matching problem is to find and the idea is that φ is an estimate for the true bijection ϕ * .Unfortunately, except in the smallest instances, computing φ is intractable.A state-of-the-art algorithm SGM from [20] is commonly used to approximately solve the optimization problem in (3), and we denote its output φSGM (∈ Π S ), and it is an approximation of φ and, hence, an approximation of ϕ * .For any ϕ ∈ Π S , the full alignment strength str (ϕ) and the restricted alignment strength str(ϕ) are defined as and str(ϕ) := 1 − D(ϕ) Although the denominators of (4) have exponentially many summands, alignment strength is easily computed as follows.For i = 1, 2, define the full density of G i as and the restricted density of G i as dG i = the number of edges of G i induced by N , divided by n 2 .It holds that see [39] for the derivation of ( 5) from (4).
The importance of alignment strength to a practitioner is twofold: First, the alignment strength of ϕ * (and its proxies φ and φSGM ) may be thought of as a measure of how similar the structure of the graphs G 1 and G 2 are through the "true" bijection; indeed, if the number of disagreements under ϕ * (and its proxies φ and φSGM ) is about equal to the average over all bijections then its alignment strength is near 0 (as clearly seen from the definition in (4)) and, at the other extreme, if ϕ * (and its proxies φ and φSGM ) is nearly an isomorphism between G 1 and G 2 then its alignment strength is near 1.It was proven in [39] that the full alignment strength of the "true" bijection str (ϕ * ) is a strongly consistent estimator of T , which is a parameter called the total correlation between the two graphs G 1 and G 2 , defined in Section 3.
Another way that alignment strength is of much importance to a practitioner is in providing confidence that φSGM or φ is a good estimate of ϕ * , the "truth."If str( φSGM ) or str( φ) is high enough then we may be confident that a meaningful match capturing similar graph structure has been found, and therefore φSGM or φ is approximately or exactly ϕ * .But, how high is high enough?
Indeed, these issues in the use of alignment strength become vastly more complicated by the possibility of phantom alignment strength.This is a phenomenon that occurs when, in the presence of weak signal, meaningless matchings have many fewer disagreements than average (sometimes very substantially fewer) due to random noise, and φ and/or φSGM is one of these meaningless matchings-optimal in the optimization problem, but meaningless as estimates of ϕ * .Indeed, the alignment strength of φ and/or φSGM may be elevated enough to give the misleading appearance of significance when, in reality, they don't at all resemble ϕ * .This will be illustrated in Section 5.
The purpose of this paper is to give a principled, practical means of approaching the decision of what level of alignment strength for φ and/or φSGM indicates that they are a good approximation of ϕ * , in which case the alignment strength reflects the amount of meaningful similar structure between G 1 and G 2 -beyond the random similarity between completely unrelated graphs.
(A note on terminology: We define both full alignment strength and restricted alignment strength since each will end up being important at a different time.The Phantom Alignment Strength Conjecture of Section 4 requires restricted alignment strength specifically; indeed, since full alignment strength includes the seeds, this would dilute the desired effect, falsifying the conjecture conclusion.However, after we have confidence that our graph matching is the true matching, it is then full alignment strength that will be a better estimator of total correlation introduced in Section 3.) and each i = 1, 2, the probability of u ∼ Gi v is the Bernoulli parameter p {u,v} , and the Pearson correlation for random variables 1 v∼ G 1 w and 1 v∼ G 2 w is the edge correlation parameter e .Other than these dependencies, the rest of the adjacencies are independent.
The distribution of the pair of random graphs G 1 , G 2 is determined by the above (see [39]).Of course, the identity function is the "true" matching ϕ * between G 1 and G 2 .
(If the Bernoulli parameters are all equal, then the random graphs G 1 and G 2 are each said to be Erdos-Renyi, so the correlated Erdos-Renyi random graph model is a special case of the correlated Bernoulli random graph model.) Important functions of the model parameters are as follows.The Bernoulli mean and Bernoulli variance are, respectively, defined as Assume that µ is not equal to 0 nor 1.The heterogeneity correlation is defined in [39] as it is in the unit interval [0, 1]; see [39].Also pointed out in [39] is that h is 0 if and only if all Bernoulli parameters are equal (i.e. the graphs are Erdos-Renyi) and h is 1 if and only if all Bernoulli parameters are {0, 1}-valued.In particular, if h is 1 then G 1 and G 2 are almost surely isomorphic.The total correlation T is defined in [39] to satisfy the relationship In the following key result, Theorem 1, which was proved in [39], let us consider a probability space that incorporates correlated Bernoulli random graph distributions for each of the number of vertices n = 1, 2, 3, . ... Thus, the parameters are functions of n, but to prevent notation clutter we omit notating the dependence on n.The symbol a.s.
Theorem 1 together with Equation 7shows that the alignment strength of the true bijection captures (asymptotically) an underlying correlation between the random graphs that can be neatly (and symmetrically, per Equation 7) partitioned into a inter-graph contribution (edge correlation) and an intra-graph contribution (heterogeneity correlation).
Next, instead of considering a sequence of correlated Bernoulli random graphs, let us dig down deeper one probabilistic level.Specifically, suppose that for each {u, v} ∈ V 2 there exists an interval-[0, 1]-valued distribution F {u,v} such that the Bernoulli parameter p {u,v} (in the correlated Bernoulli random graph model) is an independent random variable with distribution F {u,v} .Denote the mean of this distribution µ F {u,v} , denote the variance of this distribution σ 2 F {u,v} , and (if we have µ F {u,v} not 0 nor 1) define the heterogeneity correlation of the distribution to be Theorem 2 Given an edge correlation parameter e ∈ [0, 1] and, for each {u, v} ∈ V 2 , given a [0, 1]-valued distribution F {u,v} such that the Bernoulli parameter p {u,v} is independently distributed as F {u,v} , then the distribution of the associated correlated Bernoulli random graphs (G 1 , G 2 ) is completely specified by e and, for all {u, v} ∈ V 2 , the values of µ F {u,v} and F {u,v} .
Proof: Consider any {u, v} ∈ V 2 ; the Bernoulli coefficient p {u,v} , call it X, has distribution F {u,v} .For any p ∈ [0, 1], conditioning on X = p, the joint probabilities of combinations of u, v adjacency in G 1 , G 2 are computed in a straightforward way (see [39] Appendix A) in the table: Probabilities of these adjacency combinations, relative to the underlying distribution F {u,v} , are computed by integrating/summing the conditional probabilities (in table) times the density/mass of F {u,v} , obtaining Then, for each i = 1, 2, because P[u ∼ Gi v] = EX = µ F {u,v} we have all four adjacency combinations as functions of µ F {u,v} and F {u,v} .The result follows from the independence across all pairs of vertices.
In the Phantom Alignment Strength Conjecture we assume all distributions F {u,v} are the same, call the common distribution F .Note that Bernoulli mean µ and heterogeneity correlation h are now random variables, and if n is large, then µ and h will respectively be good estimators of µ F and F .A very important consequence of Theorem 2 is that the only information that matters regarding F is contained (well-estimated) in the quantities µ and h .

Phantom Alignment Strength Conjecture, consequences
In this section, we propose the Phantom Alignment Strength Conjecture, which is the central purpose of this paper.We then discuss its consequences; the conjecture gives us a principled and practical way to decide if we should be convinced that the output of a graph matching algorithm well-approximates the true matching.
Henceforth we use the term alignment strength to refer to the restricted alignment strength.
Consider correlated Bernoulli random graphs G 1 , G 2 such that there are a "moderate" number n of nonseed vertices (say n ≥ 300), s seeds (selected discrete uniformly from the n := n + s vertices), and Bernoulli parameters are independently realized from any fixed [0, 1]-valued distribution with moderate mean µ (say .05< µ < .95).The Phantom Alignment Strength Conjecture states that, subject to caveats, as discussed in Section 6, there exists a phantom alignment strength value q ≡ q(n, s, µ ) ∈ [0, 1] such that str( φ) has "negligible" variance and is approximately a function of the total correlation T and, specifically, it holds that, with "high probability," Moreover, the conjecture states that, when using the seeded graph matching algorithm SGM of [20], (given n, s, µ , as above) then there exists qSGM ≡ qSGM (n, s, µ ) ∈ [0, 1] such that qSGM ≥ q, and str( φSGM ) has "negligible" variance and is approximately a function of the total correlation T and, specifically, it holds that, with "high probability," Note that both str( φ) and str( φSGM ) are conjectured to be an approximately piecewise linear function of T ; two pieces, one piece with slope 0 and one piece with slope 1.However, str( φ) is continuous and shaped like a hockey stick (see Figure 2f), whereas for str( φSGM ) there can be a discontinuity (see Figure 2b); but the function value of the linear portion with slope 0 is the same for str( φSGM ) as it is for str( φ), namely it is the phantom alignment strength value q.
There are important consequences of the Phantom Alignment Strength Conjecture for the practitioner.Suppose that a practitioner has two particular graphs G 1 , G 2 with n nonseed vertices and s seeds that can be considered as realized from a correlated Bernoulli random graph model, and the practitioner wants to seeded graph match them, computing φSGM as an approximation of the true matching ϕ * .How can the practitioner tell if φSGM is ϕ * ?This conjecture provides a principled, practical mechanism.The practitioner should realize two independent Erdos-Renyi graphs H 1 and H 2 with n nonseed vertices, s seeds, and adjacency probability parameter p equal to the combined density of G 1 and G 2 .Then use SGM to seeded graph match H 1 and H 2 , and the alignment strength of the bijection (between H 1 and H 2 ) is approximately q ≡ q(n, s, µ), since the total correlation in generating H 1 and H 2 is 0, by design.Then, when subsequently seeded graph matching G 1 and G 2 , if str( φSGM ) is greater than some predetermined and fixed > 0 above q, then that would indicate that φSGM = ϕ * and, if str( φSGM ) is less than this, then there is no confidence that φSGM is ϕ * .Moreover, in the former case the practitioner can have confidence in approximating str( φSGM ) ≈ T , and in the latter case there wouldn't be confidence in this approximation.(In the former case, note that the full alignment strength str ( φSGM ) would then be an even better estimate of T .)(If some of the model assumptions are violated and the Bernoulli mean of G 1 may be different from G 2 , then it may be better not to combine their densities, but rather to realize H 1 and H 2 as Erdos-Renyi graphs with respective adjacency parameter equal to their respective densities.)

Empirical evidence in favor of the Phantom Alignment Strength Conjecture
In this section we provide empirical evidence for the Phantom Alignment Strength Conjecture.A summary is as follows: We begin in Section 5.1 with a scale small enough (n is just on the order of tens) to solve seeded graph matching and attain optimality.Although the Phantom Alignment Strength Conjecture does not apply because n is so small, we nonetheless see many ingredients of the conjecture.Then, in Section 5.2, we use synthetic data on a scale for the conjecture to be applicable, and we empirically demonstrate the conjecture for many types of Bernoulli parameter distributions; unimodal, bimodal, symmetric, skewed, etc.The SGM algorithm is employed for seeded graph matching, since exact optimality is unattainable in practice.
In Section 5.3, the alignment strength of completely uncorrelated Erdos-Renyi graphs (graph matched with SGM, using no seeds), taken as a function of n, is empirically demonstrated to be the same order of growth (in terms of n) as the theoretical bound for matchability (as a function of n), which suggests that the two quantities are the same, in excellent accordance with the conjecture.
Then, in Section 5.4, we observe that when there is block structure and differing distributions for the Bernoulli parameters by block (thus the conjecture hypotheses are not adhered to) then the conjecture's claims may fail to hold, to some degree.Nonetheless, there is still a phantom alignment strength that allows for a procedure similar to what we recommend in Section 4 to be successfully used for deciding when alignment strength is significant enough to indicate that the seeded graph matching has found the truth.
Real data is then used for demonstration in Section 5.5 and Section 5.6.Specifically, in Section 5.5, we use a human connectome at many different resolution levels, and graph match it to a manually noised copy of itself.
Then, in Section 5.6, we consider several pairs of real-data graphs (titled Wikipdeia, Enron, and C Elegans) whose vertices are the same objects, and the adjacencies in each pair of graphs represent relationships between the objects across two different modalities.
All of these experiments serve as strong empirical evidence for the Phantom Alignment Strength Conjecture, and motivate its use.

Of hockey sticks and phantom alignment strength
We begin with an experiment in which the value of n is well below what is required in the statement of the Phantom Alignment Strength Conjecture.However, n is small enough here to enable us to compute φ exactly, using the integer programming formulation from [39].We will be able to see many features of the Phantom Alignment Strength Conjecture, and we will also see that phantom alignment strength is not just an artifact of the SGM algorithm.
For each value of e from 0 to 1 in increments of .025,we did 100 independent repetitions of the following experiment.We realized a pair of correlated Bernoulli random graphs on n = 30 vertices with edge correlation e and, for each pair of vertices, the associated Bernoulli parameter was 0.5.(In particular, the graphs are correlated Erdos-Renyi.)Since here σ 2 = 0, we have that h = 0, and thus T = e .We discrete uniform randomly chose s = 15 seeds, so there were n = 15 nonseeds.For each experiment, we solved the seeded graph matching problem to optimality (indeed, n = 15 is small enough to do so), obtaining φ.If it happened that φ = ϕ * then we plotted a green asterisk in Figure 1 for the resulting alignment strength str( φ) against the total correlation T and, if φ = ϕ * , we plotted a red asterisk for the resulting alignment strength str( φ) against the total correlation T .The black diamonds in Figure 1 are the mean alignment strengths for the 100 repetitions, plotted for each value of e .
Figure 1: For each e from 0 to 1 in increments of .025,alignment strength of φ for 100 independent realizations when all Bernoulli probabilities were 0.5 (in particular, T = e ), with n = 15 nonseeds, s = 15 seeds, a green asterisk if φ = ϕ * , else a red asterisk.
It is readily seen from Figure 1 that the variance for the alignment strength of φ is quite high, which is reason to not formulate the Phantom Alignment Strength Conjecture until n is much larger.Other that this, observe that if we substitute "mean of the alignment strength of φ" into the conjecture in place of "alignment strength of φ" then the conjecture would hold here.Indeed, when T > ≈ 0.44 ≡ q we very generally had that φ = ϕ * , and when T ≤ ≈ 0.44 we very generally had that φ = ϕ * .(This boundary is not sharp, but is close.)Also, note that when T > ≈ 0.44, the mean of the alignment strength was approximately equal to T .Furthermore, when T ≤ ≈ 0.44, we see that the (mean) alignment strength of φ is the phantom alignment strength (mean) of ≈ 0.44.Indeed, in this latter case, the alignment strength of φ is a misleading high value, and is not meaningful.

Of hockey sticks and broken hockey sticks
In this section, we use synthetic data that meets the hypotheses of the Phantom Alignment Strength Conjecture.Our setup was as follows.We chose the number of nonseeds to be n = 1000, and we repeated an experiment for all combinations of the following: • Each pair of Beta distribution parameters α, β listed in the following table: • Each µ =(mean of the scaled/translated Beta distribution) from .1 to .9 in increments of .1,• Each number of seeds s = 0, 10, 20, 50, 250, 1000, • Each value of edge correlation e from 0 to 1 in increments of 0.025, • Each value of δ from 0 to δ max := min{ α+β α µ , α+β β (1 − µ )} in increments of 1 10 δ max .For each combination of the above, we realized a pair of correlated Bernoulli random graphs on n + s vertices, with edge correlation e and, for each pair of vertices, the associated Bernoulli parameter was independently realized from the distribution α+β has support interval of length δ, has mean µ , and the support interval is contained in the interval [0, 1].
α+β is uniform when α, β is 1, 1, and is bimodal when α, β is 0.5, 0.5, is symmetric unimodal when α, β is 2, 2, and is skewed in the other two cases, in different directions, one where the mode is an endpoint of the support and one where the mode is interior of the support.
• The Bernoulli mean µ is approximately µ , since n+s 2 is very large for these purposes.The s seeds were chosen discrete uniform randomly from the n + s vertices, and we computed φSGM via the SGM algorithm for seeded graph matching.In Figure 2 we plotted alignment strength str( φSGM ) against total correlation T for all of the pairs of graphs generated in the case where µ = 0.5, in different subfigures for the

Match Ratio
Figure 2: Alignment strength str( φSGM ) plotted against total correlation T for the synthetic data experiments in Section 5.2, separated according to the number of seeds s.The number of nonseeds was n = 1000, and only the case of µ = 0.5 is shown here.Match ratio of each experiment is color coded green, blue, or red according to the legend above.Subfigures (g) and (h) are zooms into subfigures (c) and (d), to increase the granularity so that the thresholding is better seen.different values of s = 0, 10, 20, 50, 250, 1000; green dots indicate when φSGM = ϕ * , blue and red dots indicate when φSGM = ϕ * , blue when φSGM agreed with ϕ * on at least 85% of the nonseeded vertices (i.e."match ratio ≥ 85%"), and red when φSGM agreed with ϕ * on less than 85% of the nonseeded vertices.
Note that in Figure 2, each of (a)-(f) are plots of 2255 points, each point represented with a filled circle, and the crowding of the points makes them resemble lines; so, in Figure 2, we also included (g) and (h), which are zooms of a portion of (c) and (d), respectively.With the increased granularity in (g) and (h), we see that if we ignore some outlier red and green dots, then there is a better defined transition from red to green than would appear in (c) and (d).
The Phantom Alignment Strength Conjecture is well motivated by the results illustrated in Figure 2. In particular, alignment strength str( φSGM ) exhibits very low variance and is approximately a piecewise-linear function of total correlation T .There appears to be a critical value qSGM , dependent on the number of seeds s in these experiments, for which the following holds.When total correlation T is above qSGM then φSGM = ϕ * and str( φSGM ) ≈ T , and when total correlation T is below qSGM then φSGM = ϕ * , evidenced by str( φSGM ) ≈ T , and str( φSGM ) is constant-at a phantom alignment strength level.When there are enough seeds, we see that the two pieces of the function join to become continuous, suggesting that φSGM = φ is then achieved for all T , and the value of qSGM is then q.
Also note that the five different Beta distributions from which Bernoulli parameters were realized (the five pairs of Beta parameters labelled A, B, C, D, E) in these experiments were collected into each of the figures of Figure 2, and the experiment results for these different distributions are indistinguishable from each other in the figures, in accordance with Theorem 2, and reflected in the Phantom Alignment Strength Conjecture claim that the phantom alignment strength is just a function of n, s, µ , and that it isn't relevant what distribution is used to obtain the Bernoulli parameters.
Also note the phase transition from matchable to non-matchable which takes place when T gets to qSGM , and this phase transition becomes better and better defined as the number of seeds goes up.
For the other values of µ , the figures exhibited the same overall type of structure, although the phantom alignment strength values were different.In the interest of space, we only present here the µ = 0.5 experiment figures.

Phantom alignment strength vs theoretical matchability threshold
Among other assertions, the Phantom Alignment Strength Conjecture asserts, under conditions, that the alignment strength str( φSGM ) when T = 0, called the "phantom alignment strength," is equal to the total correlation threshold for matchability of exact seeded graph matching (i.e. the particular value such that φ = ϕ * or not according as T is greater than this value or not); indeed, we have denoted this common quantity q.In this section, we will compare alignment strength str( φSGM ) when T = 0 to the matchability threshold proved in [45].
Consider a probability space with a sequence of correlated Bernoulli random graphs for each of the number of vertices n ≡ n = 1, 2, 3, . .., with s = 0 seeds and all Bernoulli parameters equal to a fixed value p (ie correlated Erdos-Renyi random graphs).When we say that a sequence of events happens "almost always" we mean that, with probability 1, all but a finite number of the events occur.The following result was stated and proved in [45]; although stated there in terms of e , we write T instead, since here, where h = 0, we have that T = e .For each value of p = .05,.1,.2,.3,.4,.5, and each of 500 values of n between 500 and 4000, (as mentioned, s = 0) we plotted realizations of alignment strength str( φSGM ) vs the value of n, for uncorrelated ( e = 0) pairs of random Bernoulli (Erdos-Renyi) graphs where each Bernoulli parameter is p, hence T = 0 (since e = 0, h = 0).Figure 3 shows the plots for p = 0.05, 0.1, 0.5.  1, and f p is also drawn in Figure 3.For each value of p, note the near-perfect fit of f p to the associated points plotted in Figure 3, and note that the value of d p is close to zero.Indeed, this suggests, as conjectured in the Phantom Alignment Strength Conjecture, that the phantom alignment strength (ie str( φSGM ) when T = 0) exists as a value q which coincides with the amount of total correlation needed for φ = ϕ * .

Block settings
The setting of the Phantom Alignment Strength Conjecture in Section 4 was specifically concerning correlated Bernoulli random graphs G 1 , G 2 such that there are n nonseed vertices, s seed vertices (selected discrete uniformly from the n := n+s vertices), and Bernoulli parameters for each pair of vertices are selected independently from any fixed distribution with mean µ .
Let us consider a block setting, which differs from the above in that there is a positive integer K, and the vertex set V is first randomly partitioned into K blocks B 1 , B 2 , . . ., B K as follows: There is a given probability vector π ∈ [0, 1] K such that K i=1 π i = 1 and each vertex in V is independently placed in block B i with probability π i for i = 1, 2, . . ., K. Next, suppose there is a unit-interval-valued (ie [0, 1]-valued) distribution F i,j for each i = 1, 2, . . ., K and j = i, i + 1, . . ., K such that, for each 1 ≤ i ≤ j ≤ K and each u ∈ B i and v ∈ B j , the Bernoulli parameter p {u,v} is independently realized from distribution F i,j .Let M be the K × K symmetric matrix with i, jth entry equal to the mean of distribution F i,j .
We consider the following choices for n, s, π, and M : In experiment "A", we took F 1,1 to be point mass distribution at 0.3, F 1,2 to be point mass distribution at 0.4, and F 2,2 to be point mass distribution at 0.5.For each value of edge correlation e from 0 to 1 in increments of 0.001, we realized Bernoulli parameters and then we realized associated correlated Bernoulli random graphs.
In Figure 4, we plotted alignment strength str( φSGM ) against total correlation T ; green dots indicate when φSGM = ϕ * , (else) light blue when φSGM agreed with ϕ * at least 85% of the nonseeded vertices, (else) dark blue when φSGM agreed with ϕ * on at least 50%, (else) red when φSGM agreed with ϕ * on less than 50% of the nonseeded vertices.We then repeated the experiment with the only difference being that F 2,2 was the uniform distribution on the interval [0, 1], so (n, s, π, M ) are same as above; the resulting plot is Figure 5 (alignment strength str( φSGM ) vs T , same dot color scheme as above).Let us call this Experiment "B." Next, we repeated the above experiment for all eight possible combinations of: and we superimposed all of the alignment strength vs total correlation plots in Figure 6 (same dot color scheme as above); we will call this Experiment "C." Again, the underlying (n, s, π, M ) are the same as the previous experiments.
Note that Figure 4, Figure 5, and Figure 6 (for respective experiments A,B, and C) are not similar, even though they originate from the same values of n, s, π, and M .Thus, the Phantom Alignment Strength Conjecture is not simply extended to the case of nontrivial block structure.
However, also note that when SGM was broadly failing to get the truth in experiments A, B, and C (i.e. the red dots in Figure 4, Figure 5, and Figure 6), the alignment strength was almost constant, at a value of around 0.12.This suggests a decision procedure (analogous the procedure described in Section 4) for deciding if G 1 , G 2 from an (n, s, π, M )-block model are graph matched with some truth.The procedure would be to realize H 1 and H 2 as correlated Bernoulli random graphs where e = 0, where the n + s vertices are apportioned to the blocks in proportion to π, and where, for every pair of vertices, the Bernoulli parameter is taken as the entry of M associated with the block memberships of the two vertices, and then the s seeds are chosen uniformly at random.The alignment strength of the seeded graph match of H 1 to H 2 can then be used as a phantom alignment strength value in the sense that, if the alignment strength of the seeded graph match of G 1 to G 2 is more than some > 0 greater than this phantom alignment strength value, then we decide that there is at least some truth present in the seeded graph match of What made the block structure more complicated?We will next provide some insight.Indeed, Experiment B was constructed in an extreme way in order to cause particular mischief.The value of h in Experiment A was approximately .0129, and the value of h in Experiment B was approximately .2277; in particular, that is why the value of T was never below approximately .22 in Experiment B, as is clear from Figure 5.However, in Experiment B when ρ e = 0, all of the vertices in the first block are stochastic twins; they share the same probabilities of adjacency as each other to all of the vertices in the graph, and all adjacencies are collectively independent.Thus the "true" bijection (the identity) has no signal in that case.(One might even say that the "truth" isn't very "truthy.")As such, the total correlation in that case, approximately .2277,does not contribute to matchability vis-a-vis the first block.As positive edge correlation e is increasingly added in to Experiment B, the first block achieves matchability on the strength of only the edge correlation, and the second block achieves matchability on the strength of edge correlation together with heterogeneity correlation.In this manner, total correlation does not tell a uniform story across all vertices.This is in contrast to the hypotheses of the Phantom Alignment Strength Conjecture (and the setup in the empirical matchability experiments in the paper [39]) where the Bernoulli parameters were realized from one distribution.Note that with Experiment C, there is more variety in h (for the eight experiments the values of h ranged from approximately .0161 to approximately .30);there is still some lack of demarcation between matchable and nonmatchable in terms of total correlation, but the situation is improved somewhat from the left tail of the figure, and total correlation has more influence as a unified quantity.
We did additional experiments with other values of (n, s, π, M ) and found comparable results to what appears above.

Real data; matching graphs to noisy renditions
Recall that the Phantom Alignment Strength Conjecture is formulated under the assumption that each pair of vertices has a Bernoulli parameter that is a realization of a distribution which is common to all of the pairs of vertices.How realistic is this assumption in practice?And, more to the point of the practitioner, do the conclusions of the conjecture apply to real data, in general?
In this section we consider a human connectome at different resolution levels.(This connectome has been featured in [46,47].)Diffusion-weighted Magnetic Resonance Imaging (dMRI) brain scans were collected from one hundred and fourteen humans at the Beijing Normal University [48].Fiber tracts, which trace axonal pathways through a three-spatial-dimensional cuboid array of 1 × 1 × 1 mm 3 voxels of the dMRI scan, are estimated using the ndmg pipeline [49].
For each value of n = 70, 107, 277, 582, 3230, the graph G n was formed in the following manner.Starting from the original cuboid array of voxels, n equally spaced "contractile" voxels were selected, and each voxel in the array was merged with its nearest contractile voxel [50]; the n such groupings of voxels (centered at their contractile voxel) are the n vertices of the graph G n .For any two vertices in G n , we declare them adjacent precisely when there exists a fiber that runs through any voxel of one vertex and also any voxel of the other vertex for any of the one hundred and fourteen individuals.
Given any graph G = (V, E), and also given any noise parameter ρ ∈ [0, 1], we can instantiate a graph G called a ρ-noised rendition of G on the same vertex set V as follows.Denote the density of G by . First, instantiate an independent Erdos-Renyi graph H on V with Bernoulli parameter d G; i.e. each pair of vertices is an edge independently of the others with probability d G. Next, for each pair of vertices {u, v}, perform an independent Bernoulli trial; with probability ρ set u adjacent/ not adjacent (resp.) to v in G according as u adjacent/ not adjacent (resp.) to v in G, and with probability 1 − ρ set u adjacent/ not adjacent (resp.) to v in G according as u adjacent/ not adjacent (resp.) to v in H.In this manner, G is a mixture of G and noise graph H.When graph matching G to a ρ-noised rendition of G, clearly ϕ * is the identity function V to V .For each of n = 70, 107, 277, 582, 3230, we did the following experiment.For each value of the noise parameter ρ from 0 to 1 in increments of .025,we did 20 repetitions of instantiating a ρ-noised rendition of G n , then seeded graph matched G n to it using the SGM algorithm after selecting 10% of the n vertices (discrete uniform randomly) as seeds.The mean alignment strength str( φSGM ) (the mean being over the 20 repetitions) vs noise parameter ρ was plotted in five respective figures (for the five different values of n) in the left side of Figure 7; green dots indicate when φSGM = ϕ * , (else) light blue when φSGM agreed with ϕ * on at least 85% of the nonseeded vertices, (else) dark blue when φSGM agreed with ϕ * on at least 50%, (else) red when φSGM agreed with ϕ * on less than 50% of the nonseeded vertices.
We then repeated the above experiments, with the only difference being that in place of G n we used an Erdos-Renyi graph instantiation, the Erdos-Renyi using the Bernoulli parameter d G n (the density of the connectome G n ).The resulting plots are in the right hand side of Figure 7. Simple calculations of the distributions show that the pairs of graphs being seeded graph matched here in these repeated experiments are precisely correlated Erdos-Renyi graphs with the parameter ρ being precisely the edge correlation e , which is equal to T since h = 0.
To emphasize: The left hand side of Figure 7 is from seeded graph matching connectome to noisy connectome, and the right hand side of Figure 7 is from seeded graph matching synthetic data of the same connectome density to a noisy version of this synthetic data, which turns out to precisely be seeded graph matching pairs of correlated Bernoulli random graphs where the noise parameter turns out to be the total correlation, so the figures in the right hand side of Figure 7 are of Section 5.2 variety (except that the alignment strength values are averaged over 20 instantiations).
Notice that the figures in the left hand side of Figure 7 and their respective counterparts in the right hand side of Figure 7 look remarkably similar in many important ways.The differences seem to just be that the seeded graph matching  success and alignment strength values clearly exhibit thresholding in the synthetic data, which is less pronounced and more gradual in the connectome data, although the sharpness of the connectome thresholding seems to be catching up as the number of vertices increases.Aside from this, there stills seems to be a reasonable phantom alignnment strength for the connectome data.

Real data; matching same objects under different modalities
In this section, we illustrate the ideas in this paper using three real data sets from [20]; they are the Wikipedia, Enron, and C Elegans pairs of graphs.Each is an example of a pair of graphs with the same underlying objects (thus there is a natural "true" bijection), and adjacencies between objects in the respective graphs are relationships among the objects in two different modalities.
The Wikipedia pair of graphs G 1 , G 2 from [20] were created in the year 2009.The vertices of G 1 are the English language Wikipedia articles hyperlinked from the Wikipedia article "Algebraic Geometry," and all Wikipedia articles hyperlinked from these articles; in total, there are n = 1382 vertices.These vertices/articles each have directly corresponding articles in the French language Wikipedia, and these are the vertices of G 2 .Every pair of vertices/articles in G 1 are adjacent in G 1 precisely when one of the articles hyperlinks to the other article in the English language Wikipedia, and every pair of vertices in G 2 are adjacent in G 2 precisely when one of the articles links to the other in the French language Wikipedia.Thus G 1 and G 2 are simple, undirected graphs, and the "true" bijection is the function mapping English articles to their French versions.
For each value of s = 0, 5, 50, 150, 250, 382, 500, we did 100 replicates of uniformly sampling s seeds from the n vertices, seeded graph matched G 1 to G 2 using SGM, then recording the alignment strength str( φSGM ), averaged over the 100 replicates, plotted (in blue) vs the number of seeds s in Figure 8.In the same figure, we recorded the match ratio (the number of nonseeds correctly matched, divided by the number of nonseeds), averaged over the 100 replications, plotted (in purple) vs the number of seeds s, also in Figure 8.In addition, for each value of s = 0, 5, 50, 150, 250, 382, 500, we did 100 replicates of realizing uncorrelated pairs of Erdos-Renyi graphs H 1 , H 2 , each with 1382 vertices and Bernoulli parameter of H 1 equal to the density of G 1 , Bernoulli parameter of H 2 equal to the density of G 2 , then uniformly sampling s seeds from the 1382 vertices, then seeded graph matched H 1 to H 2 using SGM, and recording the alignment strength str( φSGM ), averaged over the 100 replicates, plotted (in green) vs the number of seeds s in Figure 8; these values represent the phantom alignment strength values in the respective seed levels.Note that, as the number seeds went from 0 to 5 to 50, the jump in match ratio coincides with a jump in the gap between seeded graph matching alignment strength and the phantom alignment strength.(Even when s = 0 there is some truth in the graph match; the match ratio was .0151,approximately 21 nonseed vertices matched correctly, whereas chance is 1/1382, one nonseed vertex matched correctly.)The C. Elegans pair of graphs G e , G ch from [51,20] are connectomes mapping out the neural structure of the roundworm Caenorhabditis Elegans.C. Elegans is of interest to neuroscientists due to its well studied genetics [52], comparatively simple nervous system [53], and a growing understanding of the correspondence between the two [54,55].Like in humans, communication in the C. Elegans nervous system occurs via synapses, or junctions, between pairs of neurons.Neuronal synapses in the C. Elegans connectome can be classified in two ways [51]: an electrical synapse is a channel through which electrical impulses traverse, whereas chemical synapses are junctions through which neurotransmitters flow.We consider n = 279 somatic neurons of the hermaphrodite C. Elegans as the vertices of each graph.For each pair of vertices/neurons, they are adjacent in G e precisely when there is an electrical synapse between them, and they are adjacent in G ch precisely when there is a chemical synapse between them.
We conducted the identical experiments as we did for the Wikipedia graphs, except that the number of seeds s considered were s = 0, 1, 5, 10, 20, 50, 75, 100, 150, 200, and we matched G e and G ch .The resulting plots are in Figure 9; alignment strength of the C Elegans seeded graph match in blue, phantom alignment strength in green, match ratio in purple.Note that seeded graph matching did poorly, as evidenced by low match ratio, even when the number of seeds was huge (200 seeds and 79 nonseeds), and correspondingly the gap between seeded graph matching alignment strength and phantom alignment strength was small.The Enron graphs from [20] arose in the following manner.Enron was a large and highly respected energy company that dissolved spectacularly in 2001 amid systemic fraud.The United States Justice Department released a trove of email messages between company employees.The graphs G 130 , G 131 , and G 132 have as vertices n = 184 Enron employees and, for each pair of vertices/employees, the vertices are adjacent in G 130 precisely when there is an email from one employee to the other in week number 130 of the email corpus, they are adjacent in G 131 precisely when there is an email from one employee to the other in week number 131, and they are adjacent in G 132 precisely when there is an email from one employee to the other in week number 132.The paper [56] identified an anomaly going into week 132, and [20] used match ratio differences between pairs of these graphs to highlight this anomaly.
We conducted the identical experiments for each of the pairs G 130 , G 131 and G 131 , G 132 and G 130 , G 132 as we did for the Wikipedia graphs, except that the number of seeds s considered were s = 0, 1, 5, 10, 20, 50, 60, 90, 100.The resulting plots are in Figure 10.As noted in [20], the match ratio from matching G 130 to G 131 is highest of the three, since the anomaly had not yet occurred.The next highest match ratio was from matching G 131 to G 132 , then came matching G 130 to G 132 .Note that the gap between seeded graph matching alignment strength and phantom alignment strength was ordered the same way; highest was G 130 to G 131 , then was G 131 to G 132 , and then was G 130 to G 132 .Indeed, more gap here when there was higher match ratio.(Note that the match ratios here differ a bit from those in the paper [20], Figure 8; that figure was inadvertently from a nonsimple graph version of the data, and here we created a simple graph.)6 Notable mentions and future directions, plus caveats The applications of graph matching are broad and many, and getting the right answer is only valuable when we know that we have the right answer.This paper provides principled tools that can help the practitioner decide if seeded graph matching has found the true bijection.
The first caveat -and future direction-is that we are presenting a conjecture, and not a theorem.Indeed, the Phantom Alignment Strength Conjecture, as formulated in Section 4, includes terms in quotes; "moderate," "high probability," "very different," and "negligible."Ironing these terms out with specifics is part of the puzzle of proving the conjecture, and is an important next task.It may be a hard task, and we expect this paper to stimulate more experimentation, fine-tuning, and eventually a proof of the conjecture.
Part of the first caveat is the consideration that we don't have a proof of the conjecture as of now, and our experimentation is wide but not exhaustive, and thus there may be additional hypotheses or limitations to the conjecture statement.
A second caveat is that the conjecture is expressed in terms of an underlying model for a pair of random graphs, and we need to consider if particular real data that we may encounter (beyond the examples that we used here) can more generally be considered as arising from such a model.Also, when there are multiple blocks with Bernoulli coefficients being realized from different distributions for different blocks, we saw in Section 5.4 that total correlation became a much less reliable tool for determining matchability.More work is needed to explore this further; the paper [39], when presenting empirical evidence for the relationship between total correlation and matchability, restricted their attention to the setting hypothesized in our Phantom Alignment Strength Conjecture, which excludes multiple blocks.Indeed, in the setting of our conjecture, the role of total correlation in matchability is starkly visible.See Figure 2d, where the x-axis is total correlation, and compare to Figure 11. Figure 11 is the same data plotted in Figure 2d, except that the x-axis is used for edge correlation instead of total correlation.The contrast between these two figures is quite dramatic.Indeed, the current-literature-standard yardstick of edge correlation failed miserably in capturing matchability, whereas total correlation captured matchability perfectly here.These two figures are powerful illustration of the role of total correlation in matchability.Figure 11: The same as the plot in Figure 2d, except that the x-axis in this figure is edge correlation instead of total correlation.The contrast between these two figures is quite stark, and highlights the utility of total correlation with regard to matchability.
While more work remains to be done, we have here presented principled tools that can be of significant help to the practitioner now.
A large measure of inspiration for this paper came from Figures 1, 2, and 3 of [39] (on which half of us are co-authors).Those figures displayed the results of graph matchings of many simulations of pairs of correlated Bernoulli random graphs under similar conditions of the Phantom Alignment Strength Conjecture.One axis of each figure tracked edge correlation, and the second axis tracked heterogeneity correlation; green, yellow, and red dots were respectively located at coordinates corresponding to parameters where the graph matchings were always the truth, mostly the truth, and often not the truth, respectively.It was striking to observe that the regions of red and green were sharply demarcated by a level curve of total correlation, with little yellow between the red and green.These figures starkly demonstrated the role of total correlation in matchability, as well as thresholding behavior.Together with the theoretical results of [39] tieing alignment strength to total correlation (when graph matching gets truth), we had important ingredients for the "hockey stick" at the heart of the Phantom Alignment Strength Conjecture.

3
The correlated Bernoulli random graph model Definition 1 Given positive integer n, vertex set V such that |V | = n, the parameters of the correlated Bernoulli random graph model are Bernoulli parameters p {u,v} ∈ [0, 1] for each {u, v} ∈ V 2 , and an edge correlation parameter e ∈ [0, 1].The pair of random graphs (G 1 , G 2 ) have a correlated Bernoulli random graph distribution when as follows: G 1 and G 2 each have vertex set V .For each {u, v} ∈ V 2 ,

Figure 3 :
Figure 3: Phantom alignment strength as a function of n, fitted to f p (n) := d p + c p log n n .

Figure 8 :
Figure 8: Matching English and French Wikipedia graphs
Figure 5: Experiment B in Section 5.4; same as Experiment A except that F 2,2 is uniform [0, 1].