Skip to main content

Characterizing the hypergraph-of-entity and the structural impact of its extensions

Abstract

The hypergraph-of-entity is a joint representation model for terms, entities and their relations, used as an indexing approach in entity-oriented search. In this work, we characterize the structure of the hypergraph, from a microscopic and macroscopic scale, as well as over time with an increasing number of documents. We use a random walk based approach to estimate shortest distances and node sampling to estimate clustering coefficients. We also propose the calculation of a general mixed hypergraph density measure based on the corresponding bipartite mixed graph. We analyze these statistics for the hypergraph-of-entity, finding that hyperedge-based node degrees are distributed as a power law, while node-based node degrees and hyperedge cardinalities are log-normally distributed. We also find that most statistics tend to converge after an initial period of accentuated growth in the number of documents. We then repeat the analysis over three extensions—materialized through synonym, context, and tf_bin hyperedges—in order to assess their structural impact in the hypergraph. Finally, we focus on the application-specific aspects of the hypergraph-of-entity, in the domain of information retrieval. We analyze the correlation between the retrieval effectiveness and the structural features of the representation model, proposing ranking and anomaly indicators, as useful guides for modifying or extending the hypergraph-of-entity.

Introduction

Complex networks have frequently been studied as graphs, but only recently has attention been given to the study of complex networks as hypergraphs (Estrada and Rodriguez-Velazquez 2005). The hypergraph-of-entity (Devezas and Nunes 2019) is a hypergraph-based model used to represent combined data (Bast et al. 2016, §2.1.3). That is, it is a joint representation of corpora and knowledge bases, integrating terms, entities and their relations. It attempts to solve, by design, the issues of representing combined data through inverted indexes and quad indexes. The hypergraph-of-entity, together with its random walk score (Devezas and Nunes 2019, §4.2.2), is also an attempt to generalize several tasks of entity-oriented search. This includes ad hoc document retrieval and ad hoc entity retrieval, as well as the recommendation-alike tasks of related entity finding and entity list completion. However, there is a tradeoff. On one side, the random walk score acts as a general ranking function. On the other side, it performs below traditional baselines like TF-IDF (term frequency \(\times\) inverted document frequency). Since ranking is particularly dependent on the structure of the hypergraph, a characterization is a fundamental step towards improving the representation model and, with it, the retrieval performance.

Accordingly, our focus was on studying the structural features of the hypergraph. This is a task that presents some challenges, both from a practical sense and from a theoretical perspective. While there are many tools (Csardi and Nepusz 2006; Bastian et al. 2009) and formats (Himsolt 1997; Brandes et al. 2001) for the analysis and transfer of graphs, hypergraphs still lack clear frameworks to perform these functions, making their analysis less trivial. Even formats like GraphML (Brandes et al. 2001) only support undirected hypergraphs. Furthermore, there is still an ongoing study of several aspects of hypergraphs, some of which are trivial in graph theory. For example, the adjacency matrix is a well-established representation of a graph, however recent work is still focusing on defining an adjacency tensor for representing general hypergraphs (Ouvrard et al. 2017). As a scientific community, we have been analyzing graphs since 1735 and, even now, innovative ideas in graph theory are still being researched (Aparicio et al. 2018). However, the concept of hypergraph is much younger, dating from 1970 (Berge 1970), and thus there are still many open challenges and contribution opportunities.

In this work, which is an extended version of our previous characterization work (Devezas and Nunes 2019), we take a practical application of hypergraphs in the domain of information retrieval, the hypergraph-of-entity, as an opportunity to establish a basic framework for the analysis of hypergraphs. We expand on our previous work by analyzing the impact of two extensions (synonymy, and contextual similarity), that had previously been defined for this representation model (Devezas and Nunes 2019), and we also introduce and characterize a new extension, based on the idea of segmenting the document into different sets of terms according to discretizations of term frequency (TF-bins, or term frequency bins). The main contributions of this work are the following:

  • Analysis of multiple versions of real-world hypergraph data structures being developed for information retrieval;

  • Proposal of a practical analysis framework for hypergraphs;

  • Proposal of estimation approaches for the computation of shortest paths and clustering coefficients in hypergraphs;

  • Proposal of a computation approach for the density of general mixed hypergraphs based on a corresponding bipartite graph representation;

  • Example of an application in the context of information retrieval, where structural features were measured over different hypergraph-based models and presented in context with the performance of each model.

The remainder of this document is organized as follows. In “Reference work” section, we begin by providing an overview on the analysis of the inverted index, knowledge bases and hypergraphs, covering the three main aspects of the hypergraph-of-entity. In “Hypergraph characterization approach” section, we describe our characterization approach, covering shortest distance estimation with random walks and clustering coefficient estimation with node sampling, as well as proposing a general mixed hypergraph density formula by establishing a parallel with the corresponding bipartite mixed graph. In “Analyzing the hypergraph-of-entity base model” section, we present the results of a characterization experiment of the hypergraph-of-entity for a subset of the INEX (INitiative for the Evaluation of XML Retrieval) 2009 Wikipedia collection and, in “Analyzing the structural impact of different index extensions” section, we explore the effect of including synonyms, contextual similarity, or TF-bins in the structure of the hypergraph. In “An application to information retrieval” section, we assess the retrieval effectiveness of the representation model, analyzing the correlations between the evaluation metrics and the structural features (“Correlating evaluation metrics and structural features” section), and proposing ranking and anomaly indicators based on our conclusions (“Design rules for modifying or extending the hypergraph-of-entity” section). Finally, in “Conclusion” section, we close with the conclusions and future work.

Reference work

The hypergraph-of-entity is a representation model for indexing combined data, jointly modeling unstructured textual data from corpora and structured interconnected data from knowledge bases. As such, before analyzing a hypergraph from this model, we surveyed existing literature on inverted index analysis, as well as knowledge base analysis. We then surveyed literature specifically on the analysis of hypergraphs, particularly focusing on statistics like clustering coefficient, shortest path lengths and density.

Analyzing inverted indexes

There are several models based on the inverted index that combine documents and entities (Bhagdev et al. 2008; Bast and Buchhold 2013) and that are comparable with the hypergraph-of-entity. There has also been work that analyzed the inverted index, particularly regarding query evaluation speed and space requirements (Voorhees 1986; Zobel et al. 1998).

Voorhees (1986) compared the efficiency of the inverted index with the top-down cluster search. She analyzed the storage requirements of four test collections, measuring the total number of documents and terms, as well as the average number of terms per document. She then analyzed the disk usage per collection, measuring the number of bytes for document vectors and the inverted index. Finally, she measured CPU time in number of instructions and the I/O time in number of data pages accessed at least once, also including the query time in seconds.

Zobel et al. (1998) took a similar approach to compare the inverted index and signature files. First, they characterized two test collections, measuring size in megabytes, number of records and distinct words, as well as the record length, and the number of words, distinct words and distinct words without common terms per record. They also analyzed disk space, memory requirements, ease of index construction, ease of update, scalability and extensibility.

For the hypergraph-of-entity characterization, we do not focus on measuring efficiency, but rather on studying the structure and size of the hypergraph.

Analyzing knowledge bases

Studies have been made to characterize the entities and triples in knowledge bases. In particular, given the graph structure of RDF (resource description framework), we are interested in understanding which statistics are relevant for instance to discriminate between the typed nodes.

Halpin (2009) took advantage of Microsoft’s Live.com query log to reissue entity and concept queries over their FALCON-S semantic web search engine. They then studied the results, characterizing their source, triple structure, RDF and OWL (web ontology language) classes and properties, and the power-law distributions of the number of URIs, both returned as results and as part of the triples linking to the results. They focused mostly on measuring the frequency of different elements or aggregations (e.g., top-10 domain names for the URIs, most common data types, top vocabulary URIs).

Ge et al. (2010) defined an object link graph based on the graph induced by the RDF graph, based on paths linking objects (or entities), as long as they are either direct or established through blank nodes. They then studied this graph for the Falcons Crawl 2008 and 2009 datasets (FC08 and FC09), which included URLs from domains like bio2rdf.org or dbpedia.org. They characterized the object link graph based on density, using the average degree as an indicator, as well as connectivity, analyzing the largest connected component and the diameter. They repeated the study for characterizing the structural evolution of the object link graph, as well its domain-specific structures (according to URL domains). Comparing two snapshots of the same data enabled them to find evidence of the scale-free nature of the network. While the graph almost doubled in size from FC08 to FC09, the average degree remained the same and the diameter actually decreased.

Fernández et al. (2016) focused on studying the structural features of RDF data, identifying redundancy through common structural patterns, proposing several specific metrics for RDF graphs. In particular, they proposed several subject and object degrees, accounting for the number of links from/to a given subject/object (outdegree and indegree), the number of links from a \(\langle\)subject, predicate\(\rangle\) (partial outdegree) and to a \(\langle\)predicate, object\(\rangle\) (partial indegree), the number of distinct predicates from a subject (labeled outdegree) and to an object (labeled indegree), and the number of objects linked from a subject through a single predicate (direct outdegree), as well as the number of subjects linking to an object through a single predicate (direct indegree). They also measured predicate degree, outdegree and indegree. They proposed common ratios to account for shared structural roles of subjects, predicates and objects (e.g., subject-object ratio). Global metrics were also defined for measuring the maximum and average outdegree of subject and object nodes for the whole graph. Another analysis approach was focused on the predicate lists per subject, measuring the ratio of repeated lists and their degree, as well as the number of lists per predicate. Finally, they also defined several statistics to measure typed subjects and classes, based on the rdf:type predicate.

While we study a hypergraph that jointly represents terms, entities and their relations, we focus on a similar characterization approach, that is more based on structure and less based on measuring performance.

Analyzing hypergraphs

Hypergraphs (Berge 1970) have been around since 1970. While this concept was introduced by Claude Berge on this year, there had been other contributions surrounding the topic, namely in extremal graph and set theory. Post-1970, the work by Erdös (1971) and Brown et al. (1973) illustrates the intersection between extremal graph theory and hypergraph theory, while, pre-1970, we can also find contributions like Sperner’s theorem (Sperner 1928), in extremal set theory, or the Turán number (Turán 1941, 1961), in extremal graph theory. Interestingly, hypergraphs have remained somewhat fringe in network science, perhaps due to Paul Erdös resistance to the concept (Berge 1970):

At the Balatonfüred Conference (1969), P. Erdös and A. Hajnal asked us why we would use hypergraphs for problems that can be also formulated in terms of graphs. The answer is that by using hypergraphs, one deals with generalizations of familiar concepts. Thus, hypergraphs can be used to simplify as well as to generalize.

Although Erdös himself, who was interested in exploring the representation of graphs using set intersections (Erdös et al. 1966), also studied hypergraph problems, he avoided this designation, only sparsely using it (Brown et al. 1973):

By an r-graph we mean a fixed set of vertices together with a class of unordered subsets of this fixed set, each subset containing exactly r elements and called an r-tuple. In the language of Berge (1970) this is a simple uniform hypergraph of rank r.

Hypergraphs are data structures that can capture higher-order relations. As such, they either present conceptually different or multiple counterparts to the equivalent graph statistics. Take for instance the node degree. While graphs only have a node degree, indegree and outdegree, hypergraphs can also have a hyperedge degree, which is the number of nodes in a hyperedge (Klamt et al. 2009). The hyperedge degree also exists for directed hyperedges, in the form of a tail degree and a head degree.Footnote 1 The tail degree is based on the cardinality of the source node set and the head degree is based on the cardinality of the target node set. In this work we will rely on the degree, clustering coefficient, average path length, diameter and density to characterize the hypergraph-of-entity.

Building on the work by Gallo et al. (1993), who extended Dijkstra’s algorithm to hypergraphs, and the work by Ausiello et al. (1992), who tackled the same problem using a dynamic approach, Gao et al. (2015) have also proposed two algorithms for computing shortest paths in hypergraphs. The first, HyperEdge-based Dynamic Shortest Path (HE-DSP), like Gallo et al., proposed an extension to Dijkstra’s algorithm. The second, Dimension Reduction Dynamic Shortest Path (DR-DSP), relied on an induced graph with the same vertex set, adding weighted edges when a hyperedge containing the two vertices exists in the corresponding hypergraph, while selecting the minimum weight over all available hyperedges for the pair of vertices.

In this work, we focus on approximated computation approaches, which are useful for large-scale hypergraphs. Ribeiro et al. (2012) proposed the use of multiple random walks to find shortest paths in power law networks. They found that random walks had the ability to observe a large fraction of the network and that two random walks, starting from different nodes, would intersect with a high probability. Głąbowski et al. (2012) contributed with a shortest path computation solution based on ant colony optimization, clearly structuring it as pseudocode, while providing several configuration options. Parameters included the number of ants, the influence of pheromones and other data in determining the next step, the speed of evaporation of the pheromones, the initial, minimum and maximum pheromone levels, the initial vertex and an optional end vertex. Li (2011) studied the computation of shortest paths in electric networks based on random walk models and ant colony optimization, proposing a current reinforced random walk model inspired by the previous two. In this work, we also use a random walk based approach to approximate shortest paths and estimate the average path length and diameter of the graph.

Gallagher and Goldberg (2013, Eq. 4) provide a comprehensive review on clustering coefficients for hypergraphs. The proposed approach for computing the clustering coefficient in hypergraphs accounted for a pair of nodes, instead of a single node, which is more frequent in graphs. Based on these two-node clustering coefficients, the node cluster coefficient was then calculated. Two-node clustering coefficients measured the fraction of common hyperedges between two nodes, through the intersection of the incident hyperedge sets for the two nodes. It then provided different kinds of normalization approaches, either based on the union, the maximum or minimum cardinality, or the square root of the product of the cardinalities of the hyperedge sets. The clustering coefficient for a node was then computed based on the average two-node clustering coefficient for the node and its neighbors.

The codegree Turán density (Mubayi and Zhao 2007) \(\gamma (\mathcal {F})\) can be computed for a family \(\mathcal {F}\) of k-uniform hypergraphs, also known as k-graphs. It is calculated based on the codegree Turán number \(\hbox {co-ex}(n, \mathcal {F})\)—the extremal number based on the codegree in a hypergraph, instead of the degree in a graph—which takes as parameters the number of nodes n and the family \(\mathcal {F}\) of k-graphs. In turn, the codegree Turán number is calculated based on the minimum number of nodes, taken from all sets of \(r-1\) vertices of each hypergraph \(H_n\) that, when united with an additional vertex, will form a hyperedge from H. The codegree density for a family \(\mathcal {F}\) of hypergraphs is then computed based on \(\limsup _{n \rightarrow \infty } \frac{\hbox {co-ex}(n, \mathcal {F})}{n}\). Since this was the only concept of density we found associated with hypergraphs or, more specifically, a family of k-uniform hypergraphs, we opted to propose our own density formulation (“Hypergraph characterization approach” section). Furthermore, the hypergraph-of-entity is a single general mixed hypergraph. In other words, it is not a family of hypergraphs, it contains hyperedges of multiple degrees (it’s not k-uniform, but general) and it contains undirected and directed hyperedges (it’s mixed). Accordingly, we propose a density calculation based on the counterpart bipartite graph of the hypergraph, where hyperedges are translated to bridge nodes.

Methodology

In this section, we introduce general concepts and definitions, formally providing mathematical support for this analysis. Next, we present the characterization methodology and propose approaches to estimate shortest distances, clustering coefficients and density. Finally, we describe the methodology for a practical application of this analysis framework in the domain of information retrieval.

General concepts and definitions

We provide a mathematical framework, where we formalize several concepts and definitions, including relevant classes of hypergraphs, as well as useful properties and statistics, that we rely upon across this manuscript.

Classes of hypergraphs

In this section we formally define hypergraph, distinguishing between undirected, directed and mixed, as well as uniform and general.

Definition 1

(Hypergraph) Let v be a vertex and V be a set of vertices such that \(v \in V\), with \(n = |V|\) being the number of vertices. Let \(E = E_U \cup E_D\) be the set of all hyperedges, where \(E_U\) represents the subset of undirected hyperedges \(e_U \in E_U\) and \(E_D\) the subset of directed hyperedges \(e_D \in E_D\), with \(m = |E_U| + |E_D| = |E|\) being the total number of hyperedges. Let also a set \(e_U \subseteq V\) be an undirected hyperedge and a tuple of sets \(e_D = (t, h)\) be a directed hyperedge formed by a tail set \(t \subseteq V\) (source) and a head set \(h \subseteq V\) (target). A hypergraph is then a tuple H = (V, E).

Definition 2

(Hypergraph direction) Under this notation, a hypergraph H = (V, E) is said to be:

  • Undirected, when \(E = E_U\) or, equivalently, \(E_D = \emptyset\);

  • Directed, when \(E = E_D\) or, equivalently, \(E_U = \emptyset\);

  • Mixed, when \(E_U \ne \emptyset \wedge E_D \ne \emptyset\).

Definition 3

(Hypergraph uniformity) A uniform or k-uniform hypergraph is characterized by all of its hyperedges being defined over the same number k of vertices. For an undirected hyperedge \(e_U\) it means \(|e_U| = k\), while for a directed hyperedge \(e_D = (t, h)\) it means \(|t| + |h| = k\).

On the other hand, a non-uniform hypergraph is said to be a general hypergraph, which contains hyperedges of diverse cardinalities.

*Please refer to Banerjee and Char (2017) for more information on directed uniform hypergraphs.

Definition 4

(Hyperedge incidence) Let \(v \in V\) have the following sets of incident hyperedges:

  • \(E_v = E_{U_v} \cup E_{D_v}\) as the set of all incident hyperedges to v, ignoring direction;

  • \(E_v^- = E_{U_v} \cup E_{D_v}^-\) as the set of all incoming hyperedges to v;

  • \(E_v^+ = E_{U_v} \cup E_{D_v}^+\) as the set of all outgoing hyperedges from v.

Hypergraph statistics

In this section, we formally describe the hypergraph statistics that we rely upon for our analysis framework. In particular we describe the different degrees that can be computed for a vertex, the cardinalities of hyperedges, the diameter and average shortest path length, the clustering coefficient, and the density.

Definition 5

(Vertex-based vertex degree) Let \(d_v(v)\) be the degree of a vertex measured based on the number of adjacent vertices.

Vertex-based degree (ignoring direction) is given by:

$$\begin{aligned} d_v(v) = \sum _{e_U \in E_{U_v}} |e_U| + \sum _{(t, h) \in E_{D_v}} \left( |t| + |h| \right) \end{aligned}$$

Vertex-based indegree is given by:

$$\begin{aligned} d_v^-(v) = \sum _{e_U \in E_{U_v}} |e_U| + \sum _{(t, h) \in E_{D_v}^-} |t| \end{aligned}$$

And vertex-based outdegree is given by:

$$\begin{aligned} d_v^+(v) = \sum _{e_U \in E_{U_v}} |e_U| + \sum _{(t, h) \in E_{D_v}^+} |h| \end{aligned}$$

Definition 6

(Hyperedge-based vertex degree) Let \(d_h(v)\) be the degree of a vertex measured based on the number of incident hyperedges.

Hyperedge-based degree (ignoring direction) is given by:

$$\begin{aligned} d_h(v) = |E_v| \end{aligned}$$

Hyperedge-based indegree is given by:

$$\begin{aligned} d_h^-(v) = |E_v^-| \end{aligned}$$

And hyperedge-based outdegree is given by:

$$\begin{aligned} d_h^+(v) = |E_v^+| \end{aligned}$$

Definition 7

(Hyperedge cardinality) Let c(e) be the cardinality of a hyperedge measured based on the number of nodes it contains. Let \(e_U\) be an undirected hyperedge and \(e_D = (t, h)\) be a directed hyperedge.

Undirected hyperedge cardinality is given by:

$$\begin{aligned} c(e_U) = |e_U| \end{aligned}$$

Directed hyperedge cardinality is given by:

$$\begin{aligned} c(e_D) = |t| + |h| \end{aligned}$$

In order to index hyperedges based on their number of nodes, we also use the notation \(E_U^a\) to represent sets of undirected hyperedges of cardinality \(a = |e_U|\), as well as \(E_D^{a,b}\) to represent sets of directed hyperedges with a tail of size \(a = |t|\) and a head of size \(b = |h|\).

Definition 8

(Diameter/avg. short. path len.) Let L be the set of shortest path lengths between all pairs of connected nodes. Let \(\ell _{u, v} \in L\) be the length of the shortest path between nodes u and v from the vertex set V. For \(e_{U_i}, e_{U_j} \in E_U\) and \(e_{D_i}, e_{D_j} \in E_D\), we define L as follows:

$$\begin{aligned} L = \left\{ \ell _{u, v} : \,\, u \in e_{U_i} \wedge v \in e_{U_j} \,\, \vee \,\, u \in t \wedge (t, \cdot ) \in e_{D_i} \wedge v \in h \wedge (\cdot , h) \in e_{D_j} \right\} \end{aligned}$$

The diameter is then given by:

$$\begin{aligned} \max L \end{aligned}$$

And the average shortest path length is given by:

$$\begin{aligned} \displaystyle \frac{1}{|L|} \sum _{\ell _{i,j} \in L} \ell _{i,j}. \end{aligned}$$

Definition 9

(Clustering coefficient) The clustering coefficient measures the degree to which nodes tend to agglomerate in dense groups. We compute this metric based on the following approach by Gallagher and Goldberg (2013). Let \(E_v = E_{U_v} \cup E_{D_v}\) be the set of incident hyperedges to v, ignoring direction. Let N(v) be the set of all vertices adjacent to v (i.e., sharing a hyperedge, while ignoring direction).

The clustering coefficient cc(u, v) for a pair of nodes u and v is given by:

$$\begin{aligned} \displaystyle cc(u, v) = \frac{|E_u \cap E_v|}{|E_u \cup E_v|} \end{aligned}$$

The clustering coefficient cc(v) for a single node v is given by:

$$\begin{aligned} \displaystyle cc(v) = \frac{1}{|N(v)|} \sum _{u \in N(v)} cc(u, v) \end{aligned}$$

And the clustering coefficient cc(H) for the hypergraph is given by:

$$\begin{aligned} \displaystyle cc(H) = \frac{1}{|V|} \sum _{v \in V} cc(v) \end{aligned}$$

Definition 10

(Density) We transform a hypergraph H = (V, E) into its corresponding bipartite graph \(G_H = (\mathcal {V}, \mathcal {E})\), using the density of \(G_H\) as an indicator of density for H.

The vertices \(\mathcal {V}\) of \(G_H\) are based on the vertices V and hyperedges E from H and are given by:

$$\begin{aligned} \mathcal {V} = V \cup \{ v_e : e \in E \} \end{aligned}$$

The edges \(\mathcal {E} = \mathcal {E}_U \cup \mathcal {E}_D\) of \(G_H\) are established based on all pairs of vertices connected by a hyperedge \(E = E_U \cup E_D\) from H.

The undirected edges \(\mathcal {E}_U\) of \(G_H\) are given by:

$$\begin{aligned} \mathcal {E}_U = \{ (u, v_e), (v_e, w) : e \in E_U \wedge u \in e \wedge w \in e \} \end{aligned}$$

And the directed edges \(\mathcal {E}_D\) of \(G_H\) are given by:

$$\begin{aligned} \mathcal {E}_D = \{ (u, v_e), (v_e, w) : e=(t, h) \in E_D \wedge u \in t \wedge w \in h \} \end{aligned}$$

Density D(H), or simply D, is then given by:

$$\begin{aligned} \displaystyle D = D(G_H) = \frac{2|\mathcal {E}_U| + |\mathcal {E}_D|}{2|\mathcal {V}|\left( |\mathcal {V}| - 1\right) } \end{aligned}$$

Hypergraph characterization approach

Graphs can be characterized at a microscopic, mesoscopic and macroscopic scale. The microscopic analysis is concerned with statistics at the node-level, such as the degree or clustering coefficient. The mesoscopic analysis is concerned with statistics and patterns at the subgraph-level, such as communities, network motifs or graphlets. The macroscopic analysis is concerned with statistics at the graph-level, such as average clustering coefficient or diameter. In this work, our analysis of the hypergraph is focused on the microscopic and macroscopic scales. We compute several statistics for the whole hypergraph, as well as for snapshot hypergraphs that depict growth over time. Some of these statistics are new to hypergraphs, when compared to traditional graphs. For instance, nodes in directed graphs have an indegree and an outdegree. However, nodes in directed hypergraphs have four degrees, based on incoming and outgoing nodes, as well as on incoming and outgoing hyperedges. While in graphs all edges are binary, leading to only one other node, in hypergraphs hyperedges are n-ary, leading to multiple nodes, and thus different degree statistics. While some authors use ‘degree’ to refer to node and hyperedge degrees (Yu and Sun 2018, §4) (Klamt et al. 2009, §Network Statistics in Hypergraphs), in this work we opted to use the ‘degree’ designation when referring to nodes and the ‘cardinality’ designation when referring to hyperedges. This is to avoid any confusion for instance between an “hyperedge-induced” node degree and a hyperedge cardinality.

We analyze the base model, as well as three models based on the synonyms, contextual similarity and TF-bins extensions. For the full hypergraph of each of the four models, we compute the following global statistics:

  • Number of nodes, in total and per type;

  • Number of hyperedges, in total, per direction, and per type;

  • Average degree;

  • Average clustering coefficient;

  • Average path length;

  • Diameter;

  • Density.

We also plot the following distributions for the full hypergraph:

  • Node degree distributions per node type:

    • Node-based node degree;

    • Hyperedge-based node degree.

  • Hyperedge cardinality distributions per hyperedge type.

Then, we define a temporal analysis framework based on an increasing number of documents (i.e., time passes as documents are added to the hypergraph-of-entity index). We prepare several snapshots, with a different number of documents each, for each of the four models. We then compute and plot the following statistics for each snapshot, showing its evolution as the number of documents increases:

  • Average node degree over time;

  • Average hyperedge cardinality over time;

  • Average diameter and average path length over time;

  • Average clustering coefficient over time;

  • Average density over time.

  • Size over time:

    • Number of nodes;

    • Number of hyperedges;

    • Space in disk;

    • Space in memory.

Finally, we also measure the run time for several operations, in order to understand the efficiency cost and the evolution of its behavior for an increasing number of documents:

  • Index creation time;

  • Global statistics computation time;

  • Node degrees computation time;

  • Hyperedge cardinalities computation time.

In order to support large-scale hypergraphs, we compute the average path length, diameter, clustering coefficient, and density using approximated strategies. We estimate shortest distances based on random walks, the clustering coefficient based on node sampling, and the density based on a bipartite graph induced from the hypergraph, although without the need to explicitly create this graph. The following sections will detail these approaches.

Estimating shortest distances with random walks

Ribeiro et al. (2012) found that, in power law networks, there is a high probability that two random walk paths, usually starting from different nodes, will intersect and share a small fraction of nodes. We took advantage of this conclusion, adapting it to a hypergraph, in order to compute a sample of shortest paths and their length, used to estimate the average path length and diameter. We considered two (ordered) sets \(S_1 \subset V\) and \(S_2 \subset V\) of nodes sampled uniformly at random, each of size \(s = |S_1| = |S_2|\). We then launched r random walks of length \(\ell\) from each pair of nodes \(S_1^i\) and \(S_2^i\). For a given pair of random walk paths, we iterated over the nodes in the path starting from \(S_1^i\), until we found a node in common with the path starting from \(S_2^i\). At that point, we merged the two paths based on the common node, discarding the suffix of the first path and the prefix of the second path. We computed the length of these paths, keeping only the minimum length over the r repeats. As the number of iterations r increased, we progressively approximated the shortest path for the pair of nodes. Despite the inherent estimation error, this method can be used to study even large-scale hypergraphs—precision can be controlled by tuning the number of sampled nodes and random walks, which will eventually lead to convergence for large values. This approach enabled us to generate a sample of approximated shortest path lengths, which could be used to compute the estimated diameter (its maximum) and the estimated average path length (its mean), in a scenario where high precision is not critical. This is true for instance for a quick or initial analysis of a hypergraph. Given the repeated research iterations over the hypergraph-of-entity and the multitude of tests carried over this model, a quick estimation approach is ideal.

Estimating clustering coefficients with node sampling

In a graph, the clustering coefficient is usually computed for a single node and averaged over the whole graph. As shown by Gallagher and Goldberg (2013, §I.A.), in hypergraphs the clustering coefficient is computed, at the most atomic level, for a pair of nodes. The clustering coefficient for a node is then computed based on the averaged two-node clustering coefficients between the node and each of its neighbors (cf. Gallagher and Goldberg (2013, Eq.4)). Three options were provided for calculating the two-node clustering coefficient, one of them based on the Jaccard index between the neighboring hyperedges of each node (Gallagher and Goldberg 2013, Eq.1), which we use in this work. While a global understanding of the clustering coefficient is useful for characterizing overall local connectivity in the hypergraph, the existence of a random hypergraph generation model, like the Watts–Strogatz model (Watts and Strogatz 1998) for graphs, would provide further interpretations at a mesoscale. We leave this open and instead focus on the macroscale.

Continuing with the philosophy of large-scale hypergraph support in our analysis framework, as opposed to computing the clustering coefficient for all nodes, we estimated the clustering coefficients for a smaller sample \(S \subseteq V\) of nodes. Furthermore, for each sampled node \(s_i \in S\), we also sampled its neighbors \(N_S(s_i)\) for computing the two-node clustering coefficients. We then applied the described equations to obtain the clustering coefficients for each node \(s_i\) and a global clustering coefficient based on the overall average. For \(S = V \wedge N_S(s_i) = N(s_i)\), being \(N_S\) the sampled neighbors and N the full set of neighbors, we would obtain the exact clustering coefficient. Again, this approach offers two parameters that can be controlled as a tradeoff between between efficiency and effectiveness.

Computing the density of general mixed hypergraphs

A general mixed hypergraph is general (or non-uniform) in the sense that its hyperedges can contain an arbitrary number of vertices, and it is mixed in the sense that it can contain hyperedges that are either undirected and directed. We compute a hypergraph’s density by analogy with its corresponding bipartite graph, which contains all nodes from the hypergraph, along with connector nodes representing the hyperedges.

Consider the hypergraph H = (V, E), with \(n = |V|\) nodes and \(m = |E|\) hyperedges. Also consider the set of all undirected hyperedges \(E_U\) and directed hyperedges \(E_D\), where \(E = E_U \cup E_D\). Their subsets \(E_U^k\) and \(E_D^{k_1,k_2}\) should also be respectively considered, where \(E_U^k\) is the subset of undirected hyperedges with k nodes and \(E_D^{k_1,k_2}\) is the subset of directed hyperedges with \(k_1\) tail (source) nodes, \(k_2\) head (target) nodes and \(k = k_1 + k_2\) nodes, assuming the hypergraph only contains directed hyperedges between disjoint tail and head sets. This means that the union of \(E_U = E_U^1 \cup E_U^2 \cup E_U^3 \cup \cdots\) and \(E_D = E_D^{1,1} \cup E_D^{1,2} \cup E_D^{2,1} \cup E_D^{2,2} \cup \cdots\) forms the set of all hyperedges E. We use it as a way to distinguish between hyperedges with different degrees. This is important because, depending on the degree k, the hyperedge will contribute differently to the density, when considering the corresponding bipartite graph. For instance, one undirected hyperedge with degree k = 4 will contribute with four edges to the density. Accordingly, we derive the density of a general mixed hypergraph as shown in Eq. 1.

$$\begin{aligned} D = \frac{2 \sum _k k |E_U^k| + \sum _{k_1,k_2} (k_1 + k_2) |E_D^{k_1,k_2}| }{ 2 (n + m) (n + m - 1) } \end{aligned}$$
(1)

In practice, this is nothing more than a comprehensive combination of the density formulas for undirected and directed graphs. On one side, we consider the density of a mixed graph that should result of the combination of an undirected simple graph and a directed simple graph. That is, each pair of nodes can be connected, at most, by an undirected edge and two directed edges of opposing directions. On the other side, we use hypergraph notation to directly obtain the required statistics from the corresponding mixed bipartite graph, thus calculating the analogous density for a hypergraph.

Contextualizing through a practical application

In order to study the usefulness of the analysis framework that we propose, we explore it in the context of an information retrieval application. In particular, our use case is based on ad hoc document retrieval (leveraging entities). For this retrieval task, given a keyword query, the goal is to retrieve and rank the documents that best answer the information need of the user. As an entity-oriented search task, the approach must take into account entities, mentioned in documents, and their relations to improve retrieval performance. Evaluation is then done based on a set of topics (whose title is usually used as the keyword query), along with a set of relevance judgments, containing relevance grades assigned by the judges on multiple retrieved documents.

In this experiment, we attempt to identify individual properties of the hypergraph that correlate with the retrieval performance scores that we compute. We identify indicator properties that help us rank our models by effectiveness, as well as identify models that might be low performers. Although this is also a contribution of this work, we consider it to be secondary, compared to the analysis framework that we propose.

Data modeling

In this section, we begin by presenting the test collection that we use to build several hypergraphs based on the hypergraph-of-entity model. Then, we provide an overview of the hypergraph-of-entity, describing the construction approach of the hypergraphs that we study, and a description of the random walk score. Finally, we present the motivation to characterize this unified model for entity-oriented search.

INEX 2009 Wikipedia collection

In this work, we characterize hypergraphs built based on different versions of the hypergraph-of-entity model, relying upon the INEX 2009 Wikipedia collection (Schenkel et al. 2007). We also explore an application in the domain of information retrieval, where assessment is dependent on the topics and relevance judgments from the INEX 2010 Ad Hoc track. In this section, we describe this test collection, including the main dataset and the subset prepared for the analysis and information retrieval application, as well as the associated topics and relevance judgments, also known as qrels (query relevance set).

Main dataset The INEX 2009 Wikipedia collectionFootnote 2 is an XML version of articles from the English Wikipedia, based on the dump from October 8, 2008, and incorporating semantic annotations from the 2008-w40-2 version of YAGO (Yet Another Great Ontology).Footnote 3 Like DBpedia,Footnote 4 YAGO is a semantic knowledge base, containing structured data from Wikipedia, WordNet and GeoNames. The INEX 2009 Wikipedia collection is provided in multiple tar.bz2 archives that contain nearly 2.7 million articles, requiring 50.7 GB of disk space when uncompressed and only 5.5 GB when compressed, and it relies on over 5800 classes from YAGO, including people, movies, and cites. Each XML document also contains links to other articles, corresponding to the hyperlinks found in the Wikipedia dump. In total, there are nearly 102 million XML elements in the collection. In order to build the hypergraph, we rely on the text nodes of the \(\texttt {<bdy>}\) element, as well as on the \(\texttt {<link>}\) elements to create semantic triples that capture the different entity names based on mentions. The structure of the hypergraph will be further detailed in “Hypergraph-of-entity representation and retrieval model” section. For our application to information retrieval (“An application to information retrieval” section), we also rely on the qrels for the INEX 2010 Ad Hoc track,Footnote 5 in a study to determine possible correlations between the effectiveness of ad hoc document retrieval (leveraging entities) and the properties of the hypergraphs. Provided relevance grades are binary (0 for irrelevant and 1 for relevant).

INEX 2009 10T-NL subset Due to the space and time complexity of the hypergraph-of-entity, we prepared a smaller subset of the INEX 2009 Wikipedia collection, that we could use to circumvent performance issues. In fact, characterizing the corresponding hypergraph-of-entity for a smaller subset will enable us to identify weaknesses in our model that could help us improve the scalability or retrieval effectiveness of future versions. The subset was created based on a random sample of 10 topics (‘10T’). In particular, the following topics were considered: 2010003, 2010014, 2010023, 2010032, 2010038, 2010040, 2010049, 2010057, 2010079, 2010096. We then included only documents mentioned in the relevance judgments for the selected topics, optionally considering linked documents (in this case, we did not include linked documents—accordingly, ‘NL’ stands for “no linked”).

Hypergraph-of-entity representation and retrieval model

The hypergraph-of-entity (Devezas and Nunes 2019) is a unified model for entity-oriented search. It provides a joint representation for corpora and knowledge bases, through a general mixed hypergraph, containing the types of nodes and hyperedges described in Table 1. Ranked retrieval then relies on a universal ranking function, called the random walk score, that supports multiple entity-oriented search tasks, by simply controlling the input (e.g., keyword or entity query) and output (e.g., documents or entities): ad hoc document retrieval (leveraging entities), ad hoc entity retrieval, and entity list completion.

Table 1 Hypergraph-of-entity nodes and hyperedges for the base model and the extensions

Representation model

In this work, we explore multiple hypergraph-of-entity versions of the representation model, including:

  • Base model, with term and entity nodes, and document, related_to and contained_in hyperedges;

  • Synonyms model, extending the base model with synonym hyperedges;

  • Contextual similarity model, extending the base model with context hyperedges;

  • TF-bins models, extending the base model with tf_bin hyperedges, according to the selected number of bins (we experiment with 2–10 TF-bins).

Each of the analyzed hypergraphs is built by indexing the INEX 2009 Wikipedia collection, based on the text in the \(\texttt {<bdy>}\) element and semantic triples formed from \(\texttt {<link>}\) elements, where the subject is the entity described by the current article and the object is the entity described by the linked article. No predicates are considered, as these are not a part of the model.

Synonyms are context-based. Our goal is for disambiguation of context to happen naturally through the additional information provided by terms and entities grouped through document hyperedges, as well as from the related_to hyperedges between entities. A given synonym will be more frequently visited by a random walk, when a higher number of paths from the query nodes (which establish context) also lead the walker there.

Contextual similarity is defined for terms that are frequently surrounded by similar sequences of terms, i.e., that are used in a similar context. In order to establish a relation of contextual similarity, we rely on word2vec (Mikolov et al. 2013) to obtain a distributed representation of words (i.e., a word embedding—a vector of latent features that semantically represents a word). After obtaining the word embeddings, we simply use a k-nearest neighbors approach to find the k most similar words based on cosine similarity, ensuring a similarity above 0.5. The original term, as well as the k-nearest neighbors are then grouped in a context hyperedge.

Term frequency bins (or TF-bins) are computed as follows. For each document, we calculate the term frequency and, for a given number of bins n, we compute the percentiles \(P_n = \{ 100\frac{x}{n} \mid x \in \mathbb {Z}^+ \wedge x \le n \}\), assigning them the weight \(w(x) = \frac{x}{n}\). So, for example, if we consider n = 4 bins, then we compute the percentiles \(P_4 = \{ 25, 50, 75, 100 \}\), resulting in four values of TF (term frequency). Let us for instance consider the following term frequency for 10 documents: 1, 1, 1, 1, 2, 2, 2, 2, 2, 3. This would result in the value 1 for the 25 percentile, 2 for the 50 and 75 percentiles, and 3 for the 100 percentile. We would then form the TF intervals ]0, 1], ]1, 2], ]2, 2] and ]2, 3], with the interval ]2, 2] having no matches in \(\mathbb {Z}^+\), thus making it redundant. Per document, and for each non-empty interval, a weighted hyperedge was then created to group terms with a similar term frequency (i.e., within the same TF-bin). This can be used by the ranking function, to issue biased random walks, controlling the flow in a way that the walker will be driven towards documents with a higher TF for the query terms.

Retrieval model

Ranked retrieval is done based on RWS (random walk score). A query can be formed by any combination of the elements represented in the hypergraph, as can the results that we score. Most commonly, we define the following three tasks:

  • Ad hoc document retrieval, which takes a keyword query as input (mapped to a set of term nodes) and ranks a set of documents, through their hyperedges, as output;

  • Ad hoc entity retrieval, which also takes a keyword query as input, but instead ranks a set of entities, through their nodes, as output;

  • Entity list completion, which takes an entity query as input (mapped to a set of entity nodes) and ranks a set of entities, through their nodes, as output.

In this work, however, we only explore the task of ad hoc document retrieval, to illustrate an practical application of our hypergraph analysis framework. Regardless of the retrieval task, the random walk score always runs over the whole hypergraph, scoring each node and hyperedge, based on multiple random walks launched from a set of seed nodes that are either a direct or an expanded representation of the query. The random walk score \(RWS(\ell , r, \Delta _{nf}, \Delta _{ef}, exp.)\) is a universal ranking function where, for each seed node, r random walks of length \(\ell\) are launched. Each node and hyperedge has a zero score by default, storing the number of visits by random walkers. This is then normalized between zero and one, by dividing by the overall maximum number of visits. The probability resulting from the normalization is then multiplied by the probability of the seed node being a good representative of the query—this is given by the fraction of query nodes linked to the seed node (always one for a direct representation of the query) and the total number of neighbors of the seed node (Devezas and Nunes 2019, §4.2). The parameters \(\Delta _{nf}\) and \(\Delta _{ef}\) are not used in the experiments we present here and thus are set to zero. The exp. parameter determines whether we use a direct or an expanded query representation—we set it to false, thus disabling expansion and using the existing nodes for the terms in the query as a the seed nodes.

Why characterize the hypergraph-of-entity?

While the hypergraph-of-entity is able to serve as a unified framework for entity-oriented search, it is still severely outperformed by baselines like Lucene TF-IDF and BM25 (cf. Table 6). As such, we rely on hypergraph analysis to gain further insights on the structure, and to identify possible changes that could lead to a more effective and efficient model. Briefly, the reasons to characterize the hypergraph-of-entity are the following:

  • It supports decision making in the design iterations over the retrieval model;

  • Statistics like the average path length will help us tune the random walk score length parameter, and the clustering coefficient will help us understand how many repeated random walks to issue;

  • Understanding the evolution of the hypergraph, as the number of documents increases, also gives us insights on how to measure the impact of the pruning that we apply to the model (e.g., removing redundancies, or retaining only document keywords).

Analyzing the hypergraph-of-entity base model

We indexed a subset of the INEX 2009 Wikipedia collection (Schenkel et al. 2007) given by the 7487 documents appearing in the relevance judgments of 10 random topics. We then computed global statistics (macroscale), local statistics (microscale) and temporal statistics. Temporal statistics were based on an increasingly larger number of documents, by creating several snapshots of the index, through a ‘limit’ parameter, until all documents were considered.

Global statistics In Table 2, we present several global statistics about the hypergraph-of-entity, in particular the number of nodes and hyperedges, discriminated by type, the average degree, the average clustering coefficient, the average path length, the diameter and the density. The average clustering coefficient was computed based on a sample of 5000 nodes and a sample of 100,000 neighbors for each of those nodes. The average path length and the diameter were computed based on a sample of shortest distances between 30 random pairs of nodes and the intersections of 1000 random walks of length 1000 launched from each element of the pair. Finally, the density was computed based on Eq. 1. As we can see, for the 7487 documents the hypergraph contains 607,213 nodes and 253,154 hyperedges of different types, an average degree lower than one (0.83) and a low clustering coefficient (0.11). It is also extremely sparse, with a density of 3.9e−06. Its diameter is 17 and its average path length is 8.4, almost double when compared to a social network like Facebook (Backstrom et al. 2011).

Table 2 Global statistics for the base model

Local statistics Figure 1 illustrates the node degree distributions. In Fig. 1a, the node degree is based on the number of connected nodes, with the distribution approximating a log-normal behavior. In Fig. 1b, the node degree is based on the number of connected hyperedges, with the distribution approximating a power law. This shows the usefulness of considering both of the node degrees in the hypergraph-of-entity, as they are able to provide different information.

Fig. 1
figure 1

Node degree distributions for the base model (log–log scale)

Figure 2 illustrates the hyperedge cardinality distribution. For document hyperedges, cardinality is log-normally distributed, while for related_to hyperedges the behavior is slightly different, with low cardinalities having a higher frequency than they would in a log-normal distribution. Finally, the cardinality distribution of contained_in hyperedges, while still heavy-tailed, presents an initial linear behavior, followed by a power law behavior. The maximum cardinality for this type of hyperedge is also 16, which is a lot lower when compared to document hyperedges and related_to hyperedges, which have cardinality 8167 and 3084, respectively. This is explained by the fact that contained_in hyperedges establish a directed connection between a set of terms and an entity that contains those terms, being limited by the maximum number of words in an entity.

Fig. 2
figure 2

Hyperedge cardinality distribution based on the total number of nodes for the base model (log–log scale)

Temporal statistics In order to compute temporal statistics, we first generated 14 snapshots of the index based on a limit L of documents, for \(L \in \{1, 2, 3, 4, 5, 10, 25, 50, 100, 1000, 2000, 3000, 5000, 8000\}\). Each snapshot was built based on the natural order of the documents found within the tar.bz2 archives, up to a limit L, while the archives were accessed in directory order (i.e., the same as ls -U in Linux). This perfectly mimicked index growth, as documents were incrementally preprocessed and added to the hypergraph-of-entity.

Figure 3 illustrates the node-based and hyperedge-based average node degrees over time (represented as the number of documents in the index at a given instant). As we can see, both functions tend to converge, however this is clearer for the node-based degree, reaching nearly 4000 nodes, through only 9 hyperedges, on average. Figure 4 illustrates the average undirected hyperedge cardinality over time, with a convergence behavior that approximates 300 nodes per hyperedge, after rising to an average of 411.88 nodes for L = 25 documents.

Fig. 3
figure 3

Average node degree over time for the base model

Fig. 4
figure 4

Average hyperedge cardinality over time for the base model

Figure 5 illustrates the evolution of the average path length and the diameter of the hypergraph over time. For a single document, these values reached 126.1 and 491, respectively, while, for just two documents, they immediately lowered to 3.8 and 10. For higher values of L, both statistics increased slightly, reaching 7.2 and 15 for the maximum number of documents. Notice that these last values are equivalent to those computed in Table 2 (8.4 and 17, respectively), despite resulting in different amounts. This is due to the precision errors in our estimation approach, resulting in a difference of 1.2 and 2, respectively, which is tolerable when computation resources are limited. In Fig. 6, we illustrate the evolution of the clustering coefficient, which rapidly decreases from 0.59 to 0.11. The low average path length and clustering coefficient point towards a weak community structure, possibly due to the coverage of diverse topics. However, we would require a random hypergraph generation model, like the Watts–Strogatz model (Watts and Strogatz 1998) for graphs, in order to properly interpret the statistics.

Fig. 5
figure 5

Average estimated diameter and average shortest path over time for the base model

Fig. 6
figure 6

Average estimated clustering coefficient over time for the base model

Figure 7 illustrates the evolution of the density over time. The density is consistently low, starting from 1.37e−03 and progressively decreasing to 3.91e−06 as the number of documents increases. This shows that the hypergraph-of-entity is an extremely sparse representation, with limited connectivity, which might benefit precision in a retrieval task.

Fig. 7
figure 7

Average density over time for the base model

Figure 8 displays the number of nodes (Fig. 8a) and hyperedges (Fig. 8b) created over time, as the index grew. Both presented a sub-linear growth behavior, reaching 4566 nodes and 803 hyperedges for 10 documents, 238,141 nodes and 89,348 hyperedges for 2000 documents, and 607,213 nodes and 253,154 for the whole collection of 7487 documents. The ratio of hyperedges per node evolved from 0.18, to 0.38, to 0.42, always staying below one. This means that the number of hyperedges increased slower than the number of nodes. Moreover, we know that nodes represent terms and entities, which will eventually converge to a finite vocabulary, further decreasing index growth rate.

Fig. 8
figure 8

Number of nodes and hyperedges over time for the base model

As shown in Fig. 9, we also measured the space usage of the hypergraph, both in disk (Fig. 9a) and in memory (Fig. 9b). In disk, the smallest snapshot required 43.8 KiB for one document, while the largest snapshot required 181.9 MiB for the whole subset. Average disk space over all snapshots was 37.5 MiB ± 58.9 MiB. In memory, for our particular application,Footnote 6 the smallest snapshot used 1.0 GiB for one document, including the overhead of the data structures, and the largest snapshot used 2.3 GiB for the whole subset. Average memory space over all snapshots was 1.3 GiB ± 461.1 MiB. Memory also grew faster for the first 1000 documents, apparently leading to an expected convergence, although we could not observe it for such a small subset.

Fig. 9
figure 9

Required space for storing and loading the base model over time

Finally, Fig. 10 illustrates the base model run times of the following operations for an increasing number of documents: index creation (Fig. 10a); the computation of the global statistics (Fig. 10b), also shown in Table 2; the computation of all node degrees (Fig. 10c); and the computation of all hyperedge cardinalities (Fig. 10d). As we can see, the most significant increase in run time happens around 1000 documents, with the exception of the global statistics computation, which shows an increased run time for the first added documents. A possible reason for this anomaly is that this is the first analysis operation that we run after creating the index, which might influence the caching mechanisms of the system, thus reducing run time after the first documents and then resuming regular behavior. Indexing time took 1m09s for 1000 documents and 4m13s for a maximum of 8000 documents. The computation of global statistics took 17m26s for 1, 000 documents and 41m18s for a maximum of 8000 documents. Node degrees were computed in 4m27s for 1000 documents, taking 20m55s at most, while hyperedge cardinalities were computed in only 19s for 1000 documents, taking 44s at most, making it the most efficient statistic to compute.

Fig. 10
figure 10

Base model run time statistics

Analyzing the structural impact of different index extensions

In this section, we extend our previous characterization work (Devezas and Nunes 2019) by taking into consideration the index extensions, applied over the hypergraph-of-entity base model, as described by Devezas and Nunes (2019, §4.1.2). In Sections 6.1 and 6.2, we study the structural impact of synonyms and context, respectively. In “Term frequency bins” section, we propose a new grouping of terms based on the discretization of the term frequency (TF-bins), studying the structural impact of this index extension, while also considering different numbers of bins.

Synonyms

The base model for the hypergraph-of-entity establishes n-ary connections, both directed and undirected, among nodes that represent terms and entities. Most visibly, document hyperedges group all terms and entities mentioned in a document, a lot like a bag of words and entities that integrates both unstructured and structured evidence. This model can easily be extended with synonyms, that establish new bridges between documents. In particular, we used the synsets from WordNet 3.0 (Miller 1995), based on the first sense of each term in the hypergraph, and only considering its noun form. Each synset was modeled as a synonym hyperedge. In this section, we characterize the hypergraph-of-entity when using the synonyms extension. We repeat the analysis described in “Analyzing the hypergraph-of-entity base model” section, but only cover results that show a different behavior from the base model.

Table 3 shows the global statistics for the synonyms model. As we can see, the number of terms increased from 323,672 (cf. Table 2) to 326, 671. This means that 2999 synonym terms that did not originally belong to the collection were added. The number of undirected hyperedges increased significantly, with 10,650 new synonymy relations. The average degree slightly increased, with the average clustering coefficient and the density remaining stable. The diameter also remained at 17, however the average path length decreased almost a unit, from 8.37 to 7.53, approximating nodes through the relation of synonymy. This is an indicator of the usefulness of using synonyms to establish new bridges between documents. In fact, we found 4558 new paths created by this extension, resulting in 65.29 documents linked on average per synonym. Besides global statistics, we also identified four interesting changes or new characteristics when compared to the base model:

  • Term node degree distribution;

  • Synonym hyperedge cardinality distribution;

  • Average hyperedge cardinality over time;

  • Average estimated diameter and average path length over time.

Table 3 Global statistics for the synonyms model

Term node degree distribution Figure 11 illustrates the node-based node degree distribution for entity and term nodes in the hypergraph-of-entity with the synonyms extension. While the behavior for entity nodes is similar to the base model, term nodes show a combination of a power law like behavior for the lower degrees, with a log-linear behavior for the remaining degrees. This is due to the introduction of synonyms from WordNet, which, as we can see in Fig. 12, follow a distribution close to a power law.

Fig. 11
figure 11

Node degree distribution, based on connected nodes, for the synonyms model (log–log scale)

Fig. 12
figure 12

WordNet 3.0 noun synonyms distribution (log–log scale)

Synonym hyperedge cardinality distribution Figure 13 illustrates the distribution of synonyms per hyperedge. As we can see, most synonym hyperedges either contain two or three terms, while less than 100 hyperedges contain more than five synonyms. Most synonymy relations are ternary and, while there is not enough data to conclude it, the overall behavior approximates a power law.

Fig. 13
figure 13

Synonym hyperedge cardinality distribution (log–log scale)

Average hyperedge cardinality over time Consistent with the fact that most synsets introduced as undirected hyperedges have a low cardinality (two or three elements), the average hyperedge cardinality over time is overall lower than the base model. This is visible when comparing Fig. 14 with Fig. 4. Additionally, the behavior also changed from a fast growth and convergence behavior, in the base model, to a consistent sub-linear growth behavior. While convergence is not immediately clear in the synonyms model, the trend does point to such behavior.

Fig. 14
figure 14

Average hyperedge cardinality over time for the synonyms model

Average estimated diameter and average path length over time With synonymy relations, both the average path length and the diameter start at a lower value than the base model, for only one document. Apart from the initial values, when comparing Fig. 15 with Fig. 5, we find a similar behavior, although the average path length decreases from 8.37, in the base model, to 7.53, in the synonyms model, when comparing a representation of the whole collection (cf. Tables 2 and 3). Despite the similar behavior, a unitary difference is quite significative in a network (e.g., in a social network like Facebook, the average path length is 4.74 (Backstrom et al. 2012), while in the original small-world study by Milgram (1967) and Travers and Milgram (1977) the average path length was 6.2).

Fig. 15
figure 15

Average estimated diameter and average shortest path over time for the synonyms model

Temporal statistics of run times Finally, Fig. 16 illustrates the synonyms model run times of the following operations for an increasing number of documents: index creation (Fig. 16a); the computation of the global statistics (Fig. 16b), also shown in Table 3; the computation of all node degrees (Fig. 16c); and the computation of all hyperedge cardinalities (Fig. 16d). As we can see, similarly to what happened for the base model, the most significant increase in run time happens around 1000 documents, with the exception of the global statistics computation, which shows an increased run time for the first added documents. We predict that the same caching mechanisms described for the base model are responsible for this anomaly. In Fig. 16c, we also find a slight decrease in run time from 5000 to 8000 documents, which we do not find significant, as it was perhaps due to temporary load on the virtual machine. Indexing time took 1m13s for 1000 documents and 4m22s for a maximum of 8000 documents. The computation of global statistics took 17m07s for 1000 documents and 39m13s for a maximum of 8000 documents. Node degrees were computed in 4m11s for 1000 documents, taking 19m03s at most, while hyperedge cardinalities were computed in only 20s for 1000 documents, taking 44s at most, and maintaining the top rank in the most efficient statistic to compute, when compared to the base model.

Fig. 16
figure 16

Synonyms model run time statistics

Contextual similarity

Another way that we extended the base model was by using the contextual similarity between terms, as established based on the k-nearest neighbors according to word embeddings. For this particular analysis, word embeddings were obtained through word2vec, trained on a larger subset of the INEX 2009 Wikipedia collection, built from the documents mentioned in the relevance judgments for all 52 topics. The extracted vectors were of size 100, using sliding windows of 5 words to establish context, and ignoring words that appeared only once. Only the two nearest neighbors, with a similarity above 0.5 were considered to build the similarity graph. Contextual similarity hyperedges were then derived from this graph by iterating over each term and building sets that included the original term as well as incoming and outgoing terms.

Table 4 shows the global statistics for the context model. As we can see, the number of terms significantly increased from 323,672 (cf. Table 2) to 413,527. This means that 89,855 contextually similar terms that did not originally belong to the collection were added—they were however a part of the larger 52 topics collection, otherwise no new terms would have been added. The number of undirected hyperedges also increased significantly, with 157,217 new context relations. The average degree also increased from 0.83 to 1.18, with the average clustering coefficient remaining stable and the density decreasing from 3.88e\({-}\)06 to 2.75e\({-}\)06. The diameter significantly decreased from 17 to 3, as did the average path length, which decreased from 8.37 to 1.93, strongly approximating nodes through the relation of contextual similarity. This is an indicator of the impact of using word embeddings to establish new bridges between documents, although we need to assess whether retrieval effectiveness will be affected by context as a kind of noise introduced in the process rather than a good discriminative feature. We found 42,145 new paths created by this extension, resulting in 23.03 documents linked on average per context. Notice that, although synonyms established a lower number of bridges, they also connected a higher number of documents on average (2.83 \(\times\) more than context). Only by studying retrieval effectiveness we will be able to assess which characteristic translates into a better performance in the model. Besides global statistics, we also identified four interesting changes or new characteristics when compared to the base model:

  • Term node degree distribution;

  • Context hyperedge cardinality distribution;

  • Average hyperedge cardinality over time;

  • Average estimated diameter and average path length over time;

Table 4 Global statistics for the contextual similarity model

Term node degree distribution Figure 17 illustrates the node-based node degree distribution for entity and term nodes in the hypergraph-of-entity with the context extension. The behavior for entity nodes is similar to the base model and to the synonyms model. However, like in the synonyms model, term nodes show a combination of a power law like behavior for the lower degrees, with a log-linear behavior for the remaining degrees. Given the higher number of terms introduced through contextual similarity, we also find a distribution plot that is visually denser.

Fig. 17
figure 17

Node degree distribution, based on connected nodes, for the context model (log–log scale)

Context hyperedge cardinality distribution Figure 18 illustrates the distribution of terms per context hyperedge. As we can see, the behavior approximates a power law, with only a few context hyperedges containing around 50 nodes and one of them even reaching 156 nodes.

Fig. 18
figure 18

Context hyperedge cardinality distribution (log–log scale)

Average hyperedge cardinality over time Given the high number of introduced context hyperedges, most of them with a low cardinality, the average hyperedge cardinality was driven down, as we can see in Fig. 19. In a similar way to the synonym hyperedges, the behavior also changed from a fast growth and convergence behavior, in the base model, to a consistent sub-linear growth behavior.

Fig. 19
figure 19

Average hyperedge cardinality over time for the context model

Average estimated diameter and average path length over time Perhaps one of the most interesting results of this analysis is the impact of index extensions in the diameter and average path length. This is particularly visible with the context extension—the diameter decreased from 17, in the base and similarity models, to only 3, in the context model. A similar behavior was identified for the average path length that decreased from 8.33 in the base model and 7.53 in the synonyms model, to only 1.93 in the context model. This behavior over time is seen in Fig. 20, where, contrary to the base and synonyms model, we can find shorter geodesics immediately for a low number of documents. As an increasing part of the collection is considered, the length of the geodesics increase. This might be correlated with an increasing diversity of topics, thus being indicative of the discriminative power of the context extension, an aspect that should be further investigated in the future.

Fig. 20
figure 20

Average estimated diameter and average shortest path over time for the context model

Temporal statistics of run times Finally, Fig. 21 illustrates the contextual similarity model run times of the following operations for an increasing number of documents: index creation (Fig. 21a); the computation of the global statistics (Fig. 21b), also shown in Table 4; the computation of all node degrees (Fig. 21c); and the computation of all hyperedge cardinalities (Fig. 21d). As we can see, similarly to what happened for the base model, the most significant increase in run time happens around 1000 documents. When compared to the base model and the synonyms model, the global statistics computation does not show an increased run time for the first added documents. This further supports the hypothesis of this being an anomaly that happened due to initial caching or load issue, particularly since the synonyms model is quite similar, structurally, to the context model. Indexing time took 1m35s for 1000 documents and 5m05s for a maximum of 8000 documents. The computation of global statistics took 5m44s for 1000 documents and 24m20s for a maximum of 8000 documents. Node degrees were computed in 5m15s for 1000 documents, taking 24m37s at most, while hyperedge cardinalities were computed in only 24s for 1000 documents, taking 56s at most, making it the most efficient statistic to compute, and maintaining the top rank in the most efficient statistic to compute, when compared to the base model and the synonyms model.

Fig. 21
figure 21

Contextual similarity model run time statistics

Term frequency bins

In this section, we analyze the TF-bins extension, which is based on the discretization of the term frequency per document. This way, term frequency can be added to the hypergraph-of-entity, while having a low impact in scalability (i.e., we remain focused on forming groups of nodes to minimize the space complexity of the representation model).

Table 5 shows the global statistics for the TF-bins model. As we can see, the number of nodes is the same as the original model, also remaining unchanged with the number of bins. The number of undirected hyperedges increased from 14,938 to 29,884 for two TF-bins, or to 43,426 with ten bins. The average degree slightly increased from 0.83 to 0.88 for two TF-bins per document, and then to 0.93 for ten TF-bins, with the average clustering coefficient remaining stable and the density increasing from 3.88e\({-}\)06 to 7.58e\({-}\)06 for two TF-bins, and then again slightly to 7.86e\({-}\)06 for ten TF-bins. The diameter decreased from 17 to 13 for two TF-bins, and 14 for ten TF-bins, as did the average path length, which decreased from 8.37 to 6.83 and 6.90 for two and ten TF-bins, respectively. When considering two TF-bins, we found 156,200 new paths created by this extension, resulting in 30.64 documents linked on average per TF-bin. When the number of bins increased to ten, the number of new paths decreased to 153,979, but the average number of documents linked per TF-bin increased to 37.99. Besides global statistics, we also identified seven interesting changes or new characteristics when compared to the base model:

  • TF-bin hyperedge cardinality distribution per number of bins;

  • Number of undirected hyperedges per number of bins;

  • TF-bin hyperedges per number of bins;

  • Diameter and average path length per number of bins;

  • Average hyperedge cardinality over time per number of bins;

  • Average density over time per number of bins.

  • Average estimated diameter and average path length over time per number of bins;

Notice that, contrary to the synonyms and context extensions, the TF-bins extension did not affect the behavior of term node degree distribution, since it does not introduce external terms to the collection.

Table 5 Global statistics for the TF-bins model (bins = 2 and bins = 10)

TF-bin hyperedge cardinality distribution Figure 22 illustrates the cardinality distribution of tf_bin hyperedges, for different numbers of bins. The behavior is similar to the related_to hyperedges, however, as the number of bins increases, lower values of cardinality become more frequent and the behavior starts tending towards a power law.

Fig. 22
figure 22

TF-bin hyperedge cardinality distribution (log–log scale)

Number of hyperedges per number of bins As expected, in Fig. 23a, we find a growth in the number of undirected hyperedges, from 29,884, for two bins, to 43,426, for ten bins. The same happens for the tf_bin hyperedges (Fig. 23b), which are responsible for propelling such growth. The amount of hyperedges generated by increased TF-bins will eventually converge, since there is a limited number of terms per document to segment. However, for this collection, it is clear that the number of TF-bins can range from two to ten, while always generating new hyperedges, increasing the granularity at which term frequency will contribute to the model.

Fig. 23
figure 23

Number of hyperedges, per number of bins, for the TF-bins model

Diameter and average path length per number of bins As show in Fig. 24, both the diameter and the average path length, which correspond to the maximum and average geodesic distances in the hypergraph, show a high variability with the number of bins. In particular, the diameter and average path length both reach their maximum values of 18 and 8.30 when using 6 TF-bins. The minimum diameter of 11 is reached when using 9 TF-bins, while the minimum average path length of 5.93 is reached when using 7 TF-bins. This suggests that the number of bins might influence retrieval effectiveness, if varying the diameter and the average path length also affects performance directly.

Fig. 24
figure 24

Geodesic-based metrics, per number of bins, for the TF-bins model

Average hyperedge cardinality over time Figure 25 shows the evolution of the average hyperedge cardinality for different numbers of bins. The behavior is similar to the base model (cf. Fig. 4), which is equivalent to having one TF-bin. As the number of TF-bins increases, the overall average hyperedge cardinality decreases, which is the expected behavior. This is less visible as the number of bins reaches a higher value, at which point the overall cardinality is less affected, showing a progressively lower decreasing behavior. While the number of TF-bins affects this characteristic of the hypergraph, the overall behavior is maintained.

Fig. 25
figure 25

Average hyperedge cardinality over time, per number of bins, for the TF-bins model

Fig. 26
figure 26

Average density over time, per number of bins, for the TF-bins model

Average density over time The average density shown in Fig. 26 follows a similar behavior to the base model (cf. Fig. 7), regardless of the number of TF-bins. However, there is a small variation for the interval of approximately 100–1000 documents, after which it is once again reduced to the same value for the different numbers of TF-bins. It is perhaps the diversity in term frequency introduced for documents in this interval that promotes such a difference. This would explain the creation of a higher number of tf_bin hyperedges, without empty TF intervals (e.g., ]2, 2]).

Average estimated diameter and average shortest path over time Figure 27 shows the evolution of the diameter and average path length, over an increasing number of documents and TF-bins. Apart from both metrics reaching higher values for a single document as well as for five TF-bins, the behavior is similar to the base model (cf. Fig. 5).

Fig. 27
figure 27

Average estimated diameter and average shortest path over time, per number of bins, for the TF-bins model

Temporal statistics of run times Finally, Figures 28 and 29 illustrate the TF-bins model run times of the following operations for an increasing number of documents: index creation (Fig. 28a); the computation of the global statistics (Fig. 28b), also shown in Table 5; the computation of all node degrees (Fig. 29a); and the computation of all hyperedge cardinalities (Fig. 29b). As we can see, similarly to what happened for the base model and the synonyms model, the most significant increase in run time happens around 1000 documents, with the exception of the global statistics computation, which shows an increased run time for the first added documents. Indexing time took 1m11s for 1000 documents and 4m27s for a maximum of 8000 documents. The computation of global statistics took 16m38s for 1000 documents and 52m50s for a maximum of 8000 documents. Node degrees were computed in 3m54s for 1000 documents, taking 32m23 at most, while hyperedge cardinalities were computed in only 19s for 1000 documents, taking 50s at most, making it the most efficient statistic to compute, maintaining the top rank in the most efficient statistic to compute, in line with the other studied models models.

Fig. 28
figure 28

TF-bins models run time statistics (part 1)

Fig. 29
figure 29

TF-bins models run time statistics (part 2)

An application to information retrieval

So far, we have analyzed the structural impact of different index extensions in regards to the characteristics of the hypergraph. However, there is little value in understanding the behavior of structural features without the context of its application, which in this case is in the area of information retrieval (Devezas and Nunes 2019). Thus, we assess the effectiveness of each model, with different extensions and parameter configurations, through a classical information retrieval evaluation process, based on the 10 topic subset of the INEX 2009 Wikipedia collection (INEX 2009 10T-NL).

We launched three evaluation runs per index configuration, i.e., for different versions of the HGoE (hypergraph-of-entity) representation model based on different extensions. We relied on the RWS ranking function, experimenting with different random walk lengths \(\ell \in \{1,2,3\}\), and a fixed configuration for the remaining parameters: r = 10,000, expansion disabled (i.e., without seed node selection (Devezas and Nunes 2019, §4.2.1)), and weights enabled (i.e., considering tf_bin hyperedge weights, the only available weights in the indexes).

Table 6 shows the MAP (mean average precision), NDCG@p (normalized discounted cumulative gain at a cutoff of p), and P@n (precision at a cutoff of n), computed for the relevance judgments provided by the INEX 2010 Ad Hoc track (Arvola et al. 2010). As we can see, by analyzing the maximum values per column (in bold), the TF-bin models were able to obtain significantly better results overall, when compared to the base model, the synonyms model, and the context model. None of the HGoE models is yet able to outperform the baselines, although TF-bins are able to approximate TF-IDF in regard to NDCG@10 and P@10. The hypergraph-based models need to be reiterated over and improved. Herein lies the usefulness of computing the properties of the hypergraph structures and analyzing the hypergraph-of-entity. While there is no clear pattern of effectiveness correlated with the number of bins, if we consider the NDCG@10 scores, the best model for \(\ell = 1\) is TF-bins\(_2\), the best model for \(\ell = 2\) is TF-bins\(_4\), and the best model for \(\ell = 3\) is TF-bins\(_6\). This might indicate that a higher number of bins works best with a longer random walk length. However, there is no concordance to support this hypothesis when looking at the MAP and P@10 metrics, thus further investigation is required.

Table 6 Evaluating the different models in the ad hoc document retrieval task

In order to better understand whether there is a direct relation between any of the computed structural features of the hypergraph and the effectiveness of the retrieval model, we first summarize the structural features for each model in Table 7. By comparing each feature with the evaluation metrics from Table 6, we are able to find some indicators of (in)effectiveness in a graph-based retrieval model. According to Table 6, context was the worst performing model, over all values of \(\ell\). The context model also has the highest average degree and clustering coefficient, as well as the lowest average path length and diameter (cf. Table 7). This indicates that a higher local connectivity and an overall lower distance between nodes might not beneficial for retrieval effectiveness. We also observe that the TF-bin models, which have the best performance, also have a lower clustering coefficient than the base, synonyms and context models, ranging between 0.0994 and 0.1034.

Table 7 Comparing the global statistics for the different models

We also studied the structural impact of each extension, through the relative change to individual features, in comparison to the base model. Figure 30 shows a heatmap based on the change percentages in regards to the base model, which, by definition, has a 0% change over all features, in comparison to itself. As we can see, the context model suffered the most evident overall change, with a − 467% change in diameter, and a − 333% change in average path length. This model is of particular interest, as it resulted in the worst retrieval performance, when compared to the remaining models. Interestingly, this is also visible in its structural features. The clustering coefficient for the context model also suffered a substantial increase in relation to the base model, with a change of 19%, as did the degree, with a change of 29%. When looking at the density for all models, there was no change for the synonyms model, but there was a positive change, rounding 50% (in green), for the TF-bins models, and there was a negative change of − 41% for the context model. The number of nodes suffered no change for the TF-bins models, but there a slight increase for synonyms (as new terms from synsets were added), and a more significative increase for the context model. The number of edges suffered a consistently larger increase for TF-bins models, as the number of bins increased, with the synonyms model showing a slight increase, and the context model once again showing a more significative increase.

Fig. 30
figure 30

Relative change of structural features when compared to the base model

Correlating evaluation metrics and structural features

In Table 8 we further organize this approach, by comparing the evaluation results of each metric with the values of each structural feature. By using Spearman’s rank correlation coefficient (\(\rho\)), we can verify whether the retrieval model’s performance ranking given by the evaluation metrics (our ground truth) can compare with the ranking given by any of the structural features, as computed for each model. Let us first follow up with the indicators we put forth in the manual comparison of the two tables.

Table 8 Spearman’s \(\rho\) between evaluation metrics and structural features

We proposed that a high average degree and clustering coefficient would result in a low MAP, NDCG@10 and P@10, which does not necessarily mean that either feature is a good overall discriminator of model performance. In fact, the average degree does not show correlation consistency among the different evaluation metrics and parameter configurations. On the other hand, the clustering coefficient is negatively correlated with each evaluation metric over the different random walk length parameter configurations, ranging between − 0.61 and − 0.36. This makes the clustering coefficient a weak, but consistent indicator of the performance of graph-based retrieval models (i.e., higher values of the clustering coefficient indicate a low retrieval effectiveness). Absolute correlation is not particularly high, since retrieval performance does not solely depend on the structure of the graph, but also on the semantics of the representation model.

We also proposed that a low average path length and diameter would be indicative of low model performance. While the average path length and diameter correlations with the evaluation metrics are mostly positive, these are not sufficiently consistent to be considered good global indicators of performance. There are, however, special cases when the average path length serves as a slight indicator of performance, namely for \(\ell > 1\) and for the top 10 results. For \(\ell = 1\), there is a slight negative correlation that could be explained by the fact that this model only relies on the immediate neighborhood within the hypergraph and does not depend on short paths for connectivity. The diameter, on the other side, always shows a positive correlation with the evaluation metrics, but its absolute value is overall low and inconsistent for it to provide a good discriminative indicator of retrieval performance.

With a similar behavior to the clustering coefficient, but with an inverse sign, the density was overlooked as a good indicator of model performance. In particular, the worst performing model (context model) also has the lowest density of 2.75e\({-}\)06, followed by the base model and the synonyms model, tied at a density of 3.88e\({-}\)06, and then by the TF-bin models, with densities ranging from 7.58e\({-}\)06 to 7.86e\({-}\)06. While the density is a good discriminative of graph-based retrieval models, its granularity is low, only properly distinguishing between models with an obvious difference in performance.

Design rules for modifying or extending the hypergraph-of-entity

After the analysis of the impact of structural features in the performance of the retrieval models, we reflect on the implications of our findings. We use these findings to prepare a set of rules that serve as indicators or as a guide for the continued redesign of the hypergraph-of-entity. In particular, the guidelines we propose should be helpful in the process of comparing different versions based on modifications or extensions to our model. We propose two classes of indicators:

Ranking indicators:

Structural features that can be used to rank different graph-based models in regards to their predicted retrieval performance.

Anomaly indicators:

Structural features that cannot be used to rank graph-based models based on retrieval performance, but can, however, be useful for identifying anomalous models with a high chance of a low performance.

Table 9 shows the identified ranking and anomaly indicators according to the analysis carried at the beginning of this section. The clustering coefficient and the density were both identified as ranking indicators with an approximate certainty rate of 50%, based on an ascending and descending order, respectively. The degree and diameter were identified as anomaly indicators, with the degree being used to identify abnormally high values, for example larger than two standard deviations (\(2\sigma\)) above the mean (\(\mu\)), and the diameter being used to identify abnormally low values, for example less than two standard deviations below the mean.

Table 9 Indicators of graph-based retrieval model performance

Conclusion

We characterized the hypergraph-of-entity representation model, based on the structural features of the hypergraph. We analyzed the node degree distributions, based on nodes and hyperedges, and the hyperedge cardinality distributions, illustrating their distinctive behavior. We also analyzed the temporal behavior, as documents were added to the index, studying average node degree and hyperedge cardinality, estimated average path length, diameter and clustering coefficient, as well as density and space usage requirements. We expanded on the characterization work by analyzing different model extensions based on synonymy, contextual similarity, and a new concept of TF-bins, and we also measured the run time of several operations like indexing and the computation of properties. Our contributions included the application of two strategies for the approximation of statistics based on the shortest distance, as well as the clustering coefficient. We also proposed a simple approach for computing the density of a general mixed hypergraph, based on an induced bipartite mixed graph. Finally, we focused on the application of this characterization work, which, we proposed, should inform the design of graph-based representation models for information retrieval. In particular, we studied the change in structural features, when compared to the base model, as well as the correlations between retrieval effectiveness metrics (MAP, NDCG@10, P@10) and structural features (e.g., average degree, clustering coefficient). While structural features rarely presented a higher than 50% absolute correlation with any of the evaluation metrics, we identified some of them as indicators useful for ranking the retrieval models according to their effectiveness, or for identifying anomalies that lead to low effectiveness. More importantly, we have provided an analysis framework for hypergraphs that can easily be implemented and applied to both small and large-scale hypergraphs. We have also provided a characterization based on this framework, illustrating the behavior of several statistics, for instance showing that, while the degree distribution based on hyperedges still follows a power law, like in real-world networks represented as graphs, the degree distribution based on nodes instead approximates a log-normal distribution. During the development of this work, we have also found that:

  • Few attention has been given to hypergraph characterization in the real-world;

  • The community is still lacking in tools to analyze hypergraphs:

    • There is no de facto library for hypergraph analysis;

    • Few file formats support hypergraphs, namely with directed hyperedges.

  • Polyadism introduces additional complexity and calls for novel metrics that take the information within collective relations into account.

Future work In the future, we would like to further explore the computation of density, since the bipartite-based density we proposed, although useful, only accounts for hyperedges already in the hypergraph. We would also like to study the parameterization of the two estimation approaches we proposed, based on random walks and node sampling. Despite their straightforward definition, these approaches also require further evaluation, in order to understand what the expected error will be for different configurations. Another open challenge is the definition of random hypergraph generation model, which would be useful to improve characterization. Additionally, several opportunities exist in the study of the hypergraph at a mesoscale, be it identifying communities, network motifs or graphlet, or exploring unique patterns to hypergraphs. It would also be interesting to include centrality metrics in the correlation analysis, in order to understand for instance whether closeness or betweenness might impact retrieval effectiveness in the hypergraph-of-entity, furthermore considering multiple combinations of extensions, as opposed to a single one, as we have done here. Finally, regarding the hypergraph-of-entity model, it would also be useful to repeat the analysis we describe in this work based on additional test collections, as to support or disprove the results we found. Perhaps future TREC or CLEF tracks could provide relevance judgments for multiple tasks in entity-oriented search, which would be useful to boost the study of generality in information retrieval.

Availability of data and materials

The INEX 2009 Wikipedia collection analysed during the current study is available at the Max-Planck-Institut für Informatik website for the Databases and Information Systems department, https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/software/inex/. The topics and relevance judgments for the INEX 2010 Ad Hoc track are available at the INEX website, http://inex.mmci.uni-saarland.de/data/documentcollection.html. The remaining datasets that were generated and analysed during the current study are available from the corresponding author on reasonable request. The software required to replicate this study is available with the name Army ANT, under the BSD 3-Clause license, at https://github.com/feup-infolab/army-ant.

Notes

  1. Tail and head is used in analogy to an arrow, not a list.

  2. https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/software/inex/.

  3. https://yago-knowledge.org/.

  4. https://wiki.dbpedia.org/.

  5. https://inex.mmci.uni-saarland.de/data/documentcollection.html.

  6. We relied on the Grph Java library, available at http://www.i3s.unice.fr/~hogie/software/index.php?name=grph, to represent the hypergraph in memory.

Abbreviations

HGoE:

Hypergraph-of-entity

INEX:

INitiative for the Evaluation of XML Retrieval

MAP:

Mean average precision

NDCG@p :

Normalized discounted cumulative gain at a cutoff of p

OWL:

Web ontology language

P@n :

Precision at a cutoff of n

qrels:

Query relevance set

RDF:

Resource description framework

RWS:

Random walk score

TF:

Term frequency

TF-bin:

Term frequency bin

TF-IDF:

Term frequency \(\times\) inverted document frequency

YAGO:

Yet Another Great Ontology

References

  • Aparicio D, Ribeiro P, Silva F (2018) Graphlet-orbit transitions (GOT): a fingerprint for temporal network comparison. PLoS ONE 13:0205497. https://doi.org/10.1371/journal.pone.0205497

    Article  Google Scholar 

  • Arvola P, Geva S, Kamps J, Schenkel R, Trotman A, Vainio J (2010) Overview of the INEX 2010 ad hoc track. In: Comparative evaluation of focused retrieval—9th international workshop of the inititative for the evaluation of XML retrieval, INEX 2010, Vugh, The Netherlands, 13–15 December 2010, Revised Selected Papers, pp 1–32. https://doi.org/10.1007/978-3-642-23577-1_1

  • Ausiello G, Giaccio R, Italiano GF, Nanni U (1992) Optimal traversal of directed hypergraphs. ICSI, Berkeley, CA

    Google Scholar 

  • Backstrom L, Boldi P, Rosa M, Ugander J, Vigna S (2011) Four degrees of separation. CoRR arXiv:1111.4570

  • Backstrom L, Boldi P, Rosa M, Ugander J, Vigna S (2012) Four degrees of separation. In: Web science 2012, WebSci ’12, Evanston, IL, USA—22–24 June 2012, pp 33–42. https://doi.org/10.1145/2380718.2380723

  • Banerjee A, Char A (2017) On the spectrum of directed uniform and non-uniform hypergraphs. arXiv preprint arXiv:1710.06367

  • Bast H, Buchhold B, Haussmann E et al (2016) Semantic search on text and knowledge bases. Found Trends® Inf Retriev 10(2–3):119–271

    Article  Google Scholar 

  • Bast H, Buchhold B (2013) An index for efficient semantic full-text search. In: Proceedings of the 22nd ACM international conference on conference on information and knowledge management, pp 369–378. https://doi.org/10.1145/2505515.2505689

  • Bastian M, Heymann S, Jacomy M (2009) Gephi: an open source software for exploring and manipulating networks. In: Proceedings of the third international conference on weblogs and social media, ICWSM 2009, San Jose, CA, USA, 17–20 May 2009. http://aaai.org/ocs/index.php/ICWSM/09/paper/view/154

  • Berge C (1970) Graphes et Hypergraphes. Monographies universitaires de mathematiques. Dunod, Paris

  • Bhagdev R, Chapman S, Ciravegna F, Lanfranchi V, Petrelli D (2008) Hybrid search: effectively combining keywords and semantic searches. In: European semantic web conference. Springer, pp 554–568

  • Brandes U, Eiglsperger M, Herman I, Himsolt M, Marshall MS (2001) Graphml progress report structural layer proposal. In: International symposium on graph drawing. Springer, pp 501–512

  • Brown W, Erdos P, Sós V (1973) Some extremal problems on r-graphs. In: New directions in the theory of graphs (Proceedings third Ann Arbor Conference, University Michigan, Ann Arbor, MI, 1971), pp 53–63

  • Csardi G, Nepusz T et al (2006) The igraph software package for complex network research. InterJ Compl Syst 1695(5):1–9

    Google Scholar 

  • Devezas J, Nunes S (2019) Hypergraph-of-entity: a unified representation model for the retrieval of text and knowledge. Open Comput Sci 9(1):103–127. https://doi.org/10.1515/comp-2019-0006

    Article  Google Scholar 

  • Devezas JL, Nunes S (2019) Characterizing the hypergraph-of-entity representation model. In: Complex networks and their applications VIII—volume 2 proceedings of the eighth international conference on complex networks and their applications COMPLEX NETWORKS 2019, Lisbon, Portugal, 10–12 Dec 2019, pp 3–14. https://doi.org/10.1007/978-3-030-36683-4_1

  • Erdös P (1971) On some extremal problems on r-graphs. Discrete Math 1(1):1–6. https://doi.org/10.1016/0012-365X(71)90002-1

    Article  MathSciNet  MATH  Google Scholar 

  • Erdös P, Goodman AW, Pósa L (1966) The representation of a graph by set intersections. Can J Math 18:106–112

    Article  MathSciNet  Google Scholar 

  • Estrada E, Rodriguez-Velazquez JA (2005) Complex networks as hypergraphs. arXiv preprint arXiv:0505137 [physics]

  • Fernández JD, Martínez-Prieto MA, de la Fuente Redondo P, Gutiérrez C (2016) Characterizing RDF datasets. J Inf Sci 1:1–27

    Google Scholar 

  • Gallagher SR, Goldberg DS (2013) Clustering coefficients in protein interaction hypernetworks. In: ACM conference on bioinformatics, computational biology and biomedical informatics. ACM-BCB 2013, Washington, DC, USA, 22–25 Sept 2013, p 552. https://doi.org/10.1145/2506583.2506635

  • Gallo G, Longo G, Pallottino S (1993) Directed hypergraphs and applications. Discrete Appl Math 42(2):177–201. https://doi.org/10.1016/0166-218X(93)90045-P

    Article  MathSciNet  MATH  Google Scholar 

  • Gao J, Zhao Q, Ren W, Swami A, Ramanathan R, Bar-Noy A (2015) Dynamic shortest path algorithms for hypergraphs. IEEE/ACM Trans Netw 23(6):1805–1817. https://doi.org/10.1109/TNET.2014.2343914

    Article  Google Scholar 

  • Ge W, Chen J, Hu W, Qu Y (2010) Object link structure in the semantic web. In: The semantic web: research and applications, 7th extended semantic web conference, ESWC 2010, Heraklion, Crete, Greece, 30 May–3 June 2010, proceedings, part II, pp 257–271. https://doi.org/10.1007/978-3-642-13489-0_18

  • Głąbowski M, Musznicki B, Nowak P, Zwierzykowski P (2012) Shortest path problem solving based on ant colony optimization metaheuristic. Image Process Commun 17(1–2):7–17

    Article  Google Scholar 

  • Halpin H (2009) A query-driven characterization of linked data. In: Proceedings of the WWW2009 workshop on linked data on the web, LDOW 2009, Madrid, Spain, 20 April 2009

  • Himsolt M (1997) GML: a portable graph file format. Technical report, Universität Passau

  • Klamt S, Haus U, Theis FJ (2009) Hypergraphs and cellular networks. PLoS Comput Biol. https://doi.org/10.1371/journal.pcbi.1000385

    Article  MathSciNet  Google Scholar 

  • Li D (2011) Shortest paths through a reinforced random walk. Technical report, University of Uppsala

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems 26: 27th annual conference on neural information processing systems 2013. Proceedings of a Meeting Held 5–8, 2013, Lake Tahoe, Nevada, USA, pp 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality

  • Milgram S (1967) The small world problem. Psychol Today 2(1):60–67

    Google Scholar 

  • Miller GA (1995) Wordnet: a lexical database for english. Commun ACM 38(11):39–41. https://doi.org/10.1145/219717.219748

    Article  Google Scholar 

  • Mubayi D, Zhao Y (2007) Co-degree density of hypergraphs. J Combin Theory Ser A 114(6):1118–1132. https://doi.org/10.1016/j.jcta.2006.11.006

    Article  MathSciNet  MATH  Google Scholar 

  • Ouvrard X, Goff JL, Marchand-Maillet S (2017) Adjacency and tensor representation in general hypergraphs part 1: e-adjacency tensor uniformisation using homogeneous polynomials. CoRR arXiv:1712.08189

  • Ribeiro BF, Basu P, Towsley D (2012) Multiple random walks to uncover short paths in power law networks. In: 2012 Proceedings IEEE INFOCOM workshops, Orlando, FL, USA, 25–30 March 2012, pp 250–255. https://doi.org/10.1109/INFCOMW.2012.6193500

  • Schenkel R, Suchanek FM, Kasneci G (2007) YAWN: a semantically annotated wikipedia XML corpus. In: Datenbanksysteme in Business, Technologie und Web (BTW 2007), 12. Fachtagung des GI-Fachbereichs “Datenbanken und Informationssysteme” (DBIS), proceedings, 7.-9. März 2007, Aachen, Germany, pp 277–291

  • Sperner E (1928) Ein satz über untermengen einer endlichen menge. Math Z 27(1):544–548

    Article  MathSciNet  Google Scholar 

  • Travers J, Milgram S (1977) An experimental study of the small world problem. Social networks. Elsevier, Washington, DC, pp 179–197

    Google Scholar 

  • Turán P (1941) On an extremal problem in graph theory. Matematikai és Fizikai Lapok 48:436–452

    MathSciNet  Google Scholar 

  • Turán P (1961) Research problems. Magyar Tud Akad Mat Kutato Internat Közl 6:417–423

    Google Scholar 

  • Voorhees EM (1986) The efficiency of inverted index and cluster searches. In: SIGIR’86, Proceedings of the 9th annual international ACM SIGIR conference on research and development in information retrieval, Pisa, Italy, 8–10 Sept 1986, pp 164–174. https://doi.org/10.1145/253168.253203

  • Watts DJ, Strogatz SH (1998) Collective dynamics of ‘small-world’ networks. Nature 393(6684):440

    Article  Google Scholar 

  • Yu W, Sun N (2018) Establishment and analysis of the supernetwork model for Nanjing Metro Transportation System. Complexity 2018:4860531–1486053111. https://doi.org/10.1155/2018/4860531

    Article  Google Scholar 

  • Zobel J, Moffat A, Ramamohanarao K (1998) Inverted files versus signature files for text indexing. ACM Trans Database Syst 23(4):453–490. https://doi.org/10.1145/296854.277632

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank Bruno Martins, from INESC-ID and the University of Lisbon, for his suggestion on integrating the concept of term frequency into the hypergraph-of-entity in the form of bins.

Funding

José Devezas is supported by research grant PD/BD/128160/2016, provided by the Portuguese national funding agency for science, research and technology, Fundação para a Ciência e a Tecnologia (FCT), within the scope of Operational Program Human Capital (POCH), supported by the European Social Fund and by national funds from MCTES.

Author information

Authors and Affiliations

Authors

Contributions

JLD and SSN have jointly discussed and developed the ideas present in this work. JLD was responsible for the data processing and analysis, and for writing the manuscript. JLD and SSN jointly reviewed the manuscript, with SSN being the main contributor to this process, promoting discussion that led to the heatmap depicting the relative change of structural features, which was prepared by JLD.

Corresponding author

Correspondence to José Devezas.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Devezas, J., Nunes, S. Characterizing the hypergraph-of-entity and the structural impact of its extensions. Appl Netw Sci 5, 79 (2020). https://doi.org/10.1007/s41109-020-00320-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s41109-020-00320-z

Keywords