From: Characterizing the hypergraph-of-entity and the structural impact of its extensions
Type | Description | Observation |
---|---|---|
Nodes | ||
term | Represents a single word from the original document | In this work, the preprocessing pipeline includes: sentence segmentation; lower case filtering; replacement of URL, time, money and number expressions with a common placeholder, each; stemming via porter stemmer |
entity | Represents an entity from the list of extracted entities and/or provided triples | For the INEX collection, each mention to an entity is modeled through this type of node (we consider disambiguation to be a part of the ranking) |
Hyperedges (base model) | ||
document | Represents a document through the set of all its terms and entities | Undirected hyperedge |
related_to | Represents a semantic relation between multiple entities | Undirected hyperedge. In this implementation, the relation is derived from all triples in the collection, by grouping by subject |
contained_in | Represents a relation between a set of terms and an entity. | Directed hyperedge. In this implementation, this relation exists between terms that are a part of an entity name or mention and the corresponding entity node |
Hyperedges (extensions) | ||
synonym | Represents a relation of synonymy between a set of terms | Undirected hyperedge. Present in the Synonyms model. The first synset from WordNet 3.0 is obtained for each noun term, missing terms are added to the model and the hyperedge is created |
context | Represents a relation of contextual similarity between a set of terms | Undirected hyperedge. Present in the Contextual similarity model. This is computed based on the top similar terms according to word2vec embeddings |
tf_bin | Represents a sets of terms within the same term frequency interval, for a given document | Undirected hyperedge. Present in the TF-bins model. The number of TF-bins per document is a parameter that can be set during indexing |