Shannon entropy in time-varying semantic networks of titles of scientific paper

Recent work has employed information theory in social and complex networks. Studies often discuss entropy in the degree distributions of a network. However, no specific work on entropy exists in clique networks. This work is an extension of a previous study that discussed this topic. We propose a method for calculating the entropy of a clique network and its minimum and maximum values in temporal semantic networks based on titles of scientific papers. In addition, the critical network of moments was extracted. We use the titles of scientific papers published in Nature and Science over ten-year period. The results show the diversity of vocabulary over time, based on the entropy values of vertices and edges. In each critical network, we discover the paths that connect important words and an interesting modular structure.


Introduction
Information theory has evolved in recent decades and has been applied in different fields, such as biology, economics and quantum confined systems (Mousavian et al. 2016;Mishra and Ayyub 2019;Nascimento and Prudente 2018;Brillouin 2013). Recently, some authors have introduced these concepts to measure the information contained in the distribution of degrees and geodesic distances from real networks, or in classical models and semantic networks to classify and differentiate these systems by the heterogeneity of their links (Solé and Valverde 2004;Ji et al. 2008;Viol et al. 2019).
In the study of real networks, modeling the dynamics of the entry and exit of vertices and edges of the networks is necessary. The main models include the modeling of a system by a clique network, e.g., movie actor networks (Barabasi and Albert 1999), co-authoring networks (Newman 2001), concepts networks (Caldeira et al. 2006) and semantic networks (Teixeira et al. 2010;Pereira et al. 2011;Pereira et al. 2016;Grilo et al. 2017). The latter considers the network that is composed of words, concepts or entities with semantic meaning represented by the vertices, with edges that consist of connections between two words that appear in the same unit of meaning, that is, in a sentence (phrase), paragraph or title of the analyzed speech Grilo et al. 2017). Semantic networks that are modeled by a clique network can provide interesting answers for the study of the organization of human language. Teixeira et al. (2010) proposed the incidence-fidelity (IF) index to obtain a critical configuration of the semantic network of an oral discourse. Cunha et al. (2015) applied this index to networks of scientific paper titles based on publications in high-impact factor journals. The Semantic network of titles (SNT) is formed by the union of titles of publications of a scientific journal, over a given period of time, where the words are vertices of the network and the edges connect words that belong to the same title (Pereira et al. 2011). Within this context, Casteigts et al. (2012) formalized the concept of time-varying graph (TVG).
Despite the growing interest in Shannon entropy, no studies have applied this measure to clique semantic networks. Therefore, this work proposes a method that calculates the vertex and edge entropy of an SNT and their maximum and minimum limits for entropy values according to the initial conditions. The findings can be generalized for any clique network.
This work synthesizes the methodology presented in Cunha et al. (2020) and expands the possibilities of its application and results. The dataset includes the titles from the journals Nature and Science from 1998 to 2008. The networks are built as a TVG and analyzed using a sliding time window proposed by Cunha et al. (2020) and more explained here. The TVG is then called time-varying semantic network of titles (TVSNT).
In addition to the entropy calculation, the (IF) index (Teixeira et al. 2010) is applied to seek the critical network in prominent time windows, to show the connections between the most important vertices.
The results are explored according to the meanings of these indexes, and comparisons between the two systems are performed from the correlations between the entropy values.

Network of cliques
Considering their substantial applicability, clique networks fit the modeling of various social systems. We will provide a brief review of the semantic networks of cliques and the semantic networks based on titles of scientific papers.
Semantic network of cliques. According to the definition provided in Grilo et al. (2017) and the premise of (Caldeira et al. 2006), we consider a semantic network of cliques as a system of knowledge representation established by a specific context and imbued with functionality intention, where the vertices are words, concepts or entities with semantic meaning and the smallest unit of meaning is the sentence (e.g., a phrase of a text or discourse, title of a scientific paper, and keywords of a paper) and the edges consist of connections between two words that appear in the sentence.
According to this definition, a word changes its meaning depending on its neighbors in a sentence. Thus, a network is the union of these minor units of meaning, i.e., the cliques union. An increasing number of studies are investigating semantic networks of cliques, i.e., Caldeira et al. (2006) analyzed the structure of meaningful concepts in written discourses; Teixeira et al. (2010) and Lima-Neto et al. (2018) applied semantic clique networks to analyze the relationship between two words that emerge in oral speeches from a critical network, that is, a configuration is obtained using an IF index. In this configuration, the network displays the most information with the least residue (Teixeira et al. 2010); Nascimento et al. (2016) analyzed a semantic network formed by the keywords of a doctoral thesis in the area of Physics Teaching in Brazil from 1996 to 2006; Andrade et al. (2019) employed the measures of the centralities of degree, proximity and betweenness to understand the coherence and consistency of a proposal for a university program with the subjects' menus and work on the semantic networks of the titles of scientific papers.
SNT. An SNT is a semantic network of cliques, where each clique represents one title and its words are clique vertices. Consequently, an edge represents the connection between two words that belong to the same title. Some authors have proposed important study methodologies for SNTs: Pereira et al. (2011Pereira et al. ( , 2016 investigated the topological structure of an SNT of scientific papers as a method to analyze the diffusion efficiency of information, Henrique et al. (2014) employed an SNT to compare the titles of journal papers in mathematics education in English and Portuguese; the work by Cunha et al. (2013) considered a TVSNT and observed an effect on the network memory; Cunha et al. (2015) applied the IF index in SNTs of 15 high-impact factor journals and identified the correspondent critical network for each journal; Pereira et al. (2016) examined the evolution of density during the construction of semantic networks as an indicator of the diversity of scientific journal concepts; and Grilo et al. (2017) proposed a method that analyzes the robustness of an SNT using vertex removal strategies, which enable the identification of a critical removal fraction for which the topological structure of the network is changed.
Note that the authors of (Pereira et al. 2011) were the pioneers in the study of SNTs. The authors proposed rules for manual treatment and a method for data collection, construction and analysis of networks. The work by Fadigas and Pereira (2013) uses the same dataset to apply specific indexes for clique networks, which they proposed, and topologically characterizes the networks using these indexes.
Indices used in this paper. For each title network, the properties of the clique networks were utilized (Fadigas and Pereira 2013), as shown in Table 1.

IF index
Based on the premise of (Caldeira et al. 2006), words that occur together in the same sentence were associatively evoked to construct the idea to be presented. According to (Teixeira et al. 2010), based on this criterion, peers whose association is not significant were included in the network and mask the structure formed by the strongest associations. In this way, filtering is necessary to ensure that only the most relevant associations for the discourse are considered in the construction of the network.
To filter a clique semantic network and obtain the optimal network, Teixeira et al. (2010) created the (IF) index, as shown in Eq. 3. IF index generates a network with a critical configuration that contains the maximum amount of information with the minimum amount of textual residue . This index measures how "strong" and "faithful" the relationship between a pair of words is. For a given pair of words, the index considers the frequency of appearance in the text (incidence I, Eq. 1) and the frequency of appearance in the context, in which at least one word of the pair is evoked (fidelity F, Eq. 2). The IF index is the product of these two indices, as shown in Eq. 3.  n 0 Number of vertices in the initial configuration, n 0 ≥ n .
Frequency of vertex i in the initial configuration, i.e., the number of titles that Frequency of edge (i, j) in the initial configuration, i.e., the number of titles that contain the words i and j, 1 ≤ #(i, j) ≤ n q , and i, j = 1, 2, ..., n, with i = j and Number of vertices of the smallest title in the initial configuration, (1 ≤ q min ≤ n).
Number of vertices of the largest clique in the initial configuration, where k is the average degree of an undirected network and k i is the degree of a vertex i, that is the number of edges incident on the vertex i.
, are the degree values of the hubs, that is, vertices of very high degrees. σ is the standard deviation of the degree distribution.
"Initial configuration" is related to the isolated cliques, and "final configuration" is related to the built "network of cliques". The indices are valid for each time window considered In the Eqs. 1, 2 and 3, α and β represent the words in a word pair; C i is the set of sentences that contain the word i; and S α , S β and S (α,β) are the number of sentences in which the word α, word β and word pair (α, β), respectively, appear. n q is the total number of sentences in the text. Thus, once IF index is calculated for all pairs of words, its semantic network becomes weighted at the edges.
Considering that IF L is the minimum allowable value in the network for the IF index, this filtering is performed by removing the edges with IF < IF L values; only edges with IF > IF L remain in the network.
Critical Network. Critical networks were employed to investigate mechanisms inherent to human language in oral speeches (Teixeira et al. 2010;Lima-Neto et al. 2018). A value of IF L = IF c for which the network abruptly changes its connectivity exists. This phenomenon can be verified with the average minimum path in Fig. 1. Figure 2 shows the critical network for the (TVSNT) from scientific papers of Nature, w 8,1 in t = 8.

Temporal networks
Brief history. The use of time is very important in systems analysis in which elements connect. Within the scope of social and complex networks, previous works have been interested in introducing temporal parameters in networks. Doreian and Stokman (1997) applied models of evolution to study the development of social structures. Barabâsi et al. (2002) highlighted dynamic and structural mechanisms in a co-authorship on network and topologically characterized it over time; Li et al. (2007) proposed a model of a scientific collaboration network to verify the scale-free pattern in the weight distributions of the network edges over time. Tang et al. (2010) introduced the concepts of paths and temporal distances and the small word phenomenon in a temporal graph based on the condition of high edge agglomeration and low average temporal distance of nodes in networks of mobile agents and social and biological systems. In 2012, Nicosia et al. (2012) and Casteigts et al. (2012) formalized several concepts and metrics employed in the study of dynamic networks to create the concept of the TVG, which enables the modeling and analysis of networks that have edges and/or vertices that vary over time. The TVG also enabled the integration of the vast collection of concepts, formalisms and results obtained in previous works (Nicosia et al. 2012). Amblard et al. (2011) investigated the co-authoring relationships and citations among authors of scientific articles; Silva et al. (2012) analyzed the temporal evolution of brain signals in neuron networks of free-acting rats; Cunha et al. (2013) investigated the memory effect in the time series of a network of titles in the journal Nature; Paranjape et al. (2017) defined temporal network motifs as induced subgraphs on sequences of edges; Saramäki (2012, 2013) introduced several applications, suggestions for algorithms and specific metrics for networks that vary over time. Holme and Saramäki (2013) discussed the optimal transport structure and relationship between the temporal length and geometric length in a temporal network; Cunha et al. (2020) proposed a method to analyze a TVG from a sliding time-window and build a time series of network indexes, and Sousa et al. (2020) In Eq. 4, V = {v 1 , v 2 , ..., v n } is the set of vertices and E = {e 1 , e 2 , ..., e m } is the set of edges of the system, where e k = (i, j), with i = j and i, j = (1, 2, ..., n − 1, n). For these sets, n = |V | and m = |E|. The time sets are presented as follows: ⊂ N| = {t 1 , t 2 , t 3 , ..., t, ..., t ( −1) , t } represents the system lifetime, which is discrete in time. Each element of represents a date or time instant. The interval between the extreme dates is the total time T = t − t 1 + 1. ϒ = E × → {0, 1} is the presence function that guarantees the existence of a given edge at a given time t ∈ ; and σ is the latency function, which represents the time required to form an edge.
Time sliding window function. The analysis of a TVG can be performed using the sliding window function w τ ,s , where τ is the size of the time window and s represents the step taken by the window in time (Cunha et al. 2020). Figure 3 shows examples of the use of the function w τ ,s for a networks analysis.
Assuming the values of τ and s are constant and are arbitrated by the researcher, the set of windows fits into the TVG is a fuction of τ , s and T, as shown in Eq. 5. In this equation, n w is the number of total windows, i.e., number of networks to be analyzed.

Information entropy
The formalism of information as an entropy measure was introduced by Claude Shannon in 1945. According to Shannon theory, the information measure of a variable depends only on its probability distribution (Shannon 1948). Consequently, the theory can be used in several areas, such as biology, economics, and confined quantum systems (Mousavian et al. 2016;Mishra and Ayyub 2019;Nascimento and Prudente 2018). The theory may compose a methodological link that unites different areas (Zenil et al. 2016), including statistical and thermodynamic physics, in which several recent works have shown some importance for information entropy (Zurek 2018;Gao et al. 2019).
The mathematical concept of information considers that the information contained in a message is associated with the number of possible values or states of this message (Shannon 1948). For example, if the system has only one possible state (e.g., the degree of vertices in a regular network), no information is obtained upon inspection. As the number of possible different states for a system increases, the amount of information in the system increases, that is, the discovery of its real state facilitates further learning.
The entropy is the expected value for the uncertainty of the random variable X (a system state), which refers to a probability distribution, as shown in Eq. 6.
In Eq. 6, X is a random variable, p i is the probability of the state i for this variable (with i p i = 1), and k is a constant for which if arbitrated for k = log 2, the entropy value is given in bits. The value of k will be employed. Each calculated entropy value has a maximum value and an associated minimum value. When these limits are known, they help to evaluate how much the real value deviates from these idealized situations.
In a probability distribution for the state of the random variable X, the minimal entropy situation occurs when the uncertainty is minimal. As an example, when only one possible state for X exists, we are 100% certain about this state, so H(X) = 0. The maximum entropy situation occurs when all N possible states for the variable have an equal probability of occurrence, i.e., p = 1/N and H(X) = − 1 n log 2 ( 1 N ) = log 2 N. Thus, the entropy value for the random variable X of N possible states is within these limits, as shown in Eq. 7. 0 ≤ H(X) ≤ log 2 N bits (7)

Dataset, collection and treatment
The dataset is composed of the titles of articles published in the journals Nature and Science from 1999 to 2008 (Pereira et al. 2011). These journals have high-impact factor values and similar publication frequencies in the collected period 1 . The words in these titles were treated according to the treatment rules, which were proposed in Pereira et al. (2011) and organized in a way that each week of publications (Journal number) corresponds to a text file, where each line corresponds to a title. The network is then built from these files.

Building a TVSNT
The SNT is modeled for a TVG, where V is the set of different words and E is the set of pairs of words in the same title; is the collected period, which is given in weeks, since a week is the minimum period of publication of the journals. For Nature, T = 514 weeks, and for Science, T = 512 weeks. The presence function ϒ indicates if two words occur in the same title at least once in a given instant. For this work, we will not use the latency function σ , which is a constant.
The sliding time window w τ ,s , is defined initially as τ = 8 weeks and s = 1 week, i.e., w 8,1 . The network parameters that are discussed here will be calculated in each window.

Application information entropy in TVSNT
Entropy in titles Networks. According to (Cunha et al. 2020), two random variables can be obtained from the process of titles or cliques network formation: the vertex and the edge. The probabilities of the occurrences of the vertex i and the edge (i, j) are calculated for each time window considered, according to Eq. 8 and Eq. 9, respectively. The time instant t corresponds to the number of the window.
Equations 10 and 11 express the Shannon entropies for these distributions, where H v (t) and H e (t) represent the entropies of the vertices and edges, respectively, at the given time t: In order to improve the understanding about the calculation of information entropy in TVSNT, we present in Fig. 4 an example of network of cliques and its formation process, the associated probabilities, and the entropy values of vertices and edges. In Fig. 4a, we show the cliques in the initial configuration, and in Fig. 4b, we present the network of cliques built by juxtaposition and overlapping processes (Fadigas and Pereira 2013).
Limited values for entropy. The factors that contribute to the increase and reduction of entropy in a system are highlighted here. The minimum entropy value is associated with the variable's maximum certainty. Two factors contribute strongly to this certainty: (i) the minimum of possible states for the variable and (ii) the greater repetition of one or some possible states for the variable. On other hand, the maximum entropy is associated with the variable's minimum certainty, i.e. with the maximum of possible of states for the variable, where that each state has the lowest possible probability.
The limits shown in Eq. 7 may not apply to the associated entropy from the construction of cliques networks. In this section, the extremes are calculated based on the boundary conditions for the formation of networks.
The following conditions were employed for the investigated journals 2 : the number of cliques in the initial configuration n q , size of largest clique q max , smallest clique size q min = 0, and number of vertices n and number of vertices in initial configuration n 0 .
To calculate the limits, we will assume the existence of configurations that maximize and minimize the entropy.
Step 1: We imagine that the initially empty cliques with n 0 vertices are available to distribute in them, where n 0 ≥ n. Of n 0 vertices, n is the number of vertices that are necessarily different vertices.
Step 2: The n vertices in the n q cliques are distributed without vertex repetition on each clique, where the number of vertices per clique q i do not exceed the maximum value q max and are not less than the minimum value q min , i.e., q min ≤ q i ≤ q max .
This moment is referred to as Configuration 1. The distribution is performed in a way that there is no repetition of vertices and edges, using Eq. 12. In this configuration, we will have all different vertices and edges with the minimum number of edges. In the final network, x cliques of size q and y cliques of size (q + 1) exist; thus, q = n n q y = n − qn q x + y = n q xq + y(q + 1) = n (12) Configuration 1 generates the highest vertex entropy H v max = log 2 n because it guarantees the disposition of all vertices without repetition and the lowest entropy for the edges of the network once it guarantees the smallest number of edges.
The repetition of a variable also contributes to its reduction in entropy. In clique networks, this phenomenon does not occur for edges because the repetition of an edge implies that the edge exists in more than one clique. Two vertices that compose the edge are forced to be connected to all the other vertices of the clique, which causes a considerable increase in the number of edges, that is, the possibility of an increased number of states, and consequently, an increase in entropy.
We build Configuration 2: Step 3: From Configuration 1, the remaining n 0 − n repeated vertices are added, one by one, with the maximum repetition of vertices for the first vertices added.
Step 4: If n 0 −n ≥ n q −1, a repeated vertex will exist in all cliques. After the distribution, if (n 0 − n) − (n q − 1) ≥ n q − 1, the process continues, with the choice of repeating another vertex in the cliques.
Step 5: The process is repeated until the remaining vertices are less than n q − 1, and thus, they will be distributed as a single vertex repeated in the number of cliques that can fit. Step by step of the calculation of the maximum and minimum limits for the entropy values of the vertices and edges of the network of cliques in Fig. 4 Step 6: The value n q − 1 is subtracted from the vertices that have not been added until this subtraction yields a number n ≤ n q − 1. Thus, the last vertex is repeatedly added from clique to clique into n cliques.
Configuration 2 increases the probability that some vertices will reduce the entropy to the smallest value possible while respecting the boundary conditions of the problem.
For the maximum edge entropy, the number of edges should be increased as much as possible to avoiding their repetition. For this purpose, Step 7: The appropriate distribution of vertices will be performed according to the initial conditions to obtain a configuration with x cliques of size q max and y cliques of size q min , with the possibility of a clique with size q D and q min < q D < q max , which is referred to as Initial Configuration 3.
Step 8: The repeated vertices n 0 − n that remain are separately added to cliques in order to avoid repetition of edges in the cliques (Final Configuration 3).
This procedure increases the number of maximum cliques, which causes an increase in the number of distinct edges and, consequently, their entropy.
Using Fig. 4 as a starting point, we summarize in Fig. 5 the process to calculate the maximum and minimum limits for the entropy of vertices and edges.

Case n < n q
For the TVG of this work, with w 8,1 , in every window n ≥ n q . For larger time windows, n < n q may occur. In this case, some adjustments will be required to calculate the limits, for example, in Configuration 1, q = 0, q+1 = 1, y = n e x = n q −n. This case contradicts the boundary condition that q = 0 < q min . Thus, some n − n 0 will need to be distributed in cliques, in which each clique has the number of vertices q = q min . The moments where entropy decreases from its maximum may indicate trends in the journal's vocabulary over time. The vertex entropy values are higher and vary substantially less than the edges entropy values. This finding shows that windows with clique networks have minimal edge overlap.

Results and discussion
Moreover, in various intervals, H v and H e have opposite growth trends. We know that an increasing H e implies the generation of new edges, which is possible due to the increment in repeated vertices in the cliques, which causes H v to decrease. In some of the study periods, an opposite growth trend was observed between the journals for the edges entropy standard: one journal reached a high entropy value and the other journal had a low entropy value.
Notwithstanding the fact that entropy measures are sensitive to sample size, we use the entire dataset collected in the study period. This approach enables a proper comparison of the two journals, even though they have similar entropy values. Note that the real vertices have entropy H v ∼ = log 2 n in any time window of the journals. For edge entropy, these values deviate from the corresponding maximum in certain periods. Figures 7 and 8 show how entropies are correlated with their respective maximum and minimum values.
We note a strong correlation between the entropy of vertices and their maximum values, following the entropy of edges with their minimums for both journals. This suggests Science. The line shows the linear adjustment for the points and shows the difference between the correlation of the vertex entropies and that of the edge entropies, α is the linear fit coefficient and ρ is Pearson's correlation coefficient that, over time, the vocabulary of the journals maintained a high diversification for w 8,1 , although for τ = T, the vocabulary is not that diversified.
The entropy values calculated here do not require the use of a null model (i.e., random network) for comparison. The process of constructing Configurations 1, 2 and 3 is already randomized. A network of cliques has high clustering, which means that a correspondent random network does not exist since the clustering coefficient tends to zero (C → 0) in random networks (Watts and Strogatz 1998).
It was found, for each time window, the critical network using the incidence-fidelity index [11]. These networks allowed us to identify the most relevant vertices (i.e. words) considering their connections. Figures 9 and 10 show the critical networks for Nature in t = 223 (the highest H e ) and t = 7 (the lowest H e ) and the vertices considered hubs (k hub i ≥ k + 2σ ).
As in Teixeira et al. (2010), the TVSNT studied in this work presented a critical network for IF c = IF L ≈ 10 −3 . We highlight, on the one hand, that in the critical network from a network with high entropy, the hubs are poorly connected to each other, indicating greater diversity of the vocabulary. On the other hand, in the critical network from a network with low entropy, hubs are strongly connected to each other, indicating the robustness and recurrence of vocabulary.
The increase in the entropy measure may be associated with the emergence of new ideas represented by the diversity of the vocabulary and the connections between the words of the titles used to build the semantic network; while the decrease in the entropy measure may be associated with the robustness and consolidation of ideas and interests of authors and editors of a journal in a given time window.

Conclusions
The results of this study show a strong correlation between the entropy values and their respective maximum values, especially for vertices entropy. It is reasonable to say that it is equivalent to calculate the maximum entropy to estimate the entropy. Figure 6 shows that journals have a greater diversity of words than word pairs. With the journal's vocabulary in a window, the number of possible combinations for word pairs is greater than that for repeating them in titles.
When applying the IF index, we noticed that in the critical network, it is possible to identify the main themes, and how they are linked via their vocabulary (specifically, greater diversity of the vocabulary for network with high entropy and robustness and recurrence of the vocabulary for network with low entropy).
The measurement of vocabulary diversity and the diversity of connections between words in a semantic network of scientific article titles allows us to follow (i) the emergence of new ideas over time, represented by the increase in vocabulary diversity of titles or (ii) the robustness and consolidation of ideas and interests of authors and editors of a journal in a given time frame.
The method for constructing clique semantic networks is coherent with previous works with regard to the vocabulary diversity of high-impact scientific journals. The study of vertices and edges entropy in clique networks can be combined with the emergence of communities in these networks and the correlations with other indicators that are specific to this type of network (e.g. reference diameter and fragmentation (Fadigas and Pereira 2013)).
Abbreviations IF: Incidence-fidelity; SNT: Semantic network of titles; TVG: Time-varying graph; TVSNT: Time-varying semantic network of titles