Fractal dimension analogous scale-invariant derivative of Hirsch’s index

We propose a scale-invariant derivative of the h-index as “h-dimension”, which is analogous to the fractal dimension of the h-index for institutional performance analysis. The design of h-dimension comes from the self-similar characteristics of the citation structure. We applied this h-dimension to data of 134 Japanese national universities and research institutes, and found well-performing medium-sized research institutes, where we identified multiple organizations related to natural disasters. This result is reasonable considering that Japan is frequently hit by earthquakes, typhoons, volcanoes and other natural disasters. However, these characteristic institutes are screened by larger universities if we depend on the existing h-index. The scale-invariant property of the proposed method helps to understand the nature of academic activities, which must promote fair and objective evaluation of research activities to maximize intellectual, and eventually economic opportunity.

In this study, we propose a derivative of J. E. Hirsch's index (Hirsch 2005) as h-dimension, which is analogous to the fractal dimension of Hirsch's index (h-index hereafter) and has a property of being data-size invariant. h-dimension is developed as an institutional index, and it is not for individual researchers.
The authors, who are affiliated to Council for Science, Technology and Innovation (CSTI for short), a department of Japan Cabinet Office, are in charge of establishing evidence-based policy making (EBPM for short) by analyzing human and monetary resource investment, academic output and economic and social outcome. Such analysis is to be integrated to a data analysis platform system as "e-CSTI", which is now partially available to the general public as https://e-csti. go. jp/ en. The purpose of this system is to analyze/evaluate governmental policy and act, and it is not to be used for judging or resource allocation of individual institutes or researchers.
In Japan there are more than 100 national universities and research institutes. These organizations have divergent research activities both in their scale and topics. Within the data set of this study there are approximately 1.0e+04 times of difference in the number of articles. The research topics are also highly diversified: major national universities' research fields are uniformly distributed from mathematics to human studies, but there are institutes specialized in a particular topic such as particle physics, medical science of a specific part of the human body, or genetics.
Among various methods and indices related to citation analysis, h-index proposed in Hirsch (2005) has an advantage in detecting stochastic features of citation data with robustness. Ever since, multiple variations of h-index are created to meet different objectives and data variations. Alonso et al. (2009) is a comprehensive review of such early works, where various applications of the original index and the derivatives are acquainted. Among more recent works based on the h-index is Amane Koizumi's h5index (Koizumi 2018), which is gathering attention recently (see https:// www. elsev ier. com/__ data/ assets/ pdf_ file/ 0020/ 53327/ ELSV-13013-Elsev ier-Resea rch-Metri cs-Book-r12-WEB. pdf, page 40). h5-index was developed to measure institutional performance by setting a time-window of five years (hence the name "h5"). To be precise, let X(t) be the set of articles produced by some organization from its founding up to some year t, then h5-index is defined as the h-index calculated on the article set X(t) − X(t − 5) provided that t ≥ 5 . The definition of h-index is given in "Self-similarity of citation and h-dimension" section. h-5index is always associated with a particular interval of five years, for example, from 2014 to 2018, and it can be any five years if adequate data is available. Koizumi's work also directly inspired this study.
Still, we see difficulties in using h-index (or h5-index) for institutional performance evaluation because these indices are heavily correlated to the number of articles on which the indices are defined. It means we cannot compare institutes with a different number of researchers. Another difficulty comes from the fact that the network of citation acquires vertices and links along with time, and each researchers organization has their own stage of development.
Self-similarity or fractal-like property is almost a universal concept to analyze and understand complex structures. We consider this concept as a primary principle to analyze citation networks. We developed our version of h-index derivative h-dimension, or Fujita and Usami Applied Network Science (2022) 7:5 h d to take advantage of this property. As a result the index is necessarily scale-invariant to bring fair and accurate institutional evaluation. There is substantial criticism on the evaluation of individual researcher based on h-index. In Koltun and Hafner (2021), declining effectiveness of the scientometric measures is reported, that correlation of h-index with scientific awards has dropped due to the changing authorship patterns. In Waltman (2016), it is claimed that productivity over expenditure, or achievement per budget is the key factor to be evaluated, and all the scientometric index which does not consider monetary cost is meaningless or harmful. In Waltman and van Eck (2012), it is claimed that h-index is prone to noise, which makes the index to lose the order-preserving property and consistency.
However, because this study is strictly institution-oriented, we consider such criticism is not very relevant. Moreover, we believe that a set of academic articles which are richly linked by citation is more valuable than disconnected one, and creating such knowledge circulating system is an important mission of research institutions.
The present study is constructed as follows: in "Self-similarity of citation and h-dimension" section, we will examine the structural characteristics, in particular statistical self-similarity of citation. This examination leads us to the fractal dimension of h-index as h-dimension, or h d . In "h-dimension and its implication" section we will apply the proposed index h d to the prepared data to study the properties and implications of the proposed index. By referring to h d , we found well-performing medium-sized institutes, which are obscured by larger organizations if we depend on the original h-index. We also check the effect of the research field on h d , to find out that it is not heavily affected by the research field selection. In "Conclusion and future work" section we will summarize this study and discuss the future work.

Self-similarity of citation and h-dimension
In this section, we propose "h-dimension", a scale-invariant derivative of the h-index, by considering the self-similar characteristics of the citation network. It is analogous to the so-called fractal dimension of the h-index, hence it is named as such.
As we briefly discussed in "Introduction" section, the citation network represents the propagation of knowledge, which means understanding structural characteristics of the citation network will give us insight into the flow of knowledge. For the purpose of structural discussion, we begin with formulating the citation network.
Let X = {u, v, . . .} be a set of articles we are concerned, and E = {(u, v), (u, w), . . .} be a set of citation relations between the elements of X; i.e., (u, v) ∈ E implies (v, u) / ∈ E because citation is asymmetric. Let us note that a citation relation (u, v) means the article u is cited by the article v. The pair of sets (X, E) defines the "citation network" as an acyclic directed graph. Let c be a function from X to Z,the set of natural numbers, as c(u) = |{v|(u, v) ∈ E}| ; function c counts how many times u is cited, or in-degree of the vertex u.
Let D be a distribution function in general. For simplicity, let D(b) be the possibility to be x < b provided that x is a probabilistic variable on which D is defined.
h-index h is defined as follows; let U(x) be a subset of X such that all the members of U(x) has x or more incoming links, then Equation 1 can be written more simply by using empirical distribution function D as follows; let n = |X| , the number of articles, and D be a set-valued function as D(x) = {u| u ≤ nD(x)} . Then h can be written as Note that Eq. 2 describes the h-index as a fixed point based on an empirical distribution function, and this fact directly yields an effective h-index calculation algorithm deployed in this study, which is described in "Appendix" with a sample code.
Let H be the subset of X which defines h-index, namely, The citation network evolves with time by adding a new article to the network. Therefore, it is natural to identify the increment of n, the size of the data, or |E| with ongoing time t, which is a common approach to model how a citation network (or other real-life complex network) is built (see Price 1976 or Barabasi and Albert 1999 for example). Each research institute or university has its own history. An older institute is likely to have a larger set of articles if the researchers' number is similar. On the other hand, the expectation of the h-index of a larger set of articles is larger if the citation distribution is the same 1 The median of personal h-index of the researchers who are affiliated to the institute is a good candidate for the purpose of institutional performance measurement. However, the median is not available because there is a significant number of research-related people who never appear as the author of the academic papers. The proposed index of this study has the advantage that it only depends on the academic paper database.
Therefore, a simple comparison of h-indices of different institutes' data may only mean that one data has more articles, which means raw h-index is inadequate for institutional evaluation. Actually, the same difficulty exists in the case of personal h-index, which is addressed in the original paper of Hirsch (2005) by denominating the raw index value by the years of being an active researcher.
h5-index, which sets a fixed length of five years window to collect the data, overcomes this difficulty by taking a snapshot of uniformly controlled exposure. Despite this improvement, we still have the following concern about the growth of the citation network.
A research institute goes through its own process of development. Newly established institute N is in its early stage, while another institute M is in its mature stage. Suppose if these two institutes share a similar index value. If we are to conclude that N and M achieved similar performance based on this index, the conclusion has very limited significance because N is doing better. Comparison between cases with similar stages of development can be meaningful; otherwise, we are uncertain despite the controlled observation window.
(3) H = {u|c(u) ≥ h}. Figure 1 is a visualization of an artificially created acyclic directed graph with similar in-degree distribution of typical citation network. The layout is configured to place cited article below its citing article. Because the citing article necessarily comes after the cited one, the network "grows" upwards like a tree grows to the sky.
Inside Fig. 1, we can identify a sub-network by selecting all the nodes that refer to "B" directly or indirectly. This sub-network is similar to the original network (which is the articles that cites "A" directly or indirectly) in the sense that they share statistical properties, for example, the degree distribution. Benoit Mandelbrot refers to this property as statistical self-similarity in his book (Mandelbrot 1977), Chapter XII, p. 276.
In fact, the fractal-like property of the complex network has been gathering attention (for example Song et al. 2005;Corominas-Murtra et al. 2013or Zhao et al. 2006). Hierarchical characteristics of acyclic network is sometimes called "rank" structure from the fact that it is often embedded in a one-dimensional ordered space (see Newman 2018 14.7, p. 564).
The h-index defining set H of Eq. 3 and their associated links E H also defines a subgraph (H , E H ) , which occupies the lower part of Fig. 1 . The in-degree distribution D H of this sub-graph is obtained from the original distribution function D as D H (u) = D(u)

D(h)
provided that h is the h-index value of the whole citation network and u ≥ h . Therefore, the sub-graph (H , E H ) shares degree distribution with the original network (X, E) except that it lacks the long-tail, or lower degree part of the distribution.

Fig. 1
A citation network artificially constructed for the purpose of illustration. Each dot represents an article, and the straight line is a citation relation. To visualize citation direction, a cited article is placed below the citing article. The article "A" (bottom red dot) comes first within this network, then "B" comes after it to cite "A". The network above "A" (the whole network) is similar to the one above "B" in a sense that they share some statistical properties like degree distribution Obviously, the empirical data necessarily have finite steps of similarity, i.e. we cannot go down to statistically similar sub-networks infinitely. In other words, the level of detail of the observed network is limited. Due to this self-similarity and nature of citation relation, adding a new node (or growth of the network) also means adding detail to the network.
If the sub-network H converges to a stationary state when t → ∞ , such a terminal state should consist of a fair evaluation foundation indifferent to data scale or the institute's history. However, it is unlikely that such a stationary state exists because of the statistical self-similarity of the citation structure. 2 It means infinitely detailed observation leads to infinitely larger observed value, which makes comparison impossible. This difficulty is analogous to the measurement of coast line length, which famously diverges to infinity as the mesh of the survey becomes smaller for more detailed measurement. (see Mandelbrot 1977, chapter 2).
Fortunately, we already know how to treat a measurement of self-similar structure that diverges with the observation scale, which is the fractal dimension.
According to this knowledge, we propose h-dimension h d as follows: where h is the h-index (or h5-index) value, which is the objective measurement, and s is the size of the network, which is the inverse of the scale of observation. The number of articles can not be used as s because it only counts the vertices which have incoming links. In practice, s can be obtained as the sum of the citation counts u∈X c(u). (4)

h-dimension and its implication
In this section, we examine the properties and implications of h d . In the first subsection, we describe the preparation process of the data. In "Application and examination of h-dimension" section we apply the proposed index h d to the collected data and analyze the theoretical and empirical properties of h d . We will check if it is scale-invariant. In "Research field and h-dimension" section we will examine the effect of the research field selection on h d , and give some intuitive understanding of the proposed index through statistical analysis. In "Adversary strategy against h-dimension" section, we will examine h d from the "opposite side of the game" and try to construct the strategies to deceive the index.

Data preparation
We used a bibliometric dataset for research articles published in 2014-2018 from national university corporations, national research and developments agencies and inter-university research institute corporations in Japan. As a starting point for data collection, a list of institutional identifiers was prepared by using a global research identifier database (Data-Science I 2019). As a consequence, 134 GRID (Global Research Identifier Database) ids were obtained including all 86 national university corporations in Japan. We used the Dimensions analytics API (https:// docs. dimen sions. ai/ dsl/) as the platform, which permits us to extract fundamental bibliometric data on a research institute for a given period related to a specific research field by a simple query. The query language is not very different from a common SQL with some extensions.
Since we could extract up to a maximum of 1,000 results only, it is necessary to add the operator, skip, followed by the offset, if we like to obtain all the data when the total count of the result is over 1,000. This iteration to give the offset could be done up to a maximum of 50,000 results. No queries for a specific research institute and the research field gave results of more than 50,000. Therefore, by iterating such queries for 134 research institutes and 22 research fields, we successfully collected the dataset for further analysis.
The number of citations for each article is as of 14 March 2020 in Dimensions, when we collected all the data. The dataset of this study consists of citation counts of 550,602 papers. To measure h5-index for each research institute, a set of unique publication ids over all research fields and their number of citations was used. Then h5-index was measured according to the definition given by Hirsch (2005), using the algorithm addressed in "Appendix". Technically, the only difference between the h5-index and h-index is that we used research articles for 5 years from 2014 to 2018 in order to measure the recent activities for a research institute instead of the whole activities. Figure 2 shows 134 institutes' h-index (h5-index) values and their number of institutional articles in a log-log plot. It is clearly seen that the h5-index and the number of articles shows a strong positive correlation, whose coefficient is 0.85. Figure 2 also empirically denies the existence of the stationary terminal state of the h-index defining sub-network H, which is discussed at the end of "Self-similarity of citation and h-dimension" section. Thanks to this improvement, we could find well-performing medium-sized organizations with the number of institutional articles from several hundred to ten thousand. Among these organizations we could identify several research institutes focused on natural disasters, as shown in red "x" in Figs. 2 and 3. There are several institutions that marked even better than disaster-related organizations, to which we do not refer any further to avoid making them identified.

Application and examination of h-dimension
This result is reasonably understood because Japan is frequently hit by various natural disasters, like typhoons and earthquakes. In contrast, these characteristic leading institutes are overshadowed by larger organizations in Fig. 2. It is evidently seen that the proposed index is useful to evaluate institutional performance by a relatively simple calculation.
However, we cannot easily conclude that the result of Fig. 3 is actually representing institutional performance. The organizations in the data have different research fields, which are known to have different citation conventions.
In the following part, we will show that the proposed index h d is robust to the research field configuration difference. This is due to the scale-invariant design of h d and property of the original h-index to select an essential part of the data.

Research field and h-dimension
In general, research field is a major factor of the outcome of citation, for example in Qian et al. (2017) it is shown that even the sub-fields within computer science have significant effect on the citation rates. Various field-normalized bibliometric methods are devised and utilized to compensate such variation for the purpose of fair and accurate evaluation (see Ahlgren and Sjögårde 2015;Bornmann and Haunschild 2016;Reddy et al. 2020).
To begin with, we estimate the effect of the research field selection on h d . The value of h-index is defined by the distribution of citation as described in Eq. 2. The data we are analyzing actually has a joint distribution of the research institutes, research fields, and citation count. Here we define three probabilistic variables as follows: let the research fields be F, institutes G, and citation C. Let P(X) be the probability that a statement X is true, namely, P(F = f , G = g, C = c) denotes probability of F = f , G = g, and C = c for some research field f, institute g, and citation value c.
Then the distribution function D(h) of Eq. 2 can be written as the marginalization of the original data: D(h) = G F P(F , G, C < h) . The research field distribution of some institute g i can be written as C P(F , G = g i , C) . To see the effect of research field difference, we must obtain citation distribution which is independent of particular institute g i . This can be performed by calculating of the data. Aggregated term of Eq. 5 is a conditional distribution of research field f multiplied by the proportion of the field f of the research institute g i . Note that Eq. 5 uses distribution of all the articles that belong to research field f, and not the observed distribution of particular institute g i . Adding this term over research fields will give randomly controlled citation distribution under the condition of research field selection of particular institute g i . Figure 4 is a scatter plot of h d of randomly controlled data of Eq. 5 (the horizontal axis) and corresponding institute's observed h d value. In practice, the control is obtained by averaging 200 runs of the randomly selected results. The plot shows weak positive correlation between the controlled expectation and observed value, which is represented by the green segment. However, the residual reached nearly 90 percent of the observed variance. We also checked Kendall's rank correlation coefficient between the controlled expectation and the data, which was 0.2. In summary, only 10 to 20 percent of the institutional h d variation is explained by research field selection. Much of the h d variation comes from outside of the research field selection.
We will further analyze how this almost research-field invariant property of h d is realized. Table 1  citation, h5-index, and h-dimension from the prepared data as described in "Data preparation" section. We can see that mean citations in the 3rd ("Mean") column and h5-index in the 4th column have great variation. Also, these two indices have a strong positive correlation with the size of the data.
In contrast, the proposed index h d listed in the far right column of Table 1 shows nearly constant values.
As we discussed in "Self-similarity of citation and h-dimension" section, the original h-index expectation is a monotonically increasing function of the number of articles. Therefore, the positive correlation of h-index to the number of articles in Table 1 is natural.
However, as for h d , which is scale-invariant by its design, is not guaranteed to yield uniform values because different research field may have different citation distribution.
In Fig. 5, we can see that each research field shows approximate power-law distribution above its h-index value 3 . Although these distributions have difference in their fewer times cited part, the h-index defining part have very similar distributions with each other. We will see this relation in more detail with simple analytic calculation and empirical check as follows: The h d definition of Eq. 4 can be transformed to h = s h d , where h d serves as the exponent over the inverse of the observation scale s. Then, s = n a a−1 because the expectation of power-law distribution with exponent a is a a−1 . This is not valid in the overall distribution of citation data, but it holds above the h-index defining point.
Consequently it holds that for a constant k = a a−1 h d .
By substituting D(x) of Eq. 2 with bx −a , where b is a constant for normalization to be compatible with the probability distribution, we obtain h = nbh −a , which is transformed to Equation 7 is not a new result, and already described in PRATELLI et al. (2012). From Eqs. 6 and 7 we have n 1 a+1 ∼ kn h d . Therefore is obtained.
To confirm the relation between h d and power-law exponent of Eq. 8, we estimated the power-law exponent value of 134 data sets by applying Hill's estimator (Hill 1975) and compare them with h d .
The estimation result is shown in Fig. 6, where we can see its mode is approximately a = 1.2 . Alternatively, we estimate the power-law exponent by way of the h-index. Compare Eq. 7 with the fit function of Fig. 2 as 0.47 = 1 a+1 , which gives a = 1.14 . Two different estimates meet quite well.
However, it should be noted that the exponent of the power-law distribution and h d are based on totally different principles. The power-law exponent is a result of fitting a In summary, h d of the research fields are stable because all the research fields share very similar distributions above the h-index defining values, which we can see in Fig. 5.
Following the definition of h d and self-similar characteristics of citation network, we can infer that a network generated by unifying multiple statistically similar networks (which means these two networks share the same h d values) will also yield the same h d value. This is consistent with the fact that the original networks are statistically similar subgraphs of the newly generated network. In other words, h d is invariant to the operation of identically distributed data set unification.
As a result, we can claim that the difference in research field selection has limited effect on the institutional h d , which means most of the h d variation is not originated from their research fields.
If we stay in the realm of self-similar citation structure, Eq. 8 will give us a rough idea of what h d is trying to measure. High h d , or small exponent of power-law distribution Fig. 6 Estimated power values of the Japanese institutes' data set distribution. The plot shows probability density to see the mode, which is approximately 1.2 Fig. 7 Artificially constructed two acyclic directed graphs with the same number of vertices and different h d values. The graphs are built by adding a vertex repeatedly, just like the real-life complex network. The older vertex in red, then yellow, green, and the new one in blue, and the vertex size is proportional to its in-degree. The other layout configurations are identical to Fig. 1 implies that there is an active and effective knowledge production process which provides readily available knowledge with quality and quantity to create new knowledge. In terms of citation structure, it is a richly connected network.
A pair of artificially constructed acyclic directed graphs with the same number of vertices and different h d values are visualized inFig. 7. The left (lower h d valued) graph consists of several communities which are sparsely connected with each other, and the higher h d graph is more tightly connected. If these two graphs represent citation relations, the higher h d graph seems to have more foundational research activities shared among the researches than the lower h d valued one.

Adversary strategy against h-dimension
In this subsection we will discuss the property of h d by trying to conceive a work-around path to gain the index.
Following the Eq. 4, h-dimension is defined as Therefore, the basic heuristics of the strategy is 1. to gain the h-index value, and 2. to keep the sum of citations small.
For a given citation sum C, the largest possible h-index is the integer part of √ C . Thus, h-dimension has a range of 0 ≤ h d ≤ 0.5.
The network with its maximum h d value has a unique degree distribution. Suppose the h-index defining set has h vertices, then all of these vertices have in-degree of h, and no other vertex has in-degree. Let us refer to the network that has h-dimension of 0.5 as " h d -optimized".
In a naturally grown citation structure, when an article is cited it becomes more likely to be found and cited by other researchers. But in order to reach the h d -optimized status, natural citation mechanism have to be stopped in order to avoid citation beyond the targeted h-index value, because any citation that does not contribute to the h-index brings down h-dimension.
Another way to gain h d is to decrease the denominator while keeping the nominator of Eq. 9 by eliminating articles which does not contribute institutional h-index. This is unlikely to be implemented because all the newly published articles have no citations.
These two by-pass strategies are naive, and much more sophisticated work-around will be devised eventually. But for the time being, we consider that h d is not particularly easy to work-around if we stay in the world of self-similarity.

Conclusion and future work
We proposed an index h-dimension, or h d as a derivative of the h-index which is analogous to the fractal dimension of the original h-index. Unlike the original h-index, not only h d is invariant to the number of articles by its design, it is also robust to the difference of research fields, if not completely independent. .
Due to the self-similar property of the citation network structure, h-index is strongly and positively correlated to the number of articles, which gains its size as the time goes on. Most of the difficulties in comparing research organizations comes from this fractal property, and we already have an excellent tool to analyze this problem as "fractal dimension", which was named as such by the famous researcher Benoit Mandelbrot. h d is defined as the fractal dimension of h-index.
We prepared a citation data set of 134 Japanese national universities, national research and developments agencies and inter-university research institutes from the year 2014 to 2018 by using Dimensions analytics API of DigitalScience, Inc., and applied h d to the data. We could find several medium-sized research institutes that performs excellently by virtue of scale-invariant property of h d , where we could identify multiple organizations focused on natural disaster, which is a reasonable result considering the natural environment of Japan. These characteristic institutes are obscured by major organizations if we depend on the conventional h-index.
We carried out several analysis from various angles on the properties of h d . We examined the effect of research field on h d value, to find out that the original h-index (therefore h d as well) is closely related to the exponent of power-law distribution, which is quite similar among research fields above their h-index values. This is the reason why h d is quite robust against difference in the research field. This property can practically exclude the effect of research field difference from institutional h d values.
We also examined how visually different a graph with higher h d is from a lower one, to find out that lower h d graph is separated into multiple loosely inter-connected communities. In order to understand the behavior of h d from a different point of view, we also tried to "attack" h d and gain the value without following the supposed citation structure growth procedure, to find out that the mechanism behind self-similarity of citation structure is natural and hard to destroy.
International comparison of research organizations based on h d must be an interesting research topic. The contraposition of the discovery of medium-sized institutes also constitutes a future investigation theme, i.e., the reason why the organizations which produced more than 1.0e+04 articles achieved consistently low h d values.
Although single measure can never be the final solution to achieve fair and accurate institutional evaluation, we believe the h-dimension can help making good strategic decision to maximize intellectual or economic opportunity. We hope that the network study based on the principle of fractal to become even more active, and if this study could encourage it, the authors could not be more delighted.
the rank h of such x h is the value to be obtained. We will refer to this process as "sorting method" hereafter.
The major part of computational cost of sorting method exists in sorting the data set X, which is expected to be O(n log(n))).
Advantage of the sorting method is that it is reasonably fast and straightforward. The problem is the ineffectiveness. Sorting of h-th element beyond is not necessary. Actually, first h elements have no need to be sorted either.
The algorithm described here is a direct consequence from the fact that h-index is a fixed point of empirical distribution function, which is formulated as Eq. 2 of "Selfsimilarity of citation and h-dimension" section. Figure 8 illustrates the outline of how the proposed algorithm works. The value to be calculated rests on the crossing point of diagonal green line and the empirical distribution function (purple). To find out the point, we start from a randomly selected value P 0 from the whole data, and repeat the following process: Randomly select P i from a data segment I i−1 . If P i has larger (or smaller) rank than P i itself, the fixed point is in the segment above (or below) P i , which is set as I i and repeat this process to eventually reach the fixed point of Eq. 2.
However, from Fig. 8, it may seem that the algorithm requires totally sorted data. Therefore we will give a non-visual description in the following part of this section.
In the context of information science, proposed algorithm is a derivative of an algorithm commonly referred to as "quickselect", which outputs k− th largest (or smallest) element from given data of size n in average computational time of O(n). Quickselect itself is a variation of quicksort, both were developed by the same person C. Hoare Fig. 8 An intuitive outline of the algorithm. The vertical axis is for the rank, up for larger, or lower rank. The horizontal axis shows the value of the data. Purple curve shows the empirical distribution function of the data. For the aid of comparison between the value and its rank, green diagonal straight line is added. To calculate the h-index value is to find out the point where the rank and value is equal (or nearest), which is the crossing point H. In the beginning, P 0 is randomly selected from the whole data. Hence P 0 has larger rank than the value itself, H must be in the segment I 0 , (≥ P 0 ) . Next value P 1 is randomly selected from I 0 . P 1 has a smaller rank than its value, therefore H should be in the segment I 1 , (≤ P 1 ) . Then, P 2 , which has larger rank than its value is randomly selected from I 1 . P 2 sets the next segment I 2 . Repeat the process two more times to finally reach H. Black slithering curve at the bottom of the figure depicts the search path (published in 1961 as Hoare 1961). Quickselect works on the given rank k, which is not known when h-index is to be calculated.
1. The algorithm takes two arguments, the data array to be processed and additional numeral, which is to keep the temporary value of the index while processing. It is initialized as 0. 2. Create two empty arrays as upper and lower. Create a numeral eq as 0. 3. Pickup a single element from the data as pivot, compare it with each of the data's elements. If pivot < element , push the element into the upper array. If pivot > element , put it to lower. If pivot = element , increment eq by one and discard the element. 4. If upper.length + count − pivot is equal to 0 or 1, return pivot and exit. This is the BINGO situation. 5. Otherwise, if upper.length + count > pivot , the h-index should be found in upper.
Consequently, return to step1 with the argument data as upper and the numeral argument count is unchanged. 6. Otherwise, the h-index should be found in lower. Therefore starts from step 1 with new data as lower . As eq + upper.length elements were found above lower, the numeral argument must be incremented to count + eq + upper.length.
Each time we go back to the beginning of the algorithm, the data to be processed is expected to be half the size 4 . Consequently, comparison and data separation operations will be executed Fig. 9 Performance of the proposed algorithm (plus symbol), horizontal axis for the input data size and vertical axis for averaged elapsed time. Sorting method ("cross" symbol) result is added for comparison, which shows downward convex shape. n log(n) fits to the sorting method result with residual 7.7e-04, which is significantly better than linear fitting(shown) of 1.6e-03 times in total. Figure 9 shows plot of performance measurement, in which the processing time of the proposed algorithm shows linear response to the input size, as expected from Eq. 10. In comparison, sorting method shows downward convex response, which is again expected from the burden of sort operation. These two methods were tested with identical randomly generated data set of designated size, twenty runs were averaged for each data.
Because the total computational time is given as the sum of computational time necessary to process each segmented data, if segmentation is consistently unbalanced, for example segmented to a single element array and the rest, the total computational time should be n(n+1) 2 . It means the worst case will take O(n 2 ) of computational time, which is noted in most of the information science text books (see Press et al. 2007, Chapter 8). O(n 2 ) of time complexity can not be tolerated if we are to process large scale data.
Fortunately, we virtually have no need to consider this worst case as the computational time converges to O(n) in probability. We will give a brief proof of convergence as follows: Let a i be the size of i− th segmented data while the algorithm is running. As we select the pivot randomly, the range of a i+1 is given as [0, a i − 1] , within this range a i+1 is distributed uniformly. Let expected upper bound of each segment size a i be R(a i ) (their lower bound is always 0, which corresponds to the BINGO situation). Additionally, for any probabilistic variable x, let E(x) be expectation and Var(x) be variance of the variable as usual.
Clearly, R(a i+1 ) = E(a i ) , which is R(a i ) 2 as a i is also distributed within its range. Therefore Var(a i+1 ) = Var(a i ) 4 and the first partition size has variance of Var(a 1 ) = n 2 4 . From which the variance of overall computation results as follows: Var( i a i ) ≤ i Var(a i ) = n 2 3 . Let C(n) = a i , the total computational time to process data of size n. From the definition of O(n), C(n) = O(n) is rewritten as ∀ǫ C(n) > ǫn . Because ǫ is arbitrary, we can set it to an unbounded above monotonic function ǫ(n) and rewrite the condition as C(n) > ǫ(n)n . From the fact that Var(C(n)) ≤ n 2 , probability P(C(n) > O(n)) ≤ 1 ǫ(n) concludes directly from Chebyshev's inequality, which implies our claim of convergence in probability of the computational time to O(n).
Above discussion also ensures that we encounter only finite C(n) > O(n) cases even if n → ∞ . Actually the probability P(C(n) = n(n+1) 2 )) is given as 1 n! , which is less than 1.0e-150 when n is only 1.0e+02; typical data we contemplate has up to 1.0e+08 of size. Actually, as seen from Fig. 9, the proposed algorithm performs more than ten times faster than conventional sorting method even if the data has only 1.0e+05 of size.
Here is a sample implementation of our proposed algorithm in LISP, which is virtually a direct translation from the algorithm description.