The role of highly intercited papers on scientific impact: the Mexican case

The present paper explores the relationship between highly intercited papers in the k-max of citation networks and an author’s impact from the Mexican National System of Researchers (SNI). We investigate whether a more interconnected network, a higher k of the k-max, explains the variation of the total number of citations, controlling for personal characteristics such as SNI level, area of expertise, and the number of publications. We find that the k-max is positively and significantly correlated with impact. In this context, we find that the share of self and collaborator-citations increases with the magnitude of the k-max and women tend to have less interlinked cores of their citation networks than men (smaller k’s). Interestingly, we find that women tend to have a higher share of third-party citations while men tend to have a higher share of self and collaborator-citations, for all k’s and areas of expertise. We conduct a Blinder–Oaxaca decomposition to better understand the citation gender gap and find that much of it can be explained through the differences in observable characteristics (including the k-max) between women and men.

Page 2 of 20 Dorantes-Gilardi et al. Applied Network Science (2022) 7:58 which a small cluster of authors disproportionately cites each other's papers. The strategic use of self-citations to boost an author's impact has been widely studied (Kacem et al. 2020; Van Noorden and Singh Chawla 2019; Wallace et al. 2012). For instance, King et al. (2017) find that men self-cite 56% more than women, creating an asymmetric cumulative advantage between genders. Reciprocity between authors within their closest social circle has been less studied and only recently has gained importance.  investigate whether authors that present high citation reciprocity (exchange of citations between authors) outperform their peers and find that only those in the lowest part of the citation distribution do benefit from this strategy to boost their visibility.
Therefore, the issue of exploring the determinants of an author's impact is a mix of personal attributes and the authors' social network. It is well established that collaboration in academia is mostly beneficial for all parties since it can improve their productivity and impact through the acquisition of resources, learning of new abilities, boost of citations, and a more lengthy career (Van Der Wal et al. 2021;Wallace et al. 2012;Fortunato et al. 2018;Paraskevopoulos et al. 2021;. The study of real-world networks has stressed the necessity of developing centrality, rankings, and structural organization measures to uncover complex connectivity patterns usually hindered and are proven useful to characterize different network structures and configurations (Alvarez-Hamelin et al. 2005). One macro-level measure to find interconnected links within the network is based on its k-core, defined as the maximal set of nodes that have at least degree k within the set (Kong et al. 2019;Seidman 1983). The concept of k-core has proven helpful in a variety of financial, biological, and community detection topics (Kong et al. 2019;Giatsidis et al. 2011;Burleson-Lesser et al. 2020;. In practice, it is of particular interest the study of the maximal degree k such that a k-core exists, the so-called main core and k-max, since the nodes belonging to it are responsible for providing a "structure" to the network due to their high relations strengths (Burleson-Lesser et al. 2020). We will only consider the k-max throughout the paper.
In this paper, we explore whether an author's k-max (magnitude of k) correlates with the number of citations, once we control for other individual characteristics, such as the number of papers, area of expertise, the average number of co-authors per paper, career length, among others. We construct personal citation networks where the nodes are papers and links are citations that may come from a paper of the same author (selfcitation), a paper of a co-author (collaborator-citation), or someone else (third-party citation).
Our goal is to uncover whether a more interlinked inner core (higher k) implies more citations. To the best of our knowledge, this is the first paper to investigate the link of the citation networks topology through differences in the k-max to explain variations in an author's impact. Also, we investigate possible mechanisms that affect the k-max, both in the magnitude of the k and some possible strategies of increasing it, such as self-citations.
A personal network (ego network) intrinsically embeds social mechanisms and can generate different benefits to the focal node depending on its structure. Vacca (2020) explains that tightly-knit ego networks generate bonding social capital and may result in Page 3 of 20 Dorantes-Gilardi et al. Applied Network Science (2022) 7:58 higher levels of cooperation and support among members. Unlike other networks (e.g. biological or social), in citation networks, the links cannot be severed; they are permanent, and the cost of maintaining the link (a citation) is zero once it appears. Given this particularity, we presume that a more cohesive inner core of a personal citation network could only benefit the focal author. Furthermore, the possible adverse effects of a highly cohesive network, such as greater social pressure or limits on individual freedom (Vacca 2020), are not a matter of concern.
Since the k-max of our citation networks contains papers that receive at least k citations, and these may come from different sources (self, collaborator, or third-party), we hypothesize that an author may partially have the ability to increase the k through selfcitations or if collaborators reciprocate citations as well. Nevertheless, it is worth mentioning that we do not imply that all self or collaborator-citations are artificially boosting the impact or the k. For instance, an author could naturally have a large k if several of their papers are impactful within a small community and are usually co-cited.
We would expect differences between the citation networks of early and senior-career researchers and across fields, but it is not evident how they differ between genders, particularly the innermost core (k-max). The present paper aims at contributing to the research in how gender-differentiated patterns in their citation network can translate into permanence and promotion in academia.
Moreover, we use a Blinder-Oaxaca (BO) decomposition to explain further how the kmax and other variables contribute either positively or negatively to the citation gender gap, in the fashion of the wage gender gap literature. Thus, we examine how much of the gap can be explained by differences in observable characteristics or endowments (including the k-max) and how much is due to those characteristics having different effects on citations (coefficients). To our knowledge, this paper is the first, in the growing literature on gender bias in academia, to apply such a decomposition approach. Thus, this study not only contributes specifically to the gender bias literature in academia but may also inform policy-makers to design policies targeting gender equality.
The paper is organized as follows: section "The Mexican National System of Researchers" explains the Mexican National System of Researchers briefly; section "Data" presents the data and how the citation network is constructed; section "Network topology and impact" shows the results of the relationship between an author's networks topology and the number of citations; section "Network topology and impact" shows some measures that relate to the k-max and whether there are gender differences; section "Illustrative cases} illustrates particular researchers to better understand the dynamics of citation patterns and k-max. Finally, section "Conclusions" provides concluding comments.

The Mexican National System of Researchers
The National System of Researchers (Sistema Nacional de Investigadores, SNI) was created in 1984 and conceived initially to mitigate the acute income loss of faculty doing full-time research due to the economic crisis and aimed to support research activities across the country (Sandoval-Romero and Larivière 2020; Francisco et al. 2020). The evaluation process to enter the SNI and be promoted relies on peer review committees and assesses a mix of the number of publications and their impact. The SNI is divided into five levels in which members are classified: Candidates, SNI I, SNI II, SNI III, and Page 4 of 20 Dorantes-Gilardi et al. Applied Network Science (2022) 7:58 Emeritus. There are seven evaluation committees depending on the researchers' area of expertise (Table 1).
The SNI provides a monthly monetary compensation determined by the Federal Government and depends solely on the level: Candidate receiving the lowest stimulus and Emeritus the highest. This compensation serves as a salary complement and represents, on average, 30% of the income but can represent up to 50% of it (Sandoval-Romero and Larivière 2020). This reward system has proven to incentivize the production of academic work. For instance, Rodríguez Miramontes et al. (2017) find that between 1991 and 2011, 83% of articles published by Mexican researchers were written by at least one member of the SNI in that period (Sandoval-Romero and Larivière 2020).
The evaluation periods are shorter for early-career researchers (Candidates and SNI I) and longer for seniors (SNIs II and III); thus, the highest rescission rates are within Candidate (41%) and SNI I (19.3%) (Sandoval-Romero and Larivière 2020). The first stages of the SNI remain the bottleneck overall, but especially for women since they are mainly represented in those levels (Appendix A).
Female presence in the SNI has increased but remains insufficient; in 2018, only 37% active members were women 1 (CONACYT 2018). However, the percentage of female researchers is heterogeneous across areas (Appendix A), ranging from 22% in Area 1 (Physics, Mathematics, and Earth Sciences) to 52% in Area 3 (Medicine and Health). Furthermore, comparing the percentage of female researchers across levels, we observe that it substantially decreases as the level increases, going from 44% in the lowest level (Candidate) to 23% in the highest one (SNI III) 2 .

Data
We have access to public information on the researchers who were part of the Mexican National System of Researchers in 2018. This database contains 28,639 researchers, and we matched 11,039 authors to their corresponding information in the Microsoft Academic Graph (MAG) dataset 3 using their full names and institution. Even though the researchers in SNI are not randomly distributed and do not represent the whole population of researchers in Mexico, we verified that our matched sample has the same Page 5 of 20 Dorantes-Gilardi et al. Applied Network Science (2022) 7:58 characteristics as the SNI population, such as the proportion of women per area or proportion of women per SNI level (Appendix A). For each author, we retrieved from MAG the number of citations and publications and their institutional rank. An institution's rank is a measure constructed by MAG and roughly defined as the logarithm of the probability of an entity being "important", where importance is calculated using its relationships with other entities in the graph.
Considering that we are interested in the role of k-max as a determinant of success, we only kept those researchers with at least 100 citations in the (MAG) dataset; in this way, we obtained more dense citation networks. Researchers with fewer citations tend to have k-max with lower k because citation distributions tend to be highly skewed, where most authors have none or very few citations. As a result, our final baseline sample consists of 2363 researchers in all areas.
It is worth noting that our final sample is not representative of the whole population of SNI researchers since we are only considering those who have many citations (>100); however, since the authors we identify in MAG are representative of the SNI population, our sample of 2.3k authors should also be representative of the highly cited authors in the SNI population. Due to this, the distribution of our sample differs from the distribution of the population of SNI researchers across areas because different areas have different citation patterns. For instance, Ioannidis et al. (2019) find that the median of citations in General Arts, Humanities & Social Sciences is 28 while in Biology is 140 and Chemistry 129. We show in Appendix C frequency tables by area and SNI level of our sample.
As shown in Table 2, the mean number of citations in the sample is 602.2 citations, where women have on average 119.7 fewer citations than men, and this difference is statistically significant. We observe the same pattern when using third-party citations (women have 64.7 fewer citations than men) and collaborator citations. Women cite their papers less relative to men, consistent with the literature (King et al. 2017). The average number of publications is 42.5 papers, 33.4 for female researchers and 47.1 for male researchers. Moreover, women have on average a less interconnected k-max ( k = 3.4 ) than men ( k = 3.7). Figure 1a presents the fraction of women in each of the four SNI levels: Candidate and levels I, II and III. There are 49% of men at the candidate level, as opposed to 51% of women. Thus, the higher the SNI level, the lower the representation of women. Medicine and Health (Area 3) is the area of knowledge with the largest fraction of women, followed by Humanities and Behavioral Sciences (Area 4), while Engineering and Industry (Area 7) has the lowest fraction (Fig. 1b).

Citation networks
For each author with at least 100 citations uploaded in MAG, we construct a citation network as follows: (i) we consider all articles where the author appears (ii) we consider all the articles that cite at least one article of the author in question. For simplicity, we remove the directionality of the link as we are only interested in the level of the network's interconnectivity. In this manner, the citation network is an undirected network where nodes represent articles and links represent citations from one article to another. We note that for every link, there must be at least one incident node representing an article of the focal author (we do not consider citations between two articles in which the focal author has no authorship) 4 .
Next, we partition the nodes into three different classes: self, collaborator, and thirdparty author; depending on whether the author is part of the list of co-authors, a collaborator of the author is part of the list of co-authors, or none of the above, respectively. Classes are mutually exclusive, meaning that a node can only belong to one of them; if the paper represented by the node has the author and a collaborator as co-authors, we consider it a self-citation.
Finally, we obtain the largest k for which there exists a k-core using the graph-tool library (Peixoto 2014). This allows us to compute the proportion of nodes of each class in both the complete network and the main core.

Network topology and impact
As explained above, an author's citation network structure can be seen as the result of citations coming from third-party authors, collaborators, and the researcher's selfcitation behavior. Different citation behaviors could translate into a different network topology that may have direct and indirect effects on academic impact. Therefore, our variables of interest for each researcher are the number of citations, third-party citations, collaborator citations, and self-citations for all the papers published until August of 2021. Our measure of network structure is the maximal value of k for which a k-core exists, calculated using the citation network of each author. Intuitively, the k-max of the citation network captures the innermost core of highly intercited papers. In our case, due to the construction of the network, the k-max should always contain papers authored by the SNI member we are considering. It is not sufficient to have the highest number of total citations to have a dense k-max (high k); several of the authors' papers must be cocited simultaneously.
Our dependent variables are over-dispersed count variables that can be analyzed either by log-transforming them and then using Ordinary Least Squares or by using a Negative Binomial regression model without such transformation. We use both models and include controls such as productivity (logarithm of the number of papers), the logarithm of the rank of the affiliation institution, the researcher's area, level in the SNI, career length 5 and the logarithm of the average number of co-authors per paper. Table 3 shows the results of our estimation. As shown, independently of the model employed, there is a positive and significant correlation between a higher k of the k-max and the researcher's citations (total, third-party, collaborator, and self-citations). In other words, if the citation network of an author has a highly interconnected subnetwork, it follows that the author is more likely to have more citations. Interestingly, the coefficient is smaller for third-party and collaborator citations but grows for self-citations. There is also a significant and positive correlation between the number of publications and any type of researcher's citations, while the correlation is negative and significant for the rank of the affiliation institution. This last result is consistent with previous findings that show a positive correlation between institutional prestige and the probability of becoming a top-cited scientist in the long run ).
If we consider the area of knowledge, it is observed that researchers in Medicine & Health (Area 3) and Humanities & Behavioral Sciences (Area 4) have fewer citations consistently when compared to Engineering & Industry (Area 7, the omitted category). This result contrasts with the results in Gonzalez-Brambila and Veloso (2007) who show that researchers in Health Sciences receive the largest number of citations per four years of all SNI areas among researchers that were part of the SNI from 1991 to 2002.
By SNI level, we see that researchers in the most prestigious level (SNI III) receive between 59% and 72% more citations relative to the mean of Candidates, and SNI II researchers between 48% and 52% more citations than Candidates. Finally, there is a positive and significant correlation between the average number of co-authors per paper  and total, third-party and collaborator citations, but the relationship becomes negative for self-citations, independently of the model used. We show in section "Data" that there are significant differences or gaps across female and male researchers. Considering the results above, we can answer the differential effects of the different determinants on the academic impact of female and male researchers. We use the logarithm of the total number of a researcher's citations as a measure of academic impact. Following the seminal works of Oaxaca andBlinder (1973, 1973), we use the Blinder-Oaxaca (BO) decomposition method to study the citation gap. This method has been widely applied in economics to study gender/racial wage gaps.
We first estimate a two group-specific regression model (see Table 13 in Appendix D) and then perform the decomposition. As shown in Table 4, the decomposition output reports the mean predictions of the logarithm of citations for men and women and their difference. In our sample, the mean of ln(citations) is 5.976 for men and 5.882 for women, yielding a citation gap of 0.0940. The citation gap is divided into three parts in the first column of the decomposition output (endowments, coefficients and interaction).
The first term reflects the mean increase in women's citations if they had the same characteristics as men (effects due to women having different endowments). The increase of 0.142 indicates that differences in productivity (logarithm of the number of papers), the logarithm of the rank of the affiliation institution, the researcher's area, level in the SNI, career length, and the logarithm of the average number of co-authors per paper account for about 151% the citation gap.
The second term quantifies the change in women's citations when applying the men's coefficients to the women's characteristics (effects due to those characteristics having different influences on citations-coefficients). The overall difference in citations decreases when applying men's coefficients. As shown in Table 12 of Appendix D, the average number of co-authors per paper, career length, and a higher level of SNI (relative to Candidate) boost cites more for women than for men (i.e., the coefficients are greater for women), which explains why when applying the coefficients of men to these factors (column 3 of Table 4) the gap decreases. The third term is the interaction term that measures the simultaneous effect of differences in endowments and coefficients, and is not significant. Overall, these results show that differences in endowments between women and men explain the citation gap.

Gender differences in k-max
In section "Network topology and impact", we explore the relationship between various determinants of an author's characteristics and the academic impact. In particular, we find that a higher k of the k-max author's citation network does have a positive and significant association with the number of citations. Thus, if a less interconnected core produces fewer citations, we would like to know the extent of these gender differences.
In Fig. 2, we present the probability density function of the maximal k values for which there is a k-core (k-max), distinguishing between women and men 6 . We observe that Page 10 of 20 Dorantes-Gilardi et al. Applied Network Science (2022) 7:58  Page 11 of 20 Dorantes-Gilardi et al. Applied Network Science (2022) 7:58 women tend to have less interconnected cores in their citation networks, the average k for women is 3.4 while for men is 3.7, and the maximum value of k for women (13) is smaller than for men (15). Conducting a Kolmogorov-Smirnov test (Appendix B), we find evidence that the kmax probability density distribution functions of women and men are not equal and that women tend to have a less interlinked inner core of their citation networks than men. Thus, if higher values of k are associated with more citations and the advantages this entails for an author, it would mean that women would not be benefiting as much as men through this channel, confirming the results of the Blinder-Oaxaca decomposition (Table 4).
If women have less interlinked citation networks, we would like to investigate possible determinants of a higher k in the k-max and whether women and men display different patterns in these. Table 5 show the correlation matrices of the k-max magnitude and the number of nodes and edges in both the whole network and the k-max subnetwork for women and men, respectively. For both genders, we find a positive relationship between the number of edges in the network and the k-max, stronger for men (0.591) than for women (0.431). Also, there exists a positive relationship between the number of nodes in the network and the k-max, stronger for men than for women (0.503 and 0.302, correspondingly). However, we find that more interlinked cores tend to have fewer nodes and more edges in the citation network.
Considering the type of citations an author can receive (self, collaborator and thirdparty citations), we analyze how these correlate with the k-max. Our interest is to shed light on whether an author's strategic behavior through the increment in selfcitations and collaborator-citations can increase the k magnitude. In Table 6, we show the correlation matrices of the k-max and the share of each type of citation out of the total citations. We find that self-citations have the strongest positive correlation coefficient with k-max and collaborator-citations to a lesser extent. Interestingly, we find that third-party citations are negatively correlated with k-max, which may indicate that authors able to gather more self and collaborator-citations tend to have more total citations ultimately.  -Gilardi et al. Applied Network Science (2022) 7:58 In Fig. 3 we show k-max and type of citations in our data, considering the share of each type with respect to the total number of citations. As seen in Table 6, there is a positive correlation between the largest k for which there is a k-core and self-citations and between the largest k and collaborator-citations, with a higher slope for the first one than for the second one. On the contrary, there is a negative association between the largest k and third-party citations. However, there is no visible difference of the correlation coefficients between women and men.
We explore further whether, given a k-max, there are differences between genders on the type of cite (share) they receive. Figure 4 shows the median of each share of citations differentiated by gender, for each k. Strikingly, for less dense citation networks ( k < 3 ), there are no observable differences between women and men for any type of cite. Table 5 Correlation matrix of nodes, edges and k-max *p < 0.05 **p < 0.01 ***p < 0.001  Table 6 Correlation matrix of type of citations and k-max % of each type with respect to total citations *p < 0.05 **p < 0.01 ***p < 0.001  However, as k increases, the gap between genders widens. Women tend to have more third-party citations for each k. Men lean towards higher shares of self and collaboratorcitations, for each k, suggesting a higher probability of reciprocating citations. One may argue that some areas tend to have more collaborators and many papers, leading to more interconnected citation networks. Thus, we explore if, given an area, there are observable gender differences in the median of each share of citations differentiated by gender (Fig. 5). Area 1 (Physics, Mathematics, and Earth Sciences) has the highest share of self and collaborator-citations and the lowest third-party citations, more for men than women. On the contrary, Area 5 (Social and Economic Sciences) has the lowest share of self and collaborator-citations and the highest share of third-party citations. Therefore, there are marked differences across fields in the share of type of citations that could translate in disparities in success between women and men, both within Areas and between them.
We do not argue that all self and collaborator-citations artificially boost the magnitude of the k-max. For instance, an author with many papers or several collaborators benefits simply due to that. Still, this is a finding that would be worth further exploring.

Illustrative cases
This section presents an illustrative example of the relationship between highly interconnected citation networks and the number of citations by collaborators and third-party authors. We propose comparing two researchers with different citation networks, looking into their citation patterns discussed in the previous sections.  -Gilardi et al. Applied Network Science (2022) 7:58 Researchers A and B are both physicists working in materials science and have 6,625 and 4,178 citations across 112 and 227 articles, respectively. The largest k for which there is a k-core for Researcher A is 8, while this value is 5 for Researcher B. Thus, A has more citations and a more interconnected citation network than B who has fewer citations and a lower k-core. In Fig. 6A, we can see the difference in the interconnectivity citation network of both researchers: Researcher A has a more densely connected core than Researcher B.
The k-core decomposition of both authors shows that researcher A presents a more dense network for various levels of k, meaning that articles that cite their work usually cite several of their articles. Interestingly, the distribution of the number of citations for researcher A, while higher, is more compact, with fewer outliers than Researcher B (Fig. 6 B. Researcher B has only 25% of their articles with more than 13 citations, while researcher A has 42%. However, the standard deviation of the number of citations of researchers A and B is 62 and 76, respectively, with researcher B having more outlier articles. Moreover, researcher A publishes more often in less ranked venues than researcher B, even though they share the same field (Fig. 6C). On average, researcher B publishes in journals with rank 7,685 while researcher A in journals with rank 10,268.
Finally, the source of citations of both researchers is quite different. While 82.6% of citations come from third-party authors for B, A only receives 59% (Fig. 6A). Researcher A receives 31.8% of their citations from collaborators while researcher B receives only 13.8%, less than half the proportion by researcher A. The same pattern occurs for self-citations: Researcher A has 8.9% while Researcher B 3.5%. These findings are consistent with our  -Gilardi et al. Applied Network Science (2022) 7:58 findings in the previous sections where a more interconnected citation network core positively correlates with self and collaborator-citations and negatively with third-party citations.

Conclusions
In this work, we study the effect of the size of interconnected nodes of an author's citation network and the number of citations the authors receives. We find a positive relationship between the size of the main core in the citation network of an author (magnitude of k of the k-max), a proxy for the size of their interlinked articles, and their number of citations. We observe that more interlinked citation networks correlate with a large share of self and collaborator-based citations, and a low share of third-party citations as a percentage of the total number of citations. We argue that this could serve as a mechanism to directly or indirectly boost citations, with the caveat that it could also occur naturally in some research areas and exceptional cases. For instance, there are notable differences across fields in the share of citation types in which Area 1 (Physics, Mathematics, and Earth Sciences) has the highest share of self and collaborator-based citations, as well the lowest third-party citations, more for men than women. We show a statistically significant difference between the level of the interconnectivity of men and women citation networks, where the women tend to have a less interlinked inner cores. Thus, if women tend to have consistently less interlinked citation networks, this could limit the permanence and promotion of women careers.
We also explore the citation gender gap through a Blinder-Oaxaca (BO) decomposition. We examine how much of the gap can be explained by differences in observable characteristics or endowments (including k-max) and how much is due to those characteristics having different effects on citations (coefficients). Our results show that differences in endowments between women and men explain much of the citation gap.
In this sense, further research could explore how citation reciprocity differs between genders and how the topology of the citation network evolves with the career. These would shed light on how strategic behavior, other than self-citations, affects an academic career and whether we can find significant gender-differentiated determinants. We do not affirm that all self and collaborator-citations artificially boost the k-max. For instance, an author with many papers or several collaborators benefits simply due to that. However, this is a finding that would be worth further exploring.

Appendix A: SNI population and matched sample
The matched sample corresponds to all SNI researchers we identified in the MAG data through the normalized name and institution (Tables 7,8,9,10,11,12,13).