Community structure in co-inventor networks affects time to first citation for patents

We have investigated community structure in the co-inventor network of a given cohort of patents and related this structure to the dynamics of how these patents acquire their first citation. A statistically significant difference in the time lag until first citation is linked to whether or not this citation comes from a patent whose listed inventors share membership in the same communities as the inventors of the cited patent. Although the inventor-community structures identified by different community-detection algorithms differ in several aspects, including the community-size distribution, the magnitude of the difference in time to first citation is robustly exhibited. Our work is able to quantify the expected acceleration of knowledge flow within inventor communities and thereby further establishes the utility of network-analysis tools for studying innovation dynamics.


Introduction and Motivation
Inventions can be codified in patent applications that, if granted, bestow certain rights to the assignees. Patent documents contain citations to other patents as part of the requirement to acknowledge the state of prior art and to delimit the invention's legal scope. These citations are understood to represent knowledge flows between inventors and have been used widely to study various economic and social aspects of innovation dynamics [1][2][3].
The degree to which social and working relationships between inventors influence patenting and knowledge propagation in technological-innovation space has been the subject of recent interest [4][5][6][7][8][9][10]. We shed new light on this topic from a networkanalysis perspective by studying the effect of community structure in the co-inventor network on patent-citation dynamics. It can be intuitively expected that knowledge about inventions will be transmitted, and thus become available for utilization, faster within groups of inventors that have collaborated before. One of the main aims of our present work is to verify this expectation and quantify the acceleration of knowledge flow through inventor communities. We use the time lag until first citation [11,12] as a proxy measure for the fastest speed at which knowledge about an invention can propagate. This is motivated by the fact that, as in the case of an electrical pulse whose leading edge is the real carrier of information, independent of the rest of its line shape, the time to first citation is more representative of the speed of information flow than any other, aggregate or average, citation measure.
Here our approach is different. We use established community-detection algorithms [15,16] to identify communities of inventors in the co-inventor network that has been constructed by projection from a bipartite inventor-and-patent network. This method can directly reveal communities based on collaborative inventive activity without relying on any externally observed relationship proxies. On the other hand, the algorithmic identification of communities may introduce biases arising from idiosyncrasies inherent to particular detection methods. Our present work also investigates how robustly community effects in patent-citation dynamics can be identified using the generally different community structures obtained by various established detection algorithms.
Community structures in citation networks for scientific articles have been analyzed recently (see, for example, Refs [17][18][19]), and opportunities for similar studies on the available large amounts of rich patent-citation data are now also beginning to be realized [20,21]. Our study offers an example for the type of interesting insights that can be gained by applying network-analysis-based community-detection methods in the important area of technical innovation.
The remainder of this article is structured as follows. In the following Methods section, we start by discussing the data used to construct the co-inventor network and the five community-detection algorithms utilized in our work. Basic properties of communities identified on the largest connected component of the co-inventor network by the different algorithms are compared before details are given on how the citation-lag distributions are assembled. The subsequent Results and Discussion section presents a detailed analysis of the observed distributions for the time to first citation, focusing in particular on how these depend on whether inventors from the citing and the cited patent share membership in one of the algorithmically detected communities. Our Conclusions are summarized in the final section. A list of abbreviations used throughout this article is given in Table 1.

Methods
The availability of disambiguated US-patent data [22] makes it possible to construct the co-inventor network associated with a particular cohort of patents. To be specific, we use patents granted by the US Patent and Trademark Office (USPTO) during 1995-1999 and assigned to one of the following classes from the US Patent Classification System (USPC) [23]: 257 (Active solid-state devices), 326 (Electronic digital logic circuitry), or 438 (Semiconductor device manufacturing: process) [1] . The projection of the bi-partite patent-and-inventor network onto inventors yields the co-inventor network [5] associated with this cohort. Going further than using a simple projection as was done in previous work [5,6], we introduce edge weights w ij reflecting the frequency and intensity of the co-inventing activity [24] between [1] The objectives of a particular investigation will generally determine the choice of cohort. Ours is motivated by three main needs: (i) to leave a sufficiently long time window for citation accrual, (ii) to have a sufficiently large data set to reduce statistical noise, and (iii) to include patents from similar fields that are granted around the same time to minimize variations in the inventors' work environment. two inventors i and j; The sum in Eq. (1) is over all patents in the chosen cohort that list more than one inventor, and δ if inventor i is (not) among the n α inventors listed on the particular patent α. In the following, we restrict ourselves to considering only the largest connected component (LCC) of the thus-constructed co-inventor network, which can be expected to capture the most relevant aspects of inventors' connectedness [8,25]. Amounting to approximately 31% of the total co-inventor network by number of nodes, the LCC comprises a diverse assortment of inventors as evidenced by the spread of firms their patents are assigned to. For example, IBM, Hitachi LTD, Motorola INC, Kabushiki Kaisha Toshiba, and Texas Instruments represent 14%, 7.2%, 4.6%, 4.1%, and 4.0% of nodes in the LCC, respectively. Other companies each account for portions smaller than 4% of the total in the component. Table 2 provides an overview of relevant summary statistics pertaining to the patent cohort and co-inventor network considered here.
We have used five established community-detection algorithms to analyze the LCC of the co-inventor network: Greedy [26], Louvain [27], Infomap [28], Random Walks [29], and Propagating Labels [30]. Although different algorithms generally yield different community structures, clear similarities are exhibited between the structures obtained by conceptually related approaches such as Greedy and Louvain on the one hand, and Infomap, Random Walks, and Propagating Labels on the other. In particular, we observe the well-known resolution-limit issue [31] where the communities delineated by approaches that maximise modularity [32] (Greedy and Louvain) are typically larger in size and generally subsume the many smaller communities identified by other approaches (Infomap, Random Walks, Propagating Labels). This may be inferred graphically from visualizations of community structures, such as those shown in Fig. 1. A more quantitative comparison is possible based on the size distributions for communities obtained by application of each of the five different detection algorithms to the LCC, given in Fig. 2, and the corresponding community-structure-related summary statistics provided in Table 3. We also use the adjusted Rand index (ARI) [33] to measure similarity between community structures generated by different algorithms as well as those from different runs of the same algorithm. ARI scores close to 1 are obtained when comparing the results of multiple runs of the same algorithm on the LCC of the co-inventor network, indicating the generally very good or, for Propagating Labels, at least satisfactory robustness of each method. The ARI value of 0.8 found when comparing the community structures generated from the Greedy and Louvain algorithms attests to their high degree of similarity. Similarity to a much lesser extent is exhibited between the partitionings arising from the Infomap and Propagating-Labels methods, as well as between those of Greedy/Louvain and Random Walks. All other pairwise comparisons yield a very small ARI score of 0.2 and below.
Having identified inventor communities in the LCC on the co-inventor network, we track the citations acquired by patents from the chosen cohort over a ten-year period from each individual patent's time of grant. Following the usual convention [1,34], we assign the time of citation to be the application date of the citing patent. Hence, citations to patents in the cohort considered here can only originate from patents applied for before 2010. All citation-related data are sourced directly from the USPTO [35]. For the purpose of the present work, we specifically consider the time ∆t 1 elapsed after each patent's time of grant until it acquires its first citation [11,12] and determine this citation to be either a self-citation, an in-community citation, or an out-of-community citation based on the previously obtained community structure. Here a self -citation occurs if any of the inventors listed on the citing patent is also listed as an inventor on the cited patent, which is the case independently of any community structure. An in-community citation is a non-self -citation from a patent where at least one of the listed inventors shares membership in one of the algorithmically identified communities with at least one of the inventors listed on the cited patent. An out-of-community citation is from a patent whose listed inventors belong to different communities than the inventors of the cited patent. To focus most precisely on how community structure affects the knowledge flow through the co-inventor network, we only consider citations from patents that have at least one of their listed inventors belonging to the LCC of the co-inventor network defined via the cohort of cited patents [2] . Table 2 provides citation-related summary statistics for the analyzed patent cohort.
Distributions of time lags ∆t 1 to first citation obtained for each subset of citation type (self, in-community, and out-of-community) are juxtaposed in Fig. 3. Their distinctive properties are discussed in greater detail in the next Section. Note that, because the time of citation is the application date of the citing patent, and the time lag ∆t 1 is measured from the grant date of the cited patent, negative values of ∆t 1 are possible and, in fact, occur quite frequently. That patents have often already accrued citations by the time of grant from other patents that were applied for before that date but granted afterwards is a well-known feature of patent-citation dynamics [34].

Results and Discussion
As is apparent from the examples shown in Fig. 3, distributions of time lags to first citation are generally broad and skewed. We find that log-normal distributions provide a successful fit to their line shapes, except for the distinctive peak for the ∆t 1 = 0 bin exhibited by the time-lag distribution of first citations that are self-citations. Further research is needed to elucidate the origin of the inflated probability for a zero time lag for self-citations; we can only speculate at this point that it could be the result of larger firms' patenting strategies.
The analysis of time-lag distributions for in-community vs. out-of-community first citations enables us to discuss the influence of community structure on citation dynamics, for each set of communities identified by the five different algorithms used [2] Excluding citations by inventors outside the LCC amounts to neglecting trivial out-of-community citations that would likely influence the time-lag distribution for this type of citation mostly by shifting weight to larger ∆t1 [6][7][8]. If at all relevant, this could only further accentuate the differences between characteristic values (mean, median, and mode) for the time lag to first citation found for incommunity and out-of-community citations.
in this work. Specifically, we consider distributional averages (mean and median values), as well as the most probable value (mode). Table 4 gives the results obtained for these by two different methods: direct calculation using the empirical data for the time-lag distributions, and the values derived from parameters of the fitted lognormal distributions. For the mean and median values of time lags for in-community first citations, there is excellent agreement, within uncertainties, between the two approaches for all five community structures considered. Similarly good agreement exists between the medians and modes of time lags for out-of-community first citations. The disagreement between the modes calculated directly and from the log-normal fit for in-community first-citation time lags obtained based on the Infomap and Propagating Labels algorithms is due to the slightly inflated probability for the ∆t 1 = 0 bin exhibited in these cases. See Figs. 3 and 4. The magnitude of the inflated peak at ∆t 1 = 0 for in-community first-citation time lags is generally comparable to the size of statistical fluctuations and overall much smaller than that observed in the time-lag distribution of first self-citations. The consistently higher mean obtained for the out-of-community first-citation time lag using the raw data as compared with that derived from the log-normal-distribution fit parameters can be traced to the fact that the fit systematically underestimates probabilities in the tail of the time-lag distribution for this type of citation. Overall, the magnitude for the means and medians found in this work for the first-citation time lag agrees very well with that reported previously in the literature [11,12,36,37].
The specific community structure determined for the co-inventor network is observed to depend sensitively on the employed algorithm. The Greedy and Louvain algorithms yield a comparatively small number of larger communities, whereas the Infomap and Propagating-Labels algorithms identify mostly much smaller communities. The community structure obtained using Random Walks lies somewhat inbetween these two extremes. See Fig. 2 and Table 3. Interestingly, the distributional properties of the time lag for in-community first citations, and even more so those pertaining to out-of-community first citations, are found to be quite similar irrespective of the significant differences between the underlying inventor-community structures. Figure 4 shows the distributions of time lags for in-community citations based on the community structures obtained from running the Greedy, Louvain, Random-Walks, and Propagating-Labels algorithms on the LCC of the co-inventor network. Corresponding results for the community structure identified by the Infomap algorithm are given in Fig. 3. The distributions of the time lag for out-of-community first citations obtained based on the communities identified by the five different algorithms are visually barely distinguishable and therefore not shown. This suggests that algorithms producing larger communities have generally agglomerated parts of the co-inventor network that are separate communities as far as citing behavior is concerned. Hence, our study provides another real-world verification of the resolution limit [31] associated with algorithms based on optimizing modularity.
Comparison of the median and mean values for the time lags to first citation found for the in-community and out-of-community types, respectively, shows systematic differences. See Table 4. In particular, the median time lags to in-community first citations are about 2-3 months shorter than the corresponding medians for outof-community first citations. The difference between the means of the time lag for in-community and out-of-community first citations is about 2-5 months. Given that we have analyzed time lags to first citation for patents whose inventors are all connected by prior co-authorship of patents, this indicates a strong influence of community structure on patent-citation dynamics and the associated knowledge flow. Further comparison with the mean, median, and mode values found for the time lag to first citations that are self-citations is very instructive. See Table 5. The mean and median calculated using the raw data, including the inflated peak at ∆t 1 = 0, are found to be essentially the same as for in-community citations. Hence, on average, in-community first citations and first-self-citations occur after similar time periods that are, again on average, shorter by about 3 months than the time lag for out-of-community first citations. However, the distributions of time lags for the in-community and self types of first citations differ significantly. This can be illustrated by analyzing the adjusted distribution of time lags for selfcitations where the inflated probability at ∆t 1 = 0 has been replaced with an interpolated value, or with the peak removed entirely. The mean, median and mode values found for this modified distribution agree almost exactly with those found for out-of-community citations. See again Table 5. This interesting combination of properties invites further investigation. A tentative speculation could be made that both the out-of-community type of first citation and the first-self-citations outside the ∆t 1 = 0 peak originate largely from examiners, whereas the in-community type and the self-citations from the inflated ∆t 1 = 0 peak are generally made by inventors themselves. As inventor and examiner citations have been separately identified on USPTO patents granted since 2001, a follow-up study utilizing a large-enough cohort of post-2001 patents with sufficiently long time window for citation accrual should soon be possible.
We employed Welch's t-test [38] to establish whether the difference between the means of the citation lag for in-community and out-of-community citations is statistically significant. Results are summarized in Table 4. The t-scores ranging between 6.6 and 8.7 indicate that the differences in the means of citation lags are statistically extremely significant. Although the t-test is expected to yield reliable scores for non-normal distributions [39], we also applied it to the, to a good approximation normal, distribution of the log of the time lag to first citation (after shifting the latter by a constant to eliminate negative values). Also in this case, similarly large t-scores reject the possibility of the means being equal with high confidence.
The high t-values generated in our application of the Welch test are likely a consequence of the very large sample sizes (1 415 data points from in-community citations, 12 272 from out-of-community citations) that increase sensitivity to small differences. To illustrate the effect sample size has on the outcome of the t-test, we selected a random and independently chosen subset of 300 citations from each of the in-community and out-of-community datasets and performed Welch's t-test on these small samples. This was repeated 500 times to obtain averages of such t-values, which are also listed in Table 4. In the case of the Infomap-generated community partition, the mean t-value resulting from this procedure is 2.9, along with 83% of the small-sample simulations finding differences between the distributions' means to be significant to the 5% α level. This indicates that sample sizes of at least 300 first citations are needed to detect with sufficient confidence the time delay reported here between those originating from within and outside of co-inventor communities.
As further proof of the inventor-community-related cause of acceleration in patentcitation dynamics, we performed the following control experiment. Starting with the Infomap-generated community partition of the LCC, inventors were randomly re-assigned to communities within this given structure. As a result, the high-level morphology (number of communities and their individual sizes) of the partition was left unchanged, but the groupings of inventors within each community became completely random, as indicated by the ARI value of 0.0002 obtained when comparing the initial and randomized partitions. Repeating the analysis of first citations, we observe that the total number of in-community citations occurring in the randomized community structure has dropped precipitously to 129, i.e., about 10 % of the in-community citations present in the original structure. Furthermore, the mean (median) time to a first in-community citation in the randomized structure is determined to be 23.9 months (18 months), which is not significantly earlier anymore than the corresponding value of 24.4 months (19 months) found for the out-of-community first citations. The Welch's t-test score of 0.24 also indicates that there is no significant difference between the means of the distributions of times to first in-community and out-of-community citations in the situation with randomized inventor assignment to communities. The significantly reduced total number of in-community citations, and the disappearing difference between mean times to first citations that originate from within and outside of communities, both convincingly indicate that the originally established inventor communities are indeed the platforms for accelerated patent-citation dynamics.
We close this section by discussing several confounding variables whose influence may weaken our inference of a general non-trivial community-structure effect on citation dynamics. Inventor team size: All other things being equal, inventors working in larger teams will have a greater chance of acquiring in-community first citations to their patents, which would also be likely to occur more quickly. However, the scope for this trivial mechanism causing the observed effect is severely limited by the fact that large inventor teams are actually quite rare. See, e.g., data presented in Ref. [40]. The average community size found in our work certainly exceeds the average inventor-team size given for the period 1995-1999 in Fig. 4 of that article. Firm-level associations: Certain inventor communities may just be a reflection of the inventors' employment at the same firm and, for their case, intra-firm information channels acting in parallel to prior co-invention activity may drive the observed acceleration in the in-community citation dynamics. However, for large firms, especially those with multiple geographically separated R&D centers, this alternative mechanism could be ineffective. Disentangling trivial firm-association effects from knowledge flows established via real inventor collaboration would be an interesting direction for future research. Niche-technology associations: The patent cohort studied in this work relates to the broad and technologically crowded semiconductor industry, and our detected communities may correlate with very specific technology types. The fact that citations are likely to come first from inventors in the same community could then arise simply because inventors that are working in the same industrial niche are likely to build on each others' technological advances before inventors working in technologically more distant fields. To clarify this issue, the distributions of technology specializations across the respective in-community and out-of-community citing-patent cohorts would need to be studied in greater detail using a suitable proxy measure for technological similarity that could be defined either within [41] or beyond [42] existing classifications schemes.

Conclusions
We have investigated community structure on the co-inventor network associated with a particular cohort of USPTO patents where edge weights reflect the frequency of inventors' collaborative patenting activity. Five established community-detection algorithms (Greedy, Louvain, Infomap, Random Walks, and Propagating Labels) were deployed to identify communities on this network's largest connected component. The sizes and numbers of communities found by the different algorithms varied, with some similarities exhibited by algorithms using related methodology. Table 3 and Fig. 2 provide details enabling a quantitative comparison between the properties of the algorithmically detected community structures.
To investigate the effect of inventor communities on patent-citation dynamics, we analyzed the time lag to the first citation received by the patents associated with inventors from the largest connected component of the co-inventor network. Only citations originating from patents co-authored by at least one of the inventors from the largest connected component were counted for the present study. Three different types of first citation were distinguished: self-citations, in-community (non-self-)citations, and out-of-community citations. The distributions of time lags for each type of first citation were observed to have distinctive properties. Figure 3 shows results obtained based on the community structure found by the Infomap algorithm. The mean, median, and mode values for each type of distribution were determined to enable a quantitative comparison of the speeds of knowledge flow within and between different inventor communities. Results for all community structures investigated in the present work are summarized in Table 4. The median time delay to first citation for the out-of-community type turns out to be typically 3 months longer than for the in-community type. Self-citations and in-community-type citations have approximately the same median time lag, even though the distributions of time lags for these two types of first citation are markedly different. The difference between the mean time lag observed for out-of-community and in-community first citations is generally even larger than that found for the corresponding median values. Although the communities identified by the different detection algorithms utilized in this work differ in some detail, the observed influence of community structure on patent-citation dynamics was found to agree closely, even on a quantitative level. Thus our results provide a rather general quantification for the accelerated knowledge flow through inventor communities formed via collaborative patenting. Furthermore, the observation that association with distinct inventor communities based on previous co-authorship on patents results in faster citation of an inventor's later patents by members of that community provides strong further evidence in support of the fundamental importance of such collaboration-based social connections [6,8,9,43] that has not always been able to be clearly observed [4].
Our focus on the time lag to patents' first citation was motivated by the expectation that this quantity will likely be the best proxy measure for a real difference in time scales for knowledge propagation within and outside of co-inventor communities. It would be interesting to also compare the mean time lags between later (i.e., second, third, . . . nth) in-community and out-of-community citations. Such a study should be able to observe how the in-community advantage for knowing earlier about an invention diminishes over its repeated utilization.
The results of our work point the way to other interesting directions for future research. For example, analyzing the characteristics of the algorithmically identified inventor communities could yield useful information regarding the structure of effective innovation teams, extending previous work that focused only on direct collaborations between inventors [40]. Studying the dependence of the observed acceleration of information flow through inventor communities on the field and type of inventions may yield a measure for the speed at which the knowledge frontier moves in different parts of innovation space. Community-detection methods could also be deployed to elucidate relationships shaping innovation activity beyond the network of inventors, e.g., on the level of firms [44] or other organizations. Thus opportunities abound for the useful application of modern network-analysis tools to innovation economics [45] and related social-science studies [46].  Subset of these patents whose first citation is a self-citation 2 584 - Table 3 Comparison of inventor-community structures found by different algorithms. Summary statistics provided here pertain to the communities obtained by applying each of the five indicated community-detection algorithms to the largest connected component of the co-inventor network considered in this work. Similarity between community structures is quantified in terms of the adjusted Rand index (ARI) [33]. Listed ARI values are averages calculated for pairs of community structures generated by multiple runs of the respective algorithms that are being compared. The total number of first citations classified as in-community citations based on each of the five community partitions is also given. Louvain Infomap Figure 1 Community structure within the co-inventor network. The upper (lower) panel indicates communities in the largest connected component of the co-inventor network identified by the Louvain (Infomap) algorithm using different colors (but with some colors being reused in the lower panel due to the overall large number of communities yielded by Infomap). Note how certain groups of communities identified as separate by Infomap are clustered together by Louvain. The size of the circle indicating a node is proportional to the number of edges attached to that node. Images created using Gephi [47].  Table 3 for community-structure-related summary statistics.