Distribution of labor, productivity and innovation in collaborative science

In this paper, we investigate the process of scientific discovery using an under-exploited source of information: the Polymath projects. Polymath projects are an original attempt to solve a series mathematical problems collectively and in a collaborative online environment. To investigate the Polymath experiment, we analyze all the posts related to the projects that have resulted in a peer-reviewed publication. We focus in particular on the organization of the scientific labor and on the innovations that result from the contributions of the different authors. We find that a high presence of occasional contributors increases the productivity of the most active users and the overall productivity of the forums (i.e., the number of posts grows super-linearly with the number of contributors). We argue that, in large-scale collaborations, the serendipitous interaction between occasional contributors can be crucial to the scientific process, and individual contributions from occasional participants can open new directions of research.

Despite the success of these platforms and even though academic institutions have long insisted on the idea of open and participatory science, there are few actual examples of large-scale collective production in science.
The conceptual framework of online collaborative structures raises several important questions when applied to scientific production: Is science the craft of many or of few? Can research be conducted in a large-scale open collaborative environment? Can science be based on collaboration rather than competition? These questions are discussed in the book "Reinventing discovery" by Nielsen (2011), which presents several examples of collective problem solving. Among the cases described by Nielsen, one-the Polymath project-has attracted our attention, because it can be studied not only ethnographically but also computationally, since all its contributions are available as digital records.
The first Polymath project was proposed in 2009 by mathematician Tim Gowers, who, with a post on his blog, invited mathematicians to find a combinatorial proof of the density version of the Hales-Jewett Theorem, using a dedicated thread of discussion. Since then, fifteen other Polymath projects have been launched, six of which have resulted in one or more peer-reviewed publications signed with the collective name "Polymath Collaboration". The Polymath blogs not only enable the study of an important project in collaborative science, but also provide an unprecedented playground for the in-depth study of discovery processes.
In our work, we present a comprehensive statistical analysis of the Polymath ecosystem, looking specifically at the activities of participants and the content that they produce. Our findings are based on all projects that have resulted in a peer-reviewed publication (projects 1,4,5,8,15). We did not examine the other projects because they were abandoned by their contributors at an early stage and their data are insufficient to support a robust analysis.
After discussing some related work in Sects. , 1 we present the data and methods used in our analysis. In Sect. 1 we present our results. In Sects. 3.1 and 3.2 we analyze the internal structure of collaboration and its role in productivity patterns. We identify a clear hierarchy in participation patterns with a hyperactive elite responsible for 80% of the work. At the same time, we show that collaborative architecture plays an important role in promoting individual production: Indeed, we observe a dynamic of super-production in which the presence of occasional participants helps to increase the productivity of the elite.
After analyzing the organization of work in open science, we focus on the mechanisms of scientific discovery. A mathematical discovery is the rigorous verification of a formal statement, realized by bringing together a set of pre-existing theorems, conjectures, axioms, and so on. It is therefore part of a larger category of innovation processes in which the introduction of new ideas and concepts is crucial to intellectual progress. Innovation processes can be described by the notion of "adjacent possible expansion" introduced by Kauffman (2000). This term refers to the expansion or restructuring of the possible knowledge space, triggered by the introduction of novel concepts. This type of process has been shown to leave distinctive traces in the statistical properties of the knowledge produced, expressed by two key laws first observed in linguistics: Zipf 's law and Heaps' law (Tria et al. 2018). In Sect. 3.3 we show that Polymath's discovery dynamics exhibit the markers of adjacent possible expansion processes, similar to literary production and musical innovation. Finally, in Sect. 3.4 we examine the triggering factors for innovation and show that no rule determines a priori who the key innovators will be, as peripheral users can sometimes steer collective work in new directions.
However, few works focus on large-scale scientific collaboration and even less on Polymath projects. In addition to the reflections by Gowers himself (Gowers and Nielsen 2009), a descriptive analysis of Polymath 1 project can be found in Barany (2010), where the authors provide a qualitative discussion of the rules that Polymath contributors developed to organize their work. For a more quantitative analysis of the initiative, but limited to the first project, see Cranshaw and Kittur (2011). Kloumann et al. (2016) are-to our knowledge-the only authors that have presented a statistical analysis of multiple Polymath projects. Their analysis compares full Polymath projects with the side initiatives of "Mini-Polymath projects", which are smaller collaborations concerning Math Olympics questions that, while quite difcult, have known solutions. A detailed description of the collective problem-solving approach in the third Mini-Polymath has also been provided by Pease and Martin (2012).
Drawing on this literature, our paper develops in three directions. First, we confirm and extend the results on the distribution of labor obtained by Cranshaw and Kittur (2011) on Polymath 1, for the five projects that achieved a final peer-review publication. We also extend this research by considering interaction patterns among contributors. Second, we investigate the productivity of collective intelligence in collaborative systems. Following Sornette's work on GitHub (Sornette et al. 2014), we study the superlinearity of production as a function of the number of users. Finally, based on innovation studies in online systems, such as the one presented in Tria et al. (2018), we introduce an innovation measure for the mathematical production process and identify the actors responsible for introducing innovations in the Polymath ecosystem.

Data and methods
We collected all the posts from Polymath projects 1, 4, 5, 8, and 15, starting from the links listed on the Polymath project wiki page (The Polymath Wiki 2021). The corpus for each project consists of a collection of posts identified by publication date, author, text, and parent post (for posts written in response to another contribution). The posts were published primarily on three blogs: Timothy Gowers's blog (2021), Terence Tao's blog (2021) and The Polymath blog (2021). Each of these blogs entails different technical restrictions on author interaction. On Gowers's blog, comments can only be posted in the main threads, limiting the depth of discussion and preventing authors from responding to comments in sub-threads. In contrast, Tao's blog allows comments up to a depth of 4. The Polymath blog does not appear to limit nested comments at any level and shows comments up to a depth of 10.

Demography of the projects
The five projects we analyzed contain between 545 and 3363 posts and the number of contributors varies from 57 to 199. Detailed information about each project can be found in Fig. 1. The network in Fig. 1 represents the bipartite network of contributors and projects: In the graph, every edge represents an author's participation in a project. The size of the contributors' nodes represents the number of their contributions. The graph shows that there is a small core of very active authors who have participated in almost all projects, and a periphery of occasional contributors working on a single project.

Contents' identification
Since we are interested in reconstructing the collaborative processes that led to the discovery of a mathematical a solution, we need to identify the mathematical objects used in the posts. Natural language processing techniques performed poorly on this task and tended to identify non-mathematical terms, such as features of each participant's personal language patterns. Therefore, we built a mathematical vocabulary by means of a two-steps protocol. First, we collected the titles of all Wikipedia pages labeled as "mathematics" (Lists of mathematics topics 2021). Second, we added to the list all the expressions "Theorem of *", "*'s conjecture", etc. extracted from the corpora. The dictionary we obtained contains 25.035 mathematical concepts. Table 1 shows the number of independent mathematical concepts retrieved in the projects: in expressions like "theorem of *", the sub-string "theorem" is not considered as an independent concept. Through this mathematical dictionary, the content of each post can be qualified by the set of mathematical concepts that it contains: K i = {kw 1 , . . . , kw m } . A post will thus be generally characterized by its time (t), its author ( α ) and its con-

Topic extraction
We then aggregate different mathematical concepts into topics. This aggregation allows us to study the collaboration between authors and the structure of the collective labor.
To define topics, we create a co-occurrence network for each Polymath project, where the nodes represent different mathematical concepts. In this network, two concepts are connected if there is at least one post in which they were discussed together. The network is weighted according to the number of co-occurrences in different posts. Since this network is highly connected and extremely complex, we filter the edges to highlight relevant structures. In order to do so, we compute the Planar Maximally Filtered Graph 1 (PMFG) proposed in Tumminello et al. (2005). We then define our topics as the clusters of mathematical concepts identified by the Louvain community detection algorithm (Blondel et al. 2008) over the PMFG graph. Table 2 shows the number of topics extracted for each project and the modularity of the partition of keywords in the filtered co-occurrences network. Using this definition of topics, we label each post with the topic whose keywords appear most frequently in the text. In case of a tie, no label is assigned to the post. Therefore, in addition to its publication time (t), author ( α ), and content (K), each post is also characterized by a topic label (T):

Similarity and innovation
We first define the semantic similarity between two posts using the Jaccard measure between their contents: . We tweak this similarity by considering the temporal distance among the posts, thus introducing the semantic-temporal similarity:  where τ 0 is the average time distance among all the pairs of posts (within each project). According to this measure, two posts that are similar in content but distant in time will be less similar than according to the standard Jaccard measure. We use the semantic-temporal similarity measure to define an innovation index for each post. First we define two separate indicators for each post: • The in-debate index measures the similarity between a post and the contents published before it. It is calculated as the average of the semantic-temporal similarity from the previous posts: • The impact index measures how much a post content is reproduced in the posts following it. It is calculated as the average Jaccard similarity with the following posts: An innovative post is characterized by a low in-debate index (i.e., it is different from the earlier content) and a high impact (i.e., it influences the following contents that are therefore similar to it). For this reason we define the innovation index for each post as:

Organization of labor
As usual in collaborative systems, only a few contributors do most of the work (Barabasi 2003). When we analyze the number of contributions made by each author, we find a power-law distribution (Fig. 2B) and high Gini indices (Fig. 2C). In Fig. 2A we represented this distribution in the form of the Lorenz curve: authors are ordered by the number of contributions and curves represent the cumulative fraction of posts produced by the corresponding fraction of ranked authors. From the figure, we can see that the most active 10% produce the 80% of the posts (with the exception of project 4, which is characterized by a lower Gini index, where the 20% of contributors produce the 80% of the posts). Following the procedure described in Bassolas et al. (2019), we use the Lorenz curve to categorize authors hierarchically: We take the derivative of the Lorenz curve at the point (1,1) and set an initial threshold at the point where the derivative crosses the horizontal axis (as you can see in Fig. 2A). The authors after this threshold represent the most productive elite of the project. We remove these elite contributors and we repeat the procedure recursively, identifying a group we define as the first shell (highly active authors but outside the hyperactive elite) at the first iteration, and the peripheral shells (namely shells E3, E4, E5, E6 in Fig. 2D) at subsequent iterations. In Fig. 2C, we display the number of contributors (4) I i = −ξ i log(ν i ).
in the elite group and in the first shell, while Fig. 2D shows the percentage of authors in each hierarchical category. We can see that, according to this classification, the elite group contains less than 10% of the authors while the peripheral shells are consistently the most represented. In the following we will refer to authors belonging to the elite and the first shell as the active core.

Interactions between the authors
To better understand the division of labor in Polymath, we investigated the distribution of interactions between authors. In particular, we focused on how the active core authors, as defined in Sect. 3.1, interact with the peripheral shells.
In order to do so, based on the dependencies between posts, we defined a comments interaction network CIN = (V, E, W ) with the following properties: each node i ∈ V represents an author, an edge (i, j) ∈ E represents the existence of at least one comment by author i to a post of author j, and the weight W ij associated to the edge (i, j) represents the number of times author i replied to a post of author j. To understand whether such interactions are highly concentrated in the active core of elite authors or more spread towards peripheral contributors, we compared the obtained graphs with a stochastic network model preserving, on average, the activity level of each node. Similarly to Roth et al. (2013), we hence simulate K networks {Ŵ k = (V, E k , Q k )} k∈{1,...,K } with an expected degree for each node equal to the one of the authors of our dataset, keeping the same number of nodes n = |V| and edges m = |E| = |E k | for all k ∈ {1, . . . , K } . To do so we draw the weights Q k ij from a multinomial distribution with parameters m and p = {p ij } i,j∈V such that where d out i is the out-degree of node i and d in j is the in-degree of node j in the comments interaction network. Figure 3A shows the distribution of the fraction of in-core links (i.e., the fraction of messages from elite contributors to other elite contributors) in our K = 100 simulations and compares these distributions with the actual fraction of in-core links in our dataset. Figure 3B shows the same comparison for the in-periphery links, i.e., the fraction of messages written by peripheral contributors directed to other peripheral contributors. Both plots show a peculiar division of labor in the Polymath project: both core-to-core and periphery-to-periphery links are more represented than in random simulations, underlining that authors are more likely to reply to contributors who participate in the discovery process to a similar extent. As mentioned in the Data and Methods section, some of the blogs we studied limit the depth of response structures. We qualitatively observed a shift from a non-hierarchical structure in the very first project (i.e., only the presence of second-level comments and no deeper structures) to a more structured organization of posts in later projects. To evaluate the robustness of the results presented in Fig. 3, we compared them with the results obtained with a different definition of network interactions. We define a topic interaction network TIN(T) = (V, E(T ), W (T )) with the following properties: the node set V still represents the set of authors, an edge (i, j) ∈ E represents the fact that authors i and j published a post on the same topic at a distance no bigger than T posts (when posts are ordered chronologically). The weight W ij (T ) , associated to the edge (i, j), represents the number of times author i and j published a post on the same topic in the time window defined by parameter T. Notice that, by definition, such a network is undirected. Once again, in order to study authors interactions, we need to compare them with a set of simulated networks To do so, it is now sufficient to draw the solely values { Q ij } i,j∈V,j≥i from a multinomial random distribution, as we want the network to be undirected and Q k ij = Q k ji for all the K simulations. Therefore, we draw the values { Q ij } i,j∈V,j≥i from a multinomial distribution of parameters m and p = { p ij } i,j∈V,j≥i such that where d i is the degree of node i in the topic interaction network. The resulting distribution of in-core and in-periphery interactions is shown in Fig. 4. We notice that,  -Lovin 1987). This is however surprising in a scientific context where interactions are generally assumed to be based on cumulative advantage processes (Merton 1968).

Collective intelligence at work
Several studies on collaborative systems have shown a super-linear effect of collaboration: The very expression "collective intelligence" suggests that the collective productivity (in our case the number of posts) is higher than the sum of the individual productions. Fig. 4 Interactions on the same topic: boxplots represent the outcome of simulations while ⋄ markers represents real values coming from data. A Percentage of links stemming from a core node and ending in a core node over the total number of edges stemming from core nodes. B Percentage of links stemming from peripheral nodes and ending in peripheral nodes over the total number of edges stemming from peripheral nodes. The increasing transparency corresponds to the increase of the time window Gargiulo et al. Applied Network Science (2022) 7:19 To test this feature dynamically, we count the daily number of posts and the daily number of participants for all projects: where t 0 , t 1 , . . . represent different days. To reduce noise, we smooth these time series with a 7-days rolling window. By plotting the pairs (n user (t), n post (t)) , we obtain the curves representing the relationship between the number of users and the number of posts. Figure 5A shows a pronounced superlinear growth of the number of posts with the number of users, aggregated for all projects: n post = n γ user (with exponent γ = 1, 46 ). Our results are similar to those of Sornette et al. (2014) for GitHub. Figure 5B, C suggest that contributions have positive super-linear effects, even when they are relatively marginal. In Fig. 5B, we show that the average individual daily production (for all contributors with more than 10 posts in all the projects) grows with the number of users active on that day. Figure 5C displays the average daily productivity of the active core as a function of the number of users in the peripheral shells. We observe that an important presence of peripheral users boosts the productivity of the most active users.
In Fig. 5, we show the results obtained by aggregating all the projects. The individual analysis of each blog shows similar trends with very small variations in the growth exponents (blog1: γ = 1.30 , blog4: γ = 1.22 , blog5: γ = 1.46 , blog8: γ = 1.65 , blog15: γ = 1.50 ). Since the blog platforms are diverse, the robustness of these results suggests super-productivity to be an intrinsic characteristic of collaborative science, regardless the communication medium.

Statistical properties of scientific discoveries
While in the previous sections we analyzed collaborative patterns in open science, we now focus on the analysis of the scientific discovery process itself. First, we analyze the statistical properties of the mathematical concepts used in the projects. As described in the Methods Section, we have assigned a set of mathematical concepts to each post. We first test whether our corpus follows the basic laws of linguistic patterns: Zipf 's Law and Heaps' Law. Zipf 's law expresses the relationship between the frequency and the ranking of words. It states that the frequency of a word is inversely correlated with its rank, f ∼ r −α . For example, looking at the Gutenberg Project corpus (a large sample of English literature), one can observe a value α ∼ −1 for low values of r and α ∼ −2 for high values of r. Heaps' law concerns the entry of innovative concepts into a text and expresses the relationship between the number of different words (i.e., the vocabulary size) and the total number of words used (i.e., the length of the text). It describes an initial linear growth followed by an asymptotic behavior according to the power law l = v α : in the Gutenberg corpus α ∼ 1 for low values of l and α ∼ 0.44 for high values of l have been observed. In Fig. 6, we see that not only are these laws respected in our corpus, but also all projects have the same behavior and exponents, Zipf 's exponents being α = −0.36 and α = −2 and Heaps' exponents being α = 0.9 and α = 0.4 . These values are consistent with those from Gutenberg corpus (Tria et al. 2018), although the first exponent of Zipf 's law in our corpora is lower, due to the fact that we removed the non-mathematical expressions and stop words. This consistency means that, statistically, the creative process of scientific discovery follows the same basic rules that characterize literary production.
Second, we focused on the typical timing of the discovery process, based on the the hypothesis that posts that are close in time would also tend to be similar in terms of content. In Fig. 7 we show the average Jaccard similarity between all pairs of posts published within a given time delay. We observe a power law decay of similarity with time, J ∼ �t −γ (with γ = 0.2 ), once again similar for all projects. Thus, for all projects, there exists a typical time window in which the debate remains focused on the same topic before switching to new one.

Innovation patterns
Finally, we analyze how innovations affects the discovery mechanism, by using the innovation measure we defined in Sect. 2.4. As observed in Fig. 8A, the innovation values' distributions are long-tailed, meaning that few posts have a much larger innovative content compared to the others: high innovation is rare, but statistically significant.
We define posts in the top quartile of the innovation distribution as innovative. Then, referring to the definition of activity shells introduced in Sect. 3.1, we examine which actors lead innovation. Since the groups vary in size, we compare the number of innovations observed in each class with their multinomial expectation, namely the probability that a post is innovative (25%) multiplied by the number of posts produced by the group. We calculate the z-score between the observed and expected values. While the previous results showed a fairly homogeneous behavior between the different projects, here we observe significant differences. In projects 1,4,8, the elite produces more innovation than expected. In project 15, the first shell is the main driver of innovation. Finally, in project 5, the peripheral shells are the largest producers of innovation. This result highlights that in large-scale collaborations no rule determines a priori who will be the main innovators. An innovator can be a member of the hyper-active elite, but sometimes serendipitous interactions of peripheral participants can also have a large impact on the discovery process: an isolated contribution of an occasional participant can be responsible for opening a large adjacent possible and giving a new direction to the work.

Conclusion
Over the past few decades, we have witnessed the rise of large online collaborations such as Linux, GitHub, and Wikipedia. In 2009, the first Polymath project was launched with the goal of exploiting online collaborative environments to solve mathematical problems.
In this work, we have investigated how the path to scientific discovery develops in this collaborative environment, how the labor is organized between authors, and which actors are the main innovators. Our results, which are consistent with previous works, show that productivity is highly skewed between contributors and that there is a small hyper-productive elite that publishes the bulk of the contributions. Nonetheless, peripheral contributors also play a significant role, as content production grows super-linearly with the number of discussants.
Our analysis shows that, in Polymath projects, peripheral contributions boost the activity of other authors in a rather indirect way. Although interactions between the elite and the rest of the participants are relatively limited (as both peripheral and hyper-productive authors tend to interact mainly with authors with similar levels of activity), we have demonstrated that peripheral authors often play a crucial role in bringing new and innovative ideas to the debate. Our analysis has also shown that innovators cannot be defined only by their productivity level. Sometimes, occasional contributors can play a key role in innovation and be responsible for steering the research in new directions.
In this exploration of the Polymath ecosystem, we focused on four main directions: classifying contributors by their involvement in the projects, analyzing interactions among contributors, examining the impact of large-scale collaboration on productivity, and finally identifying the actors responsible for innovation. We conducted only a limited analysis of the content of the posts and of the semantic relationships between them. In a follow-up study, it would be interesting to examine the relationships between contributions based on the similarity of the content they produced rather than considering only their direct interactions in the response network. This structure would allow us to analyze the thematic cooperation patterns between contributors. This similarity network would also allow us to characterize the internal composition of users' "opinions" on solution techniques and their complex dynamics. Finally, it would be interesting in future works to compare the results obtained from the Polymath dataset with other online collaborative environments, in particular, to analyze the relationship between the level of participation and innovation, which remains largely unexplored in the literature.