 Research
 Open Access
 Published:
Online community management as social network design: testing for the signature of management activities in online communities
Applied Network Sciencevolume 2, Article number: 30 (2017)
Abstract
Online communities are used across several fields of human activities, as environments for largescale collaboration. Most successful ones employ professionals, sometimes called “community managers” or “moderators”, for tasks including onboarding new participants, mediating conflict, and policing unwanted behaviour. Network scientists routinely model interaction across participants in online communities as social networks. We interpret the activity of community managers as (social) network design: they take action oriented at shaping the network of interactions in a way conducive to their community’s goals. It follows that, if such action is successful, we should be able to detect its signature in the network itself.
Growing networks where links are allocated by a preferential attachment mechanism are known to converge to networks displaying a power law degree distribution. Growth and preferential attachment are both reasonable firstapproximation assumptions to describe interaction networks in online communities. Our main hypothesis is that managed online communities are characterised by indegree distributions that deviate from the power law form; such deviation constitutes the signature of successful community management. Our secondary hypothesis is that said deviation happens in a predictable way, once community management practices are accounted for. If true, these hypotheses would give us a simple test for the effectiveness of community management practices.
We investigate the issue using (1) empirical data on three small online communities and (2) a computer model that simulates a widely used community management activity called onboarding. We find that onboarding produces indegree distributions that systematically deviate from power law behaviour for lowvalues of the indegree; we then explore the implications and possible applications of the finding.
Introduction
Organizations running online communities typically employ community managers, tasked with encouraging participation and resolving conflict (Rheingold 1993). These are participants, typically in small numbers (one or two members in the smaller communities) who recognise some central command, and carry out its directives. We shall henceforth call such directives policies.
Putting in place policies for online communities is costly, in terms of community managers recruitment and training, and software tools. This raises the question of what benefits organisations running online communities expect from policies; and why they choose certain policies, and not others. In what follows we outline and briefly discuss the set of assumptions that underpin our investigation.
We model online communities as social networks of interactions across participants. An implicit assumption in our work is that the topology of the interaction network of online communities affects their ability to reach their objectives (that can be formed in terms of the maximization of some objective function^{1}, see for instance (Tapscott and Williams 2008; Slegg 2014).
Community managers may thus derive a course of actions to alter the interaction patterns of their communities, so as to favor and support the achievement of the community’s objectives.
The actions can be encoded as a set of simple instructions for community managers to execute. Computer scientists might think of such instructions as algorithms; economists call them mechanisms; professional online community managers call them policies. In this paper we use this third term.
All this implies that the decision to deploy a particular policy on an online community is a network design exercise. An organisation decides to employ a community manager to shape the interaction network of its community in a way that helps its own ultimate goals. And yet, interaction networks in online communities cannot simply be designed; they are the result of many independent decisions, made by individuals who do not respond to the organization’s command structure. An online community management policy is then best understood as an attempt to “influence” emergent social dynamics; to use a more synthetic expression, it can be best understood as the attempt to design for emergence. Its paradoxical nature is at the heart of its appeal.
We are interested in detecting the mathematical signature of specific policies in the network topology. We consider a simple policy called onboarding (Rheingold 1993; Shirky 2008). As a new participant becomes active (e.g. by posting her first post), community managers are instructed to leave her a comment that contains (a) positive feedback and (b) suggestions to engage with other participants that she might share interests with.
We model online conversations as social networks, and look for the effect of onboarding on the topology of those networks. We proceed as follows:

1.
We initially examine data from three small online communities. Only two of them deploy a policy of onboarding. We observe that, indeed, the shape of the degree distribution of these two differs from that of the third.

2.
We propose an experiment protocol to determine whether onboarding policies can explain the differences observed between the degree distributions of the first two online communities and that of the third one.

3.
Based on a generalized preferential attachment model (Dorogovtsev and Mendes 2002), we simulate the growth of online communities. Variants to the model cover the relevant cases: the absence of onboarding policies and their presence, with varying degrees of effectiveness.

4.
We run the experiment protocol against the degree distributions generated by the computer model, and discuss its results.
“Related works” section briefly examines the two strands of literature that we mostly draw upon. “Materials and methods” section presents some data from realworld online communities; it then proceeds to describe our main experiment, a computer simulation of interaction in online communities with and without onboarding. “Results” section presents the experiment’s results. “Discussion and conclusions” section discusses them.
Related works
The extraordinary successes of online communities in deploying largescale, decentralized projects has led many scholars to conjecture that online communities exhibit emergent behavior, and called such behavior collective intelligence, after an influential book by Pierre Lévy (1997). This name was adopted by a research community that aims at providing tools for better collective sense and decision making such as argument maps (representations of the logical structure of a debate, with all redundancy eliminated) (Shum 2003) and attentionmediation metrics (indicators that signal what, in an online debate, is worthiest reading and responding to. The number of Likes on Facebook is one such metric) (Klein 2012).
Alongside with positive studies, scholars have researched the normative aspects of online community management. The monography by (Kraut et al. 2012) confirms the importance of online community management practices, and even proposes a categorization and critical look at existing practices. Others have tried to systematic approaches to community build (Diplaris et al. 2011) and produce technological innovation to support it (Shum 2003; De Liddo et al. 2012). These tools are meant to facilitate and encourage participation to online communities, to make it easier for individuals to extract knowledge from them.
Starting in the 2000s, online communities became the object of another line of enquiry, stemming from network science. Network representation of relationships across groups of humans has yielded considerable insights in social sciences since the work of the sociometrists in the 1930s, and continues to do so; phenomena like effective spread of information, innovation adoption, and brokerage have all been addressed in a network perspective (Borgatti et al. 2009; Burt 2009). As new datasets encoding human interaction became available, many online communities came to be represented as social networks. This was the case for social networking sites, like Facebook (Lewis et al. 2008; Nick 2013); microblogging platform like Twitter (Kunegis et al. 2013; Java et al. 2007; Hodas and Lerman 2014); newssharing services like Digg (Hodas and Lerman 2014); collaborative editing projects like Wikipedia (Laniado et al. 2011); discussion forums like the Java forum (Zhang et al. 2007); and bug reporting services for software developers like Bugzilla (Zanetti et al. 2012). Generally, such networks represent participants as nodes. Edges represent a relationship or interaction. The nature of interaction varies across online communities: one edge can stand for friendship for Facebook; followerfollowed relationship, retweet or mention in Twitter; vote or comment in Digg and the Java forum; talk in Wikipedia; comment in Bugzilla.
In contrast to collective intelligence scholars, network scientists typically do not address the issue of community management, and treat social networks drawn from online interaction as fully emergent. In this paper, we employ a network approach to investigate the issue of whether the work of community managers leaves a footprint detectable by quantitative analysis. To our knowledge, no other work attempted this investigation. In particular, we exploit a result from the theory of evolving networks, from seminal work by Barabási and Albert (1999) showing that the assumption of growth and preferential attachment, when taken together, result in a network whose degree distribution converges to a power law (Barabasi 2005; Barabási et al. 1999). The model was later generalized in various ways and tested across a broad range of networks, including social networks (Dorogovtsev and Mendes 2002).
We use this generalized model as a baseline. We acknowledge that there are realworld human communication networks that do not appear to have been generated by it (see for example (Leskovec and Horvitz 2008)). In very large social networks, for example, limitations to human cognition as expressed by Dunbar numbers might truncate the distribution.
The baseline model implies that the indegree distribution of the interaction network in an online community follows a power law by default. The action of online community managers, as they attempt to further the goals of the organisation that runs the online community, will result in its degree distribution deviating from the baseline power law in predictable ways. Such deviation can be interpreted as the signature that the policy is working well.
The most important difficulty with this method is the absence of a counterfactual: if a policy is enacted in the online community, the baseline degree distribution corresponding to the absence of the policy is not observable, and viceversa. This rules out a direct proof that the policy “works”. Hence our choice to combine empirical data and computer simulations.
In a previous paper (Cottica et al. 2016), we test whether power law models are a good fit for the untransformed indegree distributions of interaction networks in online communities. The approach presented in this paper is more general in that we transform the indegree distributions before applying the same test. This is meant to take on board explicitly the node attractiveness parameter mentioned in (Dorogovtsev and Mendes 2002).
Materials and methods
In this section we introduce the empirical data, the experiment protocol and the simulation model we use in the experiment.
Empirical data
We examine data from three realworld online communities. We obtained the data from the organisations managing them; in fact, one of the authors is directly involved in two of them, Edgeryders and Matera 2019. The three are roughly comparable in size; all are used by practitioners and interested citizens to publicly discuss issues that have a collective dimension; arise around a shared interest rather than personal ties. The last point is important, since “topical” and “social” online interaction patterns have been shown to be different (Grabowicz et al. 2013).
Online communities are modelled as interaction networks, in which nodes are registered users and edges represent comments. The presence of an edge from Alice to Bob indicates that Alice has commented content authored by Bob at least once. The resulting graphs are directed (“Alice comments Bob” is not equivalent to “Bob comments Alice”) and weighted (Alice can write multiple comments to Bob’s content; the edge’s weight is equal to the number of comments written). Table 1 presents some descriptive statistics about them.

InnovatoriPA^{2} is a community of (mostly) Italian civil servants discussing how to introduce and foster innovation in the public sector. It does not employ any special onboarding or moderation policy.

Edgeryders^{3} is a community of (mostly) European citizens, discussing public policy issues from the perspective of grassroot activism and social innovation. It enacts the onboarding of new members policy.

Matera 2019^{4} is a community of (mostly) citizens of the Italian city of Matera and the surrounding region, discussing the city’s policies. It, too, enacts an onboarding policy. The two policies are exactly the same; Matera 2019 has modelled its community management policies on those of Edgeryders.
The communities are modeled as interaction networks (summarized in Table 1) in which nodes are users and edges represent directed comments from A to B, weighted by the number of comments written. A glance at their respective visualizations (Fig. 1) suggests that the networks of the three communities have very different topologies. Innovatori PA displays more obviously visible hubs than the other two.
Testing goodnessoffit
We fitted power laws indegree distributions of these three online communities, as of early December 2014. Next, we tested the hypothesis that these indegree distributions follow a power law, as predicted by (Dorogovtsev and Mendes 2002). See Fig. 2 illustrating the computation workflow.
To do so, we first fitted power functions to the entire support of each indegree distribution. We emphasize indegree, as opposed to outdegree, because directedness is implicit in the idea of preferential attachment, and because the indegree distribution is the one to follow a power law in online conversation networks (Dorogovtsev and Mendes 2002).
We next fitted power functions to the right tail of each indegree distribution, i.e. for any degree k(n)≥k _{min}, where k _{min} is the indegree that minimizes the KolmogorovSmirnov distance (hereafter denoted as D) between the fitted function and the data with indegree k(n)≥k _{min}. Figure 3 shows the similarity between the indegree distributions of the interaction networks of reallife and simulated online communities, with and without onboarding.
Finally, we ran goodnessoffit tests for each indegree distribution and for fitted power functions. The method we followed throughout the paper is borrowed from Clauset et al. (2009). In the rest of this section we briefly describe it.
We start from a null hypothesis that the observed distribution is generated by a power function with exponent α, on the domain k≥k _{ min }. We denote by D the KolmogorovSmirnov distance between the observed distribution and the power function that best fits it. Next, we use the best fit power function to generate a large number (N) of distributions. We now denote D _{ G } the KolmogorovSmirnov distance between each of them and its own best fit power function. Finally, we compare D with D _{ G } for each of the generated distribution.
Such comparison is summarised in a pvalue. This pvalue counts the number of times when D _{ G }>D over N. A pvalue close to 1 indicates that the power function is a good fit for the data: hence the null hypothesis is not rejected. A pvalue close to zero indicate that the power function is a bad fit for the data, and rejects the null hypothesis. The rejection value is set, conservatively, at 0.1.
The previously described computation is conducted in a systematic way. The observed data is fitted twice, first over the whole indegree distribution (that is, over the interval k≥1), and then over the interval k≥k _{min}. The goodnessoffit procedure is then ran on both fitted power functions.
Results on empirical data
Results are summarised in Table 1. As we consider the interval k≥1, we find that the indegree distribution of the Innovatori PA network – the unmoderated one – is consistent with the expected behavior of an evolving network with preferential attachment. We cannot reject the null hypothesis that it was generated by a power law. For other two communities, both with onboarding policies, the null hypothesis is strongly rejected. On the other hand, when we consider only the tail of the degree distributions, i.e. k≥k _{min}, all three communities display a behavior that is consistent of a setting with preferential attachment.
These results are consistent with the objectives of the onboarding policy, consisting in helping newcomers find their way around a community that they don’t know yet. A successfully onboarded new user will generally have some extra interaction with existing active members. All things being equal, we can expect extra edges to appear in the network, and interfere with the indegree distribution that would appear in the absence of onboarding – explaining the nonpower law distribution of Edgeryders and Matera2019. Extra edges target mostly low connectivity nodes: onboarding targets newcomers, and focuses on helping them through the first few successful interactions. Highly active (therefore highly connected) members do not need to be onboarded. This may explain why all three communities display power law behavior in the upper tail of their indegree distributions, regardless of onboarding.
The difference observed between the two communities with onboarding policies and the one without might be caused not by the policy itself, but by some other unobserved variable. For example, variations in user experience design choices are associated to different (network) patterns of interuser communication in (Hodas and Lerman 2014). Cultural differences across the different user bases could also be playing a role. The available evidence is compatible with the hypothesis that onboarding policies in online communities leave a signature in the indegree distribution of their interaction networks, but it cannot prove that hypothesis.
Experiment protocol
To explore the issue further, we generate and compare computer simulations of interaction networks in online communities that are identical except for the presence and effectiveness of onboarding policies. In this way, we isolate the effect, on the interaction network, of onboarding from that of any other effect that might be at work in the real world. The mechanics of the model is described in “The simulation model” section.
We proceed as follows.
First, we simulate the evolution of the interaction network of a large number of online communities. We divide them into a control group (no onboarding policy) and a treatment group (presence of onboarding policy). Specifically, we simulate the evolution of the interaction network of:

One hundred communities with no onboarding policy. These will constitute the control group of our simulated communities.

One hundred communities for each couple of values of ν _{1} and ν _{2}, with ν _{1},ν _{2}∈{0.0, 0.2, 0.4, 0.6, 0.8, 1.0}. These will constitute our treatment groups.

For each of these networks, we compute the indegree distribution.
Next, we define the following hypotheses.

Let C be the network of interaction in an online community. Denote the indegree of node n in the network by k(n). Let F be the bestfit powerlaw model for the distribution:
$$ q(k) = k + mA $$(1)where k is the indegree distribution of C, m the number of nodes that join the network at each timestep and A a node attractiveness parameter.

Hypothesis 1. The distribution of q(k) is generated by F for any k>1.

Hypothesis 2. The distribution of q(k) is generated by F for any k≥k _{min}, where k _{min} is the indegree that minimizes the KolmogorovSmirnov distance between the fitted function and the data over k≥k _{min}.
Hypotheses 1 and 2 are similar in scope, but different in strength. Hypothesis 1 rests on the more restrictive condition that the indegree distribution is a good fit for a power function over its whole domain; Hypothesis 2 needs for the distribution only to be a good fit for a power function over its upper tail. This makes Hypothesis 2 much harder to reject. For example, for Edgeryders and Matera2019, Hypothesis 1 is rejected, whereas Hypothesis 2 is not rejected.
Both hypotheses are based on the asymptotic form taken by the stationary indegree distribution of networks growing by preferential attachments in (Dorogovtsev and Mendes 2002). The result holds even if preferential attachment is not the sole mode of network evolution, and for any edge sources.
The exact formulation for Hypothesis 1 in (Dorogovtsev and Mendes 2002) is k>>1, which we approximate with k>1, because onboarding only targets newcomers to an online community, therefore lowdegree nodes in the network. In other words, onboarding’s influence on the goodnessoffit of the transformed indegree distribution to the power law model is strongest on its lower tail.
Finally, we test Hypothesis 1 and 2 on each of the 3700 indegree distributions generated. We do this using the goodnessoffit tests proposed by Clauset et al. (2009) and illustrated in detail in the Appendix. We expect to obtain the following:

In the control group, both Hypothesis 1 and Hypothesis 2 are true.

In the treatment group with fully effective onboarding Hypothesis 1 is false and Hypothesis 2 is true.

In the intermediate situations of partially ineffective onboarding, Hypothesis 1 can be true or false, according to the value of ν _{1} and ν _{2}. Hypothesis 2 is true.
Disproving Hypotheses 1 and 2 implies that, in the context of the model, the microlevel behaviour prescribed by the onboarding policy onto the community manager gives rise to an indegree distribution that is no longer power lawshaped. The realworld implications of such a result are discussed in “Discussion and conclusions” section.
The simulation model
Our computer model simulates the growth of an interaction network in an online community with and without onboarding. It follows closely the practices of realworld online community management as we know them, for example as reported in the Edgeryders and Matera 2019 online communities. The purpose of this is to check what effect this micro behaviour has on the network and its degree distribution.
Without onboarding
We use the model without onboarding to generate the networks in our control group. The mechanism used for the network to grow is based on preferential attachment, consistently with the BarabásiAlbert tradition. We follow the more general formulation of Dorogovtsev and Mendes (2002).

A (directed) network is initialized, consisting of two reciprocally connected nodes u,v (thus comprising two directed links u→v,v→u).

At each time step,

one new node – representing a participant in the online community – appears in the network.

m new edges – representing comments – appear in the network. The source of each edge is drawn at random from the uniform distribution of the existing nodes.

This represents a departure from (Dorogovtsev and Mendes 2002), where edge sources are assumed to be unspecified. We need to specify edge sources in order to conform to the data model of the network analysis software we are using; this, however, does not have any analytical implications, as both (Dorogovtsev and Mendes 2002) and we focus on the indegree distribution.
Its target is chosen according to the following rule: the probability that the new edge points to node s is proportional to k(s)+A where A is a parameter representing additional attractiveness of the node.
With onboarding
We use a variant of the above model that includes onboarding to generate the networks in our treatment group. The variant consists simply of the model without onboarding, to which further steps are added as follows.

At each timestep, in addition to the m edges mentioned above, one additional edge is directed towards the newcomer node. This is meant to represent the community manager’s onboarding action described in “Introduction” section.

At each timestep, with probability ν _{1}, one edge is added. Its source is the newcomer node; its target is chosen according to the following rule: the probability that the new edge points to node s is proportional to k(s)+A where A is a parameter representing additional attractiveness of the node. This is meant to represent the newcomer’s reaction to the community manager’s onboarding activity; as a result of the latter, the newcomer becomes active and reaches out to someone in the community, as advised by the community manager. We assume that community managers will normally incline to point newcomers to existing users who are reputed to be interesting conversationalists, and that the characteristic of being interesting conversationalists is correlated with node indegree. Parameter ν _{1} can be thought of as representing onboarding effectiveness. More skilled community managers will be more persuasive in inducing newcomers to reach out and engage in the conversation taking place in the online community.

At each timestep, one more edge is added with probability ν _{2}. Its source is drawn at random from the uniform distribution of the existing nodes; its target is the newcomer node. This represents a successful onboarding outcome: the new participant, by becoming active, has attracted the attention of some existing participant, who has engaged with her. No longer isolated, she is now in conversation. ν _{2} can be thought of as representing community responsiveness. As it increases, the efforts of newcomers to engage in conversation become more likely to be reciprocated.
Results
Following the protocol outlined above, we evolved 100 networks for each of the 37 variants of the model. For all networks, we set network size to 2000 nodes; A=1; and m=1. These choices are discussed in the Appendix. Our results are summarized in Table 2.
Goodnessoffit of the powerlaw model
For each network evolved we computed two bestfit powerlaw models, one for k>1 and the other for k≥k _{min} where k _{min} is the indegree the minimizes the KolmogorovSmirnov distance between the fitted function and the data over k≥k _{min}. On each of these models, we ran a goodnessoffit test as described in “Experiment protocol” section. This resulted in two distributions of pvalues for our control group, plus two more for each of our 36 treatment groups. Tables 3 and 4 report descriptive statistics for these distributions.
From Table 3, we conclude that onboarding seems to have some effect on the goodnessoffit of the generated data to their respective bestfit powerlaw models when k>1. The effect goes in the direction of reducing the pvalues and increasing the number of rejects to almost 100%.
It is worth looking at the average pvalues generated by each combination of ν _{1} and ν _{2}. These are shown in Table 4.
We run ttests of the null hypothesis that the average pvalue in the control group is equal to the average pvalues in each of the different treatment groups. This results in a strong rejection of the null for any combination of ν _{1} and ν _{2} (6.5<T<7.5 in all cases). It seems unquestionable that introducing onboarding to an online community has a measurable negative impact on the probability of a powerlaw model to be a good fit for its interaction network’s indegree distribution.
When we consider only the upper tail of the the distribution generated by Eq. 1, the effect of introducing onboarding on the goodnessoffit is much less clear. In Table 5 we show what happens when we choose the scaling range so as to minimize the KolmogorovSmirnov distance between the degree distributions themselves and their bestfit powerlaw models. In the control group, the goodnessoffittopowerlaw test fails in 13 of the 100 runs. In the treatment groups, rejections vary from 18 to 36, depending on the values of ν _{1} and ν _{2}.
Average pvalues of goodnessoffit tests when k≥k _{min} are shown in Table 6. They are all well within the donotreject range.
Tables 5 and 6 tell two different stories. Table 5 is unconclusive: in both the control and the treatment groups, we do not reject Hypothesis 2 in the treatment group most of the time, as expected, but must still reject in a relatively large number of cases (13 in the control group, 1836 in the treatment groups). Table 6 indicates that the average pvalue in all groups is comfortably within the donotreject range, and in this sense behaves entirely according to Hypothesis 2.
Lower bounds
Our results show a limited, albeit statistically significant, effect of onboarding on the value of k _{min}, the value of k that minimizes the KolmogorovSmirnov distance between the data generated by the computer simulation and the bestfit powerlaw model. Table 7 shows, for each value of ν _{1} and ν _{2}, the average value of k _{min}, and the result (expressed in pvalue) of a ttest on the null hypothesis that such average value is the same as the corresponding statistics in the control group, against the alternative hypothesis that the former is greater than the latter.
A glance at Fig. 4 shows that over 80% of the indegree distributions from interaction networks in the control group, visavis only 50 to 60% of those in the treatment group, fit a powerlaw model best for k _{min}≤2. Within the treatment group, no significant variability seems to be associated to the increase of either ν _{1} or ν _{2}.
Exponents
We find that introducing onboarding to an online community has a positive and significant effect on the value of the exponent of the bestfit powerlaw model for the indegree distribution of its interaction network, as computed on k>1. This is consistent with the theoretical results by Dorogovtsev and Mendes (2002), who proved that introducing a fraction of nonpreferential attachment edges in evolving networks with preferential attachment does not suppress the powerlaw dependence of its degree distribution, but only increases the scaling exponent thereof.
This result holds when the bestfit powerlaw models is computed over k≥k _{min}, where k _{min} is, as usual, the value of k that minimizes the KolmogorovSmirnov distance between the simulated indegree distribution and its bestfit powerlaw model. When it is computed over the whole support of the indegree distribution (k≥1), it also holds, except for ν _{1}=1; in this case, the values of the exponents in the control and in the treatment groups are not statistically distinguishable. Tables 8 and 9 show, for each value of ν _{1} and ν _{2}, the average value of the scaling parameter α, and the result (expressed in pvalue) of a ttest on the null hypothesis that such average value is the same as the corresponding statistics in the control group, against the alternative hypothesis that the former is greater than the latter. Table 8 refers to k≥1, whereas Table 9 refers to k _{min}.
The influence of ν1 and ν2
We now turn to the question of the role played by ν _{1} and ν _{2} within the treatment group. Figure 5 show the cumulate density functions of the pvalues in the control and treatment groups as ν _{1} and ν _{2} vary in the case of k>1.
Onboarding effectiveness ν _{1} and community responsiveness ν _{2} do not seem seem to affect the goodnessoffit to power law of indegree distributions much. This is clearest in Table 3, as well as in Fig. 5. This is likely to be simply an effect of the large impact of onboarding: the percentage of nonpower law distributions is already close to 100% and cannot increase any further.
To dig deeper into this result, consider what happens when we allow our lower bound k _{ min } to vary, so as to maximize the indegree distribution’s goodnessoffit to a power law model. From Table 5, we observe that the number of rejects tends to decrease as ν _{1} and ν _{2} increase. This tendency is mirrored by that of the average pvalues in our goodnessoffit tests to increase with both ν _{1} and ν _{2} (Table 6). Figure 6 shows this more clearly.
Intuitively, the community manager’s act of onboarding new members introduces a different law of motion into an evolving network otherwise based on preferential attachment. This shows as an increase in the number of indegree distributions that are not a good fit for a power function. However, if the community manager is successful, her action will prompt more activity (by the newcomer, via the parameter ν _{1}), and this extra activity does follow a preferential attachment rule. This pushes back the shape of the indegree distribution towards the power function. In the left chart of Fig. 6, the curves representing the values of ν _{1} are pushed down as ν1 increases, towards that described by the control group.
As for ν _{2}, we would expect it to act in the opposite direction as ν _{1}. This is because, if the community is highly responsive (high ν _{2}), more edges will be generated that do not follow a preferential attachment rule, but simply target the one newcomer node. This effect, however, is in practice dampened by a nonlinear response of the goodnessoffit tests with respect to additional edges targeting the newcomer. Adding the second nonpreferential attachment edge does not have as much effect on the test as adding the first one. This shows up as the curves in the right part of Fig. 6, representing different values of ν _{2} being mashed together.
Regression analysis is unable to either confirm or falsify the intuition from Fig. 6. We generated 6 dummy variables, each taking value 1 when ν _{1}=c and 0 otherwise, with c∈{0.0,0.2,0.4,0.6,0.8,1}; next, we generated 6 more dummy variables for the same values of ν _{2}. We then estimated a linear regression model with the pvalue of our goodnessoffit test (computed for k>1) as the dependent variable and the 12 dummy variables as its predictors. The results are:

1.
Coefficients on predictors corresponding to different values of ν _{1} are nonsignificant. The coefficient on the variable corresponding to ν _{1}=0.4 is positive (as expected) and weakly significant (pvalue: 0.026).

2.
Coefficients on predictors corresponding to different values of ν _{2} are nonsignificant.

3.
Coefficients on interaction terms between ν _{1} and ν _{2} are not significant.

4.
We ran Ftests of joint significance of the group of predictors corresponding to different values of ν _{1}; different values of ν _{2}; and the interaction terms thereof. The null hypothesis of nonsignificance was not rejected by any of the tests.
Similar results hold when pvalues are computed for k>k _{min}.
Discussion and conclusions
We examine data from three online communities. We find that the interaction networks in the two that are employing onboarding policies were topologically different the one that is not. We turn to the question to whether this difference might encode the mathematical signature of onboarding policy itself. To answer this, we build a timedependent simulation model of an online community, in two (otherwise identical) versions: with and without onboarding of new members. We find that onboarding policies induce a poorer fit of powerlaw models to the indegree distributions of the resulting interaction networks. This effect shows in all key parameters that describe power law models. When onboarding is enacted:

More simulated networks fail the test of goodnessoffit to a power law distribution. For k>1, almost all fail it.

pvalues of the bestfit power low models are lower.

The values of k that minimise the KolmogorovSmirnov distance between the bestfit powerlaw models and the observed data are greater.

Scaling parameters are greater: onboarding makes the allocation of incoming edges more equal.
Furthermore, we find that varying our onboarding effectiveness (ν _{1}) and community responsiveness (ν _{2}) does not have a large impact on the outcome of the simulation.
We next turn to a discussion of these results, and their potential for realworld application.
Accounting for degree distribution shape in the interaction networks of online communities
Our simulation model incorporates two forces. The first one is preferential attachment; the second is onboarding. The former is meant to represent the richgetricher effect observed in many realworld social networks; the latter is meant to represent the onboarding action of moderators and community managers. The former’s effect is known to lead to the emergence of an indegree distribution that approximates a powerlaw model. The latter’s effect is more subtle, because it is in turn composed of two effects. The first one consists in the direct action of the moderator, which always targets the newcomer; the second one in the actions that might be undertaken as a result of wellexecuted onboarding policy.
The direct action of the moderators creates edges pointing to nodes not selected by preferential attachment. In the model, this increases the number of communitycreated edges, which does target nodes selected via preferential attachment. In sum, onboarding increases connectivity; adds extra edges according to a nonpreferential attachment rule; and, except in the case of ν _{1}=ν _{2}=0, also adds edges according to preferential attachment. Its net effect on the goodnessoffit is hard to determine a priori; in practice ν _{1} and ν _{2} turn out to have a surprisingly small effect.
In fact, our results suggest that a highly effective community manager and a highly responsive community can drive the degree distribution closer to the power law state. This appears to reflect the generation of more edges allocated by preferential attachment as a consequence of the onboarding activity, though the differences are too small for solid statistical analysis.
The behaviour of the community manager as encoded in our simulation model accounts for different results in tests relating to Hypothesis 1 (k>1) and Hypothesis 2 (k>k _{min}). Both in the model and in real life, onboarding always targets newcomers to online communities. By doing so, moderators hope to help shy newcomers turn into confident, active community members. This, however, does not prevent everyone else to receive incoming edges, allocated by preferential attachment. Therefore, we expect that the degree distributions generated by our model to be power lawshaped, but with power law behaviour “drowned out” by nonpreferential attachment edges being created at low levels of k. This is indeed what we observe, in the form of a stronger rejection of Hypothesis 1 than of Hypothesis 2 (compare Tables 3 and 4 with Tables 5 and 6).
Applications and limitations
We undertook this research work in the hope of discovering a simple empirical test that could be used to assess the presence and effectiveness of online community management policies, onboarding among them. The guiding idea is that the agency of online community managers and moderators is guided by a logic other than the richgetricher dynamics that spontaneously arises in many social networks. Such dynamics is associated to powerlaw shaped degree distributions, which we can regard as the default state for social interaction networks. We conjecture that enacting community management policies, such as onboarding, would result in altering the shape of the online community’s interaction network and its degree distribution. We furthermore conjecture that the precise nature of such deviations can be interpreted, and ultimately translated into statistical tests.
Our results are in accordance with the first of the two conjectures. The second, however, is only very partially confirmed.
Throughout the paper, we test indegree distributions for goodnessoffit to a power function. Its null hypothesis is that such distributions follow a power law. If the test does not reject the null, we conclude from the “Results” section that no onboarding is at work. If the test does reject the null, however, we cannot draw any conclusion. This result is compatible with the presence of onboarding, but also with any number of other processes that might be at work.
This is not a major concern for our purposes. The use case we have in mind for our empirical test is this: an organisation has instructed its community manager to onboard new members as they join, and wishes to assess the quality of their work. The organisation knows already which policy it is enacting; what it does not know is how well it works. Even in this case, a donotreject test result tells the organization that the community manager is not carrying out the work, but a reject test result cannot confirm she is, and certainly cannot assess her performance.
The goodnessoffit test does also not tell an organisation whether the performance of their community manager and the responsiveness of their community is improving over time. Improvement in the community manager’s performance is captured by increases in ν _{1}; improvement in community responsiveness is captured by increases in ν _{2}. We have shown that their value does not have a detectable effect on the test.
Directions for future research
There are several directions in which our work could be taken further. The first is a full and systematic exploration of the parameter space, with the goal of assessing our results’ robustness with respect to model specification. In this paper we restrict ourselves to the presence and effectiveness of the onboarding action in a baseline model which is closely modeled on Dorogovtsev’s and Mendes’s results (Dorogovtsev and Mendes 2002); it would be useful to test for how these results carry through as we alter other parameters of the model, such as the number of edges m created at each time step, and the additional attractiveness parameter A.
Secondly, we could attempt to make the model into a more realistic description of a realworld online community. Such an attempt would draw attention onto how some realworld phenomena, when incorporated in the model, influence its results. It would also carry the advantage of allowing online community management professionals to more easily interact with the model and critique it.
Finally, we could attempt to gauge the influence of onboarding and other community management policies on network topology by indicators other than the shape of its degree distribution, such as the presence of subcommunities.
Endnotes
^{1} The literature on stochastic actororiented models goes several steps further, and models interaction in a social network assuming that all participants pursue goals with respect to their position in the network (Snijders 1996). We do not explore this direction in the present paper because such models require the assumption of invariant network size. In our context, that would be a zerogrowth online community. We reject such an assumption as too unrealistic.
^{2} See http://www.innovatoripa.it
^{3} See https://edgeryders.eu
Appendix
A1. Testing for goodnessoffit of a power law distribution
The goodnessoffit tests we employed were built following a procedure indicated by Clauset et al. (2009, pp. 15–18). What follows summarizes it in the context of the paper. The test’s null hypothesis is that the empirical data are distributed according to a power law model; the alternative hypothesis is that they are not.
First, we fit the data for the degree distribution of a network generated by our model to a discrete powerlaw model, using maximum likelihood estimation. When we are testing for goodnessoffit of the entire degree distribution, we set the fitted powerlaw model lower bound to 1; when we are testing for goodnessoffit of the distribution’s upper tail only, we choose a lower bound such that the KolmogorovSmirnov distance D between the power law model and the empirical data is minimized. Formally, define:
Here, S(k) is the cumulative density function of the data for the observations with value at least k _{min}, and P(k) is the cumulative density function for the powerlaw model that best fits the data in the region k≥k _{min}. The value of k _{min} that minimizes the function D is the estimate for the model’s lower bound.
Next, we generate a large number of powerlaw distributed synthetic datasets with the same scaling parameter, standard deviation and lower bound as those of the distribution that best fits the empirical data. We fit each of these synthetic datasets to its own powerlaw model and calculate the D statistics of each one relative to its own model. Finally, we count what fraction of the values of D thus computed is larger than the value of D computed for the empirical data. This fraction is interpretable as a pvalue: the probability that data generated by our estimated bestfit powerlaw model will be more distant from the model than our empirical data (“distant” in the KolmogorovSmirnov sense). A pvalue close to zero indicates that it is quite unlikely that the estimated powerlaw model would generate empirical data so distant from the fitted power function; a pvalue close to one, on the contrary, indicates that the estimated power model is quite likely to generate empirical data that are further away from the fitted power function than the ones we collected.
Generating artificial datasets requires a treatment for the region below k _{min} that differs from that of the one above it. We proceed as follows. Assume that our observed dataset has n observations total and n _{ tail } observations such that k≥k _{min}. To generate a synthetic datasets with n observations, we repeat the following procedure n times:

With probability n/n _{ tail } we generate a random number k _{ i } with k _{ i }≥k _{min}, drawn from a power law with the same scaling parameter as our bestfit model.

Otherwise, with probability 1−n/n _{ tail }, we select one element uniformly at random from among the elements of the observed dataset in the region k<k _{min}.
At the end of the process, we will have a synthetic dataset that follows the estimated powerlaw model for k≥k _{min}, but has the same nonpower law distribution below k _{min}.
This test requires we decide how many synthetic datasets to generate for each test; and what is the threshold value below which we reject the null hypothesis. Again based on (Clauset et al. 2009) we make the following decisions:

We set the number of artificial datasets generated to 2500. This corresponds to an accuracy of about 0.01, based on an analysis of the expected worstcase performance of the test.

We conservatively set the rejection threshold at 0.01.
A2. Choosing parameter values
The simulation’s computational intensity prevented us from conducting a thorough exploration of its behaviour across the whole parameter space. It follows we had to pick values from some parameters. In this section we discuss briefly our choice of parameter values. The choice of m=1 implies that the number of edges in the networks in our control group will be equal to the number of nodes; we initialize the network with two nodes connected by two edges (one in each direction), then add one node and one edge at each time step. A glance at Fig. 1 shows that this is unrealistic. The realworld online communities described in Section Empirical data all display a number of edges with is a multiple of the number of nodes.
We justify this choice as follows: we have no pretence at realism. Rather, we are interested in pitting against each other two phenomena, that of preferential attachment, that tends to generate richgetsricher dynamics; and that of onboarding, that tends to introduce a measure of equality. The way we modeled onboarding is by having one single incoming edge targeting the only newcomer to the community at each timestep; we therefore chose to have one single nononboarding generated edge at each timestep. It seems reasonable that our choice would make these two forces roughly equivalent to each other, and make the impact of onboarding on the indegree distribution easier to detect.
The choice of A=1 follows from another, and more fundamental, modelling choice. We mimic Dorogovtsev’s and Mendes’s approach, where the network being modeled is directed and the probability of a new edge to target a node with indegree k is proportional to k (Dorogovtsev and Mendes 2002); this contrasts with Barabási’s and Albert’s approach, that models the network as undirected and assumes that the probability of a new edge to target a node is proportional to its total degree. In a DorogovtsevMendes type model, new nodes have, by construction, indegree zero, whereas in a BarabásiAlbert type model new nodes have total degree one. It follows that, in a DorogovtsevMendes type model, the parameter A tunes the “traction” of preferential attachment: the higher its value, the weaker the grip of pure preferential attachment. For A=0 DorogovtsevMendes type models degenerate into “multiple star networks”, where the probability of newcomers to receive an edge is zero, and all edges target the nodes initially in the network for all time.
Setting A=1 we make the probability of a newcomer to receive its first edge equal to one half that of an incumbent participant who already has one incoming edge to receive its second one, one third of that of an incumbent participant who already has two incoming edges to receive its third one and so on. One can check that this behaviour mimics that of the simplest, and best known, BarabásiAlbert type model.
References
Barabasi, AL (2005) The origin of bursts and heavy tails in human dynamics. Nature435(7039): 207–211.
Barabási, AL, Albert R (1999) Emergence of scaling in random networks. Science286(5439): 509–512.
Barabási, AL, Albert R, Jeong H (1999) Meanfield theory for scalefree random networks. Physica A Stat Mech Appl272(1): 173–187.
Borgatti, SP, Mehra A, Brass DJ, Labianca G (2009) Network analysis in the social sciences. Science323(5916): 892–895.
Burt, RS (2009) Structural Holes: The Social Structure of Competition. Harvard university press, Cambridge.
Clauset, A, Shalizi CR, Newman ME (2009) Powerlaw distributions in empirical data. SIAM Rev51(4): 661–703.
Cottica, A, Melançon G, Renoust B (2016) Testing for the signature of policy in online communities In: International Workshop on Complex Networks and Their Applications, 41–54.. Springer.
De Liddo, A, Sándor Á, Shum SB (2012) Contested collective intelligence: Rationale, technologies, and a humanmachine annotation study. Comput Supported Coop Work (CSCW)21(45): 417–448.
Diplaris, S, Sonnenbichler A, Kaczanowski T, Mylonas P, Scherp A, Janik M, Papadopoulos S, Ovelgoenne M, Kompatsiaris Y (2011) Emerging, Collective Intelligence for Personal, Organisational and Social Use. In: Bessis N Xhafa F (eds), 527–573.. Springer, Berlin.
Dorogovtsev, SN, Mendes JFF (2002) Evolution of networks. Adv Phys51(4): 1079–1187.
Grabowicz, PA, Aiello LM, Eguíluz VM, Jaimes A (2013) Distinguishing topical and social groups based on common identity and bond theory In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, 627–636.. ACM, New York.
Hodas, NO, Lerman K (2014) The simple rules of social contagion. Sci Rep4: 4343. doi:10.1038/srep04343.
Java, A, Song X, Finin T, Tseng B (2007) Why we twitter: understanding microblogging usage and communities In: Proceedings of the 9th WebKDD and 1st SNAKDD 2007 Workshop on Web Mining and Social Network Analysis, 56–65.. ACM, New York.
Klein, M (2012) Enabling largescale deliberation using attentionmediation metrics. Computer Supported Cooperative Work (CSCW)21(45): 449–473.
Kraut, RE, Resnick P, Kiesler S, Burke M, Chen Y, Kittur N, Konstan J, Ren Y, Riedl J (2012) Building Successful Online Communities: Evidencebased Social Design. MIT Press, Cambridge.
Kunegis, J, Blattner M, Moser C (2013) Preferential attachment in online networks: measurement and explanations In: Proceedings of the 5th Annual ACM Web Science Conference, 205–214.. ACM, New York.
Laniado, D, Tasso R, Volkovich Y, Kaltenbrunner A (2011) When the wikipedians talk: Network and tree structure of wikipedia discussion pages In: ICWSM.. AAAI Press, Palo Alto.
Leskovec, J, Horvitz E (2008) Planetaryscale views on a large instantmessaging network In: Proceedings of the 17th International Conference on World Wide Web, 915–924.. ACM.
Lewis, K, Kaufman J, Gonzalez M, Wimmer A, Christakis N (2008) Tastes, ties, and time: A new social network dataset using facebook.com. Soc Networks30(4): 330–342.
Nick, B (2013) Toward a better understanding of evolving social networks. PhD thesis.
Pierre, L (1997) Collective intelligence: Mankinds emerging world in cyberspace. Perseus Books, Cambrigde.
Rheingold, H (1993) The Virtual Community: Homesteading on the Electronic Frontier. MIT press, Cambridge.
Shirky, C (2008) Here Comes Everybody: The Power of Organizing Without Organizations. Penguin, London.
Shum, SB (2003) The roots of computer supported argument visualization In: Visualizing Argumentation, 3–24.. Springer, London.
Slegg, J (2014) Facebook News Feed Algorithm Change Reduces Visibility of Page Updates. http://searchenginewatch.com/sew/news/2324814/facebooknewsfeedalgorithmtweakreducesvisibilityofpageupdates. Accessed 13 Aug 2017.
Snijders, TA (1996) Stochastic actororiented models for network change. J Math Socio21(12): 149–172.
Tapscott, D, Williams AD (2008) Wikinomics: How Mass Collaboration Changes Everything. Penguin, London.
Zanetti, MS, Sarigol E, Scholtes I, Tessone CJ, Schweitzer F (2012) A quantitative study of social organisation in open source software communities. In: Jones AV (ed)2012 Imperial College Computing Student Workshop, 116–122.. Schloss Dagstuhl–LeibnizZentrum fuer Informatik, Dagstuhl, Germany. http://drops.dagstuhl.de/opus/volltexte/2012/3774.
Zhang, J, Ackerman MS, Adamic L (2007) Expertise networks in online communities: structure and algorithms In: Proceedings of the 16th International Conference on World Wide Web, 221–230.. ACM, New York.
Acknowledgement
The authors gratefully acknowledge the invaluable contributions of Giovanni Ponti, Raffaele Miniaci, Noemi Salantiu, LeeSean Huang, the faculty and students at University of Alicante and everybody at Masters of Networks 3. We also acknowledge support of the OPENCARE and CATALYST European projects.
Funding
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement no 688670).
Author information
Affiliations
Contributions
AC designed the study, carried out the statistical analyses. AC drafted the manuscript with the help of GM and BR. GM and BR implemented the algorithm generating artificial networks (with or without onboarding), and ran the parallelization of the algorithms carrying the experiment and producing statistical data. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Alberto Cottica.
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional information
Availability of data and materials
Code and data can be found from https://github.com/albertocottica/communitiesnetworkdesign/ (last access 21/06/2017).
Authors’ information
AC is one of Edgeryders cofounder and economist finishing his PhD degree in economy at the University of Alicante, Spain. His research interests coevolve with his role as community manager in diverse social environments.
GM is full professor at the Computer Science department, Université de Bordeaux, France. His research activities develop around visual network analysis, with a strong interest in multidisciplinary projects with social scientists.
BR is a postdoctoral researcher at the National Institute of Informatics (NII), Japan, and at the CNRS UMI 3527 JapaneseFrench Laboratory for Informatics (JFLI), Japan. He was a research engineer at the National Audiovisual Institute (Ina) in Paris, France, from 2009 to 2012, and received his PhD in 2014 from the University of Bordeaux, France. His research interests cover network analysis and visualization, and multimedia analytics.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Collective intelligence
 Online communities
 Network structure