Skip to main content

Penalized inference of the hematopoietic cell differentiation network via high-dimensional clonal tracking



During their lifespan, stem- or progenitor cells have the ability to differentiate into more committed cell lineages. Understanding this process can be key in treating certain diseases. However, up until now only limited information about the cell differentiation process is known.


The goal of this paper is to present a statistical framework able to describe the cell differentiation process at the single clone level and to provide a corresponding inferential procedure for parameters estimation and structure reconstruction of the differentiation network.


We propose a multidimensional, continuous-time Markov model with density-dependent transition probabilities linear in sub-population sizes and rates. The inferential procedure is based on an iterative calculation of approximated solutions for two systems of ordinary differential equations, describing process moments evolution over time, that are analytically derived from the process’ master equation. Network sparsity is induced by adding a SCAD-based penalization term in the generalized least squares objective function.


The methods proposed here have been tested by means of a simulation study and then applied to a data set derived from a gene therapy clinical trial, in order to investigate hematopoiesis in humans, in-vivo. The hematopoietic structure estimated contradicts the classical dichotomy theory of cell differentiation and supports a novel myeloid-based model recently proposed in the literature.


Over the past decade, gene therapy (GT) has proved its potential as a next-generation therapy for many diseases that were untreatable by conventional therapies (Naldini 2011). GT can be used to treat cellular defects due to a mutated gene by providing a fully functional copy of it or by equipping target cells with a new cellular function through genetic engineering. Most clinical approaches are based on the delivery of exogenous DNA molecules by viral vectors using retrovirus- or lentivirus-derived systems. The advent of next-generation sequencing (NGS) platforms (i.e. Roche/454 pyrosequencing and Illumina sequencing technology) substantially improved the accuracy and resolution of viral integration site (IS) analyses (Biasco et al. 2011). This technological progress has resulted in the availability of IS data from single transduction experiments that are two orders of magnitude denser than previously possible. As a result, IS research has diversified from characterizing the mechanisms driving the virus integration process and its interactions with the host cell genome to investigating other biological questions. For example, retrovirus IS distribution over the genome has been used as an indicator of active gene enhancers and regulatory regions, involved in hematopoietic stem cell commitment (Romano et al. 2016) or as a tool to follow individual cell fate in-vivo (Biasco et al. 2016; Scala et al. 2018).

In clinical settings, GT has been successfully used, for example, to treat hematological diseases such as Wiskott-Aldrich Syndrome (WAS), an inherited immunodeficiency caused by mutations in the gene encoding WASP protein (Aiuti et al. 2013). In this context, to ensure life-long curative potential and limit possible treatment side effect, hematopoietic stem/progenitor cells (HSC) are harvested from patients’ bone marrow (BM), corrected by means of virus-based manipulation and then re-infused to patients. Treated cells acquire a unique label, represented by IS genomic coordinates, and this label will be inherited by all cellular offspring generated by both duplication or differentiation events in more committed cell types. In other words, IS can be used as a molecular marker to track individual HSC and evolution.

The set of cells, among all lineages under investigation, sharing a specific genomic marker, and therefore deriving from a common HSC ancestor, is defined as a clone. The analysis of the in-vivo clone evolution by means of periodic IS analysis performed on patients’ BM or peripheral blood (PB) sample, is called clonal tracking. In the experimental data analysed in this paper, 15 different cell sub-populations (named also cell types or lineages), distributed along the hematopoietic hierarchy, have been collected from three patients affected by WAS during their first three years after GT treatment. Lineage-specific population sizes have been measured by means of reads count values (Biasco et al. 2016), returned by NGS platforms. Given the amount of lineages, samples and patients, this study provides a unique opportunity to reveal novel insight into human hematopoiesis.

In the literature, various mathematical approaches for the quantitative analysis of hematopoiesis have been proposed. For example, stochastic models for simplified hierarchical structures reduced to two categories, stem cells reserve and contributing clones, have been developed in (Abkowitz et al. 1990; Catlin et al. 2001) and in (Becker et al. 1963) for cat and mouse models, respectively. In (Marciniak-Czochra and Stiehl 2013) authors proposed a more complex multi-compartment model described by a set of deterministic functions, aimed at evaluating the mechanisms of regulation governing reconstitution after HSCs transplantation in humans. However, in all these approaches, authors compared how known and alternative hierarchies support various types of experimental data, rather than making hierarchy estimation a goal of the inferential procedure itself. In this respective, an interesting discovery-oriented approach applied to GT for WAS data has been recently proposed in (Scala et al. 2018). By means of additive Bayesian network modeling of IS detection, simultaneous structural learning and associations estimation have been performed, in order to investigate differences in lineages dependence at early and late phase after treatment. Dichotomizing IS measurements is motivated by the necessity to alleviate the noisy nature of IS analysis, due to technical factors such as DNA amplification and sequencing, but it also discards valuable information about clone size dynamics. Standard Bayesian network algorithms are in principle capable of both parameter inference and model structure learning. These methods are based on modeling conditionally independence among nodes, usually using a contingency table parametrization in the case of binary variables. In the context of clonal tracking data, the results derived from such a modeling approaches are difficult to interpret from a biological perspective. In general, the metrics adopted by learning methods like mutual information make no distinction between positive and negative association. However, they have opposite biological interpretations, where a positive dependence suggests a differentiation path connecting two nodes, whereas a negative one supports the hypothesis that the nodes belong to alternative differentiation branches. In addition, coefficients measuring dependence strength are difficult to compare among each other and are particularly sensitive to a significant amount of stochastic variation of the process itself and other effects, such as saturation.

The goals of this paper are to propose a statistical framework able to model the cell differentiation process (CDP) measured at single clone resolution, to provide an inferential procedure able to perform both process parameters estimation and model reconstruction and finally, to investigate the hematopoietic process in humans. Our proposal derives from the definition of a novel generative stochastic process for clonal tracking data, defined over a network of lineages (nodes). The model is able to properly address the stochastic nature of the cell differentiation process and given a specific setting for the network parameters, generate complete evolution of clone size dynamics among all lineages. Although the application in this paper is clearly geared towards the cell differentiation process, the methodology underlying the analysis is completely general and can be applied to any stochastic process that involves differentiation, replication and extinction, such as political systems, corporate organizational development or even insect colonies.

A description of the continuous time, density-dependent Markov model for CDP, along with the underlying assumptions, are detailed in “Stochastic cell differentiation model” section. In “Approximate generalized method-of-moments estimation” section an efficient generalized least square estimation procedure, relying on first order Euler’s method approximation for the evolution of first and second order process moments is derived. In “Model selection” section a sparsity-inducing penalty term is incorporated in the estimation procedure in order to reconstruct the differentiation structure of the systems under investigation. A overview of the inference algorithm is given in “Schematic overview of the inferential procedure” section. In “Simulation study” section the performance of our proposal is verified by means of a simulation study. In “Investigating human hematopoiesis in vivo” section the experimental data previously mentioned are described in more details and then analyzed. Finally, our findings are discussed in “Discussion” section. “Conclusion” section is dedicated to final considerations, possible extensions and future directions to improve the methodology.

Stochastic cell differentiation model

For each clone or integration site l a CDP can be defined as a N-dimensional Markov process, Xl(t), such that each element \({X^{l}_{i}}(t)\) of \(\boldsymbol {X}^{l}(t)=\left (X^{l}_{1}(t), \dots, X^{l}_{N}(t)\right)\), corresponds to the number of cells (counts) of type (or lineage) Ci,i=1,…,N present in clone l at time t. For notational convenience, we will drop the explicit dependence on each individual clone l in this section, as clones can be thought of as independent copies of each other. Given an initial state vector, x0, the process evolves according to a random sequence of events, divided in three categories: duplications, deaths and differentiations. Individual cells are assumed to be independent from each other and cells belonging to the same lineage are assumed to obey the same law. Single event rates are assumed to be non-negative and constant over time. A graphical representation of cellular events is available in Fig. 1a and a detailed description follows.

Fig. 1
figure 1

Cellular events and network representation of a fully connected 3-dimensional CDP. a Process configuration modification associated to duplication (top), death (middle) and differentiation (bottom) of a type 1 cell (circled in yellow). b Each node represent a lineage, arrows represent cell events and labels corresponding events rates. Each lineage has two self-referring arrows, labelled with αi and δi, corresponding to cell duplication and death respectively. Connections between nodes are differentiation paths

Cell duplication: \(1C_{i} \xrightarrow {\alpha _{i}} 2C_{i}\)

The net effect, or process state change induced by the duplication of a cell of type Ci is the increment, of one unit, of Ci population size. Duplication rates, α=(αi,i=1,…,N), correspond approximately to the probability that a generic cell of type Ci undergoes duplication, in a time unit. The transition probability associated to a duplication event in lineage Ci occurring in time interval [t,t+Δt) for process X(t) being in state xt, is given by:

$$P(X_{i}(t+\Delta t)=x_{i,t}+1 | X_{i}(t)=x_{i,t}) \approx x_{i,t} \alpha_{i}\Delta t. $$

Cell death: \(C_{i}\xrightarrow {\delta _{i}} \emptyset \)

The net effect corresponding to a single death event is the decrease of one unit in the Ci population size. Death rates in vector δ=(δi,i=1,…,N), are the probabilities that a generic cell of type Ci dies, in a time unit. The transition probability associated to the generic death event in a time interval [t,t+Δt) is:

$$P(X_{i}(t+\Delta t)=x_{i,t}-1 | X_{i}(t)=x_{i,t}) \approx x_{i,t} \delta_{i}\Delta t. $$

Cell differentiation: \(C_{i} \xrightarrow {\lambda _{i,j}} C_{j}\)

According to the biological literature, it is possible to distinguish between two different models of differentiation: asymmetric cell division and signalling induced differentiation. In the first case, cell division gives rise to two daughter cells with distinct features and fates. In the second case, differentiation is a process induced by a set of cell-to-cell signals, leading to conformational and receptor modifications and is not coupled with a duplication event. The kind of differentiation considered in this work consists in the transition of a single cell from lineage Ci to lineage Cj, and is equivalent to assume signalling induced differentiation. Similarly to duplication and death events, event rates λ=(λi,j,i,j=1,…,N,ij) measure the probabilities that a generic cell of type Ci undergoes differentiation into Cj, in a time unit. The transition probability associated to a single differentiation event CiCj in the time interval [t,t+Δt), for a system being in state xt is given by:

$$P(X_{i}(t+\Delta t)= x_{i,t}-1, X_{j}(t+\Delta t)=x_{i,t}+1 | X_{i}(t)= x_{i,t}, X_{j}(t)=x_{i,t}) \approx x_{i,t} \lambda_{i,j}\Delta t. $$

In a CDP involving N lineages, the complexity of the system depends on the number of positive differentiation rates. This could vary from a minimum of 0 up to a maximum of N(N−1) in a fully interconnected system.

Readers familiar with literature concerning the modelling of systems of coupled biochemical reactions, could recognize similarities between those models and the presentation made so far for the CDP. Pursuing this parallelism, the set of cell events can be interpreted as first-order mass-action kinetics (reactions), whilst the transition probabilities per time unit, Xi(t)αi,Xi(t)δi and Xi(t)λi,j correspond to the propensity functions (Wilkinson 2006; Purutcuoglu and Wit 2008). Defining θ=(α,δ,λ) as the r-dimensional column vector of parameters, the set of the propensity functions can be expressed in a compact matrix notation as a r-dimensional column vector, obtained from the product D(X(t))θ, where D(X(t)) is as a r×r diagonal matrix with elements of vector X(t) replicated. Finally, the state changes associated to cell events can be recast into a net effect or stoichiometric matrix, V, defined as the r×N integer matrix.

In order to clarify the elements just introduced, we present a derivation for a fully connected CDP of size N=3, that is graphically represented in Fig. 1b. In total, r=3+3+(3×2)=12 distinct cell events can be defined and accordingly, the parameter vector is given by: \(\boldsymbol {\theta }=(\alpha _{1},\alpha _{2},\alpha _{3},\delta _{1},\delta _{2},\delta _{3},\lambda _{2,1},\lambda _{3,1},\lambda _{1,2},\lambda _{3,2},\lambda _{1,3,},\lambda _{2,3})^{\intercal }\). The net effect matrix equals to: \(\boldsymbol {V}=\left [\begin {array}{c} \boldsymbol {V}_{{\text {dupl}}} \\ \boldsymbol {V}_{{\text {death}}} \\ \boldsymbol {V}_{{\text {diff}}} \\ \end {array}\right ]\) where

$$\boldsymbol{V}_{{\text{dupl}}} = \left[\begin{array}{ccc}1 & 0 & 0 \\0 & 1 & 0 \\0 & 0 & 1 \\ \end{array}\right];~ \boldsymbol{V}_{{\text{death}}}= \left[\begin{array}{ccc}-1 & 0 & 0 \\0 & -1 & 0 \\0 & 0 & -1 \\ \end{array}\right];~\boldsymbol{V}_{{\text{diff}}}= \left[\begin{array}{ccc}1 & -1 & 0 \\ 1 & 0 & -1 \\-1 & 1 & 0 \\ 0 & 1 & -1 \\ -1 & 0 & 1 \\ 0 & -1 & 0 \\ \end{array}\right] $$

and D(X)=diag(X1,X2,X3,X1,X2,X3,X2,X3,X1,X3,X1,X2). Because X(t) has been defined as a Markov process, given an initial condition vector, Kolmogorov forward equations enables to determine the time evolution of the process probability distribution P(X(t),t). An alternative and equivalent formulation of the Kolmogorov equation, widely used for modelling physical and chemical dynamic systems, is known as master equation (Kampen 1981; Risken 1984; Gardiner 1985).

For the CDP described in this paper, the master equation is given by:

$$\begin{array}{*{20}l} \frac{dP(\boldsymbol{X}_{t},t)}{dt}= & \sum_{k=1}^{r} \left\{\left[D({\boldsymbol{X}_{t}-\boldsymbol{V}_{\boldsymbol{k,\cdot}}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}_{t}-\boldsymbol{V}_{\boldsymbol{k,\cdot}},t\right) - \left[D({\boldsymbol{X}_{t}})\boldsymbol{\theta}\right]_{k} P \left(\boldsymbol{X}_{t},t\right)\right\} \end{array} $$

The solution of (1) involves the evaluation of the evolution of P(X(t),t) over the whole set of admissible configurations for process X(t). Clearly, for systems of realistic size and complexity this do not represent a feasible option. However, starting from (1), important information about the dynamics of characteristic statistical features of the system can be obtained. In particular, as shown in Appendix 1, two coupled sets of ordinary differential equations (ODEs) are derived, describing the time evolution of lineage population size averages, E[ Xi(t)],i=1,…,N, and variances-covariances \(\Sigma _{X_{i},X_{j}}(t), i,j= 1, \dots,N\):

$$\begin{array}{*{20}l} \frac{d \mathrm{E}[\!X_{i}(t)]}{dt}= & \sum_{k=1}^{r} v_{k,i} \left[D(\mathrm{E} \left[{\boldsymbol{X}(t)}\right]) \boldsymbol{\theta} \right]_{k} \end{array} $$

with initial condition:

$$\begin{array}{*{20}l} \mathrm{E}[\!X_{i}(t_{0})]=x_{i,0} \end{array} $$


$$\begin{array}{*{20}l} \frac{d \Sigma_{X_{i},X_{j}}(t) }{dt} = & \sum_{k=1}^{r} v_{k,j} \left[D(\mathrm{E} \left[X_{i}(t) {\boldsymbol{X}(t)}) \right] \boldsymbol{\theta} \right]_{k} + \sum_{k=1}^{r} v_{k,i} \left[D(\mathrm{E} \left[X_{j}(t) {\boldsymbol{X}(t)}) \right] \boldsymbol{\theta} \right]_{k} \\ & + \sum_{k=1}^{r} v_{k,i} v_{k,j} \left[D(\mathrm{E} \left[{\boldsymbol{X}(t)}\right]) \boldsymbol{\theta} \right]_{k} - \mathrm{E}[{X_{i}}(t)] \sum_{k=1}^{r} v_{k,j} \left[D(\mathrm{E} \left[{\boldsymbol{X}(t)}\right]) \boldsymbol{\theta} \right]_{k} \\ & - \mathrm{E}[\!{X_{j}}(t)] \sum_{k=1}^{r} v_{k,i} \left[D(\mathrm{E} \left[{\boldsymbol{X}(t)}\right]) \boldsymbol{\theta} \right]_{k} ; \end{array} $$

with initial conditions:

$$\begin{array}{*{20}l} \mathrm{E}[{X_{i}(t_{0})}]=x_{i,0};~~~~ \mathrm{E}[{X_{i}(t_{0})}{X_{j}(t_{0})}]=x_{i,0}x_{j,0}. \end{array} $$

In the next session, starting from (2) and (3), an inference procedure is presented.


In this section, a two step inference procedure for parameters estimation and model selection is described. The majority of studies aimed at answering biological questions through clonal tracking experiments provide information about the simultaneous evolution of several clones, observed at a limited set of timepoints. Assuming in a single experiment in total L clones have been tracked, with S total observations at times t1<<tS, each clone trajectory \(\boldsymbol {x}^{\boldsymbol {l}} = \left (\boldsymbol {x}_{\boldsymbol {t}_{\boldsymbol {1}}^{\boldsymbol {l}}}, \dots, \boldsymbol {x}_{\boldsymbol {t}_{\boldsymbol {S}}^{\boldsymbol {l}}}\right)\), with l=1,…,L corresponds to an independent realization of a unique, common CDP. Conditioning on the generic \(\boldsymbol {x}_{\boldsymbol {t}_{\boldsymbol {s}}^{\boldsymbol {l}}}\), the mean and variance-covariance values for lineage counts at time ts+1, can be estimated by solving (2) and (3) with the following initial conditions:

$$\begin{array}{*{20}l} \mathrm{E}\left[X^{l}_{i}(t_{s})\right]=x^{l}_{i,{t_{s}}} ;~~~~~~ \mathrm{E}\left[X^{l}_{i}(t_{s}){X^{l}_{j}(t_{s})}\right]=x^{l}_{i,{t_{s}}} x^{l}_{j,{t_{s}}} \end{array} $$

A computationally efficient, albeit approximated, solution for \(\mathrm {E}\left [X^{l}_{i}(t_{s+1})\right ]\) and \(\Sigma _{X^{l}_{i},X^{l}_{j}}(t_{s+1})\) can be calculated by Euler’s method. Accordingly, for the lineage specific mean population count, the following expression is derived:

$$\begin{array}{*{20}l} \mathrm{E}\left[{X^{l}_{i}(t_{s+1})}\right] \simeq \mathrm{E}\left[{X^{l}_{i}(t_{s})}\right]+ \frac{d \mathrm{E}\left[{X^{l}_{i}(t_{s})}\right]}{dt} [t_{s+1}-t_{s}]. \end{array} $$

By substituting the second term on the right-hand-side (RHS) of (5) with (2) and including initial condition defined in (4), equation (5) becomes:

$$\begin{array}{*{20}l} \mathrm{E}\left[{X^{l}_{i}(t_{s+1})}| \boldsymbol{X}^{\boldsymbol{l}}(t_{s})=\boldsymbol{x}^{\boldsymbol{l}}_{\boldsymbol{t}_{\boldsymbol{s}}}\right] \simeq x_{t_{s}}^{l,i}+ \sum_{k=1}^{r} v_{k,i} \cdot \left[D\left({\boldsymbol{x}^{\boldsymbol{l}}_{\boldsymbol{t}_{\boldsymbol{s}}}}\right) \cdot \boldsymbol{\theta} \right]_{k} \cdot [t_{s+1}-t_{s}] \end{array} $$

or, in an equivalent matrix notation:

$$\begin{array}{*{20}l} \mathrm{E}\left[\boldsymbol{X}^{\boldsymbol{l}}(t_{s+1})\right] \simeq \boldsymbol{x}^{\boldsymbol{l}}_{\boldsymbol{t}_{\boldsymbol{s}}}+ \boldsymbol{V}^{\boldsymbol{\intercal}} \cdot D\left(\boldsymbol{x}^{\boldsymbol{l}}_{\boldsymbol{t}_{\boldsymbol{s}}}\right) \cdot \boldsymbol{\theta} \cdot [t_{s+1}-t_{s}]. \end{array} $$

A similar reasoning can be applied to approximate the solutions for variance-covariance indexes, \(\Sigma _{X^{l}_{i},X^{l}_{j}}(t_{s+1})\), introduced in (3):

$$\begin{array}{*{20}l} \Sigma_{X^{l}_{i},X^{l}_{j}}(t_{s+1}) \simeq \Sigma_{X^{l}_{i},X^{l}_{j}}(t_{s}) + \frac{d \Sigma_{X^{l}_{i},X^{l}_{j}}(t_{s}) }{dt} [t_{s+1}-t_{s}]. \end{array} $$

The first term in on the RHS of (7), by definition of variances-covariances as second central moments and initial conditions in (4), equals 0, while the second term simplifies to:

$$\begin{array}{*{20}l} \frac{d \Sigma_{X^{l}_{i},X^{l}_{j}}(t_{s}) }{dt} = & \sum_{k=1}^{r} v_{k,i} \cdot v_{k,j} \cdot \left[D\left({\boldsymbol{x}^{\boldsymbol{l}}_{\boldsymbol{t}_{\boldsymbol{s}}}}\right) \cdot \boldsymbol{\theta} \right]_{k} \end{array} $$

or, in matrix notation:

$$\begin{array}{*{20}l} \boldsymbol{\Sigma}_{\boldsymbol{X}^{\boldsymbol{l}}}(t_{s+1}) \simeq \boldsymbol{V}^{\boldsymbol{\intercal}} \cdot D\left(\boldsymbol{x}^{\boldsymbol{l}}_{\boldsymbol{t}_{\boldsymbol{s}}}\right) \cdot Diag\left(\boldsymbol{\theta}\right) \cdot \boldsymbol{V} \cdot [t_{s+1}-t_{s}]. \end{array} $$

Let now define ΔXl(ts) as:

$$\begin{array}{*{20}l} \boldsymbol{\Delta} \boldsymbol{X}^{\boldsymbol{l}}_{\boldsymbol{t}_{\boldsymbol{s}}}=\boldsymbol{X}^{\boldsymbol{l}}(t_{s+1})-\boldsymbol{X}^{\boldsymbol{l}}(t_{s}) \end{array} $$

a collection of N-dimensional r.v. modelling the state increment occurring in a time interval [ts+1ts]. It is straightforward to show by linearity of expectation operator that:

$$\begin{array}{*{20}l} \mathrm{E}\left[\boldsymbol{\Delta} \boldsymbol{X}^{\boldsymbol{l}}_{\boldsymbol{t}_{\boldsymbol{s}}}~|~ \boldsymbol{X}^{\boldsymbol{l}}(t_{s})=\boldsymbol{x}^{\boldsymbol{l}}_{\boldsymbol{t}_{\boldsymbol{s}}}\right] \simeq \boldsymbol{V}^{\boldsymbol{\intercal}} \cdot D\left(\boldsymbol{x}^{\boldsymbol{l}}_{\boldsymbol{t}_{\boldsymbol{s}}}\right) \cdot \boldsymbol{\theta} \cdot [t_{s+1}-t_{s}] \end{array} $$

and by invariance property with respect to shift in location parameters of variance-covariance index:

$$\begin{array}{*{20}l} \boldsymbol{\Sigma}_{\boldsymbol{\Delta} \boldsymbol{X}^{\boldsymbol{l}}_{\boldsymbol{t}_{\boldsymbol{s}}}|\boldsymbol{X}^{\boldsymbol{l}}(t_{s})=\boldsymbol{x}^{\boldsymbol{l}}_{\boldsymbol{t}_{\boldsymbol{s}}}} \simeq \boldsymbol{V}^{\boldsymbol{\intercal}} \cdot D\left(\boldsymbol{x}^{\boldsymbol{l}}_{\boldsymbol{t}_{\boldsymbol{s}}}\right) \cdot Diag\left(\boldsymbol{\theta}\right) \cdot \boldsymbol{V} \cdot [t_{s+1}-t_{s}]. \end{array} $$

The piece-wise constant nature of the propensity functions over time, allows to make some considerations concerning the degree of the approximation provided by Euler’s method in (10) and (11). In the time elapsing between consecutive cellular events, the propensity functions are not subjected to variation, since they depend on the current process configuration and on parameters θ, assumed constant over time. Modification of their values can eventually occur only in coincidence with cellular events. It follows that if the set of time t1,…,tS corresponds to the sequence of events times, the probabilities \(D\left (\boldsymbol {x}^{\boldsymbol {l}}_{\boldsymbol {t}_{\boldsymbol {s}}}\right) \cdot \boldsymbol {\theta } \cdot [t_{s+1}-t_{s}]\) associated to the set of possible state variations are constant as well. It is thus possible to consider state increment as a regular discrete random variables and conclude that, under this specific sampling scheme, (10) and (11) are exact results rather than an approximation. When the time distance between consecutive observations increases, it becomes more likely that more than one event occur within them, decreasing the quality of the approximation. The impact of sampling times intervals will be investigated by means of a simulation study in “Simulation study” section.

Approximate generalized method-of-moments estimation

Formulas given in (10) and (11) state that first order approximation for both increments conditional expectation and variance-covariances can be calculated as linear combination of the propensity functions at time ts, in turn linear with respect to both process state and parameters vector. This result suggests an estimation of θ as a linear regression problem of type

$$ \boldsymbol{dx}\simeq\boldsymbol{M} \boldsymbol{\theta} + \boldsymbol{\varepsilon}; ~~~\mathrm{E}[\boldsymbol{\varepsilon}]=\boldsymbol{0};~~~ \boldsymbol{\Sigma}_{\boldsymbol{\varepsilon}}=\boldsymbol{W} $$


$$ \boldsymbol{dx}=\left[\begin{array}{c} \boldsymbol{dx}^{\boldsymbol{1}}_{\boldsymbol{t}_{\boldsymbol{0}}} \\ \boldsymbol{dx}^{\boldsymbol{1}}_{\boldsymbol{t}_{\boldsymbol{1}}}\\ \vdots \\ \boldsymbol{dx}^{\boldsymbol{L}}_{\boldsymbol{t}_{\boldsymbol{S-1}}} \\ \end{array} \right] ~ \boldsymbol{M}=\left[\begin{array}{c} \boldsymbol{M}^{\boldsymbol{1}}_{\boldsymbol{t}_{\boldsymbol{0}}} \\ \boldsymbol{M}^{\boldsymbol{1}}_{\boldsymbol{t}_{\boldsymbol{1}}}\\ \vdots \\ \boldsymbol{M}^{\boldsymbol{L}}_{\boldsymbol{t}_{\boldsymbol{S-1}}} \\ \end{array} \right] ~\text{and} ~ \boldsymbol{W}= \left[\begin{array}{cccc} \boldsymbol{W}^{\boldsymbol{1}}_{\boldsymbol{t}_{\boldsymbol{0}}} & 0 & \dots & 0 \\ 0 & \boldsymbol{W}^{\boldsymbol{1}}_{\boldsymbol{t}_{\boldsymbol{1}}} & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \boldsymbol{W}^{\boldsymbol{L}}_{\boldsymbol{t}_{\boldsymbol{S-1}}} \\ \end{array}\right], $$

respectively, (i) dx is a [L·S·N]-dimensional column vector in which the generic element \(\boldsymbol {dx}^{\boldsymbol {l}}_{\boldsymbol {t}_{\boldsymbol {s}}}=\text {vec}(\boldsymbol {x}^{\boldsymbol {l}}_{\boldsymbol {t}_{\boldsymbol {s+1}}}-\boldsymbol {x}^{\boldsymbol {l}}_{\boldsymbol {t}_{\boldsymbol {s}}})\) is a N-dimensional column vector corresponding to observed increments in cells counts, interpretable as realizations of the r.v. defined in (9); (ii) M is a [L·S·Nr predictors matrix where the generic element \(\boldsymbol {M}^{\boldsymbol {l}}_{\boldsymbol {t}_{\boldsymbol {s}}}=\boldsymbol {V}^{\boldsymbol {\intercal }} \cdot D(\boldsymbol {x}^{\boldsymbol {l}}_{\boldsymbol {t}_{\boldsymbol {s}}}) \cdot [t_{s+1}-t_{s}] \) is a N×r matrix; (iii) The covariance matrix W is a blocks diagonal matrix describing the dependence between lineages counts increments belonging to the same time-point and independence among all the other. Each block \(\boldsymbol {W}^{\boldsymbol {l}}_{\boldsymbol {t}_{\boldsymbol {s}}}\) is a N×N matrix and is approximated by (11) as \(\boldsymbol {V}^{\boldsymbol {\intercal }} \cdot D(\boldsymbol {x}^{\boldsymbol {l}}_{\boldsymbol {t}_{\boldsymbol {s}}}) \cdot [t_{s+1}-t_{s}]~ \cdot \text {diag} \cdot \left (\boldsymbol {\theta }\right) \cdot \boldsymbol {V}\).

Both duplication and death events involve only a single lineage and associated columns in M matrix result in pairs numerically equal but with opposite sign, causing rank deficiency. To overcome this issue, a new set of parameters, γ=αδ named net duplication rates is defined. Differently from others, elements of γ take values in \(\mathbb {R}\). In particular, γi is positive if cells in lineage Ci undergo duplication with a higher rate than death (αiδi), and with a lower rate if the reverse is true (αiδi). Therefore, the following modifications have to be done. Parameters vector θ reduces to θ=(γ,λ), an r-dimensional vector, where r is equal to rN, V is the reduced net effect matrix of dimension r×N and M is a [L·S·Nr matrix equal to M but columns related to death events removed. Although it is not possible to infer simultaneously both duplication and death rates, the modifications introduced have limited impact on the precision of the estimators, since the product Mθ is equal to Mθ.

It is now possible to define the constrained generalized least squares (CGLS) estimator as:

$$ \hat{\boldsymbol{\theta}^{\boldsymbol{*}}}= {\underset{\boldsymbol{\theta}^{\boldsymbol{*}}}{\arg\,min}}\,(\boldsymbol{dx}-\boldsymbol{M}^{\boldsymbol{*}} \boldsymbol{\theta}^{\mathbf{\prime\prime}}{\boldsymbol{*}})^{\intercal} \boldsymbol{W}^{\boldsymbol{*-1}}(\boldsymbol{dx}-\boldsymbol{M}^{\boldsymbol{*}} \boldsymbol{\theta}^{\boldsymbol{*}})~s.t.~\boldsymbol{\lambda} \geq 0 $$

Due to the non-negativity constraint on differentiation rates λ and the dependence of W on the unknown parameters θ, a closed form solution for \(\hat {\boldsymbol {\theta }^{\boldsymbol {*}}}\) is not available. A description of the iterative procedure for solving the CGLS minimization problem is presented in Appendix 2.

Model selection

Biologically meaningful and realistic cell differentiation structures are characterized by a limited number of connections between lineages. From a statistical modelling point of view, this type of result can be encouraged by means of sparsity promoting components in the differentiation parameters vector. Within the class of differentiation rates, it is usually possible to distinguish two types of relations between cell lineages: relations with differentiation rates known to be equal to zero due to the hierarchical cellular structure, and relations that do not violate hierarchical constraints and whose rates can be either zero or positive. The first category includes all rates related to the differentiation of mature lineages into stem/progenitors cells. The associated parameters are simply set to zero a priori. For the second category, a penalization procedure based on the smoothly clipped absolute deviation (SCAD) penalty function (Fan 1997) is applied. SCAD penalization has several advantages over another widely used technique, i.e., least absolute shrinkage and selection operator (LASSO). Although the LASSO has many excellent properties and very efficient implementations, it is a biased estimator. This bias affects in particular parameters that are truly non-zero, and does not disappear when the sample size increases. The SCAD penalty, instead, retains the penalization rate of the LASSO for small coefficients, but continuously relaxes the rate of penalization as the absolute value of the coefficient increases, leading to asymptotically unbiased estimates (Fan and Li 2001). The SCAD penalty function is defined as:

$$ p_{\eta}(\theta)= \left\{\begin{array}{ll} \eta \lvert \theta \lvert, & \text{if}\ 0 \leq \lvert \theta \lvert \leq \eta \\ \frac{(\xi^{2}-1)\eta^{2}-(\lvert \theta \lvert-\xi\eta)^{2}}{2(\xi-1)}, & \text{if}\ \eta \leq \lvert \theta \lvert \leq \xi\eta \\ \frac{(\xi+1)\eta^{2}}{2}, & \text{if}\ \lvert \theta \lvert \geq \xi\eta\end{array}\right. $$

where η>0 and ξ>2. As for many other penalization procedures, the role of the threshold parameters, η and ξ, is to tune the degree of sparseness in the final model and valid setting criteria for them are needed to ensure the accuracy of the estimator. Setting ξ=3.7, the SCAD penalty has been demonstrated to give a satisfactory performance in a variety of variable selection problems (Fan and Li 2001) and it has therefore been adopted in this paper.

The optimal value of η has been chosen according to a generalized cross-validation (GCV) minimization criteria (Golub et al. 1979; Tibshirany 1996). The GCV is defined as:

$$ {GCV}_{\eta}=\frac{1}{n} \frac{\lVert \boldsymbol{dx}-\boldsymbol{\widehat{dx}} \rVert^{2}}{(1-e/n)^{2}} $$

where \(\boldsymbol {\widehat {dx}}=\boldsymbol {M}^{\boldsymbol {*}} \hat {\boldsymbol {\theta }^{\boldsymbol {*}}}_{\boldsymbol {P}}, e=tr[\boldsymbol {M}^{\boldsymbol {*}}(\boldsymbol {M}^{\boldsymbol {*\intercal }} {\boldsymbol {W}^{\boldsymbol {*-1}}} \boldsymbol {M}^{\boldsymbol {*}}+\boldsymbol {P}_{\boldsymbol {\eta }})^{\boldsymbol {-1}} \boldsymbol {M}^{\boldsymbol {*\intercal }} {\boldsymbol {W}^{\boldsymbol {*-1}}}]\) corresponds to the number of effective parameters and Pη is a r×r parameters penalization matrix described in detail in Appendix 3. Finally, given a particular value for η, the parameters estimates are calculated by minimizing the following objective function:

$$ \hat{\boldsymbol{{\theta}}^{\boldsymbol{*}}}_{\boldsymbol{P}}= {\underset{\boldsymbol{\theta}^{\boldsymbol{*}}}{\arg\min}}\,(\boldsymbol{dx}-\boldsymbol{M}^{\boldsymbol{*}} \boldsymbol{\theta}^{\boldsymbol{*}})^{\intercal} \boldsymbol{W}^{\boldsymbol{*}-1}(\boldsymbol{dx}-\boldsymbol{M}^{\boldsymbol{*}} \boldsymbol{\theta}^{\boldsymbol{*}})+n \sum p_{\eta} (\lambda_{i,j}) ~s.~t.~ \boldsymbol{\lambda} \geq 0 $$

In the penalized CGLS (PCGLS) algorithm described in Appendix 3, the penalization function is included in an iterative procedure able to perform model selection and parameters estimation simultaneously.

Schematic overview of the inferential procedure

In this section, we outline whole inferential procedure in pseudo-code notation. The algorithm can be split into two major and consecutive parts that have been introduced in “Approximate generalized method-of-moments estimation” and “Model selection” sections: CGLS and PCGLS. Detailed description for each of them can found in Appendices B and C, respectively. Algorithm 1 starts with the calculation of increments vector dx and the regression matrix M in (13). It receives as input a set η of candidate values for the SCAD tuning parameter. The initial values for the CGLS iterative procedure are calculated by solving a constrained ordinary least square problem (COLS), in which errors are assumed independent and homoscedastic. The COLS estimates \({\hat {\boldsymbol {\theta }^{\boldsymbol {*0}}}}\) are then used to calculate a first estimate for the covariance matrix \(\hat {\boldsymbol {W}^{\boldsymbol {*}}}\). By means of an iterative procedure, estimates for \({\hat {\boldsymbol {\theta }^{\boldsymbol {*}}}}\) and \(\hat {\boldsymbol {W}^{\boldsymbol {*}}}\) are then sequentially refined, until the convergence criteria are satisfied. Final \(\hat {\boldsymbol {\theta }^{\boldsymbol {*}}}\) returned by the CGLS part is then used as parameter starting values in the PCGLS procedure, aimed at reconstructing the true, sparse model configuration by shrinking small coefficients to zero. In the PCGLS algorithm, parameter estimates and model structure identification are simultaneously updated. Once the convergence criterion is met, general cross-validation (GCV) statistics are calculated as shown in (16) to select the network configuration corresponding to minimum cross-validation error.

Simulation study

In this section we present a simulation study to evaluate the performance of the inference procedure. The settings used in the simulation study closely correspond to the gene therapy dataset analyzed in “Investigating human hematopoiesis in vivo” section, using the most recent model of hematopoiesis that has been suggested for non-human primates (Goyal et al. 2015). A simulated hierarchical differentiation process (SHDP) of size N=15 has been designed with 3 hierarchical layers where differentiation paths are only allowed between adjacent levels and in a unidirectional way. The top layer, constituted by a single “stem cell” lineage (node: 1), is characterized by a positive net duplication rate. The middle layer is composed of 7 partially interconnected lineages (nodes: 2-8), all derived from differentiation events occurred in top lineage cells and able to generate bottom level cell types. The net duplication rates in this layer are heterogeneous and can be both positive or negative. Finally, the 7 lineages in the bottom layer (nodes: 9-15) have no differentiation potential and they all die faster than duplicate. A biological interpretation is that the top layer corresponds to stem cells, responsible for the generation of a set of myeloid and lymphoid branches specific progenitors in the BM (second layer). Each progenitor is then able to give rise to a small subset of committed cell, circulating in the PB (third layer) and characterized by limited lifespan. A graphical representation of the SHDP model is given in Fig. 2 along with a matrix representation of rates intensities, hierarchical constraints and the differentiation rates.

Fig. 2
figure 2

Configuration of the Simulated Hierarchically Differentiation Process. a Network configuration of the SHDP. The assumed structure mimics a potential model for human hematopoiesis according to both literature and experimental data. Network layout has been generated with a force directed layout. b Parameters heatmap. Main diagonal elements correspond to net duplication rate. Off-diagonal elements correspond to differentiation rates (row → column). Light-Grey entries are differentiations rates fixed to 0 but in the set of the potentially present differentiation paths (67). Dark-Grey entries are excluded from the inferential procedure due to violation of hierarchical constraints (112). The colored boxes correspond to: red, top to middle layer connections; blue, middle layer inter-connections; pink, middle to bottom layer connections

The aims of the simulation study are (i) to verify the performance of the proposed inferential procedure in case of a hierarchically structured systems and (ii) to measure the impact of different sampling time interval lengths on both estimation precision and model selection, i.e., the reconstruction of the true underlying differentiation process. Each experiment is composed of N=1000 simulated clone evolutions, all generated starting from the same initial state vector consisting of a single hematopoietic stem cell. Continuous-time clones dynamics are simulated by means of Gillespie algorithm (Gillespie 1977; Wilkinson 2006) according to the SHDP configuration in Fig. 2. Process states are then recorded at equispaced time intervals of lengths, dt=(0.1,0.2,0.5,0.7,1), up to the end time-point fixed at tend=4. Based on these fixed time observations, state increments vector are calculated and given as input to the algorithm described in “Schematic overview of the inferential procedure” section. In the GCV procedure we consider a sequence of candidate values η from 0.001 to 0.1 with 0.001 step sizes.

In total, 100 independent experiments have been analyzed, each composed by 1000 simulated clones. Results are graphically represented in two figures. Figure 3 shows the impact of sampling interval length on the ability to correctly estimate process parameters. The performance regarding four specific rates are reported: a positive and a negative net duplication rate, one positive differentiation rate and one absent differentiation coefficient. The distribution of the estimates obtained from the 100 replicates are represented as a boxplot. In Fig. 4 the accuracy in terms of network reconstruction for increasing dt values is summarized. The distribution of two indices, recall and precision, are plotted by means of boxplots. In particular, we focus on verifying how reliable our proposal is in estimating absent connections among lineages. Precision measures the proportion of true differentiations among the identified differentiations, while recall (also known as sensitivity) is the proportion of identified differentiations among the true differentiations.

Fig. 3
figure 3

Distribution of parameter estimates for increasing dt values. On each experiment included in the SHDP simulation study the inferential procedure has been applied by setting dt value to (0.1, 0.2, 0.5, 0.7, 1). Each boxplot describe the distribution of 100 independent estimates obtained for a specific parameter/dt pair. Four parameters are shown: a) a positive net duplication rate, γ1 (0.8); b) a negative net duplication rate γ11 (-0.05); c) a positive differentiation rate, λ1,2 (0.1). d) an absent differentiation path, λ3,2 (0)

Fig. 4
figure 4

Network reconstruction accuracy for different dt values. For each experiment included in the SHDP simulation study, the best reconstructed network selected by the inferential procedure has been compared to the true model configuration. Each replicate has been analysed by setting dt value to (0.1, 0.2, 0.5, 0.7, 1). Precision and recall values have been calculated considering only differentiation rates subjected to estimation (excluding λi,j fixed to 0 due to hierarchical constraints) and measure the capability to correctly estimate absent differentiation paths, λi,j=0. Each boxplot corresponds to 100 independent precision/recall values

From a computational point of view, the Gillespie algorithm has been implemented in C++ (Stroustrup 1997) with the support of the Eigen library (Guennebaud et al. 2010). The inferential and penalization procedures are implemented in R (R Core Team 2015) by means of custom scripts requiring sparse Matrix packages for efficient dense and sparse matrix manipulations (Bates and Maechler 2015). QP problems (21) and (22) are solved by means of IBM ILOG CPLEX Optimizer, freely available under the IBM Academic Initiative program (IBM 2010). The simulated clone trajectories included in a single experiment (1000 clones) are generated in approximately 4 minutes. The inferential procedure takes from 5 to 12 minutes to complete on a single dataset, using candidate values for SCAD parameter (η) as mentioned above. In general, with a small dt value (0.1), the amount of data to be processed is about 10 times higher than with dt=1, increasing the computational burden and time. This aspect is slightly counter-balanced by the fact that with higher dt values, the number of iterations required for convergence is higher (3.8 vs. 7.1 with dt equal to 0.1 and 1 respectively). For the setting tested and reported in this manuscript, no convergence issues have been observed.

Investigating human hematopoiesis in vivo

In this section, we return to the motivating Wiskott-Aldrich syndrome (WAS) gene therapy (GT) clinical study. The aim is to infer the network structure of the hematopoietic process in humans, along with lineage-specific duplication, death and differentiation rates. Technical and experimental protocols used to collect the data have been described in Aiuti et al. (2013); Biasco et al. (2016); Scala et al. (2018) and are briefly summarized below.

At time 0, corrected HSCs harvested from BM are re-infused in 3 patients previously treated with bone marrow suppressive drugs enhancing immunosuppression in order to ensure a higher level of engraftment for corrected HSCs. In the patient’s body, marked HSCs start to duplicate, die and possibly differentiate into functionally more specialized cells, passing on the copy of the WASP gene to all the offspring generated, reconstituting a functional hematopoietic heritage. Viral IS selection is itself a quasi-random process and the probability that two integration events occur in the same genomic position in two distinct cells is negligible (Ambrosi et al. 2008; Biasco et al. 2011; Pellin and Di Serio 2016). Therefore, IS coordinates can be used as a molecular marker to monitor the in-vivo evolution of a single HSC and of its progeny. BM and PB samples have been taken for the 3 patients at 1, 2 and 3 years after treatment, enriched by means of magnetic cell sorting (MACS) technology according to a set of antibodies known to be lineage-specific. Finally, these samples were sequenced by means of the Illumina Miseq platform (Biasco et al. 2011). A bioinformatic pipeline starting from the sequencing output detects the IS coordinate (labels) and quantifies by means of reads count values the label distributions over lineages and time.

In total 37,637 distinct clones have been tracked covering 15 cell types divided in a three hierarchical levels: (i) the HSC level: CD34; (ii) the BM level: CD3, CD14, CD15, CD19, CD56, CD61, GLYCO and (iii) the PB level: CD3, CD4, CD8, CD14, CD15, CD19, CD56. In order to limit potential bias introduced by the low recapture probability of clones known to affect clonal tracking data, we kept only the 1083 clones with more than 15 observations across all lineages and time-points.

In accordance with the current state of the biological literature, the following assumptions have been made: (i) lineages in the HSC level can differentiate in any other cell type in BM level; (ii) lineages in the BM level can be connected to any cell type in BM and PB level; and (iii) lineages in the PB level cannot differentiate.

DNA library preparation for NGS based sequencing requires a linear amplification step known to be a source of noise potentially affecting cells counts measurement. To investigate the reliability of available information, part of the HSCs sample has been sequenced after a time interval in which is reasonable to assume that no or few cell events occurred, therefore are clones of size 1. Based on 3104 ISs, a median absolute deviation (MAD) statistics equal to 6.1 has been calculated, leading to a process noise estimate \(\hat {\sigma }^{2}=(1.48 \times MAD)^{2}=81.5\) (Rousseeuw and Croux 1993). This value has been incorporated in the inference and model selection procedures by modifying variance-covariance matrix as \(\hat {\boldsymbol {W}^{\boldsymbol {*}}}={\boldsymbol {W}^{\boldsymbol {*}}}+\hat {\sigma }^{2} \boldsymbol {I}\). In order to facilitate the biological interpretation of the results obtained, a filtered version (only λi,j≥0.1) of estimated human hematopoiesis structure is given in Fig. 5. The full model can be found in Appendix 4.

Fig. 5
figure 5

Estimated (filtered) human hematopoietic differentiation process. a The network configuration of the human hematopoietic differentiation process. Nodes representing lineages positive for the surface marker and collected from bone marrow and peripheral blood (circled) have the same color. Lymphoid lineages have a blue/green color, whereas myeloid cell types are red/orange and violet. The node corresponding to the stem cell (CD34) is dark blue and circled in black. Glycophorin positive cells(Glyco) and CD61 are respectively grey and pink. Network layout has been generated with a force directed layout, excluding λi,j≤0.1. b Parameters heatmap. Main diagonal elements correspond to net duplication rate. Off-diagonal elements correspond to differentiation rates (row → column). Light-Grey entries are differentiations rates estimated as 0 by SCAD penalization or smaller than 0.1. Dark-Grey entries are excluded from the inferential procedure due to violation of hierarchical constraints (112)


In the simulation study presented in “Simulation study” section, a candidate model of hematopoiesis of realistic complexity has been considered. The inferential procedures presented in “Approximate generalized method-of-moments estimation” section relies on iteratively updated approximate solutions for 2 coupled systems of ODEs with initial conditions. The accuracy of such approximations is known to be inversely proportional to the time distance between consecutive observations. This is particularly relevant given the sampling schema adopted in the experimental study. For these reasons, different values for dt have been considered and the clones’ evolution has been observed until tend=4.

As shown in Fig. 3, dt affects the estimation precision. This result was expected and can be attributed to the loss of approximation quality provided by Euler’s method for moments evolution with higher dt values. Bias consistently increases with interval length and for the settings considered in this paper only for dt=0.1 parameter estimates are centered around the true values. The only exception is λ3,2, shown in panel Fig. 3d, for which the performance is of good quality and similar across all dt settings, despite the fact that the overall amount of information available on clone dynamics decreases 10 times. This remarkable feature is the result of the penalization procedure that attenuate the dt effect by shrinking small coefficients to zero. In terms of model reconstruction accuracy, data in Fig. 4 show that precision values are close to 1 for all settings evaluated, meaning that we are very confident that \(\hat {\lambda }_{i,j}=0\) correspond to truly absent differentiation paths. This is consistent across dt. On the other hand, recall behaviour suggests that our proposal is not able to identify all absent differentiations (i.e. λi,j=0), but on average only 4% more are missed with dt=1 (78%) compared to dt=0.1 (82%).

In view of the above encouraging results, in particular regarding the network structure, it is possible to give the following interpretation of the final model for human hematopoiesis shown in Fig. 5. According to the hematological classification, the following branches can be defined: (i) the lymphoid branch, including CD3 and CD19 in BM and CD3, CD4, CD8 and CD19 in PB; (ii) the myeloid branch composed of CD14, CD15 and CD56 in both BM and PM; (iii) Glycophorin positive cells, corresponding to Glyco BM and (iv) the CD61 positive lineage. The flexibility of such a classification is currently debated and evidence for the presence of progenitors in BM straddling multiple branches emerged in multiple independent studies (Kawamoto et al. 2010; Kawamoto et al. 2010; Aiuti et al. 2013). Our results support this hypothesis. The complexity of the relationship among BM lineages is high and characterized by relevant cross-branches differentiation paths, mainly in myeloid to lymphoid direction, such as CD14 → CD19, CD14 → GLYCO and CD56 → CD19. A possible explanation can be found in a recent investigation aimed at dissecting the heterogeneous CD34 HSC population, suggesting the presence of intermediate stages, named Myeloid PluriPotent then followed by a Multi Lymphoid Progenitor (Biasco et al. 2016; Scala et al. 2018). All lineages in BM, with exception of CD15, have connections with their committed homologous subpopulation in PB compartment, as biologically expected.

The full model represented in Appendix 4 displays a much higher level of complexity with respect to Fig. 5. As showed in Appendix 5, a considerable amount of low differentiation rates are present. We consider them mostly related to the intrinsic sampling issues associated with clonal tracking experiments. In fact, it is difficult to obtain a consistent detection of all clones contributing to a given lineage across all time-points. The missing observation of clones in an intermediate cell population, say Ck, connecting lineages Ci and Cj for example, leads the inference algorithm at estimating weak differentiation rates between CiCj directly, in addition to the true CiCkCj path. This problem affects in particular BM data, where bone marrow aspiration location, not always maintained unchanged over the follow-up period, can strongly affects clone capture probabilities. We set a threshold at 0.1, that we consider offering a good balance between model complexity and interpretability of the results.


In this paper, we presented a statistical model for cell differentiation process along with an inferential procedure able to provide parameters estimation and cell differentiation network reconstruction. The model has been defined as a continuous-time Markov chain, with density-dependent transition probabilities and considers three categories of cellular events: duplications, deaths, and differentiations. Starting from a special formulation of the Kolmogorov forward equation, two coupled set of ODEs have been derived, describing the time evolution of process first and second central moments over time. ODEs solutions have been approximated by Euler’s method allowing for parameter estimation via a general linear regression setup.

In order to take into account the dependence among process components due to differentiation events and non-negativity constraints on a subset of parameters, estimation is performed by means of an iterative generalized least square procedure, solved using an efficient quadratic programming approach. However, a biologically meaningful differentiation network is expected to be characterized by a limited amount of connections between lineages. To encourage a data-driven parsimonious solution and to provide an estimation of the best candidate differentiation pathway, a penalization step based on SCAD penalty function and a GCV criterion have been introduced.

Inferential procedures have been tested in a simulation study, mimicking a realistic candidate structure for human hematopoiesis. As expected, in case of frequently repeated observations for process state over time, Euler’s method provides a good approximation for moments evolution and, consequently, both parameters and model structure estimations are more accurate. When the time elapsing between consecutive observations increases, the quality of the estimations decreases, in particular for a subset of parameters. Despite this, model structure reconstruction is still reliable. The main limitation of the proposed model for cell differentiation, from a biological point of view, is represented by the linearity of the propensity functions, potentially allowing for unlimited clone expansion. However, it is worth noting that patients are immunodepressed at the time of treatment and that time necessary to reach a stable steady state equilibrium for lineages total populations is in the order of years.

Finally, human hematopoiesis structure and lineages specific parameters have been investigated by applying the developed method to a recent Wiskott-Aldrich syndrome gene therapy clinical study. The obtained result supports a recently proposed complex, interconnected myeloid/lymphoid branching model over previous simpler alternatives.

In future work we aim to extend the statistical framework presented in this paper to other, more flexible, dynamics formulation, such as Gompertz and logistic growth models. Additional attention will be paid to the inferential procedure, where the computationally efficient Euler’s method will be substituted with more complex alternatives, able to better approximate the solution of the ODE systems. From an application perspective, we consider of particular interests the potential comparison of the results obtained from the analysis of gene therapy data for WAS to those retrieved from different ongoing clinical trials, such as gene therapy for Adenosine Deaminase deficiency (ADA), Metachromatic Leukodystrophy (MLD) or sickle cell disease.

Appendix 1: Derivation of moment equations

By means of the summation operator, \(\sum _{\boldsymbol {x} \in \boldsymbol {\tilde {x}}}\), spanning over the whole set of possible state for process X(t), \(\boldsymbol {\tilde {x}}= \mathbb {Z}^{N} \), it is possible to derived a functional connection between the evolution for the expected population size of each process component and the dynamics of the process probability distribution P(X(t),t):

$$\begin{aligned} \frac{d \mathrm{E}[{X_{i}(t)}]}{dt}= & \frac{d \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} P(\boldsymbol{X}(t)=\boldsymbol{x},t)}{dt}\\ = & \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} \frac{dP(\boldsymbol{X}(t)=\boldsymbol{x},t)}{dt}\\ \end{aligned} $$

The evolution of P(X(t),t) can be expressed by means of the master equation introduced in 1:

$$\begin{array}{*{20}l} \frac{d \mathrm{E}[{X_{i}(t)}]}{dt}= & \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} \sum_{k=1}^{r} \left\{ \left[ D({\boldsymbol{x}-\boldsymbol{V}_{\boldsymbol{k,\cdot}}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{x}-\boldsymbol{V}_{\boldsymbol{k,\cdot}},t \right) - \left[D(\boldsymbol{x}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right) \right\} \end{array} $$

Due to the fact that the summation operator \(\sum _{\boldsymbol {x} \in \boldsymbol {\tilde {x}}}\) span over the all possible state configurations, the order of summation operators in the RHS can be inverted:

$${{}\begin{aligned} \frac{d \mathrm{E}[{X_{i}(t)}]}{dt}= & \sum_{k=1}^{r} \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} \{\left[D({\boldsymbol{x}-\boldsymbol{V}_{\boldsymbol{k,\cdot}}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{x}-\boldsymbol{V}_{\boldsymbol{k,\cdot}},t \right) - \left[D(\boldsymbol{{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right) \}\\ = & \sum_{k=1}^{r} \left\{\sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} \left[ D({\boldsymbol{x}\,-\,\boldsymbol{V}_{\boldsymbol{k,\cdot}}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{x}\!-\boldsymbol{V}_{\boldsymbol{k,\cdot}},t \right) - \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} \left[D(\boldsymbol{\boldsymbol{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right) \right\} \end{aligned}} $$

Now, the summation variable in the first term of the RHS can be modified, without affecting the sum domain, since it cover all the possible state configuration:

$${{}\begin{aligned} \frac{d \mathrm{E}[{X_{i}(t)}]}{dt}= & \sum_{k=1}^{r} \left\{\sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} (x_{i}\,+\,v_{k,i}) \left[D({\boldsymbol{x}}) \boldsymbol{\theta}\right]_{k} P \left(\boldsymbol{X}(t)\,=\, \boldsymbol{x},t \right) \,-\, \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} \left[ D(\boldsymbol{{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right)\right\} \\ = & \sum_{k=1}^{r} \left\{\sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} \left[ D({\boldsymbol{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right) + v_{k,i} \left[ D({\boldsymbol{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right)-\right. \\ & \left.\sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} \left[ D(\boldsymbol{{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right) \right\} \\ = & \sum_{k=1}^{r} \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} v_{k,i} \left[ D({\boldsymbol{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right)\\ \end{aligned}} $$

Given the known property for expected value of function f(x) of a r.v. x with probability distribution \(P(x), \mathrm {E}[f(x)]= \sum _{\boldsymbol {x}} f(x) P(x)\):

$${\begin{aligned} \frac{d \mathrm{E}[{X_{i}(t)}]}{dt}= & \sum_{k=1}^{r} \mathrm{E} \left[v_{k,i} \left[D({\boldsymbol{X}(t)}) \boldsymbol{\theta} \right]_{k} \right]\\ \end{aligned}} $$

Finally, by linearity of expectation:

$$\begin{array}{*{20}l} \frac{d \mathrm{E}[{X_{i}(t)}]}{dt}= & \sum_{k=1}^{r} v_{k,i} \left[D(\mathrm{E} \left[{\boldsymbol{X}(t)}\right]) \boldsymbol{\theta}\right]_{k} \end{array} $$

A similar approach can be extended to define a system of ODEs for the time evolution for second order moments of X(t):

$${\begin{aligned} ~~ & \frac{d \mathrm{E}[X_{i}(t){X_{j}(t)}]}{dt}= \frac{d \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} x_{j} P(\boldsymbol{X}(t)= \boldsymbol{x},t)}{dt}\\ = & \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} x_{j} \frac{dP(\boldsymbol{X}(t)= \boldsymbol{x},t)}{dt}\\ = & \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} x_{j} \sum_{k=1}^{r} \{ \left[ D({\boldsymbol{x}-\boldsymbol{V}_{\boldsymbol{k,\cdot }}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x}-\boldsymbol{V}_{\boldsymbol{k,\cdot}},t \right) - \left[D(\boldsymbol{{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right) \}\\[-3pt] = & \sum_{k=1}^{r} \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} x_{j} \{ \left[ D({\boldsymbol{x}-\boldsymbol{V}_{\boldsymbol{k,\cdot}}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x}-\boldsymbol{V}_{\boldsymbol{k,\cdot}},t \right) - \left[D(\boldsymbol{{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right) \}\\[-3pt] = & \sum_{k=1}^{r} \left\{ \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} x_{j} \left[ D({\boldsymbol{x}-\boldsymbol{V}_{\boldsymbol{k,\cdot}}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x}-\boldsymbol{V}_{\boldsymbol{k,\cdot}},t \right) -\right. \\[-3pt] & \left.\sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} x_{j} \left[D(\boldsymbol{{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right)\right\} \\[-3pt] = & \sum_{k=1}^{r} \left\{ \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} (x_{i}+v_{k,i}) (x_{j}+v_{k,j}) \left[ D({\boldsymbol{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right) - \right.\\[-3pt] & \left.\sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} x_{j} \left[D(\boldsymbol{{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right)\right\} \\[-3pt] = & \sum_{k=1}^{r} \left\{ \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} x_{j} \left[D({\boldsymbol{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right) + \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} v_{k,j} x_{i} \left[D({\boldsymbol{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right) + \right.\\[-3pt] & \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} v_{k,i} x_{j} \left[D({\boldsymbol{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right) + \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} v_{k,i} v_{k,j} \left[D({\boldsymbol{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right)-\\[-3pt] & \left.\sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} x_{i} x_{j} \left[D(\boldsymbol{{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right)\right\} \\[-3pt] = & \sum_{k=1}^{r} \left\{ \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} v_{k,j} x_{i} \left[D({\boldsymbol{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right) + \sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} v_{k,i} x_{j} \left[D({\boldsymbol{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right) + \right.\\[-3pt] & \left.\sum_{\boldsymbol{x} \in \boldsymbol{\tilde{x}}} v_{k,i} v_{k,j} \left[D({\boldsymbol{x}}) \boldsymbol{\theta} \right]_{k} P \left(\boldsymbol{X}(t)= \boldsymbol{x},t \right) \right\}\\ = & \sum_{k=1}^{r} \mathrm{E} \left[v_{k,j} X_{i}(t) \left[D({\boldsymbol{X}(t)}) \boldsymbol{\theta} \right]_{k} \right] + \sum_{k=1}^{r} \mathrm{E} \left[v_{k,i} X_{j}(t) \left[D({\boldsymbol{X}(t)}) \boldsymbol{\theta} \right]_{k} \right] + \\[-3pt] &\sum_{k=1}^{r} \mathrm{E} \left[v_{k,i} v_{k,j} \left[D({\boldsymbol{X}(t)}) \boldsymbol{\theta} \right]_{k} \right]\\[-3pt] = & \sum_{k=1}^{r} v_{k,j} \left[D(\mathrm{E} \left[X_{i}(t) {\boldsymbol{X}(t)}) \right] \boldsymbol{\theta} \right]_{k} + \sum_{k=1}^{r} v_{k,i} \left[D(\mathrm{E} \left[X_{j}(t) {\boldsymbol{X}(t)}) \right] \boldsymbol{\theta} \right]_{k} + \\[-3pt] &\sum_{k=1}^{r} v_{k,i} v_{k,j} \left[D(\mathrm{E} \left[{\boldsymbol{X}(t)}\right]) \boldsymbol{\theta} \right]_{k} \end{aligned}} $$

The generic element of the covariance matrix \(\Sigma _{X_{i},X_{j}}(t)\), describing the covariance between the X(t) components Xi(t) and Xj(t), can be expressed as a combination of first and second order moments:

$$\begin{array}{*{20}l} \Sigma_{X_{i},X_{j}}(t) = \mathrm{E}[X_{i}(t){X_{j}(t)}] -\mathrm{E}[{X_{i}(t)}] \mathrm{E}[{X_{j}(t)}] \end{array} $$

Applying derivation rule to both sides, it is possible to derived a system of ODE for the evolution of covariance matrix elements as:

$$\begin{array}{*{20}l} \frac{d \Sigma_{X_{i},X_{j}}(t) }{dt} = \frac{d \mathrm{E}[X_{i}(t){X_{j}(t)}]}{dt} - \left(\mathrm{E}[{X_{i}(t)}] \frac{d \mathrm{E}[{X_{j}(t)}]}{dt} + \mathrm{E}[{X_{j}(t)}] \frac{d \mathrm{E}[{X_{i}(t)}]}{dt} \right) \end{array} $$

Finally, substituting the RHS elements with the corresponding expression derived in (18), the following is obtained:

$$\begin{array}{*{20}l} \frac{d \Sigma_{X_{i},X_{j}}(t) }{dt} = & \sum_{k=1}^{r} v_{k,j} \left[D(\mathrm{E} \left[X_{i}(t) {\boldsymbol{X}(t)}) \right] \boldsymbol{\theta} \right]_{k} + \sum_{k=1}^{r} v_{k,i} \left[D(\mathrm{E} \left[X_{j} {\boldsymbol{X}(t)}) \right] \boldsymbol{\theta} \right]_{k} +\\ & \sum_{k=1}^{r} v_{k,i} v_{k,j} \left[D(\mathrm{E} \left[{\boldsymbol{X}(t)}\right]) \boldsymbol{\theta} \right]_{k} - \mathrm{E}[{X_{i}(t)}] \sum_{k=1}^{r} v_{k,j} \left[D(\mathrm{E} \left[{\boldsymbol{X}(t)}\right]) \boldsymbol{\theta} \right]_{k} - \\ & \mathrm{E}[{X_{j}(t)}] \sum_{k=1}^{r} v_{k,i} \left[D(\mathrm{E} \left[{\boldsymbol{X}(t)}\right]) \boldsymbol{\theta} \right]_{k} \end{array} $$

Appendix 2: Constrained generalized least square procedure

The algorithm described in pseudo-code notation in Algorithm 2 starts with the calculation of increments vector dx and predictors matrix M according to (13).

Parameters initial values for the proposed iterative procedure are calculated by solving a constrained ordinary least square problem (COLS), in which errors are assumed independent and homoscedastic. The COLS estimates \({\hat {\boldsymbol {\theta }^{\boldsymbol {*0}}}}\) are then used to calculate a first estimation for the covariance matrix \(\hat {\boldsymbol {W}^{\boldsymbol {*}}}\). By means of an iterative procedure, estimates for \({\hat {\boldsymbol {{\theta }^{\boldsymbol {*}}}}}\) and \(\hat {\boldsymbol {W}^{\boldsymbol {*}}}\) are then sequentially refined, until convergence criteria on parameters vector is satisfied. From an optimization point of view, both COLS and CGLS estimations can be interpreted as a quadratic programming (QP) problems, a special type of mathematical optimization problem in which a quadratic function has to be minimized (or maximized) taking into account for a set of linear constraints on variables. In general, a quadratic programming problem with n variables and m constraints can be formulated as follows.

Given a n-dimensional vector c, an n×n symmetric matrix Q, an m×n matrix A and an m-dimensional vector b, the goal is to find the n-dimensional vector x, such that:

$$ \boldsymbol{\hat{x}}= \underset{\boldsymbol{x}}{\text{arg\,min}} \left(\frac{1}{2} \boldsymbol{x}^{\intercal} \boldsymbol{Q} \boldsymbol{x} + \boldsymbol{c}^{\intercal} \boldsymbol{x} \right) ~s.t.~\boldsymbol{A} \boldsymbol{x} \leq \boldsymbol{b}. $$

The CGLS problem defined in (14) can be converted in a QP problem of type (20) by setting \(\boldsymbol {x}=\boldsymbol {\theta }^{\boldsymbol {*}}, \boldsymbol {Q}=2 (\boldsymbol {M}^{\boldsymbol {*}\intercal } {\boldsymbol {W}^{\boldsymbol {*-1}}} \boldsymbol {M}^{\boldsymbol {*}}), \boldsymbol {c}=-2 (\boldsymbol {dx}^{\intercal } {\boldsymbol {W}^{\boldsymbol {*-1}}} \boldsymbol {M}^{\boldsymbol {*}})\), \(\phantom {\dot {i}\!}\boldsymbol {b}=\boldsymbol {0}_{\boldsymbol {r}^{\boldsymbol {*}}}\) and defining A as a r×r diagonal matrix with elements Ai,i=0 if ith element of θ refers to a net duplication rate (unconstrained) and Ai,j=−1 if it is a differentiation rate (non-negativity constrained). Finally, the QP problem becomes:

$$ {\hat{\boldsymbol{\theta}^{\boldsymbol{*}}}}= {\underset{\boldsymbol{\theta}^{\boldsymbol{*}}}{\arg\min}} \left[ \boldsymbol{\theta}^{\boldsymbol{*}\intercal} (\boldsymbol{M}^{\boldsymbol{*}\intercal} {\boldsymbol{W}^{\boldsymbol{*-1}}} \boldsymbol{M}^{\boldsymbol{*}}) \boldsymbol{\theta}^{\boldsymbol{*}} -2 (\boldsymbol{dx}^{\intercal} {\boldsymbol{W}^{\boldsymbol{*-1}}} \boldsymbol{M}^{\boldsymbol{*}})^{\intercal} \boldsymbol{\theta}^{\boldsymbol{*}} \right] ~s.t.~\boldsymbol{A} \boldsymbol{\theta}^{\boldsymbol{*}}\leq \boldsymbol{0}_{\boldsymbol{r}^{\boldsymbol{*}}}. $$

For COLS is sufficient to remove W−1 in Q and c formulas, other terms remain unchanged. It is worth noting that in case of large systems and/or when the amount of observations is high, it is possible to take advantage of the following property for block structured matrix, in order to calculate the inverse W−1:

$$\left[\begin{array}{cccc} \hat{\boldsymbol{W}}^{\boldsymbol{*}^{\boldsymbol{iter}}}_{\boldsymbol{t}_{\boldsymbol{0}}} & 0 & \dots & 0 \\ 0 & \hat{\boldsymbol{W}}^{\boldsymbol{*}^{\boldsymbol{iter}}}_{\boldsymbol{t}_{\boldsymbol{1}}} & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \hat{\boldsymbol{W}}^{\boldsymbol{*}^{\boldsymbol{iter}}}_{\boldsymbol{t}_{\boldsymbol{S-1}}} \\ \end{array}\right]^{-1}= \left[\begin{array}{cccc} \hat{\boldsymbol{W}}^{\boldsymbol{*}^{\boldsymbol{iter}^{-1}}}_{\boldsymbol{t}_{\boldsymbol{0}}} & 0 & \dots & 0 \\ 0 & \hat{\boldsymbol{W}}^{\boldsymbol{*}^{\boldsymbol{iter}^{-1}}}_{\boldsymbol{t}_{\boldsymbol{1}}} & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \hat{\boldsymbol{W}}^{\boldsymbol{*}^{\boldsymbol{iter}^{-1}}}_{\boldsymbol{t}_{\boldsymbol{S-1}}} \\ \end{array}\right] $$

In real data analysis, this aspect allows to remarkably reduce both computational complexity and memory requirements of the estimation algorithm.

Appendix 3: Penalized constrained generalized least square procedure

The minimization problem in (17) can be formulated as a QP problem similarly to what presented in “Conclusion” section, by making the following modification to the definition of matrix Q:

$$\boldsymbol{Q}_{\boldsymbol{P}}=2 (\boldsymbol{M}^{\boldsymbol{*}\intercal} \boldsymbol{{W*}}^{\boldsymbol{-1}} \boldsymbol{M}^{\boldsymbol{*}}+\boldsymbol{P}_{\boldsymbol{\eta}}) $$


$$\boldsymbol{P}_{\boldsymbol{\eta}}= \left[\begin{array}{cc} \boldsymbol{P}_{\boldsymbol{\gamma}} & \boldsymbol{0} \\ \boldsymbol{0} & \boldsymbol{P}_{\boldsymbol{\eta,\lambda}} \\ \end{array}\right] $$

is r×r diagonal penalization matrix, with elements defined as:

$$\boldsymbol{P}_{\boldsymbol{\gamma}}= \boldsymbol{0}_{\boldsymbol{N,N}};~~~~~~\boldsymbol{P}_{\boldsymbol{\eta,\lambda}}= n \left[\begin{array}{ccc} \frac{p^{\prime}_{\eta}(\lambda_{1,2})}{\lambda_{1,2}} & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & \frac{p^{\prime}_{\eta}(\lambda_{N-1,N})}{\lambda_{N-1,N}} \\ \end{array}\right] $$

and the derivative of SCAD penalty function (15) is given by:

$$p^{\prime}_{\eta}(\theta)= \eta{I(\theta \leq \eta)+ \frac{(\xi \eta - \theta)}{(\xi -1)\eta}I(\theta \> \eta)}. $$

The QP problem for PCGLS becomes:

$$ {\hat{\boldsymbol{\theta}^{\boldsymbol{*}}}_{\boldsymbol{P}}}= {\underset{\boldsymbol{\theta*}}{\arg\min}} \left[ \boldsymbol{\theta}^{\boldsymbol{*}\intercal} (\boldsymbol{M}^{\boldsymbol{*}\intercal} \boldsymbol{W*}^{\boldsymbol{-1}} \boldsymbol{M}^{\boldsymbol{*}}+\boldsymbol{P}_{\boldsymbol{\eta}}) \boldsymbol{\theta}^{\boldsymbol{*}} -2 (\boldsymbol{dx}^{\intercal} \boldsymbol{W*}^{\boldsymbol{-1}} \boldsymbol{M}^{\boldsymbol{*}})^{\intercal} \boldsymbol{\theta}^{\boldsymbol{*}}\right] ~s.t.~\boldsymbol{A} \boldsymbol{\theta}^{\boldsymbol{*}} \leq \boldsymbol{0}_{\boldsymbol{r}^{\boldsymbol{*}}}. $$

The algorithm described in Algorithm 3 takes in input the parameters vector estimates \({\hat {\boldsymbol {\theta }^{\boldsymbol {*}}}}\) obtained from the CGLS procedure, state increments vector, dx, predictors matrix M and a vector of candidate values for tuning parameter η, named η. For each value in η, the initial values for parameters is set to \({\hat {\boldsymbol {\theta }^{\boldsymbol {*}}}}\) and then an iterative procedure composed by estimation of the covariance matrix; calculation of the penalty matrix, optimization of the objective function; is reiterated until convergence criteria on parameters vector estimates are met. For each value of η, based of the final estimates returned by the iterative procedure, the GCV statistics is calculated and stored. Finally, the rates estimates corresponding to the minimum GCV statistics are returned.

Appendix 4: Full network representation of the estimated human hematopoietic differentiation process

Fig. 6
figure 6

Estimated human hematopoietic differentiation process. a The network configuration of the human hematopoietic differentiation process. Nodes representing lineages positive for the surface marker and collected from bone marrow and peripheral blood (circled) have the same color. Lymphoid lineages have a blue/green color, whereas myeloid cell types are red/orange and violet. The node corresponding to the stem cell (CD34) is dark blue and circled in black. Glycophorin positive cells(Glyco) and CD61 are respectively grey and pink. Network layout has been generated with a force directed layout. b Parameters heatmap. Main diagonal elements correspond to net duplication rate. Off-diagonal elements correspond to differentiation rates (row → column). Light-Grey entries are differentiations rates estimated as 0 by SCAD penalization (23). Dark-Grey entries are excluded from the inferential procedure due to violation of hierarchical constraints (112)

Appendix 5: Histogram of estimated differentiation rates

Fig. 7
figure 7

Distribution of differentiation rate estimates. Histogram of \(\boldsymbol {\hat {\lambda }}\), corresponding to differentiation rates subjected to the penalization procedure. Vertical red dashed line (0.1) highlight the threshold used to select edges included in Fig. 5

Appendix 6: Abbreviations

  • BM: bone marrow

  • CDP: cell differentiation process

  • CGLS: constrained generalized least squares

  • GCV: generalized cross-validation

  • GT: gene therapy

  • HSC: hematopoietic stem cells

  • IS: integration site

  • MACS: magnetic cell sorting

  • MAD: mean absolute deviation

  • NGS: next-generation sequencing

  • ODE: ordinary differential equation

  • PB: peripheral blood

  • RHS: right-hand side

  • SCAD: smoothly clipped absolute deviation

  • SHDP: simulated hierarchical differentiation process

  • WAS: Wiskott-Aldrich syndrome

Appendix 7: Mathematical symbols

  • α,αi: duplication rate

  • δ,δi: death rate

  • ΔXl(t): random variable for clone l increments

  • λ,λi,j: differentiation rate

  • γ,γi: net duplication rate

  • η>0,ξ>2: SCAD tuning parameters

  • θ,θ: parameters vector

  • Ci: cell type

  • D(): diagonal matrix with appropriate process component repetition

  • \(\boldsymbol {dx},\boldsymbol {dx}^{\boldsymbol {l}}_{\boldsymbol {t}}\): increments vector

  • \(\frac {d P(\boldsymbol {X}_{t},t) }{dt}\): ODE describing evolution over time of process probability distribution

  • \(\frac {d \mathrm {E}[{X_{i}(t)}]}{dt}\): ODE describing evolution over time of mean for cell type i counts

  • \(\frac {d \Sigma _{X_{i},X_{j}}(t) }{dt}\): ODE describing variance-covariance evolution for cell type i constrained generalized least squares

  • \(\mathrm {E}[{X_{i}(t)}], \mathrm {E}\left [{X^{l}_{i}(t)}\right ]\): expected value for process state at time t

  • GCV: Generalized Cross Validation

  • L,l: number of clones

  • \(\boldsymbol {M}, \boldsymbol {M}^{\boldsymbol {l}}_{\boldsymbol {t}}, \boldsymbol {M*}\): predictors matrix

  • N: number of cell types

  • Pη: parameters penalization matrix

  • P(X(t),t): process probability distribution

  • r: total number of cell events

  • S,s: total number of timepoints and related index

  • t,ts: time and timepoints

  • V,Vk,vi,j: net effect matrix

  • \(\boldsymbol {W}, \boldsymbol {W}^{\boldsymbol {l}}_{\boldsymbol {t}},\boldsymbol {W}^{\boldsymbol {*}}\): covariance matrix

  • X(t),Xi(t): stochastic process for cell differentiation process

  • \(\boldsymbol {x}^{\boldsymbol {l}}, \boldsymbol {x}^{\boldsymbol {l}}_{\boldsymbol {t}}, x^{l}_{i,t}\): observed state for clone l

  • \(\boldsymbol {X}^{\boldsymbol {l}}(t), X^{l}_{i}(t)\): stochastic process for clone l dynamics

  • xt,xi,t: observed state at time t

  • \(\phantom {\dot {i}\!}\Sigma _{X^{l}}(t), \Sigma _{X_{i},X_{j}}(t)\): variance-covariance matrix

Availability of data and materials

Additional information about the study, the experimental procedures and the data can be found at


  • Abkowitz, JL, Linenberger ML, Newton MA, Shelton GH, Ott RL, Guttorp P (1990) Evidence for the maintenance of hematopoiesis in a large animal by the sequential activation of stem-cell clones. Proc Natl Acad Sci 87(22):9062–9066.

    Article  Google Scholar 

  • Aiuti, A, Biasco L, Scaramuzza S, Ferrua F, Cicalese MP, Baricordi C, Dionisio F, Calabria A, Giannelli S, Castiello MC, Bosticardo M, Evangelio C, Assanelli A, Casiraghi M, Di Nunzio S, Callegaro L, Benati C, Rizzardi P, Pellin D, Di Serio C, Schmidt M, Von Kalle C, Gardner J, Mehta N, Neduva V, Dow DJ, Galy A, Miniero R, Finocchi A, Metin A, Banerjee PP, Orange JS, Galimberti S, Valsecchi MG, Biffi A, Montini E, Villa A, Ciceri F, Roncarolo MG, Naldini L (2013) Lentiviral hematopoietic stem cell gene therapy in patients with wiskott-aldrich syndrome. Science 341(6148).

    Article  Google Scholar 

  • Ambrosi, A, Cattoglio C, Di Serio C (2008) Retroviral integration process in the human genome: is it really non-random? a new statistical approach. PLoS Comput Biol 4(8):1000144.

    Article  MathSciNet  Google Scholar 

  • Bates, D, Maechler M (2015) Matrix: Sparse and Dense Matrix Classes and Methods. R package version 1.2-2. Accessed 25 Oct 2019.

  • Biasco, L, Ambrosi A, Pellin D, Bartholomae C, Brigida I, Roncarolo MG, Di Serio C, von Kalle C, Schmidt M, Aiuti A (2011) Integration profile of retroviral vector in gene therapy treated patients is cell-specific according to gene expression and chromatin conformation of target cell. EMBO Mol Med 2(5):1757–4684.

    Google Scholar 

  • Becker, A, McCulloch E, Till J (1963) Cytological demonstration of the clonal nature of spleen colonies derived from transplanted mouse marrow cells. Nature 197:452–454.

    Article  Google Scholar 

  • Biasco, L, Pellin D, Scala S, Dionisio F, Basso-Ricci L, Leonardelli L, Scaramuzza S, Baricordi C, Ferrua F, Cicalese MP, et al. (2016) In vivo tracking of human hematopoiesis reveals patterns of clonal dynamics during early and steady-state reconstitution phases. Cell Stem Cell 19:107–119.

    Article  Google Scholar 

  • Catlin, SN, Abkowitz JL, Guttorp P (2001) Statistical inference in a two-compartment model for hematopoiesis. Biometrics 57(2):546–553.

    Article  MathSciNet  Google Scholar 

  • Fan, J (1997) Comments on wavelets in statistics: a reviews by a. antoniadis. J Ital Stat Assoc 6:131–138.

    Article  Google Scholar 

  • Fan, J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360.

    Article  MathSciNet  Google Scholar 

  • Gardiner, CW (1985) Handbook of Stochastic Methods. Springer, New York.

    Google Scholar 

  • Gillespie, DT (1977) Exact stochastic simulation of coupled chemical reactions. J Phys Chem 81(25):2340–2361.

    Article  Google Scholar 

  • Golub, GH, Heath M, Wahba G (1979) Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21(2):215–223.

    Article  MathSciNet  Google Scholar 

  • Goyal, S, Kim S, Chen IS, Chou T (2015) Mechanisms of blood homeostasis: lineage tracking and a neutral model of cell populations in rhesus macaques. BMC Biol 13(1):85.

    Article  Google Scholar 

  • Guennebaud, G, Jacob B, et al. (2010) Eigen v3. Accessed 25 Oct 2019.

  • IBM (2010) Userś Manual for CPLEX. IBM ILOG CPLEX V12.1. Accessed 25 Oct 2019.

  • Kampen, NGV (1981) Stochastic Processes in Physics and Chemistry. North-Holland, Amsterdam.

    MATH  Google Scholar 

  • Kawamoto, H, Ikawa T, Masuda K, Wada H, Katsura Y (2010) A map for lineage restriction of progenitors during hematopoiesis: the essence of the myeloid-based model. Immunol Rev 238(1):23–36.

    Article  Google Scholar 

  • Kawamoto, H, Wada H, Katsura Y (2010) A revised scheme for developmental pathways of hematopoietic cells: the myeloid-based model. Int Immunol 22(2):65–70.

    Article  Google Scholar 

  • Marciniak-Czochra, A, Stiehl T (2013) Mathematical models of hematopoietic reconstitution after stem cell transplantation In: Model Based Parameter Estimation, 191–206.. Springer, Berlin.

    Chapter  Google Scholar 

  • Naldini, L (2011) Ex vivo gene transfer and correction for cell-based therapies. Nat Rev Genet 12(5):301–15.

    Article  Google Scholar 

  • Pellin, D, Di Serio C (2016) A novel scan statistics approach for clustering identification and comparison in binary genomic data. BMC Bioinformatics 17(11):320.

    Article  Google Scholar 

  • Purutcuoglu, V, Wit E (2008) Bayesian inference for the mapk erk pathway by considering the dependency of the kinetic parameters. Bayesian Anal 3(4):851–886.

    Article  MathSciNet  Google Scholar 

  • R Core Team (2015) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. R Foundation for Statistical Computing. Accessed 25 Oct 2019.

    Google Scholar 

  • Risken, H (1984) The Fokker-Planck Equation. Springer, New York.

    Book  Google Scholar 

  • Romano, O, Peano C, Tagliazucchi GM, Petiti L, Poletti V, Cocchiarella F, Rizzi E, Severgnini M, Cavazza A, Rossi C, et al (2016) Transcriptional, epigenetic and retroviral signatures identify regulatory regions involved in hematopoietic lineage commitment. Sci Rep 6:24724.

    Article  Google Scholar 

  • Rousseeuw, PJ, Croux C (1993) Alternatives to the median absolute deviation. J Am Stat Assoc 88(424):1273–1283.

    Article  MathSciNet  Google Scholar 

  • Scala, S, Basso-Ricci L, Dionisio F, Pellin D, Giannelli S, Salerio FA, Leonardelli L, Cicalese MP, Ferrua F, Aiuti A, et al (2018) Dynamics of genetically engineered hematopoietic stem and progenitor cells after autologous transplantation in humans. Nat Med 24(11):1683.

    Article  Google Scholar 

  • Stroustrup, B (1997) The C++ Programming Language. 3rd edn. Addison-Wesley, Boston.

    MATH  Google Scholar 

  • Tibshirany, R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58:267–288.

    MathSciNet  Google Scholar 

  • Wilkinson, DJ (2006) Stochastic Modelling for Systems Biology. Chapman and Hall, London.

    MATH  Google Scholar 

Download references


The authors acknowledge support from EU COST Action COSTNET on Statistical Network Science (CA15109).

Author information

Authors and Affiliations



DP and EW developed the modeling and inference procedures. CdS critically reviewed and provided useful suggestions. DP wrote the draft and EW redacted it. LB and AA provided the data and biomedical interpretation of the results. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Ernst C. Wit.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pellin, D., Biasco, L., Aiuti, A. et al. Penalized inference of the hematopoietic cell differentiation network via high-dimensional clonal tracking. Appl Netw Sci 4, 115 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: