Skip to main content

Venture capital investments through the lens of network and functional data analysis


In this paper we characterize the performance of venture capital-backed firms based on their ability to attract investment. The aim of the study is to identify relevant predictors of success built from the network structure of firms’ and investors’ relations. Focusing on deal-level data for the health sector, we first create a bipartite network among firms and investors, and then apply functional data analysis to derive progressively more refined indicators of success captured by a binary, a scalar and a functional outcome. More specifically, we use different network centrality measures to capture the role of early investments for the success of the firm. Our results, which are robust to different specifications, suggest that success has a strong positive association with centrality measures of the firm and of its large investors, and a weaker but still detectable association with centrality measures of small investors and features describing firms as knowledge bridges. Finally, based on our analyses, success is not associated with firms’ and investors’ spreading power (harmonic centrality), nor with the tightness of investors’ community (clustering coefficient) and spreading ability (VoteRank).


In their pursuit of entrepreneurial opportunities, new firms rely on a variety of financing sources. When internal means of financing (e.g. owner capital and cash flow) are insufficient to support the growth of the business, firms will seek external capital (Hall and Lerner 2010; Cosh et al. 2009; Mina et al. 2013). In the context of high-risk and technology-intensive sectors, venture capital (VC) is the kind of external capital that can provide not only financial resources, but also substantial knowledge of product markets, and useful connections with other firms and investors (Da Rin et al. 2013).

In the VC investment model, success implies a positive exit outcome of the investee firm through an IPO or trade sale, generating (ideally optimal) returns for the investors (Gompers 1995; Poulsen and Stegemoller 2008). From the viewpoint of the firm, there are indeed other indicators of positive performance, which include new patents (Lerner 2000; Lahr and Mina 2016), new products (Hellmann and Puri 2000), and growth by number of employees and sales (Bertoni et al. 2011), depending on the stage of development of the business.

In this study we take a different approach, and consider the ability of the firm to raise funds as an indication that the firm is achieving those milestones that mark the path to exit events, and remains through time a promising vehicle of future returns. We implement this procedural view of success by investigating specific features of the network of firms and investors built from deal-level data.

A number of studies employ network tools to describe interactions among these economic agents. For instance, Bonaventura et al. (2020) use networks measures to link the likelihood of long-term positive economic performance to the flow of employees (and the associated transfer of know-how) across firms. Network tools can also be used to assess the probability that an investor will invest in a certain firm (Liang anf Yuan 2016), and to study how suppliers of finance can exchange opportunities to invest in a portfolio firm, spread financial risk and share knowledge (Bygrave 1988). Investors’ networks appear to be especially important for the likelihood of positive returns and successful exits (Hochberg et al. 2007). In this contribution we use network analysis to build, first, a binary definition of success by which we partition funding trajectories in two clusters capturing a ‘high’ and a ‘low’ funding regime. Then, we derive richer characterizations of firm success by considering the total amount of money raised as a scalar outcome, and the funding trajectory itself as a functional outcome (Esposito et al. 2021).

We retrieve data from CB Insights (2022), which provides records of transactions in venture capital markets and stock market listings from 1949. Notably, the very first transaction in our dataset corresponds to the public listing of five Japanese firms in May 1949, while the first recorded venture capital transaction concerns an investment by Greylock Partners in Worthington Biochemical – labelled by CB Insights as ‘Growth Equity’ in 1967. However, since data until 2000 are incomplete, we focus on the period 2000–2020 in order to minimize the impact of missing data on our analysis. Moreover, as different sectors may follow different investment patterns (Dushnitsky and Lenox 2006), we restrict our analysis to the healthcare sector only, which is one of the richest in venture capital investments and one that has shown to be among the least sensitive to market oscillations (Pisano 2006).

The reminder of this article is organized as follows. After a description of the complex, time-varying structure of the bipartite network of investors and firms, we show how our definitions of success can be related to standard definitions in the literature. Next, we introduce statistics computed on the projections of the bipartite network, and show their association with our definitions of success. We do this using a temporal window of 10 years along which we consider a firm’s financing rounds after its first investment, and demonstrate the advantages of our approach. Since the length of the temporal window employed in our analysis is somewhat arbitrary, we repeat it with lengths varying from 5 to 12 years – ascertaining the robustness of our findings. We apply a similar stability check varying the set of covariates employed in our regression models. Finally, we discuss our main results and provide some concluding remarks.

Network characterization

We build a bipartite network with 83258 nodes divided into investors (32796) and firms (50462) – notably, the two categories are not exclusive; 2155 nodes are labeled as both firms and investors. An investor and a firm node are connected by an undirected link whenever there exists a transaction in the CB Insights’ database regarding the two, i.e. a record of the investor investing in the firm, and for which we have information regarding the amount of money invested and the date of the deal. We end up with 63035 links. Because an investor can invest in the same firm multiple times, the network is also a multigraph. Moreover, we account for the temporal dimension in a cumulative fashion: once created, links persist, so the snapshots of the graph that we compute for each year between 2000 and 2020 contain all the links (and their nodes) concerning transactions recorded prior to and including that year.

By projecting the bipartite network on the firms’ layer, we produce a projected graph which is employed to compute the statistics described in Table 1. In particular, we assume that two firms are linked if they are invested in by the same investor within seven years. This time span balances a trade-off between the average VC investment window, which recent evidence places below five years (Sethuram et al. 2021), and the realistic advantages that may derive from the expertise of a common investor – even after its exit from one of the firms. As a robustness check, we run our pipelines also linking two firms if they are invested by the same investor within either five or ten years (see Additional file 1).

By projecting the bipartite network on the investors’ layer, we produce a second projected graph. Here two investors are linked if they have invested in the same firm in the same financing round. In order to use also investors’ centrality measures as potential predictors for firms’ success, we compute the maximum, the minimum and the median of the centrality distribution of the ‘early’ investors in each firm, i.e. those who participated in the firm’s first recorded funding round.

Table 1 Statistics computed on the projected graphs of investors and firms.

Figure 1 gives a snapshot of the firms’ projected network in 2020, aggregated at the geographical resolution of countries. The size of each node, which represents a country, is given by the number of firms located in that country. Roughly 83% of the firms are either North American or European (around 60% belong to the US market), while the remaining 17% is mostly composed of Asian firms. In terms of sub-sectoral specialization, around 50% of the firms operate within the sub-sectors of Medical Devices & Equipment, Biotechnology, and Internet Software & Services - the Medical Devices & Equipment sub-sector alone accounts for 20% of the nodes in network. About 85% of the firms in the network are either active or acquired, with the remaining portion being either inactive or having completed an IPO.

Fig. 1
figure 1

Projection of the bipartite network on the firms’ layer, aggregated by country


Assessing centrality measures against standard definitions of success

We start by assessing whether the centrality measures computed from our networks correlate with firms’ success according to a standard definition, reproducing the exercise presented in Bonaventura et al. (2020). We classify a firm as successful if, in a time window of given size from its first investment, (i) it has been acquired, (ii) it has been listed in the public market, or (iii) it has merged with another firm. Therefore, as compared to Bonaventura et al. (2020), we differ both in the creation of the networks and, slightly, in the definition of success (we do not label a firm as successful if it has made an acquisition).

Bonaventura et al. (2020) start by considering the pool of open deals available to investors each month – these are the firms which (i) have not yet received funding, (ii) have not yet been acquired, and (iii) have not yet been listed in the stock exchange market. Next, they propose an investment strategy by picking, each month, the top n firms in terms of closeness centrality, and check whether – over a time span of 6, 7 or 8 years – this strategy is more successful than just randomly picking n firms from the pool of open deals. To assess statistical significance, the authors apply a test based on a hypergeometric null distribution (the number of successful firms in a random sample of size n from a population whose overall size and prevalence of successful firms are known).

In our analysis, we link firms and investors according to observed investments. Therefore, we act as an investor who has already seen the first investment in the firms of interest. Figure 2 shows success rates computed using this approach for different centrality measures with \(n=25\). For each measure, firms are sorted in descending order to pick the top 25 – except for VoteRank, which captures an inverse centrality, so firms are sorted in ascending order. Results are aggregated computing mean success rates and their standard deviations over 11 years (from 2000 to 2010), and shown grouping centrality measures based on their correlation structure (see below). As a preliminary to analyses to be presented later in the paper, we note that group 4 emerging from the dendrogram in Fig. 3 contains centrality measures that produce similar and rather good success rates; centrality measures from groups 3 and 5 produce more heterogeneous success rates, and centrality measures from group 1 produce poorer success rates.

Fig. 2
figure 2

Success rate for different centrality measures adopting the methodology proposed by Bonaventura et al. (2020). Bars represent mean success rates over 11 years (from 2000 to 2010), and are color-coded according to the partition of centrality measures obtained from the dendrogram in Fig. 3. Black lines mark \(\pm 1\) standard deviation intervals about the means

Extracting centrality signals

After assessing centrality measures against standard definitions of success, and before turning to regression exercises for alternative and progressively more refined success outcomes, we pre-process our covariates as follows. After log-transforming those that present markedly right-skewed distributions, we scale them all and analyze their correlation structure by building a feature dendrogram (Pearson absolute correlation, complete linkage; see Fig. 3). The dendrogram highlights seven groups. Within each of them our covariates are highly correlated, but across the groups we can identify distinct, fairly uncorrelated signals. The first group includes investors’ minimum and median eigenvector, Newman betweenness, PageRank, betweenness centrality and investors’ minimum degree. The second group comprises the maximum, the minimum and the median of investors’ VoteRank. The third group, which is the largest one, contains information on firms’ degree and core number and investors’ summary statistics on average neighbor degree and closeness centrality, together with investors’ maximum PageRank, Newman betweenness centrality, degree and betweenness centrality. The fourth group features firms’ eigenvector and PageRank centrality and investors’ maximum eigenvector centrality. The fifth group describes firms as knowledge bridges (number of investors, clustering coefficient and Newman betweenness centrality). The sixth group comprises firms’ harmonic centrality and summary statistics of investors’ harmonic centrality. Finally, the seventh group includes summary statistics of investors’ clustering coefficients.

Fig. 3
figure 3

Dendrogram of firms’ and investors’ centrality measures (absolute correlation distance, complete linkage). Seven groups of features are highlighted. The first from the bottom contains centrality measures for smallest and median investors; the second includes investors’ VoteRank statistics; the third contains firms’ and investors’ closeness and average neighborhood centrality, as well as measures of firms’ and big investors’ centrality; the fourth includes firms’ and big investors’ eigenvector centrality measures, as well as firms’ PageRank; the fifth contains features describing firms as knowledge bridges; the sixth includes harmonic centrality measures; the seventh includes investors’ clustering coefficients

We leverage these groups to guide feature selection for our regression exercises. For each response type (binary, scalar, functional; more on that below), we reduce the initial set of covariates to seven predictors, selecting one per group through an exhaustive search for the combination that optimizes the goodness of fit. We later consider further (sub-optimal) combinations comprising one covariate per group, as a check on the stability of our analysis.

Refining the definition of success

Each firm has its own funding history. After its birth, the firm collects resources over time, thus being characterized by the trajectory of the amount of money it is able to attract. We treat these trajectories with tools from Functional Data Analysis (FDA), a field of statistics that studies observations that come in the form of functions over a continuous domain (Ramsay and Silverman 2005; Kokoszka and Reimherr 2017). Specifically, we focus on cumulative functions of the money raised over time by each firm, which by construction are monotonically non-decreasing. These are aligned, so that their time domain begins the year in which each firm receives its first investment (regardless of the calendar year it corresponds to). We restrict attention to firms that have received at least two investments during their time domain, as to avoid flat trajectories. Moreover, we only consider firms whose sub-sector is known. In the following, we show results obtained using a time window of 10 years from first investment, employing a total of 3072 firms. Subsequently, we also explore alternative window sizes (see Additional file 1: Figure S1).

Two regimes

Our first definition of success is based on separating firm funding trajectories in two clusters, identifying high (successful) and low investment regimes. Because of the heterogeneity among healthcare sub-sectors, we run a functional k-means clustering algorithm (Jacques and Preda 2014) with \(k=2\) separately for each sub-sector, using 100 random initialization of the centroids. As an example, Fig. 4 shows the results of the clustering algorithm for the drug discovery sub-sector. Throughout all sub-sectors, the algorithm partitions 519 (16.89%) firms in the high-regime cluster and 2553 firms in the low-regime cluster. This binary definition of success is therefore rather conservative, as a small minority of firms are labelled as successful.

Fig. 4
figure 4

k-means functional clustering (\(k=2\)) of the funding trajectories of firms belonging to the drug discovery sub-sector, observed for 10 years after the first recorded round of investments. Blue and red dashed lines represent firms in the high (‘successful’) and low regimes, respectively. Bold curves represent cluster centroids

We consider a logistic regression model defined as:

$$\begin{aligned} \log \left( \frac{P(y_i=1)}{1-P(y_i=1)}\right) =\beta _0+\sum _{j=1}^{p} \beta _j x_{ij} \quad i=1,\dots n \end{aligned}$$

where n is the number of observations; \(y_i\), \(i=1,\dots n\), are the binary responses indicating membership to the high (\(y_i=1\)) or low (\(y_i=0\)) regime clusters; \(\beta _0\) is an intercept and \(x_{ij}\), \(i=1,\dots n\), \(j=1,\dots ,p\) (\(p=7\)), are scalar covariates. We fit all possible configurations of this model, comprising one covariate per group, and select the best ones in terms of their log-likelihood.

Since the data set is unbalanced, results on the best model configuration, which has a log-likelihood of \(-862.98\) and a McFadden’s pseudo \(R^2\) (McFadden 1973) of 0.1402, are bound to be driven by the majority class. To mitigate this problem, for 1000 independent times, we randomly subsample the more abundant low-regime class as to match the numerosity of the high-regime one, and re-fit the logistic regression using the best combination of regressors found on the whole data set (Hastie et al. 2009). The average log-likelihood across these fit replications is \(-418.37\), which is substantially higher than the score of the unbalanced fit, while the average McFadden’s pseudo \(R^2\) is 0.1499, with a maximum of 0.1970. Figure 5 displays the scatter plots of the magnitude of the estimated coefficients and their significance. Firms’ closeness (group 3) and PageRank (group 4) centrality show a positive and significant impact on the probability of belonging to the high-regime class. Interestingly, firms’ clustering coefficient (group 5) seems to have a negative and significant impact. This is probably due to the conspicuous number of more peripheral firms which do not belong to the giant component of our network, and have a clustering coefficient equal to 1. The harmonic centrality of the least important investor in a firm (group 6) seems to have a positive, mildly significant effect. Finally, the covariates selected to represent the groups 1, 2 and 7 do not show a significant association with the binary response.

Fig. 5
figure 5

Scatter plots of logistic regression coefficient estimates (horizontal) and significance (vertical; \(-log(p\)-value)). The eight scatter plots correspond to the intercept (upper left) and the covariates selected in each of the seven covariate groups. Each point in a scatter plot represents one of 1000 fits run on data balanced by subsampling the majority low-regime class. Light blue solid lines mark averages across the fits, and light blue dashed lines are \(\pm 1\) standard deviations about them. Red vertical solid lines mark 0 coefficient values. Red horizontal dashed lines mark significance values associated to a p-value of 0.1

Next, we investigate whether the binary definition of success derived from firms’ funding trajectories is related to a more standard definition, i.e., their eventual exit in IPO, acquisition or merger. This is captured by the confusion matrix in Table 2. Notice that the overall accuracy is quite large (0.71) and the precision is 0.57. This means that, if we classified a firm as successful according to our definition, there would be a good chance that it would be selected as successful also by standard definitions. Nevertheless, the number of false negatives is not negligible (recall is only 0.31).

Table 2 Confusion matrix for standard vs trajectory-based binary definition. Accuracy: 0.71; recall: 0.31; precision: 0.57

Aggregate amount of money raised

The evidence of a relationship between the success of a firm and the network features obtained using our trajectory-based binary response is promising. However, our binary definition of success is very rough and the unbalance in the data forces us to run the analysis relying on reduced sample sizes. Moreover, the confusion matrix in Table 2 highlights a less-than-perfect match with a more standard binary definition, suggesting that we may be missing some important components of what makes a firm successful. Thus, we next consider a scalar proxy for success, defined as the cumulative end point of a firm’s funding trajectory, i.e. the sum of the investments it has received through a 10 year window since the first investment.

We consider a regression model defined as:

$$\begin{aligned} y_i =\beta _0+\sum _{j=1}^{p} \beta _j x_{ij} + \sum _{c=1}^{k} \gamma _c x_{ic} +\epsilon _i \quad i=1,\dots ,n \end{aligned}$$

where n is the number of observations; \(y_i\), \(i=1,\dots n\), are the scalar responses (aggregate amount of money raised); \(\beta _0\) is an intercept; \(x_{ij}\), \(i=1,\dots ,n\), \(j=1,\dots ,p\) (\(p=7\)), are scalar covariates; \(x_{ic}\), \(i=1,\dots ,n\) and \(c=1,\dots ,k\) (\(k=5\)), are scalar controls and \(\epsilon _i\), \(i=1,\dots n\), are i.i.d. Gaussian model errors. As for the logistic regression, we fit all possible configurations of this model, comprising one covariate per group. However, in order to properly gauge the network features, in all the fits we include controls for the size of the first investment (a scalar variable) and the industry of the firms (a categorical variable indicating the membership to a given market sector). The left column of Table 3 shows results for the fit of the best model configuration in terms of \(R^2\). Additional fits with different model specifications (also controlling for firms’ geographical location and year of birth) produce similar results (see Additional file 1: Table S1). In particular, we also run a linear regression using as response differential money raised; that is, the difference between aggregate money raised and the amount of money received in the first investment (in this specification the latter is not included as a control). Results for this fit are shown in the right column of Table 3. As in the case of the logistic regression for our binary outcome, and for both the scalar responses considered, the covariate from group 3 (here, the maximum of the average neighborhood degree among a firm’s investors) has a positive and significant effect, while the covariate from group 5 (a firm’s clustering coefficient) has a negative and significant one. The covariate from group 1 (here, the median of the investors’ PageRank) also has a positive and significant effect. The covariate from group 4 (firms’ eigenvector centrality) has a positive and significant effect, but only with differential money raised as response. The other covariates (from groups 2, 6 and 7) do not show a significant association with the scalar responses. Notably, the regressions explain approximately \(48\%\) and approximately \(22\%\) of the in-sample variability of aggregate and differential money raised, respectively.

Table 3 Linear regression results

Next, as we did for our binary response, we assess whether aggregate money raised is linked to the standard definition of success (exit in IPO, acquisition or merger within 10 years from the first investment). Figure 6 contrasts the boxplots of aggregate money raised between ‘typically defined’ successful and unsuccessful firms. Firms that are successful based on the standard definition indeed do appear to raise more money (higher median, shorter left tail). This is partially expected, since the highest amount of money is usually invested in the latest stages, which are more likely to happen if the firm is promising and headed towards a successful outcome (an IPO, for instance). However, some successful exits, such as acquisitions, may also happen at an early stage of the firm life-cycle – resulting in an exchange of shares but not in fresh new money for the firm. This may explain why our binary classification in low and high regimes misses many ‘typically successful’ firms; the majority of the latter correspond to acquisitions, not IPOs.

Fig. 6
figure 6

Boxplots of aggregate money raised (log) for ‘traditionally successful’ (right; 1) and ‘traditionally unsuccessful’ (left; 0) firms

Exploiting the trajectories

Our scalar outcome, which measures the aggregate money raised by a firm within a period of 10 years (the selected window size), does not capture how the investments in the firm distribute across such period – something that may be very important in delineating success. Moreover, using aggregate money raised implicitly assumes that the right time to investigate the dependence of success on network features is at the end of the period considered.

We tackle these issues by refining the target outcome and considering the full cumulative investment trajectories – instead of their end point. Thus, we run a function-on-scalar regression (Kokoszka and Reimherr 2017), modeling trajectories as a function of the same set of covariates selected with our procedure for the aggregate money raised response. The function-on-scalar regression can be written as:

$$\begin{aligned} Y_i(t) = \beta _0(t) + \sum _{j=1}^{p} \beta _j(t) x_{ij} + \epsilon _i(t) \quad i=1,\dots n \end{aligned}$$

where n is the number of observations; \(Y_i(t)\), \(i=1,\dots n\), are the aligned trajectories; \(\beta _0(t)\) is a functional intercept; \(x_{ij}\), \(i=1,\dots n\) and \(j=1,\dots ,p\) (\(p=7\)), are scalar covariates, and \(\epsilon _i(t)\), \(i=1,\dots n\), are i.i.d. Gaussian model errors.

In this model, the regression coefficient of a scalar covariate j, \(\beta _j(t)\), is itself a curve describing the relationship between the covariate and the functional response, which varies along its domain (time). Together with the functional coefficients, we also estimate their standard errors, which can be employed to build confidence bands around the curves (Goldsmith et al. 2016). Figure 7 shows the estimated functional coefficients and their 95% confidence bands. Results of the fit are in line with the previous ones, showing positive and significant effects for the covariates from groups 1, 3 and 4, and a negative and significant effect for the covariate from group 5. Interestingly, the positive effect of the covariate from group 1 is increasing, but its derivative is decreasing over time – leveling off after approximately 2 years from the first investment. Also, the positive effect of the covariate from group 3 is initially increasing but, around 4 years from the first investment, starts to decrease. The positive effect of the covariate from group 4 and the negative effect of the covariate from group 5 both increase in size across the whole time domain. Finally, the effects of the covariates from groups 2, 6 and 7 are not statistically different from 0.

Fig. 7
figure 7

Function-on-scalar regression. Blue solid lines represent coefficient curve estimates, surrounded by blue confidence bands (constructed through point-wise estimated standard errors, 95% confidence level). Red dashed lines mark 0. The estimated intercept can be interpreted as the sheer effect of time

Stability checks

In order to validate our previous results we extend the analysis in two ways. First, we vary the length of the funding trajectories, re-running our pipelines for all window sizes between 5 and 12 years. Second, we take advantage of the correlation structure characterizing the covariates at our disposal to measure the stability of coefficient estimates with respect to perturbations in model configurations.

We note that varying the window size induces a change in the number of firms in our sample (see Additional file 1: Figure S1). This is due to the fact that, when aligning firms to the year of their first investment, we consider also firms which are born very recently. The number of firms that already have a trajectory long enough to cover the entire window depends on the size considered; the longer the window, the smaller the number of firms that can be used in the fits. Figure 8 shows the variation in coefficient estimates when fitting the linear regression in Eq. 2 with different window sizes. The coefficients for covariates from groups 1 and 3 remain distinctly positive for all window sizes. The coefficients for the covariate from group 5 remain distinctly negative for all window sizes. In contrast, the coefficients for covariates from groups 2, 4, 6 and 7 change sign depending on the window size, and have 95% confidence intervals that overlap 0 for most or all window sizes, indicating less stable effects. Additional file 1: Figure S2 and S4 show the same analysis conducted for the logistic and the function-on-scalar regressions in Eqs. 1 and 3, respectively.

Fig. 8
figure 8

Stability analysis under window size changes; linear regression. Blue dots represent coefficient estimates for each window size, with \(\pm 1.96\times SE\) bands (dashed blue vertical lines). Horizontal red dashed lines mark 0. The number of firms included in the fits decreases as the window size increases, possibly introducing a selection bias

Figure 9 (left panel) shows the variation in coefficient estimates when perturbing the model configuration; that is, when taking all possible combinations of seven covariates, one from each group. In this analysis the window size is held fixed at 10 years. Notably, for group 3 the average coefficient estimate is more than one standard deviation above 0, indicating that the positive effect of the group tends to persist regardless of the specific covariates selected within the groups. In contrast, the effects of the other groups appear to depend, in size and sign, on the model configuration. This must be interpreted with care though. For instance, in this analysis group 1 and group 5 covariates show both positive and negative coefficient estimates depending on the model configuration, suggesting a less consistent role for these groups than for group 3. However, \(pagerank\_median\) does in fact have a significant positive effect within the context of the ‘best’ linear regression configuration (see Table 3) – which is also confirmed when considering the function-on-scalar regression (see Fig. 7). Also, and rather interestingly, the distribution of coefficient estimates for group 5 present a positive and a negative mode – with a trough about 0 (Fig. 9, right panel). This suggests that the group has indeed a consistent effect across model configurations, except that the sign will depend on which among the group members carries such effect – because some of the highly correlated covariates in the group are in fact counter-varying.

Fig. 9
figure 9

Stability analysis under perturbations of the model specification; linear regression. Left panel: blue dots represent averages of the coefficient estimates obtained changing the covariates considered in each group, with \(\pm 1\) standard deviation bands (dashed blue vertical lines). The horizontal red dashed line marks 0. Right panel: distribution of the coefficients estimates associated to covariates from group 5. While 0 (dashed vertical red line) is within 1 SD (dashed vertical blue lines) from the average (solid vertical blue line), the distribution is bimodal with a trough about 0


By exploiting and combining techniques from the fields of network and functional data analysis, we propose progressively more refined definitions of a firm’s success, and associate them with different network features through regression fits.

Logistic regression results for our binary outcome suggest a strong role for centrality measures belonging to groups 3 (firms and big investors), 4 (firms’ eigenvector centrality measures), and 5 (firms as knowledge bridges) – and a weaker role for centrality measures in group 6 (harmonic centrality measures). In terms of ‘best’ representatives selected within such groups, a firm’s closeness centrality from group 3 reflects the width of its investors’ portfolio (if a firm is part of a big portfolio, its separation from other firms within the network will be lower), PageRank from group 4 is a proxy of a firm’s importance and plays a role similar to eigenvector centrality in undirected networks, and a firm’s clustering coefficient from group 5 reflects tightness in community links. Concerning its estimated negative impact on success, we note that a firm’s clustering coefficient is negatively correlated to its number of investors – both because clustered firms typically belong to portfolios characterized by a high redundancy of investors, and because our network contains many isolated firms (outside the giant component) whose clustering coefficient equals 1, and who are unlikely to succeed.

Linear regression results for our scalar outcome confirm a strong role for groups 1, 3 and 5. The covariate selected from group 1 and group 4, the median of investors’ PageRank and the firms’ eigenvector centrality respectively, have positive estimated effects – suggesting that influential median investors may favor a firm’s success, as well as a firm’s own influence. The covariate selected from group 3, the maximum of investors’ average neighbor degrees, also has a positive estimated effect – suggesting that for a firm’s success forming many connections is not as critical as being connected with investors that are themselves strongly connected, as this may increase the level of capitalization in a later stage of the firm’s life. The covariate selected from group 5 is again a firm’s clustering coefficient, with a negative estimated effect on success. Our stability analysis also provides evidence that the effects of covariates from groups 3 and 5 are consistent across model specifications.

Function-on-scalar regression results for our functional outcome also confirm a strong role for groups 1, 3, 4 and 5. Interestingly, when profiled over time through this richer analysis, the positive effects of the median of investors’ PageRank (group 1) and of the maximum of investors’ average neighbor degrees (group 3) increase early in the life of a firm – but then level off. In contrast, the positive effect of a firm’s eigenvector centrality (group 4) and the negative effect of its clustering coefficient (group 5) increase throughout the temporal domain. This may mean that being connected to important and well connected investors (i.e., those with a median PageRank and large average neighbor degree) is more important early on, whereas being in a far-reaching portfolio of investors (i.e., having a small clustering coefficient and high eigenvector centrality) has a stronger impact on success later in the life of a firm.

Our analysis can be expanded in several ways. First, we limit our study to the healthcare sector, while it may be interesting to investigate other market sectors, and compare the results. Second, meso-scale communities may be analyzed in terms of their longitudinal evolution, as to characterize successful clusters of firms from a topological point of view. Third, in our analysis we gather information from the first round of funding and predict the future success of the firm, but it may be interesting to do so experimenting with different funding rounds, or with models that capture the dynamic evolution of the network, such as tools from topological data analysis (Hensel et al. 2021). This would also shed light on whether the way investment decisions travel within the network of investors has an impact on future success of firms, an application of graph kernels (Rosenfeld et al. 2020). Also, we could consider the network as being formed by the decision of rational agents operating within an economic framework, as in reinforcement learning (Jiang et al. 2018), and employ concepts taken from the field of game theory and goal recognition design to study investment decisions among partially informed agents (Keren et al. 2020).

Availability of data and materials

The data that support the findings of this study are available from CB Insights but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of CB Insights. Code for replication of our study is shared in the GitHub repository


  • Bertoni F, Colombo MG, Grilli L (2011) Venture capital financing and the growth of high-tech start-ups: disentangling treatment from selection effects. Res Policy 40(7):1028–1043

    Article  Google Scholar 

  • Bonaventura M, Ciotti V, Panzarasa P, Liverani S, Lacasa L, Latora V (2020) Predicting success in the worldwide start-up network. Sci Rep 10(1):1–6

    Article  Google Scholar 

  • Bygrave WD (1988) The structure of the investment networks of venture capital firms. J Bus Ventur 3(2):137–157

    Article  Google Scholar 

  • CB Insights (2022)

  • Cosh A, Cumming D, Hughes A (2009) Outside enterpreneurial capital. Econ J 119(540):1494–1533

    Article  Google Scholar 

  • Da Rin M, Hellmann T, Puri M (2013) A survey of venture capital research. In: Constantinides GM, Harris M, Stulz RM (eds) Handbook of the economics of finance, vol 2. Elsevier, Amsterdam, pp 573–648

    Chapter  Google Scholar 

  • Dushnitsky G, Lenox MJ (2006) When does corporate venture capital investment create firm value? J bus ventur 21(6):753–772

    Article  Google Scholar 

  • Esposito C, Gortan M, Testa L, Chiaromonte F, Fagiolo G, Mina A, Rossetti G (2021) Can you always reap what you sow? Network and functional data analysis of venture capital investments in health-tech companies. In: International conference on complex networks and their applications, Springer, pp 744–755

  • Goldsmith J, Scheipl F, Huang L, Wrobel J, Gellar J, Harezlak J, McLean M, Swihart B, Xiao L, Crainiceanu C, et al (2016) Refund: Regression with functional data. R Package Vers, pp 0.1-16

  • Gompers PA (1995) Optimal investment, monitoring, and the staging of venture capital. J Finance 50(5):1461–1489

    Article  Google Scholar 

  • Hall BH, Lerner J (2010) The financing of r &d and innovation. In: Hall BH, Rosenberg N (eds) Handbook of the economics of innovation, vol 1. Elsevier, Amsterdam, pp 609–639

    Chapter  Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York

    Book  Google Scholar 

  • Hellmann T, Puri M (2000) The interaction between product market and financing strategy: the role of venture capital. Rev Financ stud 13(4):959–984

    Article  Google Scholar 

  • Hensel F, Moor M, Rieck B (2021) A survey of topological machine learning methods. Front Artif Intell 4:52

    Article  Google Scholar 

  • Hochberg YV, Ljungqvist A, Lu Y (2007) Whom you know matters: venture capital networks and investment performance. J Finance 62(1):251–301

    Article  Google Scholar 

  • Jacques J, Preda C (2014) Functional data clustering: a survey. Adv Data Anal Classif 8(3):231–255

    Article  MathSciNet  Google Scholar 

  • Jiang J, Dun C, Huang T, Lu Z (2018) Graph convolutional reinforcement learning. arXiv preprint arXiv:1810.09202

  • Keren S, Xu H, Kwapong K, Parkes D, Grosz B (2020) Information shaping for enhanced goal recognition of partially-informed agents. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 9908–9915

  • Kokoszka P, Reimherr M (2017) Introduction to functional data analysis. Chapman and Hall/CRC, Boca Raton

    Book  Google Scholar 

  • Lahr H, Mina A (2016) Venture capital investments and the technological performance of portfolio firms. Res Policy 45(1):303–318

    Article  Google Scholar 

  • Lerner J (2000) Assessing the contribution of venture capital. RAND J Econ 31(4):674–692

    Article  Google Scholar 

  • Liang YE, Yuan STD (2016) Predicting investor funding behavior using crunchbase social network features. Int Res

  • Marchiori M, Latora V (2000) Harmony in the small-world. Phys A Statistical Mech Appl 285(3–4):539–546

    Article  Google Scholar 

  • McFadden D, et al (1973) Conditional logit analysis of qualitative choice behavior

  • Mina A, Lahr H, Hughes A (2013) The demand and supply of external finance for innovative firms. Ind Corp Change 22(4):869–901

    Article  Google Scholar 

  • Newman ME (2005) A measure of betweenness centrality based on random walks. Soc Netw 27(1):39–54

    Article  Google Scholar 

  • Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab

    Google Scholar 

  • Pisano GP (2006) Science Business: the promise, the reality, and the future of biotech. Harvard Business Press, Boston

    Google Scholar 

  • Poulsen AB, Stegemoller M (2008) Moving from private to public ownership: selling out to public firms versus initial public offerings. Financ Manage 37(1):81–101

    Article  Google Scholar 

  • Ramsay JO, Silverman BW (2005) Functional data analysis, 2nd edn. Springer, New York

    Book  Google Scholar 

  • Rosenfeld N, Szanto A, Parkes DC (2020) A kernel of truth: determining rumor veracity on twitter by diffusion pattern alone. In: Proceedings of the web conference 2020, pp 1018–1028

  • Sethuram S, Taussig M, Gaur A (2021) A multiple agency view of venture capital investment duration: the roles of institutions, foreignness, and alliances. Glob Strategy J 11(4):578–619

    Article  Google Scholar 

  • Zhang J-X, Chen D-B, Dong Q, Zhao Z-D (2016) Identifying a set of influential spreaders in complex networks. Sci Rep 6:27823

    Article  Google Scholar 

Download references


We are grateful to Enrico Stivella for useful feedback.


F.C., C.E., G.F., A.M. and L.T. acknowledge support from the Sant’Anna School of Advanced Studies. F.C. acknowledges support from Penn State University. G.R. acknowledges support from the scheme “INFRAIA-01-2018-2019: Research and Innovation action”, Grant Agreement n. 871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics”.

Author information

Authors and Affiliations



All authors conceived ideas and analysis approaches. C.E., M.G. and L.T. retrieved and processed data, implemented pipelines and performed statistical analyses. All authors interpreted findings. F.C., M.G. and L.T. wrote the manuscript. F.C., G.F., A.M. and G.R. supervised the research. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Lorenzo Testa or Francesca Chiaromonte.

Ethics declarations

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Supplementary material containing additional analysis and robustness checks.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Esposito, C., Gortan, M., Testa, L. et al. Venture capital investments through the lens of network and functional data analysis. Appl Netw Sci 7, 42 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: