This study’s primary purpose is to reveal the dynamics of propagation and the localization of knowledge in the drug development cycle by analyzing the drug pipeline and supply chain and ownership data. For this research, we constructed an MLN, simultaneously representing three types of relationships between companies and institutions regarding the knowledge flow in the drug pipeline. Before we move to the MLN analysis, we explain the definition of the networks of each layer.
Drug pipeline network
We represent the drug pipeline data acquired from Clarivate Analytics as a drug pipeline network \(G_1\) in Fig. 4. In the graph \(G_1=(V_1,E_1)\), \(V_1\) is a node-set constructed from companies, government, and educational institutions listed in the drug pipeline data, and \(E_1\) is an edge set constructed from licensor–licensee relations on drug pipelines. Let us denote that a directional edge is present as \(i\rightarrow j\) in the drug pipeline network, where node i is the licenser, and node j is the drug candidate’s licensee in the drug pipeline data. As mentioned in Sect. 2, nodes are various types of business entities categorized as a government institution, educational institution, private company, and public company. One drug pipeline also does not always define one edge in the drug pipeline network because there is the case that the originator of the drug candidate shares the license with two companies. We denote the adjacency matrix of \(G_{1}\) as \(A^{[1]}=\left( a_{ij} ^{[1]}\right)\), where the element \(a_{{ij}}^{{[1]}}\) corresponds to the number of edges from node i to j. The superscript represents the layer index. The in- and out-degrees of node i are defined by \(k^{[1]} _{\text {in},i} =\sum _j a_{ji} ^{[1]}\) and \(k^{[1]} _{\text {out},i} = \sum _j a_{ij} ^{[1]}\). In the graph \(G_1\), \(k^{[1]} _{\text {in},i}\) and \(k^{[1]} _{\text {out},i}\) represents the total number of drug pipeline that the company i is dealing with, and the total number of drug pipeline that the company i has discovered, respectively. Moreover, the drug pipeline data contains the case that the drug pipeline’s licensee corresponds to the licenser. In other words, the graph \(G_1\) contains self-loops, and the number of self-loops of node i are computed as \(\ell ^{[1]} _i= a_{ii} ^{[1]}\). In order to distinguish the shifts of license with the other, we denote the number of edges of node i from/to different nodes as in-/out-degree removed self-loops,
$$\begin{aligned} m^{[1]} _{\text {in},i}=k^{[1]} _{\text {in},i} - \ell ^{[1]} _i \quad \text {and}\quad m^{[1]} _{\text {out},i}=k^{[1]} _{\text {out},i} - \ell ^{[1]} _i~. \end{aligned}$$
(1)
Thus, \(m^{[1]} _{\text {in},i}\) and \(m^{[1]} _{\text {out},i}\) represents the number of license of drug candidates that the company i has transferred to the others, and that the company i has obtained from the others, respectively. Therefore, in the graph \(G_1\), the edges \((i\ne j)\) are knowledge flows in the drug pipelines. However, the self-loops \((i=j)\) represents the localization of the knowledge in one company.
Furthermore, the drug pipeline data is a snapshot data and a record at a particular development stage. Thus, we use the status of the drug development cycle as edge attribution, \(E_1 = \bigcup _{p}E_1 ^{p}\), and \(E_1 ^p\) represents the drug pipeline at the status \(p\in \{\)Discovery, Phase I–III Clinical, Pre-registration, Registered, Launched\(\}\). In response to the extension, we add the status index p to the definitions of the characteristics such as in-degree of the company i, \(k^{[1,p]} _{\text {in},i}\). For example, the launched drug without transfer of license is represented as a self-loop with the launched attribution in the network, and \(\ell ^{[1,p]} _i\) at \(p=\{\text {Launched}\}\) represents the number of launched drugs for which the company i discovered and has launched. Although we cannot determine when drug candidates’ license is transferred because of data property, we can observe the tendency of knowledge flows in the drug pipeline between business entities in the pharmaceutical area.
Supply chain network
The supplier-customer relationship is most important at the company level in the real economy. We represent the supply chain data acquired from the S&P Capital IQ dataset as a supply-chain network \(G_2\) in Fig. 4. In the graph \(G_2=(V_2,E_2)\), \(V_2\) is a node-set constructed from companies listed in the supply chain data, and \(E_2\) is a edge set constructed from these supplier–customer relations. The supply chain network is an unweighted directed network, representing the supply chain business relationships. We denote a directional edge as \(i\rightarrow j\) when company i is a supplier to company j. The in- and out-degrees of node i are defined by \(k^{[2]} _{\text {in},i} =\sum _j a_{ji} ^{[2]}\) and \(k^{[2]} _{\text {out},i} = \sum _j a_{ij} ^{[2]}\). Moreover, we denote the number of edges of node i from/to different nodes as in-/out-degree removed self-loops, \(m^{[2]} _{\text {in},i}=k^{[2]} _{\text {in},i} - \ell ^{[2]} _i\) and \(m^{[2]}_{\text {out},i}=k^{[2]} _{\text {out},i} - \ell ^{[2]} _i\). In the graph \(G_2\), \(m^{[2]} _{\text {in},i}\) and \(m^{[2]} _{\text {out},i}\) represents the number of suppliers and customers of company i, respectively.
Ownership network
We define ownership networks \(G_3\) in Fig. 4 based on the S&P Capital IQ dataset. In the graph \(G_3=(V_3, E_3)\), \(V_3\) is a node-set constructed from companies listed in the ownership data, and \(E_3\) is an edge set constructed from these ownership relations. We denote an edge from company i to j when company j has a stake in the company i. The in- and out-degrees of node i are defined by \(k^{[3]} _{\text {in},i} =\sum _j a_{ji} ^{[3]}\) and \(k^{[3]} _{\text {out},i} = \sum _j a_{ij} ^{[2]}\). Moreover, we denote the number of edges of node i from/to different nodes as in-/out-degree removed self-loops, \(m^{[3]} _{\text {in},i}=k^{[3]} _{\text {in},i} - \ell ^{[3]} _i\) and \(m^{[3]}_{\text {out},i}=k^{[3]} _{\text {out},i} - \ell ^{[3]} _i\). In the graph \(G_3\), \(m^{[3]} _{\text {in},i}\) and \(m^{[3]} _{\text {out},i}\) represents the number of companies which company i holds shares, and the number of shareholders of company i, respectively. Furthermore, the number of owners of each company was limited to 100 in the S&P Capital IQ dataset. Although the S&P Capital IQ dataset includes the ownership ratio, we define the ownership network as an unweighted directed network to increase the number of duplicated nodes between layers. Note that the ownership network in this paper represents the dependency flow, which is in the opposite direction to that typically used because we assume that the firm knowledge tends to flow to these owners.
Combining datasets
To construct the MLN in Fig. 4, we must combine the drug pipeline data and the company’s supply chain and ownership data. Because the company’s name in the two datasets is not always the same, we performed name identification between the two datasets by using information about the country and industry after removing abbreviations such as Inc and Corp. When multiple candidates arose, we checked them manually.
MLN representation
In this subsection, we generalize the definitions of our networks using an MLN framework. Generally, the MLN is a pair defined as \({\mathcal {M}}=({\mathcal {G}},{\mathcal {C}})\), where \({\mathcal {G}}=\{G_{\alpha }; \alpha \in \{1,\cdots , M \} \}\) of the family of graphs \(G_{\alpha }=(V_{\alpha },E_{\alpha })\), where the set of nodes of layer \(G_{\alpha }\) is denoted as \(V_{\alpha }\), and M is the number of layers. \({\mathcal {C}}=\{E_{\alpha \beta }\subseteq V_{\alpha }\times V_{\beta }; \alpha ,\beta \in \{1,\cdots , M \}, \alpha \ne \beta \}\) is a set of interconnections between the nodes of different layers \(G_{\alpha }\) and \(G_{\beta }\) with \(\alpha \ne \beta\). We define the MLN (\(M=3\)), composed of companies and institutions as nodes and three types of interactions between them, by using the drug pipeline, supply chain, and ownership data. Because each layer of our MLN is defined by the types of interactions, there are overlapped nodes but no interconnection between the nodes of different layers: \({\mathcal {C}}=\{\emptyset \}\). The graphs \(G_{\alpha }\) for each layer are defined as
-
\(G_{1}\): Drug pipeline network
-
\(G_{2}\): Supply-chain network
-
\(G_{3}\): Ownership network
where the definitions for each layer are explained in previous subsections. Figure 4 displays the conceptual representation of MLN.
The adjacency matrix of each layer \(G_{\alpha }\) is denoted by \(A^{[\alpha ]}=\left( a_{ij} ^{[\alpha ]}\right)\), where the element \(a_{ij} ^{[\alpha ]}\) corresponds to the number of edges from node i to j in the \(\alpha\)-th layer. The in- and out-degrees of a node i of the MLN are defined as vectors:
$$\begin{aligned} \varvec{k}_{\text {in},i} = \left( k^{[1]} _{\text {in},i}, ~~k^{[2]} _{\text {in},i}, ~~k^{[3]} _{\text {in},i}\right) ~~~\text {and}~~~~\varvec{k}_{\text {out},i} = \left( k^{[1]} _{\text {out},i}, ~~k^{[2]} _{\text {out},i}, ~~k^{[3]} _{\text {out},i}\right) , \end{aligned}$$
(2)
where \(k^{[\alpha ]} _{\text {in},i}\) and \(k^{[\alpha ]} _{\text {out},i}\) are the in- and out-degrees of node i in the \(\alpha\)-th layer, i.e., \(k^{[\alpha ]} _{\text {in},i} =\sum _j a_{ji} ^{[\alpha ]}\) and \(k^{[\alpha ]} _{\text {out},i} = \sum _j a_{ij} ^{[\alpha ]}\). Similarly, we denote the number of self-loops of node i in the \(\alpha\)-th layer as, \(\ell ^{[\alpha ]} _i= a_{ii} ^{[\alpha ]}\). In this paper, we must recognize the flows as self-loops because self-loops in the drug pipeline network \(G_1\) correspond to the accumulation of knowledge. Thus, we count the number of flows in each layer by the edges between two nodes and self-loops. The number of edges of node i from/to different nodes is denoted as
$$\begin{aligned} m^{[\alpha ]} _{\text {in},i}=k^{[\alpha ]} _{\text {in},i} - \ell ^{[\alpha ]} _i ~~~~~~\text {and}~~~~~~~m^{[\alpha ]} _{\text {out},i}=k^{[\alpha ]} _{\text {out},i} - \ell ^{[\alpha ]} _i~, \end{aligned}$$
(3)
and their total number is
$$\begin{aligned} M_{\alpha }=\frac{1}{2}\sum _{i,j=1,i\ne j} ^{N_{\alpha }} a_{ij} ^{[\alpha ]}~~~~~~\text {and}~~~~~~~L_{\alpha }=\sum _i^{N_{\alpha }} a_{ii} ^{[\alpha ]}~. \end{aligned}$$
(4)
where \(N_{\alpha }\) is the number of nodes in \(\alpha\)-th layer. Notably, we added the status layer index p to the definitions of the first layer characteristics such as in-degree of the i-th node \(k^{[1,p]} _{\text {in},i}\) and the total number of edges \(M^{p}_1\).
Bow tie structure
As we showed in the edge-level analysis in Sect. 2, the knowledge flow in the drug pipeline seems to be circulated at the country level; however, whether the flows connect at the company level is unclear. Generally, the giant weakly connected components (GWCCs) of a directed network can be decomposed as giant strongly connected components (GSCCs), which is the largest size of the SCC in the GWC, its upstream and downstream portions (IN and OUT) known as the bow tie decomposition in the Web Broder et al. (2000). This decomposition could help us understand the hierarchical and circular flows of the networks from a macroscopic perspective.
Community structure
Besides the macroscopic structure measured as a bow-tie structure, a community detection is a powerful tool for explaining densely connected networks’ structural properties. We compared the structural properties between layers by using node attributes such as country and primary industry. Although we show the detailed analysis of community structure of each network in 6.2, we explain the method that is useful to extract community structures of a network here.
To find communities in the GWCC of the layers, we use the map equation method Rosvall and Bergstrom (2008), known as Infomap, which is one of the best performing community detection methods Lancichinetti and Fortunato (2009). The map equation method is a flow-based and information-theoretic approach to find an efficient code for minimizing the length of the description of the random walk for generating a module partition \({\mathcal {M}}\) to divide n nodes into m communities. Then, the average single-step description length is defined as
$$\begin{aligned} L({\mathcal {M}})=q_{\curvearrowleft }H({\mathcal {Q}})+\sum ^{m}_{i=1}p_{i\circlearrowright }H({\mathcal {P}}_i)~. \end{aligned}$$
(5)
The first term arises from the movements of the random walker across modules, where \(q_{\curvearrowleft }\) is the probability that the random walker switches communities, and \(H({\mathcal {Q}})\) depicts the average description length of the community index codewords given by the Shannon entropy. The second term arises from the intra-community movement of the random walker, where the weight \(p_{i\circlearrowright }\) represents the fraction of the movements within the community, and \(H({\mathcal {P}}_i)\) represents the entropy of the intra-community movement. Furthermore, this method has been extended to a hierarchical map equation Rosvall and Bergstrom (2011) that decomposes a network into communities and sub-communities.
We detect the hierarchical communities by using the multi-coding Infomap method, and we use the “Level” index to represent the hierarchy of communities; communities at the 2nd level represent sub-communities at the 1st level. To characterize the hierarchical communities, we use node attributions with country, company type, and bow tie component for the drug pipeline layer, and country, primary industry, and bow tie component for the supply-chain and ownership layer, respectively.
Interlayer degree correlations
It is reasonable to assume that a hub node in the drug pipeline network could be a hub in the supply chain or ownership network. Thus, to verify this assumption, we investigate the degree correlations for each node between different layers. However, we must note that the edges in the drug pipeline layer \(E_1 ^{p}\) characterized by the status \(p\in \{\)Discovery, Phase I–III Clinical, Pre-registration, Registered, Launched\(\}\), and the drug pipelines that have not reached the launched stage could disappear in the drug development cycle. Therefore, we only focus on the launched case here. Furthermore, each company’s incoming edges in the drug pipeline layer are equal to the number of drugs with which it is dealing. From this perspective, we divide the edges on which we focus in the drug pipeline layer into two types:
-
Self-loops, \(\ell _i^{[1,p]}\) at \(p=\{\text {Launched}\}\), corresponding to the number of launched drugs that the company i discovered as the licenser and has launched. In this paper, we define that \(\ell _i^{[1,p]}\) at \(p=\{\text {Launched}\}\) represents the closed innovation.
-
In-degrees removed self-loops, \(m_{\text {in},i}^{[1,p]}\) at \(p=\{\text {Launched}\}\), corresponding to the number of launched drugs for which company i was not licensed originally but owns because of the licensee. In this paper, we define that \(m_{\text {in},i}^{[1,p]}\) at \(p=\{\text {Launched}\}\) represents the open innovation.
Node and edge overlap
Although the dynamics of knowledge flows in the drug pipeline between companies remain unclear, we can raise a possible hypothesis for the edge-level similarity. When drug pipelines develop in a too closed situation, such that pharmaceutical companies do not have business with the same industry, the supply chain’s edge-level similarity ceases to appear. However, the alliances between the pharmaceutical companies, i.e., open innovation in the pharmaceutical industry, could share not only markets but also drug pipelines, which is assumed to appear as the edge-level similarity to the supply-chain layer. Furthermore, if the drug pipeline tends to be transferred to the owner of its licensor, we might observe the edge-level similarities to the ownership network. Although knowledge of the pharmaceutical industry might flow along with the flow of control, it is challenging to observe it in our MLN when the mergers and acquisition (M&A) causes it. To confirm the above hypothesis, we compute the overlapping of nodes and edges as
$$\begin{aligned} O(X_{\alpha },X_{\beta })=\frac{\left| X_{\alpha }\cap X_{\beta }\right| }{\left| X_{\alpha }\cup X_{\beta }\right| }~, \end{aligned}$$
(6)
where \(X_{\alpha }\) is a set of nodes/edges at the \(\alpha\)-th layer. We measure the overlap by the fraction of nodes/edges appearing in both layers over the aggregate number of nodes/edges of the two layers. We ignore the multiple edges and self-loops in the drug pipeline layer.
To evaluate the statistical significance of the finding stated above more precisely, we compute the probabilities (p values) when the expected number of overlapped edges is larger than the observed value by using a statistical test. Here, the null hypothesis is that we have no edge overlap between the two layers. First, we assume that the probability of generating the \(\alpha\)-th layer having the x overlapping edges obeys the binomial distribution, \(x\sim \text {B}(n,p(\beta |\alpha ))\), where \(n=\left| E_{\alpha }\cup E_{\beta }\right|\). We define the conditional probability that the edge connects between the two nodes of the overlapping knowledge flow layer (\(\beta =1\)) and the \(\alpha\)-th layer as
$$\begin{aligned} p(\beta =1|\alpha )=p_{1}\times p_{\alpha }=1\times \frac{\left<{\bar{k}}^{[\alpha ]}\right>}{\left| V_{\alpha }\right| }~, \end{aligned}$$
(7)
where \(\langle {\bar{k}}^{[\alpha ]}\rangle\) is the half value of the averaged total degree of the \(\alpha\)-th layer. Therefore, the last term in the right-hand side of this equation corresponds to the probability of finding the edges for randomly selected pairs of nodes. Here, the drug pipeline layer (\(\beta =1\)) is independent variable of the \(\alpha\)-th layer, \(p_{1}=1\), by definition.