This study’s primary purpose is to reveal the dynamics of propagation and the localization of knowledge in the drug development cycle by analyzing the drug pipeline and supply chain and ownership data. For this research, we constructed an MLN, simultaneously representing three types of relationships between companies and institutions regarding the knowledge flow in the drug pipeline. Before we move to the MLN analysis, we explain the definition of the networks of each layer.
Drug pipeline network
We represent the drug pipeline data acquired from Clarivate Analytics as a drug pipeline network \(G_1\) in Fig. 4. In the graph \(G_1=(V_1,E_1)\), \(V_1\) is a nodeset constructed from companies, government, and educational institutions listed in the drug pipeline data, and \(E_1\) is an edge set constructed from licensor–licensee relations on drug pipelines. Let us denote that a directional edge is present as \(i\rightarrow j\) in the drug pipeline network, where node i is the licenser, and node j is the drug candidate’s licensee in the drug pipeline data. As mentioned in Sect. 2, nodes are various types of business entities categorized as a government institution, educational institution, private company, and public company. One drug pipeline also does not always define one edge in the drug pipeline network because there is the case that the originator of the drug candidate shares the license with two companies. We denote the adjacency matrix of \(G_{1}\) as \(A^{[1]}=\left( a_{ij} ^{[1]}\right)\), where the element \(a_{{ij}}^{{[1]}}\) corresponds to the number of edges from node i to j. The superscript represents the layer index. The in and outdegrees of node i are defined by \(k^{[1]} _{\text {in},i} =\sum _j a_{ji} ^{[1]}\) and \(k^{[1]} _{\text {out},i} = \sum _j a_{ij} ^{[1]}\). In the graph \(G_1\), \(k^{[1]} _{\text {in},i}\) and \(k^{[1]} _{\text {out},i}\) represents the total number of drug pipeline that the company i is dealing with, and the total number of drug pipeline that the company i has discovered, respectively. Moreover, the drug pipeline data contains the case that the drug pipeline’s licensee corresponds to the licenser. In other words, the graph \(G_1\) contains selfloops, and the number of selfloops of node i are computed as \(\ell ^{[1]} _i= a_{ii} ^{[1]}\). In order to distinguish the shifts of license with the other, we denote the number of edges of node i from/to different nodes as in/outdegree removed selfloops,
$$\begin{aligned} m^{[1]} _{\text {in},i}=k^{[1]} _{\text {in},i}  \ell ^{[1]} _i \quad \text {and}\quad m^{[1]} _{\text {out},i}=k^{[1]} _{\text {out},i}  \ell ^{[1]} _i~. \end{aligned}$$
(1)
Thus, \(m^{[1]} _{\text {in},i}\) and \(m^{[1]} _{\text {out},i}\) represents the number of license of drug candidates that the company i has transferred to the others, and that the company i has obtained from the others, respectively. Therefore, in the graph \(G_1\), the edges \((i\ne j)\) are knowledge flows in the drug pipelines. However, the selfloops \((i=j)\) represents the localization of the knowledge in one company.
Furthermore, the drug pipeline data is a snapshot data and a record at a particular development stage. Thus, we use the status of the drug development cycle as edge attribution, \(E_1 = \bigcup _{p}E_1 ^{p}\), and \(E_1 ^p\) represents the drug pipeline at the status \(p\in \{\)Discovery, Phase I–III Clinical, Preregistration, Registered, Launched\(\}\). In response to the extension, we add the status index p to the definitions of the characteristics such as indegree of the company i, \(k^{[1,p]} _{\text {in},i}\). For example, the launched drug without transfer of license is represented as a selfloop with the launched attribution in the network, and \(\ell ^{[1,p]} _i\) at \(p=\{\text {Launched}\}\) represents the number of launched drugs for which the company i discovered and has launched. Although we cannot determine when drug candidates’ license is transferred because of data property, we can observe the tendency of knowledge flows in the drug pipeline between business entities in the pharmaceutical area.
Supply chain network
The suppliercustomer relationship is most important at the company level in the real economy. We represent the supply chain data acquired from the S&P Capital IQ dataset as a supplychain network \(G_2\) in Fig. 4. In the graph \(G_2=(V_2,E_2)\), \(V_2\) is a nodeset constructed from companies listed in the supply chain data, and \(E_2\) is a edge set constructed from these supplier–customer relations. The supply chain network is an unweighted directed network, representing the supply chain business relationships. We denote a directional edge as \(i\rightarrow j\) when company i is a supplier to company j. The in and outdegrees of node i are defined by \(k^{[2]} _{\text {in},i} =\sum _j a_{ji} ^{[2]}\) and \(k^{[2]} _{\text {out},i} = \sum _j a_{ij} ^{[2]}\). Moreover, we denote the number of edges of node i from/to different nodes as in/outdegree removed selfloops, \(m^{[2]} _{\text {in},i}=k^{[2]} _{\text {in},i}  \ell ^{[2]} _i\) and \(m^{[2]}_{\text {out},i}=k^{[2]} _{\text {out},i}  \ell ^{[2]} _i\). In the graph \(G_2\), \(m^{[2]} _{\text {in},i}\) and \(m^{[2]} _{\text {out},i}\) represents the number of suppliers and customers of company i, respectively.
Ownership network
We define ownership networks \(G_3\) in Fig. 4 based on the S&P Capital IQ dataset. In the graph \(G_3=(V_3, E_3)\), \(V_3\) is a nodeset constructed from companies listed in the ownership data, and \(E_3\) is an edge set constructed from these ownership relations. We denote an edge from company i to j when company j has a stake in the company i. The in and outdegrees of node i are defined by \(k^{[3]} _{\text {in},i} =\sum _j a_{ji} ^{[3]}\) and \(k^{[3]} _{\text {out},i} = \sum _j a_{ij} ^{[2]}\). Moreover, we denote the number of edges of node i from/to different nodes as in/outdegree removed selfloops, \(m^{[3]} _{\text {in},i}=k^{[3]} _{\text {in},i}  \ell ^{[3]} _i\) and \(m^{[3]}_{\text {out},i}=k^{[3]} _{\text {out},i}  \ell ^{[3]} _i\). In the graph \(G_3\), \(m^{[3]} _{\text {in},i}\) and \(m^{[3]} _{\text {out},i}\) represents the number of companies which company i holds shares, and the number of shareholders of company i, respectively. Furthermore, the number of owners of each company was limited to 100 in the S&P Capital IQ dataset. Although the S&P Capital IQ dataset includes the ownership ratio, we define the ownership network as an unweighted directed network to increase the number of duplicated nodes between layers. Note that the ownership network in this paper represents the dependency flow, which is in the opposite direction to that typically used because we assume that the firm knowledge tends to flow to these owners.
Combining datasets
To construct the MLN in Fig. 4, we must combine the drug pipeline data and the company’s supply chain and ownership data. Because the company’s name in the two datasets is not always the same, we performed name identification between the two datasets by using information about the country and industry after removing abbreviations such as Inc and Corp. When multiple candidates arose, we checked them manually.
MLN representation
In this subsection, we generalize the definitions of our networks using an MLN framework. Generally, the MLN is a pair defined as \({\mathcal {M}}=({\mathcal {G}},{\mathcal {C}})\), where \({\mathcal {G}}=\{G_{\alpha }; \alpha \in \{1,\cdots , M \} \}\) of the family of graphs \(G_{\alpha }=(V_{\alpha },E_{\alpha })\), where the set of nodes of layer \(G_{\alpha }\) is denoted as \(V_{\alpha }\), and M is the number of layers. \({\mathcal {C}}=\{E_{\alpha \beta }\subseteq V_{\alpha }\times V_{\beta }; \alpha ,\beta \in \{1,\cdots , M \}, \alpha \ne \beta \}\) is a set of interconnections between the nodes of different layers \(G_{\alpha }\) and \(G_{\beta }\) with \(\alpha \ne \beta\). We define the MLN (\(M=3\)), composed of companies and institutions as nodes and three types of interactions between them, by using the drug pipeline, supply chain, and ownership data. Because each layer of our MLN is defined by the types of interactions, there are overlapped nodes but no interconnection between the nodes of different layers: \({\mathcal {C}}=\{\emptyset \}\). The graphs \(G_{\alpha }\) for each layer are defined as

\(G_{1}\): Drug pipeline network

\(G_{2}\): Supplychain network

\(G_{3}\): Ownership network
where the definitions for each layer are explained in previous subsections. Figure 4 displays the conceptual representation of MLN.
The adjacency matrix of each layer \(G_{\alpha }\) is denoted by \(A^{[\alpha ]}=\left( a_{ij} ^{[\alpha ]}\right)\), where the element \(a_{ij} ^{[\alpha ]}\) corresponds to the number of edges from node i to j in the \(\alpha\)th layer. The in and outdegrees of a node i of the MLN are defined as vectors:
$$\begin{aligned} \varvec{k}_{\text {in},i} = \left( k^{[1]} _{\text {in},i}, ~~k^{[2]} _{\text {in},i}, ~~k^{[3]} _{\text {in},i}\right) ~~~\text {and}~~~~\varvec{k}_{\text {out},i} = \left( k^{[1]} _{\text {out},i}, ~~k^{[2]} _{\text {out},i}, ~~k^{[3]} _{\text {out},i}\right) , \end{aligned}$$
(2)
where \(k^{[\alpha ]} _{\text {in},i}\) and \(k^{[\alpha ]} _{\text {out},i}\) are the in and outdegrees of node i in the \(\alpha\)th layer, i.e., \(k^{[\alpha ]} _{\text {in},i} =\sum _j a_{ji} ^{[\alpha ]}\) and \(k^{[\alpha ]} _{\text {out},i} = \sum _j a_{ij} ^{[\alpha ]}\). Similarly, we denote the number of selfloops of node i in the \(\alpha\)th layer as, \(\ell ^{[\alpha ]} _i= a_{ii} ^{[\alpha ]}\). In this paper, we must recognize the flows as selfloops because selfloops in the drug pipeline network \(G_1\) correspond to the accumulation of knowledge. Thus, we count the number of flows in each layer by the edges between two nodes and selfloops. The number of edges of node i from/to different nodes is denoted as
$$\begin{aligned} m^{[\alpha ]} _{\text {in},i}=k^{[\alpha ]} _{\text {in},i}  \ell ^{[\alpha ]} _i ~~~~~~\text {and}~~~~~~~m^{[\alpha ]} _{\text {out},i}=k^{[\alpha ]} _{\text {out},i}  \ell ^{[\alpha ]} _i~, \end{aligned}$$
(3)
and their total number is
$$\begin{aligned} M_{\alpha }=\frac{1}{2}\sum _{i,j=1,i\ne j} ^{N_{\alpha }} a_{ij} ^{[\alpha ]}~~~~~~\text {and}~~~~~~~L_{\alpha }=\sum _i^{N_{\alpha }} a_{ii} ^{[\alpha ]}~. \end{aligned}$$
(4)
where \(N_{\alpha }\) is the number of nodes in \(\alpha\)th layer. Notably, we added the status layer index p to the definitions of the first layer characteristics such as indegree of the ith node \(k^{[1,p]} _{\text {in},i}\) and the total number of edges \(M^{p}_1\).
Bow tie structure
As we showed in the edgelevel analysis in Sect. 2, the knowledge flow in the drug pipeline seems to be circulated at the country level; however, whether the flows connect at the company level is unclear. Generally, the giant weakly connected components (GWCCs) of a directed network can be decomposed as giant strongly connected components (GSCCs), which is the largest size of the SCC in the GWC, its upstream and downstream portions (IN and OUT) known as the bow tie decomposition in the Web Broder et al. (2000). This decomposition could help us understand the hierarchical and circular flows of the networks from a macroscopic perspective.
Community structure
Besides the macroscopic structure measured as a bowtie structure, a community detection is a powerful tool for explaining densely connected networks’ structural properties. We compared the structural properties between layers by using node attributes such as country and primary industry. Although we show the detailed analysis of community structure of each network in 6.2, we explain the method that is useful to extract community structures of a network here.
To find communities in the GWCC of the layers, we use the map equation method Rosvall and Bergstrom (2008), known as Infomap, which is one of the best performing community detection methods Lancichinetti and Fortunato (2009). The map equation method is a flowbased and informationtheoretic approach to find an efficient code for minimizing the length of the description of the random walk for generating a module partition \({\mathcal {M}}\) to divide n nodes into m communities. Then, the average singlestep description length is defined as
$$\begin{aligned} L({\mathcal {M}})=q_{\curvearrowleft }H({\mathcal {Q}})+\sum ^{m}_{i=1}p_{i\circlearrowright }H({\mathcal {P}}_i)~. \end{aligned}$$
(5)
The first term arises from the movements of the random walker across modules, where \(q_{\curvearrowleft }\) is the probability that the random walker switches communities, and \(H({\mathcal {Q}})\) depicts the average description length of the community index codewords given by the Shannon entropy. The second term arises from the intracommunity movement of the random walker, where the weight \(p_{i\circlearrowright }\) represents the fraction of the movements within the community, and \(H({\mathcal {P}}_i)\) represents the entropy of the intracommunity movement. Furthermore, this method has been extended to a hierarchical map equation Rosvall and Bergstrom (2011) that decomposes a network into communities and subcommunities.
We detect the hierarchical communities by using the multicoding Infomap method, and we use the “Level” index to represent the hierarchy of communities; communities at the 2nd level represent subcommunities at the 1st level. To characterize the hierarchical communities, we use node attributions with country, company type, and bow tie component for the drug pipeline layer, and country, primary industry, and bow tie component for the supplychain and ownership layer, respectively.
Interlayer degree correlations
It is reasonable to assume that a hub node in the drug pipeline network could be a hub in the supply chain or ownership network. Thus, to verify this assumption, we investigate the degree correlations for each node between different layers. However, we must note that the edges in the drug pipeline layer \(E_1 ^{p}\) characterized by the status \(p\in \{\)Discovery, Phase I–III Clinical, Preregistration, Registered, Launched\(\}\), and the drug pipelines that have not reached the launched stage could disappear in the drug development cycle. Therefore, we only focus on the launched case here. Furthermore, each company’s incoming edges in the drug pipeline layer are equal to the number of drugs with which it is dealing. From this perspective, we divide the edges on which we focus in the drug pipeline layer into two types:

Selfloops, \(\ell _i^{[1,p]}\) at \(p=\{\text {Launched}\}\), corresponding to the number of launched drugs that the company i discovered as the licenser and has launched. In this paper, we define that \(\ell _i^{[1,p]}\) at \(p=\{\text {Launched}\}\) represents the closed innovation.

Indegrees removed selfloops, \(m_{\text {in},i}^{[1,p]}\) at \(p=\{\text {Launched}\}\), corresponding to the number of launched drugs for which company i was not licensed originally but owns because of the licensee. In this paper, we define that \(m_{\text {in},i}^{[1,p]}\) at \(p=\{\text {Launched}\}\) represents the open innovation.
Node and edge overlap
Although the dynamics of knowledge flows in the drug pipeline between companies remain unclear, we can raise a possible hypothesis for the edgelevel similarity. When drug pipelines develop in a too closed situation, such that pharmaceutical companies do not have business with the same industry, the supply chain’s edgelevel similarity ceases to appear. However, the alliances between the pharmaceutical companies, i.e., open innovation in the pharmaceutical industry, could share not only markets but also drug pipelines, which is assumed to appear as the edgelevel similarity to the supplychain layer. Furthermore, if the drug pipeline tends to be transferred to the owner of its licensor, we might observe the edgelevel similarities to the ownership network. Although knowledge of the pharmaceutical industry might flow along with the flow of control, it is challenging to observe it in our MLN when the mergers and acquisition (M&A) causes it. To confirm the above hypothesis, we compute the overlapping of nodes and edges as
$$\begin{aligned} O(X_{\alpha },X_{\beta })=\frac{\left X_{\alpha }\cap X_{\beta }\right }{\left X_{\alpha }\cup X_{\beta }\right }~, \end{aligned}$$
(6)
where \(X_{\alpha }\) is a set of nodes/edges at the \(\alpha\)th layer. We measure the overlap by the fraction of nodes/edges appearing in both layers over the aggregate number of nodes/edges of the two layers. We ignore the multiple edges and selfloops in the drug pipeline layer.
To evaluate the statistical significance of the finding stated above more precisely, we compute the probabilities (p values) when the expected number of overlapped edges is larger than the observed value by using a statistical test. Here, the null hypothesis is that we have no edge overlap between the two layers. First, we assume that the probability of generating the \(\alpha\)th layer having the x overlapping edges obeys the binomial distribution, \(x\sim \text {B}(n,p(\beta \alpha ))\), where \(n=\left E_{\alpha }\cup E_{\beta }\right\). We define the conditional probability that the edge connects between the two nodes of the overlapping knowledge flow layer (\(\beta =1\)) and the \(\alpha\)th layer as
$$\begin{aligned} p(\beta =1\alpha )=p_{1}\times p_{\alpha }=1\times \frac{\left<{\bar{k}}^{[\alpha ]}\right>}{\left V_{\alpha }\right }~, \end{aligned}$$
(7)
where \(\langle {\bar{k}}^{[\alpha ]}\rangle\) is the half value of the averaged total degree of the \(\alpha\)th layer. Therefore, the last term in the righthand side of this equation corresponds to the probability of finding the edges for randomly selected pairs of nodes. Here, the drug pipeline layer (\(\beta =1\)) is independent variable of the \(\alpha\)th layer, \(p_{1}=1\), by definition.