 Research
 Open access
 Published:
Modeling selfpropagating malware with epidemiological models
Applied Network Science volume 8, Article number: 52 (2023)
Abstract
Selfpropagating malware (SPM) is responsible for large financial losses and major data breaches with devastating social impacts that cannot be understated. Wellknown campaigns such as WannaCry and Colonial Pipeline have been able to propagate rapidly on the Internet and cause widespread service disruptions. To date, the propagation behavior of SPM is still not well understood. As result, our ability to defend against these cyber threats is still limited. Here, we address this gap by performing a comprehensive analysis of a newly proposed epidemiologicalinspired model for SPM propagation, the SusceptibleInfectedInfected DormantRecovered (SIIDR) model. We perform a theoretical analysis of the SIIDR model by deriving its basic reproduction number and studying the stability of its diseasefree equilibrium points in a homogeneous mixed system. We also characterize the SIIDR model on arbitrary graphs and discuss the conditions for stability of diseasefree equilibrium points. We obtain access to 15 WannaCry attack traces generated under various conditions, derive the model’s transition rates, and show that SIIDR fits the real data well. We find that the SIIDR model outperforms more established compartmental models from epidemiology, such as SI, SIS, and SIR, at modeling SPM propagation.
Introduction
Selfpropagating malware (SPM) is one of today’s most concerning cybersecurity threats. Over past years, SPM resulted in huge financial losses and data breaches with high economic and societal impacts. For instance, the infamous WannaCry (Mike Azzara 2021) attack, first discovered in 2017 and still actively used by attackers nowadays, was estimated to have affected more than 200, 000 computers across 150 countries worldwide, with economic damages ranging from hundreds of millions to billions of dollars. In May 2021, the Colonial Pipeline (Wikipedia 2023a) cyberattack caused the shut down of the entirety of the Colonial gasoline pipeline system for several days. It affected consumers and airlines along the East Coast of the United States and was deemed a national security threat. Another remarkable worldwide SPM attack is Petya (Wikipedia 2023b), first discovered in 2016 when it started spreading through phishing emails. Petya represents a family of various types of ransomware responsible for estimated economic damages of over 10 million dollars (Wikipedia 2023b).
Given the current cybercrime landscape, with new threats emerging daily, tools designed for modeling SPM behavior become crucial. Indeed, a deep understanding of selfpropagating malware characteristics provides us opportunities to identify threats, test control strategies, and design proactive defenses against attacks. A large body of research on the subject so far has been devoted to the design of methods to detect and mitigate selfpropagating malware. Proposed techniques include network traffic signatures (Kim and Karp 2004; Kumar and Lim 2020; Ongun et al. 2021; Newsome et al. 2005) and hostlevel binary analysis (Chen and Bridges 2017; Ben Said et al. 2018) used to identify anomalous behavior, softwaredefined networking (SDN) for ransomware threat detection and mitigation (Akbanov et al. 2019; Alotaibi and Vassilakis 2021), as well as evasionresilient methods for detecting adaptive worms (Li and Stafford 2014; Newsome et al. 2005; Ongun et al. 2021). However, less attention was dedicated to comparing and finding the most suitable models to capture SPM behavior. Additionally, the majority of existing works on SPM modeling focus on theoretical analyses of infection spreading (Guillén et al. 2017; Guillén and del Rey 2018; Mishra and Saini 2007; Martínez Martínez et al. 2021), lacking a thorough realworld evaluation of these models.
In this paper, we model the behavior of a wellknown SPM attack, WannaCry, based on realworld attack traces. The similarities between the behavior of biological and computer viruses enable us to leverage compartmental models from epidemiology. We adopt a novel compartmental epidemic model called SIIDR (Chernikova et al. 2022), and conduct a thorough analysis to show that it can be used to accurately model SPM spreading dynamics.
First, we study the model assuming a homogeneous mixing of hosts and analytically derive its basic reproduction number \(R_0\) (Dietz 1993; Kephart and White 1993; Van den Driessche and Watmough 2008). \(R_0\) is the number of secondary cases generated by an infectious seed in a fully susceptible population. It describes the epidemic threshold, thus, the conditions necessary for a macroscopic outbreak (\(R_0 > 1\)) (Fraser et al. 2009; Van den Driessche and Watmough 2008). We also investigate equilibrium or fixed points of SIIDR as they provide insights on how to contain or suppress the spreading.
Additionally, computer networks are often represented as graphs, where nodes denote the hosts in the network and edges represent the communication links between them. In any static graph, the propagation of contagion processes depends not only on the transition rates of SPM but also on the spectral properties of the graph (Newman 2018). To discuss the important characteristics of SIIDR that illustrate the ability of SPM to successfully propagate through the network in these settings, we represent SIIDR model as a NonLinear Dynamical System (NLDS) and relaxing the homogeneous mixing assumption.
Finally, we reconstruct the dynamics of WannaCry spreading analysing real traffic logs. We use the Akaike Information Criterion (AIC) (Akaike 1974) to compare how different compartmental models fit the derived epidemic traces. We show that SIIDR captures malware spreading better than classical epidemic models such as SI, SIS, SIR. Indeed, the investigation of real WannaCry attacks showed that consecutive infection attempts originating from the same host are delayed by a variable time interval. This finding suggests the existence of “dormant” infected state, in which infected hosts temporarily cease to pass infection to their neighbors. Furthermore, calibrating the model to the real data via an Approximate Bayesian Computation technique we determine the transition rates (i.e., model parameters) that characterize WannaCry propagation.
To summarize, our contributions are the following:

We derive the basic reproduction number of the SIIDR model (Chernikova et al. 2022) and discuss the stability conditions of the diseasefree equilibrium points of the system of ODEs that represents SIIDR under a homogeneous mixing assumption.

We derive the conditions for stability of the SIIDR diseasefree equilibrium points on arbitrary graphs thus relaxing the homogeneous mixing assumption.

We reconstruct the spreading dynamics of an actual SPM (WannaCry) using realworld traces obtained by running a vulnerable version of Windows in a virtual environment.

We show that SIIDR outperforms several classical models in terms of capturing WannaCry behavior, and derive the model’s transition rates from actual attacks.
We organize the rest of the paper as follows: first, we provide background information about the WannaCry malware and the most common compartmental models of epidemiology. We also define the threat model and problem statement. Then we introduce the SIIDR model, discuss the derivation of \(R_0\) and the stability of the diseasefree equilibrium points. In addition, we present the experimental results that support the findings of the paper. Table 1 includes common terminology used in the paper.
Background and problem statement
WannaCry malware
WannaCry is a selfpropagating malware attack, which targets computers running the Microsoft Windows operating system by encrypting data and demanding ransom in Bitcoins. It automatically spreads through the network and scans for vulnerable systems, using the EternalBlue exploit to gain access, and the DoublePulsar backdoor tool to install and execute a copy of itself. WannaCry malware has a ’killswitch’ that appears to work like this: part of WannaCry’s infection routine involves sending a request that checks for a web domain. If its request returns showing that the domain is alive or online, it will activate the ’killswitch’, prompting WannaCry to exit the system and no longer proceed with its propagation and encryption routines. Otherwise, if the malicious program can not connect to the domain, it encrypts the computer’s data, then attempts to exploit the vulnerability of Server Message Block protocol to spread out to random computers on the Internet, and laterally to computers on the same network (Wikipedia 2023c).
Epidemiological models
Compartmental epidemiological models are used to model the spread of infectious diseases (Brauer 2008; Keeling and Rohani 2008). This approach segments the population into groups (compartments) describing the various stages of infection. The compartmental structure varies according to the disease under study and the application of the model. Following disease evolution, individuals can transition at specific rates among compartments. Generally speaking, these transitions can be either spontaneous (e.g., recovery process) or resulting from interactions (e.g., infection process). In their simplest formulation, compartmental models assume homogeneous mixing. Said differently, each individual is potentially in contact with everyone else (Vespignani 2012).
The most common compartmental models are the SI, SIS, SIR and SEIR models. In Appendix 1 we will briefly review the formulation of these models by neglecting demographic changes in the population (i.e., the number of individuals is assumed to be fixed). More in detail, we represent them as systems of Ordinary Differential Equations (ODEs). This is a common approach to model epidemics in continuous time, even though it approximates the number of individuals in different compartments as continuous functions.
Problem statement and threat model
The objective of our work is to provide a rigorous mathematical analysis of realistic SPM attacks, and thus lay down the foundation of efficient defense strategies against these prevalent threats. Several works propose models to capture the behavior of SPM (Guillén et al. 2017; Guillén and del Rey 2018; Mishra and Saini 2007; Martínez Martínez et al. 2021), however, the vast majority of them have only theoretical analysis and do not incorporate the information about realworld SPM traces. Thus, they lack validation in realworld scenarios. Additionally, it is hard to perform comparative analysis to other models without presenting their performance using realworld data. Existing work that uses actual malware traces for modeling SPM (Levy et al. 2020) leverages minimal epidemiological models that, in their simplicity, fail to fully capture malware characteristics. To this end, here we use a more advanced compartmental model (called SIIDR) to describe epidemics resulting from SPM and apply it to realworld attack traces from a wellknown malware, WannaCry.
Besides studying different epidemiological models according to their suitability to describe WannaCry epidemics, our second goal is to infer the parameters of the SIIDR epidemic model for different malware variants. Parameter inference is crucial for enabling attack simulations on real networks to measure the impact of the attack, as well as the effectiveness of defensive measures. Indeed, once the parameters of the attack are known, an analyst could estimate the basic reproduction number of the attack, and understand whether the attack might result in a macroscopic outbreak. Similarly, a defender might configure its network topology by performing edge or node hardening (Le et al. 2015; Tong et al. 2012; Torres et al. 2021), minimizing the leading eigenvalue of the graph to prevent the damage from selfpropagating malware attacks, or using anomaly detection methods to detect the malware propagation (Ongun et al. 2021).
In this work, our focus is on modeling SPM propagation inside a local network (e.g., enterprise network, campus network) since we do not have global visibility on SPM propagation across different networks. We assume that the attacker gets a foothold inside the local network through a single initially infected host. From the ‘patient zero’ victim, the attack can propagate and infect other vulnerable machines in the subnet. We initially assume a homogeneous mixing model, meaning that every machine can contact all others. This is a valid assumption because in a subnet every machine is able to scan every other internal IP within the same subnet. We are assuming that none of the machines is immune to the exploited vulnerability at the beginning of the attack, thus, all of them may become infected during SPM propagation. Infectious machines become recovered when the malware is successfully detected and an efficient recovery process removes it. We assume that these machines cannot be reinfected again. We then relax the homogeneous mixing assumption and characterize the behavior of the model on arbitrary graph, considering that a contact between any two nodes in a network does not occur randomly with equal probabilities, but each node communicates with the particular subset of nodes in the network.
Related work
Numerous works propose to simulate and model malware propagation on different levels of fidelity and scalability (Perumalla and Sundaragopalan 2004). The research on modeling malware and worms propagation includes hardware testbeds (Vahdat et al. 2002; White et al. 2002), emulation systems (Durst et al. 1999; Wei et al. 2010), packetlevel simulations (Riley et al. 2004; Szymanski et al. 2003), fullyvirtualized environments (Perumalla and Sundaragopalan 2004), mixed abstraction simulations (Guo et al. 2000; Kiddle et al. 2003), and epidemic models. In our work we focus on this last line of research. Similarly, Mishra and Jha (Mishra and Jha 2010) introduce the SEIQRS (SusceptibleExposedInfectiousQuarantinedRecoveredSusceptible) model for viruses and study the effect of the quarantined compartment on the number of recovered nodes. In their paper, the authors focus on the analysis of the threshold that determines the outcome of the disease. Mishra and Pandey (2014) introduce the SEISV model for viruses with a vaccinated state, while (Mishra and Saini 2007) study the SEIRS model to characterize the malicious objects’ free equilibrium, formulating the stability of the results in terms of the threshold parameter. Toutonji et al. (2012) propose a VEISV (VulnerableExposedInfectiousSecuredVulnerable) model and use the reproduction rate to derive global and local stability. With the help of simulation, they show the positive impact of increasing security countermeasures in the vulnerable state on wormexposed and infectious propagation waves. Guillén et al. (2019) introduce a SCIRAS (SusceptibleCarrierInfectiousRecoveredAttackedSusceptible) model. Authors study the local and global stability of its equilibrium points and compute the basic reproductive number. Ojha et al. (2021) develop a new SEIQRV (SusceptibleExposedInfectedQuarantinedRecoveredVaccinated) model to capture the behavior of malware attacks in wireless sensor networks. In their work, authors obtain the equilibrium points of the proposed model, analyze the system stability under different conditions, and verify the performance of the model through simulations. Zheng et al. (2020) introduce the SLBQR (SusceptibleLatentBreaking outQuarantinedRecovered) model considering vaccination strategies with temporary immunity as well as quarantined strategies. The authors study the stability of the model, investigate a strategy based on quarantines aimed at suppressing the spread of the virus, and discuss the effect of the vaccination on permanent immunity. In order to verify their findings, the authors simulate the model exploring a range of temporary immune times and quarantine rates.
Recently, several attempts have been made to enhance the realism of the epidemic models. For instance, Guillén et al. (2017) study the SEIRS model with an improved incidence rate (i.e., new infected hosts per time unit). Additionally, the equilibrium points are computed and their local and global stability are studied. Finally, the authors derive the explicit expression of the basic reproductive number and propose efficient measures to control the epidemics. Martínez Martínez et al. (2021) introduce a dynamic version of SEIRS. The authors look at the performance of the model with different sets of parameters, propose optimal values, and discuss its applicability to model realworld malware. Gan et al. (2020) propose a dynamical SIP (SusceptibleInfectedProtected) model, find an equilibrium point, and discuss its local and global stability. Additionally, the authors perform the numerical simulations of the model to demonstrate the dependency on parameter values. Yao et al. (2018) present a timedelayed worm propagation model with variable infection rate. They analyze the stability of equilibrium and the threshold of Hopf bifurcation. The authors carry out the numerical analysis and simulation of the model.
Some papers explore malware propagation on networks comprised of different types of devices. For instance, Guillén and del Rey (2018) considers the special class of carrier devices whose operative systems are not targeted by malware (for example, iOS devices for Android malware); the authors introduce a new compartment (Carrier) to account for these devices, and analyze efficient control measures based on the basic reproductive number. Zhu et al. (2012) take into consideration the ability of viruses to infect not only computers, but also many kinds of external removable devices; in their model, internal devices can be in Susceptible, Infected, and Recovered states, while removable devices can be in Susceptible and Infected states.
None of these previous works perform model fitting to realworld malware scenarios, but only consider theoretical analyses of the proposed models. The closest to our work is Levy et al. (2020); the authors use real traces to fit malware propagation with SIR, a simplistic model that, as we have shown, performs poorly compared to SIIDR and fails to capture selfpropagating malware dynamics.
Analysis of the SIIDR model
In this section, we introduce the main characteristics of WannaCry propagation dynamics, the proposed modeling framework (the SIIDR model), we discuss its basic reproduction number and the stability of diseasefree equilibrium points. Table 2 includes common terminology used in this section.
SPM modeling with the SIIDR model
A detailed analysis of the WannaCry traces (Chernikova et al. 2022) revealed the following characteristics:

The time interval \(\Delta t\) between two consecutive malicious attempts from the same infected IP is not constant and has high variability. This intuition is supported by the results in Fig. 1 where we show the quartile coefficient of dispersion (QCoD) of these \(\Delta t\) for different Wannacry variants. The QCoD is defined as \((Q_3  Q_1) / (Q_3 + Q_1)\). As benchmark we show the hypothetical QCoD of exponentially distributed \(\Delta _t\) with the same mean observed in the data. We chose the exponential distribution since time intervals lapsing between Poissonlike events happening at constant rate follow this distribution. From the figure we see that the QCoD of \(\Delta t\) obtained from the data is much higher (\(\sim 50\%\) more across variants) than the one we would expect to see with constant frequency events.

The time interval \(\Delta t\) between the last attack from an infected IP and the end of the collected trace is large. The average values of \(\Delta t\) between two consecutive malicious attempts and \(\Delta t\) between the last attack attempt from an infected IP and the end of the epidemics are shown in Fig. 2. The mean value of the \(\Delta t\) in the second case are much larger then the \(\Delta t\) between two consecutive attack attempts.
Based on the first observation, an infected dormant state \(I_D\) is included to capture the heterogeneous distribution of time windows between two malicious attack attempts. Therefore, an infected node can become dormant for some period of time and resume its malicious activity later. The second observation supports the presence of a Recovered state: once nodes recover, they will not become infectious or susceptible again, at least within a certain observation period. The transition diagram corresponding to the SIIDR model is illustrated in Fig. 3. Interacting with the infectious, a susceptible node can become infected with rate \(\beta\), and afterwards, it may either recover with rate \(\mu\), or move to the dormant state with rate \(\gamma _1\). From the dormant state, it may become actively infectious again with rate \(\gamma _2\).
The evolution of the system can be modeled through the following ODEs system:
with \(N=S(t)+I(t)+I_D(t)+R(t)\), where the total size of the population N is constant. It is important to stress how the system of ODEs assumes an homogeneous mixing in the host population.
SIIDR equilibrium points
While modeling SPM we are interested in equilibrium states when the number of infected individuals equals to 0 and does not change over time (i.e., diseasefree equilibrium points). Thus, we need to derive the constant solutions of the ODE system corresponding to SIIDR model (Perko 2013).
Definition 1
An equilibrium point or fixed point of the system of ODEs \(\dot{x} = f(X)\) is a solution \(E^*\) that does not change with time, i.e., \(f(E^*) = 0\).
For the SIIDR model we can find the equilibrium points by solving the following system:
given that \(S + I + I_D + R = N\).
Thus, we find diseasefree equilibrium points of the SIIDR model as \(E^* = (S, 0, 0, R)\) where \(I = I_D = 0\) and \(S + R = N\). The particular case is the beginning of the propagation process when the number of recovered individuals is 0: \(R = 0\) or \(E^* = (N, 0, 0, 0)\). Therefore, we perform further analyses of SIIDR model based on this equilibrium point. There exists no endemic equilibrium point when \(I \ne 0\) for SIIDR model. It is present only when \(\mu = 0\) (SIID model) and is equal to \((0, I^*, \frac{\gamma _1 I^*}{\gamma _2}, 0)\).
The basic reproduction number
The basic reproduction number \(R_0\) is the number of secondary cases generated by a single infectious seed in a fully susceptible population (Keeling and Rohani 2008). \(R_0\) defines the epidemic threshold, that is the condition for a macroscopic outbreak. If \(R_0 > 1\), on average, infected individuals are able to sustain the spreading. If \(R_0<1\), on average, the disease will die out before any macroscopic outbreak.
One way to derive the basic reproduction number is to use the nextgeneration matrix approach (Diekmann et al. 1990, 2010; Blackwood and Childs 2018). This states that the basic reproduction number is the largest eigenvalue of the nextgeneration matrix. The method takes into consideration the dynamics of compartments linked to new infections. For example the number of infected individuals in compartment i, \(i \in \{1, \dots , k\}\), where k is the number of compartments with infected individuals, changes as follows:
where \(F_i(X)\) is the rate of appearance of new infections in compartment i by all other means, \(V_i(X) = [V_i^(X)  V_i^+(X)]\), \(V_i^+(X)\) is the rate of transfer of individuals into compartment i and \(V_i^(X)\) represents the rate of transfer of individuals out of compartment. If \(E^*\) is a diseasefree equilibrium, then we can define a nextgeneration matrix:
where:
In the case of SIIDR model, the matrix G can be represented at one of the diseasefree equilibrium points \(DFE=(N,0,0,0)\) as follows:
Let \(\vec {v}\) be an eigenvector of the matrix G, and \(\lambda\) its corresponding eigenvalue. The eigenvalue equation is (Bhatia 1997):
where \(\vec {v}\) is a nonzero vector, therefore \(det[\lambda I  G] = 0\). Using G from equation 2, we obtain:
which results in: 1) \(\lambda = 0\) or 2) \(\lambda = \beta / \mu\). According to the nextgeneration matrix method (Diekmann et al. 1990, 2010; Blackwood and Childs 2018), the reproduction number \(R_0\) is the largest eigenvalue of the nextgeneration matrix G, hence, \(R_0 = \frac{\beta }{\mu }\), which is the same definition of \(R_0\) of the SIR model. In other words, the introduction of the new compartment \(I_D\) does not alter the conditions for a macroscopic outbreak. We note that, in general, the disease free equilibrium might contain individuals already immune to the disease, i.e., \(E^* =(NR,0,0,R)\). This might be due to wave of infections caused by previous introductions of the virus. In this more general case we have: \(R_0 = \frac{\beta }{\mu }\left( 1\frac{R}{N} \right)\), where in parenthesis we have the fraction of the susceptible population.
Stability analysis of SIIDR equilibrium points
A particularly important characteristic of a diseasefree equilibrium point is its stability (Hirsch and Smale 1974), which indicates whether the system will be able to return to the equilibrium point after small perturbations. For example, a small perturbation can be a slight increase in the number of initially infected nodes.
Let us consider the system of ODEs that captures the dynamics of our SIIDR model (see Eqs. 1), governed by:
Let \(X = E^*\) be a fixed point of f(X), that is, \(f(E^*) = 0\). Furthermore, let us assume that the system’s initial state at \(t = 0\) is \(X = X^0\). In this context, the stability of \(E^*\) can be obtained answering to the following question: if the system starts near \(E^*\), how close will it remain to \(E^*\)? Beside this intuition, stability is more formally defined as follows (Hirsch and Smale 1974):
Definition 2
The equilibrium point \(E^*\) is stable if for any \(\epsilon > 0\), there exists a \(\delta > 0\) such that: if the system’s initial state \(X^0\) lies in the ball of radius \(\delta\) around \(E^*\) (i.e., \(X^0  E^* < \delta\)), then solutions \(X^t\) exist for all \(t > 0\), and they stay in the ball of radius \(\epsilon\) around \(E^*\) (i.e., \(X^t  E^* < \epsilon\)).
In addition:
Definition 3
We say that \(E^*\) is locally asymptotically stable if it is stable and the solutions \(X^t\) with initial state \(X^0\) in the ball of radius \(\delta\) converge to \(E^*\) as \(t \rightarrow \infty\).
And:
Definition 4
We say that \(E^*\) is stable in the sense of Lyapunov (i.e., Lyapunov stable) when there exists the continuously differentiable function L(X) such that:
If \(\dot{L}(X) <0\) and \(\dot{L}(X) = 0\) only when \(X=E^*\), then \(E^*\) is locally asymptotically stable.
We next analyze the stability of the SIIDR diseasefree equilibrium points and show that they are Lyapunov stable, if the reproduction number \(R_0\) is smaller or equal to one. We formally state and prove it in the following theorem:
Theorem 1
If \(R_0 \le 1\) the diseasefree equilibrium point \(E^*\) of the SIIDR system of ODEs is Lyapunov stable.
Proof
Let \(L(X) = I + I_D\), where L is the valid Lyapunov function as long as it is nonnegative continuously differentiable scalar function which equals 0 at the diseasefree equilibrium point (\(I = I_D = 0\)). The timederivative of L is the following:
where we used Eqs. 1 that describe the evolution of I and \(I_D\). Therefore, \(\dot{L} \le 0\) (Eq. 4) when:
Given the basic reproduction number \(R_0 = \frac{\beta S}{\mu N}\), we obtain:
Eq. 5 holds when \(R_0 \le 1\). Hence, \(\dot{L} \le 0\) when \(R_0 \le 1\). Furthermore, \(\dot{L}(E^*) = 0\) (since \(I = 0\) when \(X = E^*\)), which concludes the proof that \(E^*\) is a Lyapunov stable diseasefree equilibrium point.
Note that \(\dot{L}(X) = 0\) when \(I = 0\), even if \(X \ne E^*\) (for instance, if \(I_D \ne 0\)). Thus, \(E^*\) is not locally asymptotically stable (see Definition 4). \(\square\)
SIIDR analysis on arbitrary graphs
Our analysis in previous sections was performed under the homogeneousmixing assumption (Bansal et al. 2007; Vespignani 2012). In this limit, all hosts are wellmixed and potentially in contact. The homogeneous approximation might be a good representation of the contact dynamics in a local subnet where each machine can contact anyone else. However, the contact patterns in larger networks are complex. Indeed, many real networks (including the Internet) feature, among other properties, a heterogeneous connectivity distribution consisting of a few highlyconnected ’hubs’, while the vast majority of nodes have much lower connectivity (Albert and Barabási 2002; PastorSatorras et al. 2015). In this section, we analyze the epidemiological dynamics of the SIIDR model on arbitrary graphs that capture heterogeneity in host contact patterns. In this case, the propagation of malware can be modeled with a discretetime NonLinear Dynamical System (Chakrabarti et al. 2008; Prakash et al. 2011).
A NLDS system is specified by the vector of probabilities at time step \(t+1\) as \(P_{t+1} = g(P_{t})\), where g is nonlinear continuous function operating on a vector \(P_{t}\). We define the system equations based on the transition diagram of the model (Fig. 3).
First, we are computing the probability of node i of not getting infected at time step t: \(\zeta _{i, t}(I)\), which happens when: (1) none of its neighbors are in state I, or (2) a neighbor is in state I but fails to infect i with probability \((1{\tilde{\beta }})\), where \(\tilde{\beta }\) is the attack transmission probability over a contactlink. We note how \(\tilde{\beta }\) is generally different than the infection rate \(\beta\) introduced above. Indeed we can approximate \(\beta = \tilde{\beta } \langle k \rangle _t\) where \(\langle k \rangle _t\) is the average contact rate per unit time. Hence:
Next, we develop the equations for probabilities P of node i to be in each of the possible states (\(S, I, I_D, R\)) at time step \(t+1\).
For generality and clarity, we denote by \(\alpha _{XY}\) the probability of a node to transition from state X to Y, while \(\alpha _{XX}\) is the probability of a node to remain in state X. With this notation, the probability equations for each state are as follows:
State S: A node i is in state S at time \(t+1\) if it was in state S at time t and it did not get infected:
State I: A node i is in state I at time \(t+1\) if either: 1) it was in state S at time t and was successfully infected, or 2) it was in state I at time t and it remained there (i.e., it did not transition to states R or \(I_D\)), or 3) it was in state \(I_D\) at time t and transitioned to state I.
State \(I_D\): A node i is in state \(I_D\) at time \(t+1\) if either: 1) it was in state I at time t and transitioned to state \(I_D\), or 2) it was in state \(I_D\) at time t and it remained there.
State R: We can compute \(P_{R,i,t}\) using the relation:
Now we can write down the system of equations for SIIDR using Eqs. 7–10 to define \(P_{t}\), the probability vector that completely describes the evolution of the system at any time step t:
Stability analysis
The next step in our analysis of the SIIDR propagation on complex networks represented as arbitrary graphs is to define the diseasefree equilibrium points and analyze their stability.
Definition 5
An equilibrium point of NLDS is the probability vector \(P^*\) that satisfies \(P_{t+1}\) = \(P_t = P^*\) for any t (Verhulst 2006).
Thus, for the SIIDR model we can define the diseasefree equilibrium point as follows:
One way to analyze the stability of the equilibrium point of a nonlinear dynamical system is to approximate its dynamics at that point as a linear dynamical system (i.e., linearization) (Sayama 2015). In this case, the system behavior in an infinitesimally small area about the equilibrium point is approximated with a Jacobian matrix.
The largest eigenvalue \(\lambda _J\) of the Jacobian matrix indicates whether the equilibrium point of the system is stable or not. Since we are considering the time as discrete, if \(\lambda _J < 1\), the equilibrium point is asymptotically stable; even if small perturbations occur, the system asymptotically goes back to the equilibrium point. If \(\lambda _J > 1\), the system is unstable and diverges away from the equilibrium point. If \(\lambda _J = 1\), then the system may either diverge from, or converge to the equilibrium point (Bof et al. 2018; Dahleh et al. 2004; Haddad and Chellaboina 2011; Sayama 2015).
The Jacobian matrix of SIIDR modeled as NDLS and an analysis of its eigenvalues is presented in Appendix 3. We show that one of the eigenvalues of the Jacobian has value 1. This result is particularly significant. Asymptotic stability requires all the eigenvalues of the Jacobian matrix to be less than 1 in absolute values. Since the Jacobian matrix has at least one eigenvalue of value 1, the equilibrium point of the NLDS system cannot be asymptotically stable. However, the equilibrium point can still be Lyapunov stable.
We show that the equilibrium points of SIIDR are indeed Lyapunov stable using Lyapunov’s second stability criterion.
Definition 6
The equilibrium point \(P^*\) of \(P_{t+1} = g(P_{t})\) NLDS is Lyapunov stable if there exists a continuous function L, such that for any t:
Theorem 2
The equilibrium points of SIIDR represented as NLDS of the form (11) are Lyapunov stable if:
where \(\lambda _A\) is the largest eigenvalue of the adjacency matrix, \(\tilde{\beta }\) and \(\mu\) are probabilities of infection and recovery respectively.
Experimental results
In this section, we present the reconstruction of WannaCry dynamics from network logs captured with Zeek monitoring tool (The Zeek Project 2023). Additionally, we show supporting results that confirm that the SIIDR model fits WannaCry traces best. We also present our experiments for parameter estimation, providing the statistics from the posterior distribution of SIIDR transition rates. These results expand the results presented in our previous work where we introduced SIIDR model (Chernikova et al. 2022). Moreover, we study the basic reproduction number \(R_0\) of the reconstructed attacks to understand its correlation with SPM dynamics (in particular, its propagation speed). We also discuss the issue of structural and practical identifiabiility of SIIDR parameters which is common in epidemiological modeling. Finally, we experimentally demonstrate that the condition for Lyapunov stability of the diseasefree equilibrium point holds when the networks are modeled as arbitrary graphs relaxing homogeneous mixing assumption.
WannaCry malware traces
We obtained realistic WannaCry attack traces by running the malware in a controlled virtual environment consisting of 51 virtual machines, configured with a version of Windows vulnerable to the EternalBlue SMB exploit. The external traffic generated by the VMs was blocked to isolate the environment and prevent external malware spread. The infection started from an initial victim IP, and then the attack propagated through the network as the infected IPs began to scan other IPs. In these experiments, WannaCry varied the number of threads used for scanning, which were set to 1, 4 or 8, and the time interval between scans, which was set to 500ms, 1 s, 5 s, 10 s or 20 s. Using the combination of these two parameters resulted in 15 WannaCry traces. While running WannaCry with this setup, the log traces were collected with the help of the open source Zeek network monitoring tool.
WannaCry reconstruction
To reconstruct the WannaCry dynamics we are using Zeek communication logs where we consider only communication between internal IPs. Since WannaCry attempts to exploit the SMB vulnerability, we label as malicious all the attempts of connections on destination port 445. The first attempt to establish the malicious connection is considered to be the start of the epidemics, and the end corresponds to the last communication event in the network. Each IP trying to establish a malicious connection for the first time at time t is considered infected at time t. The cumulative number of infected IPs through time represents the curve of the WannaCry epidemics.
WannaCry dynamics
We show the dynamics of WannaCry variants characterized by different numbers of scanning threads and time between scans in Fig. 4. These dynamics represent the cumulative number of infected nodes during the epidemic time. The trace which corresponds to 1 thread and 20 s sleeping time wc_1_20s has unusual behavior in the dynamics. It has a very small number of infected nodes until the end of the attack, when the infections rapidly increase to the 7 infected nodes at once. For all other WannaCry variants we observe that the attack reaches the maximum number of infected nodes quickly and is not able to infect any other nodes for a large time window before the end of the epidemic. These graphs confirm the fact that after an IP enters a recovered state it no longer has an opportunity to get back to susceptible or infected nodes. For modeling and parameter estimation experiments we exclude the time windows after which the number of infections does not change. Additionally, we present the number of contacted and infected IPs in Table 3. Interestingly, the overall percentage of infected nodes is small (around 25% on average) for all variants. The possible reason for this is the fact that some of the machines that do not get infected may have immunity to the malware.
Model selection
We select the model that fits WannaCry traces best among several representative compartmental epidemiological models: SI, SIS, SIR, SEIR and SIIDR assuming an homogenous mixing of machines. These models have different number of parameters and, therefore, different apriori explaining power. The SIIDR model is also the one that has the largest number of parameters. To allow for a fair comparison among models, we considered the Akaike Information Criterion (AIC) as a metric to measure their performance. The AIC is calculated based on the maximum likelihood estimate and the number of free model parameters, thus, allowing comparison of models with different number of parameters. More information about AIC criteria can be found in “Model selection” section in Appendix 5. We perform model selection for all WannaCry traces. The lowest AIC score corresponds to the best model. We run the experiments on an uniform grid of model parameter values between 0 and 1. We select the lowest AIC score for each WannaCry trace and each compartmental model. The results are illustrated in Table 4. In bold, we highlight the minimum AIC value across all models for each WannaCry trace. The SIIDR model has the lowest AIC score for all traces except for wc_1_20s. For instance, the AIC score associated with the SEIR model for wc_8_5s WannaCry trace is equal to 87, the SIS model score is 104, the SIR model score is 35, whereas for the SIIDR model the AIC is the lowest and has the value of 121. This trend is valid for all other WannaCry traces except for wc_1_20s where the SEIR model provides the best fit. However, this variant is an outlier. Therefore, we can conclude that, among the four epidemiological models, the SIIDR model fits the WannaCry attack traces best.
For each compartmental model and each WannaCry trace, we plot the reconstruction curve of the number of infected nodes using the parameters corresponding to the lowest AIC score along with the true dynamics of infected nodes. The results are shown in Fig. 5. In the case of the SIS model, the orange line (representing the simulated dynamics of the number of infected nodes) is far from the blue one, which illustrates the empirical dynamic for all malware traces. In the case of the SIR and SEIR models the numbers of simulated infections are closer to the real ones, however, the SIIDR and actual dynamics curves are the closest.
Parameter estimation
We approximated the posterior distribution of SIIDR transition rates using the ABCSMCMNN technique (Filippi et al. 2013). The details of this technique are described in “Posterior distribution of transition rates” in Appendix 5. The mean values and standard deviation of the posterior distribution of SIIDR transition rates (\(\beta\), \(\mu\), \(\gamma _1\), \(\gamma _2\)) are represented in Table 5. The parameter dt is the integration step, which is calculated as: \(dt = (t_N  t_0) /T\), where \(t_N\) is the last timestamp, \(t_0\) is the first timestamp, and T is the number of timestamps in WannaCry traces. dt differs by variant due to the different propagation speeds. The attack transmission probability \(\tilde{\beta }\) is related to attack transmission rate \(\beta\) as follows: \(\beta = \tilde{\beta } \langle k \rangle _t\) where \(\langle k \rangle _t\) is the average contact rate per unit time. In the WannaCry traces we have one communication or contact per dt, hence, the transmission probability \(\tilde{\beta }\) over a contactlink also equals \(\beta\).
Based on estimated values of transition rates we calculated the basic reproduction number \(R_0\) for all WannaCry traces. We also calculate the SPM propagation speed for all WannaCry traces as the average number of new infections per 100 s. The results are illustrated in Fig. 6. As expected, we observe that higher SPM propagation speed corresponds to a higher basic reproduction number \(R_0\).
The mean values of the parameters’ posterior distribution can be further used to simulate SPM with the SIIDR model. This provides an opportunity to create synthetic, but realistic, WannaCry scenarios and evaluate whether existing defenses are successful in preventing and stopping the malware from propagation in the networks. However, we notice that some of the WannaCry attack variants affect only a small number of nodes. For example, the wc_8_5s trace has only 4 infected nodes at the end of the trace which constitutes 14% of all nodes. Consequently, ABCSMCMNN is expected to perform worse in the estimation of transition rates for such traces. Thus, parameters estimated from the traces with higher numbers of infections are more reliable.
Identifiability of SIIDR transition rates
As long as the goals of modeling with SIIDR include inferences about the underlying propagation process, we are interested in the estimation of SIIDR parameter distribution corresponding to model outputs that best fit the observed data. However, parameters’ estimation can only produce robust results if the model is identifiable meaning that it is possible to obtain a unique solution for all unknown parameters given the model structure and output. On the other hand, if parameters are not identifiable their similar values may yield considerably different model outputs (Chis et al. 2011; Tuncer and Le 2018).
The common problem of data uncertainty forces the issue of parameter identifiability to appear relevant in epidemiological modeling (Chowell 2017; Gallo et al. 2022; Weitz and Dushoff 2015; Valdez et al. 2015). The lack of identifiability in the model parameters may prevent reliable predictions of the epidemic dynamics. Therefore, it becomes crucial to investigate the parameter identifiability, and its limitations and propose solutions to improve it.
There exist notions of structural and practical identifiability. Structural identifiability is a property of the model structure itself given that the model is errorfree and the observed data has no noise. Practical identifiability is connected to the quality of data leveraged for parameter estimation. It measures whether there is enough information to infer the transition rates (Dankwa et al. 2022).
We addressed the structural SIIDR parameters identifiability using the method of differential algebra (Chis et al. 2011; Miao et al. 2011) with the help of DAISY (Bellu et al. 2007) and SIAN (Hong et al. 2020; Ilmer et al. 2021) software and achieved the following result:
Theorem 3
All parameters of the SIIDR model are globally structurally identifiable when incidence represents the output of the model and the size of population N is known. Otherwise, parameters N and \(\beta\) appear to be structurally nonidentifiable while \(\mu , \gamma _1\) and \(\gamma _2\) remain identifiable.
Therefore, we consider the SIIDR model to be structurally identifiable as long as the size of the computer networks is usually known. More information about SIIDR structural identifiability along with the results from DAISY software can be found in Appendix 2.
However, even when the model parameters are structurally identifiable, they may still be nonidentifiable in practice due to the limited number of observed variables, the quality of data used for estimation, and the complexity of the model (the number of parameters that are jointly estimated).
To investigate practical identifiability we looked at the joint posterior distribution of SIIDR parameters. The plots can be found in “Joint posterior distributions of SIIDR parameters” in Appendix 2. For some of the WC variants, there exists a correlation between parameters \(\beta\) and \(\mu\). Additionally, some of the joint posterior distributions possess multimodality. Although on average the issue of nonidentifiability is not dominant, it might appear in some parts of the phase space of the SIIDR model. One reason for this behavior is that the incidence represents the output of the fitted model and appears to be insufficient to characterize the whole model’s dynamic. On the other hand, SIIDR has four parameters estimated jointly, therefore, it may contain multiple sets of parameters that lead to the same output of the model. Hence, measuring the data about other states rather than just the number of infected nodes as a function of time to characterize the system dynamics more extensively, should improve the practical SIIDR identifiability.
Threshold evaluation
In this section, we evaluate the conditions of SIIDR model equilibrium points to satisfy the Lyapunov stability. Specifically, we are interested in the equilibrium point which corresponds to the start of epidemics, when all nodes in the network have the following probability vector to appear in all of the states of SIIDR model \(P^* = \{\vec {1},\vec {0}, \vec {0}, \vec {0}\}\). We study the stability of this point after the infection of the initial node by SPM (i.e., the system initial state \(P_0\) lies in the ball of radius \(\delta\) around \(P^*\)) by looking at the density of recovered nodes w.r.t to the stability threshold s and associated infection propagation dynamics \(P_t\).
We evaluate stability conditions on the variety of synthetic and realworld networks described in the following subsection.
Graphs characteristics
We consider synthetic networks generated with BarabásiAlbert (BA) (Barabási and Albert 1999), ErdősRényi (ER) (Erdős and Rényi 1959), WattsStrogatz (WS) (Watts and Strogatz 1998), Configuration Model (CM) (Newman 2003), and Scalefree (SF) (Barabási 2009) models along with three realworld graphs (Leskovec et al. 2005; Leskovec and Mcauley 2012; Leskovec et al. 2007). Realworld graphs include networks generated using Facebook data (Facebook), autonomous systems peering information inferred from Oregon route views (Oregon), and anonymized traffic data about incoming and outgoing emails between members of the European research institution (Email). All synthetic graphs have 1000 nodes and different topological characteristics. Thus, ER graphs have different leading eigenvalues that range from 11 to 999, BA networks have the leading eigenvalue between 35 and 508, and WS graphs  between 10 and 900. ER, BA, and WS networks have only one connected component. They have a larger diameter and average path length, and smaller density and transitivity in the graphs with smaller leading eigenvalues. CM and SF networks have more connected components and the values of other topological characteristics are similar to ER, BA, and WS graphs with small leading eigenvalues.
More details about the topological characteristics of considered networks are presented in Table 6.
Phase transition
To illustrate the results of Theorem 2 we plot the final number of recovered nodes in the network with respect to the threshold values \(s = \lambda _A * \beta / \mu\) in the range from 0 to 2. We achieve these results by fixing the transition rates \(\mu = 0.5, \gamma _1 = 0.5, \gamma _2 = 0.5\) and changing the value of \(\beta\). For ER, BA and WS graphs infection propagation starts from one initially infected node, for SF, CM and realworld networks the fraction of infected nodes at t = 1 is 0.05. We average results over 100 stochastic realizations that we run considering 50 different seeds. Resulting phase transition plots are illustrated in Figs. 7, 8, 9, and 10.
For all types of graphs, the total fraction of recovered nodes is negligible for values of \(s<1\). As predicted by the theory, the epidemic threshold is \(s\sim 1\). In the case of SF, CM, and real networks (see Fig. 10), the threshold appears to be for \(s<1\). However, we note how in order to obtain macroscopic outbreaks in these graphs, we started the simulations with \(5\%\) of initially infected seeds, instead of a single one as done for the other networks. Hence, also for these networks, the phase transition takes place for \(s\sim 1\).
In general, networks with larger diameters and average path lengths, smaller density, and transitivity have a smaller fraction of recovered nodes during the infection propagation.
These results demonstrate that for all t the solution \(P_t\) stays in some ball of radius \(\epsilon\) from the starting equilibrium point \(P^* = P_0\) when \(s < 1\), therefore, it is Lyapunov stable. Moreover, we see that SIIDR behaves the same as the SIR model in terms of the stability of equilibrium points: when the threshold s is less than one the SIIDR system solution converges to DFE when t tends to infinity. It can be explained by the fact that SIIDR model is very similar to a SIR model except for the particular configuration of transition rates.
Conclusions
We performed a comprehensive analysis of a new compartmental model, SIIDR, that captures the behavior of selfpropagating malware. We showed that SIIDR fits realworld WannaCry traces much better than existing compartmental models such as SI, SIS, SIR, and SEIR (which were previously studied in the literature). Additionally, we estimated the posterior distribution of the model’s parameters for real attack traces and showed how they characterize the WannaCry behavior. We also analytically derived the conditions when SPM is expected to become an epidemic and discussed the stability of model’s diseasefree equilibrium points. Our work demonstrates the impact of modeling the propagation of SPM, simulating real attacks on networks, and evaluating defensive techniques.
Availibility of data and materials
The datasets supporting the conclusions of this article are available in the github repository: https://github.com/achernikova/siidr/. WannaCry data is available from the corresponding author on reasonable request.
Code availability
The code is available in the github repository: https://github.com/achernikova/siidr/.
References
Abbey H (1952) An examination of the ReedFrost theory of epidemics. Hum Biol 24(3):201–33
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723. https://doi.org/10.1109/TAC.1974.1100705
Akbanov M, Vassilakis VG, Logothetis MD (2019) Ransomware detection and mitigation using softwaredefined networking: the case of WannaCry. Comput Electr Eng 76:111–121. https://doi.org/10.1016/j.compeleceng.2019.03.012
Albert R, Barabási AL (2002) Statistical mechanics of complex networks. Rev Mod Phys 74(1):47–97. https://doi.org/10.1103/RevModPhys.74.47
Alotaibi FM, Vassilakis VG (2021) SDNbased detection of selfpropagating ransomware: the case of BadRabbit. IEEE Access 9:28039–28058. https://doi.org/10.1109/ACCESS.2021.3058897
Azzara M (2021) What is WannaCry Ransomware and how does it work? “https://www.mimecast.com/blog/allyouneedtoknowaboutwannacryransomware/”
Bansal S, Grenfell B, Meyers L (2007) When individual behaviour matters: homogeneous and network models in epidemiology. J R Soc Interface 4(16):879–891. https://doi.org/10.1098/rsif.2007.1100
Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512
Barabási AL (2009) Scalefree networks: a decade and beyond. Science 325(5939):412–413. https://doi.org/10.1126/science.1173299
Bellu G, Saccomani MP, Audoly S et al (2007) Daisy: a new software tool to test global identifiability of biological and physiological systems. Comput Methods Progr Biomed 88(1):52–61
Ben Said N, Biondi F, Bontchev V et al (2018) Detection of Mirai by syntactic and behavioral analysis. In: IEEE 29th International symposium on software reliability engineering (ISSRE), pp 224–235. https://doi.org/10.1109/ISSRE.2018.00032
Bhatia R (1997) Matrix analysis, vol 169. Springer, New York
Blackwood JC, Childs LM (2018) An introduction to compartmental modeling for the budding infectious disease modeler. Lett Biomath
Bof N, Carli R, Schenato L (2018) Lyapunov theory for discrete time systems. arXiv preprint arXiv:1809.05289
Brauer F (2008) Compartmental models in epidemiology. Math Epidemiol 19–79
Chakrabarti D, Wang Y, Wang C et al (2008) Epidemic thresholds in real networks. ACM Trans Inf Syst Secur 10(4):1–26
Chen Q, Bridges RA (2017) Automated behavioral analysis of malware: a case study of WannaCry ransomware. In: 16th IEEE international conference on machine learning and applications (ICMLA), pp 454–460. https://doi.org/10.1109/ICMLA.2017.0119
Chernikova A, Gozzi N, Boboila S et al (2022) Cyber network resilience against selfpropagating malware attacks. In: Proceedings 27th European symposium on research in computer security (ESORICS)
Chis OT, Banga JR, BalsaCanto E (2011) Structural identifiability of systems biology models: a critical comparison of methods. PLoS ONE 6(11):e27755
Chowell G (2017) Fitting dynamic models to epidemic outbreaks with quantified uncertainty: a primer for parameter uncertainty, identifiability, and forecasts. Infect Dis Model 2(3):379–398
Dahleh M, Dahleh MA, Verghese G (2004) Lectures on dynamic systems and control. A+ A 4(100):1–100
Dankwa EA, Brouwer AF, Donnelly CA (2022) Structural identifiability of compartmental models for infectious disease transmission is influenced by data type. Epidemics 41:100643
Diekmann O, Heesterbeek JAP, Metz JA (1990) On the definition and the computation of the basic reproduction ratio R 0 in models for infectious diseases in heterogeneous populations. J Math Biol 28(4):365–382
Diekmann O, Heesterbeek J, Roberts MG (2010) The construction of nextgeneration matrices for compartmental epidemic models. J R Soc Interface 7(47):873–885
Dietz K (1993) The estimation of the basic reproduction number for infectious diseases. Stat Methods Med Res 2(1):23–41
Durst R, Champion T, Witten B et al (1999) Testing and evaluating computer intrusion detection systems. Commun ACM 42(7):53–61
Erdős P, Rényi A (1959) On random graphs i. Publ math debrecen 6(290297):18
Filippi S, Barnes CP, Cornebise J et al (2013) On optimality of kernels for approximate Bayesian computation using sequential Monte Carlo. Stat Appl Genet Mol Biol 12(1):87–107
Fraser C, Donnelly CA, Cauchemez S et al (2009) Pandemic potential of a strain of influenza A (H1N1): early findings. Science 324(5934):1557–1561
Gallo L, Frasca M, Latora V et al (2022) Lack of practical identifiability may hamper reliable predictions in COVID19 epidemic models. Sci Adv 8(3):eabg5234
Gan C, Feng Q, Zhang X et al (2020) Dynamical propagation model of malware for cloud computing security. IEEE Access 8:20325–20333
Guillén JH, del Rey AM (2018) Modeling malware propagation using a carrier compartment. Commun Nonlinear Sci Numer Simul 56:217–226
Guillén JH, del Rey AM, Encinas LH (2017) Study of the stability of a SEIRS model for computer worm propagation. Phys A 479:411–421
Guillén JH, del Rey AM, CasadoVara R (2019) Security countermeasures of a SCIRAS model for advanced malware propagation. IEEE Access 7:135472–135478
Guo Y, Gong W, Towsley D (2000) Timestepped hybrid simulation (TSHS) for large scale networks. In: Proceedings IEEE INFOCOM 2000. Conference on computer communications. Nineteenth annual joint conference of the IEEE computer and communications societies (Cat. No. 00CH37064). IEEE, pp 441–450
Haddad WM, Chellaboina V (2011) Nonlinear dynamical systems and control: a Lyapunovbased approach. Princeton University Press, Princeton
Higham DJ (2001) An algorithmic introduction to numerical simulation of stochastic differential equations. SIAM Rev 43(3):525–546
Hirsch M, Smale S (1974) Differential equations, dynamical systems, and linear algebra. Academic Press, Oxford
Hong H, Ovchinnikov A, Pogudin G et al (2020) Global identifiability of differential models. Commun Pure Appl Math 73(9):1831–1879
Ilmer I, Ovchinnikov A, Pogudin G (2021) Webbased structural identifiability analyzer. In: Computational methods in systems biology: 19th international conference, CMSB 2021, Bordeaux, France, September 22–24, 2021, Proceedings 19. Springer, pp 254–265
Keeling M, Rohani P (2008) Modeling infectious diseases in humans and animals. 837 Princeton university press
Kephart JO, White SR (1993) Measuring and modeling computer virus prevalence. In: Proceedings 1993 IEEE computer society symposium on research in security and privacy. IEEE, pp 2–15
Kiddle C, Simmonds R, Williamson C et al (2003) Hybrid packet/fluid flow network simulation. In: Seventeenth workshop on parallel and distributed simulation, 2003. (PADS 2003). Proceedings. IEEE, pp 143–152
Kim HA, Karp B (2004) Autograph: toward automated, distributed worm signature detection. In: 13th USENIX security symposium (USENIX Security 04). USENIX Association, San Diego, CA
Kumar A, Lim TJ (2020) Early detection of Mirailike Iot bots in largescale networks through subsampled packet traffic analysis. In: Advances in information and communication: proceedings of the 2019 future of information and communication conference (FICC), vol 2. Springer, pp 847–867
Le LT, EliassiRad T, Tong H (2015) MET: a fast algorithm for minimizing propagation in large graphs with small eigengaps. In: Proceedings of the 2015 SIAM International conference on data mining (SDM), pp 694–702
Leskovec J, Mcauley J (2012) Learning to discover social circles in ego networks. Adv Neural Inf Process Syst 25
Leskovec J, Kleinberg J, Faloutsos C (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp 177–187
Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evolution: densification and shrinking diameters. ACM Trans Knowl Discov Data (TKDD) 1(1):2es
Levy N, Rubin A, YomTov E (2020) Modeling infection methods of computer malware in the presence of vaccinations using epidemiological models: an analysis of realworld data. Int J Data Sci Anal 10(4):349–358
Li J, Stafford S (2014) Detecting smart, selfpropagating Internet worms. In: IEEE Conference on communications and network security, pp 193–201. https://doi.org/10.1109/CNS.2014.6997486
Martínez Martínez I, Florián Quitián A, DíazLópez D et al (2021) MalSEIRS: Forecasting malware spread based on compartmental models in epidemiology. Complexity
McKinley TJ, Vernon I, Andrianakis I et al (2018) Approximate Bayesian computation and simulationbased inference for complex stochastic epidemic models. Stat Sci 33(1):4–18
Miao H, Xia X, Perelson AS et al (2011) On identifiability of nonlinear ode models and applications in viral dynamics. SIAM Rev 53(1):3–39
Minter A, Retkute R (2019) Approximate Bayesian computation for infectious disease modelling. Epidemics 29:100368
Mishra BK, Jha N (2010) SEIQRS model for the transmission of malicious objects in computer network. Appl Math Model 34(3):710–715
Mishra BK, Pandey SK (2014) Dynamic model of worm propagation in computer network. Appl Math Model 38(7–8):2173–2179
Mishra BK, Saini DK (2007) SEIRS epidemic model with delay for transmission of malicious objects in computer network. Appl Math Comput 188(2):1476–1482
Newman M (2018) Networks. Oxford University Press, Oxford
Newman MEJ (2003) The structure and function of complex networks. SIAM Rev 45(2):167–256. https://doi.org/10.1137/s003614450342480
Newsome J, Karp B, Song D (2005) Polygraph: automatically generating signatures for polymorphic worms. In: IEEE Symposium on security and privacy (S &P), pp 226–241. https://doi.org/10.1109/SP.2005.15
Ojha RP, Srivastava PK, Sanyal G et al (2021) Improved model for the stability analysis of wireless sensor network against malware attacks. Wirel Pers Commun 116(3):2525–2548
Ongun T, Spohngellert O, Miller BA et al (2021) PORTFILER: portlevel network profiling for selfpropagating malware detection. In: Proceedings of the 9th IEEE conference on communications and network security (CNS), pp 182–190
PastorSatorras R, Castellano C, Van Mieghem P et al (2015) Epidemic processes in complex networks. Rev Mod Phys 87:925–979. https://doi.org/10.1103/RevModPhys.87.925
Perko L (2013) Differential equations and dynamical systems, vol 7. Springer Science & Business Media, New York
Perumalla KS, Sundaragopalan S (2004) Highfidelity modeling of computer network worms. In: 20th Annual computer security applications conference. IEEE, pp 126–135
Prakash B, Chakrabarti D, Faloutsos M et al (2011) Threshold conditions for arbitrary cascade models on arbitrary networks. Knowl Inf Syst 33:537–546
Riley GF, Ammar MH, Fujimoto RM et al (2004) A federated approach to distributed network simulation. ACM Trans Model Comput Simul (TOMACS) 14(2):116–148
Sayama H (2015) Introduction to the modeling and analysis of complex systems. Open SUNY, New York
Szymanski BK, Liu Y, Gupta R (2003) Parallel network simulation under distributed genesis. In: Seventeenth workshop on parallel and distributed simulation, 2003. (PADS 2003). Proceedings. IEEE, pp 61–68
The Zeek Project (2023) Zeek network monitoring tool. https://docs.zeek.org/en/master/scriptreference/logfiles.html. Accessed 11 July 2022
Tong H, Prakash BA, EliassiRad T et al (2012) Gelling, and melting, large graphs by edge manipulation. In: Proceedings of the 21st ACM conference on information and knowledge management (CIKM), pp 245–254
Toni T, Welch D, Strelkowa N et al (2009) Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J R Soc Interface 6(31):187–202
Torres L, Chan K, Tong H et al (2021) Nonbacktracking eigenvalues under node removal: Xcentrality and targeted immunization. SIAM J Math Data Sci 3:656–675
Toutonji OA, Yoo SM, Park M (2012) Stability analysis of VEISV propagation modeling for network worm attack. Appl Math Model 36(6):2751–2761
Tuncer N, Le TT (2018) Structural and practical identifiability analysis of outbreak models. Math Biosci 299:1–18
Vahdat A, Yocum K, Walsh K et al (2002) Scalability and accuracy in a largescale network emulator. ACM SIGOPS Op Syst Rev 36(SI):271–284
Valdez LD, Aragão Rêgo H, Stanley HE et al (2015) Predicting the extinction of Ebola spreading in Liberia due to mitigation strategies. Sci Rep 5(1):12172
Van den Driessche P, Watmough J (2008) Further notes on the basic reproduction number. Math Epidemiol 59–178
Verhulst F (2006) Nonlinear differential equations and dynamical systems. Springer Science & Business Media, Utrecht
Vespignani A (2012) Modelling dynamical processes in complex sociotechnical systems. Nat Phys 8(1):32–39
Watts DJ, Strogatz SH (1998) Collective dynamics of ‘smallworld’ networks. Nature 393(6684):440–442
Wei S, Hussain A, Mirkovic J et al (2010) Tools for worm experimentation on the deter testbed. Int J Commun Netw Distrib Syst 5(1–2):151–171
Weitz JS, Dushoff J (2015) Modeling postdeath transmission of Ebola: challenges for inference and opportunities for control. Sci Rep 5(1):8751
White B, Lepreau J, Stoller L et al (2002) An integrated experimental environment for distributed systems and networks. ACM SIGOPS Op Syst Rev 36(Sl):255–270
Wikipedia (2023a) Colonial Pipeline ransomware attack. URL https://en.wikipedia.org/wiki/Colonial_Pipeline_ransomware_attack. Accessed 7 May 2022
Wikipedia (2023b) Petya and NotPetya. URL https://en.wikipedia.org/w/index.php?. Accessed 7 May 2022
Wikipedia (2023c) Wannacry ransomware attack. URL https://en.wikipedia.org/w/index.php?title=WannaCry_ransomware_attack &oldid=1086034703, accessed 7May2022
Yao Y, Fu Q, Yang W et al (2018) An epidemic model of computer worms with time delay and variable infection rate. Secur Commun Netw 2018
Zheng Y, Zhu J, Lai C (2020) A SEIQR model considering the effects of different quarantined rates on worm propagation in mobile internet. Math Probl Eng
Zhu Q, Yang X, Ren J (2012) Modeling and analysis of the spread of computer virus. Commun Nonlinear Sci Numer Simul 17(12):5117–5124
Acknowledgements
We acknowledge Jason Hiser and Jack W. Davidson from University of Virginia for providing us access to the WannaCry attack traces.
Funding
Open access funding provided by Northeastern University Library This research was sponsored by the U.S. Army Combat Capabilities Development Command Army Research Laboratory under Cooperative Agreement Number W911NF1320045 (ARL Cyber Security CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Combat Capabilities Development Command Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.
Author information
Authors and Affiliations
Contributions
NG and NP proposed the SIIDR model. AC, NG, and NP contributed to the methodology of the paper. AC and NG performed the experiments. All authors contributed to the discussion and writing of the paper, and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1 Compartmental models of epidemiology
SI model
The SI model is used to describe diseases where infection is permanent. It features two compartments and one transition. The susceptible compartment S represents healthy individuals that interacting with infectious individuals in the compartment I can get infected (\(S + I \rightarrow 2I\)). It can be translated in the following system of ODEs:
Due to the homogeneous mixing assumption, the per capita rate at which susceptible individuals get infected can be written as the probability of interacting with an infected individual (I/N) times the transmission rate of the disease \(\beta\). The state diagram for the SI model is shown in Fig. 11.
SIS model
The SIS model features two compartments and two transitions. Beside the infection process as in the SI model, SIS models have also a recovery process: infected individuals spontaneously recover at rate \(\mu\) becoming susceptible to the disease again (\(I \rightarrow S\)). Hence SIS models are used for diseases that can infect individuals multiple times. The system of ODEs associated with SIS model is:
Note how, differently from infection, the recovery process is spontaneous and does not require any interaction. Hence, each infected individual has an average duration of infection of \(\mu ^{1}\). The state diagram for SIS model is shown in Fig. 12.
SIR model
The SIR model describes diseases that give permanent (or longlasting) immunity. It features three compartments and two transitions. Differently from SIS models, within the SIR framework infected individuals that are no longer infectious transition to the recovered compartment R. The system of differential equations corresponding to the SIR model is the following:
The state diagram for the SIR model is represented in Fig. 13.
SEIR model
The SEIR model describes diseases where susceptible individuals S remain exposed E after interaction with infected I individual before becoming infectious themselves. It features four compartments and three transitions. The system of differential equations corresponding to the SEIR model is the following:
The state diagram for the SEIR model is represented in Fig. 14.
Appendix 2 Identifiability of SIIDR transition rates
SIIDR model can be represented as follows:
where \(t_0 \le t \le T\), \(\dot{X}(t)\) is a system of ODEs, X(t) is a vector of timevarying diseases states and the unique solution to the system \(\dot{X}(t)\), \(\theta \in \Theta\) is a vector of constant unknown model parameters, Y(t) is a vector of timedependent model outputs, g is the measurement equation which defines the relationship between X(t), \(\theta\) and Y, and \(X_0\) is a vector of the known initial conditions.
Definition 7
A parameter \(\theta\) is structurally globally identifiable if \(\forall\) \(\theta ^* \in \Theta\):
Definition 8
A parameter \(\theta\) is structurally locally identifiable if \(\forall\) \(\theta ^* \in \Theta\), there exists a neighbourhood \(\Omega (\theta )\) such that
A variety of methods exists to evaluate the structural and practical identifiability of parameters. In our work, we leveraged the method of differential algebra implemented in DAISY (Bellu et al. 2007) and SIAN (Hong et al. 2020; Ilmer et al. 2021) software to address the structural identifiability of SIIDR. We looked at the joint posterior distribution of SIIDR parameters to address the issue of practical identifiability. We discuss SIIDR identifiability results in the following subsections.
Differential algebra approach for structural identifiability
In this section, we show the results for structural identifiability of SIIDR parameters achieved with the differential algebra approach implemented in DAISY software. Figures 15, 16 represent the input and output of the DAISY software when the number of infected nodes is the output variable Y(t). Figures 17, 18 show the results from DAISY software in the situation when the sum of infected, infected dormant, and recovered nodes is the output variable. When the size of the population N is known, we can exclude it from the ODE equations and consider \(\beta = \beta /N\) to be the unknown parameter. In both cases all parameters of the SIIDR model are globally structurally identifiable. Figures 19, 20 show the results when the N is the unknown parameter. In this sutiation, parameters \(\beta\) and N are not identifiable, however, \(\mu , \gamma _1, \gamma _2\) remain identifiable.
Joint posterior distributions of SIIDR parameters
In this subsection, we illustrate the joint posterior distribution for SIIDR parameters. The plots for wc_4_500ms variant are in Figs. 21, 22. In this case, joint posterior distribution has multiple modes which means that the parameters value are not uniquely identifiable. The results for wc_8_20s are illustrated in Figs. 23, 24. In this situation, \(\beta\) and \(\mu\) parameters are correlated. In Figs. 25, 26 we show the results for wc_1_5s variant. The posterior joint distribution of \(\beta\) and \(\mu\) parameters are not correlated and there is no multimodality.
Appendix 3 Linearization of SIIDR as NLDS
The Jacobian matrix \(\mathcal {J}\) at the equilibrium point \(P^*\) is defined as:
where \(\mathcal {J}_{i,j}=[\nabla g(P^*)]_{i,j}=\frac{\partial g_i}{\partial p_j}_{P=P^*}\).
We calculate the partial first order derivatives of our equation system and obtain the Jacobian matrix:
The size of the Jacobian matrix is \(4N \times 4N\), where N is the number of nodes in the graph. Every row has 4 elements of size \(N \times N\). We use the following notation: \(\mathbb {I}\) is the identity matrix of size \(N \times N\) and \(\mathbb {O}\) is a matrix of size \(N \times N\) with all zeros. \(\mathbb {A}\) is the adjacency matrix of the network represented as a graph, of size \(N \times N\).
The first row is a linear combination of the other rows, thus:
Let us represent the Jacobian matrix as follows:
where \(Q_1\), \(Q_2\), \(Q_3\), O are matrices of size \(N \times N\), \(2N \times N\), \(2N \times 2N\), \(2N \times N\) respectively:
Let \(\vec {v}\) of size \(3N \times 1\) and \(\lambda _J\) be the eigenvector and the eigenvalue of J respectively. Then we can define \(\vec {v}\) to be composed of \(\vec {v_1}\) of size \(N\times 1\) and \(\vec {v_2}\) of size \(2N \times 1\):
\(\vec {v}\) and \(\lambda _J\) satisfy the following equation:
which results in:
Eq. 23 implies that:
From Eq. 25 we have:

1.
\(\vec {v_2} = \vec {0}\), or

2.
\(\vec {v_2}\) is the eigenvector of \(Q_3\) and \(\lambda _J\) is the eigenvalue of \(Q_3\).
We look at the first case into more detail: if \(v_2\) = 0, from Eq. 24, we obtain that \(Q_1 \cdot \vec {v_1} = \lambda _J \cdot \vec {v_1}\). That means either: (a) \(\vec {v_1} = 0\), which is not feasible, because in this case \(\vec {v}= \vec {0}\), or (b) \(\lambda _J\) is the eigenvalue of \(Q_1\).
Thus, the eigenvalues of the Jacobian matrix can be represented as eigenvalues of matrix \(Q_1\) (when \(\vec {v_2}\) = 0) and eigenvalues of matrix \(Q_3\). Given the structure of \(Q_1\) (i.e., identity matrix of size \(N\times N\)), the eigenvalues of \(Q_1\) are equal to \(\vec {1}\). Thus, we can conclude that the Jacobian matrix has at least one eigenvalue equal to 1.
Appendix 4 SIIDR stability as the system of NLDS
Theorem 4
The equilibrium points of SIIDR represented as NLDS of the form (11) are Lyapunov stable if:
where \(\lambda _A\) is the largest eigenvalue of the adjacency matrix, \(\tilde{\beta }\) and \(\mu\) are probabilities of infection and recovery respectively.
Proof
System (11) can be reduced to the first three equations because of linear dependency of \(P_{R,i,t+1}\) on other equations, and has the following representation in the matrix form:
where matrices C and \(\mathcal {P}^T_tBP_t\) of size \(3N \times 3N\) correspond to the linear and nonlinear part of the system, respectively. \(\mathcal {P^T} = \{\mathcal {P}_1^T,\mathcal {P}_2^T,\mathcal {P}_3^T\}\) is a \(3N \times 9N\) matrix, where \(\mathcal {P}_i^T\) is a \(3N \times 3N\) matrix with nonzero \(i_{th}\) row \(P_S, P_I, P_{I_D}\):
\(B = \{B_i\}_{i = 1}^3\) is a \(9N \times 3N\) matrix where \(B_i = \{b_{kl}\}_{k, l = 1}^3\) has the size of \(3N\times 3N\). Based on our system representation (11) matrix C is the following:
and matrix B is:
where \(\mathbb {A}\) is the adjacency matrix of the corresponding graph.
Let L be the continuous function equal to \(P^T K\), where K is the \(3N \times 1\) matrix:
Then
L is positive definite because it is equal to the sum of probabilities of all nodes in the graph be infected or infected dormant. The finite difference (13) in this case is equal to:
where:
and
thus,
which results in the condition:
or
where \(P_I\) is the \(1 \times N\) vector of node probabilities to be infected, \(P_S\) is the \(1 \times N\) vector of node probabilities to be susceptible, and \(\mathbb {A}\) is the adjacency matrix of the corresponding graph. Expression (27) means that the sum of probabilities of nodes to recover should be greater than the sum of probabilities of nodes to become infected at each time step for the equilibrium points of the system (11) to be Lyapunov stable.
As long as the maximum value of probabilities in the vector \(P_S\) is 1, it is true that:
So if we prove that:
the condition (27) will be satisfied.
This condition can also be formulated by incorporating the nodes’ degrees as follows:
where \(\mathbb {D}\) is the \(1 \times N\) vector where each element \(d_i\) is equal to the degree of the node i in the graph.
As long as the maximum value of probabilities in the vector \(P_I\) is 1, it is true that:
So if we prove that:
the condition (27) will be satisfied. Condition 32 can be rewritten as follows:
or
It is known that the largest eigenvalue \(\lambda _A\) has the following lower bound in the case of an arbitrary graph:
where \(d_{ave}\) is the average degree of the graph. Therefore it is true that
Hence if the following condition:
is satisfied, then the DFE equilibrium point will be Lyapunov stable on an arbitrary graph. \(\square\)
Appendix 5 Model fitting and parameter estimation
In this section, we present the methodology used to compare different epidemic models in reproducing real WannaCry attack traces. Our method leverages the Akaike Information Criterion (AIC) Akaike (1974) to select the model that best fits the spreading caused by WannaCry malware. We also discuss how we estimate the posterior distribution of the SIIDR transition rates using an Approximate Bayesian Computation approach based on Sequential Monte Carlo (ABCSMC) Filippi et al. (2013), McKinley et al. (2018), Toni et al. (2009).
Model selection
We use the AIC as guiding criterion to compare SIIDR to other epidemiological models, namely SI, SIS, SIR. The AIC is calculated based on the number of free parameters k and the maximum likelihood estimate of the model L as follows:
The first term introduces a penalty that increases with the number of parameters and thus discourages overfitting. The second term rewards the goodness of fit that is assessed by the likelihood function. For the likelihood function, we use the least squares estimation. The best model is the one with the lowest AIC. In the case of the least squares estimation, the AIC can be expressed as:
where:
and \(\hat{\epsilon }_i\) are the estimated residuals:
with \(I^{sim}_t\) being the cumulative number of infected nodes from model simulations, and \(I_t^{real}\) the cumulative number of infected nodes from realworld observations, at time interval t.
We use stochastic simulations (Higham 2001) to obtain a numerical approximation of the propagation process described by the system of ODEs. Generally, statistical methods such as stochastic simulations are a good approximation for larger systems, while in the case of smaller systems stochastic fluctuations become more important. The transitions among compartments are implemented through chain binomial processes (Abbey 1952). At step t the number of entities in compartment X transiting to compartment Y is sampled from a binomial distribution \(Pr^{Bin}(X(t), p_{X \rightarrow Y}(t))\), where \(p_{X \rightarrow Y}(t)\) is the transition probability. If multiple transitions can happen from X (e.g., \(X \rightarrow Y\), \(X \rightarrow Z\)), a multinomial distribution is used (e.g., \(Pr^{Mult}(X(t), p_{X \rightarrow Y}(t), p_{X \rightarrow Z}(t))\)).
The model selection methodology is summarized in Algorithm 1. We start by creating a uniform grid of possible parameter values (lines 25). For each model and each set of parameter values \(p =(\beta ,\mu ,\gamma _1,\gamma _2)\) we perform several stochastic experiments simulating the model dynamics (the run_stochastic_avg procedure). Each stochastic realization consists of a time series, where \(S(t), I(t), I_D(t), R(t)\) represent the number of nodes in each state at time interval t during the simulation. The cumulative infection \(I_{sim}\) consists of the total number of nodes in states \(I, I_D\), and R, and is also a time series across all time intervals dt. Next, we compute the AIC using equation (38) by comparing the simulated to the actual dynamic. We select the minimum AIC score for each model; the best model is the one with the minimum AIC score overall.
SIIDR Parameters associated with the best AIC score
In Table 7 we show the SIIDR parameters associated with the minimum AIC score for all WC variants.
Posterior distribution of transition rates
To find the best set of parameters for the SIIDR model we can approximate the posterior distribution of the parameters using Approximate Bayesian Computation (ABC) techniques (Minter and Retkute 2019). These techniques are based on the Bayes rule for determining the posterior distribution of parameters given the data:
where \(P(\theta )\) is the prior distribution of parameters that represents our belief about them and \(P(D\theta )\) is the likelihood function, i.e., the probability density function of the data given the parameters. Marginal likelihood of the data P(D) does not depend on \(\theta\), and therefore the posterior distribution \(P(\theta D)\) is proportional to the numerator in (40).
ABC methods are useful when the likelihood function is unknown or is not feasible to estimate analytically. The simplest version of ABC techniques is called rejection algorithm and is illustrated in Algorithm 2. Despite it simplicity, the rejection algorithm is generally slow at converging. Indeed, each iteration is independent from the previous ones and the prior distribution from which parameters are sampled is never updated. Furthermore, it is often difficult to decide, a priori, a reasonable threshold value \(\epsilon\) that guarantees both fast convergence and accurate results.
In alternative to the rejection algorithm, we use here a more advanced ABC technique that leverages Sequential Monte Carlo (ABCSMC) (Toni et al. 2009; McKinley et al. 2018). The ABCSMC approach iteratively constructs generations of prior distributions by decreasing the rejection threshold over time. At the first generation, a given number of parameter sets (i.e., particles) is accepted from the starting prior distribution, while each prior distribution used in following generations is obtained as a weighted sample from the previous generation \(\theta ^*\) perturbed through a kernel \(K(\theta \theta ^*)\). Common choices for the kernel are the uniform and multivariate normal distributions. A kernel with a large variance will prevent the algorithm from being stuck in the local modes, but will result in a huge number of particles being rejected, which is inefficient. Therefore, we use the multivariate normal distribution, where the covariance matrix is calculated considering M nearest neighbors (MNN) of the particles from the previous generation (Filippi et al. 2013). The ABCSMCMNN algorithm is illustrated in Algorithm 3.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Chernikova, A., Gozzi, N., Perra, N. et al. Modeling selfpropagating malware with epidemiological models. Appl Netw Sci 8, 52 (2023). https://doi.org/10.1007/s4110902300578z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s4110902300578z