 Research
 Open Access
 Published:
Modeling and maximizing influence diffusion in social networks for viral marketing
Applied Network Science volume 3, Article number: 6 (2018)
Abstract
Modeling influence diffusion in social networks is an important challenge. We investigate influencediffusion modeling and maximization in the setting of viral marketing, in which a node’s influence is measured by the number of nodes it can activate to adopt a new technology or purchase a new product. One of the fundamental problems in viral marketing is to find a small set of initial adopters who can trigger the most further adoptions through wordofmouthbased influence propagation in the network. We propose a novel multiplepath asynchronous threshold (MAT) model, in which we quantify influence and track its diffusion and aggregation. Our MAT model captures not only direct influence from neighboring influencers but also indirect influence passed along by messengers. Moreover, our MAT framework models influence attenuation along diffusion paths, temporal influence decay, and individual diffusion dynamics. Our work is an important step toward a more realistic diffusion model. Further, we develop an effective and efficient heuristic to tackle the influencemaximization problem. Our experiments on four reallife networks demonstrate its excellent performance in terms of both influence spread and time efficiency. Our work provides preliminary but significant insights and implications for diffusion research and marketing practice.
Introduction
People live in various social networks, and share information and ideas with friends in the form of wordofmouth (WOM) communication. New technologies and various social media rapidly penetrate into every aspect of our daily life, and provide us new channels and great convenience to exchange information and express opinions. They disseminate massive volumes of information over different social media, and spread influence to each other. As social media becomes prevalent, its influence on business, politics and society becomes evident and significant. How new innovations, behaviors, and diseases spread through social networks has a long history of study in social sciences. Research in this area has exploded and drawn considerable attention from many disciplines over the last decade. Many models of information and influence diffusion have been proposed for a wide variety of applications, such as viral marketing (Kempe et al. 2003; Bhagat et al. 2012), cascading behavior and prediction (Leskovec et al. 2007; Cheng et al. 2014), information spreading (Morales et al. 2014), outbreak detection (Leskovec et al. 2007), etc. In this paper, we look into modeling influence diffusion in the setting of viral marketing in business.
In essence, viral marketing is a process of influence diffusion over social networks. An effective viral marketing campaign requires that marketers identify individuals with high social networking potential. A general setting can be depicted as follows: a company would like to market a new product, in the hopes that it would be adopted by as many people as possible in a target social network. The company chooses a small number of influential individuals as the initial adopters/seeds by giving them free/discounted samples of the product and encourages them to recommend the product to their friends, hoping that their friends would be influenced to purchase the product, then influence their friends, and so on. As a result, the influence would propagate through the network and trigger widespread adoption in the end of the diffusion process. Viral marketing is termed to describe such a marketing technique that induces the users in a social network to pass on a marketing message (viral ad) to others so as to achieve the largest influence spread in terms of product sales or brand awareness.
Viral marketing is driven by WOM communication and enhanced by network effects. Surveybased statistical research has shown very strong support for the hypothesis that network linkage can directly affect product/service adoption (Hill et al. 2006; Iyengar et al. 2011). The crucial factor to the success of viral marketing is for the marketers to identify the most influential set of initial adopters. Researchers investigated different approaches to seeding campaigns for marketing practice (Hill et al. 2006; Iyengar et al. 2011; Libai et al. 2013; Rand and Rust 2011; Peres 2014). These studies significantly enhance our understanding of WOM behavior and effects on promoting viral marketing. However, the approach these researchers commonly use or rely on is explanatory modeling. They apply various field experiments and/or statistical models to data for testing some causal hypotheses or correlations between variables. For example, researchers find support for the hypothesis that referrals from strong ties are more influential in receivers’ decisionmaking than those from weak ties (Brown and Reinegen 1987), and that highdegree seeding remains the most successful strategy (Hinz et al. 2011). Nevertheless, can we create a robust seeding strategy that outperforms the highdegree seeding? Is it possible to find the optimal or suboptimal seeding strategy? Can we estimate the cascade of adoption given any set of initial adopters? Most of these qualitative approaches and empirical studies fail to give answers to these important questions in an operational manner due to the lack of an explicit/realistic influencediffusion model.
While most marketing researchers focus on investigating WOM diffusion processes using various explanatory models and/or qualitative methods, researchers from computer science and other related fields aim to explicitly build influencediffusion models so as to maximize and/or predict influence spread. Influence maximization as an important algorithmic technique for viral marketing was first posed by Domingos and Richardson (2001, 2002), in which they applied Markov random fields to model the influence among customers and then choose the best marketing plan to maximize the profit. In their seminal paper (Kempe et al. 2003), Kempe et al. formulate it as a discrete optimization problem: given a social network, a stochastic diffusion model, and a number K (also called budget), the objective is to find the seed set of K initial adopters (of a new product) who can trigger the largest cascade of further adoptions, in which the influence diffusion process unfolds in discrete time steps as described by the diffusion model.
This influencemaximization (IM) problem has attracted a great deal of attention and extensive studies. To address this problem, two key components are indispensable. First, we need to build a realistic influencediffusion model that captures in detail the diffusion process and the activation mechanism of WOM marketing. Second, we need to develop an effective and efficient algorithm that enables us to find the optimal or suboptimal seed set under the diffusion model. Kempe et al. (2003) approach the problem under two widelystudied stochastic diffusion models: independent cascade (IC) model and linear threshold (LT) model. They show that this optimization problem is NPhard for both the IC and LT models, and provide the first provable approximation guarantees for heuristic algorithms. Their work laid theoretical and algorithmic foundations for understanding influence diffusion and addressing the IM problem. Since then, there has been a large amount of followup work in two main directions: (1) extensive research has been done to study various extensions to the IC and LT models (Bhagat et al. 2012; Goyal et al. 2010; Chen et al. 2011; Borodin et al. 2010; He et al. 2012; Gayraud et al. 2015); (2) a large number of algorithms have been proposed to improve efficiency in finding the Kseed set (Leskovec et al. 2007; Chen et al. 2010; Goyal et al. 2011b; Jiang et al. 2011; Wang et al. 2012). These extensions are the substantial complement of the classic IC and LT models. However, there exist some significant limitations in these models.
First, they approach a social entity’s adoption likelihood from a confined scope, considering only the direct influence from the activated neighbors of the entity. However, a social entity’s adoption decision could be influenced by other adopters who are not social neighbors (Fang et al. 2013), e.g. through structural equivalence (Burt 1987) and threedegreesofinfluence (Christakis and Fowler 2007). In reality, as described in the network coproduction model (Kozinets et al. 2010), WOM messages do not flow unidirectionally from seeded consumers to others, but rather may be exchanged among any connected consumers in the network. In other words, a node in the network could be influenced and activated by inactive neighbors who pass on the influence from other influencers. The IC and LT models fail to capture such indirect influence passed along by messengers who may or may not be activated. The findings of Fang et al. (2013) suggest that diffusion models relying exclusively on direct influence are limited in predictive power. We argue that messengers play an important role in influence diffusion for viral marketing, which should be taken into account to build a more realistic influencediffusion model.
Second, both IC and LT models have an undesirable timeinvariant property (Chen et al. 2013). Delaying the activation attempts of any node in the IC model and delaying adding the influence weights of some outlinks of a newly activated node in the LT model would not change the distribution of the final active set. This timeinvariant property is not realistic. Goyal et al. (2010) measure the number of actions that propagate between pairs of neighbors in Flickr at different time intervals. They find that there is an exponential decay in the number of actions propagated as the time elapses. In practice, the time constraint and spreading speed are always critical concerns of marketers since they are closely related to profit and competition. In fact, influence not only decays with time but also attenuates along diffusion paths. As indicated in the threedegreesofinfluence phenomenon (Christakis and Fowler 2007), while it keeps propagating up to a social horizon of three hops, peer influence dissipates along the diffusion path. Supportive experimental studies have continued to appear (Morales et al. 2014; Bond et al. 2012; Liu et al. 2012).
Third, the IC and LT models (and most generalizations) fail to capture the individual temporal diffusion dynamics. In the IC model, each newly activated node gets a single chance to simultaneously execute an activation attempt to each of its inactive outneighbors in one time step. In the LT model, an inactive node checks each of its inneighbors at the same time to compute the total influence weight in any time step. Both of them are synchronous diffusion models. In fact, WOM communication generally takes place in an asynchronous way. Some people are much more active in WOM messaging than others, and people communicate much more frequently with their family members and close friends than regular friends. Even if a person sends out a message simultaneously to all her friends on Twitter, they may log in and check out the message at different times. As shown by Iribarren and Moro in their viral email experiment (Iribarren and Moro 2009), there is large heterogeneity in individual human response time which has great impact on the dynamics of collective information diffusion.
There is another subtle but more fundamental issue in existing models. WOM’s effectiveness as an information source for consumers can be broken down into WOM’s reach and impact. Unlike epidemic spreading in which each exposure acts independently, WOM’s impact is usually derived from social reinforcement, where repeated exposures continue to have large marginal effects on adoption (Centola 2010; Romero et al. 2011). We thus consider product adoption as a threestage process of influence accumulation. The first stage is awareness, where an inactive entity gets exposed to the WOM message. The second stage is aggregation, in which the entity gets reinforced with more and more WOM exposures. The last stage is activation, where the entity adopts the product when the accumulated influence is greater than a certain threshold of the entity. We refer to this procedure as the 3A process, and attempt to explicitly reveal this process in an incrementally aggregate manner.
We aim at developing a more realistic influencediffusion model to address these issues. We arrive at a multiplepath asynchronous threshold (MAT) model. In this model, we naturally integrate the three stages of the 3A process, and quantify in an incremental manner the aggregate consequences from informational influence to activation on the basis of complex WOM communication. We consider both direct and indirect influence, take into account influence attenuation along diffusion paths and influence decay with time, and model the individual temporal diffusion dynamics using a contactfrequencybased Poisson process. Further, we develop a novel approximation algorithm to address the influencemaximization problem under our MAT model, and conduct extensive experiments on four reallife networks. Our work provides preliminary but important insights and implications for viral marketing and diffusion research in other fields.
Related research
Viral marketing is an important and costeffective marketing technique in business. There are a large number of studies in the literature addressing influence diffusion and viral marketing. Detailed surveys can be found in (Chen et al. 2013; Kempe et al. 2015; AlSumaidan and Ykhlef 2016). Here we focus on several representative models and algorithms most relevant to our work.
Diffusion models
Kempe et al. published a seminal paper on maximizing influence spread in social networks (Kempe et al. 2003). They formulate influence maximization as a discrete optimization problem, and investigate the stepbystep dynamics of adoption under two basic diffusion models: the independent cascade (IC) model and the linear threshold (LT) model.
The IC model takes the social network G=(V,E), the influence probability p(·) on each link, and the initial seed set S_{0} as input, and the influencediffusion process unfolds in discrete time steps according to the following randomized propagation rule. At each time step t_{ i } (i≥0), each newlyactivated node u is given a single chance to activate each currently inactive outneighbor v independently. It succeeds with the prespecified influence probability p_{ uv }. If successful, then v becomes active in step t_{i+1}. But whether or not u succeeds, it cannot make any further attempts to activate v in subsequent steps. The diffusion process runs until no new nodes are activated. Such diffusion behavior is referred to as simple contagion in social science, in the sense that activation could be triggered by a single influencer independently. This model is suited for epidemic spread since an individual may become infected once exposed to the virus. For viral marketing, it does not fit very well. With respect to the 3A process described previously, the IC model reflects the brandawareness stage, but fails to explicitly capture the aggregation or activation stages since most people cannot be convinced to adopt a new product after a single WOM interaction.
In the LT model, every link (u→v) of the social network G is associated with an influence weight b_{ uv }∈ [0,1], indicating the influence significance of u on v. Before the diffusion process starts, each node v is assigned a threshold θ_{ v } uniformly at random in the range [ 0,1]. The threshold can be interpreted as the personal latent tendency of a node to adopt a new product. Once the initial seed set S_{0} is given, the diffusion process unfolds deterministically in discrete time steps. In any step, an inactive node v becomes active if and only if the total influence weight of its active inneighbors \(N_{v}^{in}\) is greater than or equal to its threshold θ_{ v }, i.e., \(\sum _{u\in N_{v}^{in}} b_{uv}\geq \theta _{v}.\) The diffusion process continues until no further activation is possible. As opposed to the IC model, the LT model captures complex contagion, in which an individual is usually not activated until she receives positive reinforcement from multiple influencers. The LT model is more suitable for viral marketing than the IC model in this regard.
These two classic models lie at the core of a large number of generalizations in the subsequent work. Bhagat et al. (2012) adapt the LT model to what they called linear threshold with color (LTC) model. They show that there exist tattlers who are activated without adopting the product themselves. These tattlers serve as information bridges in influence propagation and significantly affect product adoption. It is no doubt that distinguishing tattlers from adopters is an important step toward a more realistic model. However, a tattler does not have to be activated in order to spread influence in reality. For example, Tom bought an iPhone, and told his friend Jeff that it is cool and he likes it. When Jeff chats with their friend Nick at lunch, he simply tells Nick that Tom just bought an iPhone and he really likes it. Jeff does not have to be activated or form his opinion about the iPhone, but he does pass Tom’s influence on to Nick indirectly. Just like Jeff, many tattlers are simply messengers who pass WOM messages on to their friends and implicitly pass indirect/passive/informational influence around. This is an important feature of WOM communication, as described in the network coproduction model (Kozinets et al. 2010). In fact, messengers play an important role in WOM marketing, and should be distinguished and included explicitly in influencediffusion modeling.
To incorporate the temporal aspects observed in diffusion dynamics, Chen et al. (2012) investigate timecritical influence maximization. They associate a degreebased or randomlyselected meeting probability to each edge to reflect the timedelayed diffusion dynamics. It is a realistic step to incorporate temporal diffusion dynamics of individual nodes. However, this model has limitations on the following two respects. First, the meeting probability would be meaningless if there is no deadline constraint or the deadline constraint is set to a relatively large number. As noted in previous section, the IC model has a timeinvariant property, i.e., delaying the activation attempts of any nodes would not change the distribution of the final active set. In other words, the meeting probability has an effect in their model only because of the deadline constraint, which apparently underestimate the impact of individual temporal dynamics on influence diffusion. We argue that it would make more sense to associate the temporal diffusion dynamics with the temporal influence decay. Second, they fail to capture the contact frequency in their meeting probability. For most weighted social networks, the weight on a link is a natural and important indicator that describes the strength of the relationship and/or the contact frequency. A realistic diffusion model should incorporate the weight information to better capture the individual temporal diffusion dynamics.
There are many other extensions of the classic IC and LT models to different scenarios. Chen et al. (2011) extend the IC model to incorporate the emergence and propagation of negative opinions. Chen et al. (2015) adapt the IC model to the topicaware influence propagation, in which the influence probability on each link is determined by the respective topic distribution. The LT model has been extended to address the influencemaximization problem under competition (Borodin et al. 2010; He et al. 2012). Gayraud et al. (2015) extend both the IC and LT models to study influence diffusion in dynamic networks. However, all the aforementioned models are probabilistic models. None of them explicitly quantify the influence or directly compute the aggregation of influence to reveal the 3A process from awareness →aggregation →activation.
Influence maximization
The IM problem is formally described as follows: Given a social network G=(V,E) with a total of n nodes (i.e., V=n), a stochastic diffusion model on G, and a number K(K≪n) that specifies the number of seed nodes (e.g., initial adopters of a new product), find the seed set of K nodes to be activated first so that they can trigger the largest cascade of further adoptions in the network. More precisely, let S_{0} denote a seed set, ϕ(S_{0}) denote the final active set generated by the stochastic diffusion model, and σ(S_{0})=E(ϕ(S_{0})) denote the influence spread of seed set S_{0}, which is the expected size of the final active sets of all random runs of influence diffusion under the given diffusion model. Then the IM problem is to find the optimal seed set S^{∗} such that
The IM problem is NPhard for both the IC and LT models (Kempe et al. 2003). Moreover, the influence spread function σ(·) in both the IC and LT models has two important properties: monotonicity and submodularity. A set function f:2^{V}→R is monotone if for any subsets S⊆T⊆V, f(S)≤f(T), which in this context means that adding more nodes to a seed set cannot reduce the size of the final activated set. The set function f is submodular if for any subsets S⊆T⊆V and any element u∈V∖T, f(S∪{u})−f(S)≥f(T∪{u})−f(T). Submodularity can be understood as diminishing marginal return, which in this context means the marginal gain of node u when added to a seed set T cannot exceed the marginal gain when adding it to a subset S⊆T. It is shown that the influence spread function σ(·) for any general cascade model or general threshold model is always monotone, but the general models may not satisfy the submodularity property (Chen et al. 2013). As an example, consider the fixed threshold model on a simple graph that contains only three nodes {u,v,i} and two links {(u→i), (v→i)} with influence weight b_{ ui }=b_{ vi }=0.5. In a fixed threshold model, each node has a fixed threshold instead of a threshold uniformly selected at random for each run of influence diffusion. For this example, node i has a fixed threshold θ_{ i }=0.8. Then we can calculate the influence spread and obtain: σ(∅)=0, σ({u})=σ({v})=1 (since b_{ ui }=b_{ vi }=0.5<0.8=θ_{ i }, node i is not activated), and σ({u,v})=3 (since b_{ ui }+b_{ vi }=1>0.8=θ_{ i }, node i is activated). Now let us set S=∅ and T={u}. Then we have σ(S∪{v})−σ(S)=1<2=σ(T∪{v})−σ(T). It indicates the marginal gain of node v when added to T exceeds the marginal gain when adding it to S (S⊆T). Therefore, the influence spread function σ(·) for this fixed threshold model is not submodular.
After exploiting these properties, Kempe et al. present a simple greedy hillclimbing algorithm to find the “optimal”Knode seed set. As it relies on Monte Carlo (MC) simulation to compute the influence spread, we refer to it as MCGreedy hereafter. As shown in Algorithm 1, MCGreedy starts with an empty seed set (Line 2), and runs K rounds to generate a set of K seed nodes. In each round, it sweeps over each node u∈V∖S to compute the influence spread σ(S∪{u}). Since the exact computation of influence spread under the IC and LT models is ♯Phard (Chen et al. 2010), they run MC simulation R times to estimate the influence spread (Line 610). Finally, the node that together with current seed set S generates the maximum influence spread is added to S (Line 12).
For any model (such as the IC and LT models) in which the influence spread function σ(·) is monotone and submodular, MCGreedy approximates the optimal within a factor of (1−1/e−ε) for any ε>0. This is a performance guarantee at least 63% of optimal. However, this algorithm suffers from a very high computational cost. It requires O(nK) evaluations of influence spread, and each evaluation needs to run R simulations of the diffusion process. Not only is each simulation generally timeconsuming, but R also has to be large enough to maintain the effectiveness of the algorithm. This performance inefficiency makes it infeasible even for mediumsized networks of tens of thousands of nodes and edges.
To improve its efficiency, one strategy is to reduce the number of influencespread evaluations. For example, Leskovec et al. (2007) exploit submodularity and present a CostEffective Lazy Forward (CELF) scheme that significantly reduces the number of evaluations. Unfortunately, CELF and its extensions CELF++ (Goyal et al. 2011a) are still not scalable due to the overhead of Monte Carlo simulation. Therefore new heuristic algorithms that avoid the simulation are needed. This strategy requires exploiting specific aspects of the diffusion model and the network structure. Wang et al. (2012) propose a maximum influence arborescence (MIA) algorithm for the IC model. The main idea is to use local arborescence structures of each node to approximate the influence propagation. Chen et al. (2010) present a local directed acyclic graph (LDAG) algorithm for the LT model, in which they construct a LDAG for each node and restrict influence diffusion to a node only through its LDAG. Goyal et al. (2011b) present a more efficient heuristic called SIMPATH for the LT model. This algorithm is built upon the CELF optimization.
Methodology
In our previous work (Wang and Street 2014; 2015a), we develop a reachabilitybased influencediffusion model to exploit the implicit knowledge of influencebased connectivity and vertexpair similarity encoded in the network graph topology. This model has been successfully used to uncover influence centrality and community structure in social networks (Wang and Street 2015a, b). In this reachabilitybased model, a node’s influence significance is measured by the total amount of information the node spreads over its neighborhood within a prespecified depth limit. Roughly speaking, it focuses on information’s visibility (e.g., brand awareness) regardless of information’s effect (e.g., adoption). In this paper, we investigate activationbased influencediffusion modeling for viral marketing, in which a node’s influence significance is measured by the total number of nodes it can activate to adopt a new technology or purchase a new product. We adapt our reachabilitybased influencediffusion model to the viralmarketing scenario and propose a novel, more realistic activationbased influencediffusion model. We call it the multiplepath asynchronous threshold (MAT) model, in which we quantify influence and track its diffusion and aggregation. We integrate direct and indirect influence, and instantiate the 3A process to model complex WOM communication and its effects. Moreover, we naturally incorporate in our MAT model the influence attenuation along diffusion paths, influence decay with time, and individual temporal diffusion dynamics based on the relationship strength or contact frequency between a node of interest and its neighbors. Further, we develop an effective and efficient heuristic, called IVGreedy, to address the influencemaximization problem.
MAT Model
Our model is applicable to both undirected/directed and unweighted/weighted networks. Without loss of generality, let G=(V,E,W) be a directed and weighted network with V=n and E=m. The weight w_{ uv }∈W associated with a directed edge (u→v) represents the relationship strength of node u over node v, e.g., the frequency that u calls or emails v. The relationship strength is usually asymmetric in directed networks. For any unweighted network, we regard it as a weighted network with a weight of 1 on each link. For any undirected edge between two nodes, we replace it with a pair of directed links pointing to each other, associated with the same weight as the original weight on the undirected edge. The influence is propagated through outlinks.
We categorize all the nodes in the network into two types: influencers and messengers. An influencer is an active node (product adopter) that can originate and spread its influence in the network. A messenger is an inactive node that may acquire influence and also pass on to her friends the influence it receives from influencers or other messengers. Once a messenger acquires influence that is greater than or equal to its threshold, it is activated and turns into an influencer who starts to spread out its own influence to others. It is noted that an influencer can not only diffuse its own influence to others but also act as a messenger passing along the influence from other influencers. Our model captures an important characteristic of WOM communication (Kozinets et al. 2010) in that: anyone (either active or inactive) can pass along WOM messages and potentially influence the recipient. In other words, a node can be activated by not only direct influence from its active neighbors (influencers) but also indirect influence passed along by its inactive neighbors (messengers). This is a distinguishing, more realistic feature built in our MAT model.
From a graphtheoretic point of view, the influence diffusion from each influencer can be regarded as a batch of random walks around the neighborhood of the influencer. Each walk explores one diffusion path from the influencer and passes on the influence to the nodes it visits along the path. All walks together constitute an influencediffusion tree with the influencer as the root node. To model the complex WOM communication in a more realistic manner, we incorporate in the diffusion tree/process of each influencer several important rules/mechanisms as follows:

No cycles or loops are allowed. It makes sense since no one should repeatedly send out the same WOM message to her friends when the message she sent out circulates back to herself;

A node may be visited multiple times along different diffusion paths. It resembles that the message/influence may be passed on to the same person via different chains of friends. This is a more realistic imitation of WOM messaging and social reinforcement in terms of repeated exposures (Romero et al. 2011);

The diffusion tree of one influencer may contain other influencers. That is because influencers also act as messengers passing along the message/influence from other influencers;

Influence dissipates along diffusion paths. Moreover, the diffusion process stops after three hops from the influencer to match the wellknown threedegreesofinfluence phenomenon (Christakis and Fowler 2007);

Influence diffusion from a parent node to its child nodes is asynchronous. This is an important mechanism implemented in our model to capture the individual diffusion dynamics.
It is worth pointing out that: 1) the influencediffusion processes of different influencers are independent of each other; 2) influence not only attenuates along diffusion paths but also decays over time. We elaborate below how we quantify influence and keep track of its diffusion and aggregation following the rules described above.
The relationship strength, as indicated by the weight on edges, may vary significantly among family members, close friends, casual acquaintances, and so on. Stronger relationship implies stronger influence potential in general. To quantify the relationship strength on influence, we define a normalized influence weight that measures the fraction of influence a node receives from a specific inneighbor relative to the total influence it may receive from all of its inneighbors. Given a directed edge u→v with a raw weight w_{ uv }, and letting \(N_{v}^{in}\) denote the set of node v’s inneighbors, the normalized influence weight \(\hat {w}_{uv}\) is defined as
To quantify the influence attenuation along a diffusion path, we define a depthassociated attenuation coefficient
where d is the depth (number of hops) from an influencer to the node of interest along a diffusion path. It is the same as what we define in the reachabilitybased influencediffusion model (Wang and Street 2014; 2015a). It can be interpreted as the probability of an influencer’s influence reaching a node d hops away, as analogous to the probability of a center node linking to a node at a fixed distance d of its concentric scales of resolution, which is proportional to d^{−2} as depicted in (Easley and Kleinberg 2010). From a probabilistic perspective, letting a random variable X_{ i } denote the total amount of influence an influencer i spreads, and letting a random variable Y denote any diffusion path from i, then we can for now quantify influencer i’s total influence as the conditional expectation
where y is a diffusion path from influencer i to a destination node j. E[ X_{ i }Y=y] is the expected influence j acquires along the path y, which is estimated by the chain product of the normalized influence weights of the corresponding links that constitute the diffusion path y. P{Y=y} is the probability that i’s influence reaches j along the path y, which is estimated by the depthassociated attenuation coefficient at the depth from i to j along the path y. The depthassociated attenuation is actually a compounding factor that incorporates the decreasing reaching probability, trustworthiness decay and information corruption. As indicated in the threedegreesofinfluence phenomenon (Christakis and Fowler 2007), the influence ceases to have a noticeable effect on people beyond three degrees of separation. Therefore, we set the depth limit d_{ max } to 3 for each influencer. In other words, the influence of any influencer can propagate up to a social horizon of three hops.
Influence also dissipates over time. For example, right after we purchase a new product, we usually feel the excitement and urge to tell our friends (potentially with stronger attempt to influence them). Our enthusiasm and influence potential fade away as time elapses. We model the temporal influence decay as a timeassociated attenuation coefficient β(t) as an exponential function of time
where λ is a userspecified tunable parameter of decay rate. It can be used to account for different products on various social media. When a node is newly activated, a timer is attached to this new influencer and the time is set to 0 at that moment so that we can track the number of hops/steps its influence propagates in its neighborhood. It is noted that the timing for that node acting as a messenger of other influencers has no change.
The next is to model the individual diffusion dynamics. Many realworld networks are thought to be scalefree, in the sense that the degree distribution follows a power law. Therefore, at the network level, it is often seen that some people have much more friends and are much more active in WOM messaging than others. At the individual level, people usually communicate much more frequently with their family members and close friends than casual acquaintances. The contact frequency among friends is strongly correlated with the relationship strength. As a result, WOM messaging from one person to each of her friends generally takes place in an asynchronous manner. To capture the individual temporal diffusion dynamics, we model the heterogeneity of WOM messaging as a Poisson process. For a directed edge (u→v), let X_{ uv } denote the number of times node u makes contact with node v during one time step. We assume that X_{ uv } follows a Poisson distribution with a rate of μ_{ uv }. Its probability mass function can be written as
Statistically, μ_{ uv } is the average number of times u makes contact with v per time step. It can be estimated using outlink weights of each node to differentiate their activeness and vertexpair diffusion dynamics. In practice, the weights on edges may have different meanings and properties, and their values may vary significantly. We use 01 scaling to bring them into alignment. Specifically, μ_{ uv } is computed as
where \(\hat {\mu }\) is a tunable parameter that measures the overall activeness of the network, and μ_{ u } and \(\tilde {w}_{uv}\) can be interpreted as the global activeness of node u and local activeness of u→v, respectively. A sensible range for \(\hat {\mu }\) is (0,2]. The larger \(\hat {\mu }\), the more active the network is. When \(\hat {\mu }=2\), the probability that a node contacts with its outlink neighbor at least once in one time step is greater than or equal to 0.865. When \(\hat {\mu }\) is too small, the network would be too inactive to facilitate viral marketing. It can be learned using datadriven approaches in practice. We set it to 1 by default. Let w_{ u } denote the sum of all outlink weights of node u, and define w_{ max }= maxi∈Vw_{ i } and w_{ min }= mini∈Vw_{ i }. Let w_{u:max} and w_{u:min} denote the maximum and minimum weight among node u’s outlinks, respectively. Then μ_{ u } and \(\tilde {w}_{uv}\) can be calculated as follows
These normalization schemes enable us to differentiate a node’s activeness at both the network and the individual levels. Specifically, for a scalefree network, there exist a few hubs, which are nodes with a degree that greatly exceeds the average. These highdegree nodes are usually more active and more influential, and play an important role in WOM marketing. The global activeness μ_{ u } has a desirable interpretation of the degree distribution. Hubs in a scalefree network would have a significantly larger μ_{ u } than other nodes. On the other hand, if the network of interest is not scalefree, μ_{ u } would be more randomly or normally distributed. Degreebased seeding strategy would be no longer effective. In case w_{ max }=w_{ min } (which rarely occurs), we set μ_{ u } to 0.5 for all u∈V. Similarly, \(\tilde {w}_{uv}\) captures node u’s local activeness, which is proportional to the weight on the link from u to its respective neighbor. If w_{u:max}=w_{u:min} (e.g., unweighted networks), we set \(\tilde {w}_{uv}\) to 0.5.
Once the rate μ_{ uv } is determined, it is straightforward to calculate the probability with which the WOM message propagates from node u to its outneighbor v at one time step as
If the propagation is not realized at time step t_{ i }, then the probability for the message to be transmitted at time step t_{i+1} is unchanged due to the memoryless property of Poisson distribution. If u has delayed passing on the message to v for delay_{ max } time steps (maximum delay), it is assumed that u has no intention to pass the message on to v at all. This is a sensible mechanism reflecting the fact that not everyone is actively engaged in WOM messaging with each of its neighbors at any time step. In addition, we set delay_{ max } to 3 time steps in alignment with the depth limit d_{ max } of 3, which implies that the message would have no noticeable influence even if it is finally propagated after it has been held for 3 or more time steps.
Now we can quantify the influence along a diffusion path at a specific time step. Let \(\sigma _{i:x\rightarrow y}^{t,d}\) denote the amount of influence (originated from an influencer i) that node x passes on to node y at time step t and depth d. Then following a diffusion path from influencer i→j→k→l, the influence that nodes j, k and l acquire from node i is respectively calculated as
In each equation above, the first term on the right hand side is the temporal influence decay, the second term is the depthassociated influence attenuation, and the rest is the chain product of the normalized influence weights on the corresponding links that constitute the diffusion path from the influencer to the node of interest.
The diffusion process starts with an initial set of influencers (seed nodes) S_{0} with S_{0}=K, and unfolds in discrete time steps. At each time step, the influence is propagated one hop from a node u (parent node) to each outneighbor v (child node) with a probability p_{ uv }. Each seed node is assigned an influence value of 1. Each inactive node v is initialized with an influence value of 0, and selects an activation threshold θ_{ v } uniformly at random in the range [ 0,1]. Let A denote the set of v’s inneighbors who pass on influence to v, and let f_{ v }(A) denote the threshold function of v, that is, the total influence that v receives from nodes in A. Formally, we define f_{ v }(A) as
where b_{ uv } represents the total amount of influence that v receives from u, and \(N_{v}^{in}\) denotes the set of inneighbors of v. Whenever f_{ v }(A)≥θ_{ v }, v is activated and turns into an influencer with an influence value of 1. Then it not only continues passing other influencers’ influence as a messenger, but also starts to spread out its own influence as an influencer. The diffusion process stops when the number of hops of influence diffusion of each influencer (including the seed nodes and all activated nodes) reaches the depth limit (d_{ max }=3 by default) and no new activation is possible.
Influence Maximization under MAT Model
The influencemaximization (IM) problem under our MAT model can be described in a general framework as follows: Given a social network G=(V,E,W), a budget K denoting the seedset size, and the MAT model on G associated with parameters λ (temporal decay rate), d_{ max } (depth limit), p_{ uv } for each (u→v)∈E (individual diffusion probability) and delay_{ max } (maximum delay) as input, find a seed set S_{0}⊆V with S_{0}≤K such that the expected influence spread σ(S_{0}) is maximized under the MAT model. For viral marketing, the seed nodes are the initial adopters of a new product, and the influence spread is measured by the total number of people in the network who adopt the product in the end of the diffusion process.
The IM problem is NPhard under the MAT model, which can be proved by reduction. If we set λ=0, d_{ max }=1, delay_{ max }=0, and p_{ uv }=1 if u is active and p_{ uv }=0 if u is inactive for any u∈V, the MAT model then reduces to the classic LT model. Therefore, the IM problem over this class of instances is equivalent to the IM problem under the classic LT model, which is known to be NPhard (Kempe et al. 2003). This concludes the proof.
It is easy to show that the influence spread function σ(·) for the MAT model is monotone. For each run of the diffusion process, the threshold θ_{ v } (for each v∈V) is randomly selected and fixed. When the seed set S_{0} grows, the final active set ϕ(S_{0}) also grows under these fixed thresholds, due to the monotonicity of f_{ v }(S). Finally, σ(S_{0}) is simply the average of the size of all final active sets among all possible threshold values selected in different runs, and thus is monotone.
However, it is hard to prove that σ(·) for the MAT model is also submodular. The widelyused approach is to construct a liveedge graph model (Chen et al. 2013) that is equivalent to the diffusion model of interest. In a liveedge graph, any active node should be reachable from at least one seed node along at least one liveedge path consisting entirely of active nodes chained up from one to another. It implies that any active node (except seed nodes) should have at least one active neighbor, which makes it infeasible to apply this approach to our MAT model since a node without any active neighbors may still be activated by indirect influence in the MAT model. For now, we conjecture that the influence spread function σ(·) for our MAT model is submodular. We leave the proof as future work.
To address the IM problem, we need to find the “optimal” seed set. Let us first clarify two important concepts: aggregation and overlap. Aggregation means that an inactive node is reinforced with more WOM exposures and receives more influence. When it accumulates enough influence (greater than or equal to its threshold), it becomes active and starts to spread its own influence as an influencer. Overlap means that an influencer spreads its influence to any nodes that have been activated by other influencers. On one hand, if seed nodes are closely clustered, they may potentially have large overlaps, which leads to deterioration in performance. On the other hand, if seed nodes are totally isolated, they can not achieve aggregation effect, which results in a detriment to new activations as well. Therefore, the ideal selection of the K seed nodes would be the K most influential nodes such that: (1) each of them achieves individual influence spread as large as possible; (2) they should be far enough away from each other to minimize potential overlaps; and (3) they should be close enough to reinforce aggregation effects. Using these guidelines in our seeding strategy, we develop a novel algorithm, IVGreedy, to tackle the IM problem under the MAT model.
IVGreedy algorithm
As discussed in the previous section, MCGreedy (Kempe et al. 2003) is a simple greedy hillclimbing algorithm, using Monte Carlo (MC) simulation to estimate the influence spread. Although its computational cost is extremely high due to the overhead of MC simulation, it achieves the largest influence spread under the IC and LT models among all stateoftheart algorithms (Chen et al. 2013). Regardless of its low efficiency, the greedy strategy used in MCGreedy makes sense and is highly effective. We develop IVGreedy borrowing the greedy strategy of MCgreedy but replacing MC simulation by using the influence vector of each node.
The influence vector of a node captures where and how much influence that node spreads over its neighborhood based on a nonstochastic version of the MAT model, in which we do not consider individual diffusion dynamics (i.e., let p_{ uv }=1 for any u∈V), or the activation of the nodes (i.e., only i is an influencer; all other nodes remain messengers). In addition, we ignore the temporal diffusion decay (i.e., let λ=0) when generating the influence vectors such that our IVGreedy is more robust with different decay rates. We develop a modified depthlimited search algorithm, called IVBuilder, to explore the diffusion tree of each node individually, and generate an influence vector for each node, which records its influence spread to all the nodes in its neighborhood within a depth limit of 3 (by default). IVbuilder uses the same approach as we use to generate the influence matrix for community detection in (Wang and Street 2014; 2015a). The only difference exists in the weight normalization schemes. For community detection, we normalize each inweight of a node by the maximum inweight of that node to capture the relative susceptibility of the node to its inneighbors, and the influence vectors are used to differentiate vertexpair similarity. For viral marketing, we normalize each inweight of a node by the total inweight of that node to quantify the absolute influence transmitted from each inneighbor, and the influence vectors are used to find out which nodes are more likely to be activated. Due to space limit, we refer the interested reader to (Wang and Street 2015a) for implementation details on influencevector generation.
We use these influence vectors as a proxy for the stochastic influencediffusion process to avoid the expensive MC simulation. Specifically, We define the influence score of a node as the sum of all elements in its influence vector, which represents an estimate of its total influence spread. This influence score can be used to differentiate the influence significance of the nodes in the network. An intuitive solution is to select the topK nodes of the highest influence scores as the seeds. Unfortunately, it does not work well. Like those highdegree nodes, the nodes of high influence scores may be clustered or too close to each other, leading to large overlaps of influence spread. We need to somehow separate them from each other to minimize overlaps. The strategy that we use in IVGreedy is a greedy strategy similar to MCGreedy, in which we sweep over the influence vector of each node to repeatedly pick the node with the maximum marginal gain and add it to the seed set until all K seeds are found. The pseudocode is shown in Algorithm 2.
In IVGreedy, each node is indicated by its node index from 1 to n. IVGreedy starts with an empty seed set (Line 2), and then generate the influence vector IV_{ u } for each node u∈V (Line 3). IV_{ u } is an nelement array, in which element IV_{ u }(j) represents the amount of influence node u exerts on node j. For each node u, we get its influence score R_{ u } by summing up all elements in its influence vector IV_{ u } (Lines 46). We find the node v that has the highest influence score (Line 7), and add it to the seed set S as the first seed (Line 8). Then we make a copy of IV_{ v } to an array AR, which is used as a representation of the collective influence distribution of currently selected seeds. To find the next seed, we sweep over each node i∈V∖S and compute the marginal gain g_{ i } of adding i to S individually. We compute its benefit factor c=1−AR(i) (Line 13). Recall that AR(i) represents the amount of influence i receives so far, which can be also regarded as the probability activating i. For example, if AR(i)=1, it means that i has been activated, then selecting i as a seed has no benefit to the overall influence spread. Then for each element IV_{ i }(j), we get p=c×IV_{ i }(j) (Line 15), which represents the marginal gain that node j gets. However, remember that the threshold function f_{ v }(S) is defined within the range [ 0,1]. If the amount of influence that node v receives is greater than 1, it is set to 1 since v has been activated and more influence on v is of no more effect or benefit. Therefore, we add p to AR(j) to get q (Line 16), which represents the accumulated influence that node j receives. If q is greater than 1, we set the actual marginal gain p=1−AR(j) (Lines 1719). Then we add p to g_{ i }, which represents the total marginal gain of adding node i to S. Once we get the marginal gain of each node i∈V∖S (Lines 1122), we select the node that produces the maximum marginal gain as a seed and add it to S (Lines 2324). Then we update AR by calculating the benefit factor c and adding the marginal gain to each element in AR (Lines 2528). We repeat the process described above (K−1) times to find the (K−1) seeds, which along with the first seed node constitute a full seed set S.
Experiments
We conduct experiments on four widelyused reallife network datasets to evaluate our MAT model and the performance of IVGreedy, and compare it against a set of baseline algorithms on both influence spread and time efficiency. The code is written in Visual Basic and all experiments are carried out on a regular desktop PC with Intel(R) Core(TM) i54670 CPU @ 3.40 GHz and 8.0 GB memory under Windows 7 64bit OS.
Network description
To evaluate the applicability of our model and algorithms, we employ four reallife networks with different combinations of link directionality and weights. We list their statistics in Table 1. PGP (Boguna et al. 2004) is an undirected/unweighted network of users of the PrettyGoodPrivacy algorithm for secure information interchange. Each node represents a user, and each edge connects a pair of users of interest who have assigned public keys of another based on trust between them. It is a single connected component with relative clear community structure. NetHEPT (HEPT Collaboration Network 2009) is an undirected/weighted collaboration network from the paper lists extracted from “High Energy Physics (Theory)” section of the eprint arXiv from 1991 to 2003. Each node represents a unique author, and each edge represents coauthorship of the two authors of interest, weighted by the number of papers they have coauthored. This network has been frequently used in previous work (Chen et al. 2010; Goyal et al. 2011b). WikiVote (Wikipedia Whovotewhom Network 2010; Leskovec et al. 2010) is a directed/unweighted whovotewhom network from Wikipedia. Nodes in the network represent Wikipedia users and a directed edge from nodes i to j represents that user i voted on user j. This link direction does not reflect the direction of influence flow. User i voting on user j is actually because j has influence on i, as analogous to that of citing paper i and cited paper j. Therefore, we reverse the link direction to reflect the actual influence flow. This network has a giant component and a set of 23 small ones. The last network dataset is C.elegans (White et al. 1986). It is a directed/weighted neural network of the nematode worm C.elegans.
Baseline heuristics
The simplest and naive baseline is to select nodes uniformly at random (hereafter referred to as RANDOM). Not surprisingly, this algorithm is very unstable and does not perform well in terms of influence spread. The most frequently used is the degreecentrality heuristic (hereafter referred to as DEGREE), in which the seed nodes v are chosen in descending order of outdegrees \(d_{v}^{out}\). This seeding strategy is simple but effective. Empirical studies (Kempe et al. 2003; Hinz et al. 2011) show that DEGREE results in larger influence spread than other centralitybased heuristics, such as distance centrality and betweenness centrality. A common feature of these centralitybased algorithms is that they rely solely on one specific structural property of the network without considering the diffusion dynamics. In fact, many of the highestdegree nodes may be clustered and have potentially large overlaps of influence spread, which leads to deterioration in performance.
The next baseline is the TopK algorithm. We present its pseudocode in Algorithm 3. It sweeps over each node u∈V to compute the influence spread of each node individually (Lines 29), using Monte Carlo simulation (Lines 47). Then, it selects the top K nodes with the largest individual influence spread (Lines 1013). It is worth noting that the topK nodes that produce the largest influence spread individually is usually not the same as the K seed nodes that produce the largest influence spread together. For example, if two top influencers are so close to each other that their influence spreads have a large overlap, it is not a good idea to select both of them as seed nodes. While the DEGREE algorithm relies solely on the structural properties of the network without considering the diffusion dynamics, the TopK algorithm relies solely on the diffusion dynamics without considering the network structure.
MCGreedy (as shown in Algorithm 1) is the last baseline heuristic. It considers both the diffusion dynamics and the network structure (implicitly). One may notice that the TopK algorithm is actually the first round of the K rounds in MCGreedy except that MCGreedy selects only top one influencer instead of top K influencers. It continues exhaustively running MC simulation to find the next seed node that produces the maximum marginal gain in influence spread. This step iterates until a set of K seed nodes is found. MCGreedy usually achieves the largest influence spread of the K seed nodes at the collective level. However, it suffers from a very high computational cost, which makes it infeasible even for mediumsized networks.
Performance comparison
We run experiments on the four network datasets to compare the performance of IVGreedy against the four baseline algorithms in terms of influence spread and time efficiency. As discussed in the previous section, we set both d_{ max } and delay_{ max } to 3, and treat them as builtin parameters of the MAT model. We also set the network activeness parameter \(\hat {\mu }\) to 1 as defaulted. The temporal decay rate λ is regarded as a userspecified parameter. The larger λ, the faster temporal decay of influence. We set it to 0.2 for performance comparison of different algorithms, and then evaluate its effect on influence spread with different values. In addition, both MCGreedy and TopK need to run MC simulation R_{1} times in each round of influencespread estimation. Once an algorithm produces the seed set it finds, we run MC simulation R_{2} times to estimate the influence spread of each algorithm for comparison. In our experiments, we set both R_{1} and R_{2} to 1000. In particular, due to the extremely low efficiency of MCGreedy, we only report its results for C.elegans dataset.
Influence spread
We illustrate in Fig. 1 the experimental results of the four network datasets. Each curve shows the variation of the influence spread with respect to the seedset size K. K is a userspecified parameter. In practice, K is closely related to the marketing budget. For diffusion research, we choose K primarily based on the network size. Specifically, we vary the number of seeds K from 5 to 100 in an increment of 5 for PGP, NetHEPT and WikiVote since they have 10680, 15233, and 7115 nodes, respectively. K is varied from 1 to 15 in an increment of 1 for C.elegans since it has only 453 nodes.
It is not surprising that RANDOM has the worst performance on all datasets. The degreecentrality heuristic, DEGREE, greatly improves the influence spread, which is 48 times better than RANDOM. It makes sense that the wellconnected nodes (social hubs) facilitate influence diffusion with their high reach to others. TopK is analogous to DEGREE in the sense of targeting the group of most influential nodes who achieve largest influence spread individually. It achieves an improvement of 28.4%,3.8%,6.5%, and 7.3% over DEGREE on PGP, NetHEPT, WikiVote, and C.elegans, respectively. However, its improvement on influence spread comes with a huge sacrifice on efficiency since it relies on expensive MC simulation. Both DEGREE and TopK fail to capture the fact that many of the highestdegree or most influential nodes may be clustered, which results in large overlaps of influence spread. IVGreedy incorporates the mechanism to minimize overlaps, and achieves the best performance in the comparison with RANDOM, DEGREE, and TopK. The larger seedset size, the greater improvement. Specifically, for K=100 for PGP/NetHEPT/WikiVote and K=15 for C.elegans, IVGreedy beats DEGREE by 35.1%,14.9%,8.6% and 10.8%, and outperforms TopK by 5.3%,10.7%,1.6% and 3.3%, respectively. It is only inferior to MCGreedy by 3% on C.elegans. As shown in the comparison of running time, IVGreedy runs several orders of magnitude faster than MCGreedy (and TopK) on the other hand.
Moreover, we list in Table 2 the 95% confidence intervals (CI) for the expected influence spread of the four network datasets, based on 1000 runs of MC simulation for each dataset. We also perform ttest at the 0.05 significance level on IVGreedy versus DEGREE and IVGreedy versus TopK, respectively. The corresponding pvalues for each dataset are calculated and included in Table 2 as well. Clearly, both the confidence intervals and the pvalues indicate that the improvement of IVGreedy over DEGREE and TopK is statistically significant for all datasets.
Running time
An analysis of the time complexity of the proposed algorithms is given as follows. Let n and m denote the number of nodes and edges in the network, respectively. For DEGREE, it is straightforward to sweep over each node K times to select the top K nodes of the highest degrees as the seed set. Therefore, the time complexity of DEGREE is O(Kn). IVGreedy (as shown in Algorithm 2) needs to call IVbuilder to generate an influence vector for each node first. Let L denote the average length of influence vectors. L is determined by the average node outdegree b, depth limit d_{ max } and and the community structure. As elaborated in our previous work (Wang and Street 2014; 2015a), IVbuilder may take O(b^{d}) time exploring the diffusion tree of each node. In most cases, the time complexity of IVbuilder is O(Ln), and L is way smaller than n, especially for largescale networks. Therefore, we usually do not maintain each influence vector in an nelement array in practice, but only keep the nonzero influence values in a compact dynamic array instead. Then in Line 11 of IVGreedy, the number of iterations is reduced to L from n. Therefore, the time complexity of IVGreedy is O(KLn).
Both TopK and MCGreedy reply on MC simulation. Each run of MC simulation takes O(m) time. TopK (as shown in Algorithm 3) needs to sweep over each node to compute the influence spread individually by running MC simulation R times for each node, and thus it has a time complexity of O(nRm). As we can see, TopK already has a very high computational cost. Unfortunately, MCGreedy (as shown in Algorithm 1) is much more time consuming than TopK. TopK is just the first round in MCGreedy finding the first seed. This process has to repeat K rounds to find the Knode seed set. In each round, the node that generates the maximum marginal gain is selected and added to the seed set. Therefore, the time complexity of MCGreedy is O(KnRm). In fact, in the k^{th} round of repetition, the seedset size increases to k. The time that each run of MC simulation in the k^{th} round takes is more than k times of that in the first round. In other words, the actual running time of MCGreedy could be several K^{2} times of that of TopK, which makes it infeasible even for mediumsized networks of thousands of nodes.
In Fig. 2, we illustrate the running time of the algorithms on the four network datasets. A seedset size of K=100 is set for the PGP, NetHEPT, and WikiVote datasets. K is set to 15 for the C.elegans dataset. It is noted that the Yaxis is in logarithmic scale. Clearly, DEGREE is the fastest algorithm. It finishes almost instantly on all datasets. IVGreedy significantly outperforms TopK. Specifically, IVGreedy is 63, 21, 155 and 590 times faster than TopK on PGP, NetHEPT, WikiVote and C.elegans, respectively. Not surprisingly, MCGreedy is the slowest one. As we can see on the C.elegans dataset of only 453 nodes and 2025 edges, while IVGreedy takes only 0.2 seconds to find a set of 15 seed nodes, TopK takes 2 minutes 4 seconds, but MCGreedy takes more than 25 hours. Due to such a high computational cost, we are not able to run it on the other three datasets. Overall, IVGreedy is the best performing algorithm in terms of both influence spread and running time.
Adoption rate
Marketers are concerned about not only the influence spread but also the adoption rate, which is usually measured by the number of adopters in a given time period. How fast the influence spreads is a critical factor for a viral marketing campaign, especially when there exist competing products in the same social network. It is interesting to observe the rate of adoption under our MAT model. We show in Fig. 3 the influence spread over time on the four network datasets with IVGreedy as the seeding strategy. Again, the seedset size is set to 100 for PGP, NetHEPT and WikiVote, and 15 for C.elegans.
As we can see, the adoption rate shows similar pattern on all datasets. The influence diffuses at a highspeed rate at early stage, and the speed decreases monotonically. The influence spread reaches the saturation level in about 10 to 15 time steps. At t=5, the influence spread arrives at 87%, 84%, 93.4%, and 98.3% of the maximum on PGP, NetHEPT, WikiVote, and C.elegans, respectively. At t=10, this percentage increases to 97.4%, 97.3%, 99.5%, and 100%, respectively. It is observed that WikiVote and C.elegans achieve higher adoption rates than PGP and NetHEPT, which can be roughly explained by the high cohesiveness of WikiVote and the small network size of C.elegans. This follows our intuition. In practice, marketers need to carefully determine the period of a time step. It could be a day, a week, or a month, etc. The influence decay rate and individual contact frequency may vary accordingly.
Parameter analysis and model comparison
The temporal decay rate λ is a userspecified parameter in our MAT model. The value of λ can be varied to account for different products on different social networks. For a particular application, it is possible to use a datadriven approach to learn the appropriate value of λ. We evaluate the impact of λ on influence spread using different decay rates under the MAT model. We also include the classic LT model (Kempe et al. 2003) as a baseline for model comparison. DEGREE is a fairly effective and generic heuristic that is applicable to both the MAT model and the LT model. Moreover, it always produces the same seed set, which enables us to perform a fair comparison on different models. Therefore, we use DEGREE in this experiment. We set λ to 0.1, 0.2, and 0.3, respectively. The experimental results are illustrated in Fig. 4.
As expected, λ in our MAT model has significant impact on influence spread. A larger λ indicates faster temporal decay of influence, and thus results in smaller influence spread. It is good to see that λ in the MAT model enables us to gauge the influence spread in a reasonably large range on all datasets. When λ increase from 0.1 to 0.3, influence spread drops 45.4%, 48.1%, 36.7% and 25.4% on PGP, NetHEPT, WikiVote and C.elegans, respectively.
Bhagat et al. (2012) evaluate the influence spread of several models. Their findings suggest that the classic LT model overestimates the influence spread by large amounts. As we can see, when λ=0.2, the influence spread under the MAT model is already 15%, 24%, 21.3% and 19.9% smaller than the LT model on PGP, NetHEPT, WikiVote and C.elegans, respectively. When λ=0.3, the influence spread under the MAT model drops to 35.4%, 43.2%, 36.4% and 31% smaller than the LT model on PGP, NetHEPT, WikiVote and C.elegans, respectively. We believe that our MAT model is more powerful than the LT model on prediction of influence spread. It is our future work to develop an effective datadriven approach to learn λ from training data and then use it in the MAT model to predict the diffusion.
Conclusion
In this paper, we propose a novel multiplepath asynchronous threshold (MAT) model for viral marketing in social networks. It differs from existing diffusion models in several aspects. We quantitatively measure influence and keep track of its spread and aggregation during the diffusion process. Our MAT model captures both direct and indirect influence, depthassociated influence attenuation, temporal influence decay, and individual diffusion dynamics. Our work is an important step toward a more realistic diffusion model. Further, we develop an effective and efficient heuristic, IVGreedy, to tackle the influencemaximization problem. Our experiments on four reallife networks demonstrate its excellent performance in terms of both influence spread and time efficiency.
Our work provides preliminary but significant insights and implications for marketing practice. Firs of all, unlike other diffusion models relying exclusively on the direct influence from influencers, our model draw managerial attention to messengers who play an important role in spreading indirect influence in viral marketing. Our model offers an algorithmic view of the complex WOM communications, as described in the network coproduction model (Kozinets et al. 2010). Second, our work provides pragmatic implications for how marketers should plan and leverage WOM campaigns. They need to take into consideration depthassociated influence attenuation and temporal influence decay when determining the time period/steps. Specifically, it is necessary for the marketers to find out the decay rate attuned to the product and/or social media of interest. Third, to fully grasp the effect of the WOM marketing, marketers need to look beyond the network structure and incorporate the weight information on edges to capture the network activeness and individual diffusion dynamics. Finally, our work suggests that marketers can create an efficient seeding strategy that achieves larger influence spread than the highdegree seeding. They need to consider both individual influence spread and the network structure to minimize the potential overlaps.
Our work opens up several directions for future work on diffusion research. First, the conjecture on submodularity of the influencespread function under the MAT model needs rigorous proof. Second, scalable heuristics need to be developed for largescale networks, especially for dense networks and large seedset size. More importantly, it is necessary to validate the MAT model with more reallife datasets from diverse domains. It is desirable to have the dataset that contains both network connectivity and action traces of a large set of instances. Then we can partition the instances into a training set and a test set, and use the training set to learn model parameters, such as the temporal influence decay rate and the network activeness. Finally, we use those learned parameter values in the MAT model to predict the influence spread for the instances in the test set, and evaluate the prediction accuracy based on the actual influence spread. This would be a practical mechanism to not only validate the model but also build an effective prediction model. A lot of interesting work can follow.
References
AlSumaidan, L, Ykhlef M (2016) Toward information diffusion model for viral marketing in business. Int J Adv Comput Sci Appl 7(2):637–646.
Bhagat, S, Goyal A, Lakshmanan L (2012) Maximizing product adoption in social networks In: 5^{th} ACM International Conference on Web Search and Data Mining, 603–612.. ACM, New York.
Boguna, M, PastorSatorras R, DiazGuilera A, Arenas A (2004) Models of social networks based on social distance attachment. Phys Rev E 70:056122.
Bond, RM, Fariss CJ, Jones JJ, Kramer ADI, Marlow C, Settle JE, Fowler JH (2012) A 61millionperson experiment in social influence and political mobilization. Nature 489:295–298.
Borodin, A, Eilmus Y, Oren J (2010) Threshold models for competitive influence in social networks In: 6^{th} International Workshop on Internet and Network Economics, 539–550.. Springer, Berlin.
Brown, J, Reinegen P (1987) Social ties and wordofmouth referral behavior. J Consum Res 14(3):350–362.
Burt, RS (1987) Social contagion and innovation: Cohesion versus structural equivalence. Am J Sociol 92(6):1287–1335.
Centola, D (2010) The spread of behavior in an online social network experiment. Science 329(5996):1194–1197.
Chen, S, Fan J, Li G, Feng J, Tan K, Tang J (2015) Online topicaware influence maximization. Proc VLDB Endowment 8(6):666–677.
Chen, W, Collins A, Cummings R, Ke T, Liu Z, Rincon D, Sun X, Wang Y, Wei W, Yuan Y (2011) Influence maximization in social networks when negative opinions may emerge and propagate In: 2011 SIAM International Conference on Data Mining, 379–390.. SIAM, Philadelphia.
Chen, W, Lakshmanan LVS, Castillo C (2013) Information and Influence Propagation in Social Networks. Morgan & Claypool Publishers, California.
Chen, W, Lu W, Zhang N (2012) Timecritical influence maximization in social networks with timedelayed diffusion process In: 26^{th} AAAI Conference on Artificial Intelligence, 592–598.. AAAI Press, Palo Alto.
Chen, W, Yuan Y, Zhang L (2010) Scalable influence maximization in social networks under the linear threshold model In: IEEE 10^{th} International Conference on Data Mining, 88–97.. IEEE, Piscataway.
Cheng, J, Adamic LA, Dow PA, Kleinberg J, Leskovec J (2014) Can cascades be predicted? In: 23^{rd} International Conference on World Wide Web, 925–936.. ACM, New York.
Christakis, NA, Fowler JH (2007) The spread of obesity in a large social network over 32 years. N Engl J Med 357:370–379.
Domingos, P, Richardson M (2001) Mining the network value of customers In: 7^{th} ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 57–66.. ACM, New York.
Easley, D, Kleinberg J (2010) Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press, Cambridge.
Fang, X, Hu PJ, Li L, Tsai W (2013) Predicting adoption probabilities in social networks. Inf Syst Res 24(1):128–145.
Gayraud, N, Pitoura E, Tsaparas P (2015) Diffusion maximization in evolving social networks In: 2015 ACM Conference on Online Social Networks, 125–135.. ACM, New York.
Goyal, A, Bonchi F, Lakshmanan LVS (2010) Learning influence probabilities in social networks In: 3^{rd} ACM International Conference on Web Search and Data Mining, 241–250.. ACM, New York.
Goyal, A, Lu W, Lakshmanan LVS (2011) CELF++: Optimizing the greedy algorithm for influence maximization in social networks In: 20^{th} International Conference on World Wide Web, 47–48.. ACM, New York.
Goyal, A, Lu W, Lakshmanan LVS (2011) SIMPATH: An efficient algorithm for influence maximization under linear threshold model In: IEEE 11^{th} International Conference on Data Mining, 211–220.. IEEE, Piscataway.
He, X, Song G, Chen W, Jiang Q (2012) Influence blocking maximization in social networks under the competitive linear threshold model In: 2012 SIAM Conference on Data Mining, 463–474.. SIAM, Philadelphia.
HEPT Collaboration Network (2009). http://research.microsoft.com/enus/people/weic/graphdata.zip. Accessed 16 Apr 2016.
Hill, S, Provost F, Volinsky C (2006) Networkbased marketing: Identifying likely adopters via consumer networks. Stat Sci 21(2):256–276.
Hinz, O, Skiera B, Barrot C, Becker J (2011) Seeding strategies for viral marketing: An empirical comparison. J Mark 75(6):55–71.
Iribarren, JL, Moro E (2009) Impact of human activity patterns on the dynamics of information diffusion. Phys Rev Lett 103:038702.
Iyengar, R, Van den Bulte C, Valente TW (2011) Opinion leadership and social contagion in new product diffusion. Mark Sci 30(2):195–212.
Jiang, Q, Song G, Cong G, Wang Y, Si W, Xie K (2011) Simulated annealing based influence maximization in social networks In: 25^{th} AAAI Conference on Artificial Intelligence, 127–132.. AAAI Press, Palo Alto.
Kempe, D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social network In: 9^{th} ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 137–146.. ACM, New York.
Kempe, D, Kleinberg J, Tardos E (2015) Maximizing the spread of influence through a social network. Theory Comput 11(4):105–147.
Kozinets, R, de Valck K, Wojnicki A, Wilner S (2010) Networked narratives: Understanding wordofmouth in online communities. J Mark 74(2):71–89.
Leskovec, J, Krause A, Guestrin C, Faloutsos C, VanBriesen J, Glance NS (2007) Costeffective outbreak detection in networks In: 13^{th} ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 420–429.. ACM, New York.
Leskovec, J, Lang KJ, Mahoney MW (2010) Empirical comparison of algorithms for network community detection In: 19^{th} International Conference on World Wide Web, 631–640.. ACM, New York.
Leskovec, J, McGlohon M, Faloutsos, Glance N, Hurst M (2007) Patterns of cascading behavior in large blog graphs In: 2007 SIAM International Conference on Data Mining, 551–556.. SIAM, Philadelphia.
Libai, B, Muller E, Peres R (2013) Decomposing the value of wordofmouth seeding programs: Acceleration vs. expansion. J Mark Res 50:161–176.
Liu, L, Tang J, Han J, Yang S (2012) Learning influence from heterogeneous social networks. Data Min Knowl Disc 25:511–544.
Morales, AJ, Borondo J, Losada JC, Benito RM (2014) Efficiency of human activity on information spreading on twitter. Soc Networks 39:1–11.
Peres, R (2014) The impact of network characteristics on the diffusion of innovations. Physica A 402:330–343.
Rand, W, Rust RT (2011) Agentbased modeling in marketing: Guidelines for rigor. Int J Res Mark 28(3):181–193.
Richardson, M, Domingos P (2002) Mining knowledgesharing sites for viral marketing In: 8^{th} ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 61–70.. ACM, New York.
Romero, DM, Meeder B, Kleinberg J (2011) Differences in the mechanics of information diffusion across topics: Idioms, political hashtags, and complex contagion on Twitter In: 20^{th} International Conference on World Wide Web, 695–704.. ACM, New York.
Wang, C, Chen W, Wang Y (2012) Scalable influence maximization for independent cascade model in largescale social networks. Data Min Knowl Discov 25(3):545–576.
Wang, W, Street WN (2014) A novel algorithm for community detection and influence ranking in social networks In: 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 555–560.. IEEE, Piscataway.
Wang, W, Street WN (2015a) Modeling influence diffusion to uncover influence centrality and community structure in social networks. Soc Netw Anal Min 5:15.
Wang, W, Street WN (2015b) Finding hierarchical communities in complex networks using influenceguided label propagation In: 15^{th} International Conference on Data Mining Workshops, 547–556.. IEEE, Piscataway.
White, JG, Southgate E, Thomson JN, Brenner S (1986) The structure of the nervous system of the nematode Caenorhabditis elegans. Philos Trans R Soc B 314(1165):1–340.
Wikipedia Whovotewhom Network (2010). https://snap.stanford.edu/data/wikiVote.html. Accessed 5 June 2016.
Acknowledgements
We thank the anonymous reviewers for their insightful remarks and suggestions that allow us to improve the quality of this paper significantly.
Funding
Not applicable.
Availability of data and materials
The datasets used in this article are all publicly available and cited in the references.
Author information
Affiliations
Contributions
WW and WNS equally contributed to this work. Both authors read and approved the final manuscript.
Corresponding author
Correspondence to Wenjun Wang.
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Influence diffusion
 Influence maximization
 Social network
 Viral marketing