diff options
| -rw-r--r-- | paper/sections/model.tex | 154 |
1 files changed, 89 insertions, 65 deletions
diff --git a/paper/sections/model.tex b/paper/sections/model.tex index 5097c63..98712d7 100644 --- a/paper/sections/model.tex +++ b/paper/sections/model.tex @@ -1,55 +1,35 @@ -\paragraph{Influence Cascades} +We consider a graph $G= (V, E)$ and we define $m\defeq |V|$. A \emph{cascade +model} is a discrete-time homogeneous Markov process over $\{0,1\}^m$. Denoting +by $X^t$ the state of the process at time step $t-1$, we interpret $X^t_j = 1$ +for some node $j\in V$ as ``node $j$ is infected at time step $t-1$''. An +\emph{influence cascade} is simply a realisation of the random process +described by a cascade model. -We consider a weighted graph $G$ over the set of vertices $V$ and we define -$m\defeq |V|$. The edge weights are given by an (unknown) weight matrix -$\Theta:V^2\to\reals^+$. $\Theta_{i,j} = 0$ simply expresses that there is no -edge from $i$ to $j$ For $j\in V$, we write $\theta_j$ the $j$-th column of -$\Theta$: this is the indicator vector of the parents of $j$ in $G$. +In what follows, we describe two widely used cascade models, the Independent +Cascade model (IC) and the Linear Threshold model (LT). In these models, the +transition probabilities of the Markov process depend on unknown parameters +describing how much the nodes in $G$ influence each other. -An \emph{influence cascade} is a homogeneous discrete-time Markov process over -$\{0,1\}^m$. More precisely, we represent the set of nodes infected at time -step $t-1\in\ints^+$ by a random vector $X^{t}\in\{0,1\}^{m}$ where $X^{t}_k -= 1$ can be interpreted as \emph{node $k$ was infected at time step $t-1$}. - -\paragraph{Problem} In what follows, we assume that we know how to express the -transition probabilities as a function of the weight matrix $\Theta$ and the -goal is to infer this matrix by observing influence cascades. - -We use standard maximum likelihood estimation (MLE). However, we assume that -the parameter we wish to recover, namely the matrix $\Theta$, is sparse in the -following sense: each node in the graph $G$ has at most $s$ parents. In other -words, the columns of $G$ have at most $s$ non-zero entries. To factor this -additional assumption in the maximum likelihood estimation we use the now -classic $\ell_1$ regularization. For an observation $(x_1,\ldots,x_n)$, we -estimate $\hat{\Theta}$ by: -\begin{equation}\label{eq:mle} - \hat{\Theta} = \argmax_{\Theta}\mathcal{L}(\Theta\,|\, x^1,\ldots, x^n) - + \lambda\|\Theta\|_1 -\end{equation} -where $\mathcal{L}$ is the log-likelihood function and $\lambda$ is the -regularization factor. - -The goal of this paper is to characterize $\hat{\Theta}$ in terms of: -\begin{itemize} - \item \emph{efficiency}: how well does $\hat{\Theta}$ approximate the true - matrix $\Theta$? In particular can we bound its accuracy as a function - of the number of observations? - \item \emph{model consistency}: assuming that the true matrix $\Theta$ is - $s$-sparse as defined above, can we recover the support of $\Theta$ - from the estimator $\hat{\Theta}$, and if so, how many observations are - required? -\end{itemize} - -We now turn to two widely used influence cascade models, the Independent -Cascade Model (IC) and the Linear Threshold Model (LT) and specify the MLE -problem \eqref{eq:mle} for these models. +Recovering these parameters from observed influence cascades is the central +question of the present work and is important for the following two reasons: +\begin{enumerate} + \item knowing the influence parameters allows for future prediction of + influence cascades. + \item when the set of edges $E$ is unknown, it can be recovered from the + influence parameters. +\end{enumerate} +We note that recovering the edges in $E$ from observed influence casccades is +a well-identified problem known as the \emph{Graph Inference} problem. However, +recovering the influence parameters is no less important and has been seemingly +overlooked so far. The machinery developped in this work addresses both +problems. \subsection{Independent Cascade Model} In the independent cascade model, nodes can be either uninfected, active or infected. All nodes start either as uninfected or active. At each time step -$t$, for each edge $(i,j)$ where $j$ is uninfected and $i$ is active, $j$ -attempts to infect $j$ with probability $p_{i,j}$. If $i$ succeeds, $j$ will +$t$, for each edge $(i,j)$ where $j$ is uninfected and $i$ is active, $i$ +attempts to infect $j$ with probability $p_{i,j}\in[0,1]$. If $i$ succeeds, $j$ will become active at time step $t+1$. Regardless of $i$'s success, node $i$ will be infected at time $t+1$: nodes stay active for only one time step. The cascade continues until no active nodes remain. @@ -58,35 +38,79 @@ If we denote by $X^t$ the indicator variable of the set of active nodes at time step $t-1$, then if $j$ is uninfected at time step $t-1$, we have: \begin{displaymath} \P\big[X^{t+1}_j = 1\,|\, X^{t}\big] - = 1 - \prod_{i = 1}^m (1 - p_{i,j})^{X^t_i} - = 1 - e^{\inprod{\theta_j}{X^t}} + = 1 - \prod_{i = 1}^m (1 - p_{i,j})^{X^t_i}. \end{displaymath} -where we defined $\Theta_{i,j} \defeq \log(1-p_{i,j})$. +Defining $\Theta_{i,j} \defeq \log(1-p_{i,j})$, this can be rewritten +as: +\begin{equation}\label{eq:ic} + \tag{IC} + \P\big[X^{t+1}_j = 1\,|\, X^{t}\big] + = 1 - \prod_{i = 1}^m e^{\Theta_{i,j}X^t_i} + = 1 - e^{\inprod{\theta_j}{X^t}} +\end{equation} +where we defined $\theta_j\defeq (\Theta_{1,j},\ldots,\Theta_{m,j})$. \subsection{Linear Threshold Model} In the Linear Threshold Model, each node $j\in V$ has a threshold $t_j$ drawn -uniformly from the interval $[0,1]$. We assume that the weights given by -$\Theta$ are such that for each node $j\in V$, $\sum_{i=1}^m \Theta_{i,j} \leq -1$. +uniformly from the interval $[0,1]$. Furthermore, there is a weight +$\Theta_{i,j}\in[0,1]$ for each edge $(i,j)$. We assume that the weights are +such that for each node $j\in V$, $\sum_{i=1}^m \Theta_{i,j} \leq 1$. -Nodes can be either infected or uninfected. At each time step, each uninfected -node $j$ becomes infected if the sum of the weights $\Theta_{i,j}$ where $i$ is -an infected parent of $j$ is larger than $j$'s threshold $t_j$. Denoting by -$X^t$ the indicator variable of the set of infected nodes at time step $t-1$, -if $j\in V$ is uninfected at time step $t-1$, then: -\begin{displaymath} +Nodes can be either uninfected or active. At each time step, each uninfected +node $j$ becomes active if the sum of the weights $\Theta_{i,j}$ for $i$ an +active parent of $j$ is larger than $j$'s threshold $t_j$. Denoting by $X^t$ +the indicator variable of the set of active nodes at time step $t-1$, if +$j\in V$ is uninfected at time step $t-1$, then: +\begin{equation} + \label{eq:lt} + \tag{LT} \P\big[X^{t+1}_j = 1\,|\, X^{t}\big] = \sum_{i=1}^m \Theta_{i,j}X^t_i = \inprod{\theta_j}{X^t} -\end{displaymath} +\end{equation} +where we defined again $\theta_j\defeq (\Theta_{1,j},\ldots,\Theta_{m,j})$. -\subsection{Unification} +\subsection{Maximum Likelihood Estimation} + +In both models, the unknown influence parameters are described by an $m\times +m$ matrix $\Theta$. Furthermore, the information of the edges of $G$ is fully +contained in $\Theta$: +\begin{displaymath} + (i,j)\in E\Leftrightarrow \Theta_{i,j} \neq 0 +\end{displaymath} -$X_j^{t+1}$ is a Bernoulli variable whose parameter can be written: -a function $f$ of $\inprod{\theta_j}{x^t}$, where $\inprod{a}{b}$ is the inner -product of $a$ and $b$ in $\reals^m$: -$\theta_j$ and $x^t$: +Given observations $(x^1,\ldots,x^n)$ of a cascade model, we can recover +$\Theta$ via Maximum Likelihood Estimation (MLE). Denoting by $\mathcal{L}$ the +log-likelihood function, we consider the following $\ell_1$-regularized MLE +problem: +\begin{displaymath} + \hat{\Theta} \in \argmax_{\Theta} \mathcal{L}(\Theta\,|\,x^1,\ldots,x^n) + + \lambda\|\Theta\|_1 +\end{displaymath} +where $\lambda$ is the regularization factor which helps at preventing +overfitting as well as controlling for the sparsity of the solution. -\begin{equation}\label{eq:markov} - \P\big[x_j^{t+1} = 1\,|\, x^t\big] \defeq f(\inprod{\theta_j}{x^t}). +Note that both the (IC) and the (LT) models are decomposable in the following +sense: the state evolution of each node at each time step happens independently +of the following nodes \emph{and} the state evolution of node $j\in V$ only +depends on $\theta_j$, the $j$-th column of $\Theta$. Consequently, the +log-likelihood function $\mathcal{L}$ can be written as a sum of $m$ terms, +each term $j\in\{1,\ldots,m\}$ only involving $\theta_j$. Since this is equally +true for $\|\Theta\|_1$, each column $\theta_j$ of $\Theta$ can be estimated by +a separate optimization program: +\begin{equation}\label{eq:pre-mle} + \hat{\theta}_j \in \argmax_{\theta} \mathcal{L}_j(\theta_j\,|\,x^1,\ldots,x^n) + + \lambda\|\theta_j\|_1 \end{equation} + +Furthermore, the state evolution of a node $j\in V$ has the same structure in +both models: the transition from uninfected to active at time step $t+1$ is +controlled by a Bernoulli variable whose parameter can be written +$f(\inprod{\theta_j}{x^t})$ for some function $f$. Hence, if we denote by $n_j$ +the first step at which $j$ becomes active, we can rewrite the MLE program +\eqref{eq:pre-mle}: +\begin{multline} + \hat{\theta}_j \in \argmax_{\theta} + \sum_{t=1}^{n_j-1}\log\big(1-f(\inprod{\theta_j}{x^t})\big)\\ +\log f(\inprod{\theta_j}{x^{n_j}})+ \lambda\|\theta_j\| +\end{multline} |
