finale/sections/model.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68

The GLC model is described over a directed graph $G = (V, \Theta)$. Denoting by
$k=|V|$ the number of nodes in the graph, $\Theta\in\R_{+}^{k\times k}$ is the
matrix of edge weights. Note that $\Theta$ implicitly defines the edge set $E$ of
the graph through the following equivalence:
\begin{displaymath}
    (u,v)\in E\Leftrightarrow \Theta_{u,v} > 0,\quad
    (u,v)\in V^2
\end{displaymath}

The time is discretized and indexed by a variable $t\in\N$. The nodes can be in
one of three states: \emph{susceptible}, \emph{infected} or \emph{immune}.
Let us denote by $S_t$ the set of nodes susceptible at the beginning of time
step $t\in\N$ and by $I_t$ the set of nodes who become infected at this time
step. The following dynamics:
\begin{displaymath}
    S_0 = V,\quad S_{t+1} = S_t \setminus I_t
\end{displaymath}
expresses that the nodes infected at a time step are no longer susceptible
starting from the next time step (they become part of the immune nodes).

The dynamics of $I_t$ are described by a random Markovian process. Let us
denote by $x_t$ the indicator vector of $I_t$ at time step $t\in\N$. $x_0$ is
drawn from a \emph{source distribution} $p_s:\{0,1\}^n\to[0,1]$. For $t\geq 1$,
we have:
\begin{equation}
    \label{eq:markov}
    \forall i\in S_t,\quad
    \P\big(x_{i}^{t} = 1\,|\, x^{t-1}\big) = f(\bt_i\cdot x^{t-1})
\end{equation}
where $\bt_i$ is the $i$th column of $\Theta$. The function $f:\R\to[0,1]$ can
be interpreted as the inverse link function of the model. Finally, the
transitions in \cref{eq:markov} occur independently for each $i$. A cascade
continues until no infected nodes remains.

We refer the reader to \cite{pouget} for a more complete description of the
model and examples of common contagion models which can be interpreted as
specific instances of the GLC model.

It follows from Section 2, that a source distribution $p_s$ and
\cref{eq:markov} together completely specify the distribution $p$ of a cascade
$\mathbf{x} = (x_t)_{t\geq 0}$:
\begin{equation}
    \label{eq:dist}
    \mathcal{L}_{\Theta}(\bx)
    = p_s(x^0)\prod_{\substack{t\geq 1 \\ i\in S_t}}
    f(\bt_i\cdot x^{t-1})^{x^t_i}\big(1-f(\theta_i\cdot x^{t-1})\big)^{1-x_i^t}
\end{equation}

\paragraph{MLE estimation.}
It follows from the form of \cref{eq:dist} that the log-likelihood of $\Theta$
given cascade data can be written as the sum of $k$ terms, each term $i$ only
depending on $\bt_i$. Hence, each $\bt_i$ can be learnt separately by solving
a node-specific optimization problem. Specifically, for a given node $i$, if we
concatenate together all the time steps $t$ where this node was susceptible and
write $y^t = x^{t+1}_i$, \emph{i.e} whether or not the node became infected at
the next time step, the MLE estimator for $\bt_i$ is obtained by solving the
following optimization problem:
\begin{equation}
    \label{eq:mle}
    \hat{\theta}\in \argmax_\theta \sum_{t} y^t\log f(\theta\cdot x^t)
    + (1-y^t) \log \big(1 - f(\theta\cdot x^t)\big)
\end{equation}
It is interesting to note that at the node-level, doing MLE inference for the
GLC model is exactly amounts to fitting a Generalized Linear Model. When $f$ is
log-concave as is the case in most examples of GLC models, then the above
optimization problem becomes a convex optimization problem which can be solved
exactly and efficiently. The code to perform MLE estimation can be found in the
appendix, file \textsf{mle.py}.