finale/sections/model.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

The GLC model is described over a directed graph $G = (V, \Theta)$. Denoting by
$k=|V|$ the number of nodes in the graph, $\Theta\in\R_{+}^{k\times k}$ is the
matrix of edge weights. Note that $\Theta$ implicitly defines the edge set $E$ of
the graph through the following equivalence:
\begin{displaymath}
    (u,v)\in E\Leftrightarrow \Theta_{u,v} > 0,\quad
    (u,v)\in V^2
\end{displaymath}

The time is discretized and indexed by a variable $t\in\N$. The nodes can be in
one of three states: \emph{susceptible}, \emph{infected} or \emph{immune}.
Let us denote by $S_t$ the set of nodes susceptible at the beginning of time
step $t\in\N$ and by $I_t$ the set of nodes who become infected at this time
step. The following dynamics:
\begin{displaymath}
    S_0 = V,\quad S_{t+1} = S_t \setminus I_t
\end{displaymath}
expresses that the nodes infected at a time step are no longer susceptible
starting from the next time step (they become part of the immune nodes).

The dynamics of $I_t$ are described by a random Markovian process. Let us
denote by $x_t$ the indicator vector of $I_t$ at time step $t\in\N$. $x_0$ is
drawn from a \emph{source distribution} $p_s:\{0,1\}^n\to[0,1]$. For $t\geq 1$,
we have:
\begin{equation}
    \label{eq:markov}
    \forall i\in S_t,\quad
    \P\big(x_{i}^{t} = 1\,|\, x^{t-1}\big) = f(\bt_i\cdot x^{t-1})
\end{equation}
where $\bt_i$ is the $i$th column of $\Theta$. The function $f:\R\to[0,1]$ can
be interpreted as the inverse link function of the model. Finally, the
transitions in \cref{eq:markov} occur independently for each $i$. A cascade
continues until no infected nodes remains. As noted in \cite{pouget} many
commonly studied contagion models can be cast as specific instances of the GLC
model.

\Cref{eq:markov} and a source distribution $p_s$ together completely specify
the probability distribution of a cascade $\mathbf{x} = (x_t)_{t\geq 0}$ given
$\Theta$ and allow us to write the log-likelihood of the model:
\begin{equation}
    \label{eq:dist}
    \begin{split}
        \mathcal{L}(\Theta\,|\, \mathbf{x}) = &\log p_s(x^0)\\
    & + \sum_{t\geq 1}\sum_{i\in S_t}\Big(x_i^t\log f(\bt_i\cdot x^{t-1})\\
    &+ (1-x_i^t)\log\big(1-f(\bt_i\cdot x^{t-1})\big)\Big)
    \end{split}
\end{equation}

\paragraph{MLE estimation.}
It follows from the form of \cref{eq:dist} that the log-likelihood of $\Theta$
given cascade data can be written as the sum of $k$ terms, each term $i$ only
depending on $\bt_i$. Hence, each $\bt_i$ can be learnt separately by solving
a node-specific optimization problem. Specifically, for a given node $i$, if we
concatenate together all the time steps $t$ where this node was susceptible and
write $y^t = x^{t+1}_i$, \emph{i.e} whether or not the node became infected at
the next time step, the MLE estimator for $\bt_i$ is obtained by solving the
following optimization problem:
\begin{equation}
    \label{eq:mle}
    \begin{split}
        \hat{\bt}_i\in \argmax_\theta \sum_{t} &y^t\log f(\theta\cdot x^t)\\
        &+ (1-y^t) \log \big(1 - f(\theta\cdot x^t)\big)
    \end{split}
\end{equation}
It is interesting to note that at the node-level, doing MLE inference for the
GLC model is exactly amounts to fitting a Generalized Linear Model. When $f$ is
log-concave as is the case in most examples of GLC models, then the above
optimization problem becomes a convex optimization problem which can be solved
exactly and efficiently.