finale/sections/bayesian.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120

\begin{figure}
\centering
\label{fig:graphical}
\includegraphics[scale=.8]{graphical.pdf}
\caption{Graphical model representation of the Network Inference Problem in the
  case of a Gaussian product prior, with edge weights $\theta_{ij}$, observed
cascade indicator vectors $X^c_t$, edge prior parameters $\mu_{ij}$ and
$\sigma_{ij}$. The source distribution, parameterized by $\phi$, is considered
fixed here.}
\end{figure}

\subsection{Advantages of the Bayesian Framework}
In this section, we develop a Bayesian approach to the Network Inference Problem
by placing priors on the edge weights of the graph. The quantity of interest is
the posterior distribution, given through Bayes' rule by:
\begin{equation}
  \label{eq:bayesrule}
  \Theta | \bx \propto \text{prior}(\Theta) \times \mathcal{L}_\Theta(\bx)
\end{equation}
where $\mathcal{L}_\Theta(\bx)$ is the likelihood expressed in
Eq.~\ref{eq:dist}.

One advantage of the Bayesian approach is its ability to convey distributional
information about our belief for each parameter rather than the pointwise
estimates accessible by MLE.~For example, exploring the entropy
of the posterior on each parameter allows us to quantify how uncertain we are of
each edge parameters' value.  In the next section, we will explore how to
exploit this knowledge to improve the rate at which we decrease our uncertainty
by focusing on the most relevant parts of the network.

Another advantage of the Bayesian approach is the ability to encode
domain-knowledge through well-chosen prior distributions. For example, there is
an extensive literature~\cite{} on parametric representations of social
networks, which attempt to reproduce certain properties of such networks:
density of triangles, diameter, degree distribution, clustering coefficient etc.
Accounting for known graph properties, such as reciprocal links or the high
density of triangles has the potential to greatly increase the information we
leverage from each cascade.  Of course, such priors no longer allow us to
perform inference in parallel, which was leveraged in prior work.

\subsection{Inference}

Depending on the link function $f$, the GLC model may not possess conjugate
priors (e.g.~the IC model). Even if conjugate priors exist, they may be
restricted to product form. In these cases, we resort to the use of sampling
algorithms (MCMC) and approximate Bayesian methods (variational inference),
which we cover here.

\paragraph{MCMC}
The Metropolis-Hastings (MCMC) algorithm allows us to draw samples from the
posterior directly using the un-normalized posterior distribution. The advantage
of this method is the ability to sample from the exact posterior and the wide
availability of software packages which will work `out-of-the-box'. However,
vanilla MCMC scales badly and is unsuitable for Bayesian learning of large
networks ($\geq 100$ nodes).

\paragraph{Variational Inference}

Variational inference algorithms consist in fitting an approximate family of
distributions to the exact posterior. The variational objective can be
decomposed as a sum between a divergence term with the prior and a likelihood
term:
\begin{equation}
  \begin{split}
    \mathcal{V}(\mathbf{\Theta}, \mathbf{\Theta'}, \{\mathbf{x}_c\}) =  &-
    \text{KL}(q_{\mathbf{\Theta'}}, p_{\mathbf{\Theta}}) \\ &+ \sum_{c = 1}^C
    \E_{q_{\mathbf{\Theta'}}} \log \mathcal{L}(\mathbf{x}_c | \mathbf{\Theta})
  \end{split}
\end{equation}

where $p_{\mathbf{\Theta}}$ is the prior distribution, parametrized by prior
parameters $\Theta = (\mathbf{\mu}^0 , \mathbf{\sigma}^0)$,
$q_{\mathbf{\Theta'}}$ is the approximate posterior distribution, parametrized
by variational parameters $\Theta' = (\mathbf{\mu}, \mathbf{\sigma})$,
$\log \mathcal{L}(x | \Theta)$ is the log-likelihood as written in
Eq.~\ref{eq:dist}, and $\text{KL}(p , q)$ is the Kullback-Leibler divergence
between distributions $p$ and $q$. The variational objective maximizes a lower
bound on the log marginal likelihood:
\begin{equation}
  \max_{\mathbf{\Theta'}} \mathcal{V}(\mathbf{\Theta}, \mathbf{\Theta'},
  \{\mathbf{x}_c\}) \leq \log p_\Theta(\{ \mathbf{x}_c\})
\end{equation}

Contrary to MCMC which outputs samples from the exact posterior given all
observed data, the variational inference approach allows us to process data in
batches to provide an analytical approximation to the posterior, thus improving
scalability.  In many cases, however, the expectation term cannot be found in
closed-form, and approximation by sampling does not scale well with the number
of parameters, but we can borrow ideas from Bohning~\cite{} to propose a
linear/quadratic approximation to the log-likelihood for which the expectation
term can be written analytically.

\subsection{Example}
We develop the Variational Inference algorithm for the IC model (cf.
Section~\ref{sec:model}).  As mentioned above, the IC model with link function
$f: z \mapsto 1 - e^{-z}$ has no conjugate priors. We consider here a truncated
product gaussian prior:
\begin{equation}
  \label{eq:gaussianprior}
  \text{prior}(\Theta) = \prod_{ij} \mathcal{N}^+(\theta_{ij} | \mu_{ij},
  \sigma_{ij})
\end{equation}
where $\mathcal{N}^+(\cdot)$ is a gaussian truncated to lied on $\mathbb{R}^+$
since $\Theta$ is a transformed parameter $z \mapsto -\log(1 - z)$. This model
is represented in the graphical model of Figure~\ref{fig:graphical}.

The product-form of the prior implies that the KL term is entirely decomposable:
\begin{equation}
  \text{KL}(q_{\mathbf{\Theta'}}, p_{\mathbf{\Theta}}) = \sum_{ij}
  KL\left(\mathcal{N}^+(\mu_{ij}, \sigma_{ij}), \mathcal{N}^+(\mu^0_{ij},
  \sigma^0_{ij})\right)
\end{equation}

ince an easy closed-form formula exists for the KL divergence between two
gaussians, we approximate the truncated gaussians by their non-truncated
counterpart. \begin{equation}
  \label{eq:kl}
  \text{KL}(q_{\mathbf{\Theta'}}, p_{\mathbf{\Theta}}) \approx \sum_{i,j} \log
  \frac{\sigma^0_{ij}}{\sigma_{ij}} + 
\end{equation}