finale/sections/active.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70

In this section we study the problem of Graph Inference in the context of
Active Learning: instead of assuming that the source of a cascade is drawn from
a source distribution $p_s$, we assume that after each cascade observation, the
experimenter is able to select the node which will become the source of the
next cascade.

This setting is motivated by the following \emph{data stream} scenario: if the
cascades are observed sequentially in an online manner at a very high rate,
then it is impossible for the experimenter to process and use all the available
data for learning.  One can instead cherry-pick from the stream the cascades
originating at the nodes which are most helpful for the learning problem.

We note that the above scenario should be more accurately described as
\emph{Online Active Learning}: not only is the experimenter deciding the node
at which the cascade originates, but also the decisions are made sequentially
based only the past observations. We address both the online and active aspects
below.

\paragraph{Bayesian Experimental Design.} Bayesian Inference provides a natural
framework for Active Learning: indeed, the posterior distribution quantifies
the amount of uncertainty left in the parameters given the observations and can
be used to design criteria to decide with experiment to perform next. This is
known as \emph{Bayesian Experimental Design}. We refer the reader to
\cite{chaloner} for a survey of the relevant literature.

A well-studied criterion in Bayesian Experimental Design is the mutual
information between the prior distribution and the observation, \emph{i.e} the
difference between the entropy of the posterior and the prior distributions. In
the specific case of Graph Inference, we define the utility function $U$ for
selecting node $i$ given current prior distribution $\Theta$:
\begin{displaymath}
    U(i) = I(\Theta, \mathbf{x}\,|\, x^0 = e_i) = H(\Theta) - H(\Theta\,|\,
    \mathbf{x}, x^0 = e_i)
\end{displaymath}
where $x^0 = e_i$ simply expresses that at time $t=0$ only node $i$ is infected
and $\bx$ is a cascade whose conditional distribution $\bx\,|\,\Theta$ is given
by \cref{eq:dist}. Given utility function $U$, our online active learning
policy simply selects at each time step a node $i^*$ such that
$i^*\in\argmax_{i\in V} U(i)$.

Unfortunately, the utility function $U$ define above is computationally hard to
evaluate. We propose instead the following surrogate utility function
$\tilde{U}$:
\begin{displaymath}
    \tilde{U}(i) = \sum_{j\in V} \var(\Theta_{i,j}),\quad i\in V
\end{displaymath}
The justification of this surrogate utility function is given by
\cref{prop:active} which shows that maximizing $\tilde{U}$ implies maximizing
a lower bound on $U$. Note that when doing Variational Inference with Gaussian
approximations as described in \cref{sec:inference}, then computing $\tilde{U}$
simply involves summing one row in the matrix of the current estimates of the
variances.

\begin{proposition}
    \label{prop:active} For all $i\in V$, $\tilde{U}(i) \leq U(i)$.
\end{proposition}

\begin{proof} \end{proof}

\begin{remark} In the first part of the proof, we show that truncating
    a cascade to keep only the first time step is a lower bound on the complete
    mutual information. It has been noted in a different context \cite{} that
    for the purpose of graph inference, the first step of a cascade is the most
    informative.
\end{remark}

\paragraph{Online Bayesian Updates} bullshit on SGD on data streams. Cite
"SGD as an online algorithm for data streams". Should tie this with our VI
algorithm.