finale/sections/active.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107

In this section we study the problem of Graph Inference in the context of
Active Learning: instead of assuming that the source of a cascade is drawn from
a source distribution $p_s$, we assume that after each cascade observation, the
experimenter is able to select the node which will become the source of the
next cascade.

This setting is motivated by the following \emph{data stream} scenario: if the
cascades are observed sequentially in an online manner at a very high rate,
then it is impossible for the experimenter to process and use all the available
data for learning.  One can instead cherry-pick from the stream the cascades
originating at the nodes which are most helpful for the learning problem.

We note that the above scenario should be more accurately described as
\emph{Online Active Learning}: not only is the experimenter deciding the node
at which the cascade originates, but also the decisions are made sequentially
based only the past observations. We address both the online and active aspects
below.

\paragraph{Bayesian Experimental Design.} Bayesian Inference provides a natural
framework for Active Learning: indeed, the posterior distribution quantifies
the amount of uncertainty left in the parameters given the observations and can
be used to design criteria to decide with experiment to perform next. This is
known as \emph{Bayesian Experimental Design}. We refer the reader to
\cite{chaloner} for a survey of the relevant literature.

A well-studied criterion in Bayesian Experimental Design is the mutual
information between the prior distribution and the observation, \emph{i.e} the
difference between the entropy of the posterior and the prior distributions. In
the specific case of Graph Inference, we define the utility function $U$ for
selecting node $i$ given current prior distribution $\Theta$:
\begin{displaymath}
    U(i) = I(\Theta, \mathbf{x}\,|\, x^0 = e_i) = H(\Theta) - H(\Theta\,|\,
    \mathbf{x}, x^0 = e_i)
\end{displaymath}
where $x^0 = e_i$ simply expresses that at time $t=0$ only node $i$ is infected
and $\bx$ is a cascade whose conditional distribution $\bx\,|\,\Theta$ is given
by \cref{eq:dist}. Given utility function $U$, our online active learning
policy simply selects at each time step a node $i^*$ such that
$i^*\in\argmax_{i\in V} U(i)$.

Unfortunately, the utility function $U$ define above is computationally hard to
evaluate. We propose instead the following surrogate utility function
$\tilde{U}$:
\begin{displaymath}
    \tilde{U}(i) = \sum_{j\in V} \var(\Theta_{i,j}),\quad i\in V
\end{displaymath}
This surrogate utility function is partially motivated by \cref{prop:active}
which shows that $U(i)$ can be lower-bounded by a simple node-level criterion
expressed as a sum over the outgoing edges of node $i$.

From the computational perspective $\tilde{U}$ is much simpler to compute than
$U$. In particular, in cases where the Bayesian Inference method used maintains
the matrix of variances for the approximate posterior distribution as described
in \cref{sec:example}, then computing $\tilde{U}$ simply involves summing this
matrix along the rows.

\begin{proposition}
    \label{prop:active}
    Assume that the prior distribution $\Theta$ has a product form, then: 
    \begin{displaymath}
        U(i) \geq \sum_{j\in V} I\big(\Theta_{i,j},
        \mathcal{B}(\Theta_{i,j})\big)
        \quad i\in V
    \end{displaymath}
    where $\mathcal{B}(\Theta_{i,j})$ denotes a Bernouilli variable of
    parameter $\Theta_{i,j}$.
\end{proposition}

\begin{proof}
The data processing inequality implies that for any function $f$ and any two
random variables $X$ and $Y$:
\begin{displaymath}
    I(X, Y) \geq I(X, f(Y)).
\end{displaymath}
Applying this inequality to $\Theta$ and $\bx$ with $f:\bx\mapsto x^1$
which truncates a cascade to its first time step, we get:
\begin{displaymath}
    U(i) \geq I(\Theta, x^1\,|\, x^0 = e_i)
\end{displaymath}
Using the chain rule for mutual information and the fact that the coordinates
of $x^1_j$ are mutually independent:
\begin{displaymath}
    U(i) \geq \sum_{j\in V} I(\Theta, x^1_j\,|\, x^0 = e_i)
\end{displaymath}
Since $\Theta$ has a product form and since $x^1_j$ only depends on
$\Theta_{i,j}$ (conditioned on $x^0 = e_i$) we can further write:
\begin{displaymath}
    \begin{split}
        U(i)&\geq \sum_{j\in V} I(\Theta_{i,j}, x^1_j\,|\,x^0 = e_i)\\
\end{split}
\end{displaymath}
This completes the proof since conditioned on $x^0 = e_i$, $x^1_j$ is
a Bernouilli variable of parameter $\Theta_{i,j}$.
\end{proof}

\begin{remark} In the first part of the proof, we show that truncating
    a cascade to keep only the first time step is a lower bound on the complete
    mutual information. It has been noted in \cite{Abrahao:13}, although for
    a slightly different cascade model, that for the purpose of graph inference
    the head of a cascade is the most informative. This further motivates our
    proposed approximation of $U$.
\end{remark}

\paragraph{Online Bayesian Updates} bullshit on SGD on data streams. Cite
"SGD as an online algorithm for data streams". Should tie this with our VI
algorithm.