In this section we study the problem of Graph Inference in the context of Active Learning: instead of assuming that the source of a cascade is drawn from a source distribution $p_s$, we assume that after each cascade observation, the experimenter is able to select the node which will become the source of the next cascade. This setting is motivated by the following \emph{data stream} scenario: if the cascades are observed sequentially in an online manner at a very high rate, then it is impossible for the experimenter to process and use all the available data for learning. One can instead cherry-pick from the stream the cascades originating at the nodes which are most helpful for the learning problem. We note that the above scenario should be more accurately described as \emph{Online Active Learning}: not only is the experimenter deciding the node at which the cascade originates, but also the decisions are made sequentially based only the past observations. We address both the online and active aspects below. \paragraph{Bayesian Experimental Design.} Bayesian Inference provides a natural framework for Active Learning: indeed, the posterior distribution quantifies the amount of uncertainty left in the parameters given the observations and can be used to design criteria to decide with experiment to perform next. This is known as \emph{Bayesian Experimental Design}. We refer the reader to \cite{chaloner} for a survey of the relevant literature. A well-studied criterion in Bayesian Experimental Design is the mutual information between the prior distribution and the observation, \emph{i.e} the difference between the entropy of the posterior and the prior distributions. In the specific case of Graph Inference, we define the utility function $U$ for selecting node $i$ given current prior distribution $\Theta$: \begin{displaymath} U(i) = I(\Theta, \mathbf{x}\,|\, x^0 = e_i) = H(\Theta) - H(\Theta\,|\, \mathbf{x}, x^0 = e_i) \end{displaymath} where $x^0 = e_i$ simply expresses that at time $t=0$ only node $i$ is infected and $\bx$ is a cascade whose conditional distribution $\bx\,|\,\Theta$ is given by \cref{eq:dist}. Given utility function $U$, our online active learning policy simply selects at each time step a node $i^*$ such that $i^*\in\argmax_{i\in V} U(i)$. Unfortunately, the utility function $U$ define above is computationally hard to evaluate. We propose instead the following surrogate utility function $\tilde{U}$: \begin{displaymath} \tilde{U}(i) = \sum_{j\in V} \var(\Theta_{i,j}),\quad i\in V \end{displaymath} The justification of this surrogate utility function is given by \cref{prop:active} which shows that maximizing $\tilde{U}$ implies maximizing a lower bound on $U$. Note that when doing Variational Inference with Gaussian approximations as described in \cref{sec:inference}, then computing $\tilde{U}$ simply involves summing one row in the matrix of the current estimates of the variances. \begin{proposition} \label{prop:active} For all $i\in V$, $\tilde{U}(i) \leq U(i)$. \end{proposition} \begin{proof} \end{proof} \begin{remark} In the first part of the proof, we show that truncating a cascade to keep only the first time step is a lower bound on the complete mutual information. It has been noted in a different context \cite{} that for the purpose of graph inference, the first step of a cascade is the most informative. \end{remark} \paragraph{Online Bayesian Updates} bullshit on SGD on data streams. Cite "SGD as an online algorithm for data streams". Should tie this with our VI algorithm.