In this section we study the problem of Graph Inference in the context of Active Learning: instead of assuming that the source of a cascade is drawn from a source distribution $p_s$, we assume that after each cascade observation, the experimenter is able to select the node which will become the source of the next cascade. This setting is motivated by the following \emph{data stream} scenario: if the cascades are observed sequentially in an online manner at a very high rate, then it is impossible for the experimenter to process and use all the available data for learning. One can instead cherry-pick from the stream the cascades originating at the nodes which are most helpful for the learning problem. We note that the above scenario should be more accurately described as \emph{Online Active Learning}: not only is the experimenter deciding the node at which the cascade originates, but also the decisions are made sequentially based only the past observations. We address both the online and active aspects below. \paragraph{Bayesian Experimental Design.} Bayesian Inference provides a natural framework for Active Learning: indeed, the posterior distribution quantifies the amount of uncertainty left in the parameters given the observations and can be used to design criteria to decide with experiment to perform next. This is known as \emph{Bayesian Experimental Design}. We refer the reader to \cite{chaloner} for a survey of the relevant literature. A well-studied criterion in Bayesian Experimental Design is the mutual information between the prior distribution and the observation, \emph{i.e} the difference between the entropy of the posterior and the prior distributions. In the specific case of Graph Inference, we define the utility function $U$ for selecting node $i$ given current prior distribution $\Theta$: \begin{displaymath} U(i) = I(\Theta, \mathbf{x}\,|\, x^0 = e_i) = H(\Theta) - H(\Theta\,|\, \mathbf{x}, x^0 = e_i) \end{displaymath} where $x^0 = e_i$ simply expresses that at time $t=0$ only node $i$ is infected and $\bx$ is a cascade whose conditional distribution $\bx\,|\,\Theta$ is given by \cref{eq:dist}. Given utility function $U$, our online active learning policy simply selects at each time step a node $i^*$ such that $i^*\in\argmax_{i\in V} U(i)$. Unfortunately, the utility function $U$ define above is computationally hard to evaluate. We propose instead the following surrogate utility function $\tilde{U}$: \begin{displaymath} \tilde{U}(i) = \sum_{j\in V} \var(\Theta_{i,j}),\quad i\in V \end{displaymath} This surrogate utility function is partially motivated by \cref{prop:active} which shows that $U(i)$ can be lower-bounded by a simple node-level criterion expressed as a sum over the outgoing edges of node $i$. From the computational perspective $\tilde{U}$ is much simpler to compute than $U$. In particular, in cases where the Bayesian Inference method used maintains the matrix of variances for the approximate posterior distribution as described in \cref{sec:example}, then computing $\tilde{U}$ simply involves summing this matrix along the rows. \begin{proposition} \label{prop:active} Assume that the prior distribution $\Theta$ has a product form, then: \begin{displaymath} U(i) \geq \sum_{j\in V} I\big(\Theta_{i,j}, \mathcal{B}(\Theta_{i,j})\big) \quad i\in V \end{displaymath} where $\mathcal{B}(\Theta_{i,j})$ denotes a Bernouilli variable of parameter $\Theta_{i,j}$. \end{proposition} \begin{proof} The data processing inequality implies that for any function $f$ and any two random variables $X$ and $Y$: \begin{displaymath} I(X, Y) \geq I(X, f(Y)). \end{displaymath} Applying this inequality to $\Theta$ and $\bx$ with $f:\bx\mapsto x^1$ which truncates a cascade to its first time step, we get: \begin{displaymath} U(i) \geq I(\Theta, x^1\,|\, x^0 = e_i) \end{displaymath} Using the chain rule for mutual information and the fact that the coordinates of $x^1_j$ are mutually independent: \begin{displaymath} U(i) \geq \sum_{j\in V} I(\Theta, x^1_j\,|\, x^0 = e_i) \end{displaymath} Since $\Theta$ has a product form and since $x^1_j$ only depends on $\Theta_{i,j}$ (conditioned on $x^0 = e_i$) we can further write: \begin{displaymath} \begin{split} U(i)&\geq \sum_{j\in V} I(\Theta_{i,j}, x^1_j\,|\,x^0 = e_i)\\ \end{split} \end{displaymath} This completes the proof since conditioned on $x^0 = e_i$, $x^1_j$ is a Bernouilli variable of parameter $\Theta_{i,j}$. \end{proof} \begin{remark} In the first part of the proof, we show that truncating a cascade to keep only the first time step is a lower bound on the complete mutual information. It has been noted in \cite{Abrahao:13}, although for a slightly different cascade model, that for the purpose of graph inference the head of a cascade is the most informative. This further motivates our proposed approximation of $U$. \end{remark} \emph{Computational Considerations.} Given the online nature of the Active Learning scenario described above, it is crucial the algorithm used to perform Bayesian Inference supports online updates. This will be the case when using Stochastic Gradient Descent to optimize the Variational Inference objective as described in Section 3.2 and as used in the experiments in Section 5. However, contrary to the standard application of SGD, each data point will only be processed once. It has been noted in prior works (see for example \cite{bottou}) that when SGD is used on infinite data stream, with each data point being processed only once, the interpretation is that SGD is directly optimizing the expected loss against the distribution of the input data stream (as opposed to the empirical distribution of the fixed input data set in standard offline learning). In our case, as we learn the graph in an active manner, the distribution of the input data stream converges to the uniform distribution which shows the consistency of the resulting inference method.