\begin{figure} \centering \label{fig:graphical} \includegraphics[scale=.8]{graphical.pdf} \caption{Graphical model representation of the Network Inference Problem in the case of a Gaussian product prior, with edge weights $\theta_{ij}$, observed cascade indicator vectors $X^c_t$, edge prior parameters $\mu_{ij}$ and $\sigma_{ij}$. The source distribution, parameterized by $\phi$, is considered fixed here.} \end{figure} \subsection{Advantages of the Bayesian Framework} In this section, we develop a Bayesian approach to the Network Inference Problem by placing priors on the edge weights of the graph. The quantity of interest is the posterior distribution, given through Bayes' rule by: \begin{equation} \label{eq:bayesrule} \Theta | \bx \propto \text{prior}(\Theta) \times \mathcal{L}_\Theta(\bx) \end{equation} where $\mathcal{L}_\Theta(\bx)$ is the likelihood expressed in Eq.~\ref{eq:dist}. One advantage of the Bayesian approach is its ability to convey distributional information about our belief for each parameter rather than the pointwise estimates accessible by MLE.~For example, exploring the entropy of the posterior on each parameter allows us to quantify how uncertain we are of each edge parameters' value. In the next section, we will explore how to exploit this knowledge to improve the rate at which we decrease our uncertainty by focusing on the most relevant parts of the network. Another advantage of the Bayesian approach is the ability to encode domain-knowledge through well-chosen prior distributions. For example, there is an extensive literature~\cite{} on parametric representations of social networks, which attempt to reproduce certain properties of such networks: density of triangles, diameter, degree distribution, clustering coefficient etc. Accounting for known graph properties, such as reciprocal links or the high density of triangles has the potential to greatly increase the information we leverage from each cascade. Of course, such priors no longer allow us to perform inference in parallel, which was leveraged in prior work. \subsection{Inference} Depending on the link function $f$, the GLC model may not possess conjugate priors (e.g.~the IC model). Even if conjugate priors exist, they may be restricted to product form. In these cases, we resort to the use of sampling algorithms (MCMC) and approximate Bayesian methods (variational inference), which we cover here. \paragraph{MCMC} The Metropolis-Hastings (MCMC) algorithm allows us to draw samples from the posterior directly using the un-normalized posterior distribution. The advantage of this method is the ability to sample from the exact posterior and the wide availability of software packages which will work `out-of-the-box'. However, vanilla MCMC scales badly and is unsuitable for Bayesian learning of large networks ($\geq 100$ nodes). \paragraph{Variational Inference} Variational inference algorithms consist in fitting an approximate family of distributions to the exact posterior. The variational objective can be decomposed as a sum between a divergence term with the prior and a likelihood term: \begin{equation} \begin{split} \mathcal{V}(\mathbf{\Theta}, \mathbf{\Theta'}, \{\mathbf{x}_c\}) = &- \text{KL}(q_{\mathbf{\Theta'}}, p_{\mathbf{\Theta}}) \\ &+ \sum_{c = 1}^C \E_{q_{\mathbf{\Theta'}}} \log \mathcal{L}(\mathbf{x}_c | \mathbf{\Theta}) \end{split} \end{equation} where $p_{\mathbf{\Theta}}$ is the prior distribution, parametrized by prior parameters $\Theta = (\mathbf{\mu}^0 , \mathbf{\sigma}^0)$, $q_{\mathbf{\Theta'}}$ is the approximate posterior distribution, parametrized by variational parameters $\Theta' = (\mathbf{\mu}, \mathbf{\sigma})$, $\log \mathcal{L}(x | \Theta)$ is the log-likelihood as written in Eq.~\ref{eq:dist}, and $\text{KL}(p , q)$ is the Kullback-Leibler divergence between distributions $p$ and $q$. The variational objective maximizes a lower bound on the log marginal likelihood: \begin{equation} \max_{\mathbf{\Theta'}} \mathcal{V}(\mathbf{\Theta}, \mathbf{\Theta'}, \{\mathbf{x}_c\}) \leq \log p_\Theta(\{ \mathbf{x}_c\}) \end{equation} Contrary to MCMC which outputs samples from the exact posterior given all observed data, the variational inference approach allows us to process data in batches to provide an analytical approximation to the posterior, thus improving scalability. In many cases, however, the expectation term cannot be found in closed-form, and approximation by sampling does not scale well with the number of parameters, but we can borrow ideas from Bohning~\cite{} to propose a linear/quadratic approximation to the log-likelihood for which the expectation term can be written analytically. \subsection{Example} We develop the Variational Inference algorithm for the IC model (cf. Section~\ref{sec:model}). As mentioned above, the IC model with link function $f: z \mapsto 1 - e^{-z}$ has no conjugate priors. We consider here a truncated product gaussian prior: \begin{equation} \label{eq:gaussianprior} \text{prior}(\Theta) = \prod_{ij} \mathcal{N}^+(\theta_{ij} | \mu_{ij}, \sigma_{ij}) \end{equation} where $\mathcal{N}^+(\cdot)$ is a gaussian truncated to lied on $\mathbb{R}^+$ since $\Theta$ is a transformed parameter $z \mapsto -\log(1 - z)$. This model is represented in the graphical model of Figure~\ref{fig:graphical}. The product-form of the prior implies that the KL term is entirely decomposable. Since an easy closed-form formula exists for the KL divergence between two gaussians, we approximate the truncated gaussians by their non-truncated counterpart. \begin{equation} \label{eq:kl} \begin{split} \text{KL}(q_{\mathbf{\Theta'}}, p_{\mathbf{\Theta}}) &= \sum_{ij} KL\left(\mathcal{N}^+(\mu_{ij}, \sigma_{ij}), \mathcal{N}^+(\mu^0_{ij}, \sigma^0_{ij})\right) \\ &\approx \sum_{ij} \log \frac{\sigma^0_{ij}}{\sigma_{ij}} + \frac{\sigma^2_{ij} + {(\mu_{ij} - \mu_{ij}^0)}^2}{2{(\sigma^0_{ij})}^2} \end{split} \end{equation} Reparametrization trick Batches Algorithm