diff options
| author | Thibaut Horel <thibaut.horel@gmail.com> | 2015-11-06 19:10:40 -0500 |
|---|---|---|
| committer | Thibaut Horel <thibaut.horel@gmail.com> | 2015-11-06 19:10:40 -0500 |
| commit | fe51cea4aebf43265def42c40934280c7ad06c62 (patch) | |
| tree | 92b9337708395b451f23e726f50f3bcfd19087cc | |
| parent | e482995948fd81f4cdd8cc5f587c223114fbb844 (diff) | |
| download | cascades-fe51cea4aebf43265def42c40934280c7ad06c62.tar.gz | |
Pass over 4.2
| -rw-r--r-- | finale/mid_report.tex | 146 |
1 files changed, 73 insertions, 73 deletions
diff --git a/finale/mid_report.tex b/finale/mid_report.tex index 67aedd0..bd7f475 100644 --- a/finale/mid_report.tex +++ b/finale/mid_report.tex @@ -364,73 +364,72 @@ asymptotically consistent under mild assumptions. \subsection{Bayesian Approach} -\paragraph{Notation} -Let's adopt a Bayesian approach to the previous problem. To fix the notation -introduced above, we let $G=(V, \Theta)$ be a weighted directed graph with -$\Theta \in \R_+^{k \times k}$. We place a prior $\mathcal{D}$ on $\Theta$. Let -$x_0$ be a source node, sampled from distribution $p_s$ over the vertices $V$ of -the graph $\mathcal{G}$. Since we assume for now that the state of the source -node is entirely observed, the choice of $p_s$ is not relevant to the inference -of $\Theta$. Finally, we let $x_t$ (resp. $\mathbbm{1}_{S_t}$) be the binary -vectors encoding the infected (resp.~susceptible) nodes at time $t$: -\begin{alignat*}{2} - \theta & \sim \mathcal{D}\qquad &&\text{(prior on $\theta$)} \\ - x_0 & \sim p_s &&\text{(random source distribution)}\\ - \forall t\geq 0, x_{t+1} | x_t, \theta &\sim Ber(f(\theta\cdot x_t)\cdot -\mathbbm{1}_{S_t}) - &&\text{(cascade process)} -\end{alignat*} +In this section, we adopt a Bayesian approach to the Network Inference problem. +Since we assume that the state of the source $x^0$ is observed, the choice of +$p_s$ is not relevant to the inference of $\Theta$ (as long as it satisfies the +conditions of Proposition 2.). We enrich the model described in \cref{eq:dist} +by placing a prior $\mathcal{D}$ on $\Theta$. -\paragraph{Graph prior} -Prior work has focused on MLE estimation regularized using lasso, summarized in -the section above. This is equivalent to choosing an independent Laplace prior -for $\theta$: -$$(\Theta_{i,j})\sim \prod_{i,j} Laplace(0,1)$$ -In the context of social network learning however, we can place more informative -priors. We can: +\paragraph{Choice of a Graph Prior.} Following the standard assumption that in +many applications, graphs are sparse (\emph{i.e} contain few edges), prior +works have focused on MLE estimation with LASSO regularization ($\ell_1$-norm +penalization). This can be interpreted in the Bayesian world as doing MAP +estimation with an independent Laplace prior on $\Theta$: +\begin{displaymath} +\Theta\sim \prod_{i,j} Laplace(0,\lambda) +\end{displaymath} +However, in many domains of applications, we know much more about the possible +structure of the graph $G$ to be learned and by placing more informative priors on +$\Theta$ we can: \begin{itemize} -\item Take into account prior knowledge of edges' existence (or lack thereof) -\item Take into account common graph structures, such as triangles, clustering +\item take into account prior knowledge of edges' existence (or lack thereof) +\item take into account knowledge of the graph structure, such as density of + triangles, diameter, degree distribution, clustering coefficient, etc. \end{itemize} -A common prior for graph is the Exponential Random Graph Model (ERGM), which -allows flexible representations of networks and Bayesian inference. The -distribution of an ERGM family is defined by feature vector $s(G)$ and by the +A commonly used prior for graphs is the Exponential Random Graph Model (ERGM), +which allows flexible representations of networks and Bayesian inference. The +distribution of an ERGM family is defined by a feature vector $s(G)$ and by the probability distribution: -$$P(G | \Theta) \propto \exp \left( s(G)\cdot \Theta \right)$$ +\begin{displaymath} +P(G | w) \propto \exp \left( s(G)\cdot w \right) +\end{displaymath} + +\paragraph{Inference.} If the prior has a product form, \emph{i.e} can be +factorized node-by-node, then the MAP estimation problem has the same +decomposability property as the MLE estimator described in Section 4.1: it is +still possible to learn the columns $\bt_i$ of $\Theta$ by solving independent +optimization problems. -Though straightforward MCMC could be applied here, recent -work~\cite{caimo2011bayesian, koskinen2010analysing, robins2007recent} has shown -that ERGM inference has slow convergence and lack of robustness, developping -better alternatives to naive MCMC formulations. Experiments using such a prior -are ongoing, but we present results only for simple product distribution-type -priors here. +However for a general ERGM, the prior might not have a product form. For +example a prior could encourage the presence of triangles or stars by including +the number of triangles or stars in the feature vectors $s(G)$. It is easy to +see that in this case the prior won't have a product form. Though +straightforward MCMC could be applied here, recent +work~\cite{caimo2011bayesian, koskinen2010analysing, robins2007recent} has +shown that ERGM inference has slow convergence and lack of robustness and +develop better alternatives to the naive MCMC formulation. -\paragraph{Inference} -We can sample from the posterior by MCMC\@ using the PyMC library, which works -relatively well (see paragraph on \emph{experiments}. If the prior is a product -distribution, then we can simply do Bayesian GLM regression, as in -Section~\ref{MLEsec}. However, computing the posterior edge-by-edge does not -accomodate for complex priors, which could encourage for example the presence of -triangles or two stars. For such priors, we must sample from the posterior of -the Markov chain directly, which is what we have done here. This method is very -slow however, and scales badly as a result of the number of cascades and number -of nodes. We could greatly benefit from using an alternative method: + +\paragraph{Experiments.} +Experiments using an ERGM prior are ongoing, hence we only present results for +MCMC on product-type priors. We used the library PyMC to sample from the +posterior distribution using MCMC\@ (see paragraph \emph{Results} below). This +method is very slow however, and scales badly in the number of cascades and +number of nodes. We could greatly benefit from using an alternative method: \begin{itemize} \item EM\@. This approach was used in \cite{linderman2014discovering, -simma2012modeling} to learn -the parameters of a Hawkes process, a closely related inference problem. + simma2012modeling} to learn the parameters of a Hawkes process, a closely + related inference problem. \item Variational Inference. This approach was used -in~\cite{linderman2015scalable} as an extension to the paper cited in the -previous bullet point. Considering the scalabilty of their approach, we hope to -apply their method to our problem here, due to the similarity of the two -processes, and to the computational constraints of running MCMC over a large -parameter space. + in~\cite{linderman2015scalable} as an extension to + \cite{linderman2014discovering}. Considering the scalabilty of their + approach, we hope to apply their method to our problem here, due to the + similarity of the two processes, and to the computational constraints of + running MCMC over a large parameter space. \end{itemize} - - \begin{figure} \subfloat[][50 cascades]{ \includegraphics[scale=.4]{../simulation/plots/2015-11-05_22:52:30.pdf}} @@ -453,28 +452,29 @@ confounded with the y axis.} \label{betapriorbayeslearning} \end{figure} -\paragraph{Experiments} - -We ran some experiments on a simple network with 4 nodes with $\binom{4}{2}=6$ -parameters to learn with the MCMC package PyMC\@. We plot in -Figure~\ref{betapriorbayeslearning} the progressive learning of $\Theta$ for -increasing numbers of observations. Of note, since the IC model does not include -self-loops, the diagonal terms are never properly learned, which is expected but -not undesirable. We notice that the existence or not of an edge is (relatively) -quickly learned, with the posterior on edges with no weight converging to $0$ -after $100$ cascades. To get a concentrated posterior around the true non-zero -edge weigth requires $1000$ cascades, which is unreasonably high considering the -small number of parameters that we are learning in this experiment. +\paragraph{Results.} We ran some experiments on a simple network with 4 nodes +with $\binom{4}{2}=6$ parameters to learn with the MCMC package PyMC\@. We plot +in Figure~\ref{betapriorbayeslearning} the progressive learning of $\Theta$ for +increasing numbers of observations. Of note, since the GLMC model does not +include self-loops, the diagonal terms are never properly learned, which is +expected but not undesirable. We notice that the existence or not of an edge is +(relatively) quickly learned, with the posterior on edges with no weight +converging to $0$ after $100$ cascades. To get a concentrated posterior around +the true non-zero edge weigth requires $1000$ cascades, which is unreasonably +high considering the small number of parameters that we are learning in this +experiment. \subsection{Active Learning} -In this setup, $S$ is no longer drawn from a random distribution $p_s$ but is -chosen by the designer of the experiment. Once a source is drawn, a cascade is -drawn from the Markovian process, $\mathcal{D'}$. We wish to choose the source -which maximises the information gain on the parameter $\Theta$, so as to speed -up learning. We suggest to choose the source which maximises the mutual -information between $\theta$ and $X$ at each time step. The mutual information -can be computed node-by-node by sampling: +Finally, we propose an Active Learning formulation of Network Inference +problem. In this setup, the source $x^0$ is no longer drawn from a random +distribution $p_s$ but a single source $i$ is chosen by the designer of the +experiment. Once a source is drawn, a cascade is drawn from the Markovian +process, $\mathcal{D'}$. We wish to choose the source which maximises the +information gain on the parameter $\Theta$, so as to speed up learning. We +suggest to choose the source which maximises the mutual information between +$\theta$ and $X$ at each time step. The mutual information can be computed +node-by-node by sampling: \begin{equation*} I((x_t)_{t\geq 1} ,\Theta | x_0 = i) = \mathbb{E}_{\substack{\Theta \sim D_t \\ x | \Theta, i \sim D'}}\log p(x|\Theta) - \mathbb{E}_{\substack{\Theta \sim D_t |
