Pass over 4.2

author: Thibaut Horel <thibaut.horel@gmail.com> 2015-11-06 19:10:40 -0500
committer: Thibaut Horel <thibaut.horel@gmail.com> 2015-11-06 19:10:40 -0500
commit: fe51cea4aebf43265def42c40934280c7ad06c62 (patch)
tree: 92b9337708395b451f23e726f50f3bcfd19087cc
parent: e482995948fd81f4cdd8cc5f587c223114fbb844 (diff)
download: cascades-fe51cea4aebf43265def42c40934280c7ad06c62.tar.gz
1 files changed, 73 insertions, 73 deletions
diff --git a/finale/mid_report.tex b/finale/mid_report.tex
index 67aedd0..bd7f475 100644
--- a/finale/mid_report.tex
+++ b/finale/mid_report.tex
@@ -364,73 +364,72 @@ asymptotically consistent under mild assumptions.
 
 \subsection{Bayesian Approach}
 
-\paragraph{Notation}
-Let's adopt a Bayesian approach to the previous problem. To fix the notation
-introduced above, we let $G=(V, \Theta)$ be a weighted directed graph with
-$\Theta \in \R_+^{k \times k}$. We place a prior $\mathcal{D}$ on $\Theta$. Let
-$x_0$ be a source node, sampled from distribution $p_s$ over the vertices $V$ of
-the graph $\mathcal{G}$. Since we assume for now that the state of the source
-node is entirely observed, the choice of $p_s$ is not relevant to the inference
-of $\Theta$. Finally, we let $x_t$ (resp. $\mathbbm{1}_{S_t}$) be the binary
-vectors encoding the infected (resp.~susceptible) nodes at time $t$:
-\begin{alignat*}{2}
- \theta & \sim \mathcal{D}\qquad &&\text{(prior on $\theta$)} \\
- x_0 & \sim p_s &&\text{(random source distribution)}\\
- \forall t\geq 0, x_{t+1} | x_t, \theta &\sim Ber(f(\theta\cdot x_t)\cdot
-\mathbbm{1}_{S_t})
- &&\text{(cascade process)}
-\end{alignat*}
+In this section, we adopt a Bayesian approach to the Network Inference problem.
+Since we assume that the state of the source $x^0$ is observed, the choice of
+$p_s$ is not relevant to the inference of $\Theta$ (as long as it satisfies the
+conditions of Proposition 2.). We enrich the model described in \cref{eq:dist}
+by placing a prior $\mathcal{D}$ on $\Theta$.
 
-\paragraph{Graph prior}
-Prior work has focused on MLE estimation regularized using lasso, summarized in
-the section above. This is equivalent to choosing an independent Laplace prior
-for $\theta$:
-$$(\Theta_{i,j})\sim \prod_{i,j} Laplace(0,1)$$
-In the context of social network learning however, we can place more informative
-priors. We can:
+\paragraph{Choice of a Graph Prior.} Following the standard assumption that in
+many applications, graphs are sparse (\emph{i.e} contain few edges), prior
+works have focused on MLE estimation with LASSO regularization ($\ell_1$-norm
+penalization). This can be interpreted in the Bayesian world as doing MAP
+estimation with an independent Laplace prior on $\Theta$:
+\begin{displaymath}
+\Theta\sim \prod_{i,j} Laplace(0,\lambda)
+\end{displaymath}
 
+However, in many domains of applications, we know much more about the possible
+structure of the graph $G$ to be learned and by placing more informative priors on
+$\Theta$ we can:
 \begin{itemize}
-\item Take into account prior knowledge of edges' existence (or lack thereof)
-\item Take into account common graph structures, such as triangles, clustering
+\item take into account prior knowledge of edges' existence (or lack thereof)
+\item take into account knowledge of the graph structure, such as density of
+    triangles, diameter, degree distribution, clustering coefficient, etc.
 \end{itemize}
 
-A common prior for graph is the Exponential Random Graph Model (ERGM), which
-allows flexible representations of networks and Bayesian inference. The
-distribution of an ERGM family is defined by feature vector $s(G)$ and by the
+A commonly used prior for graphs is the Exponential Random Graph Model (ERGM),
+which allows flexible representations of networks and Bayesian inference. The
+distribution of an ERGM family is defined by a feature vector $s(G)$ and by the
 probability distribution:
-$$P(G | \Theta) \propto \exp \left( s(G)\cdot \Theta \right)$$
+\begin{displaymath}
+P(G | w) \propto \exp \left( s(G)\cdot w \right)
+\end{displaymath}
+
+\paragraph{Inference.} If the prior has a product form, \emph{i.e} can be
+factorized node-by-node, then the MAP estimation problem has the same
+decomposability property as the MLE estimator described in Section 4.1: it is
+still possible to learn the columns $\bt_i$ of $\Theta$ by solving independent
+optimization problems.
 
-Though straightforward MCMC could be applied here, recent
-work~\cite{caimo2011bayesian, koskinen2010analysing, robins2007recent} has shown
-that ERGM inference has slow convergence and lack of robustness, developping
-better alternatives to naive MCMC formulations. Experiments using such a prior
-are ongoing, but we present results only for simple product distribution-type
-priors here.
+However for a general ERGM, the prior might not have a product form. For
+example a prior could encourage the presence of triangles or stars by including
+the number of triangles or stars in the feature vectors $s(G)$. It is easy to
+see that in this case the prior won't have a product form. Though
+straightforward MCMC could be applied here, recent
+work~\cite{caimo2011bayesian, koskinen2010analysing, robins2007recent} has
+shown that ERGM inference has slow convergence and lack of robustness and
+develop better alternatives to the naive MCMC formulation.
 
-\paragraph{Inference}
-We can sample from the posterior by MCMC\@ using the PyMC library, which works
-relatively well (see paragraph on \emph{experiments}. If the prior is a product
-distribution, then we can simply do Bayesian GLM regression, as in
-Section~\ref{MLEsec}. However, computing the posterior edge-by-edge does not
-accomodate for complex priors, which could encourage for example the presence of
-triangles or two stars. For such priors, we must sample from the posterior of
-the Markov chain directly, which is what we have done here.  This method is very
-slow however, and scales badly as a result of the number of cascades and number
-of nodes. We could greatly benefit from using an alternative method:
+
+\paragraph{Experiments.}
+Experiments using an ERGM prior are ongoing, hence we only present results for
+MCMC on product-type priors. We used the library PyMC to sample from the
+posterior distribution using MCMC\@ (see paragraph \emph{Results} below). This
+method is very slow however, and scales badly in the number of cascades and
+number of nodes. We could greatly benefit from using an alternative method:
 \begin{itemize}
 \item EM\@. This approach was used in \cite{linderman2014discovering,
-simma2012modeling} to learn
-the parameters of a Hawkes process, a closely related inference problem.
+    simma2012modeling} to learn the parameters of a Hawkes process, a closely
+    related inference problem.
 \item Variational Inference. This approach was used
-in~\cite{linderman2015scalable} as an extension to the paper cited in the
-previous bullet point. Considering the scalabilty of their approach, we hope to
-apply their method to our problem here, due to the similarity of the two
-processes, and to the computational constraints of running MCMC over a large
-parameter space.
+    in~\cite{linderman2015scalable} as an extension to
+    \cite{linderman2014discovering}. Considering the scalabilty of their
+    approach, we hope to apply their method to our problem here, due to the
+    similarity of the two processes, and to the computational constraints of
+    running MCMC over a large parameter space.
 \end{itemize}
 
-
-
 \begin{figure}
 \subfloat[][50 cascades]{
 \includegraphics[scale=.4]{../simulation/plots/2015-11-05_22:52:30.pdf}}
@@ -453,28 +452,29 @@ confounded with the y axis.}
 \label{betapriorbayeslearning}
 \end{figure}
 
-\paragraph{Experiments}
-
-We ran some experiments on a simple network with 4 nodes with $\binom{4}{2}=6$
-parameters to learn with the MCMC package PyMC\@. We plot in
-Figure~\ref{betapriorbayeslearning} the progressive learning of $\Theta$ for
-increasing numbers of observations. Of note, since the IC model does not include
-self-loops, the diagonal terms are never properly learned, which is expected but
-not undesirable. We notice that the existence or not of an edge is (relatively)
-quickly learned, with the posterior on edges with no weight converging to $0$
-after $100$ cascades. To get a concentrated posterior around the true non-zero
-edge weigth requires $1000$ cascades, which is unreasonably high considering the
-small number of parameters that we are learning in this experiment.
+\paragraph{Results.} We ran some experiments on a simple network with 4 nodes
+with $\binom{4}{2}=6$ parameters to learn with the MCMC package PyMC\@. We plot
+in Figure~\ref{betapriorbayeslearning} the progressive learning of $\Theta$ for
+increasing numbers of observations. Of note, since the GLMC model does not
+include self-loops, the diagonal terms are never properly learned, which is
+expected but not undesirable. We notice that the existence or not of an edge is
+(relatively) quickly learned, with the posterior on edges with no weight
+converging to $0$ after $100$ cascades. To get a concentrated posterior around
+the true non-zero edge weigth requires $1000$ cascades, which is unreasonably
+high considering the small number of parameters that we are learning in this
+experiment.
 
 \subsection{Active Learning}
 
-In this setup, $S$ is no longer drawn from a random distribution $p_s$ but is
-chosen by the designer of the experiment. Once a source is drawn, a cascade is
-drawn from the Markovian process, $\mathcal{D'}$.  We wish to choose the source
-which maximises the information gain on the parameter $\Theta$, so as to speed
-up learning. We suggest to choose the source which maximises the mutual
-information between $\theta$ and $X$ at each time step. The mutual information
-can be computed node-by-node by sampling:
+Finally, we propose an Active Learning formulation of Network Inference
+problem. In this setup, the source $x^0$ is no longer drawn from a random
+distribution $p_s$ but a single source $i$ is chosen by the designer of the
+experiment. Once a source is drawn, a cascade is drawn from the Markovian
+process, $\mathcal{D'}$.  We wish to choose the source which maximises the
+information gain on the parameter $\Theta$, so as to speed up learning. We
+suggest to choose the source which maximises the mutual information between
+$\theta$ and $X$ at each time step. The mutual information can be computed
+node-by-node by sampling:
 \begin{equation*}
 I((x_t)_{t\geq 1} ,\Theta | x_0 = i) = \mathbb{E}_{\substack{\Theta \sim D_t \\
 x | \Theta, i \sim D'}}\log p(x|\Theta) - \mathbb{E}_{\substack{\Theta \sim D_t
author	Thibaut Horel <thibaut.horel@gmail.com>	2015-11-06 19:10:40 -0500
committer	Thibaut Horel <thibaut.horel@gmail.com>	2015-11-06 19:10:40 -0500
commit	fe51cea4aebf43265def42c40934280c7ad06c62 (patch)
tree	92b9337708395b451f23e726f50f3bcfd19087cc
parent	e482995948fd81f4cdd8cc5f587c223114fbb844 (diff)
download	cascades-fe51cea4aebf43265def42c40934280c7ad06c62.tar.gz