Big reorganisation of the repo

author: Thibaut Horel <thibaut.horel@gmail.com> 2015-03-09 13:50:20 -0400
committer: Thibaut Horel <thibaut.horel@gmail.com> 2015-03-09 13:50:33 -0400
commit: 843f75943d25f4e180493142b6da0968621b9a78 (patch)
tree: a1c7e5fa8898e663f4009715bd8101ac5696d7c8 /old_work/formalisation.tex
parent: c73f5ffb14020f8997488d1edf6594833fcbbef7 (diff)
download: cascades-843f75943d25f4e180493142b6da0968621b9a78.tar.gz
1 files changed, 317 insertions, 0 deletions
diff --git a/old_work/formalisation.tex b/old_work/formalisation.tex
new file mode 100644
index 0000000..c2ba8c0
--- /dev/null
+++ b/old_work/formalisation.tex
@@ -0,0 +1,317 @@
+\documentclass[10pt]{article}
+\usepackage[utf8x]{inputenc}
+\usepackage{fullpage}
+\usepackage{amsmath, amssymb, bbm, amsthm}
+\usepackage[pagebackref=true,breaklinks=true,colorlinks=true,backref=false]{hyperref}
+\usepackage[capitalize, noabbrev]{cleveref}
+\usepackage{microtype}
+\input{defs}
+
+\date{}
+\title{Cracking Cascades: Applying Sparse Recovery to Graph Reconstruction}
+\author{Jean Pouget-Abadie, Thibaut Horel}
+
+\begin{document}
+\maketitle
+
+\section{Preliminaries: What is a measurement?}
+
+There is an unknown graph $G=(V,E)$ that we wish to reconstruct by observing
+information cascades. An information cascade is a random discrete-time process
+where we get to observe the set of vertices infected at each time step.
+
+The key contribution is to formulate this problem as a sparse recovery problem
+akin to those found in the compressed sensing literature. The parallel between
+the two problems is as follows:
+\begin{itemize}
+    \item the unknown \emph{signal} $s$ is a vector in $\{0,1\}^{V\times V}$
+        with $s_{u,v} = 1$ iff. $(u,v)\in E$.
+    \item the \emph{sensing matrix} is given by the cascades. The \emph{sensing
+        matrix} maps the signal to the observation space. Since the cascades
+        are random time processes, they define a sequence of random
+        observations.
+    \item the \emph{observations} are the set of vertices infected at each time
+        step in each cascade.
+\end{itemize}
+
+Let us define $n\eqdef |V|$. If we assume that all the vertices in $G$ have
+a degree bounded by $d\in\N_+$, then the signal $s$ is $nd$-sparse. Assuming
+that the sensing matrix has good random properties, then we can hope to
+recover $s$ with $O(nd\log n)$ random observations.
+
+Two main goals:
+\begin{itemize}
+    \item understand to which extent the probabilistic cascade models translate
+        into incoherence properties of the sensing matrix and allow for an
+        accurate reconstruction of the unknown graph with a small number of
+        observations.
+    \item compare the empirical performance of this compressed sensing approach
+        to prior graph reconstruction approaches.
+\end{itemize}
+
+\begin{remark}
+\begin{itemize}
+    \item It seems that we will need a \emph{complete} probabilistic model for
+        the cascades. By complete I mean that we need to specify a joint
+        distribution on the source of the cascade (\emph{e.g.} the source is
+        chosen uniformly at random) and on the cascade diffusion process. Such
+        a model is probably not very realistic, but is necessary to obtain
+        theoretical guarantees. The unrealism of the model is not a problem
+        \emph{per se}; the eventual validity of this approach will be confirmed
+        experimentally anyway.
+    \item It seems likely that we will want the rows of the sensing matrix to
+        be independent realizations of the same distribution. However,
+        \textbf{this will never be the case} if we have one row for each time
+        step of each cascade: the rows within the same cascade are obviously
+        highly correlated. We might have to \emph{aggregate} them (for example
+        through concatenation?) in a sensible way to have with one row per
+        cascade.
+    \item Something which makes this problem very different from the usual
+        compressed sensing problems is that the measurements depend on the
+        signal we wish to recover: cascades propagate along the edges contained
+        in the signal. More precisely, not only the rows of the sensing matrix
+        are correlated (if we do not aggregate them by cascade) but also they
+        are correlated through the signal. Is this a problem? I don't know,
+        this looks scary though.
+    \item A slightly more general model which we will need in the independent
+        cascade case considers a signal $s\in[0,1]^{V\times V}$. We can always
+        go from this model to the binary case by first recovering the
+        non-binary $s$ and then applying a thresholding procedure. But if we
+        only care about recovering the support of $s$, can we do better than
+        this two-step process? (TO DO: read stuff about support recovery).
+\end{itemize}
+\end{remark}
+
+Previous stuff:
+\begin{itemize}
+\item Introduction
+\item Simple graphs (cycle, star, path, tree?)
+\item Isotropic condition
+\item RIP condition
+\end{itemize}
+
+One last property we are going to look for in measurements is the {\bf Restriced Isometry Property} (RIP), which was introduced in \cite{???} and is ubiquitous to the compressed sensing literature. Suppose we are given a measurement matrix $M$. For every $T \in \mathbb{N}$, we define $\delta_T$ to be the smallest quantity such that, for any submatrix $M_T$ of $M$ obtained by extracting $T$ columns of $M$, and for any vector $c$ such that $\| c\|_2 = 1$:
+
+$$1 - \delta_T \leq \| M_T c \|_2^2 \leq 1 + \delta_T$$
+
+For small $\delta_T$, the above equation defines a `loose' orthonormality property for the columns of $M$.
+
+\section{Warm up: the voter model}
+
+
+In the voter model, there are two types of nodes, {\it red} and {\it blue}. We fix a horizon $T$. At $t=0$, we decide on a set of {\it blue} nodes, called the seeding set.  At each time step $t$, each node $u$ chooses one of its neighbors uniformly, with probability $1/deg(u)$, and adopts the color of that neighbor. A cascade in the voter model is the evolution of which nodes $blue$ nodes for each time step $ 0 \leq t \leq T$.
+
+For simplicity, we suppose that every node has a probability $1/p_\text{init}$ of being a part of the seeding set of a cascade. The following analysis can easily be adapted to a non-uniform probability vector. We also suppose that the graphs includes self-loops, meaning the node has the option to keep his color for the next round. This is a common assumption which does not change any of the following results.
+
+\paragraph{Notation}
+Denote by $(M^t_k)_{1 \leq k \leq m; 1 \leq t \leq T}$ the set of blue nodes in cascade $k$ at time step $t$. In other words, $M^t_k$ is a vector in $\mathbb{R}^n$ such that:
+
+\[ (M^t_k)_j = \left\{ 
+  \begin{array}{l l}
+    1 & \quad \text{if $j$ is $blue$ in cascade $k$ at time step $t$}\\
+    0 & \quad \text{otherwise}
+  \end{array} \right.\]
+
+Consider a node $i$, and let $\theta_i$ be the vector representing the neighbors of $i$ in the graph {\cal G}. In other words, $\theta_i$ is a vector in $\mathbb{R}^n$ such that:
+
+\[ \theta_{ij} = \left\{ 
+  \begin{array}{l l}
+    1/deg(i) & \quad \text{if $j$ is a parent of $i$ in {\cal G}}\\
+    0 & \quad \text{otherwise}
+  \end{array} \right.\]
+
+Let $b^t_k$ be the vector representing the probability that in cascade $k$ at time step $t$ node $i$ stays or becomes $blue$.
+
+\[ b^t_k = \left\{ 
+  \begin{array}{l l}
+    (\text{Number of $blue$ neighbors in $(k, t-1$})/deg(i) & \quad \text{if $t\neq 0$}\\
+    1/p_{\text{init}} & \quad \text{otherwise}
+  \end{array} \right.\]
+
+Our objective is to find the neighbors of $i$ in {\cal G} from $(M^t_k)$ i.e recover the vector $\theta_i$. Notice that for each $(k,t)$, we have:
+
+$$\forall t< T-1, k, \ M^t_k \cdot \theta_i = b^{t+1}_k$$
+
+We can concatenate each equality in matrix form $M$, such that the rows of $M$ are the observed $blue$ nodes $M^t_k$ and the entries of $\vec b$ are the corresponding probabilities:
+
+$$M \cdot \theta_i = \vec b$$
+
+Note that if $M$ had full column rank, then we could recover $\theta_i$ from $\vec b$. This is however an unreasonable assumption, even after having observed many cascades. We must explore which assumptions on $M$ allow us to recover $\theta_i$ from a small number of cascades. Further note that we do not immediately observe $\vec b$. Instead, we observe a bernoulli vector, whose j$^{th}$ entry is equal to $1$ with probability $b_j$ and $0$ otherwise. We will denote this vector $\vec w$, and denote by $\vec \epsilon$ the quantum noise vector such that $\vec w = \vec b + \vec \epsilon$. We can therefore reformulate the problem with our observables:
+
+$$M \cdot \theta_i + \vec \epsilon = \vec w$$
+
+We hope to show that we can exploit the sparsity of vector $\theta_i$ to apply common sparse recovery techniques.
+
+\section{Independent Cascade Model}
+
+\subsection{The Model}
+
+Recall the independent cascade model, where nodes have 3 states: susceptible,
+active, and inactive. All nodes start either as susceptible or active. At each
+time step $t$, for every (active, susceptible)-pair $(j,i)$, $j$ attempts to
+infect $i$ with probability $p_{j,i}$ and become inactive. If $j$ succeeds, $i$
+will become active at time step $t+1$. The cascade continues until no active
+nodes remain.
+
+Consider a node $i$. If $p_{j,i}$ is the probability that node $j$ infects $i$
+at the next step, let $\theta_{i,j} := \log (1 - p_{j,i})$. Note that
+$\theta_{j,i} = 0 \Leftrightarrow p_{j,i} = 0$. Also note that:
+
+$$\sum_{j\in S} \theta_{j,i} = \log \prod_{j \in S} (1 - p_{j,i}) = \log \mathbb{P}(\text{i not infected by any j} \in S)$$
+
+Our objective is to recover the vector $\theta_{*,i}$ for every node $i$.
+
+\subsection{Formulating the sparse recovery problem}
+
+The idea behind what follows is the assumption that $p_{*,i}$ and therefore $\theta_{*,i}$ are sparse vectors: a node has few neighbors (parents in the directed graph) compared to the size of the graph.
+
+Let $\mathbbm{1}_S$ be the indicator variable for the nodes that are active at
+a certain time step $t$ of a certain cascade. Under the previous assumption, we
+have:
+$$\inprod{\mathbbm{1}_S, \theta_{*,i}} = \log \mathbb{P}( \text{i not infected at } t+1| S \text{ active at } t)$$
+
+Suppose we observe a series of cascades $(A_k^t)$, where $A_k^t$ are the nodes
+active in cascade $k$ at time $t$ (these are exactly the nodes which became
+infected at time $t-1$). Recall that once a node has been infected (becomes
+active), it cannot be infected in the future (is inactive). Let us therefore
+look at all the sets $(A_k^t)_{t \leq t_i^k} $ such that for every cascade $k$,
+$t_i^k > t$, where $t_i^k = + \infty$ if $i$ was not infected in that cascade.
+If we concatenate these rows $\mathbbm{1}_{A_k^t}$ into a matrix $M$ (for
+measurement), we have:
+
+$$M \theta_{*,i} = b$$
+where $b$ is a column vector corresponding to $\log \mathbb{P}(\text{i was not infected at } t+1| S \text{ active at } t)$. Note that even if we had access to $b$ directly but $M$ does not have full column rank, then solving for $x$ is not trivial! However, under sparsity assumptions on $x$, we could apply results for the sparse recovery/sparse coding/compressed sensing literature 
+
+In reality, we do not have access to $b$. We have access to the realization vector $\vec w$ of $\{i$ was infected at $t+1 | S$ active at $t\}$:
+
+$$\vec w\sim B(1 - e^{M \theta_{*,i}})$$
+
+where the operation $f: \vec v \rightarrow 1 - e^{\vec v}$ is done element-wise and $B(\vec p)$ is a vector whose entries are bernoulli variables of parameter the entries of $\vec p$.
+
+\subsection{Maximum likelihood problem}
+
+The maximum likelihood problem takes the following form:
+\begin{displaymath}
+    \begin{split}
+        \max_\theta &\sum_{k,t}\left(b_{k,t}\log\left(1-\exp\inprod{A_k^t, \theta}\right)
+        + (1-b_{k,t})\log\exp\inprod{A_k^t,\theta}\right)\\
+    \text{s.t. }& \|\theta\|_1 \leq k
+    \end{split}
+\end{displaymath}
+which we can rewrite as:
+\begin{displaymath}
+    \begin{split}
+        \max_\theta &\sum_{k,t}b_{k,t}\log\left(\exp\left(-\inprod{A_k^t, \theta}\right)-1\right)
+        + \sum_{k,t}\inprod{A_k^t,\theta}\\
+    \text{s.t. }& \|\theta\|_1 \leq k
+    \end{split}
+\end{displaymath}
+In this form it is easy to see that this is a convex program: the objective
+function is the sum of a linear function and a concave function.
+
+\paragraph{Questions} to check: is the program also convex in the $p_{j,i}$'s?
+What is the theory of sparse recovery for maximum likelihood problems? (TO DO:
+read about Bregman divergence, I think it could be related).
+
+\subsection{Results under strong assumptions}
+
+What can we hope to have etc. (If M has full column rank or if same sets S occur frequently).
+
+\subsection{Result under RIP condition}
+
+For ease of notation, let $\theta := \theta_{*,i}$ The following resolution of the above problem is heavily borrowed from \cite{2005math......3066C}. We can see $w$ as a bounded perturbation of $f(b)$: 
+$$w = f(M\theta) + \epsilon, \text{ where } \|\epsilon\|_{\infty} < 1$$
+
+Define the following program, which we can solve with all available information $M$ and $w$:
+\begin{equation}
+(N_1) \quad \min_\theta \|\theta\|_1 \quad \text{subject to} \quad \|e^{M\theta} - (1 - w) \|_2 \leq \|\epsilon \|_\infty \nonumber
+\end{equation}
+
+While we can hope for the columns of $M$ to be `loosely' orthogonal, we must normalize them if we hope for $\delta_T$ to be small. Let $M'$ correspond to $M$ where the columns have been normalized, we will therefore define $\epsilon'$ and $M'$, such that:
+$$w = f(M'\theta) + \epsilon'$$
+We see that $\epsilon'= w - (w - \epsilon)^{\vec \lambda}$ (element-wise operation), where $\lambda_i = 1/\sqrt{\sum \text{col}(M)_i}$ such that $\|\epsilon'\|_{\infty} < 1$. We can define the following program:
+
+\begin{equation}
+(N_2) \quad \min_\theta \|\theta\|_1 \quad \text{subject to} \quad \|e^{M'\theta} - (1 - w) \|_2 \leq \|\epsilon' \|_\infty \nonumber
+\end{equation}
+
+With the corresponding $\delta_T'$ defined for $M'$, we are going to prove the following:
+
+\begin{theorem}
+
+Suppose that all nodes in our graph have {\bf bounded degree by $S$}, that no edge between $i$ and its neighbors has a probability of $1$, such that $\forall j, p_{j,i} < 1 - \eta \ \text{ where } 0<\eta<1$, and that $\delta'_{3S} + 3 \delta'_{4S} < 2$. If $\theta_0$ is the true vector for node $i$, the solution $\hat \theta$ to $(N_2)$ obeys:
+$$\|\hat \theta - \theta_0 \|_2 \leq \frac{C_S}{\eta} \|\epsilon' \|_{\infty} $$ where $C_S$ is well behaved ($\approx 10$ for reasonable values of $\delta'_{4S}$)
+\end{theorem}
+
+
+\begin{proof}
+The proof borrows heavily from \cite{2005math......3066C}. Define $h := \theta - \theta_0$. We list the results as they are given in \cite{2005math......3066C} or as they can be adapted in our case.
+
+\begin{itemize}
+\item $\|e^{M'\theta} - e^{M' \theta_0} \|_2 \leq \| e^{M'\theta} - ( 1 - w) \|_2 + \| e^{M'\theta} - (1 - w) \|_2 \leq 2 \|\epsilon' \|_{\infty} $
+\item $\|h\|_2^2 \leq (1 + |T_0|/N) \|h_{T_{01}}\|_2^2$, where N is an arbitrary positive integer, and $h_{T_{01}}$ is the vector $h$, where all entries are masked to 0, except those corresponding to the non-zero entries of $\theta_0$ and the $N$ largest entries of $h_{T_0^c}$.
+\item $\|e^{M'h} \|_2 \geq |\eta| \|M'h\|_2   \geq |\eta| C_{T_0, N} \cdot \|h_{T_{0,1}} \|_2$, where $C_{T_0,N} := \sqrt{1 - \delta'_{N + T_0}}  - \sqrt{T_0 / N} \sqrt{1 + \delta'_N}$
+\end{itemize}
+The first bullet point is obtained by the triangle inequality. The second bullet point is identical to the proof in \cite{2005math......3066C}. The first inequality in the third bullet-point is obtained because by assumption all entries of $M'\theta$ are greater than $\ln \eta$ (remember that the $e^{\vec v}$ operation is done element-wise). The rest follows like in \cite{2005math......3066C}, and we obtain that:
+$$\|h\|_2 \leq 2 \frac{\sqrt{1 + T_0/N}}{ \eta \cdot C_{T_0, N}} \|\epsilon'\|_{\infty} = \frac{C_S}{\eta} \|\epsilon'\|_{\infty}$$
+For $\delta'_{4S} \leq \frac{1}{5}, C_S \approx 8.82$, for $\delta'_{4S} \leq \frac{1}{4}, C_S \approx 10.47$
+\end{proof}
+
+\begin{remark} A few notes on the result:
+\begin{itemize}
+\item With this naive reduction of $w$ as an $e-$pertubation to $f(b)$, the best $\|\epsilon\|_{\infty}$ we can hope for is $1 - \eta$. Seeing $w$ as more than an arbitrary perturbation will most likely yield a much better upper-bound.
+\item  The upper-bound is done in $\log$-space for the true probabilities:
+$$\frac{C_S}{\eta} > \sum \left[ \log \left(\frac{1 - p_j}{1 - p_j'} \right) \right]^2 > \frac{\eta \log(1/\eta)}{1 - \eta} \| \frac{1 - p_j}{1 - p'_j} - 1 \|  $$
+
+This bullet-point needs to be finished
+\item To what extent can we approximate our measurement matrix to a matrix with independent bernoulli entries? What $\delta_T$'s can we hope for? Which constant $C_S$ can we hope for? How does it change with the rows of $M$?
+\end{itemize}
+
+\end{remark}
+
+
+
+\subsection{Results under isotropic condition}
+
+
+
+% Given $M$ and $w$, we can solve the following maximum log-likelihood problem:
+% \begin{equation}
+% \max_{\|\theta_{*,i}\|_0 \leq k}  <w,\log f(M \theta_{*,i}) + <1 - w, M\theta_{*,i}>
+% \label{eq:max_loglik_obj}
+% \end{equation}
+
+% Since optimizing under the $\|\theta\|_0 \leq k$ is hard, we solve instead the Lagrangian relaxation:
+
+% \begin{equation}
+% \max_{\theta_{*,i}}  <w,\log f(M \theta_{*,i}) + <1 - w, M\theta_{*,i}> + \lambda \|\theta_{*,i}\|_0 
+% \label{eq:lagra_max_loglik_obj}
+% \end{equation}
+
+
+% Eq~\ref{eq:lagra_max_loglik_obj} is concave in $\theta_{*,i}$ and can therefore be solved efficiently. We would like to exploit the sparsity of $\theta_{*,i}$ to show that we recover the true sparse $\theta_{*,i}$ with few measurements (rows of M).
+
+
+\bibliography{sparse}{}
+\bibliographystyle{plain}
+
+\end{document}
+
+
+
+
+% Let us now consider the matrix M, whose rows are the vectors $\mathbbm{1}_S$. If we could estimate the vector $X := \mathbb{P}(i \text{ not infected at } t | S \text{ active at } t)$, we could solve for $\theta$:
+% $$M \theta_{*i} = X, \text{ where }\theta_{*,i} \text{ is a sparse vector}$$
+
+% There are many powerful tools from sparse coding theory which would allow us to recover the sparsest $\theta$ under certain assumptions on $M$. One such assumption is incoherence of the columns of $M$. In other words, if we can show that for all columns $M_i, M_j$ of M, we have $|<M_i,M_s>| \leq \mu \|M_i\| \|M_j\|$, i.e M is $\mu-incoherent$, then for any node $i$ with fewer than $k = \frac{1}{2 \mu}$ parents, we can recover $\theta_{*i}$ exactly. To get a better intuition of what this means, if the entries of $M$ were independent gaussian variables, we would need approximately $k \ln(\frac{n}{k})$ measurements to recover $\theta_{*i}$ exactly. 
+
+% This would seem to suggest that the quality of observed cascades is much more important than quantity--a notion which has not truly been explored in graph reconstruction as of yet. Another stronger assumption on the columns of M is separability (see Moitra's work in Topic Modeling) leading to even stronger results.
+
+% In reality of course, we do not know $X$. Instead, we have access to $\hat X$ which estimates $X$ in the following way. For any set S in any cascade that was active before i was, we obtain a binary signal, $\hat X_S:=0$ if $i$ was infected immediately afterwards and $\hat X_S:=1$ otherwise. As the number of times set S was observed active, we have $$\frac{1}{m} \sum \hat X_S \rightarrow \mathbb{P}(i \text{ not infected at }t+1 | S \text{ active at } t)$$.
+
+% Certain sets might have been infected many times in our set of cascades, certain sets might have been infected only once. One possible way to rewrite the program above is: $$\hat X_j = H(w_j - e^{(M \theta_{*i})_j}), \text{ where } w\sim {\cal U}[0,1] \text{ and H is the step function } \mathbbm{1}_{\cdot \geq 0}$$
+
+% This problem seems to be closely related to the growing field of 1-bit compressed sensing, with at least one difference: when scientists consider 1bitCS with noise, they are usually in a position to decide which measurement to do next to recover $\theta_{*i}$, whereas in this framework, we are limited to an observer status\footnote{Research in the other direction could be very interesting: see discussion about exploration-exploitation with Thibaut}.
+
+% I need to do more reading on the subject, namely to understand what guarantees people have found for sample complexity.
+
author	Thibaut Horel <thibaut.horel@gmail.com>	2015-03-09 13:50:20 -0400
committer	Thibaut Horel <thibaut.horel@gmail.com>	2015-03-09 13:50:33 -0400
commit	843f75943d25f4e180493142b6da0968621b9a78 (patch)
tree	a1c7e5fa8898e663f4009715bd8101ac5696d7c8 /old_work/formalisation.tex
parent	c73f5ffb14020f8997488d1edf6594833fcbbef7 (diff)
download	cascades-843f75943d25f4e180493142b6da0968621b9a78.tar.gz