1 files changed, 0 insertions, 317 deletions
diff --git a/notes/formalisation.tex b/notes/formalisation.tex
deleted file mode 100644
index c2ba8c0..0000000
--- a/notes/formalisation.tex
+++ /dev/null
@@ -1,317 +0,0 @@
-\documentclass[10pt]{article}
-\usepackage[utf8x]{inputenc}
-\usepackage{fullpage}
-\usepackage{amsmath, amssymb, bbm, amsthm}
-\usepackage[pagebackref=true,breaklinks=true,colorlinks=true,backref=false]{hyperref}
-\usepackage[capitalize, noabbrev]{cleveref}
-\usepackage{microtype}
-\input{defs}
-
-\date{}
-\title{Cracking Cascades: Applying Sparse Recovery to Graph Reconstruction}
-\author{Jean Pouget-Abadie, Thibaut Horel}
-
-\begin{document}
-\maketitle
-
-\section{Preliminaries: What is a measurement?}
-
-There is an unknown graph $G=(V,E)$ that we wish to reconstruct by observing
-information cascades. An information cascade is a random discrete-time process
-where we get to observe the set of vertices infected at each time step.
-
-The key contribution is to formulate this problem as a sparse recovery problem
-akin to those found in the compressed sensing literature. The parallel between
-the two problems is as follows:
-\begin{itemize}
-    \item the unknown \emph{signal} $s$ is a vector in $\{0,1\}^{V\times V}$
-        with $s_{u,v} = 1$ iff. $(u,v)\in E$.
-    \item the \emph{sensing matrix} is given by the cascades. The \emph{sensing
-        matrix} maps the signal to the observation space. Since the cascades
-        are random time processes, they define a sequence of random
-        observations.
-    \item the \emph{observations} are the set of vertices infected at each time
-        step in each cascade.
-\end{itemize}
-
-Let us define $n\eqdef |V|$. If we assume that all the vertices in $G$ have
-a degree bounded by $d\in\N_+$, then the signal $s$ is $nd$-sparse. Assuming
-that the sensing matrix has good random properties, then we can hope to
-recover $s$ with $O(nd\log n)$ random observations.
-
-Two main goals:
-\begin{itemize}
-    \item understand to which extent the probabilistic cascade models translate
-        into incoherence properties of the sensing matrix and allow for an
-        accurate reconstruction of the unknown graph with a small number of
-        observations.
-    \item compare the empirical performance of this compressed sensing approach
-        to prior graph reconstruction approaches.
-\end{itemize}
-
-\begin{remark}
-\begin{itemize}
-    \item It seems that we will need a \emph{complete} probabilistic model for
-        the cascades. By complete I mean that we need to specify a joint
-        distribution on the source of the cascade (\emph{e.g.} the source is
-        chosen uniformly at random) and on the cascade diffusion process. Such
-        a model is probably not very realistic, but is necessary to obtain
-        theoretical guarantees. The unrealism of the model is not a problem
-        \emph{per se}; the eventual validity of this approach will be confirmed
-        experimentally anyway.
-    \item It seems likely that we will want the rows of the sensing matrix to
-        be independent realizations of the same distribution. However,
-        \textbf{this will never be the case} if we have one row for each time
-        step of each cascade: the rows within the same cascade are obviously
-        highly correlated. We might have to \emph{aggregate} them (for example
-        through concatenation?) in a sensible way to have with one row per
-        cascade.
-    \item Something which makes this problem very different from the usual
-        compressed sensing problems is that the measurements depend on the
-        signal we wish to recover: cascades propagate along the edges contained
-        in the signal. More precisely, not only the rows of the sensing matrix
-        are correlated (if we do not aggregate them by cascade) but also they
-        are correlated through the signal. Is this a problem? I don't know,
-        this looks scary though.
-    \item A slightly more general model which we will need in the independent
-        cascade case considers a signal $s\in[0,1]^{V\times V}$. We can always
-        go from this model to the binary case by first recovering the
-        non-binary $s$ and then applying a thresholding procedure. But if we
-        only care about recovering the support of $s$, can we do better than
-        this two-step process? (TO DO: read stuff about support recovery).
-\end{itemize}
-\end{remark}
-
-Previous stuff:
-\begin{itemize}
-\item Introduction
-\item Simple graphs (cycle, star, path, tree?)
-\item Isotropic condition
-\item RIP condition
-\end{itemize}
-
-One last property we are going to look for in measurements is the {\bf Restriced Isometry Property} (RIP), which was introduced in \cite{???} and is ubiquitous to the compressed sensing literature. Suppose we are given a measurement matrix $M$. For every $T \in \mathbb{N}$, we define $\delta_T$ to be the smallest quantity such that, for any submatrix $M_T$ of $M$ obtained by extracting $T$ columns of $M$, and for any vector $c$ such that $\| c\|_2 = 1$:
-
-$$1 - \delta_T \leq \| M_T c \|_2^2 \leq 1 + \delta_T$$
-
-For small $\delta_T$, the above equation defines a `loose' orthonormality property for the columns of $M$.
-
-\section{Warm up: the voter model}
-
-
-In the voter model, there are two types of nodes, {\it red} and {\it blue}. We fix a horizon $T$. At $t=0$, we decide on a set of {\it blue} nodes, called the seeding set.  At each time step $t$, each node $u$ chooses one of its neighbors uniformly, with probability $1/deg(u)$, and adopts the color of that neighbor. A cascade in the voter model is the evolution of which nodes $blue$ nodes for each time step $ 0 \leq t \leq T$.
-
-For simplicity, we suppose that every node has a probability $1/p_\text{init}$ of being a part of the seeding set of a cascade. The following analysis can easily be adapted to a non-uniform probability vector. We also suppose that the graphs includes self-loops, meaning the node has the option to keep his color for the next round. This is a common assumption which does not change any of the following results.
-
-\paragraph{Notation}
-Denote by $(M^t_k)_{1 \leq k \leq m; 1 \leq t \leq T}$ the set of blue nodes in cascade $k$ at time step $t$. In other words, $M^t_k$ is a vector in $\mathbb{R}^n$ such that:
-
-\[ (M^t_k)_j = \left\{ 
-  \begin{array}{l l}
-    1 & \quad \text{if $j$ is $blue$ in cascade $k$ at time step $t$}\\
-    0 & \quad \text{otherwise}
-  \end{array} \right.\]
-
-Consider a node $i$, and let $\theta_i$ be the vector representing the neighbors of $i$ in the graph {\cal G}. In other words, $\theta_i$ is a vector in $\mathbb{R}^n$ such that:
-
-\[ \theta_{ij} = \left\{ 
-  \begin{array}{l l}
-    1/deg(i) & \quad \text{if $j$ is a parent of $i$ in {\cal G}}\\
-    0 & \quad \text{otherwise}
-  \end{array} \right.\]
-
-Let $b^t_k$ be the vector representing the probability that in cascade $k$ at time step $t$ node $i$ stays or becomes $blue$.
-
-\[ b^t_k = \left\{ 
-  \begin{array}{l l}
-    (\text{Number of $blue$ neighbors in $(k, t-1$})/deg(i) & \quad \text{if $t\neq 0$}\\
-    1/p_{\text{init}} & \quad \text{otherwise}
-  \end{array} \right.\]
-
-Our objective is to find the neighbors of $i$ in {\cal G} from $(M^t_k)$ i.e recover the vector $\theta_i$. Notice that for each $(k,t)$, we have:
-
-$$\forall t< T-1, k, \ M^t_k \cdot \theta_i = b^{t+1}_k$$
-
-We can concatenate each equality in matrix form $M$, such that the rows of $M$ are the observed $blue$ nodes $M^t_k$ and the entries of $\vec b$ are the corresponding probabilities:
-
-$$M \cdot \theta_i = \vec b$$
-
-Note that if $M$ had full column rank, then we could recover $\theta_i$ from $\vec b$. This is however an unreasonable assumption, even after having observed many cascades. We must explore which assumptions on $M$ allow us to recover $\theta_i$ from a small number of cascades. Further note that we do not immediately observe $\vec b$. Instead, we observe a bernoulli vector, whose j$^{th}$ entry is equal to $1$ with probability $b_j$ and $0$ otherwise. We will denote this vector $\vec w$, and denote by $\vec \epsilon$ the quantum noise vector such that $\vec w = \vec b + \vec \epsilon$. We can therefore reformulate the problem with our observables:
-
-$$M \cdot \theta_i + \vec \epsilon = \vec w$$
-
-We hope to show that we can exploit the sparsity of vector $\theta_i$ to apply common sparse recovery techniques.
-
-\section{Independent Cascade Model}
-
-\subsection{The Model}
-
-Recall the independent cascade model, where nodes have 3 states: susceptible,
-active, and inactive. All nodes start either as susceptible or active. At each
-time step $t$, for every (active, susceptible)-pair $(j,i)$, $j$ attempts to
-infect $i$ with probability $p_{j,i}$ and become inactive. If $j$ succeeds, $i$
-will become active at time step $t+1$. The cascade continues until no active
-nodes remain.
-
-Consider a node $i$. If $p_{j,i}$ is the probability that node $j$ infects $i$
-at the next step, let $\theta_{i,j} := \log (1 - p_{j,i})$. Note that
-$\theta_{j,i} = 0 \Leftrightarrow p_{j,i} = 0$. Also note that:
-
-$$\sum_{j\in S} \theta_{j,i} = \log \prod_{j \in S} (1 - p_{j,i}) = \log \mathbb{P}(\text{i not infected by any j} \in S)$$
-
-Our objective is to recover the vector $\theta_{*,i}$ for every node $i$.
-
-\subsection{Formulating the sparse recovery problem}
-
-The idea behind what follows is the assumption that $p_{*,i}$ and therefore $\theta_{*,i}$ are sparse vectors: a node has few neighbors (parents in the directed graph) compared to the size of the graph.
-
-Let $\mathbbm{1}_S$ be the indicator variable for the nodes that are active at
-a certain time step $t$ of a certain cascade. Under the previous assumption, we
-have:
-$$\inprod{\mathbbm{1}_S, \theta_{*,i}} = \log \mathbb{P}( \text{i not infected at } t+1| S \text{ active at } t)$$
-
-Suppose we observe a series of cascades $(A_k^t)$, where $A_k^t$ are the nodes
-active in cascade $k$ at time $t$ (these are exactly the nodes which became
-infected at time $t-1$). Recall that once a node has been infected (becomes
-active), it cannot be infected in the future (is inactive). Let us therefore
-look at all the sets $(A_k^t)_{t \leq t_i^k} $ such that for every cascade $k$,
-$t_i^k > t$, where $t_i^k = + \infty$ if $i$ was not infected in that cascade.
-If we concatenate these rows $\mathbbm{1}_{A_k^t}$ into a matrix $M$ (for
-measurement), we have:
-
-$$M \theta_{*,i} = b$$
-where $b$ is a column vector corresponding to $\log \mathbb{P}(\text{i was not infected at } t+1| S \text{ active at } t)$. Note that even if we had access to $b$ directly but $M$ does not have full column rank, then solving for $x$ is not trivial! However, under sparsity assumptions on $x$, we could apply results for the sparse recovery/sparse coding/compressed sensing literature 
-
-In reality, we do not have access to $b$. We have access to the realization vector $\vec w$ of $\{i$ was infected at $t+1 | S$ active at $t\}$:
-
-$$\vec w\sim B(1 - e^{M \theta_{*,i}})$$
-
-where the operation $f: \vec v \rightarrow 1 - e^{\vec v}$ is done element-wise and $B(\vec p)$ is a vector whose entries are bernoulli variables of parameter the entries of $\vec p$.
-
-\subsection{Maximum likelihood problem}
-
-The maximum likelihood problem takes the following form:
-\begin{displaymath}
-    \begin{split}
-        \max_\theta &\sum_{k,t}\left(b_{k,t}\log\left(1-\exp\inprod{A_k^t, \theta}\right)
-        + (1-b_{k,t})\log\exp\inprod{A_k^t,\theta}\right)\\
-    \text{s.t. }& \|\theta\|_1 \leq k
-    \end{split}
-\end{displaymath}
-which we can rewrite as:
-\begin{displaymath}
-    \begin{split}
-        \max_\theta &\sum_{k,t}b_{k,t}\log\left(\exp\left(-\inprod{A_k^t, \theta}\right)-1\right)
-        + \sum_{k,t}\inprod{A_k^t,\theta}\\
-    \text{s.t. }& \|\theta\|_1 \leq k
-    \end{split}
-\end{displaymath}
-In this form it is easy to see that this is a convex program: the objective
-function is the sum of a linear function and a concave function.
-
-\paragraph{Questions} to check: is the program also convex in the $p_{j,i}$'s?
-What is the theory of sparse recovery for maximum likelihood problems? (TO DO:
-read about Bregman divergence, I think it could be related).
-
-\subsection{Results under strong assumptions}
-
-What can we hope to have etc. (If M has full column rank or if same sets S occur frequently).
-
-\subsection{Result under RIP condition}
-
-For ease of notation, let $\theta := \theta_{*,i}$ The following resolution of the above problem is heavily borrowed from \cite{2005math......3066C}. We can see $w$ as a bounded perturbation of $f(b)$: 
-$$w = f(M\theta) + \epsilon, \text{ where } \|\epsilon\|_{\infty} < 1$$
-
-Define the following program, which we can solve with all available information $M$ and $w$:
-\begin{equation}
-(N_1) \quad \min_\theta \|\theta\|_1 \quad \text{subject to} \quad \|e^{M\theta} - (1 - w) \|_2 \leq \|\epsilon \|_\infty \nonumber
-\end{equation}
-
-While we can hope for the columns of $M$ to be `loosely' orthogonal, we must normalize them if we hope for $\delta_T$ to be small. Let $M'$ correspond to $M$ where the columns have been normalized, we will therefore define $\epsilon'$ and $M'$, such that:
-$$w = f(M'\theta) + \epsilon'$$
-We see that $\epsilon'= w - (w - \epsilon)^{\vec \lambda}$ (element-wise operation), where $\lambda_i = 1/\sqrt{\sum \text{col}(M)_i}$ such that $\|\epsilon'\|_{\infty} < 1$. We can define the following program:
-
-\begin{equation}
-(N_2) \quad \min_\theta \|\theta\|_1 \quad \text{subject to} \quad \|e^{M'\theta} - (1 - w) \|_2 \leq \|\epsilon' \|_\infty \nonumber
-\end{equation}
-
-With the corresponding $\delta_T'$ defined for $M'$, we are going to prove the following:
-
-\begin{theorem}
-
-Suppose that all nodes in our graph have {\bf bounded degree by $S$}, that no edge between $i$ and its neighbors has a probability of $1$, such that $\forall j, p_{j,i} < 1 - \eta \ \text{ where } 0<\eta<1$, and that $\delta'_{3S} + 3 \delta'_{4S} < 2$. If $\theta_0$ is the true vector for node $i$, the solution $\hat \theta$ to $(N_2)$ obeys:
-$$\|\hat \theta - \theta_0 \|_2 \leq \frac{C_S}{\eta} \|\epsilon' \|_{\infty} $$ where $C_S$ is well behaved ($\approx 10$ for reasonable values of $\delta'_{4S}$)
-\end{theorem}
-
-
-\begin{proof}
-The proof borrows heavily from \cite{2005math......3066C}. Define $h := \theta - \theta_0$. We list the results as they are given in \cite{2005math......3066C} or as they can be adapted in our case.
-
-\begin{itemize}
-\item $\|e^{M'\theta} - e^{M' \theta_0} \|_2 \leq \| e^{M'\theta} - ( 1 - w) \|_2 + \| e^{M'\theta} - (1 - w) \|_2 \leq 2 \|\epsilon' \|_{\infty} $
-\item $\|h\|_2^2 \leq (1 + |T_0|/N) \|h_{T_{01}}\|_2^2$, where N is an arbitrary positive integer, and $h_{T_{01}}$ is the vector $h$, where all entries are masked to 0, except those corresponding to the non-zero entries of $\theta_0$ and the $N$ largest entries of $h_{T_0^c}$.
-\item $\|e^{M'h} \|_2 \geq |\eta| \|M'h\|_2   \geq |\eta| C_{T_0, N} \cdot \|h_{T_{0,1}} \|_2$, where $C_{T_0,N} := \sqrt{1 - \delta'_{N + T_0}}  - \sqrt{T_0 / N} \sqrt{1 + \delta'_N}$
-\end{itemize}
-The first bullet point is obtained by the triangle inequality. The second bullet point is identical to the proof in \cite{2005math......3066C}. The first inequality in the third bullet-point is obtained because by assumption all entries of $M'\theta$ are greater than $\ln \eta$ (remember that the $e^{\vec v}$ operation is done element-wise). The rest follows like in \cite{2005math......3066C}, and we obtain that:
-$$\|h\|_2 \leq 2 \frac{\sqrt{1 + T_0/N}}{ \eta \cdot C_{T_0, N}} \|\epsilon'\|_{\infty} = \frac{C_S}{\eta} \|\epsilon'\|_{\infty}$$
-For $\delta'_{4S} \leq \frac{1}{5}, C_S \approx 8.82$, for $\delta'_{4S} \leq \frac{1}{4}, C_S \approx 10.47$
-\end{proof}
-
-\begin{remark} A few notes on the result:
-\begin{itemize}
-\item With this naive reduction of $w$ as an $e-$pertubation to $f(b)$, the best $\|\epsilon\|_{\infty}$ we can hope for is $1 - \eta$. Seeing $w$ as more than an arbitrary perturbation will most likely yield a much better upper-bound.
-\item  The upper-bound is done in $\log$-space for the true probabilities:
-$$\frac{C_S}{\eta} > \sum \left[ \log \left(\frac{1 - p_j}{1 - p_j'} \right) \right]^2 > \frac{\eta \log(1/\eta)}{1 - \eta} \| \frac{1 - p_j}{1 - p'_j} - 1 \|  $$
-
-This bullet-point needs to be finished
-\item To what extent can we approximate our measurement matrix to a matrix with independent bernoulli entries? What $\delta_T$'s can we hope for? Which constant $C_S$ can we hope for? How does it change with the rows of $M$?
-\end{itemize}
-
-\end{remark}
-
-
-
-\subsection{Results under isotropic condition}
-
-
-
-% Given $M$ and $w$, we can solve the following maximum log-likelihood problem:
-% \begin{equation}
-% \max_{\|\theta_{*,i}\|_0 \leq k}  <w,\log f(M \theta_{*,i}) + <1 - w, M\theta_{*,i}>
-% \label{eq:max_loglik_obj}
-% \end{equation}
-
-% Since optimizing under the $\|\theta\|_0 \leq k$ is hard, we solve instead the Lagrangian relaxation:
-
-% \begin{equation}
-% \max_{\theta_{*,i}}  <w,\log f(M \theta_{*,i}) + <1 - w, M\theta_{*,i}> + \lambda \|\theta_{*,i}\|_0 
-% \label{eq:lagra_max_loglik_obj}
-% \end{equation}
-
-
-% Eq~\ref{eq:lagra_max_loglik_obj} is concave in $\theta_{*,i}$ and can therefore be solved efficiently. We would like to exploit the sparsity of $\theta_{*,i}$ to show that we recover the true sparse $\theta_{*,i}$ with few measurements (rows of M).
-
-
-\bibliography{sparse}{}
-\bibliographystyle{plain}
-
-\end{document}
-
-
-
-
-% Let us now consider the matrix M, whose rows are the vectors $\mathbbm{1}_S$. If we could estimate the vector $X := \mathbb{P}(i \text{ not infected at } t | S \text{ active at } t)$, we could solve for $\theta$:
-% $$M \theta_{*i} = X, \text{ where }\theta_{*,i} \text{ is a sparse vector}$$
-
-% There are many powerful tools from sparse coding theory which would allow us to recover the sparsest $\theta$ under certain assumptions on $M$. One such assumption is incoherence of the columns of $M$. In other words, if we can show that for all columns $M_i, M_j$ of M, we have $|<M_i,M_s>| \leq \mu \|M_i\| \|M_j\|$, i.e M is $\mu-incoherent$, then for any node $i$ with fewer than $k = \frac{1}{2 \mu}$ parents, we can recover $\theta_{*i}$ exactly. To get a better intuition of what this means, if the entries of $M$ were independent gaussian variables, we would need approximately $k \ln(\frac{n}{k})$ measurements to recover $\theta_{*i}$ exactly. 
-
-% This would seem to suggest that the quality of observed cascades is much more important than quantity--a notion which has not truly been explored in graph reconstruction as of yet. Another stronger assumption on the columns of M is separability (see Moitra's work in Topic Modeling) leading to even stronger results.
-
-% In reality of course, we do not know $X$. Instead, we have access to $\hat X$ which estimates $X$ in the following way. For any set S in any cascade that was active before i was, we obtain a binary signal, $\hat X_S:=0$ if $i$ was infected immediately afterwards and $\hat X_S:=1$ otherwise. As the number of times set S was observed active, we have $$\frac{1}{m} \sum \hat X_S \rightarrow \mathbb{P}(i \text{ not infected at }t+1 | S \text{ active at } t)$$.
-
-% Certain sets might have been infected many times in our set of cascades, certain sets might have been infected only once. One possible way to rewrite the program above is: $$\hat X_j = H(w_j - e^{(M \theta_{*i})_j}), \text{ where } w\sim {\cal U}[0,1] \text{ and H is the step function } \mathbbm{1}_{\cdot \geq 0}$$
-
-% This problem seems to be closely related to the growing field of 1-bit compressed sensing, with at least one difference: when scientists consider 1bitCS with noise, they are usually in a position to decide which measurement to do next to recover $\theta_{*i}$, whereas in this framework, we are limited to an observer status\footnote{Research in the other direction could be very interesting: see discussion about exploration-exploitation with Thibaut}.
-
-% I need to do more reading on the subject, namely to understand what guarantees people have found for sample complexity.
-