diff options
Diffstat (limited to 'notes/formalisation.tex')
| -rw-r--r-- | notes/formalisation.tex | 317 |
1 files changed, 0 insertions, 317 deletions
diff --git a/notes/formalisation.tex b/notes/formalisation.tex deleted file mode 100644 index c2ba8c0..0000000 --- a/notes/formalisation.tex +++ /dev/null @@ -1,317 +0,0 @@ -\documentclass[10pt]{article} -\usepackage[utf8x]{inputenc} -\usepackage{fullpage} -\usepackage{amsmath, amssymb, bbm, amsthm} -\usepackage[pagebackref=true,breaklinks=true,colorlinks=true,backref=false]{hyperref} -\usepackage[capitalize, noabbrev]{cleveref} -\usepackage{microtype} -\input{defs} - -\date{} -\title{Cracking Cascades: Applying Sparse Recovery to Graph Reconstruction} -\author{Jean Pouget-Abadie, Thibaut Horel} - -\begin{document} -\maketitle - -\section{Preliminaries: What is a measurement?} - -There is an unknown graph $G=(V,E)$ that we wish to reconstruct by observing -information cascades. An information cascade is a random discrete-time process -where we get to observe the set of vertices infected at each time step. - -The key contribution is to formulate this problem as a sparse recovery problem -akin to those found in the compressed sensing literature. The parallel between -the two problems is as follows: -\begin{itemize} - \item the unknown \emph{signal} $s$ is a vector in $\{0,1\}^{V\times V}$ - with $s_{u,v} = 1$ iff. $(u,v)\in E$. - \item the \emph{sensing matrix} is given by the cascades. The \emph{sensing - matrix} maps the signal to the observation space. Since the cascades - are random time processes, they define a sequence of random - observations. - \item the \emph{observations} are the set of vertices infected at each time - step in each cascade. -\end{itemize} - -Let us define $n\eqdef |V|$. If we assume that all the vertices in $G$ have -a degree bounded by $d\in\N_+$, then the signal $s$ is $nd$-sparse. Assuming -that the sensing matrix has good random properties, then we can hope to -recover $s$ with $O(nd\log n)$ random observations. - -Two main goals: -\begin{itemize} - \item understand to which extent the probabilistic cascade models translate - into incoherence properties of the sensing matrix and allow for an - accurate reconstruction of the unknown graph with a small number of - observations. - \item compare the empirical performance of this compressed sensing approach - to prior graph reconstruction approaches. -\end{itemize} - -\begin{remark} -\begin{itemize} - \item It seems that we will need a \emph{complete} probabilistic model for - the cascades. By complete I mean that we need to specify a joint - distribution on the source of the cascade (\emph{e.g.} the source is - chosen uniformly at random) and on the cascade diffusion process. Such - a model is probably not very realistic, but is necessary to obtain - theoretical guarantees. The unrealism of the model is not a problem - \emph{per se}; the eventual validity of this approach will be confirmed - experimentally anyway. - \item It seems likely that we will want the rows of the sensing matrix to - be independent realizations of the same distribution. However, - \textbf{this will never be the case} if we have one row for each time - step of each cascade: the rows within the same cascade are obviously - highly correlated. We might have to \emph{aggregate} them (for example - through concatenation?) in a sensible way to have with one row per - cascade. - \item Something which makes this problem very different from the usual - compressed sensing problems is that the measurements depend on the - signal we wish to recover: cascades propagate along the edges contained - in the signal. More precisely, not only the rows of the sensing matrix - are correlated (if we do not aggregate them by cascade) but also they - are correlated through the signal. Is this a problem? I don't know, - this looks scary though. - \item A slightly more general model which we will need in the independent - cascade case considers a signal $s\in[0,1]^{V\times V}$. We can always - go from this model to the binary case by first recovering the - non-binary $s$ and then applying a thresholding procedure. But if we - only care about recovering the support of $s$, can we do better than - this two-step process? (TO DO: read stuff about support recovery). -\end{itemize} -\end{remark} - -Previous stuff: -\begin{itemize} -\item Introduction -\item Simple graphs (cycle, star, path, tree?) -\item Isotropic condition -\item RIP condition -\end{itemize} - -One last property we are going to look for in measurements is the {\bf Restriced Isometry Property} (RIP), which was introduced in \cite{???} and is ubiquitous to the compressed sensing literature. Suppose we are given a measurement matrix $M$. For every $T \in \mathbb{N}$, we define $\delta_T$ to be the smallest quantity such that, for any submatrix $M_T$ of $M$ obtained by extracting $T$ columns of $M$, and for any vector $c$ such that $\| c\|_2 = 1$: - -$$1 - \delta_T \leq \| M_T c \|_2^2 \leq 1 + \delta_T$$ - -For small $\delta_T$, the above equation defines a `loose' orthonormality property for the columns of $M$. - -\section{Warm up: the voter model} - - -In the voter model, there are two types of nodes, {\it red} and {\it blue}. We fix a horizon $T$. At $t=0$, we decide on a set of {\it blue} nodes, called the seeding set. At each time step $t$, each node $u$ chooses one of its neighbors uniformly, with probability $1/deg(u)$, and adopts the color of that neighbor. A cascade in the voter model is the evolution of which nodes $blue$ nodes for each time step $ 0 \leq t \leq T$. - -For simplicity, we suppose that every node has a probability $1/p_\text{init}$ of being a part of the seeding set of a cascade. The following analysis can easily be adapted to a non-uniform probability vector. We also suppose that the graphs includes self-loops, meaning the node has the option to keep his color for the next round. This is a common assumption which does not change any of the following results. - -\paragraph{Notation} -Denote by $(M^t_k)_{1 \leq k \leq m; 1 \leq t \leq T}$ the set of blue nodes in cascade $k$ at time step $t$. In other words, $M^t_k$ is a vector in $\mathbb{R}^n$ such that: - -\[ (M^t_k)_j = \left\{ - \begin{array}{l l} - 1 & \quad \text{if $j$ is $blue$ in cascade $k$ at time step $t$}\\ - 0 & \quad \text{otherwise} - \end{array} \right.\] - -Consider a node $i$, and let $\theta_i$ be the vector representing the neighbors of $i$ in the graph {\cal G}. In other words, $\theta_i$ is a vector in $\mathbb{R}^n$ such that: - -\[ \theta_{ij} = \left\{ - \begin{array}{l l} - 1/deg(i) & \quad \text{if $j$ is a parent of $i$ in {\cal G}}\\ - 0 & \quad \text{otherwise} - \end{array} \right.\] - -Let $b^t_k$ be the vector representing the probability that in cascade $k$ at time step $t$ node $i$ stays or becomes $blue$. - -\[ b^t_k = \left\{ - \begin{array}{l l} - (\text{Number of $blue$ neighbors in $(k, t-1$})/deg(i) & \quad \text{if $t\neq 0$}\\ - 1/p_{\text{init}} & \quad \text{otherwise} - \end{array} \right.\] - -Our objective is to find the neighbors of $i$ in {\cal G} from $(M^t_k)$ i.e recover the vector $\theta_i$. Notice that for each $(k,t)$, we have: - -$$\forall t< T-1, k, \ M^t_k \cdot \theta_i = b^{t+1}_k$$ - -We can concatenate each equality in matrix form $M$, such that the rows of $M$ are the observed $blue$ nodes $M^t_k$ and the entries of $\vec b$ are the corresponding probabilities: - -$$M \cdot \theta_i = \vec b$$ - -Note that if $M$ had full column rank, then we could recover $\theta_i$ from $\vec b$. This is however an unreasonable assumption, even after having observed many cascades. We must explore which assumptions on $M$ allow us to recover $\theta_i$ from a small number of cascades. Further note that we do not immediately observe $\vec b$. Instead, we observe a bernoulli vector, whose j$^{th}$ entry is equal to $1$ with probability $b_j$ and $0$ otherwise. We will denote this vector $\vec w$, and denote by $\vec \epsilon$ the quantum noise vector such that $\vec w = \vec b + \vec \epsilon$. We can therefore reformulate the problem with our observables: - -$$M \cdot \theta_i + \vec \epsilon = \vec w$$ - -We hope to show that we can exploit the sparsity of vector $\theta_i$ to apply common sparse recovery techniques. - -\section{Independent Cascade Model} - -\subsection{The Model} - -Recall the independent cascade model, where nodes have 3 states: susceptible, -active, and inactive. All nodes start either as susceptible or active. At each -time step $t$, for every (active, susceptible)-pair $(j,i)$, $j$ attempts to -infect $i$ with probability $p_{j,i}$ and become inactive. If $j$ succeeds, $i$ -will become active at time step $t+1$. The cascade continues until no active -nodes remain. - -Consider a node $i$. If $p_{j,i}$ is the probability that node $j$ infects $i$ -at the next step, let $\theta_{i,j} := \log (1 - p_{j,i})$. Note that -$\theta_{j,i} = 0 \Leftrightarrow p_{j,i} = 0$. Also note that: - -$$\sum_{j\in S} \theta_{j,i} = \log \prod_{j \in S} (1 - p_{j,i}) = \log \mathbb{P}(\text{i not infected by any j} \in S)$$ - -Our objective is to recover the vector $\theta_{*,i}$ for every node $i$. - -\subsection{Formulating the sparse recovery problem} - -The idea behind what follows is the assumption that $p_{*,i}$ and therefore $\theta_{*,i}$ are sparse vectors: a node has few neighbors (parents in the directed graph) compared to the size of the graph. - -Let $\mathbbm{1}_S$ be the indicator variable for the nodes that are active at -a certain time step $t$ of a certain cascade. Under the previous assumption, we -have: -$$\inprod{\mathbbm{1}_S, \theta_{*,i}} = \log \mathbb{P}( \text{i not infected at } t+1| S \text{ active at } t)$$ - -Suppose we observe a series of cascades $(A_k^t)$, where $A_k^t$ are the nodes -active in cascade $k$ at time $t$ (these are exactly the nodes which became -infected at time $t-1$). Recall that once a node has been infected (becomes -active), it cannot be infected in the future (is inactive). Let us therefore -look at all the sets $(A_k^t)_{t \leq t_i^k} $ such that for every cascade $k$, -$t_i^k > t$, where $t_i^k = + \infty$ if $i$ was not infected in that cascade. -If we concatenate these rows $\mathbbm{1}_{A_k^t}$ into a matrix $M$ (for -measurement), we have: - -$$M \theta_{*,i} = b$$ -where $b$ is a column vector corresponding to $\log \mathbb{P}(\text{i was not infected at } t+1| S \text{ active at } t)$. Note that even if we had access to $b$ directly but $M$ does not have full column rank, then solving for $x$ is not trivial! However, under sparsity assumptions on $x$, we could apply results for the sparse recovery/sparse coding/compressed sensing literature - -In reality, we do not have access to $b$. We have access to the realization vector $\vec w$ of $\{i$ was infected at $t+1 | S$ active at $t\}$: - -$$\vec w\sim B(1 - e^{M \theta_{*,i}})$$ - -where the operation $f: \vec v \rightarrow 1 - e^{\vec v}$ is done element-wise and $B(\vec p)$ is a vector whose entries are bernoulli variables of parameter the entries of $\vec p$. - -\subsection{Maximum likelihood problem} - -The maximum likelihood problem takes the following form: -\begin{displaymath} - \begin{split} - \max_\theta &\sum_{k,t}\left(b_{k,t}\log\left(1-\exp\inprod{A_k^t, \theta}\right) - + (1-b_{k,t})\log\exp\inprod{A_k^t,\theta}\right)\\ - \text{s.t. }& \|\theta\|_1 \leq k - \end{split} -\end{displaymath} -which we can rewrite as: -\begin{displaymath} - \begin{split} - \max_\theta &\sum_{k,t}b_{k,t}\log\left(\exp\left(-\inprod{A_k^t, \theta}\right)-1\right) - + \sum_{k,t}\inprod{A_k^t,\theta}\\ - \text{s.t. }& \|\theta\|_1 \leq k - \end{split} -\end{displaymath} -In this form it is easy to see that this is a convex program: the objective -function is the sum of a linear function and a concave function. - -\paragraph{Questions} to check: is the program also convex in the $p_{j,i}$'s? -What is the theory of sparse recovery for maximum likelihood problems? (TO DO: -read about Bregman divergence, I think it could be related). - -\subsection{Results under strong assumptions} - -What can we hope to have etc. (If M has full column rank or if same sets S occur frequently). - -\subsection{Result under RIP condition} - -For ease of notation, let $\theta := \theta_{*,i}$ The following resolution of the above problem is heavily borrowed from \cite{2005math......3066C}. We can see $w$ as a bounded perturbation of $f(b)$: -$$w = f(M\theta) + \epsilon, \text{ where } \|\epsilon\|_{\infty} < 1$$ - -Define the following program, which we can solve with all available information $M$ and $w$: -\begin{equation} -(N_1) \quad \min_\theta \|\theta\|_1 \quad \text{subject to} \quad \|e^{M\theta} - (1 - w) \|_2 \leq \|\epsilon \|_\infty \nonumber -\end{equation} - -While we can hope for the columns of $M$ to be `loosely' orthogonal, we must normalize them if we hope for $\delta_T$ to be small. Let $M'$ correspond to $M$ where the columns have been normalized, we will therefore define $\epsilon'$ and $M'$, such that: -$$w = f(M'\theta) + \epsilon'$$ -We see that $\epsilon'= w - (w - \epsilon)^{\vec \lambda}$ (element-wise operation), where $\lambda_i = 1/\sqrt{\sum \text{col}(M)_i}$ such that $\|\epsilon'\|_{\infty} < 1$. We can define the following program: - -\begin{equation} -(N_2) \quad \min_\theta \|\theta\|_1 \quad \text{subject to} \quad \|e^{M'\theta} - (1 - w) \|_2 \leq \|\epsilon' \|_\infty \nonumber -\end{equation} - -With the corresponding $\delta_T'$ defined for $M'$, we are going to prove the following: - -\begin{theorem} - -Suppose that all nodes in our graph have {\bf bounded degree by $S$}, that no edge between $i$ and its neighbors has a probability of $1$, such that $\forall j, p_{j,i} < 1 - \eta \ \text{ where } 0<\eta<1$, and that $\delta'_{3S} + 3 \delta'_{4S} < 2$. If $\theta_0$ is the true vector for node $i$, the solution $\hat \theta$ to $(N_2)$ obeys: -$$\|\hat \theta - \theta_0 \|_2 \leq \frac{C_S}{\eta} \|\epsilon' \|_{\infty} $$ where $C_S$ is well behaved ($\approx 10$ for reasonable values of $\delta'_{4S}$) -\end{theorem} - - -\begin{proof} -The proof borrows heavily from \cite{2005math......3066C}. Define $h := \theta - \theta_0$. We list the results as they are given in \cite{2005math......3066C} or as they can be adapted in our case. - -\begin{itemize} -\item $\|e^{M'\theta} - e^{M' \theta_0} \|_2 \leq \| e^{M'\theta} - ( 1 - w) \|_2 + \| e^{M'\theta} - (1 - w) \|_2 \leq 2 \|\epsilon' \|_{\infty} $ -\item $\|h\|_2^2 \leq (1 + |T_0|/N) \|h_{T_{01}}\|_2^2$, where N is an arbitrary positive integer, and $h_{T_{01}}$ is the vector $h$, where all entries are masked to 0, except those corresponding to the non-zero entries of $\theta_0$ and the $N$ largest entries of $h_{T_0^c}$. -\item $\|e^{M'h} \|_2 \geq |\eta| \|M'h\|_2 \geq |\eta| C_{T_0, N} \cdot \|h_{T_{0,1}} \|_2$, where $C_{T_0,N} := \sqrt{1 - \delta'_{N + T_0}} - \sqrt{T_0 / N} \sqrt{1 + \delta'_N}$ -\end{itemize} -The first bullet point is obtained by the triangle inequality. The second bullet point is identical to the proof in \cite{2005math......3066C}. The first inequality in the third bullet-point is obtained because by assumption all entries of $M'\theta$ are greater than $\ln \eta$ (remember that the $e^{\vec v}$ operation is done element-wise). The rest follows like in \cite{2005math......3066C}, and we obtain that: -$$\|h\|_2 \leq 2 \frac{\sqrt{1 + T_0/N}}{ \eta \cdot C_{T_0, N}} \|\epsilon'\|_{\infty} = \frac{C_S}{\eta} \|\epsilon'\|_{\infty}$$ -For $\delta'_{4S} \leq \frac{1}{5}, C_S \approx 8.82$, for $\delta'_{4S} \leq \frac{1}{4}, C_S \approx 10.47$ -\end{proof} - -\begin{remark} A few notes on the result: -\begin{itemize} -\item With this naive reduction of $w$ as an $e-$pertubation to $f(b)$, the best $\|\epsilon\|_{\infty}$ we can hope for is $1 - \eta$. Seeing $w$ as more than an arbitrary perturbation will most likely yield a much better upper-bound. -\item The upper-bound is done in $\log$-space for the true probabilities: -$$\frac{C_S}{\eta} > \sum \left[ \log \left(\frac{1 - p_j}{1 - p_j'} \right) \right]^2 > \frac{\eta \log(1/\eta)}{1 - \eta} \| \frac{1 - p_j}{1 - p'_j} - 1 \| $$ - -This bullet-point needs to be finished -\item To what extent can we approximate our measurement matrix to a matrix with independent bernoulli entries? What $\delta_T$'s can we hope for? Which constant $C_S$ can we hope for? How does it change with the rows of $M$? -\end{itemize} - -\end{remark} - - - -\subsection{Results under isotropic condition} - - - -% Given $M$ and $w$, we can solve the following maximum log-likelihood problem: -% \begin{equation} -% \max_{\|\theta_{*,i}\|_0 \leq k} <w,\log f(M \theta_{*,i}) + <1 - w, M\theta_{*,i}> -% \label{eq:max_loglik_obj} -% \end{equation} - -% Since optimizing under the $\|\theta\|_0 \leq k$ is hard, we solve instead the Lagrangian relaxation: - -% \begin{equation} -% \max_{\theta_{*,i}} <w,\log f(M \theta_{*,i}) + <1 - w, M\theta_{*,i}> + \lambda \|\theta_{*,i}\|_0 -% \label{eq:lagra_max_loglik_obj} -% \end{equation} - - -% Eq~\ref{eq:lagra_max_loglik_obj} is concave in $\theta_{*,i}$ and can therefore be solved efficiently. We would like to exploit the sparsity of $\theta_{*,i}$ to show that we recover the true sparse $\theta_{*,i}$ with few measurements (rows of M). - - -\bibliography{sparse}{} -\bibliographystyle{plain} - -\end{document} - - - - -% Let us now consider the matrix M, whose rows are the vectors $\mathbbm{1}_S$. If we could estimate the vector $X := \mathbb{P}(i \text{ not infected at } t | S \text{ active at } t)$, we could solve for $\theta$: -% $$M \theta_{*i} = X, \text{ where }\theta_{*,i} \text{ is a sparse vector}$$ - -% There are many powerful tools from sparse coding theory which would allow us to recover the sparsest $\theta$ under certain assumptions on $M$. One such assumption is incoherence of the columns of $M$. In other words, if we can show that for all columns $M_i, M_j$ of M, we have $|<M_i,M_s>| \leq \mu \|M_i\| \|M_j\|$, i.e M is $\mu-incoherent$, then for any node $i$ with fewer than $k = \frac{1}{2 \mu}$ parents, we can recover $\theta_{*i}$ exactly. To get a better intuition of what this means, if the entries of $M$ were independent gaussian variables, we would need approximately $k \ln(\frac{n}{k})$ measurements to recover $\theta_{*i}$ exactly. - -% This would seem to suggest that the quality of observed cascades is much more important than quantity--a notion which has not truly been explored in graph reconstruction as of yet. Another stronger assumption on the columns of M is separability (see Moitra's work in Topic Modeling) leading to even stronger results. - -% In reality of course, we do not know $X$. Instead, we have access to $\hat X$ which estimates $X$ in the following way. For any set S in any cascade that was active before i was, we obtain a binary signal, $\hat X_S:=0$ if $i$ was infected immediately afterwards and $\hat X_S:=1$ otherwise. As the number of times set S was observed active, we have $$\frac{1}{m} \sum \hat X_S \rightarrow \mathbb{P}(i \text{ not infected at }t+1 | S \text{ active at } t)$$. - -% Certain sets might have been infected many times in our set of cascades, certain sets might have been infected only once. One possible way to rewrite the program above is: $$\hat X_j = H(w_j - e^{(M \theta_{*i})_j}), \text{ where } w\sim {\cal U}[0,1] \text{ and H is the step function } \mathbbm{1}_{\cdot \geq 0}$$ - -% This problem seems to be closely related to the growing field of 1-bit compressed sensing, with at least one difference: when scientists consider 1bitCS with noise, they are usually in a position to decide which measurement to do next to recover $\theta_{*i}$, whereas in this framework, we are limited to an observer status\footnote{Research in the other direction could be very interesting: see discussion about exploration-exploitation with Thibaut}. - -% I need to do more reading on the subject, namely to understand what guarantees people have found for sample complexity. - |
