\documentclass[10pt]{article} \usepackage[utf8x]{inputenc} \usepackage{amsmath, amssymb, bbm, amsthm} \usepackage[pagebackref=true,breaklinks=true,colorlinks=true,backref=false]{hyperref} \usepackage[capitalize, noabbrev]{cleveref} \usepackage{microtype} \input{defs} \date{} \title{Cracking Cascades: Applying Sparse Recovery to Graph Reconstruction} \author{Jean Pouget-Abadie, Thibaut Horel} \begin{document} \maketitle \section{Preliminaries: What is a measurement?} There is an unknown graph $G=(V,E)$ that we wish to reconstruct by observing information cascades. An information cascade is a random discrete-time process where we get to observe the set of vertices infected at each time step. The key contribution is to formulate this problem as a sparse recovery problem akin to those found in the compressed sensing literature. The parallel between the two problems is as follows: \begin{itemize} \item the unknown \emph{signal} $s$ is a vector in $\{0,1\}^{V\times V}$ with $s_{u,v} = 1$ iff. $(u,v)\in E$. \item the \emph{sensing matrix} is given by the cascades. The \emph{sensing matrix} maps the signal to the observation space. Since the cascades are random time processes, they define a sequence of random observations. \item the \emph{observations} are the set of vertices infected at each time step in each cascade. \end{itemize} Let us define $n\eqdef |V|$. If we assume that all the vertices in $G$ have a degree bounded by $d\in\N_+$, then the signal $s$ is $nd$-sparse. Assuming that the sensing matrix has good random properties, then we can hope to recover $s$ with $O(nd\log n)$ random observations. Two main goals: \begin{itemize} \item understand to which extent the probabilistic cascade models translate into incoherence properties of the sensing matrix and allow for an accurate reconstruction of the unknown graph with a small number of observations. \item compare the empirical performance of this compressed sensing approach to prior graph reconstruction approaches. \end{itemize} \begin{remark} \begin{itemize} \item It seems that we will need a \emph{complete} probabilistic model for the cascades. By complete I mean that we need to specify a joint distribution on the source of the cascade (\emph{e.g.} the source is chosen uniformly at random) and on the cascade diffusion process. Such a model is probably not very realistic, but is necessary to obtain theoretical guarantees. The unrealism of the model is not a problem \emph{per se}; the eventual validity of this approach will be confirmed experimentally anyway. \item It seems likely that we will want the rows of the sensing matrix to be independent realizations of the same distribution. However, \textbf{this will never be the case} if we have one row for each time step of each cascade: the rows within the same cascade are obviously highly correlated. We might have to \emph{aggregate} them (for example through concatenation?) in a sensible way to have with one row per cascade. \item Something which makes this problem very different from the usual compressed sensing problems is that the measurements depend on the signal we wish to recover: cascades propagate along the edges contained in the signal. More precisely, not only the rows of the sensing matrix are correlated (if we do not aggregate them by cascade) but also they are correlated through the signal. Is this a problem? I don't know, this looks scary though. \item A slightly more general model which we will need in the independent cascade case considers a signal $s\in[0,1]^{V\times V}$. We can always go from this model to the binary case by first recovering the non-binary $s$ and then applying a thresholding procedure. But if we only care about recovering the support of $s$, can we do better than this two-step process? (TO DO: read stuff about support recovery). \end{itemize} \end{remark} Previous stuff: \begin{itemize} \item Introduction \item Simple graphs (cycle, star, path, tree?) \item Isotropic condition \item RIP condition \end{itemize} One last property we are going to look for in measurements is the {\bf Restriced Isometry Property} (RIP), which was introduced in \cite{???} and is ubiquitous to the compressed sensing literature. Suppose we are given a measurement matrix $M$. For every $T \in \mathbb{N}$, we define $\delta_T$ to be the smallest quantity such that, for any submatrix $M_T$ of $M$ obtained by extracting $T$ columns of $M$, and for any vector $c$ such that $\| c\|_2 = 1$: $$1 - \delta_T \leq \| M_T c \|_2^2 \leq 1 + \delta_T$$ For small $\delta_T$, the above equation defines a `loose' orthonormality property for the columns of $M$. \section{Warm up: the voter model} In the voter model, there are two types of nodes, {\it red} and {\it blue}. We fix a horizon $T$. At $t=0$, we decide on a set of {\it blue} nodes, called the seeding set. At each time step $t$, each node $u$ chooses one of its neighbors uniformly, with probability $1/deg(u)$, and adopts the color of that neighbor. A cascade in the voter model is the evolution of which nodes $blue$ nodes for each time step $ 0 \leq t \leq T$. For simplicity, we suppose that every node has a probability $1/p_\text{init}$ of being a part of the seeding set of a cascade. The following analysis can easily be adapted to a non-uniform probability vector. We also suppose that the graphs includes self-loops, meaning the node has the option to keep his color for the next round. This is a common assumption which does not change any of the following results. Denote by $(M^t_k)_{1 \leq k \leq m; 1 \leq t \leq T}$ the set of blue nodes in cascade $k$ at time step $t$: $M^t_k$ is a vector in $\mathbb{R}^n$ such that: \[ (M^t_k)_j = \left\{ \begin{array}{l l} 1 & \quad \text{if $j$ is $blue$ in cascade $k$ at time step $t$}\\ 0 & \quad \text{otherwise} \end{array} \right.\] Consider a node $i$, and let $\theta_i$ be a vector in $\mathbb{R}^n$ such that: \[ \theta_{ij} = \left\{ \begin{array}{l l} 1/deg(i) & \quad \text{if $j$ is a parent of $i$}\\ 0 & \quad \text{otherwise} \end{array} \right.\] Our objective is to find the parents of $i$ from $(M^t_k)$ i.e recover the vector $\theta_i$. Notice that for a given cascade $k$, regardless of $i$'s color at time step $t$, the probability that it stays or becomes $blue$ at time step $t+1$ is $$\frac{\text{Number of blue neighbors}}{deg(i)} = M^t_k \cdot \theta_i$$ \section{Independent Cascade Model} \subsection{The Model} Recall the independent cascade model, where nodes have 3 states: susceptible, active, and inactive. All nodes start either as susceptible or active. At each time step $t$, for every (active, susceptible)-pair $(j,i)$, $j$ attempts to infect $i$ with probability $p_{j,i}$ and become inactive. If $j$ succeeds, $i$ will become active at time step $t+1$. The cascade continues until no active nodes remain. Consider a node $i$. If $p_{j,i}$ is the probability that node $j$ infects $i$ at the next step, let $\theta_{i,j} := \log (1 - p_{j,i})$. Note that $\theta_{j,i} = 0 \Leftrightarrow p_{j,i} = 0$. Also note that: $$\sum_{j\in S} \theta_{j,i} = \log \prod_{j \in S} (1 - p_{j,i}) = \log \mathbb{P}(\text{i not infected by any j} \in S)$$ Our objective is to recover the vector $\theta_{*,i}$ for every node $i$. \subsection{Formulating the sparse recovery problem} The idea behind what follows is the assumption that $p_{*,i}$ and therefore $\theta_{*,i}$ are sparse vectors: a node has few neighbors (parents in the directed graph) compared to the size of the graph. Let $\mathbbm{1}_S$ be the indicator variable for the nodes that are active at a certain time step $t$ of a certain cascade. Under the previous assumption, we have: $$\inprod{\mathbbm{1}_S, \theta_{*,i}} = \log \mathbb{P}( \text{i not infected at } t+1| S \text{ active at } t)$$ Suppose we observe a series of cascades $(A_k^t)$, where $A_k^t$ are the nodes active in cascade $k$ at time $t$ (these are exactly the nodes which became infected at time $t-1$). Recall that once a node has been infected (becomes active), it cannot be infected in the future (is inactive). Let us therefore look at all the sets $(A_k^t)_{t \leq t_i^k} $ such that for every cascade $k$, $t_i^k > t$, where $t_i^k = + \infty$ if $i$ was not infected in that cascade. If we concatenate these rows $\mathbbm{1}_{A_k^t}$ into a matrix $M$ (for measurement), we have: $$M \theta_{*,i} = b$$ where $b$ is a column vector corresponding to $\log \mathbb{P}(\text{i was not infected at } t+1| S \text{ active at } t)$. Note that even if we had access to $b$ directly but $M$ does not have full column rank, then solving for $x$ is not trivial! However, under sparsity assumptions on $x$, we could apply results for the sparse recovery/sparse coding/compressed sensing literature In reality, we do not have access to $b$. We have access to the realization vector $\vec w$ of $\{i$ was infected at $t+1 | S$ active at $t\}$: $$\vec w\sim B(1 - e^{M \theta_{*,i}})$$ where the operation $f: \vec v \rightarrow 1 - e^{\vec v}$ is done element-wise and $B(\vec p)$ is a vector whose entries are bernoulli variables of parameter the entries of $\vec p$. \subsection{Maximum likelihood problem} The maximum likelihood problem takes the following form: \begin{displaymath} \begin{split} \max_\theta &\sum_{k,t}\left(b_{k,t}\log\left(1-\exp\inprod{A_k^t, \theta}\right) + (1-b_{k,t})\log\exp\inprod{A_k^t,\theta}\right)\\ \text{s.t. }& \|\theta\|_1 \leq k \end{split} \end{displaymath} which we can rewrite as: \begin{displaymath} \begin{split} \max_\theta &\sum_{k,t}b_{k,t}\log\left(\exp\left(-\inprod{A_k^t, \theta}\right)-1\right) + \sum_{k,t}\inprod{A_k^t,\theta}\\ \text{s.t. }& \|\theta\|_1 \leq k \end{split} \end{displaymath} In this form it is easy to see that this is a convex program: the objective function is the sum of a linear function and a concave function. \paragraph{Questions} to check: is the program also convex in the $p_{j,i}$'s? What is the theory of sparse recovery for maximum likelihood problems? (TO DO: read about Bregman divergence, I think it could be related). \subsection{Results under strong assumptions} What can we hope to have etc. (If M has full column rank or if same sets S occur frequently). \subsection{Result under RIP condition} For ease of notation, let $\theta := \theta_{*,i}$ The following resolution of the above problem is heavily borrowed from \cite{2005math......3066C}. We can see $w$ as a bounded perturbation of $f(b)$: $$w = f(M\theta) + \epsilon, \text{ where } \|\epsilon\|_{\infty} < 1$$ Define the following program, which we can solve with all available information $M$ and $w$: \begin{equation} (N_1) \quad \min_\theta \|\theta\|_1 \quad \text{subject to} \quad \|e^{M\theta} - (1 - w) \|_2 \leq \|\epsilon \|_\infty \nonumber \end{equation} While we can hope for the columns of $M$ to be `loosely' orthogonal, we must normalize them if we hope for $\delta_T$ to be small. Let $M'$ correspond to $M$ where the columns have been normalized, we will therefore define $\epsilon'$ and $M'$, such that: $$w = f(M'\theta) + \epsilon'$$ We see that $\epsilon'= w - (w - \epsilon)^{\vec \lambda}$ (element-wise operation), where $\lambda_i = 1/\sqrt{\sum \text{col}(M)_i}$ such that $\|\epsilon'\|_{\infty} < 1$. We can define the following program: \begin{equation} (N_2) \quad \min_\theta \|\theta\|_1 \quad \text{subject to} \quad \|e^{M'\theta} - (1 - w) \|_2 \leq \|\epsilon' \|_\infty \nonumber \end{equation} With the corresponding $\delta_T'$ defined for $M'$, we are going to prove the following: \begin{theorem} Suppose that all nodes in our graph have {\bf bounded degree by $S$}, that no edge between $i$ and its neighbors has a probability of $1$, such that $\forall j, p_{j,i} < 1 - \eta \ \text{ where } 0<\eta<1$, and that $\delta'_{3S} + 3 \delta'_{4S} < 2$. If $\theta_0$ is the true vector for node $i$, the solution $\hat \theta$ to $(N_2)$ obeys: $$\|\hat \theta - \theta_0 \|_2 \leq \frac{C_S}{\eta} \|\epsilon' \|_{\infty} $$ where $C_S$ is well behaved ($\approx 10$ for reasonable values of $\delta'_{4S}$) \end{theorem} \begin{proof} The proof borrows heavily from \cite{2005math......3066C}. Define $h := \theta - \theta_0$. We list the results as they are given in \cite{2005math......3066C} or as they can be adapted in our case. \begin{itemize} \item $\|e^{M'\theta} - e^{M' \theta_0} \|_2 \leq \| e^{M'\theta} - ( 1 - w) \|_2 + \| e^{M'\theta} - (1 - w) \|_2 \leq 2 \|\epsilon' \|_{\infty} $ \item $\|h\|_2^2 \leq (1 + |T_0|/N) \|h_{T_{01}}\|_2^2$, where N is an arbitrary positive integer, and $h_{T_{01}}$ is the vector $h$, where all entries are masked to 0, except those corresponding to the non-zero entries of $\theta_0$ and the $N$ largest entries of $h_{T_0^c}$. \item $\|e^{M'h} \|_2 \geq |\eta| \|M'h\|_2 \geq |\eta| C_{T_0, N} \cdot \|h_{T_{0,1}} \|_2$, where $C_{T_0,N} := \sqrt{1 - \delta'_{N + T_0}} - \sqrt{T_0 / N} \sqrt{1 + \delta'_N}$ \end{itemize} The first bullet point is obtained by the triangle inequality. The second bullet point is identical to the proof in \cite{2005math......3066C}. The first inequality in the third bullet-point is obtained because by assumption all entries of $M'\theta$ are greater than $\ln \eta$ (remember that the $e^{\vec v}$ operation is done element-wise). The rest follows like in \cite{2005math......3066C}, and we obtain that: $$\|h\|_2 \leq 2 \frac{\sqrt{1 + T_0/N}}{ \eta \cdot C_{T_0, N}} \|\epsilon'\|_{\infty} = \frac{C_S}{\eta} \|\epsilon'\|_{\infty}$$ For $\delta'_{4S} \leq \frac{1}{5}, C_S \approx 8.82$, for $\delta'_{4S} \leq \frac{1}{4}, C_S \approx 10.47$ \end{proof} \begin{remark} A few notes on the result: \begin{itemize} \item With this naive reduction of $w$ as an $e-$pertubation to $f(b)$, the best $\|\epsilon\|_{\infty}$ we can hope for is $1 - \eta$. Seeing $w$ as more than an arbitrary perturbation will most likely yield a much better upper-bound. \item The upper-bound is done in $\log$-space for the true probabilities: $$\frac{C_S}{\eta} > \sum \left[ \log \left(\frac{1 - p_j}{1 - p_j'} \right) \right]^2 > \frac{\eta \log(1/\eta)}{1 - \eta} \| \frac{1 - p_j}{1 - p'_j} - 1 \| $$ This bullet-point needs to be finished \item To what extent can we approximate our measurement matrix to a matrix with independent bernoulli entries? What $\delta_T$'s can we hope for? Which constant $C_S$ can we hope for? How does it change with the rows of $M$? \end{itemize} \end{remark} \subsection{Results under isotropic condition} % Given $M$ and $w$, we can solve the following maximum log-likelihood problem: % \begin{equation} % \max_{\|\theta_{*,i}\|_0 \leq k} % \label{eq:max_loglik_obj} % \end{equation} % Since optimizing under the $\|\theta\|_0 \leq k$ is hard, we solve instead the Lagrangian relaxation: % \begin{equation} % \max_{\theta_{*,i}} + \lambda \|\theta_{*,i}\|_0 % \label{eq:lagra_max_loglik_obj} % \end{equation} % Eq~\ref{eq:lagra_max_loglik_obj} is concave in $\theta_{*,i}$ and can therefore be solved efficiently. We would like to exploit the sparsity of $\theta_{*,i}$ to show that we recover the true sparse $\theta_{*,i}$ with few measurements (rows of M). \bibliography{sparse}{} \bibliographystyle{plain} \end{document} % Let us now consider the matrix M, whose rows are the vectors $\mathbbm{1}_S$. If we could estimate the vector $X := \mathbb{P}(i \text{ not infected at } t | S \text{ active at } t)$, we could solve for $\theta$: % $$M \theta_{*i} = X, \text{ where }\theta_{*,i} \text{ is a sparse vector}$$ % There are many powerful tools from sparse coding theory which would allow us to recover the sparsest $\theta$ under certain assumptions on $M$. One such assumption is incoherence of the columns of $M$. In other words, if we can show that for all columns $M_i, M_j$ of M, we have $|| \leq \mu \|M_i\| \|M_j\|$, i.e M is $\mu-incoherent$, then for any node $i$ with fewer than $k = \frac{1}{2 \mu}$ parents, we can recover $\theta_{*i}$ exactly. To get a better intuition of what this means, if the entries of $M$ were independent gaussian variables, we would need approximately $k \ln(\frac{n}{k})$ measurements to recover $\theta_{*i}$ exactly. % This would seem to suggest that the quality of observed cascades is much more important than quantity--a notion which has not truly been explored in graph reconstruction as of yet. Another stronger assumption on the columns of M is separability (see Moitra's work in Topic Modeling) leading to even stronger results. % In reality of course, we do not know $X$. Instead, we have access to $\hat X$ which estimates $X$ in the following way. For any set S in any cascade that was active before i was, we obtain a binary signal, $\hat X_S:=0$ if $i$ was infected immediately afterwards and $\hat X_S:=1$ otherwise. As the number of times set S was observed active, we have $$\frac{1}{m} \sum \hat X_S \rightarrow \mathbb{P}(i \text{ not infected at }t+1 | S \text{ active at } t)$$. % Certain sets might have been infected many times in our set of cascades, certain sets might have been infected only once. One possible way to rewrite the program above is: $$\hat X_j = H(w_j - e^{(M \theta_{*i})_j}), \text{ where } w\sim {\cal U}[0,1] \text{ and H is the step function } \mathbbm{1}_{\cdot \geq 0}$$ % This problem seems to be closely related to the growing field of 1-bit compressed sensing, with at least one difference: when scientists consider 1bitCS with noise, they are usually in a position to decide which measurement to do next to recover $\theta_{*i}$, whereas in this framework, we are limited to an observer status\footnote{Research in the other direction could be very interesting: see discussion about exploration-exploitation with Thibaut}. % I need to do more reading on the subject, namely to understand what guarantees people have found for sample complexity.