\documentclass[10pt]{article} \usepackage{fullpage, amsmath, amsthm, amssymb} \usepackage{graphicx, algorithm, algpseudocode} \usepackage[utf8x]{inputenc} \newtheorem{theorem}{Theorem} \title{Sparse Recovery for Graph Reconstruction} \author{Eric Balkanski \and Jean Pouget-Abadie} \date{} \begin{document} \maketitle \section{Introduction} Given a set of observed cascades, the \textbf{graph reconstruction problem} consists of finding the underlying graph on which these cascades spread. We explore different models for cascade creation, namely the {\bf voter model} and the \textbf{independent cascade model}. \section{Related Work} In previous work, this problem has been formulated in different ways, including a convex optimization and a maximum likelihood problem. However, there is no known algorithm for graph reconstruction with theoretical guarantees and with a reasonable required sample size. \section{The Voter Model} \subsection{Description} In the voter model, there are two types of nodes, \textsc{red} and \textsc{blue}. We fix a horizon $T > 0$. At $t=0$, we decide on a set of \textsc{blue} nodes, called the seed set. Every other node is colored $\textsc{red}$. At each time step, each node $u$ chooses one of its neighbors uniformly, with probability $1/deg(u)$, and adopts the color of that neighbor. A cascade in the voter model can be described as the evolution of which nodes are \textsc{blue} for each time step $0\leq t\leq T$. Several variants of the model can be formulated, namely whether or not nodes have self-loops and maintain their color with a certain probability or whether the probability that a node is part of a seed set is uniform versus non-uniform, independent versus correlated. For simplicity, we suppose here that nodes do not have self-loops and that the probability that a node is part of the seed set is indepedent and equal to $1/p_{\text{init}}$ for $0 < p_{\text{init}} < 1$. \subsection{Formulation} We are going to graph reconstruction problem from voter-model type cascades as a sparse recovery problem. We introduce the following notation: \paragraph{Notation} Denote by $(\vec m^t_k)_{1 \leq k \leq m; 1 \leq t \leq T}$ the set of blue nodes in cascade $k$ at time step $t$. In other words, $\vec m^t_k$ is a vector in $\mathbb{R}^n$ such that: \[ (\vec m^t_k)_j = \left\{ \begin{array}{l l} 1 & \quad \text{if $j$ is \textsc{blue} in cascade $k$ at time step $t$}\\ 0 & \quad \text{otherwise} \end{array} \right.\] Let $\vec x_i$ be the vector representing the neighbors of a node $i$ in the graph ${\cal G}$. In other words, $\vec x_i$ is a vector in $\mathbb{R}^n$ such that: \[ (\vec x_{i})_j = \left\{ \begin{array}{l l} 1/\text{deg(i)} & \quad \text{if $j$ is a parent of $i$ in ${\cal G}$}\\ 0 & \quad \text{otherwise} \end{array} \right.\] Let $w^t_{k,i}$ be the scalar representing the probability that in cascade $k$ at time step $t$ node $i$ stays or becomes \textsc{blue}. \[ w^t_{k,i} = \left\{ \begin{array}{l l} \frac{\text{Number of \textsc{blue} neighbors of i in $(k, t-1)$}}{\text{deg}(i)} & \quad \text{if $t\neq 0$}\\ 1/p_{\text{init}} & \quad \text{otherwise} \end{array} \right.\] Our objective is to recover the neighbors for each node $i$ in ${\cal G}$. This corresponds to recovering the vectors $(\vec x_i)_{i \in {\cal G}}$. Notice that for each cascade $k$ and every time step $t < T - 1$, we have: \begin{equation} \langle \vec m_k^t, \vec x_i \rangle = \mathbb{P}(\text{node $i$ is \textsc{blue} in cascade k at $t+1$}) = w_{k,i}^{t+1} \nonumber \end{equation} We can concatenate each equality in matrix form $M$, such that the rows of $M$ are the row vectors $(\vec m^t_k)^T$, and for each node $i$, the entries of $\vec w_i$ are the corresponding probabilities: $$M \vec x_i = \vec w_i$$ Note that if $M$ had full column rank, then we could recover $\vec x_i$ from $\vec b$. This is however an unreasonable assumption, even after having observed many cascades. We must explore which assumptions on $M$ allow us to recover $\vec x_i$ when the number of observed cascades is small. Further note that we do not immediately observe $\vec w_i$. Instead, we observe a bernoulli vector, whose j$^{th}$ entry is equal to $1$ with probability $(\vec w_i)_j$ and $0$ otherwise, corresponding to whether at the next time step $i$ became or stayed \textsc{blue}. We will denote this vector $\vec b_i$, and denote by $\vec \epsilon_i$ the zero-mean noise vector such that: $$\vec w_i = \vec b_i + \vec \epsilon_i$$ For each node $i$, we can therefore reformulate the problem with our observables: \begin{equation} \label{eq:sparse_voter} M \vec x_i = \vec b_i + \epsilon_i \end{equation} If the vector $\vec x_i$ is sufficiently sparse, i.e. node $i$ has sufficiently few neighbors, then we hope to apply common sparse recovery techniques to solve for $\vec x_i$. \subsection{Example} \begin{figure} \centering \includegraphics[width=0.3\textwidth]{images/voter.png} \caption{A cascade in the voter model with time steps $t = 0,1,2$ over a graph with 5 vertices} \end{figure} To recover the neighbors of $v_1$, we get the following matrix $M$ for the example in Figure 1: \begin{align*} &\hspace{0.35cm} v_2 \hspace{0.2 cm} v_3 \hspace{0.2 cm} v_4 \hspace{0.2 cm} v_5 &\\ \vspace{1 cm} M = & \left( \begin{array}{cccc} 0 & 0 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ \end{array} \right) & \begin{array}{l} \hspace{ -4.5 cm} \text{time step 0} \\ \hspace{ - 4.5cm} \text{time step 1} \\ \end{array} \end{align*} and the vector $b_1$: \begin{align*} b_1 = & \left( \begin{array}{c} 1 \\ 1 \\ \end{array} \right) & \begin{array}{l} \hspace{ - 5cm} \text{time step 1} \\ \hspace{ - 5cm} \text{time step 2} \\ \end{array} \end{align*} \section{The Independent Cascade Model} \subsection{Description} In the independent cascade model, nodes have three possible states: susceptible, infected or active\footnote{The words `infected' and `active' are used interchangeably}, and inactive. All nodes start either as susceptible or infected. The infected nodes at $t=0$ constitute the seed set. Similarly to the voter model, we will suppose that nodes are part of the seed set with independent probability $p_{\text{init}}$. At each time step $t$, for every pair of nodes $(j,i)$ where $j$ is infected and $i$ is susceptible, $j$ attempts to infect $i$ with probability $p_{j,i}$. If $j$ succeeds, $i$ will become active at time step $t+1$. Regardless of $j$'s success, node $j$ will be inactive at time $t+1$: nodes stay active for only one time step. The cascade continues until time step $T_k$ when no active nodes remain. \subsection{Formulation} We are going to graph reconstruction problem from independent-cascade-model-type cascades as a sparse recovery problem. We introduce the following notation: \paragraph{Notation} We adapt slightly the definition of $(\vec m^t_k)$: \[ (\vec m^t_k)_j = \left\{ \begin{array}{l l} 1 & \quad \text{if $j$ is \textsc{active} in cascade $k$ at time step $t$}\\ 0 & \quad \text{otherwise} \end{array} \right.\] Similarly to the voter model, if $\vec p_i$ is the vector representing the `parents' of node $i$ in the graph ${\cal G}$ such that $(\vec p_i)_j$ is the probability that node $j$ infects $i$, then $\vec \theta_i$ is the vector: \[ (\vec \theta_{i})_j = \left\{ \begin{array}{l l} \log ( 1- p_{j,i}) & \quad \text{if $j$ is a parent of $i$ in ${\cal G}$}\\ 0 & \quad \text{otherwise} \end{array} \right.\] Note that if $p_{j,i} = 0$, then $\theta_{j,i} = 0$ and we can write $\vec \theta = \log (1 - \vec p)$ element-wise. For $t>0$, let $w^t_{k,i}$ be the scalar representing the probability that in cascade $k$ at time step $t$ node $i$ becomes \textsc{active}, conditioned on the state of cascade $k$ at time step $t-1$. As long as node $i$ has not yet been infected in cascade $k$, we can write as in the voter model, we can observe that: $$\langle \vec m_k^t, \vec \theta_i \rangle = \vec w^t_{k,i}$$ As in the voter model, by concatenating all row vectors $\vec m^t_{k,i}$ such that $t < t^k_i$, where $t^k_i$ is the infection of node $i$ in cascade $k$, in the matrix $M_i$, we can write: $$M_i \vec \theta_i = \vec w_i$$ Once again, we do not have access to $\vec w_i$. Instead, for $t