\documentclass[10pt]{article} \usepackage{fullpage, amsmath, amsthm, amssymb} \usepackage{graphicx, algorithm, algpseudocode} \usepackage[utf8x]{inputenc} \newtheorem{theorem}{Theorem} \title{Sparse Recovery for Graph Reconstruction} \author{Eric Balkanski \and Jean Pouget-Abadie} \date{} \begin{document} \maketitle \section{Introduction} Given a set of observed cascades, the graph reconstruction problem consists of finding the underlying graph on which these cascades spread. We explore different models for cascade creation, namely the voter model and the independent cascade model. Our objective is therefore to approximately compute the `inverse' of the cascade generation function from as few observations as possible. \section{Related Work} There have been several works tackling the graph reconstruction problem in variants of the independent cascade. We briefly summarize their results and approaches below. \begin{itemize} \item Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Klause were amongst the first to tackle the problem of graph reconstruction from cascade observation \cite{gomez}. Their first method \textsc{NetInf}, and a series of other similar methods published subsequently, solves a heuristic which approximates a NP-hard maximum-likelihood formulation. Unfortunately \textsc{NetInf} has no known theoretical guarantees. One difficulty with comparing ourselves to their heuristic is that they assume a continuous time independent cascade model, which is close but un-adaptable to our framework. Namely, as we will see in Section~\ref{sec:icc}, nodes in our model cannot influence nodes beyond the time step at which they are active. \item Their paper was later followed by a publication from Praneeth Netrapalli and Sujay Sanghavi \cite{netrapalli}, which provided two algorithms with several theoretical guarantees. \begin{itemize} \item The first algorithm they discuss is a convex and parallelizable optimization problem. However, the theoretical guarantee is close to perfect reconstruction of `strong' edges after ${\cal O}(d^2 \log n)$ with high probability. There are several problems to this result however. \begin{itemize} \item Researchers believe the true threshold of cascades necessary is closer to ${\cal O}(d \log n)$ \item The constant in the big ${\cal O}$ is highly polynomial in the different constants, of the order of billions even for very generous initializations of these parameters \item The authors make a restrictive assumption on the probabilities of each edge, which they call `correlation decay': $\forall i, \sum_{j \in {\cal N}(i)} p_{j,i} < 1 -\alpha$, where $0 < \alpha < 1$ \end{itemize} \item The second algorithm, to which we will compare ourselves in Section~\ref{sec:exp}, is \textsc{Greedy}. It achieves perfect reconstruction after ${\cal O}(d \log n)$ observed cascades with high probability. However, this guarantee has been proven only in the case of tree-graphs. The authors mention that it does fairly well in pratice on other types of graphs. \end{itemize} \item Finally, a recent paper by Bruno Abrahao, Flavio Chierichetti, Robert Kleinberg, Alessandro Panconesi gives several algorithms and a different approach to showing their theoretical guarantees. Their strongest result is for bounded-degree graphs, where they show that they can obtain perfect reconstruction after observing ${\cal O}(\Delta^9 \log^2 \Delta \log n)$ algorithm. This is weaker than the first algorithm suggested by \cite{netrapalli}, but does not make the `correlation decay' assumption., where $\Delta$ is the max degree in graph ${\cal G}$. \end{itemize} What we aim for with the following framework is to achieve theoretical guarantees promising close-to-perfect reconstruction with observing ${\cal O}(\Delta \log n)$ cascades with high probability without relying on unreasonable assumptions like `correlation decay'. We also want our method to be easily implementable in practice and compare favorably to the above algorithms. \section{The Voter Model} \subsection{Description} In the voter model, there are two types of nodes, \textsc{red} and \textsc{blue}. We fix a horizon $T > 0$. At $t=0$, we decide on a set of \textsc{blue} nodes, called the seed set. Every other node is colored $\textsc{red}$. At each time step, each node $u$ chooses one of its neighbors uniformly, with probability $1/deg(u)$, and adopts the color of that neighbor. A cascade in the voter model can be described as the evolution of which nodes are \textsc{blue} for each time step $0\leq t\leq T$. Several variants of the model can be formulated, namely whether or not nodes have self-loops and maintain their color with a certain probability or whether the probability that a node is part of a seed set is uniform versus non-uniform, independent versus correlated. For simplicity, we suppose here that nodes do not have self-loops and that the probability that a node is part of the seed set is indepedent and equal to $1/p_{\text{init}}$ for $0 < p_{\text{init}} < 1$. \subsection{Formulation} We are going to graph reconstruction problem from voter-model type cascades as a sparse recovery problem. We introduce the following notation: \paragraph{Notation} Denote by $(\vec m^t_k)_{1 \leq k \leq m; 1 \leq t \leq T}$ the set of blue nodes in cascade $k$ at time step $t$. In other words, $\vec m^t_k$ is a vector in $\mathbb{R}^n$ such that: \[ (\vec m^t_k)_j = \left\{ \begin{array}{l l} 1 & \quad \text{if $j$ is \textsc{blue} in cascade $k$ at time step $t$}\\ 0 & \quad \text{otherwise} \end{array} \right.\] Let $\vec x_i$ be the vector representing the neighbors of a node $i$ in the graph ${\cal G}$. In other words, $\vec x_i$ is a vector in $\mathbb{R}^n$ such that: \[ (\vec x_{i})_j = \left\{ \begin{array}{l l} 1/\text{deg(i)} & \quad \text{if $j$ is a parent of $i$ in ${\cal G}$}\\ 0 & \quad \text{otherwise} \end{array} \right.\] Let $w^t_{k,i}$ be the scalar representing the probability that in cascade $k$ at time step $t$ node $i$ stays or becomes \textsc{blue}. \[ w^t_{k,i} = \left\{ \begin{array}{l l} \frac{\text{Number of \textsc{blue} neighbors of i in $(k, t-1)$}}{\text{deg}(i)} & \quad \text{if $t\neq 0$}\\ 1/p_{\text{init}} & \quad \text{otherwise} \end{array} \right.\] Our objective is to recover the neighbors for each node $i$ in ${\cal G}$. This corresponds to recovering the vectors $(\vec x_i)_{i \in {\cal G}}$. Notice that for each cascade $k$ and every time step $t < T - 1$, we have: \begin{equation} \langle \vec m_k^t, \vec x_i \rangle = \mathbb{P}(\text{node $i$ is \textsc{blue} in cascade k at $t+1$}) = w_{k,i}^{t+1} \nonumber \end{equation} We can concatenate each equality in matrix form $M$, such that the rows of $M$ are the row vectors $(\vec m^t_k)^T$, and for each node $i$, the entries of $\vec w_i$ are the corresponding probabilities: $$M \vec x_i = \vec w_i$$ Note that if $M$ had full column rank, then we could recover $\vec x_i$ from $\vec b$. This is however an unreasonable assumption, even after having observed many cascades. We must explore which assumptions on $M$ allow us to recover $\vec x_i$ when the number of observed cascades is small. Further note that we do not immediately observe $\vec w_i$. Instead, we observe a bernoulli vector, whose j$^{th}$ entry is equal to $1$ with probability $(\vec w_i)_j$ and $0$ otherwise, corresponding to whether at the next time step $i$ became or stayed \textsc{blue}. We will denote this vector $\vec b_i$, and denote by $\vec \epsilon_i$ the zero-mean noise vector such that: $$\vec w_i = \vec b_i + \vec \epsilon_i$$ For each node $i$, we can therefore reformulate the problem with our observables: \begin{equation} \label{eq:sparse_voter} M \vec x_i = \vec b_i + \epsilon_i \end{equation} If the vector $\vec x_i$ is sufficiently sparse, i.e. node $i$ has sufficiently few neighbors, then we hope to apply common sparse recovery techniques to solve for $\vec x_i$. \subsection{Example} \begin{figure} \centering \includegraphics[width=0.3\textwidth]{images/voter.png} \caption{A cascade in the voter model with time steps $t = 0,1,2$ over a graph with 5 vertices} \end{figure} To recover the neighbors of $v_1$, we get the following matrix $M$ for the example in Figure 1: \begin{align*} &\hspace{0.35cm} v_2 \hspace{0.2 cm} v_3 \hspace{0.2 cm} v_4 \hspace{0.2 cm} v_5 &\\ \vspace{1 cm} M = & \left( \begin{array}{cccc} 0 & 0 & 1 & 1 \\ 1 & 1 & 0 & 0 \\ \end{array} \right) & \begin{array}{l} \hspace{ -4.5 cm} \text{time step 0} \\ \hspace{ - 4.5cm} \text{time step 1} \\ \end{array} \end{align*} and the vector $b_1$: \begin{align*} b_1 = & \left( \begin{array}{c} 1 \\ 1 \\ \end{array} \right) & \begin{array}{l} \hspace{ - 5cm} \text{time step 1} \\ \hspace{ - 5cm} \text{time step 2} \\ \end{array} \end{align*} \section{The Independent Cascade Model} \label{sec:icc} \subsection{Description} In the independent cascade model, nodes have three possible states: susceptible, infected or active\footnote{The words `infected' and `active' are used interchangeably}, and inactive. All nodes start either as susceptible or infected. The infected nodes at $t=0$ constitute the seed set. Similarly to the voter model, we will suppose that nodes are part of the seed set with independent probability $p_{\text{init}}$. At each time step $t$, for every pair of nodes $(j,i)$ where $j$ is infected and $i$ is susceptible, $j$ attempts to infect $i$ with probability $p_{j,i}$. If $j$ succeeds, $i$ will become active at time step $t+1$. Regardless of $j$'s success, node $j$ will be inactive at time $t+1$: nodes stay active for only one time step. The cascade continues until time step $T_k$ when no active nodes remain. \subsection{Formulation} We are going to graph reconstruction problem from independent-cascade-model-type cascades as a sparse recovery problem. We introduce the following notation: \paragraph{Notation} We adapt slightly the definition of $(\vec m^t_k)$: \[ (\vec m^t_k)_j = \left\{ \begin{array}{l l} 1 & \quad \text{if $j$ is \textsc{active} in cascade $k$ at time step $t$}\\ 0 & \quad \text{otherwise} \end{array} \right.\] Similarly to the voter model, if $\vec p_i$ is the vector representing the `parents' of node $i$ in the graph ${\cal G}$ such that $(\vec p_i)_j$ is the probability that node $j$ infects $i$, then $\vec \theta_i$ is the vector: \[ (\vec \theta_{i})_j = \left\{ \begin{array}{l l} \log ( 1- p_{j,i}) & \quad \text{if $j$ is a parent of $i$ in ${\cal G}$}\\ 0 & \quad \text{otherwise} \end{array} \right.\] Note that if $p_{j,i} = 0$, then $\theta_{j,i} = 0$ and we can write $\vec \theta = \log (1 - \vec p)$ element-wise. For $t>0$, let $w^t_{k,i}$ be the scalar representing the probability that in cascade $k$ at time step $t$ node $i$ becomes \textsc{active}, conditioned on the state of cascade $k$ at time step $t-1$. As long as node $i$ has not yet been infected in cascade $k$, we can write as in the voter model, we can observe that: $$\langle \vec m_k^t, \vec \theta_i \rangle = \vec w^t_{k,i}$$ As in the voter model, by concatenating all row vectors $\vec m^t_{k,i}$ such that $t < t^k_i$, where $t^k_i$ is the infection of node $i$ in cascade $k$, in the matrix $M_i$, we can write: $$M_i \vec \theta_i = \vec w_i$$ Once again, we do not have access to $\vec w_i$. Instead, for $t