diff options
Diffstat (limited to 'paper/sections/intro.tex')
| -rw-r--r-- | paper/sections/intro.tex | 62 |
1 files changed, 59 insertions, 3 deletions
diff --git a/paper/sections/intro.tex b/paper/sections/intro.tex index a0933a5..8e36a37 100644 --- a/paper/sections/intro.tex +++ b/paper/sections/intro.tex @@ -1,3 +1,4 @@ +\begin{comment} A recent line of research has focused on applying advances in sparse recovery to graph analysis. A graph can be interpreted as a signal that one seeks to `compress' or `sketch' and then `recovered'. However, we could also consider the situation where the graph is unknown to us, and we dispose of few measurements to recover the signal. Which real-life processes allow us to `measure' the graph? A diffusion process taking place on a graph can provide valuable information about the existence of edges and their edge weights. By observing the sequence of nodes which become `infected' over time without knowledge of who has `infected' whom, can we recover the graph on which the process takes place? The spread of a particular behavior through a network is known as an {\it Influence Cascade}. In this context, the {\it Graph Inference} problem is the recovery of the graph's edges from the observation of few influence cascades. We propose to show how sparse recovery can be applied to solve this recently introduced graph inference problem. @@ -5,16 +6,71 @@ A diffusion process taking place on a graph can provide valuable information abo Tackling the graph inference problem means constructing a polynomial-time algorithm which recovers with high probability a large majority of edges correctly from very few observations or {\it cascades}. Prior work shown that the graph inference problem can be solved in ${\cal O}(poly(s) \log m)$ number of observed cascades, where $s$ is the maximum degree and $m$ is the number of nodes in the graph. Almost miraculously, the number of cascades required to reconstruct the graph is logarithmic in the number of nodes of the graph. Results in the sparse recovery literature lead us to believe that $\Omega(s \log m/s)$ cascade observations should be sufficient to recover the graph. In fact, we prove this lower bound in a very general sense: any non-adaptive graph inference algorithm which succeeds in recovering the graph with constant probability requires $\Omega(s \log m/s)$ observations. We show an almost tight upper-bound by applying standard sparse recovery techniques and assumptions: ${\cal O}(s \log m)$ are sufficient to recover the graph. We show that the edge weights themselves can also be recovered under the same assumptions. Throughout this paper, we focus on the three main discrete-time diffusion processes: the independent cascade model, the voter model, and the linear threshold model... + \end{comment} -\subsection{Related Work} -\paragraph{Past work} +A diffusion process taking place over a graph provides valuable information +about the presence and weights of its edges. \emph{Influence cascades} are a +specific type of diffusions processes in which a particular infection behavior +spreads over the nodes of the graph. By only observing the ``infection times'' +of the nodes in the graph, one might hope to recover the underlying graph and +the parameters of the cascade model. This problem is known in the literature as +the \emph{Graph Inference problem}. + +More precisely, solving the Graph Inference problem involves designing +an algorithm taking as input as set of observed cascades (realisations of the +diffusion process) and recovers with high probability a large fraction of the +graph edges. The goal is then to understand the relationship between the number +of observations, the probability of success, and the accuracy of the +reconstruction. Prior work has shown that the typical required number of +observed cascades is $\O(poly(s)\log m)$, where $m$ is the number of nodes in +the graph, and $s$ is the maximum degree. + +A more recent line of research has focused on applying advances on sparse +inverse problems to the graph inference problem. Indeed, the graph can be +interpreted as a ``sparse signal'' that is sensed through influence cascades +and then recovered. The challenge is that influence cascade models typically +lead to non-linear inverse problems. The sparse recovery literature +suggests that $\Omega(s\log\frac{m}{s})$ cascade observations should be +sufficient to recover the graph. However, the best known upper bound to this +day is $\O(s^2\log m)$. + +The contributions of this paper are the following: +\begin{itemize} + \item we formulate the Graph Inference problem in the context of + discrete-time influence cascades as a sparse inverse problem for + a specific type of Generalized Linear Model. This formulation notably + encompases the well-studied Independent Cascade Model and Voter Model. + \item we give a an algorithm which recovers the graph edges using $\O(s\log + m)$ cascades. Furthermore, we show that our algorithm is also able to + recover the edge weights (the parameters of the influence model), + a problem which has been seemingly overlooked so far. + \item we show that our algorithm is robust in cases where the signal to + recover is approximately $s$-sparse by proving guarantees in the + \emph{stable recovery} setting. + \item we provide an almost tight lower bound of $\Omega(s\log \frac{m}{s})$ + observations. +\end{itemize} + +The organization of the paper is as follows: ... + +\paragraph{Related Work} The study of edge prediction in graph has been an active field of research for over a decade. \cite{GomezRodriguez:2010} was one of the first papers to study graph prediction from cascades. They introduce the {\scshape netinf} algorithm, which approximates the likelihood of cascades represented as a continuous process. The algorithm was later improved/modified in later work. Beside validation on generic networks, {\scshape netinf} is not known to have any theoretical recovery guarantees. \cite{Netrapalli:2012} studied solely the independent cascade model and obtained the first ${\cal O}(\Delta^2 \log m)$ guarantee on general networks. The algorithm is based around the same likelihood function we suggest, without the $\ell1$-norm penalty. However, the analysis depended strongly on a restrictive {\it correlation decay} assumption, which strongly restricts the number of new infections at every step. In this restricted setting, they show a complex lower bound, which is roughly $\Omega(\Delta \log (m/\Delta))$ lower bound for perfect support recovery with constant probability. The work of \cite{Abrahao:13} study the same continuous-model framework as \cite{GomezRodriguez:2010} and obtain a ${\cal O}(\Delta^9 \log^2 \Delta \log m)$ support recovery algorithm. Their work also studies the information leveraged by different `parts' of the cascade, showing that a surprisingly important amount of information can be gleaned from the first `infections' of the cascade. We will reach a similar conclusion in section~\ref{sec:assumptions}. However, their model supposes a single-source model, where only one source is selected at random, which is less realistic in practice. Oftentimes, the `patient 0' is unknown to us, and a multi-source model intuitively models the more common situation of a delayed observation of the cascade. -The recent work of \cite{Daneshmand:2014} is very similar to our own, they consider a $\ell1$-norm penalty to their objective function, adapting standard results from sparse recovery to obtain a ${\cal O}(\Delta^3 \log m)$ algorithm under an irrepresentability condition. With stronger assumptions, they match the \cite{Netrapalli:2012} bound of ${\cal O}(\Delta^2 \log m)$, by exploiting a similar proof technique based around the KKT conditions of the objective function. Their work has the merit of studying a general framework of continuous functions. Similarly to \cite{Abrahao:13}, they place themselves in the restrictive single-source context. +Closest to this work is a recent paper by \cite{Daneshmand:2014}, they consider +a $\ell_1$-norm penalty to their objective function, adapting standard results +from sparse recovery to obtain a ${\cal O}(\Delta^3 \log m)$ algorithm under an +irrepresentability condition. With stronger assumptions, they match the +\cite{Netrapalli:2012} bound of ${\cal O}(\Delta^2 \log m)$, by exploiting +a similar proof technique based around the KKT conditions of the objective +function. Their work has the merit of studying a general framework of +continuous functions. Similarly to \cite{Abrahao:13}, they place themselves in +the restrictive single-source context. +\begin{comment} \paragraph{Our contributions} Though our work follows closely the spirit of \cite{Netrapalli:2012} and \cite{Daneshmand:2014}, we believe we provide several significant improvements to their work. We study sparse recovery under less restrictive assumptions and obtain the first ${\cal O}(\Delta \log m)$ algorithm for graph inference from cascades. We also study the seemingly overlooked problem of weight recovery as well as the setting of the relaxed sparsity setting. Finally, we show these results are almost tight, by proving in section~\ref{sec:lowerbound} the first lower bound on the number of observations required to recover the edges and the edge weights of a graph in the general case. We study the case of the two best known diffusion processes for simplicity as outlined in \cite{Kempe:03}, but many arguments can be extended to more general diffusion processes. +\end{comment} |
