paper/sections/intro.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158

%\begin{comment}
%A recent line of research has focused on applying advances in sparse recovery to
%graph analysis. A graph can be interpreted as a signal that one seeks to
%`compress' or `sketch' and then `recovered'. However, we could also consider the
%situation where the graph is unknown to us, and we dispose of few measurements
%to recover the signal. Which real-life processes allow us to `measure' the
%graph?

%A diffusion process taking place on a graph can provide valuable information
%about the existence of edges and their edge weights. By observing the sequence
%of nodes which become `infected' over time without knowledge of who has
%`infected' whom, can we recover the graph on which the process takes place? The
%spread of a particular behavior through a network is known as an {\it Influence
%Cascade}. In this context, the {\it Network Inference}\ problem is the recovery
%of the graph's edges from the observation of few influence cascades.
%{\color{red} Cite references} We propose to show how sparse recovery can be
%applied to solve this recently introduced network inference problem.

%{\color{red} Graph inference to Network inference}
%Tackling the graph inference problem means constructing a polynomial-time
%algorithm which recovers with high probability a large majority of edges
%correctly from very few observations or {\it cascades}. Prior work shown that
%the graph inference problem can be solved in ${\cal O}(poly(s) \log m)$ number
%of observed cascades, where $s$ is the maximum degree and $m$ is the number of
%nodes in the graph. Almost miraculously, the number of cascades required to
%reconstruct the graph is logarithmic in the number of nodes of the graph.
%Results in the sparse recovery literature lead us to believe that $\Omega(s \log
%m/s)$ cascade observations should be sufficient to recover the graph. In fact,
%we prove this lower bound in a very general sense: any non-adaptive graph
%inference algorithm which succeeds in recovering the graph with constant
%probability requires $\Omega(s \log m/s)$ observations. We show an almost tight
%upper-bound by applying standard sparse recovery techniques and assumptions:
%${\cal O}(s \log m)$ are sufficient to recover the graph. We show that the edge
%weights themselves can also be recovered under the same assumptions.

 %Throughout this paper, we focus on the three main discrete-time diffusion
 %processes: the independent cascade model, the voter model, and the linear
 %threshold model\dots
 %\end{comment}

Graphs have been extensively studied for their propagative abilities:
connectivity, routing, gossip algorithms, etc.
A diffusion process taking place over a graph provides valuable information
about the presence and weights of its edges. \emph{Influence cascades} are a
specific type of diffusion processes in which a particular infectious behavior
spreads over the nodes of the graph.  By only observing the ``infection times''
of the nodes in the graph, one might hope to recover the underlying graph and
the parameters of the cascade model. This problem is known in the literature as
the \emph{Network Inference problem}.

More precisely, solving the Network Inference problem involves designing an
algorithm taking as input a set of observed cascades (realisations of the
diffusion process) and recovers with high probability a large fraction of the
graph's edges. The goal is then to understand the relationship between the
number of observations, the probability of success, and the accuracy of the
reconstruction.

The Network Inference problem can be decomposed and analyzed ``node-by-node''.
Thus, we will focus on a single node of degree $s$ and discuss how to identify
its parents among the $m$ nodes of the graph. Prior work has shown that the
required number of observed cascades is $\O(poly(s)\log m)$
\cite{Netrapalli:2012, Abrahao:13}.

A more recent line of research~\cite{Daneshmand:2014} has focused on applying
advances in sparse recovery to the graph inference problem. Indeed, the graph
can be interpreted as a ``sparse signal'' measured through influence cascades
and then recovered.  The challenge is that influence cascade models typically
lead to non-linear inverse problems. The sparse recovery literature suggests
that $\Omega(s\log\frac{m}{s})$ cascade observations should be sufficient to
recover the graph~\cite{donoho2006compressed, candes2006near}. However, the
best known upper bound to this day is $\O(s^2\log m)$~\cite{Netrapalli:2012,
Daneshmand:2014}


The contributions of this paper are the following:
\begin{itemize}
    \item we formulate the Graph Inference problem in the context of
        discrete-time influence cascades as a sparse recovery problem for
        a specific type of Generalized Linear Model. This formulation notably
        encompasses the well-studied Independent Cascade Model and Voter Model.
    \item we give an algorithm which recovers the graph's edges using $\O(s\log
        m)$ cascades. Furthermore, we show that our algorithm is also able to
        efficiently recover the edge weights (the parameters of the influence
        model) up to an additive error term,
    \item we show that our algorithm is robust in cases where the signal to
        recover is approximately $s$-sparse by proving guarantees in the
        \emph{stable recovery} setting.
    \item we provide an almost tight lower bound of $\Omega(s\log \frac{m}{s})$
        observations required for sparse recovery.
\end{itemize}

The organization of the paper is as follows: we conclude the introduction by a
survey of the related work. In Section~\ref{sec:model} we present our model of
Generalized Linear Cascades and the associated sparse recovery formulation.  Its
theoretical guarantees are presented for various recovery settings in
Section~\ref{sec:results}. The lower bound is presented in
Section~\ref{sec:lowerbound}. Finally, we conclude with experiments in
Section~\ref{sec:experiments}.

\paragraph{Related Work}

The study of edge prediction in graphs has been an active field of research for
over a decade~\cite{Nowell08, Leskovec07, AdarA05}.~\cite{GomezRodriguez:2010}
introduced the {\scshape Netinf} algorithm, which approximates the likelihood of
cascades represented as a continuous process.  The algorithm was improved in
later work~\cite{gomezbalduzzi:2011}, but is not known to have any theoretical
guarantees beside empirical validation on synthetic networks.
\citet{Netrapalli:2012} studied the discrete-time version of the independent
cascade model and obtained the first ${\cal O}(s^2 \log m)$ recovery guarantee
on general networks. The algorithm is based on a likelihood function similar to
the one we propose, without the $\ell_1$-norm penalty. Their analysis depends on
a {\it correlation decay\/} assumption, which limits the number of new infections
at every step. In this setting, they show a lower bound of the number of
cascades needed for support recovery with constant probability of the order
$\Omega(s \log (m/s))$. They also suggest a {\scshape Greedy} algorithm, which
achieves a ${\cal O}(s \log m)$ guarantee in the case of tree graphs. The work
of~\cite{Abrahao:13} studies the same continuous-model framework as
\cite{GomezRodriguez:2010} and obtains an ${\cal O}(s^9 \log^2 s \log m)$
support recovery algorithm, without the \emph{correlation decay} assumption.
\cite{du2013uncover} propose a similar algorithm to ours for recovering the
weights of the graph under a continuous-time independent cascade model, without
proving theoretical guarantees.

Closest to this work is a recent paper by \citet{Daneshmand:2014}, wherein the
authors consider a $\ell_1$-regularized objective function. They adapt standard
results from sparse recovery to obtain a recovery bound of ${\cal O}(s^3 \log
m)$ under an irrepresentability condition~\cite{Zhao:2006}. Under stronger
assumptions, they match the~\cite{Netrapalli:2012} bound of ${\cal O}(s^2 \log
m)$, by exploiting similar properties of the convex program's KKT conditions.
In contrast, our work studies discrete-time diffusion processes including the
Independent Cascade model under weaker assumptions. Furthermore, we analyze both
the recovery of the graph's edges and the estimation of the model's parameters,
and achieve close to optimal bounds.

The work of~\cite{du2014influence} is slightly orthogonal to ours since they
suggest learning the \emph{influence} function, rather than the networks
parameters directly.
%\begin{comment}
%Their work has the merit of studying a generalization of the discrete-time
%independent cascade model to continuous functions. Similarly to
%\cite{Abrahao:13}, they place themselves in the restrictive single-source
%context.
%\end{comment}

\begin{comment}
\paragraph{Our contributions}
Though our work follows closely the spirit of \cite{Netrapalli:2012} and
\cite{Daneshmand:2014}, we believe we provide several significant improvements
to their work. We study sparse recovery under less restrictive assumptions and
obtain the first ${\cal O}(s \log m)$ algorithm for graph inference from
cascades. We also study the seemingly overlooked problem of weight recovery as
well as the setting of the relaxed sparsity setting. Finally, we show these
results are almost tight, by proving in section~\ref{sec:lowerbound} the first
lower bound on the number of observations required to recover the edges and the
edge weights of a graph in the general case. We study the case of the two best
known diffusion processes for simplicity as outlined in \cite{Kempe:03}, but
many arguments can be extended to more general diffusion processes.
\end{comment}