5 files changed, 173 insertions, 130 deletions
diff --git a/paper/sections/appendix.tex b/paper/sections/appendix.tex
index 79baf2f..64953a3 100644
--- a/paper/sections/appendix.tex
+++ b/paper/sections/appendix.tex
@@ -74,6 +74,31 @@ the proof.
     Proposition~\ref{prop:fi} follows.
 \end{proof}
 
+\subsection{Lower Bound}
+
+Let us consider an algorithm $\mathcal{A}$ which verifies the recovery
+guarantee of Theorem~\ref{thm:lb}: there exists a probability distribution over
+measurements such that for all vectors $\theta^*$, \eqref{eq:lb} holds w.p.
+$\delta$. This implies by the probabilistic method that for all distribution
+$D$ over vectors $\theta$, there exists an $n\times m$ measurement matrix $X_D$
+with such that \eqref{eq:lb} holds w.p. $\delta$ ($\theta$ is now the random
+variable).
+
+Consider the following distribution $D$: choose $S$ uniformly at random from a
+``well-chosen'' set of $s$-sparse supports $\mathcal{F}$ and $t$ uniformly at
+random from
+$X \defeq\big\{t\in\{-1,0,1\}^m\,|\, \mathrm{supp}(t)\in\mathcal{F}\big\}$.
+Define $\theta = t + w$ where
+$w\sim\mathcal{N}(0, \alpha\frac{s}{m}I_m)$ and $\alpha = \Omega(\frac{1}{C})$.
+
+Consider the following communication game between Alice and Bob: \emph{(1)}
+Alice sends $y\in\reals^m$ drawn from a Bernouilli distribution of parameter
+$f(X_D\theta)$ to Bob. \emph{(2)} Bob uses $\mathcal{A}$ to recover
+$\hat{\theta}$ from $y$.  It can be shown that at the end of the game Bob now
+has a quantity of information $\Omega(s\log \frac{m}{s})$ about $S$. By the
+Shannon-Hartley theorem, this information is also upper-bounded by $\O(n\log
+C)$. These two bounds together imply the theorem.
+
 \subsection{Other continuous time processes binned to ours: proportional
 hazards model}
 
diff --git a/paper/sections/discussion.tex b/paper/sections/discussion.tex
index 52eb6c5..03e7ff2 100644
--- a/paper/sections/discussion.tex
+++ b/paper/sections/discussion.tex
@@ -1,12 +1,14 @@
 Solving the Graph Inference problem with sparse recovery techniques opens new
 venues for future work. Firstly, the sparse recovery literature has already
 studied regularization patterns beyond the $\ell_1$-norm, notably the
-thresholded and adaptive lasso \cite{vandegeer:2011, Zou:2006}. Another goal
+thresholded and adaptive lasso~\cite{vandegeer:2011, Zou:2006}. Another goal
 would be to obtain confidence intervals for our estimator, similarly to what
 has been obtained for the Lasso in the recent series of papers
 \cite{javanmard2014, zhang2014}.
 
-Finally, the linear threshold model is a commonly studied diffusion process and can also be cast as a \emph{generalized linear cascade} with inverse link function $z \mapsto \mathbbm{1}_{z > 0}$:
+Finally, the linear threshold model is a commonly studied diffusion process and
+can also be cast as a \emph{generalized linear cascade} with inverse link
+function $z \mapsto \mathbbm{1}_{z > 0}$:
 $
     \label{eq:lt}
     X^{t+1}_j = \text{sign} \left(\inprod{\theta_j}{X^t} - t_j \right)
@@ -16,21 +18,33 @@ This model therefore falls into the 1-bit compressed sensing framework
 guarantees obtained for 1-bit compressed sensing with specific measurements
 \cite{Gupta:2010, Plan:2014}. Whilst they obtained bounds of the order
 ${\cal O}(n \log \frac{m}{s}$), no current theory exists for recovering
-positive bounded signals from biniary measurememts. This research direction
+positive bounded signals from binary measurememts. This research direction
 may provide the first clues to solve the ``adaptive learning'' problem: if we
 are allowed to adaptively \emph{choose} the source nodes at the beginning of
 each cascade, how much can we improve the current results?
 
 \begin{comment}
-The Linear Threshold model can \emph{also} be cast a generalized linear cascade model. However, as we show below, its link function is non-differentiable and necessitates a different analysis. In the Linear Threshold Model, each node $j\in V$ has a threshold $t_j$ from the interval $[0,1]$ and for each node, the sum of incoming weights is less than $1$: $\forall j\in V$, $\sum_{i=1}^m \Theta_{i,j} \leq 1$. 
+The Linear Threshold model can \emph{also} be cast a generalized linear cascade
+model. However, as we show below, its link function is non-differentiable and
+necessitates a different analysis. In the Linear Threshold Model, each node
+$j\in V$ has a threshold $t_j$ from the interval $[0,1]$ and for each node, the
+sum of incoming weights is less than $1$: $\forall j\in V$, $\sum_{i=1}^m
+\Theta_{i,j} \leq 1$. 
 
-Nodes have two states: susceptible or active. At each time step, each susceptible
-node $j$ becomes active if the sum of the incoming weights from active parents is greater than $j$'s threshold $t_j$. Nodes remain active until the end of the cascade, which is reached when no new susceptible nodes become active. As such, the source nodes are chosen, the process is entirely deterministic. Let $X^t$ be the indicator variable of the set of active nodes at time step $t-1$, then:
+Nodes have two states: susceptible or active. At each time step, each
+susceptible node $j$ becomes active if the sum of the incoming weights from
+active parents is greater than $j$'s threshold $t_j$. Nodes remain active until
+the end of the cascade, which is reached when no new susceptible nodes become
+active. As such, the source nodes are chosen, the process is entirely
+deterministic. Let $X^t$ be the indicator variable of the set of active nodes
+at time step $t-1$, then:
 \begin{equation}
     X^{t+1}_j = \mathbbm{1}_{\sum_{i=1}^m \Theta_{i,j}X^t_i > t_j}
     = \mathbbm{1}_{\inprod{\theta_j}{X^t} > t_j}
 \end{equation}
-where we defined again $\theta_j\defeq (\Theta_{1,j},\ldots,\Theta_{m,j})$. In other words, for every step in the linear threshold model, we observe the following signal:
+where we defined again $\theta_j\defeq (\Theta_{1,j},\ldots,\Theta_{m,j})$. In
+other words, for every step in the linear threshold model, we observe the
+following signal:
 
 \begin{equation}
     \label{eq:lt}
@@ -38,5 +52,12 @@ where we defined again $\theta_j\defeq (\Theta_{1,j},\ldots,\Theta_{m,j})$. In o
     X^{t+1}_j = \text{sign} \left(\inprod{\theta_j}{X^t} - t_j \right)
 \end{equation}
 
-The link function of the linear threshold model is the sign function: $z \mapsto \mathbbm{1}_{z > 0}$. This model therefore falls into the 1-bit compressed sensing model \cite{Boufounos:2008} framework. Several recent papers study the theoretical guarantees obtained for 1-bit compressed sensing with specific measurements \cite{Gupta:2010}, \cite{Plan:2014}. Whilst they obtained bounds of the order ${\cal O}(n \log \frac{n}{s}$), no current theory exists for recovering positive bounded signals from bernoulli hyperplanes. We leave this research direction to future work.
+The link function of the linear threshold model is the sign function: $z
+\mapsto \mathbbm{1}_{z > 0}$. This model therefore falls into the 1-bit
+compressed sensing model \cite{Boufounos:2008} framework. Several recent papers
+study the theoretical guarantees obtained for 1-bit compressed sensing with
+specific measurements \cite{Gupta:2010}, \cite{Plan:2014}. Whilst they obtained
+bounds of the order ${\cal O}(n \log \frac{n}{s}$), no current theory exists
+for recovering positive bounded signals from bernoulli hyperplanes. We leave
+this research direction to future work.
 \end{comment}
diff --git a/paper/sections/lowerbound.tex b/paper/sections/lowerbound.tex
index a7ae21c..36fbbbe 100644
--- a/paper/sections/lowerbound.tex
+++ b/paper/sections/lowerbound.tex
@@ -10,10 +10,10 @@ problems in \cite{pw11, pw12, bipw11}.
 
 \begin{theorem}
     \label{thm:lb}
-    Let us consider a cascade model of the form \eqref{eq:glm} and a recovery algorithm
-    $\mathcal{A}$ which takes as input $n$ random cascade measurements and
-    outputs $\hat{\theta}$ such that with probability $\delta>\frac{1}{2}$
-    (over the measurements):
+    Let us consider a cascade model of the form \eqref{eq:glm} and a recovery
+    algorithm $\mathcal{A}$ which takes as input $n$ random cascade
+    measurements and outputs $\hat{\theta}$ such that with probability
+    $\delta>\frac{1}{2}$ (over the measurements):
     \begin{equation}
         \label{eq:lb}
         \|\hat{\theta}-\theta^*\|_2\leq
@@ -23,57 +23,58 @@ problems in \cite{pw11, pw12, bipw11}.
     = \Omega(s\log\frac{m}{s}/\log C)$.
 \end{theorem}
 
-This theorem should be contrasted with Theorem~\ref{thm:approx_sparse}: up to an additive
-$s\log s$ factor, the number of measurements required by our algorithm is
-tight.  The proof of Theorem~\ref{thm:lb} follows an approach similar to \cite{pw12}.
-We only present a sketch of the proof here and refer the reader to their paper
-for more details.
+This theorem should be contrasted with Theorem~\ref{thm:approx_sparse}: up to
+an additive $s\log s$ factor, the number of measurements required by our
+algorithm is tight.  The proof of Theorem~\ref{thm:lb} follows an approach
+similar to \cite{pw12}.  We present a sketch of the proof in the Appendix and
+refer the reader to their paper for more details.
 
-Let us consider an algorithm $\mathcal{A}$ which verifies the recovery
-guarantee of Theorem~\ref{thm:lb}: there exists a probability distribution over
-measurements such that for all vectors $\theta^*$, \eqref{eq:lb} holds w.p.
-$\delta$. This implies by the probabilistic method that for all distribution
-$D$ over vectors $\theta$, there exists an $n\times m$ measurement matrix $X_D$ with
-such that \eqref{eq:lb} holds w.p. $\delta$ ($\theta$ is now the
-random variable).
+%Let us consider an algorithm $\mathcal{A}$ which verifies the recovery
+%guarantee of Theorem~\ref{thm:lb}: there exists a probability distribution over
+%measurements such that for all vectors $\theta^*$, \eqref{eq:lb} holds w.p.
+%$\delta$. This implies by the probabilistic method that for all distribution
+%$D$ over vectors $\theta$, there exists an $n\times m$ measurement matrix $X_D$
+%with such that \eqref{eq:lb} holds w.p. $\delta$ ($\theta$ is now the random
+%variable).
 
-Consider the following distribution $D$: choose $S$
-uniformly at random from a ``well-chosen'' set of $s$-sparse supports
-$\mathcal{F}$ and $t$ uniformly at random from $X
-\defeq\big\{t\in\{-1,0,1\}^m\,|\, \mathrm{supp}(t)\in\mathcal{F}\big\}$. Define
-$\theta = t + w$ where $w\sim\mathcal{N}(0, \alpha\frac{s}{m}I_m)$ and $\alpha
-= \Omega(\frac{1}{C})$.
+%Consider the following distribution $D$: choose $S$ uniformly at random from a
+%``well-chosen'' set of $s$-sparse supports $\mathcal{F}$ and $t$ uniformly at
+%random from
+%$X \defeq\big\{t\in\{-1,0,1\}^m\,|\, \mathrm{supp}(t)\in\mathcal{F}\big\}$.
+%Define $\theta = t + w$ where
+%$w\sim\mathcal{N}(0, \alpha\frac{s}{m}I_m)$ and $\alpha = \Omega(\frac{1}{C})$.
 
-Consider the following communication game between Alice and Bob: \emph{(1)} Alice sends $y\in\reals^m$ drawn from a Bernouilli distribution of parameter
-$f(X_D\theta)$ to Bob. \emph{(2)} Bob uses $\mathcal{A}$ to recover $\hat{\theta}$ from $y$.
-It can be shown that at the end of the game Bob now has a quantity of
-information $\Omega(s\log \frac{m}{s})$ about $S$. By the Shannon-Hartley
-theorem, this information is also
-upper-bounded by $\O(n\log C)$. These two bounds together imply the theorem.
+%Consider the following communication game between Alice and Bob: \emph{(1)}
+%Alice sends $y\in\reals^m$ drawn from a Bernouilli distribution of parameter
+%$f(X_D\theta)$ to Bob. \emph{(2)} Bob uses $\mathcal{A}$ to recover
+%$\hat{\theta}$ from $y$.  It can be shown that at the end of the game Bob now
+%has a quantity of information $\Omega(s\log \frac{m}{s})$ about $S$. By the
+%Shannon-Hartley theorem, this information is also upper-bounded by $\O(n\log
+%C)$. These two bounds together imply the theorem.
 
-\begin{comment}
-\begin{lemma}
-    \label{lemma:upperinf}
-    Let $S'=\mathrm{supp}(\hat{t})$. If $\alpha = \Omega(\frac{1}{C})$, $I(S, S')
-    = \O(n\log C)$.
-\end{lemma}
+%\begin{comment}
+%\begin{lemma}
+    %\label{lemma:upperinf}
+    %Let $S'=\mathrm{supp}(\hat{t})$. If $\alpha = \Omega(\frac{1}{C})$, $I(S,
+    %S') = \O(n\log C)$.
+%\end{lemma}
 
-\begin{lemma}
-    \label{lemma:lowerinf}
-    If $\alpha = \Omega(\frac{1}{C})$, $I(S, S') = \Omega(s\log\frac{m}{s})$.
-\end{lemma}
+%\begin{lemma}
+    %\label{lemma:lowerinf}
+    %If $\alpha = \Omega(\frac{1}{C})$, $I(S, S') = \Omega(s\log\frac{m}{s})$.
+%\end{lemma}
 
-Lemmas \ref{lemma:lowerinf} and \ref{lemma:upperinf} together give Theorem
-\ref{thm:lb}.
+%Lemmas \ref{lemma:lowerinf} and \ref{lemma:upperinf} together give Theorem
+%\ref{thm:lb}.
 
-Formally, let $\mathcal{F}\subset \{S\subset [1..m]\,|\, |S|=s\}$ be a family
-of $s$-sparse supports such that:
-\begin{itemize}
-    \item $|S\Delta S'|\geq s$ for $S\neq S'\in\mathcal{F}$ .
-    \item $\P_{S\in\mathcal{F}}[i\in S]=\frac{s}{m}$ for all $i\in [1..m]$ and
-        when $S$ is chosen uniformly at random in $\mathcal{F}$
-    \item $\log|\mathcal{F}| = \Omega(s\log\frac{m}{s})$
-\end{itemize}
-such a family can be obtained by considering a linear code (see details in
-\cite{pw11}. Let . 
-\end{comment}
+%Formally, let $\mathcal{F}\subset \{S\subset [1..m]\,|\, |S|=s\}$ be a family
+%of $s$-sparse supports such that:
+%\begin{itemize}
+    %\item $|S\Delta S'|\geq s$ for $S\neq S'\in\mathcal{F}$ .
+    %\item $\P_{S\in\mathcal{F}}[i\in S]=\frac{s}{m}$ for all $i\in [1..m]$ and
+        %when $S$ is chosen uniformly at random in $\mathcal{F}$
+    %\item $\log|\mathcal{F}| = \Omega(s\log\frac{m}{s})$
+%\end{itemize}
+%such a family can be obtained by considering a linear code (see details in
+%\cite{pw11}. Let . 
+%\end{comment}
diff --git a/paper/sections/model.tex b/paper/sections/model.tex
index acec974..b704b9e 100644
--- a/paper/sections/model.tex
+++ b/paper/sections/model.tex
@@ -107,9 +107,9 @@ In the independent cascade model, nodes can be either susceptible, contagious
 or immune. At $t=0$, all source nodes are ``contagious'' and all remaining
 nodes are ``susceptible''. At each time step $t$, for each edge $(i,j)$ where
 $j$ is susceptible and $i$ is contagious, $i$ attempts to infect $j$ with
-probability $p_{i,j}\in(0,1]$; the infection attempts are mutually independent.
+probability $p_{i,j}\in[0,1]$; the infection attempts are mutually independent.
 If $i$ succeeds, $j$ will become contagious at time step $t+1$. Regardless of
-$i$'s success, node $i$ will be immune at time $t+1$. In other words, nodes
+$i$'s success, node $i$ will be immune at time $t+1$, such that nodes
 stay contagious for only one time step. The cascade process terminates when no
 contagious nodes remain.
 
@@ -130,9 +130,10 @@ Defining $\Theta_{i,j} \defeq \log(1-p_{i,j})$, this can be rewritten as:
 Therefore, the independent cascade model is a Generalized Linear Cascade model
 with inverse link function $f : z \mapsto 1 - e^z$.
 
-In Section~\ref{sec:results}, we show bounds on the parameter of the graph.
-Fortunately, a bound on $\|\hat\theta - \theta^*\|_2$ directly implies a bound
-on $\|\hat p - p^*\|_2$ as shown by the following lemma:
+In Section~\ref{sec:results}, we show bounds on the \emph{transformed}
+parameters $\Theta$ of the graph.  Fortunately, a bound on $\|\hat\theta -
+\theta^*\|_2$ directly implies a bound on $\|\hat p - p^*\|_2$ as shown by the
+following lemma:
 
 \begin{lemma}
     $\|\hat{\theta} - \theta^* \|_2 \geq \|\hat{p} - p^*\|_2$.
@@ -146,15 +147,14 @@ $|\log (1 - p) - \log (1-p')| \geq \max(1 - \frac{1-p}{1-p'},
 \subsubsection{The Linear Voter Model}
 
 In the Linear Voter Model, nodes can be either \emph{red} or \emph{blue}.
-Without loss of generality, we can suppose that the
-\emph{blue} nodes are contagious. The parameters of the graph are normalized
-such that $\forall i, \ \sum_j \Theta_{i,j} = 1$ and we assume that
-$\Theta_{i,i}$ is always non-zero, meaning that all nodes have self-loops.
-Each round, every node $j$ independently chooses
-one of its neighbors with probability $\Theta_{i,j}$ and adopts their color. The
-cascades stops at a fixed horizon time $T$ or if all nodes are of the same color.
-If we denote by $X^t$ the indicator variable of the set of blue nodes at time
-step $t$, then we have:
+Without loss of generality, we can suppose that the \emph{blue} nodes are
+contagious. The parameters of the graph are normalized such that $\forall i, \
+\sum_j \Theta_{i,j} = 1$ and we assume that $\Theta_{i,i}$ is always non-zero,
+meaning that all nodes have self-loops.  Each round, every node $j$
+independently chooses one of its neighbors with probability $\Theta_{i,j}$ and
+adopts their color. The cascades stops at a fixed horizon time $T$ or if all
+nodes are of the same color.  If we denote by $X^t$ the indicator variable of
+the set of blue nodes at time step $t$, then we have:
 \begin{equation}
 \mathbb{P}\left[X^{t+1}_j = 1 | X^t \right] = \sum_{i=1}^m \Theta_{i,j} X_i^t =
 \inprod{\Theta_j}{X^t}
@@ -176,11 +176,12 @@ the discrete-time independent cascade model, $X^k_j = 1 \implies X^{k+1}_j =
 1$.
 
 Let $\exp(p)$ be an exponentially-distributed random variable of parameter $p$
-and let $p_{i,j}$ be the rate of transmission along directed edge $(i,j)$ in the
-CICE model.  By the memoryless property of the exponential, if $X^k_j \neq 1$:
+and let $\Theta_{i,j}$ be the rate of transmission along directed edge $(i,j)$
+in the CICE model.  By the memoryless property of the exponential, if $X^k_j
+\neq 1$:
 \begin{align*}
   \mathbb{P}(X^{k+1}_j = 1 | X^k) & = \mathbb{P}(\min_{i \in {\cal N}(j)}
-  \exp(p_{i,j}) \leq \epsilon) \\
+  \exp(\Theta_{i,j}) \leq \epsilon) \\
   & = \mathbb{P}(\exp( \sum_{i=1}^m \Theta_{i,j} X^t_i) \leq \epsilon) \\
   & = 1 - e^{- \epsilon \Theta_j \cdot X^t}
 \end{align*}
@@ -193,15 +194,14 @@ Generalized Linear Cascade model with inverse link function $f:z\mapsto
 
 \subsubsection{Logistic Cascades}
 \label{sec:logistic_cascades}
-Consider the specific case of ``logistic cascades'' (where $f$ is the logistic
-function). This model captures the idea that there is a threshold around which
-each additional neighbor's contribution becomes significant: logistic
-cascades can be thought of approximating the more-commonly studied Linear
-Threshold model~\cite{Kempe:03}. As we will see later in the
-analysis, the Hessian of our optimization program simplifies in the case of a
-logistic inverse link function to $\nabla^2\mathcal{L}(\theta^*) =
-\frac{1}{|\mathcal{T}|}XX^T$ where $X$ is the observation matrix $[x^1 \ldots
-x^\mathcal{|T|}]$.
+``Logistic cascades'' (where the inverse link function $f$ is the logistic
+function) capture the idea that there is a threshold around which each
+additional neighbor's contribution becomes significant: they constitue a
+differentiable approximation to the more-commonly studied Linear Threshold
+model~\cite{Kempe:03}. As we will see later in the analysis, the Hessian of our
+optimization program simplifies in the case of a logistic inverse link function
+to $\nabla^2\mathcal{L}(\theta^*) = \frac{1}{|\mathcal{T}|}XX^T$ where $X$ is
+the observation matrix $[x^1 \ldots x^\mathcal{|T|}]$.
 
 % \subsection{The Linear Threshold Model}
 
@@ -234,18 +234,17 @@ x^\mathcal{|T|}]$.
   \caption{Illustration of the sparse-recovery approach. Our objective is to
   recover the unknown weight vector $\theta_j$ for each node $j$. We observe a
 Bernoulli realization of the $f$ transform of the matrix-vector product, where
-the measurement matrix encodes which nodes are ``contagious''}
+the measurement matrix encodes which nodes are ``contagious'' at each time step}
 \end{figure}
 
 
 Inferring the model parameter $\Theta$ from observed influence cascades is the
 central question of the present work. Recovering the edges in $E$ from observed
-influence cascades is a well-identified problem known as the \emph{Graph
+influence cascades is a well-identified problem known as the \emph{Network
 Inference} problem. However, recovering the influence parameters is no less
-important and has been seemingly overlooked so far. In this work we focus on
-recovering $\Theta$, noting that the set of edges $E$ can then be recovered
-through the following equivalence: $(i,j)\in E\Leftrightarrow  \Theta_{i,j}
-\neq 0$
+important. In this work we focus on recovering $\Theta$, noting that the set of
+edges $E$ can then be recovered through the following equivalence: $(i,j)\in
+E\Leftrightarrow  \Theta_{i,j} \neq 0$
 
 Given observations $(x^1,\ldots,x^n)$ of a cascade model, we can recover
 $\Theta$ via Maximum Likelihood Estimation (MLE). Denoting by $\mathcal{L}$ the
diff --git a/paper/sections/results.tex b/paper/sections/results.tex
index a0d3159..f68ecee 100644
--- a/paper/sections/results.tex
+++ b/paper/sections/results.tex
@@ -42,18 +42,23 @@ by~\cite{bickel2009simultaneous}.
 
 A discussion of the $(S,\gamma)$-{\bf(RE)} assumption in the context of
 generalized linear cascade models can be found in Section~\ref{sec:re}. In our
-setting we require that the (RE)-condition holds for the Hessian of the
+setting we require that the {\bf(RE)}-condition holds for the Hessian of the
 log-likelihood function $\mathcal{L}$: it essentially captures the fact that the
 binary vectors of the set of active nodes are not \emph{too} collinear.
 
 {\color{red} Rewrite the minimal assumptions necessary}
 We will also need the following assumption on the inverse link function $f$ of
 the generalized linear cascade model:
+%\begin{equation}
+    %\tag{LF}
+    %\max\left(\left|\frac{f'}{f}(\inprod{\theta^*}{x})\right|,
+    %\left|\frac{f'}{1-f}(\inprod{\theta^*}{x})\right|\right)\leq
+    %\frac{1}{\alpha}
+%\end{equation}
 \begin{equation}
-    \tag{LF}
-    \max\left(\left|\frac{f'}{f}(\inprod{\theta^*}{x})\right|,
-    \left|\frac{f'}{1-f}(\inprod{\theta^*}{x})\right|\right)\leq
-    \frac{1}{\alpha}
+  \tag{LF}
+  \max \left( \| (\log f)' \|_{\infty}, \|(\log (1-f))' \|_{\infty} \right)
+  \leq \frac{1}{\alpha}
 \end{equation}
 for some $\alpha\in(0, 1)$ and for all $x\in\reals^m$ such that
 $f(\inprod{\theta^*}{x})\notin\{0,1\}$.
@@ -147,13 +152,12 @@ solve the Network Inference problem.
 \label{cor:variable_selection}
 Under the same assumptions as Theorem~\ref{thm:main}, let $\hat {\cal S}_\eta
 \defeq \{ j \in \{1,\ldots, m\} : \hat{\theta}_j > \eta\}$ for $\eta > 0$. For
-$\epsilon>0$ and $\epsilon < \eta$, let ${\cal S}^*_{\eta + \epsilon} \defeq \{
-i \in \{1,\ldots,m\} :\theta_i^* > \eta +\epsilon \}$ be the set of all true
-`strong' parents. Suppose the number of measurements verifies: $ n >
-\frac{9s\log m}{\alpha\gamma^2\epsilon^2}$.  Then with probability
-$1-\frac{1}{m}$, ${\cal S}^*_{\eta + \epsilon} \subseteq \hat {\cal S}_\eta
-\subseteq {\cal S}^*$. In other words we recover all `strong' parents and no
-`false' parents.
+$0< \epsilon < \eta$, let ${\cal S}^*_{\eta + \epsilon} \defeq \{ i \in
+\{1,\ldots,m\} :\theta_i^* > \eta +\epsilon \}$ be the set of all true `strong'
+parents. Suppose the number of measurements verifies: $ n > \frac{9s\log
+m}{\alpha\gamma^2\epsilon^2}$.  Then with probability $1-\frac{1}{m}$, ${\cal
+S}^*_{\eta + \epsilon} \subseteq \hat {\cal S}_\eta \subseteq {\cal S}^*$. In
+other words we recover all `strong' parents and no `false' parents.
 \end{corollary}
 
 \begin{proof}
@@ -191,13 +195,10 @@ $
 be the best $s$-approximation to $\theta^*$. Then we pay a cost proportional
 to $\|\theta^* - \theta^*_{\lfloor s\rfloor}\|_1$ for recovering the weights of
 non-exactly sparse vectors. This cost is simply the ``tail'' of
-$\theta^*$: the sum of the $m-s$ smallest coordinates of $\theta^*$.
-
-In other words, the closer $\theta^*$ is to being sparse, the smaller the
-price, and we recover the results of Section~\ref{sec:main_theorem} in the
-limit of exact sparsity. These results are formalized in the following theorem,
-which is also a consequence of Theorem 1 in \cite{Negahban:2009}.
-
+$\theta^*$: the sum of the $m-s$ smallest coordinates of $\theta^*$. We recover
+the results of Section~\ref{sec:main_theorem} in the limit of exact sparsity.
+These results are formalized in the following theorem, which is also a
+consequence of Theorem 1 in \cite{Negahban:2009}.
 \begin{theorem}
 \label{thm:approx_sparse}
 Suppose the {\bf(RE)} assumption holds for the Hessian $\nabla^2 f(\theta^*)$
@@ -238,21 +239,19 @@ then similarly: ${\cal S}^*_{\eta + \epsilon} \subset \hat {\cal S}_\eta
 
 In this section, we discuss the main assumption of Theorem~\ref{thm:main} namely
 the restricted eigenvalue condition introduced by~\cite{bickel2009simultaneous}.
-We then compare to the irrepresentability condition considered in
-\cite{Daneshmand:2014}.
 
 \paragraph{Interpretation}
 
 There exists a large class of sufficient conditions under which sparse recovery
-is achievable in the context of regularized estimation. A good survey of these
-different assumptions can be found in \cite{vandegeer:2009}.
-
+is achievable in the context of regularized estimation~\cite{vandegeer:2009}.
 The restricted eigenvalue condition is one of the weakest such assumption. It
 can be interpreted as a restricted form of non-degeneracy. Since we apply it to
 the Hessian of the log-likelihood function $\nabla^2 \mathcal{L}(\theta)$, it
 essentially reduces to a form of restricted strong convexity, that
-Lemma~\ref{lem:negahban} ultimately relies on. Observe that the Hessian of
-$\mathcal{L}$ can be seen as a re-weighted Gram matrix of the observations:
+Lemma~\ref{lem:negahban} ultimately relies on.
+
+Observe that the Hessian of $\mathcal{L}$ can be seen as a re-weighted
+\emph{Gram matrix} of the observations:
 \begin{multline*}
     \nabla^2\mathcal{L}(\theta^*)
     = \frac{1}{|\mathcal{T}|}\sum_{t\in\mathcal{T}}x^t(x^t)^T
@@ -269,8 +268,7 @@ and the $(S, \gamma)$-({\bf RE}) condition on the Hessian in
 Theorem~\ref{thm:main} and Theorem~\ref{thm:approx_sparse} reduces to a
 condition on the \emph{gram matrix} of the observations $X^T X =
 \frac{1}{|\mathcal{T}|}\sum_{t\in\mathcal{T}}x^t(x^t)^T$ with $\gamma \leftarrow
-\gamma\cdot c$. The (RE)-condition is then equivalent to the assumption made in
-the Lasso analysis of~\cite{bickel:2009}.
+\gamma\cdot c$.
 
 \paragraph{(RE) with high probability}
 
@@ -281,19 +279,18 @@ this probabilistic model. Several recent papers show that large classes of
 correlated designs obey the restricted eigenvalue property with high
 probability \cite{raskutti:10, rudelson:13}.
 
-The (RE)-condition is well-behaved in the following sense: if it holds for the
-expected Hessian matrix $\E[\nabla^2\mathcal{L}(\theta^*)]$, then it holds for
-the finite sample Hessian matrix $\nabla^2\mathcal{L}(\theta^*)$ with high
+The {\bf(RE)}-condition is well-behaved in the following sense: if it holds for
+the expected Hessian matrix $\E[\nabla^2\mathcal{L}(\theta^*)]$, then it holds
+for the finite sample Hessian matrix $\nabla^2\mathcal{L}(\theta^*)$ with high
 probability.
 
-Note that the expected Hessian matrix is exactly the Fisher
-Information matrix of the generalized linear cascade model which captures the
-amount of information about $\theta^*$ conveyed by the random observations.
-The (RE)-condition has an even more natural interpretation when $f$ and $1-f$
-are strictly log-convex, since the expected \emph{gram matrix} $A \equiv
-\mathbb{E}[X^T X]$ is a matrix whose entry $a_{i,j}$ is the average number of
-times node $i$ and node $j$ are infected at the same time in the cascade
-process, and $a_{i,i}$ is the average number of times node $i$ is infected.
+If $f$ and $1-f$ are strictly log-convex, then the previous observations yields
+the following natural interpretation of the {\bf(RE)}-condition: the expected
+\emph{gram matrix} $A \equiv \mathbb{E}[X^T X]$ is a matrix whose entry
+$a_{i,j}$ is the average fraction of times node $i$ and node $j$ are infected
+at the same time during several runs of the cascade process. Note that the
+diagonal terms $a_{i,i}$ are simply the average fraction of times the
+corresponding node $i$ is infected.
 
 Therefore, under an assumption which only involves the probabilistic model and
 not the actual observations, Proposition~\ref{prop:fi} allows us to draw the