8 files changed, 173 insertions, 60 deletions
diff --git a/paper/paper.tex b/paper/paper.tex
index 6e94164..20e8f08 100644
--- a/paper/paper.tex
+++ b/paper/paper.tex
@@ -127,8 +127,8 @@
 \label{sec:linear_threshold}
 \input{sections/discussion.tex}
 
-\section{Appendix}
-\input{sections/appendix.tex}
+%\section{Appendix}
+%\input{sections/appendix.tex}
 
 
 \bibliography{sparse}
diff --git a/paper/sections/discussion.tex b/paper/sections/discussion.tex
index 25d9c53..b2947a1 100644
--- a/paper/sections/discussion.tex
+++ b/paper/sections/discussion.tex
@@ -1,7 +1,15 @@
 
 
 \paragraph{Future Work}
-Solving the Graph Inference problem with sparse recovery techniques opens new venues for future work. Firstly, the sparse recovery literature has already studied regularization patterns beyond the $\ell-1$-norm, notably the thresholded and adaptive lasso \cite{vandegeer:2011} \cite{Zou:2006}. Another series of papers that are directly relevant to the Graph Inference setting have shown that confidence intervals can be established for the lasso. Finally, the linear threshold model is a commonly studied diffusion process and can also be cast as a \emph{generalized linear cascade} with inverse link function $z \mapsto \mathbbm{1}_{z > 0}$:
+Solving the Graph Inference problem with sparse recovery techniques opens new
+venues for future work. Firstly, the sparse recovery literature has already
+studied regularization patterns beyond the $\ell_1$-norm, notably the
+thresholded and adaptive lasso \cite{vandegeer:2011, Zou:2006}. Another goal
+would be to obtain confidence intervals for our estimator, similarly to what
+has been obtained for the Lasso in the recent series of papers
+\cite{javanmard2014, zhang2014}.
+
+Finally, the linear threshold model is a commonly studied diffusion process and can also be cast as a \emph{generalized linear cascade} with inverse link function $z \mapsto \mathbbm{1}_{z > 0}$:
 
 \begin{equation}
     \label{eq:lt}
@@ -9,7 +17,15 @@ Solving the Graph Inference problem with sparse recovery techniques opens new ve
     X^{t+1}_j = \text{sign} \left(\inprod{\theta_j}{X^t} - t_j \right)
 \end{equation}
 
-This model therefore falls into the 1-bit compressed sensing model \cite{Boufounos:2008} framework. Several recent papers study the theoretical guarantees obtained for 1-bit compressed sensing with specific measurements \cite{Gupta:2010}, \cite{Plan:2014}. Whilst they obtained bounds of the order ${\cal O}(n \log \frac{n}{s}$), no current theory exists for recovering positive bounded signals from bernoulli hyperplanes. This research direction may provide the first clues to solve the ``active learning'' problem: if we are allowed to adaptively \emph{choose} the source nodes at the beginning of each cascade, can we improve on current results?
+This model therefore falls into the 1-bit compressed sensing model
+\cite{Boufounos:2008} framework. Several recent papers study the theoretical
+guarantees obtained for 1-bit compressed sensing with specific measurements
+\cite{Gupta:2010, Plan:2014}. Whilst they obtained bounds of the order
+${\cal O}(n \log \frac{n}{s}$), no current theory exists for recovering
+positive bounded signals from bernoulli hyperplanes. This research direction
+may provide the first clues to solve the ``adaptive learning'' problem: if we
+are allowed to adaptively \emph{choose} the source nodes at the beginning of
+each cascade, how much can we improve the current results?
 
 \begin{comment}
 The Linear Threshold model can \emph{also} be cast a generalized linear cascade model. However, as we show below, its link function is non-differentiable and necessitates a different analysis. In the Linear Threshold Model, each node $j\in V$ has a threshold $t_j$ from the interval $[0,1]$ and for each node, the sum of incoming weights is less than $1$: $\forall j\in V$, $\sum_{i=1}^m \Theta_{i,j} \leq 1$. 
@@ -29,4 +45,4 @@ where we defined again $\theta_j\defeq (\Theta_{1,j},\ldots,\Theta_{m,j})$. In o
 \end{equation}
 
 The link function of the linear threshold model is the sign function: $z \mapsto \mathbbm{1}_{z > 0}$. This model therefore falls into the 1-bit compressed sensing model \cite{Boufounos:2008} framework. Several recent papers study the theoretical guarantees obtained for 1-bit compressed sensing with specific measurements \cite{Gupta:2010}, \cite{Plan:2014}. Whilst they obtained bounds of the order ${\cal O}(n \log \frac{n}{s}$), no current theory exists for recovering positive bounded signals from bernoulli hyperplanes. We leave this research direction to future work.
-\end{comment}
-\ No newline at end of file
+\end{comment}
diff --git a/paper/sections/experiments.tex b/paper/sections/experiments.tex
index 7cb852c..48ffc02 100644
--- a/paper/sections/experiments.tex
+++ b/paper/sections/experiments.tex
@@ -13,7 +13,10 @@
 & \includegraphics[scale=.23]{figures/kronecker_l2_norm_nonsparse.pdf}\\
 (a) Barabasi-Albert & (b) Watts-Strogatz & (c) sparse Kronecker & (d) non-sparse Kronecker
 \end{tabular}
-\captionof{figure}{Figures (a) and (b) report the $f1$-score in $\log$ scale for 2 graphs: (a) Barabasi-Albert graph, $300$ nodes, $16200$ edges. (b) Watts-Strogatz graph, $300$ nodes, $4500$ edges. Figures (c) and (d) report the $\ell2$-norm $\|\hat \Theta - \Theta\|_2$ for a Kronecker graph which is: (c) exactly sparse (d) non-exactly sparse}
+\captionof{figure}{Figures (a) and (b) report the F$1$-score in $\log$ scale
+for 2 graphs: (a) Barabasi-Albert graph, $300$ nodes, $16200$ edges. (b)
+Watts-Strogatz graph, $300$ nodes, $4500$ edges. Figures (c) and (d) report the
+$\ell_2$-norm $\|\hat \Theta - \Theta\|_2$ for a Kronecker graph which is: (c) exactly sparse (d) non-exactly sparse}
 \label{fig:four_figs}
 \end{table*}
 
@@ -35,8 +38,14 @@ We did not benchmark against other known algorithms (\textsc{netrate} \cite{gome
 \paragraph{Graph Estimation}
 In the case of the \textsc{lasso}, \textsc{mle} and \textsc{sparse mle} algorithms, we construct the edges of $\hat {\cal G} : \cup_{j \in V} \{i : \Theta_{ij} > 0.1\}$, \emph{i.e} by thresholding. The true positives are the edges which appear both in ${\cal G}$ and $\hat {\cal G}$ and the true negatives are the edges which appear in neither. Finally, we report the F1-score$=2 \text{precision}\cdot\text{recall}/(\text{precision}+\text{recall})$, which considers the number of true edges recovered by the algorithm over the total number of edges returned by the algorithm (\emph{precision}) with the number of true edges recovered by the algorithm over the total number of edges it should have recovered (\emph{recall}).
 
-Over all experiments, \textsc{sparse mle} achieves higher rates of precision, recall, and f1-score. \textsc{sparse mle} is also robust to approximate sparsity, and displays a faster convergence of the $\ell2$-norm than any other benchmark. Interestingly, both \textsc{mle} and \textsc{sparse mle} perform exceptionally well on the Watts-Strogatz graph. The recovery rate converges at around $5000$ cascades, which is more than $15$ times the number of nodes. By contrast, \textsc{sparse mle} achieves a reasonable $f1$-score of $.75$ for roughly $500$ observed cascades. 
+Over all experiments, \textsc{sparse mle} achieves higher rates of precision,
+recall, and f1-score. \textsc{sparse mle} is also robust to approximate
+sparsity, and displays a faster convergence of the $\ell2$-norm than any other
+benchmark. Interestingly, both \textsc{mle} and \textsc{sparse mle} perform
+exceptionally well on the Watts-Strogatz graph. The recovery rate converges at
+around $5000$ cascades, which is more than $15$ times the number of nodes. By
+contrast, \textsc{sparse mle} achieves a reasonable F$1$-score of $.75$ for roughly $500$ observed cascades. 
 
 \paragraph{Quantifying robustness}
 
-The previous experiments only considered graphs with strong edges. To test the algorithms in the approximately sparse case, we add sparse edges to the previous graphs according to a bernoulli variable of parameter $1/3$ for every non-edge, and drawing a weight uniformly from $[0,0.1]$. The results are reported in Figure~\ref{fig:four_figs} by plotting the convergence of the $\ell2$-norm error, and show that both the \textsc{lasso}, followed by \textsc{sparse mle} are the most robust to noise.
-\ No newline at end of file
+The previous experiments only considered graphs with strong edges. To test the algorithms in the approximately sparse case, we add sparse edges to the previous graphs according to a bernoulli variable of parameter $1/3$ for every non-edge, and drawing a weight uniformly from $[0,0.1]$. The results are reported in Figure~\ref{fig:four_figs} by plotting the convergence of the $\ell2$-norm error, and show that both the \textsc{lasso}, followed by \textsc{sparse mle} are the most robust to noise.
diff --git a/paper/sections/intro.tex b/paper/sections/intro.tex
index f4f2bd5..bbbed11 100644
--- a/paper/sections/intro.tex
+++ b/paper/sections/intro.tex
@@ -8,11 +8,11 @@ Tackling the graph inference problem means constructing a polynomial-time algori
  Throughout this paper, we focus on the three main discrete-time diffusion processes: the independent cascade model, the voter model, and the linear threshold model...
  \end{comment}
 
-Graphs have been extensively studied in their ability to propagate 
-
-Networks and social networks in particular have become an increasingly common
-subject of study in the recent years. 
-
+Graphs have been extensively studied for their propagative abilities:
+connectivity, routing, gossip algorithms, etc.
+%One question is: can we
+%understand and predict a diffusion process from the graph? Conversely, can we
+%learn a graph from the diffusion process?
 A diffusion process taking place over a graph provides valuable information
 about the presence and weights of its edges. \emph{Influence cascades} are a
 specific type of diffusion processes in which a particular infection behavior
@@ -68,19 +68,44 @@ Section~\ref{sec:experiments}.
 
 \paragraph{Related Work}
 
-The study of edge prediction in graph has been an active field of research for over a decade \cite{Nowell08} \cite{Leskovec07} \cite{AdarA05}. \cite{GomezRodriguez:2010} introduced the submodular {\scshape netinf} algorithm, which approximates the likelihood of cascades represented as a continuous process. The algorithm was later improved in later work \cite{gomezbalduzzi:2011}, but is not known to have any recovery guarantees beside empirical validation on synthetic networks. \cite{Netrapalli:2012} studied the discrete-time version of the independent cascade model and obtained the first ${\cal O}(s^2 \log m)$ guarantee on general networks. The algorithm is based around the same likelihood function we suggest, without the $\ell1$-norm penalty. However, the analysis depended strongly on a restrictive {\it correlation decay} assumption, which strongly limits the number of new infections at every step. In this restricted setting, they show a complex lower bound of the number of cascades needed for perfect support recovery with constant probability of the order $\Omega(s \log (m/s))$. They also suggest a {\scshape greedy} algorithm, which matches ${\cal O}(s \log m)$ guarantee in the case of \emph{tree} graphs.
+The study of edge prediction in graphs has been an active field of research for
+over a decade \cite{Nowell08} \cite{Leskovec07} \cite{AdarA05}.
+\cite{GomezRodriguez:2010} introduced the {\scshape Netinf} algorithm, which
+approximates the likelihood of cascades represented as a continuous process.
+The algorithm was improved in later work \cite{gomezbalduzzi:2011}, but is not
+known to have any recovery guarantees beside empirical validation on synthetic
+networks. \cite{Netrapalli:2012} studied the discrete-time version of the
+independent cascade model and obtained the first ${\cal O}(s^2 \log m)$
+recovery guarantee on general networks. The algorithm is based around the same
+likelihood function we suggest, without the $\ell_1$-norm penalty. Their
+analysis depends on a {\it correlation decay} assumption, which limits the
+number of new infections at every step. In this setting, they show a lower
+bound of the number of cascades needed for support recovery with constant
+probability of the order $\Omega(s \log (m/s))$. They also suggest a {\scshape
+Greedy} algorithm, which matches the ${\cal O}(s \log m)$ guarantee in the case
+of tree graphs. The work of \cite{Abrahao:13} study the same continuous-model framework as
+\cite{GomezRodriguez:2010} and obtain a ${\cal O}(s^9 \log^2 s \log m)$ support
+recovery algorithm without the \emph{correlation decay} assumption.
+
+\begin{comment}
+They assume a single-source model, where only one source is selected at random, which is less realistic in practice since `patient 0' is rarely unknown to us.
+\end{comment}
 
-The work of \cite{Abrahao:13} study the same continuous-model framework as \cite{GomezRodriguez:2010} and obtain a ${\cal O}(s^9 \log^2 s \log m)$ support recovery algorithm without the \emph{correlation decay} assumption. Their work also studies the information leveraged by different `parts' of the cascade, showing that a surprisingly important amount of information can be gleaned from the first and last `infections' of the cascade. They assume a single-source model, where only one source is selected at random, which is less realistic in practice since `patient 0' is rarely unknown to us.
+Closest to this work is a recent paper by \cite{Daneshmand:2014}, wherein they
+consider a $\ell_1$-regularized objective function. They adapt standard results
+from sparse recovery to obtain a recovery bound of ${\cal O}(s^3 \log m)$ on
+their algorithm under an irrepresentability condition. Under stronger
+assumptions, they match the \cite{Netrapalli:2012} bound of ${\cal O}(s^2 \log
+m)$, by exploiting similar properties of the convex program's KKT conditions.
+In contrast, our work studies discrete-time diffusion processes including the
+Independent Cascade model under weaker assumptions. Furthermore, we analyze
+both the recovery of the graph's edges and the estimation of the model's
+parameters and achieve close to optimal bounds.
 
-Closest to this work is a recent paper by \cite{Daneshmand:2014}, they consider
-a $\ell_1$-norm penalty to their objective function, adapting standard results
-from sparse recovery to obtain a ${\cal O}(s^3 \log m)$ algorithm under an
-irrepresentability condition. With stronger assumptions, they match the
-\cite{Netrapalli:2012} bound of ${\cal O}(s^2 \log m)$, by exploiting
-a similar properties of the objective's KKT conditions. Their work has the merit of studying a generalization of the discrete-time independent cascade model to continuous functions. Similarly to \cite{Abrahao:13}, they place themselves in
+\begin{comment}
+Their work has the merit of studying a generalization of the discrete-time independent cascade model to continuous functions. Similarly to \cite{Abrahao:13}, they place themselves in
 the restrictive single-source context.
-
-TODO: add related works on lower bounds.
+\end{comment}
 
 \begin{comment}
 \paragraph{Our contributions}
diff --git a/paper/sections/lowerbound.tex b/paper/sections/lowerbound.tex
index fdee069..b714766 100644
--- a/paper/sections/lowerbound.tex
+++ b/paper/sections/lowerbound.tex
@@ -1,5 +1,5 @@
 In this section, we will consider the stable sparse recovery setting of
-Section~\ref{sec:relaxing}.  Our goal is to obtain an information-theoretic
+Section~\ref{sec:relaxing_sparsity}.  Our goal is to obtain an information-theoretic
 lower bound on the number of measurements necessary to approximately recover
 the parameter $\theta$ of a cascade model from observed cascades. Similar lower
 bounds were obtained for sparse linear inverse problems in \cite{pw11, pw12,
@@ -7,7 +7,7 @@ bipw11}
 
 \begin{theorem}
     \label{thm:lb}
-    Let us consider a cascade model of the form XXX and a recovery algorithm
+    Let us consider a cascade model of the form \eqref{eq:glm} and a recovery algorithm
     $\mathcal{A}$ which takes as input $n$ random cascade measurements and
     outputs $\hat{\theta}$ such that with probability $\delta<\frac{1}{2}$
     (over the measurements):
@@ -19,7 +19,7 @@ bipw11}
     = \Omega(s\log\frac{m}{s}/\log C)$.
 \end{theorem}
 
-This theorem should be contrasted with Corollary~\ref{cor:relaxing}: up to an additive
+This theorem should be contrasted with Theorem~\ref{thm:approx_sparse}: up to an additive
 $s\log s$ factor, the number of measurements required by our algorithm is
 tight.
 
diff --git a/paper/sections/model.tex b/paper/sections/model.tex
index dbcbd8d..cf69960 100644
--- a/paper/sections/model.tex
+++ b/paper/sections/model.tex
@@ -52,10 +52,11 @@ A \emph{generalized linear cascade model} is a cascade model such that for each
 susceptible node $j$ in state $s$ at time step $t$, the probability of $j$
 becoming ``contagious'' at time step $t+1$ conditioned on $X^t$ is a Bernoulli
 variable of parameter $f(\theta_j \cdot X^t)$:
-\begin{displaymath}
+\begin{equation}
+    \label{eq:glm}
     \mathbb{P}(X^{t+1}_j = 1|X^t) 
     = f(\theta_j \cdot X^t)
-\end{displaymath}
+\end{equation}
 where $f: \reals \rightarrow [0,1]$
 \end{definition}
 
@@ -128,9 +129,11 @@ with inverse link function $f : z \mapsto 1 - e^z$.
 
 In the Linear Voter Model, nodes can be either \emph{red} or \emph{blue}. Both
 states are symmetric, but without loss of generality, we can suppose that the
-\emph{blue} nodes are contagious. The parameters of the graph are normalized such that $\forall i,
-\ \sum_j \Theta_{i,j} = 1$. Each round, every node $j$ independently chooses
-one of its neighbors with probability $\Theta_{ij}$ and adopts their color. The
+\emph{blue} nodes are contagious. The parameters of the graph are normalized
+such that $\forall i, \ \sum_j \Theta_{i,j} = 1$ and we assume that
+$\Theta_{i,i}$ is always non-zero, meaning that all nodes have self-loops.
+Each round, every node $j$ independently chooses
+one of its neighbors with probability $\Theta_{i,j}$ and adopts their color. The
 cascades stops at a fixed horizon time $T$ or if all nodes are of the same color.
 If we denote by $X^t$ the indicator variable of the set of blue nodes at time
 step $t$, then we have:
@@ -139,7 +142,7 @@ step $t$, then we have:
 \tag{V}
 \end{equation}
 
-Therefore, the independent cascade model is a Generalized Linear Cascade model
+Thus, the independent cascade model is a Generalized Linear Cascade model
 with inverse link function $f: z \mapsto z$.
 
 % \subsection{The Linear Threshold Model}
@@ -204,6 +207,13 @@ where we denote by ${\cal T}_i$ the time steps at which node $i$ is susceptible
     + (1 - x_i^{t+1})\log\big(1-f(\inprod{\theta_i}{x^t})\big)
 \end{multline*}
 
+In the case of the voter model, the measurements include all time steps until
+we reach the time horizon $T$ or the graph coalesces to a single state. In the
+case of the independent cascade model, the measurements include all time steps
+until node $i$ becomes contagious after which its behavior is deterministic.
+Contrary to prior work, we express our results in terms of the number of
+measurements and not the number of cascades.
+
 A sufficient condition for program~\eqref{eq:pre-mle} to be a convex program is
 that $\mathcal{L}_i$ is a concave function. This will be the case if, for
 example, $f$ and $(1-f)$ are both log-concave functions. Remember that
@@ -211,4 +221,3 @@ a twice-differentiable function $f$ is log concave iff. $f''f \leq f'^2$. It is
 easy to verify this property for $f$ and $(1-f)$ in the Independent Cascade
 Model and Voter Model.
 
-Note that in the case of the voter model, TODO horizon time. In the case of the independent cascade model...
diff --git a/paper/sections/results.tex b/paper/sections/results.tex
index 8eda03f..e5aa861 100644
--- a/paper/sections/results.tex
+++ b/paper/sections/results.tex
@@ -35,8 +35,13 @@ write $S\defeq\text{supp}(\theta^*)$ and $s=|S|$. Recall, that $\theta_i$ is the
 \end{equation}
 \end{definition}
 
-This condition, well-known in the sparse recovery literature, captures the idea that the binary vectors of active nodes are not \emph{too} colinear with each other, an essentially necessary condition for recovering $\theta$. In fact, the $(S,\gamma)$-{\bf(RE)} assumption is closely linked to the \emph{Fisher Information Matrix}, which captures the amount of information carried by an observable random variable (here the set of active nodes) about an unknown parameter $\theta$ on which the random variable depends.
-A discussion of the $(S,\gamma)$-{\bf(RE)} assumption in the context of generalized linear cascade models can be found in Section~\ref{sec:re}. We will also need the following assumption on the inverse link function $f$ of the generalized linear cascade model:
+A discussion of the $(S,\gamma)$-{\bf(RE)} assumption in the context of
+generalized linear cascade models can be found in Section~\ref{sec:re}. In our
+setting we will require that this property holds for the Hessian of the
+log-likelihood function $\mathcal{L}$ and essentially captures that the binary
+vectors of the set of active nodes are not \emph{too} collinear.
+
+We will also need the following assumption on the inverse link function $f$ of the generalized linear cascade model:
 \begin{equation}
     \tag{LF}
     \max\left(\left|\frac{f'}{f}(\inprod{\theta^*}{x})\right|,
@@ -48,10 +53,16 @@ $f(\inprod{\theta^*}{x})\notin\{0,1\}$.
 
 In the voter model, $\frac{f'}{f}(z) = \frac{1}{z}$ and $\frac{f'}{f}(1-z)
 = \frac{1}{1-z}$. Hence (LF) will hold as soon as $\alpha\leq \Theta_{i,j}\leq
-1-\alpha$ for all $(i,j)\in E$. Similarly, in the Independent Cascade Model,
-$\frac{f'}{f}(z) = \frac{1}{1-e^z}$ and $\frac{f'}{1-f}(z) = -1$ and (LF) holds
-if $p_{i, j}\geq 1-\alpha$ for all $(i, j)\in E$. Remember that in this case we
-have $\Theta_{i,j} = \log(1-p_{i,j})$. These assumptions are reasonable: if an edge has a weight very close to 0, then the ``infection'' will never happen along that edge for our set of observations and we can never hope to recover it.
+1-\alpha$ for all $(i,j)\in E$ which is always satisfied for some $\alpha$.
+Similarly, in the Independent Cascade Model, $\frac{f'}{f}(z)
+= \frac{1}{1-e^{-z}}$ and $\frac{f'}{1-f}(z) = -1$ and (LF) is not restrictive.
+
+\begin{comment}
+Remember that in this case we have $\Theta_{i,j} = \log(1-p_{i,j})$. These
+assumptions are reasonable: if an edge has a weight very close to 0, then the
+``infection'' will never happen along that edge for our set of observations and
+we can never hope to recover it.
+\end{comment}
 
 \begin{theorem}
 \label{thm:main}
@@ -69,7 +80,10 @@ $\alpha > 0$. For any $\delta\in(0,1)$, let $\hat{\theta}$ be the solution of
 \end{equation}
 \end{theorem}
 
-Theorem~\ref{thm:main} states that, under the two previous assumptions, the decay rate of the error term is ${\cal O}(1/\sqrt{n})$, which is believed to be optimal. Note how we have expressed the bound in the number of measurements $n$, which is different from the number of cascades! For example, in the case of the voter model with horizon time $T$ and for $N$ cascades, we can expect $N\times T$ measurements.
+Note that we have expressed the convergence rate in the number of measurements
+$n$, which is different from the number of cascades. For example, in the case
+of the voter model with horizon time $T$ and for $N$ cascades, we can expect
+a number of measurements proportional to $N\times T$.
 
 Before moving to the proof of Theorem~\ref{thm:main}, note that interpreting it
 in the case of the Independent Cascade Model requires one more step. Indeed, to
@@ -164,7 +178,10 @@ Choosing $\lambda\defeq 2\sqrt{\frac{\log m}{\alpha n^{1-\delta}}}$ concludes th
 proof.
 \end{proof}
 
-Note how the proof of Lemma~\ref{lem:ub} relied crucially on Azuma-Hoeffding's inequality, which allows us to handle correlated observations, and obtain bounds on the number of measurements rather than the number of cascades. We now show how to use Theorem~\ref{thm:main} to recover the support of $\theta^*$, that is, to solve the Graph Inference problem.
+Note how the proof of Lemma~\ref{lem:ub} relied crucially on Azuma-Hoeffding's
+inequality, which allows us to handle correlated observations. We now show how
+to use Theorem~\ref{thm:main} to recover the support of $\theta^*$, that is, to
+solve the Graph Inference problem.
 
 \begin{corollary}
 \label{cor:variable_selection}
@@ -223,26 +240,32 @@ which is also a consequence of Theorem 1 in \cite{Negahban:2009}.
 
 \begin{theorem}
 \label{thm:approx_sparse}
-Let $\theta^*_{\lfloor s \rfloor}$ be the best s-sparse approximation to the true vector $\theta^*$. Suppose the {\bf(RE)} assumption holds for the Hessian $\nabla^2 f(\theta^*)$ and for the following set 
+Suppose the relaxed {\bf(RE)} assumption holds for the Hessian $\nabla^2
+f(\theta^*)$ and for the following set:
 \begin{align}
 \nonumber
 {\cal C}' \defeq & \{X \in \mathbb{R}^p : \|X_{S^c}\|_1 \leq 3 \|X_S\|_1 + 4 \|\theta^*_{\lfloor s \rfloor}\|_1 \} \\ \nonumber
 & \cap \{ \|X\|_1 \leq 1 \}
 \end{align}
-By solving \eqref{eq:pre-mle} for $\lambda \defeq 2\sqrt{\frac{\log m}{\alpha n^{1 - \delta} p_{\min}}}$ we have:
-\begin{align}
-\|\hat p - p^* \|_2 \leq 3 \frac{\sqrt{s}\lambda_n}{\gamma_n} + 2 \sqrt{\frac{\lambda_n}{\gamma_n}} \|p^*_{\lfloor s \rfloor}\|_1
-\end{align}
+By solving \eqref{eq:pre-mle} for $\lambda \defeq 2\sqrt{\frac{\log m}{\alpha n^{1 - \delta}}}$ we have:
+\begin{align*}
+    \|\hat \theta - \theta^* \|_2 \leq
+    \frac{3}{\gamma} \sqrt{\frac{s\log m}{\alpha n^{1-\delta}}}
+    + \frac{2}{\gamma} \sqrt[4]{\frac{s\log m}{\alpha n^{1-\delta}}} \|\theta^* - \theta^*_{\lfloor s \rfloor}\|_1
+\end{align*}
 \end{theorem}
 
 As before, edge recovery is a consequence of upper-bounding $\|\theta - \hat \theta\|_2$
 
 \begin{corollary}
-If $\theta$ is not exactly s-sparse and the number of measurements verifies:
-\begin{equation}
-n > \frac{36 \| p^*_{\lfloor s\rfloor}\|_1}{p_{\min}\gamma^2 \epsilon^4} s \log m
+    Under the same assumptions as Theorem~\ref{thm:approx_sparse} the number of
+    measurements verifies: \begin{equation}
+        n > \frac{9}{\alpha\gamma^2\epsilon^2}\left(1+
+        \frac{4}{\epsilon^2}\| \theta^* - \theta^*_{\lfloor
+    s\rfloor}\|_1\right)s\log m
 \end{equation}
-then similarly: ${\cal S}^*_{\eta + \epsilon} \subset \hat {\cal S}_\eta \subset {\cal S}^*$ w.h.p.
+then similarly: ${\cal S}^*_{\eta + \epsilon} \subset \hat {\cal S}_\eta
+\subset {\cal S}^*$ w.p. at least $1-\frac{1}{m}$.
 \end{corollary}
 
 
@@ -303,7 +326,7 @@ eigenvalue property, then the finite sample hessian also verifies the
 restricted eigenvalue property with overwhelming probability. It is
 straightforward to show this holds when $n \geq C s^2 \log m$
 \cite{vandegeer:2009}, where $C$ is an absolute constant. By adapting Theorem
-8 \cite{rudelson:13}\footnote{this result still needs to be confirmed!}, this
+8 \cite{rudelson:13}}, this
   can be reduced to:
 \begin{displaymath}
 n \geq C s \log m \log^3 \left( \frac{s \log m}{C'} \right)
@@ -312,7 +335,9 @@ where $C, C'$ are constants not depending on $(s, m, n)$.
 
 \paragraph{(RE) vs Irrepresentability Condition}
 
-\cite{Daneshmand:2014} rely on an `incoherence' condition on the hessian of the likelihood function. It is in fact easy to see that their condition is equivalent to the more commonly called {\it (S,s)-irrepresentability} condition:
+\cite{Daneshmand:2014} rely on an `incoherence' condition on the hessian of the
+likelihood function. Their condition is equivalent to the more commonly called
+{\it (S,s)-irrepresentability} condition:
 
 \begin{definition}
 Following similar notation, let $Q^* \defeq \nabla^2 f(\theta^*)$. Let $S$ and $S^c$ be the set of indices of all the parents and non-parents respectively and $Q_{S,S}$, $Q_{S^c,S}$, $Q_{S, S^c}$, and $Q_{S^c, S^c}$ the induced sub-matrices. Consider the following constant:
@@ -322,7 +347,20 @@ Following similar notation, let $Q^* \defeq \nabla^2 f(\theta^*)$. Let $S$ and $
 The (S,s)-irrepresentability holds if $\nu_{\text{irrepresentable}}(S) < 1 - \epsilon$ for $\epsilon > 0$
 \end{definition}
 
-This condition has been shown to be essentially necessary for variable selection \cite{Zhao:2006}. Several recent papers \cite{vandegeer:2011}, \cite{Zou:2006}, argue this condition is unrealistic in situations where there is a correlation between variables. Consider the following simplified example from \cite{vandegeer:2011}:
+It is intuitive that the irrepresentability condition is stronger than the
+{\bf(RE)} assumption. In fact, a slightly modified result from
+\cite{vandegeer:2009} shows that a `strong' irrepresentability condition
+directly {\it implies} the {\bf(RE)} condition for $\ell_2$-recovery:
+
+\begin{proposition}
+\label{prop:irrepresentability}
+If the irrepresentability condition holds with $\epsilon > \frac{2}{3}$, then the restricted eigenvalue condition holds with constant  $\gamma_n \geq \frac{ (1 - 3(1 -\epsilon))^2 \lambda_{\min}^2}{4s}n$, where $\lambda_{\min} > 0$ is the smallest eigenvalue of $Q^*_{S,S}$, on which the results of \cite{Daneshmand:2014} also depend.
+\end{proposition}
+
+Furthermore, recent papers \cite{vandegeer:2011}, \cite{Zou:2006}, argue that
+the irrepresentability condition is unrealistic in situations where there is
+a correlation between variables. Consider the following simplified example from
+\cite{vandegeer:2011}:
 \begin{equation}
 \nonumber
 \left( 
@@ -334,12 +372,6 @@ I_s & \rho J \\
 \end{equation}
 where $I_s$ is the $s \times s$ identity matrix, $J$ is the all-ones matrix and $\rho \in \mathbb{R}^+$. It is easy to see that $\nu_{\text{irrepresentable}}(S) = \rho s$ and $\lambda_{\min}(Q) \geq 1 - \rho$, such that for any $\rho > \frac{1}{s}$ and $\rho < 1$, the restricted eigenvalue holds trivially but the (S,s)-irrepresentability does not hold.
 
-It is intuitive that the irrepresentability condition is stronger than the {\bf(RE)} assumption. In fact, a slightly modified result from \cite{vandegeer:2009} shows that a `strong' irrepresentability condition directly {\it implies} the {\bf(RE)} condition for $\ell2$-recovery:
-
-\begin{proposition}
-\label{prop:irrepresentability}
-If the irrepresentability condition holds with $\epsilon > \frac{2}{3}$, then the restricted eigenvalue condition holds with constant  $\gamma_n \geq \frac{ (1 - 3(1 -\epsilon))^2 \lambda_{\min}^2}{4s}n$, where $\lambda_{\min} > 0$ is the smallest eigenvalue of $Q^*_{S,S}$, on which the results of \cite{Daneshmand:2014} also depend.
-\end{proposition}
 
 \begin{comment}
 \begin{lemma}
diff --git a/paper/sparse.bib b/paper/sparse.bib
index d1622fc..5df4b59 100644
--- a/paper/sparse.bib
+++ b/paper/sparse.bib
@@ -478,4 +478,26 @@ year = "2009"
   timestamp = {Thu, 16 Feb 2012 12:06:08 +0100},
   biburl    = {http://dblp.uni-trier.de/rec/bib/conf/stoc/Kleinberg00},
   bibsource = {dblp computer science bibliography, http://dblp.org}
-}
-\ No newline at end of file
+}
+
+@article{zhang2014,
+  title={Confidence intervals for low dimensional parameters in high dimensional linear models},
+  author={Zhang, Cun-Hui and Zhang, Stephanie S},
+  journal={Journal of the Royal Statistical Society: Series B (Statistical Methodology)},
+  volume={76},
+  number={1},
+  pages={217--242},
+  year={2014},
+  publisher={Wiley Online Library}
+}
+
+@article{javanmard2014,
+  title={Confidence intervals and hypothesis testing for high-dimensional regression},
+  author={Javanmard, Adel and Montanari, Andrea},
+  journal={The Journal of Machine Learning Research},
+  volume={15},
+  number={1},
+  pages={2869--2909},
+  year={2014},
+  publisher={JMLR. org}
+}