aboutsummaryrefslogtreecommitdiffstats
path: root/paper/sections/results.tex
diff options
context:
space:
mode:
Diffstat (limited to 'paper/sections/results.tex')
-rw-r--r--paper/sections/results.tex75
1 files changed, 36 insertions, 39 deletions
diff --git a/paper/sections/results.tex b/paper/sections/results.tex
index a0d3159..f68ecee 100644
--- a/paper/sections/results.tex
+++ b/paper/sections/results.tex
@@ -42,18 +42,23 @@ by~\cite{bickel2009simultaneous}.
A discussion of the $(S,\gamma)$-{\bf(RE)} assumption in the context of
generalized linear cascade models can be found in Section~\ref{sec:re}. In our
-setting we require that the (RE)-condition holds for the Hessian of the
+setting we require that the {\bf(RE)}-condition holds for the Hessian of the
log-likelihood function $\mathcal{L}$: it essentially captures the fact that the
binary vectors of the set of active nodes are not \emph{too} collinear.
{\color{red} Rewrite the minimal assumptions necessary}
We will also need the following assumption on the inverse link function $f$ of
the generalized linear cascade model:
+%\begin{equation}
+ %\tag{LF}
+ %\max\left(\left|\frac{f'}{f}(\inprod{\theta^*}{x})\right|,
+ %\left|\frac{f'}{1-f}(\inprod{\theta^*}{x})\right|\right)\leq
+ %\frac{1}{\alpha}
+%\end{equation}
\begin{equation}
- \tag{LF}
- \max\left(\left|\frac{f'}{f}(\inprod{\theta^*}{x})\right|,
- \left|\frac{f'}{1-f}(\inprod{\theta^*}{x})\right|\right)\leq
- \frac{1}{\alpha}
+ \tag{LF}
+ \max \left( \| (\log f)' \|_{\infty}, \|(\log (1-f))' \|_{\infty} \right)
+ \leq \frac{1}{\alpha}
\end{equation}
for some $\alpha\in(0, 1)$ and for all $x\in\reals^m$ such that
$f(\inprod{\theta^*}{x})\notin\{0,1\}$.
@@ -147,13 +152,12 @@ solve the Network Inference problem.
\label{cor:variable_selection}
Under the same assumptions as Theorem~\ref{thm:main}, let $\hat {\cal S}_\eta
\defeq \{ j \in \{1,\ldots, m\} : \hat{\theta}_j > \eta\}$ for $\eta > 0$. For
-$\epsilon>0$ and $\epsilon < \eta$, let ${\cal S}^*_{\eta + \epsilon} \defeq \{
-i \in \{1,\ldots,m\} :\theta_i^* > \eta +\epsilon \}$ be the set of all true
-`strong' parents. Suppose the number of measurements verifies: $ n >
-\frac{9s\log m}{\alpha\gamma^2\epsilon^2}$. Then with probability
-$1-\frac{1}{m}$, ${\cal S}^*_{\eta + \epsilon} \subseteq \hat {\cal S}_\eta
-\subseteq {\cal S}^*$. In other words we recover all `strong' parents and no
-`false' parents.
+$0< \epsilon < \eta$, let ${\cal S}^*_{\eta + \epsilon} \defeq \{ i \in
+\{1,\ldots,m\} :\theta_i^* > \eta +\epsilon \}$ be the set of all true `strong'
+parents. Suppose the number of measurements verifies: $ n > \frac{9s\log
+m}{\alpha\gamma^2\epsilon^2}$. Then with probability $1-\frac{1}{m}$, ${\cal
+S}^*_{\eta + \epsilon} \subseteq \hat {\cal S}_\eta \subseteq {\cal S}^*$. In
+other words we recover all `strong' parents and no `false' parents.
\end{corollary}
\begin{proof}
@@ -191,13 +195,10 @@ $
be the best $s$-approximation to $\theta^*$. Then we pay a cost proportional
to $\|\theta^* - \theta^*_{\lfloor s\rfloor}\|_1$ for recovering the weights of
non-exactly sparse vectors. This cost is simply the ``tail'' of
-$\theta^*$: the sum of the $m-s$ smallest coordinates of $\theta^*$.
-
-In other words, the closer $\theta^*$ is to being sparse, the smaller the
-price, and we recover the results of Section~\ref{sec:main_theorem} in the
-limit of exact sparsity. These results are formalized in the following theorem,
-which is also a consequence of Theorem 1 in \cite{Negahban:2009}.
-
+$\theta^*$: the sum of the $m-s$ smallest coordinates of $\theta^*$. We recover
+the results of Section~\ref{sec:main_theorem} in the limit of exact sparsity.
+These results are formalized in the following theorem, which is also a
+consequence of Theorem 1 in \cite{Negahban:2009}.
\begin{theorem}
\label{thm:approx_sparse}
Suppose the {\bf(RE)} assumption holds for the Hessian $\nabla^2 f(\theta^*)$
@@ -238,21 +239,19 @@ then similarly: ${\cal S}^*_{\eta + \epsilon} \subset \hat {\cal S}_\eta
In this section, we discuss the main assumption of Theorem~\ref{thm:main} namely
the restricted eigenvalue condition introduced by~\cite{bickel2009simultaneous}.
-We then compare to the irrepresentability condition considered in
-\cite{Daneshmand:2014}.
\paragraph{Interpretation}
There exists a large class of sufficient conditions under which sparse recovery
-is achievable in the context of regularized estimation. A good survey of these
-different assumptions can be found in \cite{vandegeer:2009}.
-
+is achievable in the context of regularized estimation~\cite{vandegeer:2009}.
The restricted eigenvalue condition is one of the weakest such assumption. It
can be interpreted as a restricted form of non-degeneracy. Since we apply it to
the Hessian of the log-likelihood function $\nabla^2 \mathcal{L}(\theta)$, it
essentially reduces to a form of restricted strong convexity, that
-Lemma~\ref{lem:negahban} ultimately relies on. Observe that the Hessian of
-$\mathcal{L}$ can be seen as a re-weighted Gram matrix of the observations:
+Lemma~\ref{lem:negahban} ultimately relies on.
+
+Observe that the Hessian of $\mathcal{L}$ can be seen as a re-weighted
+\emph{Gram matrix} of the observations:
\begin{multline*}
\nabla^2\mathcal{L}(\theta^*)
= \frac{1}{|\mathcal{T}|}\sum_{t\in\mathcal{T}}x^t(x^t)^T
@@ -269,8 +268,7 @@ and the $(S, \gamma)$-({\bf RE}) condition on the Hessian in
Theorem~\ref{thm:main} and Theorem~\ref{thm:approx_sparse} reduces to a
condition on the \emph{gram matrix} of the observations $X^T X =
\frac{1}{|\mathcal{T}|}\sum_{t\in\mathcal{T}}x^t(x^t)^T$ with $\gamma \leftarrow
-\gamma\cdot c$. The (RE)-condition is then equivalent to the assumption made in
-the Lasso analysis of~\cite{bickel:2009}.
+\gamma\cdot c$.
\paragraph{(RE) with high probability}
@@ -281,19 +279,18 @@ this probabilistic model. Several recent papers show that large classes of
correlated designs obey the restricted eigenvalue property with high
probability \cite{raskutti:10, rudelson:13}.
-The (RE)-condition is well-behaved in the following sense: if it holds for the
-expected Hessian matrix $\E[\nabla^2\mathcal{L}(\theta^*)]$, then it holds for
-the finite sample Hessian matrix $\nabla^2\mathcal{L}(\theta^*)$ with high
+The {\bf(RE)}-condition is well-behaved in the following sense: if it holds for
+the expected Hessian matrix $\E[\nabla^2\mathcal{L}(\theta^*)]$, then it holds
+for the finite sample Hessian matrix $\nabla^2\mathcal{L}(\theta^*)$ with high
probability.
-Note that the expected Hessian matrix is exactly the Fisher
-Information matrix of the generalized linear cascade model which captures the
-amount of information about $\theta^*$ conveyed by the random observations.
-The (RE)-condition has an even more natural interpretation when $f$ and $1-f$
-are strictly log-convex, since the expected \emph{gram matrix} $A \equiv
-\mathbb{E}[X^T X]$ is a matrix whose entry $a_{i,j}$ is the average number of
-times node $i$ and node $j$ are infected at the same time in the cascade
-process, and $a_{i,i}$ is the average number of times node $i$ is infected.
+If $f$ and $1-f$ are strictly log-convex, then the previous observations yields
+the following natural interpretation of the {\bf(RE)}-condition: the expected
+\emph{gram matrix} $A \equiv \mathbb{E}[X^T X]$ is a matrix whose entry
+$a_{i,j}$ is the average fraction of times node $i$ and node $j$ are infected
+at the same time during several runs of the cascade process. Note that the
+diagonal terms $a_{i,i}$ are simply the average fraction of times the
+corresponding node $i$ is infected.
Therefore, under an assumption which only involves the probabilistic model and
not the actual observations, Proposition~\ref{prop:fi} allows us to draw the