diff options
| author | Thibaut Horel <thibaut.horel@gmail.com> | 2015-02-05 23:22:36 -0500 |
|---|---|---|
| committer | Thibaut Horel <thibaut.horel@gmail.com> | 2015-02-05 23:22:36 -0500 |
| commit | 190cd5633e7e3237be118c18d374b5542aa38be4 (patch) | |
| tree | cec7f4bce86caa234585539657592821098da4d8 /paper | |
| parent | 1b76bfe6f4c0335fcf04a9f683949a3d412e05cd (diff) | |
| download | cascades-190cd5633e7e3237be118c18d374b5542aa38be4.tar.gz | |
Polishing
Diffstat (limited to 'paper')
| -rw-r--r-- | paper/paper.tex | 4 | ||||
| -rw-r--r-- | paper/sections/discussion.tex | 22 | ||||
| -rw-r--r-- | paper/sections/experiments.tex | 15 | ||||
| -rw-r--r-- | paper/sections/intro.tex | 55 | ||||
| -rw-r--r-- | paper/sections/lowerbound.tex | 6 | ||||
| -rw-r--r-- | paper/sections/model.tex | 23 | ||||
| -rw-r--r-- | paper/sections/results.tex | 84 | ||||
| -rw-r--r-- | paper/sparse.bib | 24 |
8 files changed, 173 insertions, 60 deletions
diff --git a/paper/paper.tex b/paper/paper.tex index 6e94164..20e8f08 100644 --- a/paper/paper.tex +++ b/paper/paper.tex @@ -127,8 +127,8 @@ \label{sec:linear_threshold} \input{sections/discussion.tex} -\section{Appendix} -\input{sections/appendix.tex} +%\section{Appendix} +%\input{sections/appendix.tex} \bibliography{sparse} diff --git a/paper/sections/discussion.tex b/paper/sections/discussion.tex index 25d9c53..b2947a1 100644 --- a/paper/sections/discussion.tex +++ b/paper/sections/discussion.tex @@ -1,7 +1,15 @@ \paragraph{Future Work} -Solving the Graph Inference problem with sparse recovery techniques opens new venues for future work. Firstly, the sparse recovery literature has already studied regularization patterns beyond the $\ell-1$-norm, notably the thresholded and adaptive lasso \cite{vandegeer:2011} \cite{Zou:2006}. Another series of papers that are directly relevant to the Graph Inference setting have shown that confidence intervals can be established for the lasso. Finally, the linear threshold model is a commonly studied diffusion process and can also be cast as a \emph{generalized linear cascade} with inverse link function $z \mapsto \mathbbm{1}_{z > 0}$: +Solving the Graph Inference problem with sparse recovery techniques opens new +venues for future work. Firstly, the sparse recovery literature has already +studied regularization patterns beyond the $\ell_1$-norm, notably the +thresholded and adaptive lasso \cite{vandegeer:2011, Zou:2006}. Another goal +would be to obtain confidence intervals for our estimator, similarly to what +has been obtained for the Lasso in the recent series of papers +\cite{javanmard2014, zhang2014}. + +Finally, the linear threshold model is a commonly studied diffusion process and can also be cast as a \emph{generalized linear cascade} with inverse link function $z \mapsto \mathbbm{1}_{z > 0}$: \begin{equation} \label{eq:lt} @@ -9,7 +17,15 @@ Solving the Graph Inference problem with sparse recovery techniques opens new ve X^{t+1}_j = \text{sign} \left(\inprod{\theta_j}{X^t} - t_j \right) \end{equation} -This model therefore falls into the 1-bit compressed sensing model \cite{Boufounos:2008} framework. Several recent papers study the theoretical guarantees obtained for 1-bit compressed sensing with specific measurements \cite{Gupta:2010}, \cite{Plan:2014}. Whilst they obtained bounds of the order ${\cal O}(n \log \frac{n}{s}$), no current theory exists for recovering positive bounded signals from bernoulli hyperplanes. This research direction may provide the first clues to solve the ``active learning'' problem: if we are allowed to adaptively \emph{choose} the source nodes at the beginning of each cascade, can we improve on current results? +This model therefore falls into the 1-bit compressed sensing model +\cite{Boufounos:2008} framework. Several recent papers study the theoretical +guarantees obtained for 1-bit compressed sensing with specific measurements +\cite{Gupta:2010, Plan:2014}. Whilst they obtained bounds of the order +${\cal O}(n \log \frac{n}{s}$), no current theory exists for recovering +positive bounded signals from bernoulli hyperplanes. This research direction +may provide the first clues to solve the ``adaptive learning'' problem: if we +are allowed to adaptively \emph{choose} the source nodes at the beginning of +each cascade, how much can we improve the current results? \begin{comment} The Linear Threshold model can \emph{also} be cast a generalized linear cascade model. However, as we show below, its link function is non-differentiable and necessitates a different analysis. In the Linear Threshold Model, each node $j\in V$ has a threshold $t_j$ from the interval $[0,1]$ and for each node, the sum of incoming weights is less than $1$: $\forall j\in V$, $\sum_{i=1}^m \Theta_{i,j} \leq 1$. @@ -29,4 +45,4 @@ where we defined again $\theta_j\defeq (\Theta_{1,j},\ldots,\Theta_{m,j})$. In o \end{equation} The link function of the linear threshold model is the sign function: $z \mapsto \mathbbm{1}_{z > 0}$. This model therefore falls into the 1-bit compressed sensing model \cite{Boufounos:2008} framework. Several recent papers study the theoretical guarantees obtained for 1-bit compressed sensing with specific measurements \cite{Gupta:2010}, \cite{Plan:2014}. Whilst they obtained bounds of the order ${\cal O}(n \log \frac{n}{s}$), no current theory exists for recovering positive bounded signals from bernoulli hyperplanes. We leave this research direction to future work. -\end{comment}
\ No newline at end of file +\end{comment} diff --git a/paper/sections/experiments.tex b/paper/sections/experiments.tex index 7cb852c..48ffc02 100644 --- a/paper/sections/experiments.tex +++ b/paper/sections/experiments.tex @@ -13,7 +13,10 @@ & \includegraphics[scale=.23]{figures/kronecker_l2_norm_nonsparse.pdf}\\ (a) Barabasi-Albert & (b) Watts-Strogatz & (c) sparse Kronecker & (d) non-sparse Kronecker \end{tabular} -\captionof{figure}{Figures (a) and (b) report the $f1$-score in $\log$ scale for 2 graphs: (a) Barabasi-Albert graph, $300$ nodes, $16200$ edges. (b) Watts-Strogatz graph, $300$ nodes, $4500$ edges. Figures (c) and (d) report the $\ell2$-norm $\|\hat \Theta - \Theta\|_2$ for a Kronecker graph which is: (c) exactly sparse (d) non-exactly sparse} +\captionof{figure}{Figures (a) and (b) report the F$1$-score in $\log$ scale +for 2 graphs: (a) Barabasi-Albert graph, $300$ nodes, $16200$ edges. (b) +Watts-Strogatz graph, $300$ nodes, $4500$ edges. Figures (c) and (d) report the +$\ell_2$-norm $\|\hat \Theta - \Theta\|_2$ for a Kronecker graph which is: (c) exactly sparse (d) non-exactly sparse} \label{fig:four_figs} \end{table*} @@ -35,8 +38,14 @@ We did not benchmark against other known algorithms (\textsc{netrate} \cite{gome \paragraph{Graph Estimation} In the case of the \textsc{lasso}, \textsc{mle} and \textsc{sparse mle} algorithms, we construct the edges of $\hat {\cal G} : \cup_{j \in V} \{i : \Theta_{ij} > 0.1\}$, \emph{i.e} by thresholding. The true positives are the edges which appear both in ${\cal G}$ and $\hat {\cal G}$ and the true negatives are the edges which appear in neither. Finally, we report the F1-score$=2 \text{precision}\cdot\text{recall}/(\text{precision}+\text{recall})$, which considers the number of true edges recovered by the algorithm over the total number of edges returned by the algorithm (\emph{precision}) with the number of true edges recovered by the algorithm over the total number of edges it should have recovered (\emph{recall}). -Over all experiments, \textsc{sparse mle} achieves higher rates of precision, recall, and f1-score. \textsc{sparse mle} is also robust to approximate sparsity, and displays a faster convergence of the $\ell2$-norm than any other benchmark. Interestingly, both \textsc{mle} and \textsc{sparse mle} perform exceptionally well on the Watts-Strogatz graph. The recovery rate converges at around $5000$ cascades, which is more than $15$ times the number of nodes. By contrast, \textsc{sparse mle} achieves a reasonable $f1$-score of $.75$ for roughly $500$ observed cascades. +Over all experiments, \textsc{sparse mle} achieves higher rates of precision, +recall, and f1-score. \textsc{sparse mle} is also robust to approximate +sparsity, and displays a faster convergence of the $\ell2$-norm than any other +benchmark. Interestingly, both \textsc{mle} and \textsc{sparse mle} perform +exceptionally well on the Watts-Strogatz graph. The recovery rate converges at +around $5000$ cascades, which is more than $15$ times the number of nodes. By +contrast, \textsc{sparse mle} achieves a reasonable F$1$-score of $.75$ for roughly $500$ observed cascades. \paragraph{Quantifying robustness} -The previous experiments only considered graphs with strong edges. To test the algorithms in the approximately sparse case, we add sparse edges to the previous graphs according to a bernoulli variable of parameter $1/3$ for every non-edge, and drawing a weight uniformly from $[0,0.1]$. The results are reported in Figure~\ref{fig:four_figs} by plotting the convergence of the $\ell2$-norm error, and show that both the \textsc{lasso}, followed by \textsc{sparse mle} are the most robust to noise.
\ No newline at end of file +The previous experiments only considered graphs with strong edges. To test the algorithms in the approximately sparse case, we add sparse edges to the previous graphs according to a bernoulli variable of parameter $1/3$ for every non-edge, and drawing a weight uniformly from $[0,0.1]$. The results are reported in Figure~\ref{fig:four_figs} by plotting the convergence of the $\ell2$-norm error, and show that both the \textsc{lasso}, followed by \textsc{sparse mle} are the most robust to noise. diff --git a/paper/sections/intro.tex b/paper/sections/intro.tex index f4f2bd5..bbbed11 100644 --- a/paper/sections/intro.tex +++ b/paper/sections/intro.tex @@ -8,11 +8,11 @@ Tackling the graph inference problem means constructing a polynomial-time algori Throughout this paper, we focus on the three main discrete-time diffusion processes: the independent cascade model, the voter model, and the linear threshold model... \end{comment} -Graphs have been extensively studied in their ability to propagate - -Networks and social networks in particular have become an increasingly common -subject of study in the recent years. - +Graphs have been extensively studied for their propagative abilities: +connectivity, routing, gossip algorithms, etc. +%One question is: can we +%understand and predict a diffusion process from the graph? Conversely, can we +%learn a graph from the diffusion process? A diffusion process taking place over a graph provides valuable information about the presence and weights of its edges. \emph{Influence cascades} are a specific type of diffusion processes in which a particular infection behavior @@ -68,19 +68,44 @@ Section~\ref{sec:experiments}. \paragraph{Related Work} -The study of edge prediction in graph has been an active field of research for over a decade \cite{Nowell08} \cite{Leskovec07} \cite{AdarA05}. \cite{GomezRodriguez:2010} introduced the submodular {\scshape netinf} algorithm, which approximates the likelihood of cascades represented as a continuous process. The algorithm was later improved in later work \cite{gomezbalduzzi:2011}, but is not known to have any recovery guarantees beside empirical validation on synthetic networks. \cite{Netrapalli:2012} studied the discrete-time version of the independent cascade model and obtained the first ${\cal O}(s^2 \log m)$ guarantee on general networks. The algorithm is based around the same likelihood function we suggest, without the $\ell1$-norm penalty. However, the analysis depended strongly on a restrictive {\it correlation decay} assumption, which strongly limits the number of new infections at every step. In this restricted setting, they show a complex lower bound of the number of cascades needed for perfect support recovery with constant probability of the order $\Omega(s \log (m/s))$. They also suggest a {\scshape greedy} algorithm, which matches ${\cal O}(s \log m)$ guarantee in the case of \emph{tree} graphs. +The study of edge prediction in graphs has been an active field of research for +over a decade \cite{Nowell08} \cite{Leskovec07} \cite{AdarA05}. +\cite{GomezRodriguez:2010} introduced the {\scshape Netinf} algorithm, which +approximates the likelihood of cascades represented as a continuous process. +The algorithm was improved in later work \cite{gomezbalduzzi:2011}, but is not +known to have any recovery guarantees beside empirical validation on synthetic +networks. \cite{Netrapalli:2012} studied the discrete-time version of the +independent cascade model and obtained the first ${\cal O}(s^2 \log m)$ +recovery guarantee on general networks. The algorithm is based around the same +likelihood function we suggest, without the $\ell_1$-norm penalty. Their +analysis depends on a {\it correlation decay} assumption, which limits the +number of new infections at every step. In this setting, they show a lower +bound of the number of cascades needed for support recovery with constant +probability of the order $\Omega(s \log (m/s))$. They also suggest a {\scshape +Greedy} algorithm, which matches the ${\cal O}(s \log m)$ guarantee in the case +of tree graphs. The work of \cite{Abrahao:13} study the same continuous-model framework as +\cite{GomezRodriguez:2010} and obtain a ${\cal O}(s^9 \log^2 s \log m)$ support +recovery algorithm without the \emph{correlation decay} assumption. + +\begin{comment} +They assume a single-source model, where only one source is selected at random, which is less realistic in practice since `patient 0' is rarely unknown to us. +\end{comment} -The work of \cite{Abrahao:13} study the same continuous-model framework as \cite{GomezRodriguez:2010} and obtain a ${\cal O}(s^9 \log^2 s \log m)$ support recovery algorithm without the \emph{correlation decay} assumption. Their work also studies the information leveraged by different `parts' of the cascade, showing that a surprisingly important amount of information can be gleaned from the first and last `infections' of the cascade. They assume a single-source model, where only one source is selected at random, which is less realistic in practice since `patient 0' is rarely unknown to us. +Closest to this work is a recent paper by \cite{Daneshmand:2014}, wherein they +consider a $\ell_1$-regularized objective function. They adapt standard results +from sparse recovery to obtain a recovery bound of ${\cal O}(s^3 \log m)$ on +their algorithm under an irrepresentability condition. Under stronger +assumptions, they match the \cite{Netrapalli:2012} bound of ${\cal O}(s^2 \log +m)$, by exploiting similar properties of the convex program's KKT conditions. +In contrast, our work studies discrete-time diffusion processes including the +Independent Cascade model under weaker assumptions. Furthermore, we analyze +both the recovery of the graph's edges and the estimation of the model's +parameters and achieve close to optimal bounds. -Closest to this work is a recent paper by \cite{Daneshmand:2014}, they consider -a $\ell_1$-norm penalty to their objective function, adapting standard results -from sparse recovery to obtain a ${\cal O}(s^3 \log m)$ algorithm under an -irrepresentability condition. With stronger assumptions, they match the -\cite{Netrapalli:2012} bound of ${\cal O}(s^2 \log m)$, by exploiting -a similar properties of the objective's KKT conditions. Their work has the merit of studying a generalization of the discrete-time independent cascade model to continuous functions. Similarly to \cite{Abrahao:13}, they place themselves in +\begin{comment} +Their work has the merit of studying a generalization of the discrete-time independent cascade model to continuous functions. Similarly to \cite{Abrahao:13}, they place themselves in the restrictive single-source context. - -TODO: add related works on lower bounds. +\end{comment} \begin{comment} \paragraph{Our contributions} diff --git a/paper/sections/lowerbound.tex b/paper/sections/lowerbound.tex index fdee069..b714766 100644 --- a/paper/sections/lowerbound.tex +++ b/paper/sections/lowerbound.tex @@ -1,5 +1,5 @@ In this section, we will consider the stable sparse recovery setting of -Section~\ref{sec:relaxing}. Our goal is to obtain an information-theoretic +Section~\ref{sec:relaxing_sparsity}. Our goal is to obtain an information-theoretic lower bound on the number of measurements necessary to approximately recover the parameter $\theta$ of a cascade model from observed cascades. Similar lower bounds were obtained for sparse linear inverse problems in \cite{pw11, pw12, @@ -7,7 +7,7 @@ bipw11} \begin{theorem} \label{thm:lb} - Let us consider a cascade model of the form XXX and a recovery algorithm + Let us consider a cascade model of the form \eqref{eq:glm} and a recovery algorithm $\mathcal{A}$ which takes as input $n$ random cascade measurements and outputs $\hat{\theta}$ such that with probability $\delta<\frac{1}{2}$ (over the measurements): @@ -19,7 +19,7 @@ bipw11} = \Omega(s\log\frac{m}{s}/\log C)$. \end{theorem} -This theorem should be contrasted with Corollary~\ref{cor:relaxing}: up to an additive +This theorem should be contrasted with Theorem~\ref{thm:approx_sparse}: up to an additive $s\log s$ factor, the number of measurements required by our algorithm is tight. diff --git a/paper/sections/model.tex b/paper/sections/model.tex index dbcbd8d..cf69960 100644 --- a/paper/sections/model.tex +++ b/paper/sections/model.tex @@ -52,10 +52,11 @@ A \emph{generalized linear cascade model} is a cascade model such that for each susceptible node $j$ in state $s$ at time step $t$, the probability of $j$ becoming ``contagious'' at time step $t+1$ conditioned on $X^t$ is a Bernoulli variable of parameter $f(\theta_j \cdot X^t)$: -\begin{displaymath} +\begin{equation} + \label{eq:glm} \mathbb{P}(X^{t+1}_j = 1|X^t) = f(\theta_j \cdot X^t) -\end{displaymath} +\end{equation} where $f: \reals \rightarrow [0,1]$ \end{definition} @@ -128,9 +129,11 @@ with inverse link function $f : z \mapsto 1 - e^z$. In the Linear Voter Model, nodes can be either \emph{red} or \emph{blue}. Both states are symmetric, but without loss of generality, we can suppose that the -\emph{blue} nodes are contagious. The parameters of the graph are normalized such that $\forall i, -\ \sum_j \Theta_{i,j} = 1$. Each round, every node $j$ independently chooses -one of its neighbors with probability $\Theta_{ij}$ and adopts their color. The +\emph{blue} nodes are contagious. The parameters of the graph are normalized +such that $\forall i, \ \sum_j \Theta_{i,j} = 1$ and we assume that +$\Theta_{i,i}$ is always non-zero, meaning that all nodes have self-loops. +Each round, every node $j$ independently chooses +one of its neighbors with probability $\Theta_{i,j}$ and adopts their color. The cascades stops at a fixed horizon time $T$ or if all nodes are of the same color. If we denote by $X^t$ the indicator variable of the set of blue nodes at time step $t$, then we have: @@ -139,7 +142,7 @@ step $t$, then we have: \tag{V} \end{equation} -Therefore, the independent cascade model is a Generalized Linear Cascade model +Thus, the independent cascade model is a Generalized Linear Cascade model with inverse link function $f: z \mapsto z$. % \subsection{The Linear Threshold Model} @@ -204,6 +207,13 @@ where we denote by ${\cal T}_i$ the time steps at which node $i$ is susceptible + (1 - x_i^{t+1})\log\big(1-f(\inprod{\theta_i}{x^t})\big) \end{multline*} +In the case of the voter model, the measurements include all time steps until +we reach the time horizon $T$ or the graph coalesces to a single state. In the +case of the independent cascade model, the measurements include all time steps +until node $i$ becomes contagious after which its behavior is deterministic. +Contrary to prior work, we express our results in terms of the number of +measurements and not the number of cascades. + A sufficient condition for program~\eqref{eq:pre-mle} to be a convex program is that $\mathcal{L}_i$ is a concave function. This will be the case if, for example, $f$ and $(1-f)$ are both log-concave functions. Remember that @@ -211,4 +221,3 @@ a twice-differentiable function $f$ is log concave iff. $f''f \leq f'^2$. It is easy to verify this property for $f$ and $(1-f)$ in the Independent Cascade Model and Voter Model. -Note that in the case of the voter model, TODO horizon time. In the case of the independent cascade model... diff --git a/paper/sections/results.tex b/paper/sections/results.tex index 8eda03f..e5aa861 100644 --- a/paper/sections/results.tex +++ b/paper/sections/results.tex @@ -35,8 +35,13 @@ write $S\defeq\text{supp}(\theta^*)$ and $s=|S|$. Recall, that $\theta_i$ is the \end{equation} \end{definition} -This condition, well-known in the sparse recovery literature, captures the idea that the binary vectors of active nodes are not \emph{too} colinear with each other, an essentially necessary condition for recovering $\theta$. In fact, the $(S,\gamma)$-{\bf(RE)} assumption is closely linked to the \emph{Fisher Information Matrix}, which captures the amount of information carried by an observable random variable (here the set of active nodes) about an unknown parameter $\theta$ on which the random variable depends. -A discussion of the $(S,\gamma)$-{\bf(RE)} assumption in the context of generalized linear cascade models can be found in Section~\ref{sec:re}. We will also need the following assumption on the inverse link function $f$ of the generalized linear cascade model: +A discussion of the $(S,\gamma)$-{\bf(RE)} assumption in the context of +generalized linear cascade models can be found in Section~\ref{sec:re}. In our +setting we will require that this property holds for the Hessian of the +log-likelihood function $\mathcal{L}$ and essentially captures that the binary +vectors of the set of active nodes are not \emph{too} collinear. + +We will also need the following assumption on the inverse link function $f$ of the generalized linear cascade model: \begin{equation} \tag{LF} \max\left(\left|\frac{f'}{f}(\inprod{\theta^*}{x})\right|, @@ -48,10 +53,16 @@ $f(\inprod{\theta^*}{x})\notin\{0,1\}$. In the voter model, $\frac{f'}{f}(z) = \frac{1}{z}$ and $\frac{f'}{f}(1-z) = \frac{1}{1-z}$. Hence (LF) will hold as soon as $\alpha\leq \Theta_{i,j}\leq -1-\alpha$ for all $(i,j)\in E$. Similarly, in the Independent Cascade Model, -$\frac{f'}{f}(z) = \frac{1}{1-e^z}$ and $\frac{f'}{1-f}(z) = -1$ and (LF) holds -if $p_{i, j}\geq 1-\alpha$ for all $(i, j)\in E$. Remember that in this case we -have $\Theta_{i,j} = \log(1-p_{i,j})$. These assumptions are reasonable: if an edge has a weight very close to 0, then the ``infection'' will never happen along that edge for our set of observations and we can never hope to recover it. +1-\alpha$ for all $(i,j)\in E$ which is always satisfied for some $\alpha$. +Similarly, in the Independent Cascade Model, $\frac{f'}{f}(z) += \frac{1}{1-e^{-z}}$ and $\frac{f'}{1-f}(z) = -1$ and (LF) is not restrictive. + +\begin{comment} +Remember that in this case we have $\Theta_{i,j} = \log(1-p_{i,j})$. These +assumptions are reasonable: if an edge has a weight very close to 0, then the +``infection'' will never happen along that edge for our set of observations and +we can never hope to recover it. +\end{comment} \begin{theorem} \label{thm:main} @@ -69,7 +80,10 @@ $\alpha > 0$. For any $\delta\in(0,1)$, let $\hat{\theta}$ be the solution of \end{equation} \end{theorem} -Theorem~\ref{thm:main} states that, under the two previous assumptions, the decay rate of the error term is ${\cal O}(1/\sqrt{n})$, which is believed to be optimal. Note how we have expressed the bound in the number of measurements $n$, which is different from the number of cascades! For example, in the case of the voter model with horizon time $T$ and for $N$ cascades, we can expect $N\times T$ measurements. +Note that we have expressed the convergence rate in the number of measurements +$n$, which is different from the number of cascades. For example, in the case +of the voter model with horizon time $T$ and for $N$ cascades, we can expect +a number of measurements proportional to $N\times T$. Before moving to the proof of Theorem~\ref{thm:main}, note that interpreting it in the case of the Independent Cascade Model requires one more step. Indeed, to @@ -164,7 +178,10 @@ Choosing $\lambda\defeq 2\sqrt{\frac{\log m}{\alpha n^{1-\delta}}}$ concludes th proof. \end{proof} -Note how the proof of Lemma~\ref{lem:ub} relied crucially on Azuma-Hoeffding's inequality, which allows us to handle correlated observations, and obtain bounds on the number of measurements rather than the number of cascades. We now show how to use Theorem~\ref{thm:main} to recover the support of $\theta^*$, that is, to solve the Graph Inference problem. +Note how the proof of Lemma~\ref{lem:ub} relied crucially on Azuma-Hoeffding's +inequality, which allows us to handle correlated observations. We now show how +to use Theorem~\ref{thm:main} to recover the support of $\theta^*$, that is, to +solve the Graph Inference problem. \begin{corollary} \label{cor:variable_selection} @@ -223,26 +240,32 @@ which is also a consequence of Theorem 1 in \cite{Negahban:2009}. \begin{theorem} \label{thm:approx_sparse} -Let $\theta^*_{\lfloor s \rfloor}$ be the best s-sparse approximation to the true vector $\theta^*$. Suppose the {\bf(RE)} assumption holds for the Hessian $\nabla^2 f(\theta^*)$ and for the following set +Suppose the relaxed {\bf(RE)} assumption holds for the Hessian $\nabla^2 +f(\theta^*)$ and for the following set: \begin{align} \nonumber {\cal C}' \defeq & \{X \in \mathbb{R}^p : \|X_{S^c}\|_1 \leq 3 \|X_S\|_1 + 4 \|\theta^*_{\lfloor s \rfloor}\|_1 \} \\ \nonumber & \cap \{ \|X\|_1 \leq 1 \} \end{align} -By solving \eqref{eq:pre-mle} for $\lambda \defeq 2\sqrt{\frac{\log m}{\alpha n^{1 - \delta} p_{\min}}}$ we have: -\begin{align} -\|\hat p - p^* \|_2 \leq 3 \frac{\sqrt{s}\lambda_n}{\gamma_n} + 2 \sqrt{\frac{\lambda_n}{\gamma_n}} \|p^*_{\lfloor s \rfloor}\|_1 -\end{align} +By solving \eqref{eq:pre-mle} for $\lambda \defeq 2\sqrt{\frac{\log m}{\alpha n^{1 - \delta}}}$ we have: +\begin{align*} + \|\hat \theta - \theta^* \|_2 \leq + \frac{3}{\gamma} \sqrt{\frac{s\log m}{\alpha n^{1-\delta}}} + + \frac{2}{\gamma} \sqrt[4]{\frac{s\log m}{\alpha n^{1-\delta}}} \|\theta^* - \theta^*_{\lfloor s \rfloor}\|_1 +\end{align*} \end{theorem} As before, edge recovery is a consequence of upper-bounding $\|\theta - \hat \theta\|_2$ \begin{corollary} -If $\theta$ is not exactly s-sparse and the number of measurements verifies: -\begin{equation} -n > \frac{36 \| p^*_{\lfloor s\rfloor}\|_1}{p_{\min}\gamma^2 \epsilon^4} s \log m + Under the same assumptions as Theorem~\ref{thm:approx_sparse} the number of + measurements verifies: \begin{equation} + n > \frac{9}{\alpha\gamma^2\epsilon^2}\left(1+ + \frac{4}{\epsilon^2}\| \theta^* - \theta^*_{\lfloor + s\rfloor}\|_1\right)s\log m \end{equation} -then similarly: ${\cal S}^*_{\eta + \epsilon} \subset \hat {\cal S}_\eta \subset {\cal S}^*$ w.h.p. +then similarly: ${\cal S}^*_{\eta + \epsilon} \subset \hat {\cal S}_\eta +\subset {\cal S}^*$ w.p. at least $1-\frac{1}{m}$. \end{corollary} @@ -303,7 +326,7 @@ eigenvalue property, then the finite sample hessian also verifies the restricted eigenvalue property with overwhelming probability. It is straightforward to show this holds when $n \geq C s^2 \log m$ \cite{vandegeer:2009}, where $C$ is an absolute constant. By adapting Theorem -8 \cite{rudelson:13}\footnote{this result still needs to be confirmed!}, this +8 \cite{rudelson:13}}, this can be reduced to: \begin{displaymath} n \geq C s \log m \log^3 \left( \frac{s \log m}{C'} \right) @@ -312,7 +335,9 @@ where $C, C'$ are constants not depending on $(s, m, n)$. \paragraph{(RE) vs Irrepresentability Condition} -\cite{Daneshmand:2014} rely on an `incoherence' condition on the hessian of the likelihood function. It is in fact easy to see that their condition is equivalent to the more commonly called {\it (S,s)-irrepresentability} condition: +\cite{Daneshmand:2014} rely on an `incoherence' condition on the hessian of the +likelihood function. Their condition is equivalent to the more commonly called +{\it (S,s)-irrepresentability} condition: \begin{definition} Following similar notation, let $Q^* \defeq \nabla^2 f(\theta^*)$. Let $S$ and $S^c$ be the set of indices of all the parents and non-parents respectively and $Q_{S,S}$, $Q_{S^c,S}$, $Q_{S, S^c}$, and $Q_{S^c, S^c}$ the induced sub-matrices. Consider the following constant: @@ -322,7 +347,20 @@ Following similar notation, let $Q^* \defeq \nabla^2 f(\theta^*)$. Let $S$ and $ The (S,s)-irrepresentability holds if $\nu_{\text{irrepresentable}}(S) < 1 - \epsilon$ for $\epsilon > 0$ \end{definition} -This condition has been shown to be essentially necessary for variable selection \cite{Zhao:2006}. Several recent papers \cite{vandegeer:2011}, \cite{Zou:2006}, argue this condition is unrealistic in situations where there is a correlation between variables. Consider the following simplified example from \cite{vandegeer:2011}: +It is intuitive that the irrepresentability condition is stronger than the +{\bf(RE)} assumption. In fact, a slightly modified result from +\cite{vandegeer:2009} shows that a `strong' irrepresentability condition +directly {\it implies} the {\bf(RE)} condition for $\ell_2$-recovery: + +\begin{proposition} +\label{prop:irrepresentability} +If the irrepresentability condition holds with $\epsilon > \frac{2}{3}$, then the restricted eigenvalue condition holds with constant $\gamma_n \geq \frac{ (1 - 3(1 -\epsilon))^2 \lambda_{\min}^2}{4s}n$, where $\lambda_{\min} > 0$ is the smallest eigenvalue of $Q^*_{S,S}$, on which the results of \cite{Daneshmand:2014} also depend. +\end{proposition} + +Furthermore, recent papers \cite{vandegeer:2011}, \cite{Zou:2006}, argue that +the irrepresentability condition is unrealistic in situations where there is +a correlation between variables. Consider the following simplified example from +\cite{vandegeer:2011}: \begin{equation} \nonumber \left( @@ -334,12 +372,6 @@ I_s & \rho J \\ \end{equation} where $I_s$ is the $s \times s$ identity matrix, $J$ is the all-ones matrix and $\rho \in \mathbb{R}^+$. It is easy to see that $\nu_{\text{irrepresentable}}(S) = \rho s$ and $\lambda_{\min}(Q) \geq 1 - \rho$, such that for any $\rho > \frac{1}{s}$ and $\rho < 1$, the restricted eigenvalue holds trivially but the (S,s)-irrepresentability does not hold. -It is intuitive that the irrepresentability condition is stronger than the {\bf(RE)} assumption. In fact, a slightly modified result from \cite{vandegeer:2009} shows that a `strong' irrepresentability condition directly {\it implies} the {\bf(RE)} condition for $\ell2$-recovery: - -\begin{proposition} -\label{prop:irrepresentability} -If the irrepresentability condition holds with $\epsilon > \frac{2}{3}$, then the restricted eigenvalue condition holds with constant $\gamma_n \geq \frac{ (1 - 3(1 -\epsilon))^2 \lambda_{\min}^2}{4s}n$, where $\lambda_{\min} > 0$ is the smallest eigenvalue of $Q^*_{S,S}$, on which the results of \cite{Daneshmand:2014} also depend. -\end{proposition} \begin{comment} \begin{lemma} diff --git a/paper/sparse.bib b/paper/sparse.bib index d1622fc..5df4b59 100644 --- a/paper/sparse.bib +++ b/paper/sparse.bib @@ -478,4 +478,26 @@ year = "2009" timestamp = {Thu, 16 Feb 2012 12:06:08 +0100}, biburl = {http://dblp.uni-trier.de/rec/bib/conf/stoc/Kleinberg00}, bibsource = {dblp computer science bibliography, http://dblp.org} -}
\ No newline at end of file +} + +@article{zhang2014, + title={Confidence intervals for low dimensional parameters in high dimensional linear models}, + author={Zhang, Cun-Hui and Zhang, Stephanie S}, + journal={Journal of the Royal Statistical Society: Series B (Statistical Methodology)}, + volume={76}, + number={1}, + pages={217--242}, + year={2014}, + publisher={Wiley Online Library} +} + +@article{javanmard2014, + title={Confidence intervals and hypothesis testing for high-dimensional regression}, + author={Javanmard, Adel and Montanari, Andrea}, + journal={The Journal of Machine Learning Research}, + volume={15}, + number={1}, + pages={2869--2909}, + year={2014}, + publisher={JMLR. org} +} |
