diff options
| author | jeanpouget-abadie <jean.pougetabadie@gmail.com> | 2015-01-28 22:27:09 -0500 |
|---|---|---|
| committer | jeanpouget-abadie <jean.pougetabadie@gmail.com> | 2015-01-28 22:27:09 -0500 |
| commit | de7df38121cc8a39ee72899c9ed0628421dab7e1 (patch) | |
| tree | cc532c434848b0abce0732d1a4438d1a890ed294 | |
| parent | a9972e9bf028bc337ffd448791b242f1c92cc21c (diff) | |
| download | cascades-de7df38121cc8a39ee72899c9ed0628421dab7e1.tar.gz | |
updated results section
| -rw-r--r-- | paper/paper.tex | 5 | ||||
| -rw-r--r-- | paper/sections/appendix.tex | 36 | ||||
| -rw-r--r-- | paper/sections/assumptions.tex | 7 | ||||
| -rw-r--r-- | paper/sections/results.tex | 74 | ||||
| -rw-r--r-- | paper/sparse.bib | 39 |
5 files changed, 132 insertions, 29 deletions
diff --git a/paper/paper.tex b/paper/paper.tex index 04bb46f..d2fdcb9 100644 --- a/paper/paper.tex +++ b/paper/paper.tex @@ -62,6 +62,7 @@ \newtheorem{lemma}{Lemma} \newtheorem{corollary}{Corollary} \newtheorem{remark}{Remark} +\newtheorem{proposition}{Proposition} \begin{document} @@ -108,6 +109,7 @@ Summarize contributions: \input{sections/results.tex} \section{A lower bound} +\label{sec:lowerbound} \section{Assumptions} \label{sec:assumptions} @@ -117,6 +119,9 @@ Summarize contributions: \section{Discussion} +\section{Appendix} +\input{sections/appendix.tex} + \bibliography{sparse} \bibliographystyle{icml2015} diff --git a/paper/sections/appendix.tex b/paper/sections/appendix.tex new file mode 100644 index 0000000..80692d0 --- /dev/null +++ b/paper/sections/appendix.tex @@ -0,0 +1,36 @@ +\subsection*{Proof of support recovery lemma} + + +\subsection*{Upper-bound for $\|\nabla f(\theta^*)\|_\infty$} + +We show an upper-bound for $ 2 \|\nabla f(\theta^*) \|_\infty$. If we only consider the first measurement of every cascade, this can be done easily. Let $N$ be the number of cascades (to distinguish from $n$ number of total measurements). Begin by noting that: + +\begin{align} +\nabla_k f(\theta) & = \sum^n_{i=1} \frac{b^i x^i_k}{1 - e^{\langle x^i, \theta \rangle}} - \sum^n_{i=1} x^i_k \nonumber \\ +& = \sum_{i=1}^n x^k_i \left( \frac{b^i}{\mathbb{P}(\text{node } \alpha \text { infected})} - 1\right) \nonumber +\end{align} + +\begin{lemma} +\label{lem:subgaussian_variable} +$\nabla f(\theta^*)$ is a $N/p_{\min}$-subgaussian variable, where $p_{\min}$ is the smallest non-zero link to node $\alpha$. +\end{lemma} + +\begin{proof} +We show that its expectation is $0$ by conditioning on the seed set $S$: + +\begin{align} +\mathbb{E} \left( \nabla_k f(\theta) \right) & = \sum_{i=1}^N \mathbb{E} \left[ x^i_k \left( \frac{b^i}{\mathbb{P}(\text{node } \alpha \text { infected})} - 1\right) \right] \nonumber \\ +& = \sum_{i=1}^N \mathbb{E}_S \left[ \mathbb{E}\left[x^i_k \left( \frac{b^i}{\mathbb{P}(\alpha \text { infected})} - 1\right) \middle| S \right] \right] \nonumber \\ +& = \sum_{i=1}^N \mathbb{E}\left[x^i_k \left( \frac{ \mathbb{E}_S \left[ b^i \middle| S \right]}{\mathbb{P}(\alpha \text { infected})} - 1\right) \right] \nonumber \\ +& = 0 \nonumber +\end{align} +Therefore, $\nabla f(\theta^*)$ is the sum of zero-mean variables, upper-bounded by $1/p_{\min}$. It follows that $\nabla f(\theta^*)$ is $N/p_{\min}$-subgaussian. +\end{proof} + +By union bound and characterization of sub-gaussian variables: + +\begin{equation} +\mathbb{P}(\| \nabla f(\theta) \|_{\infty} > \lambda) \leq 2 \exp \left( -\frac{\lambda^2 p_{\min}}{2n} + \log p \right) +\end{equation} + +In conclusion, for $\delta>0$, $\lambda := 2 \sqrt{\frac{n^{\delta + 1} \log p}{p_{\min}}}$ is a valid regularizer with probability $1 - \exp(-n^\delta \log p )$ diff --git a/paper/sections/assumptions.tex b/paper/sections/assumptions.tex index eccd095..d6fed83 100644 --- a/paper/sections/assumptions.tex +++ b/paper/sections/assumptions.tex @@ -4,3 +4,10 @@ Proving the restricted eigenvalue assumption for correlated measurements is non- \subsection{The Irrepresentability Condition} +If our objective is to recover only the support of the graph and to do it exactly , then the irrepresentability condition\footnote{{\it incoherence} in the text} used in \cite{Daneshmand:2014} is believed to be necessary. + +However, assuming we only wish to recover all edges above a certain threshold, bounding the $\ell2$-error allows us to recover the full graph above a certain minimum threshold. Furthermore, it is intuitive that the irrepresentability condition is stronger than our suggested {\bf(RE)} assumption. In fact, by adapting slightly the following result from \cite{vandegeer:2009}, we can show that a `strong' irrepresentability condition directly {\it implies} the {\bf(RE)} condition for $\ell2$-recovery + +\begin{proposition} +If the irrepresentability condition holds with $\epsilon > \frac{2}{3}$, then the restricted eigenvalue condition holds with constant: $(1 - 3\epsilon)^2 \lambda_{\min}n/s$ +\end{proposition}
\ No newline at end of file diff --git a/paper/sections/results.tex b/paper/sections/results.tex index 4e129fc..5c2fe5f 100644 --- a/paper/sections/results.tex +++ b/paper/sections/results.tex @@ -1,21 +1,35 @@ -In this section, we exploit standard techniques in sparse recovery and leverage the simple nature of Generalized Linear models to address the standard problem of edge detection as well as the less frequently studied problem of coefficient estimation. We discuss both standard diffusion processes, and extend our analysis beyond sparse graphs to approximately sparse graphs. +In this section, we exploit standard techniques in sparse recovery and leverage the simple nature of generalized linear models (GLMs) to address the standard problem of edge detection. We extend prior work by that edge weights of the graph can also be recovered. We further relax the sparsity constraint: it is more realistic to assume that the graph will have few `strong' edges, characterized by weights closer to 1, and many `weak' edges, characterized by weights closer to 0. -\paragraph{Recovering Edges vs. Recovering Coefficients} -Recovering the edges of the graph or estimating the parents of a node comes down to recovering the support (non-zero coefficients) of $\Theta$, a process known as {\it variable selection}. However, there have been a series of papers arguing that the Lasso is an inappropriate variable selection method (see H.Zou and T.Hastie, Sarah van de Geer ...). In fact, the irrepresentability condition (discussed in \cite{Daneshmand:2014} and introduced in \cite{Zhao:2006}) is essentially necessary for variable selection and rarely holds in practical situations where correlation between variable occurs. We defer an extended analysis of this situation to Section~\ref{sec:assumptions}. +\subsection{Recovering Edges and Edge weights} +Recovering the edges of the graph can be formalized as recovering the support of $\Theta$, a problem known as {\it variable selection}. As we have seen above, we can optimize Eq.~\ref{eq:pre-mle} node by node. Our objective is to recover the parents of each node, i.e the non-zero coefficients of $\theta_i \ \forall i$. For the rest of the analysis, we suppose that we consider a single node $i$. For ease of presentation, the index $i$ will be implied: $p_{i,j} = p_j$, $\theta_{i,J} = \theta_j$... -Our approach is different. Rather than trying to perform variable selection by finding $\{j: \theta_j \neq 0\}$, we seek to upper-bound $\|\hat \theta - \theta^* \|_2$. We first apply standard techniques to obtain ${\cal O}(\sqrt{\frac{s \log m}{n}})$ in the case of sparse vectors, which is tight to a certain extent as we will show in Section ???. +There have been a series of papers arguing that the standard Lasso is an inappropriate exact variable selection method \cite{Zou:2006}, \cite{vandegeer:2011}, since it relies on the essentially necessary irrepresentability condition, introduced in \cite{Zhao:2006}. However, this condition, on which the analysis of \cite{Daneshmand:2014} relies on, rarely holds in practical situations where correlation between variables occurs, and several alternatives have been suggested (the adaptive lasso, thresholded lasso...) We defer an extended analysis of the irrepresentability assumption to Section~\ref{sec:assumptions}. -It is easy to see that `parent selection' is a direct consequence of the previous result: by thresholding $\hat \theta$, one recovers all `strong' parents without false positives, as shown in corollary~\ref{cor:variable_selection}. +Our approach is different. Rather than trying to perform variable selection directly by finding $\{j: \theta_j \neq 0\}$, we seek to upper-bound $\|\hat \theta - \theta^* \|_2$. It is easy to see that recovering all `strong' edges of the graph is a direct consequence of this analysis: by thresholding all weak $\hat \theta$, one recovers all `strong' parents without false positives, as shown in corollary~\ref{cor:variable_selection}. -\paragraph{Main Theorem} -Interestingly, obtaining oracle inequalities depends on the following restricted eigenvalue condition\footnote{which is less restrictive? Irrepresentability implies compatibility? But do we have comptability and do they have irrepresentability?} for a symetric matrix $\Sigma$ and set ${\cal C} := \{X \in \mathbb{R}^p : \|X_{S^c}\|_1 \leq 3 \|X_S\|_1 \} \cap \{ \|X\|_1 \leq 1 \}$ +We will first apply standard techniques to obtain a ${\cal O}(\sqrt{\frac{s \log m}{n}})$ $\ell2$-norm upper-bound in the case of sparse vectors. We will then extend this analysis to non-sparse vectors. In section~\ref{sec:lowerbound}, we show that our results are almost tight. + +\subsection{Main Theorem} + +We begin our analysis with the following simple lemma: +\begin{lemma} +\label{lem:theta_p_upperbound} +$\|\theta - \theta^* \|_2 \geq \|p - p^*\|_2$ +\end{lemma} +\begin{proof} +Using the inequality $\forall x>0, \; \log x \geq 1 - \frac{1}{x}$, we have $|\log (1 - p) - \log (1-p')| \geq \max(1 - \frac{1-p}{1-p'}, 1 - \frac{1-p'}{1-p}) \geq \max( p-p', p'-p)$. The result follows easily. +\end{proof} + +In other words, finding an upper-bound for the estimation error of the `effective' parameters $\theta_{i,j} \defeq \log(1-p_{i,j})$ provides immediately an upper-bound for the estimation error of the true parameters $(p_{i,j})_{i,j}$. + +Interestingly, finding such an upper-bound is commonly studied in sparse recovery and essentially relies on the restricted eigenvalue condition for a symetric matrix $\Sigma$ and set ${\cal C} := \{X \in \mathbb{R}^p : \|X_{S^c}\|_1 \leq 3 \|X_S\|_1 \} \cap \{ \|X\|_1 \leq 1 \}$ \begin{equation} \nonumber \forall X \in {\cal C}, \| \Sigma X \|_2^2 \geq \gamma_n \|X\|_2^2 \qquad \text{\bf (RE)} \end{equation} -We cite the following Theorem from \cite{Negahban:2009} +We compare this condition to the irrepresentability condition used in prior work in section~\ref{sec:lowerbound}. We cite the following theorem from \cite{Negahban:2009} \begin{theorem} \label{thm:neghaban} @@ -25,58 +39,60 @@ Suppose the true vector $\theta^*$ has support S of size s and the {\bf(RE)} ass \end{equation} \end{theorem} +In section~\ref{subsec:icc}, we find a ${\cal O}(\sqrt{n})$ upper-bound for valid $\lambda_n$. It is also reasonable to assume $\gamma_n = \Omega(n)$, as discussed in section~\ref{sec:assumptions}, yielding a ${\cal O}(1/\sqrt{n})$ decay rate per measurement. The authors believe it is more natural to express these results as the number of measurements $N$, i.e. cumulative number of steps in each cascades, rather the number of cascades $n$. + -\paragraph{Relaxing the Sparsity Constraint} +\subsection{Relaxing the Sparsity Constraint} -We can relax the sparsity constraint and express the upper-bound as a function of the best sparse approximation to $\theta^*$. This result is particularly relevant in cases where $\theta^*$ has few `strong' parents, and many `weak' parents, which is frequent in social networks. The results of section~\ref{subsec:icc} and section~\ref{subsec:ltm} that follow can be easily extended to the approximately-sparse case. +In many situations however, and for social networks in particular, the graph is not exactly $s$-sparse. A more realistic situation is one where each nodes has few strong `parents' and many `weaker' parents. Rather than obtaining an impossibility result in this situation, we show that we pay a small price for relaxing the sparsity constraint. If we let $\theta^*_{\lfloor s \rfloor}$ be the best s-sparse approximation to $\theta^*$ defined as +$$\theta^*_{\lfloor s \rfloor} \defeq \min_{\|\theta\|_0 \leq s} \|\theta - \theta^*\|_1$$ +then we pay ${\cal O} \left(\sqrt{\frac{\lambda_n}{\gamma_n}} \|\theta^*_s\|_1 \right)$ for recovering the weights of non-exactly sparse vectors. Since $\|\theta^*_{\lfloor s \rfloor}\|_1$ is the sum of the $\|\theta^*\|_0 -s$ weakest coefficients of $\theta^*$, the closer $\theta^*$ is to being sparse, the smaller the price. These results are formalized in the following theorem: \begin{theorem} \label{thm:approx_sparse} -Let $\theta_s^* \defeq \min_{\|\theta\|_0 \leq s} \|\theta - \theta^*\|_1$ be the best s-sparse approximation to the true vector $\theta^*$. Suppose the {\bf(RE)} assumption holds for the Hessian $\nabla^2 f(\theta^*)$ and for the following set +Let $\theta^*_{\lfloor s \rfloor}$ be the best s-sparse approximation to the true vector $\theta^*$. Suppose the {\bf(RE)} assumption holds for the Hessian $\nabla^2 f(\theta^*)$ and for the following set \begin{align} \nonumber -{\cal C}' \defeq & \{X \in \mathbb{R}^p : \|X_{S^c}\|_1 \leq 3 \|X_S\|_1 + 4 \|\theta_s^*\|_1 \} \\ \nonumber +{\cal C}' \defeq & \{X \in \mathbb{R}^p : \|X_{S^c}\|_1 \leq 3 \|X_S\|_1 + 4 \|\theta^*_{\lfloor s \rfloor}\|_1 \} \\ \nonumber & \cap \{ \|X\|_1 \leq 1 \} \end{align} By solving \eqref{eq:pre-mle} for $\lambda_n \geq 2 \|\nabla f(\theta^*)\|_{\infty}$ we have: -\begin{equation} -\|\hat \theta - \theta^* \|_2 \leq 3 \frac{\sqrt{s}\lambda_n}{\gamma_n} + 2 \sqrt{\frac{\lambda_n}{\gamma_n}} \|\theta^*_s\|_1 -\end{equation} +\begin{align} +\|\hat p - p^* \|_2 \leq 3 \frac{\sqrt{s}\lambda_n}{\gamma_n} + 2 \sqrt{\frac{\lambda_n}{\gamma_n}} \|p^*_{\lfloor s \rfloor}\|_1 +\end{align} \end{theorem} -As we can see, we pay a $\Omega \left(\sqrt{\frac{\lambda_n}{\gamma_n}} \|\theta^*_s\|_1 \right)$ price for not being exactly s-sparse, where $\|\theta^*_s\|_1$ is the sum of the $\|\theta^*\|_0 -s$ weakest coefficients of $\theta^*$. +This follows from a more general version of Theorem~\ref{thm:neghaban} in \cite{Negahban:2009}, from Lemma~\ref{lem:theta_p_upperbound} and the simple observation that $\| \theta^*_{\lfloor s \rfloor}\|_1 \leq \| p^*_{\lfloor s \rfloor} \|_1$ +The results of section~\ref{subsec:icc} can be easily extended to the approximately-sparse case. + \subsection{Independent Cascade Model} \label{subsec:icc} -We analyse the previous conditions in the case of the Independent Cascade model. Lemma 1. provides a ${\cal O}(\sqrt{n})$-upper-bound w.h.p. on $\|\nabla f(\theta^*)\|$ +We analyse the previous conditions in the case of the Independent Cascade model. Lemma~\ref{lem:icc_lambda_upper_bound} provides a ${\cal O}(\sqrt{n})$-upper-bound w.h.p. on $\|\nabla f(\theta^*)\|$ \begin{lemma} \label{lem:icc_lambda_upper_bound} For any $\delta > 0$, with probability $1-e^{-n^\delta \log m}$, $\|\nabla f(\theta^*)\|_{\infty} \leq 2 \sqrt{\frac{n^{\delta + 1} \log m}{p_{\min}}}$ \end{lemma} -The following corollaries follows immediately from Theorem~\ref{thm:neghaban} and Lemma~\ref{lem:icc_lambda_upper_bound}. +We include a proof of this result in the Appendix. The following corollaries follow immediately from Theorem~\ref{thm:neghaban} and Lemma~\ref{lem:icc_lambda_upper_bound}. \begin{corollary} -Assume that ${\bf (RE)}$ holds with $\gamma_n = n \gamma$ for $\gamma > 0$. Then for $\lambda_n := 2 \sqrt{\frac{n^{\delta + 1} \log m}{p_{\min}}}$ and with probability $1-e^{n^\delta \log p}$ +Assume that ${\bf (RE)}$ holds with $\gamma_n = n \gamma$ for $\gamma > 0$. Then for $\lambda_n := 2 \sqrt{\frac{n^{\delta + 1} \log m}{p_{\min}}}$ and with probability $1-e^{n^\delta \log m}$ \begin{equation} -\|\hat \theta - \theta^* \|_2 \leq \frac{2}{\gamma} \sqrt{\frac{s \log p}{p_{\min} n^{1-\delta}}} +\|\hat p - p^* \|_2 \leq \frac{2}{\gamma} \sqrt{\frac{s \log m}{p_{\min} n^{1-\delta}}} \end{equation} \end{corollary} -The following corollary follows trivially and gives the first $\Omega(s \log p)$ algorithm for graph reconstruction on general graphs. +The following corollary follows trivially and gives the first $\Omega(s \log p)$ algorithm for graph reconstruction on general graphs. The proofs are included in the Appendix. \begin{corollary} \label{cor:variable_selection} -Assume that ${\bf (RE)}$ holds with $\gamma_n = n \gamma$ for $\gamma > 0$. Suppose that after solving Eq.~\ref{eq:mle}, we construct the set $\hat {\cal S}_\eta := \{ j \in [1..p] : \hat \theta_j > \eta\}$ for $\eta > 0$. Suppose the number of measurements verifies: +Assume that ${\bf (RE)}$ holds with $\gamma_n = n \gamma$ for $\gamma > 0$. Suppose that after solving for $\hat \theta$, we construct the set $\hat {\cal S}_\eta := \{ j \in [1..p] : \hat p_j > \eta\}$ for $\eta > 0$. Suppose the number of measurements verifies: \begin{equation} -n > \frac{4}{\gamma^2 p_{\min} \eta^2} s \log p +n > \frac{4}{\gamma^2 p_{\min} \eta^2} s \log m \end{equation} -Then with probability $1-e^{n^\delta \log p}$, $\hat {\cal S}_\eta = {\cal S}^*_{2\eta}$, where ${\cal S}^*_{2\eta} = \{ j \in [1..p] :\theta^*_j > 2 \eta \} $. In other words, we incur no false positives (false parents) and recover all `strong' parents such that $\theta^*_j > 2 \eta$. +Then with probability $1-e^{n^\delta \log m}$, $\hat {\cal S}_\eta = {\cal S}^*_{2\eta}$, where ${\cal S}^*_{2\eta} = \{ j \in [1..p] :p^*_j > 2 \eta \} $. In other words, we incur no false positives (false parents) and recover all `strong' parents such that $p^*_j > 2 \eta$. \end{corollary} -Note that $n$ is the number of measurements and not the number of cascades. This is an improvement over prior work since we expect several measurements per cascade. We claim that graph recovery is better understood as a function of $n$, the cumulative number of steps in each cascade, rather than as a function $N$, the number of individual cascades. - - +Note that $n$ is the number of measurements and not the number of cascades. This is an improvement over prior work since we expect several measurements per cascade. -\subsection{Linear Threshold Model} -\label{subsec:ltm} diff --git a/paper/sparse.bib b/paper/sparse.bib index 10ae8ca..c67cd20 100644 --- a/paper/sparse.bib +++ b/paper/sparse.bib @@ -119,3 +119,42 @@ year = {2006}, biburl = {http://dblp.uni-trier.de/rec/bib/conf/kdd/AbrahaoCKP13}, bibsource = {dblp computer science bibliography, http://dblp.org} } + + +@article{vandegeer:2009, +author = "van de Geer, Sara A. and Bühlmann, Peter", +doi = "10.1214/09-EJS506", +fjournal = "Electronic Journal of Statistics", +journal = "Electron. J. Statist.", +pages = "1360--1392", +publisher = "The Institute of Mathematical Statistics and the Bernoulli Society", +title = "On the conditions used to prove oracle results for the Lasso", +url = "http://dx.doi.org/10.1214/09-EJS506", +volume = "3", +year = "2009" +} + +@article{vandegeer:2011, +author = "van de Geer, Sara and Bühlmann, Peter and Zhou, Shuheng", +doi = "10.1214/11-EJS624", +fjournal = "Electronic Journal of Statistics", +journal = "Electron. J. Statist.", +pages = "688--749", +publisher = "The Institute of Mathematical Statistics and the Bernoulli Society", +title = "The adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso)", +url = "http://dx.doi.org/10.1214/11-EJS624", +volume = "5", +year = "2011" +} + +@article{Zou:2006, +author = {Zou, Hui}, +title = {The Adaptive Lasso and Its Oracle Properties}, +journal = {Journal of the American Statistical Association}, +volume = {101}, +number = {476}, +pages = {1418-1429}, +year = {2006}, +doi = {10.1198/016214506000000735}, +URL = {http://dx.doi.org/10.1198/016214506000000735}, +} |
