remodeling results section

author: jeanpouget-abadie <jean.pougetabadie@gmail.com> 2015-01-31 16:13:27 -0500
committer: jeanpouget-abadie <jean.pougetabadie@gmail.com> 2015-01-31 16:13:27 -0500
commit: 7df4228e025d5810cb1689073d42b6d8d265b018 (patch)
tree: 16c4ac9784382cdd4d84e0d90565bd848046ca38
parent: 57b3fb8c1fdd998c0245672599de2249d44cfdf2 (diff)
download: cascades-7df4228e025d5810cb1689073d42b6d8d265b018.tar.gz
3 files changed, 63 insertions, 61 deletions
diff --git a/paper/sections/assumptions.tex b/paper/sections/assumptions.tex
index 58ea977..5e412a4 100644
--- a/paper/sections/assumptions.tex
+++ b/paper/sections/assumptions.tex
@@ -1,3 +1,5 @@
+%There have been a series of papers arguing that the standard Lasso is an inappropriate exact variable selection method \cite{Zou:2006}, \cite{vandegeer:2011}, since it relies on the essentially necessary irrepresentability condition, introduced in \cite{Zhao:2006}. This is the condition on which the analysis of \cite{Daneshmand:2014} relies. It has been noted this condition is rather stringent and rarely holds in practical situations where correlation between variables occurs. Several alternatives have been suggested, including the adaptive and thresholded lasso, which relax this assumption. We defer an extended analysis of the irrepresentability assumption to Section~\ref{sec:assumptions}.
+
 In this section, we compare the main assumption of \cite{Daneshmand:2014}, commonly known as the {\it irrepresentability condition}, to the restricted eigenvalue condition. We argue that the restricted eigenvalue is more likely to hold in practical situations.
 
 \subsection{The Irrepresentability Condition}
diff --git a/paper/sections/model.tex b/paper/sections/model.tex
index ae49830..6acfade 100644
--- a/paper/sections/model.tex
+++ b/paper/sections/model.tex
@@ -13,7 +13,7 @@ We note that recovering the edges in $E$ from observed influence cascades is
 a well-identified problem known as the \emph{Graph Inference} problem. However,
 recovering the influence parameters is no less important and has been seemingly
 overlooked so far. The machinery developped in this work addresses both
-problems, in the context of generalized linear cascade models, defined below. We show that this definition encompasses a large class of well-known cascade models.
+problems for a large class of well-known cascade models.
 
 \subsection{Generalized Linear Cascade Models}
 
@@ -28,11 +28,11 @@ $$\mathbb{P}(X^t = 1|{\cal F}_{< t}) = \exp(blabla$$
 \subsection{Examples}
 \label{subsec:examples}
 
-In this section, we show that both the Independent Cascade Model and the Voter model are Generalized Linear Cascades. Discussion about the Linear Threshold model has been deferred to Section~\ref{sec:linear_threshold}
+In this section, we show that both the Independent Cascade Model and the Voter model are Generalized Linear Cascades. We discuss the Linear Threshold model in Section~\ref{sec:linear_threshold}
 
 \subsubsection{Independent Cascade Model}
 
-In the independent cascade model, nodes can be either susceptible, active/infected or inactive. At $t=0$, all source nodes are ``active'' and all remaining nodes are ``susceptible''. At each time step $t$, for each edge $(i,j)$ where $j$ is susceptible and $i$ is active, $i$ attempts to infect $j$ with probability $p_{i,j}\in[0,1]$. If $i$ succeeds, $j$ will become active at time step $t+1$. Regardless of $i$'s success, node $i$ will be inactive at time $t+1$. In other words, nodes stay active for only one time step. The cascade process terminates When no active nodes remain.
+In the independent cascade model, nodes can be either susceptible, active or inactive. At $t=0$, all source nodes are ``active'' and all remaining nodes are ``susceptible''. At each time step $t$, for each edge $(i,j)$ where $j$ is susceptible and $i$ is active, $i$ attempts to infect $j$ with probability $p_{i,j}\in[0,1]$. If $i$ succeeds, $j$ will become active at time step $t+1$. Regardless of $i$'s success, node $i$ will be inactive at time $t+1$. In other words, nodes stay active for only one time step. The cascade process terminates when no active nodes remain.
 
 If we denote by $X^t$ the indicator variable of the set of active nodes at time step $t-1$, then if $j$ is susceptible at time step $t-1$, we have:
 \begin{displaymath}
@@ -49,7 +49,11 @@ Defining $\Theta_{i,j} \defeq \log(1-p_{i,j})$, this can be rewritten as:
 
 \subsubsection{The Voter Model}
 
-In the Voter Model, nodes have two states
+In the Voter Model, nodes can be either red or blue, where ``blue'' is the source state. The parameters of the graph are normalized such that $\forall i, \ \sum_j \Theta_{i,j} = 1$. Each round, every node $j$ chooses one of its neighbors with probability $\Theta_ij$ and adopts their color. The cascades stops at a fixed horizon time T or if all nodes are of the same color. If we denote by $X^t$ the indicator variable of the set of blue nodes at time step $t-1$, then we have:
+\begin{equation}
+\mathbb{P}\left[X^{t+1}_j = 1 | X^t \right] = \sum_{i=1}^m \Theta_{i,j} X_i^t = \inprod{\theta_j}{X^t}
+\tag{V}
+\end{equation}
 
 \subsection{Maximum Likelihood Estimation}
 
@@ -95,3 +99,7 @@ the first step at which $j$ becomes active, we can rewrite the  MLE program
     \sum_{t=1}^{n_j-1}\log\big(1-f(\inprod{\theta_j}{x^t})\big)\\
 \log f(\inprod{\theta_j}{x^{n_j}})+ \lambda\|\theta_j\|
 \end{multline}
+
+
+
+
diff --git a/paper/sections/results.tex b/paper/sections/results.tex
index 042a06d..2831a51 100644
--- a/paper/sections/results.tex
+++ b/paper/sections/results.tex
@@ -1,28 +1,24 @@
-In this section, we exploit standard techniques in sparse recovery and leverage the simple nature of generalized linear models (GLMs) to address the standard problem of edge detection. We extend prior work by showing that edge weights of the graph can be recovered under similar assumptions, as well as by considering the non-exactly sparse case.
+In this section, we exploit standard techniques in sparse recovery and leverage the simple nature of generalized linear cascades to address the standard problem of edge detection. We extend prior work by showing that edge weights of the graph can be recovered under similar assumptions. We extend these results by considering the non-exactly sparse case.
 
-\subsection{Recovering Edges and Edge weights} 
-Recovering the edges of the graph can be formalized as recovering the support of the edge weights $\Theta$. This problem is known as {\it variable selection}. Since Eq.~\ref{eq:pre-mle} can be solved node by node, the objective is also known as `parent' recovery. For each node $i$, we seek to find the indices $\{j: \theta_{ij} \neq 0\}$. For the rest of the analysis, we suppose that we consider a single node $i$. For ease of presentation, the index $i$ will be implied.
+\subsection{Objectives}
 
-There have been a series of papers arguing that the standard Lasso is an inappropriate exact variable selection method \cite{Zou:2006}, \cite{vandegeer:2011}, since it relies on the essentially necessary irrepresentability condition, introduced in \cite{Zhao:2006}. This is the condition on which the analysis of \cite{Daneshmand:2014} relies. It has been noted this condition is rather stringent and rarely holds in practical situations where correlation between variables occurs. Several alternatives have been suggested, including the adaptive and thresholded lasso, which relax this assumption. We defer an extended analysis of the irrepresentability assumption to Section~\ref{sec:assumptions}.
+As mentioned previously, our objective is twofold:
 
-Our approach is different. Rather than trying to focus on exact variable selection, we seek to upper-bound  the $\ell2$-norm $\|\hat \theta - \theta^* \|_2$. It is easy to see that recovering all `strong' edges of the graph follows directly from this analysis. By thresholding all weak $\hat \theta$, one recovers all `strong' parents without false positives, as shown in corollary~\ref{cor:variable_selection}.
-
-We will first apply standard techniques to obtain a ${\cal O}(\sqrt{\frac{s \log m}{n}})$ $\ell2$-norm upper-bound in the case of sparse vectors. We then extend this analysis to non-sparse vectors. In section~\ref{sec:lowerbound}, we show that our results are almost tight.
+\begin{enumerate}
+\item We seek to recover the edges of the graph. In other words, for each node $j$, we wish to identify the set of indices $\{i: \Theta_{ij} \neq 0\}$ . 
+\item We seek to recover the edge weights $\Theta$ of ${\cal G}$ i.e. upper-bound the error $\|\hat \Theta - \Theta \|_2$.
+\end{enumerate}
+It is easy to see that solving the second objective allows us to solve the first. If $\|\hat \Theta - \Theta \|_2$ is sufficiently small, we can recover all large coordinates of $\Theta$ by keeping only the coordinates above a certain threshold.
 
 \subsection{Main Theorem}
 
-We begin our analysis with the following simple lemma:
+For the remainder of this section, we consider a single node $i$. For ease of presentation, the index $j$ is made implicit: $\theta_i \leftarrow \Theta_{ij}$ and $\theta \leftarrow (\Theta_{i\cdot})$. We begin our analysis with the following lemma:
 \begin{lemma}
-\label{lem:theta_p_upperbound}
-$\|\theta - \theta^* \|_2 \geq \|p - p^*\|_2$
+\label{lem:icc_lambda_upper_bound}
+For any $\delta > 0$, with probability $1-e^{-n^\delta \log m}$, $\|\nabla {\cal L}(\theta^*)\|_{\infty} \leq 2 \sqrt{\frac{\log m}{n^{1 - \delta} p_{\min}}}$
 \end{lemma}
-\begin{proof}
-Using the inequality $\forall x>0, \; \log x \geq 1 - \frac{1}{x}$, we have $|\log (1 - p) - \log (1-p')| \geq \max(1 - \frac{1-p}{1-p'}, 1 - \frac{1-p'}{1-p}) \geq \max( p-p', p'-p)$. The result follows easily.
-\end{proof}
-
-In other words, finding an upper-bound for the estimation error of the `effective' parameters $\theta_{i,j} \defeq \log(1-p_{i,j})$ provides immediately an upper-bound for the estimation error of the true parameters $(p_{i,j})_{i,j}$.
 
-Interestingly, finding such an upper-bound is commonly studied in sparse recovery and essentially relies on the restricted eigenvalue condition for a symetric matrix $\Sigma$ and set ${\cal C} := \{X \in \mathbb{R}^p : \|X_{S^c}\|_1 \leq 3 \|X_S\|_1 \} \cap \{ \|X\|_1 \leq 1 \}$
+Consider the well-known \emph{restricted eigenvalue condition} for a symetric matrix $\Sigma$ and set ${\cal C} := \{X \in \mathbb{R}^p : \|X_{S^c}\|_1 \leq 3 \|X_S\|_1 \} \cap \{ \|X\|_1 \leq 1 \}$
 
 \begin{equation}
 \nonumber
@@ -30,23 +26,38 @@ Interestingly, finding such an upper-bound is commonly studied in sparse recover
 \tag{RE}
 \end{equation}
 
-We compare this condition to the irrepresentability condition used in prior work in section~\ref{sec:lowerbound}. We cite the following theorem from \cite{Negahban:2009}
+The following theorem is a consequence of Theorem 1 in \cite{Negahban:2009} and Lemma~\ref{lem:icc_lambda_upper_bound}.
 
 \begin{theorem}
 \label{thm:neghaban}
-Suppose the true vector $\theta^*$ has support S of size s and the {\bf(RE)} assumption holds for the Hessian $\nabla^2 f(\theta^*)$, then by solving \eqref{eq:pre-mle} for $\lambda_n \geq 2 \|\nabla f(\theta^*)\|_{\infty}$ we have:
+Suppose the true vector $\theta^*$ has support S of size s and the {\bf(RE)} assumption holds for the Hessian $\nabla^2 {\cal L}(\theta)$ with $\gamma > 0$, then for $\delta > 0$ by solving \eqref{eq:pre-mle} with $\lambda \defeq 2\sqrt{\frac{\log m}{n^{1 - \delta} p_{\min}}}$, we have with probability $1-\exp(-n^\delta \log m)$:
 \begin{equation}
-\|\hat \theta - \theta^* \|_2 \leq 3 \frac{\sqrt{s}\lambda_n}{\gamma_n}
+\|\hat \theta - \theta \|_2 \leq \frac{3}{\gamma} \sqrt{\frac{s \log m}{p_{\min} n^{1-\delta}}}
 \end{equation}
 \end{theorem}
 
-In section~\ref{subsec:icc}, we find a ${\cal O}(\sqrt{n})$ upper-bound for valid $\lambda_n$. It is also reasonable to assume $\gamma_n = \Omega(n)$, as discussed in section~\ref{sec:assumptions}, yielding a ${\cal O}(1/\sqrt{n})$ decay rate per measurement. The authors believe it is more natural to express these results as the number of measurements $N$, i.e. cumulative number of steps in each cascades, rather the number of cascades $n$.
+The proof is included in the Appendix. Note that as $n$ goes to infinity, the hessian $\nabla^2 {\cal L}(\theta)$ goes to the \emph{Fisher Information Matrix} of the cascade model as a function of $\theta$. The various assumptions of theorem~\ref{thm:neghaban} are discussed in section~\ref{sec:assumptions}. The following corollary follows easily and gives the first ${\cal O}(s \log m)$ algorithm for graph reconstruction on general graphs.
+
+\begin{corollary}
+\label{cor:variable_selection}
+Assume that ${\bf (RE)}$ holds with $\gamma > 0$ and that $\theta$ is s-sparse. Consider the set $\hat {\cal S}_\eta \defeq \{ i \in [1..p] : \hat \theta_i > \eta\}$ for $\eta > 0$. For $\epsilon>0$ and $\epsilon < \eta$, let ${\cal S}^*_{\eta + \epsilon} \defeq \{ i \in [1..p] :\theta_i > \eta +\epsilon \}$ be the set of all true `strong' parents. Suppose the number of measurements verifies:
+\begin{equation}
+n > \frac{36}{p_{\min}\gamma^2 \epsilon^2} s \log m
+\end{equation}
+Then with probability $1-\frac{1}{m}$, ${\cal S}^*_{\eta + \epsilon} \subset \hat {\cal S}_\eta \subset {\cal S}^*$. In other words we recover all `strong' parents and no `false' parents. 
+\end{corollary}
+
+\begin{proof}
+Suppose $\theta$ is exactly s-sparse. By choosing $\delta = 0$, if $n>\frac{36}{p_{\min}\gamma^2 \epsilon^2} s \log m$, then $\|\hat \theta-\theta\|_2 < \epsilon < \eta$ with probability $1-\frac{1}{m}$. If $\theta_i = 0$ and $\hat \theta > \eta$, then $\|\hat \theta - \theta\|_2 \geq |\hat \theta_i-\theta_j| > \eta$, which is a contradiction. Therefore we get no false positives. If $\theta_i = \eta + \epsilon$, then $|\hat \theta_i - (\eta+\epsilon)| < \epsilon/2 \implies \theta_j > \eta + \epsilon/2$. Therefore, we get all strong parents.
+\end{proof}
+
+Note that in order to allow for \emph{perfect recovery}, we must estimate $p_{\min}$ within $\epsilon$ and set $\eta \defeq p_{\min} - \epsilon$.
 
 \subsection{Relaxing the Sparsity Constraint}
 
-In practice, exact sparsity is rarely verified. For social networks in particular, it is more realistic to assume that each node has few strong `parents' and many `weaker' parents. Rather than obtaining an impossibility result in this situation, we show that we pay a small price for relaxing the sparsity constraint. If we let $\theta^*_{\lfloor s \rfloor}$ be the best s-sparse approximation to $\theta^*$ defined as 
+In practice, exact sparsity is rarely verified. For social networks in particular, it is more realistic to assume that each node has few strong `parents' and many `weak' parents. Rather than obtaining an impossibility result in this situation, we show that we pay a small price for relaxing the sparsity constraint. If we let $\theta^*_{\lfloor s \rfloor}$ be the best s-sparse approximation to $\theta^*$ defined as 
 $$\theta^*_{\lfloor s \rfloor} \defeq \min_{\|\theta\|_0 \leq s} \|\theta - \theta^*\|_1$$
-then we pay ${\cal O} \left(\sqrt{\frac{\lambda_n}{\gamma_n}} \|\theta^*_s\|_1 \right)$ for recovering the weights of non-exactly sparse vectors. Since $\|\theta^*_{\lfloor s \rfloor}\|_1$ is the sum of the $\|\theta^*\|_0 -s$ weakest coefficients of $\theta^*$, the closer $\theta^*$ is to being sparse, the smaller the price. These results are formalized in the following theorem:
+then we pay a price proportional to $\|\theta^*_s\|_1$ for recovering the weights of non-exactly sparse vectors, i.e. the sum of the coefficients of $\theta$ save for the $s$ largest. In other words, the closer $\theta^*$ is to being sparse, the smaller the price. These results are formalized in the following theorem, which is also a consequence of Theorem 1 in \cite{Negahban:2009} and lemma~\ref{lem:icc_lambda_upper_bound}.
 
 \begin{theorem}
 \label{thm:approx_sparse}
@@ -56,55 +67,36 @@ Let $\theta^*_{\lfloor s \rfloor}$ be the best s-sparse approximation to the tru
 {\cal C}' \defeq & \{X \in \mathbb{R}^p : \|X_{S^c}\|_1 \leq 3 \|X_S\|_1 + 4 \|\theta^*_{\lfloor s \rfloor}\|_1 \} \\ \nonumber
 & \cap \{ \|X\|_1 \leq 1 \}
 \end{align}
-By solving \eqref{eq:pre-mle} for $\lambda_n \geq 2 \|\nabla f(\theta^*)\|_{\infty}$ we have:
+By solving \eqref{eq:pre-mle} for $\lambda \defeq 2\sqrt{\frac{\log m}{n^{1 - \delta} p_{\min}}}$ we have:
 \begin{align}
 \|\hat p - p^* \|_2 \leq 3 \frac{\sqrt{s}\lambda_n}{\gamma_n} + 2 \sqrt{\frac{\lambda_n}{\gamma_n}} \|p^*_{\lfloor s \rfloor}\|_1
 \end{align}
 \end{theorem}
 
-This follows from a more general version of Theorem~\ref{thm:neghaban} in \cite{Negahban:2009}, from Lemma~\ref{lem:theta_p_upperbound} and the simple observation that $\| \theta^*_{\lfloor s \rfloor}\|_1 \leq \| p^*_{\lfloor s \rfloor} \|_1$
-The results of section~\ref{subsec:icc} can be easily extended to the approximately-sparse case.
-
-
-\subsection{Independent Cascade Model}
-\label{subsec:icc}
-We analyse the previous conditions in the case of the Independent Cascade model. Lemma~\ref{lem:icc_lambda_upper_bound} provides a ${\cal O}(\sqrt{n})$-upper-bound w.h.p. on $\|\nabla f(\theta^*)\|$
-\begin{lemma}
-\label{lem:icc_lambda_upper_bound}
-For any $\delta > 0$, with probability $1-e^{-n^\delta \log m}$, $\|\nabla f(\theta^*)\|_{\infty} \leq 2 \sqrt{\frac{n^{\delta + 1} \log m}{p_{\min}}}$
-\end{lemma}
-
-We include a proof of this result in the Appendix. The following corollaries follow immediately from Theorem~\ref{thm:neghaban}, Theorem~\ref{thm:approx_sparse} and Lemma~\ref{lem:icc_lambda_upper_bound}:
-
-\begin{corollary}
-Assume that ${\bf (RE)}$ holds with $\gamma_n = n \gamma$ for $\gamma > 0$. Then for $\lambda_n \defeq 2 \sqrt{\frac{n^{\delta + 1} \log m}{p_{\min}}}$ and with probability $1-e^{n^\delta \log m}$:
-\begin{equation}
-\|\hat p - p^* \|_2 \leq \frac{3}{\gamma} \sqrt{\frac{s \log m}{p_{\min} n^{1-\delta}}} + 2 \sqrt[4]{\frac{\log m}{n^{1-\delta} \gamma^2 p_{\min}}} \| p^*_{\lfloor s \rfloor} \|_1
-\end{equation}
-Note that if $p$ is exactly s-sparse, $\| p^*_{\lfloor s \rfloor} \|_1 = 0$:
-\begin{equation}
-\|\hat p - p^* \|_2 \leq \frac{3}{\gamma} \sqrt{\frac{s \log m}{p_{\min} n^{1-\delta}}}
-\end{equation}
-\end{corollary}
-
-The following corollary follows easily and gives the first $\Omega(s \log p)$ algorithm for graph reconstruction on general graphs. The proofs are included in the Appendix.
+As before, edge recovery is a consequence of upper-bounding $\|\theta - \hat \theta\|_2$
 
 \begin{corollary}
-\label{cor:variable_selection}
-Assume that ${\bf (RE)}$ holds with $\gamma_n = n \gamma$ for $\gamma > 0$ and that $p$ is s-sparse. Suppose that after solving for $\hat p$, we construct the set $\hat {\cal S}_\eta \defeq \{ j \in [1..p] : \hat p_j > \eta\}$ for $\eta > 0$. For $\epsilon>0$ and $\epsilon < \eta$, let ${\cal S}^*_{\eta + \epsilon} \defeq \{ j \in [1..p] :p^*_j > \eta +\epsilon \}$ be the set of all true `strong' parents. Suppose the number of measurements verifies:
-\begin{equation}
-n > \frac{36}{p_{\min}\gamma^2 \epsilon^2} s \log m
-\end{equation}
-Then with probability $1-\frac{1}{m}$, ${\cal S}^*_{\eta + \epsilon} \subset \hat {\cal S}_\eta \subset {\cal S}^*$. In other words we recover all `strong' parents and no `false' parents. Note that if $\theta$ is not exactly s-sparse and the number of measurements verifies:
+If $\theta$ is not exactly s-sparse and the number of measurements verifies:
 \begin{equation}
 n > \frac{36 \| p^*_{\lfloor s\rfloor}\|_1}{p_{\min}\gamma^2 \epsilon^4} s \log m
 \end{equation}
 then similarly: ${\cal S}^*_{\eta + \epsilon} \subset \hat {\cal S}_\eta \subset {\cal S}^*$ w.h.p.
 \end{corollary}
 
+
+\subsection{Examples}
+
+\subsubsection{Independent Cascade Model}
+
+We begin our analysis with the following simple lemma:
+\begin{lemma}
+\label{lem:theta_p_upperbound}
+$\|\theta - \theta^* \|_2 \geq \|p - p^*\|_2$
+\end{lemma}
 \begin{proof}
-Suppose $p$ is exactly s-sparse. By choosing $\delta = 0$, if $n>\frac{36}{p_{\min}\gamma^2 \epsilon^2} s \log m$, then $\|p-p^*\|_2 < \epsilon < \eta$ with probability $1-\frac{1}{m}$. If $p^*_j = 0$ and $\hat p > \eta$, then $\|p - p^*\|_2 \geq |\hat p_j-p^*_j| > \eta$, which is a contradiction. Therefore we get no false positives. If $p^*_j = \eta + \epsilon$, then $|\hat p_j - (\eta+\epsilon)| < \epsilon/2 \implies p_j > \eta + \epsilon/2$. Therefore, we get all strong parents. The analysis in the non-sparse case is identical.
+Using the inequality $\forall x>0, \; \log x \geq 1 - \frac{1}{x}$, we have $|\log (1 - p) - \log (1-p')| \geq \max(1 - \frac{1-p}{1-p'}, 1 - \frac{1-p'}{1-p}) \geq \max( p-p', p'-p)$. The result follows easily.
 \end{proof}
 
-Note that $n$ is the number of measurements and not the number of cascades. This is an improvement over prior work since we expect several measurements per cascade.
+In other words, finding an upper-bound for the estimation error of the `effective' parameters $\theta_{i,j} \defeq \log(1-p_{i,j})$ provides immediately an upper-bound for the estimation error of the true parameters $(p_{i,j})_{i,j}$.
 
+\subsubsection{The Voter Model}
+\ No newline at end of file
author	jeanpouget-abadie <jean.pougetabadie@gmail.com>	2015-01-31 16:13:27 -0500
committer	jeanpouget-abadie <jean.pougetabadie@gmail.com>	2015-01-31 16:13:27 -0500
commit	7df4228e025d5810cb1689073d42b6d8d265b018 (patch)
tree	16c4ac9784382cdd4d84e0d90565bd848046ca38
parent	57b3fb8c1fdd998c0245672599de2249d44cfdf2 (diff)
download	cascades-7df4228e025d5810cb1689073d42b6d8d265b018.tar.gz