paper/sections/results.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111

In this section, we exploit standard techniques in sparse recovery and leverage the simple nature of generalized linear models (GLMs) to address the standard problem of edge detection. We extend prior work by that edge weights of the graph can also be recovered. We further relax the sparsity constraint: it is more realistic to assume that the graph will have few `strong' edges, characterized by weights closer to 1, and many `weak' edges, characterized by weights closer to 0.

\subsection{Recovering Edges and Edge weights} 
Recovering the edges of the graph can be formalized as recovering the support of $\Theta$, a problem known as {\it variable selection}. As we have seen above, we can optimize Eq.~\ref{eq:pre-mle} node by node. Our objective is to recover the parents of each node, i.e the non-zero coefficients of $\theta_i \ \forall i$. For the rest of the analysis, we suppose that we consider a single node $i$. For ease of presentation, the index $i$ will be implied: $p_{i,j} = p_j$, $\theta_{i,J} = \theta_j$...

There have been a series of papers arguing that the standard Lasso is an inappropriate exact variable selection method \cite{Zou:2006}, \cite{vandegeer:2011}, since it relies on the essentially necessary irrepresentability condition, introduced in \cite{Zhao:2006}. However, this condition, on which the analysis of \cite{Daneshmand:2014} relies on, rarely holds in practical situations where correlation between variables occurs, and several alternatives have been suggested (the adaptive lasso, thresholded lasso...) We defer an extended analysis of the irrepresentability assumption to Section~\ref{sec:assumptions}.

Our approach is different. Rather than trying to perform variable selection directly by finding $\{j: \theta_j \neq 0\}$, we seek to upper-bound $\|\hat \theta - \theta^* \|_2$. It is easy to see that recovering all `strong' edges of the graph is a direct consequence of this analysis: by thresholding all weak $\hat \theta$, one recovers all `strong' parents without false positives, as shown in corollary~\ref{cor:variable_selection}.

We will first apply standard techniques to obtain a ${\cal O}(\sqrt{\frac{s \log m}{n}})$ $\ell2$-norm upper-bound in the case of sparse vectors. We will then extend this analysis to non-sparse vectors. In section~\ref{sec:lowerbound}, we show that our results are almost tight.

\subsection{Main Theorem}

We begin our analysis with the following simple lemma:
\begin{lemma}
\label{lem:theta_p_upperbound}
$\|\theta - \theta^* \|_2 \geq \|p - p^*\|_2$
\end{lemma}
\begin{proof}
Using the inequality $\forall x>0, \; \log x \geq 1 - \frac{1}{x}$, we have $|\log (1 - p) - \log (1-p')| \geq \max(1 - \frac{1-p}{1-p'}, 1 - \frac{1-p'}{1-p}) \geq \max( p-p', p'-p)$. The result follows easily.
\end{proof}

In other words, finding an upper-bound for the estimation error of the `effective' parameters $\theta_{i,j} \defeq \log(1-p_{i,j})$ provides immediately an upper-bound for the estimation error of the true parameters $(p_{i,j})_{i,j}$.

Interestingly, finding such an upper-bound is commonly studied in sparse recovery and essentially relies on the restricted eigenvalue condition for a symetric matrix $\Sigma$ and set ${\cal C} := \{X \in \mathbb{R}^p : \|X_{S^c}\|_1 \leq 3 \|X_S\|_1 \} \cap \{ \|X\|_1 \leq 1 \}$

\begin{equation}
\nonumber
\forall X \in {\cal C}, \| \Sigma X \|_2^2 \geq \gamma_n \|X\|_2^2 
\tag{RE}
\end{equation}

We compare this condition to the irrepresentability condition used in prior work in section~\ref{sec:lowerbound}. We cite the following theorem from \cite{Negahban:2009}

\begin{theorem}
\label{thm:neghaban}
Suppose the true vector $\theta^*$ has support S of size s and the {\bf(RE)} assumption holds for the Hessian $\nabla^2 f(\theta^*)$, then by solving \eqref{eq:pre-mle} for $\lambda_n \geq 2 \|\nabla f(\theta^*)\|_{\infty}$ we have:
\begin{equation}
\|\hat \theta - \theta^* \|_2 \leq 3 \frac{\sqrt{s}\lambda_n}{\gamma_n}
\end{equation}
\end{theorem}

In section~\ref{subsec:icc}, we find a ${\cal O}(\sqrt{n})$ upper-bound for valid $\lambda_n$. It is also reasonable to assume $\gamma_n = \Omega(n)$, as discussed in section~\ref{sec:assumptions}, yielding a ${\cal O}(1/\sqrt{n})$ decay rate per measurement. The authors believe it is more natural to express these results as the number of measurements $N$, i.e. cumulative number of steps in each cascades, rather the number of cascades $n$.


\subsection{Relaxing the Sparsity Constraint}

In many situations however, and for social networks in particular, the graph is not exactly $s$-sparse. A more realistic situation is one where each nodes has few strong `parents' and many `weaker' parents. Rather than obtaining an impossibility result in this situation, we show that we pay a small price for relaxing the sparsity constraint. If we let $\theta^*_{\lfloor s \rfloor}$ be the best s-sparse approximation to $\theta^*$ defined as 
$$\theta^*_{\lfloor s \rfloor} \defeq \min_{\|\theta\|_0 \leq s} \|\theta - \theta^*\|_1$$
then we pay ${\cal O} \left(\sqrt{\frac{\lambda_n}{\gamma_n}} \|\theta^*_s\|_1 \right)$ for recovering the weights of non-exactly sparse vectors. Since $\|\theta^*_{\lfloor s \rfloor}\|_1$ is the sum of the $\|\theta^*\|_0 -s$ weakest coefficients of $\theta^*$, the closer $\theta^*$ is to being sparse, the smaller the price. These results are formalized in the following theorem:

\begin{theorem}
\label{thm:approx_sparse}
Let $\theta^*_{\lfloor s \rfloor}$ be the best s-sparse approximation to the true vector $\theta^*$. Suppose the {\bf(RE)} assumption holds for the Hessian $\nabla^2 f(\theta^*)$ and for the following set 
\begin{align}
\nonumber
{\cal C}' \defeq & \{X \in \mathbb{R}^p : \|X_{S^c}\|_1 \leq 3 \|X_S\|_1 + 4 \|\theta^*_{\lfloor s \rfloor}\|_1 \} \\ \nonumber
& \cap \{ \|X\|_1 \leq 1 \}
\end{align}
By solving \eqref{eq:pre-mle} for $\lambda_n \geq 2 \|\nabla f(\theta^*)\|_{\infty}$ we have:
\begin{align}
\|\hat p - p^* \|_2 \leq 3 \frac{\sqrt{s}\lambda_n}{\gamma_n} + 2 \sqrt{\frac{\lambda_n}{\gamma_n}} \|p^*_{\lfloor s \rfloor}\|_1
\end{align}
\end{theorem}

This follows from a more general version of Theorem~\ref{thm:neghaban} in \cite{Negahban:2009}, from Lemma~\ref{lem:theta_p_upperbound} and the simple observation that $\| \theta^*_{\lfloor s \rfloor}\|_1 \leq \| p^*_{\lfloor s \rfloor} \|_1$
The results of section~\ref{subsec:icc} can be easily extended to the approximately-sparse case.


\subsection{Independent Cascade Model}
\label{subsec:icc}
We analyse the previous conditions in the case of the Independent Cascade model. Lemma~\ref{lem:icc_lambda_upper_bound} provides a ${\cal O}(\sqrt{n})$-upper-bound w.h.p. on $\|\nabla f(\theta^*)\|$
\begin{lemma}
\label{lem:icc_lambda_upper_bound}
For any $\delta > 0$, with probability $1-e^{-n^\delta \log m}$, $\|\nabla f(\theta^*)\|_{\infty} \leq 2 \sqrt{\frac{n^{\delta + 1} \log m}{p_{\min}}}$
\end{lemma}

We include a proof of this result in the Appendix. The following corollaries follow immediately from Theorem~\ref{thm:neghaban}, Theorem~\ref{thm:approx_sparse} and Lemma~\ref{lem:icc_lambda_upper_bound}:

\begin{corollary}
Assume that ${\bf (RE)}$ holds with $\gamma_n = n \gamma$ for $\gamma > 0$. Then for $\lambda_n \defeq 2 \sqrt{\frac{n^{\delta + 1} \log m}{p_{\min}}}$ and with probability $1-e^{n^\delta \log m}$:
\begin{equation}
\|\hat p - p^* \|_2 \leq \frac{3}{\gamma} \sqrt{\frac{s \log m}{p_{\min} n^{1-\delta}}} + 2 \sqrt[4]{\frac{\log m}{n^{1-\delta} \gamma^2 p_{\min}}} \| p^*_{\lfloor s \rfloor} \|_1
\end{equation}
Note that if $p$ is exactly s-sparse, $\| p^*_{\lfloor s \rfloor} \|_1 = 0$:
\begin{equation}
\|\hat p - p^* \|_2 \leq \frac{3}{\gamma} \sqrt{\frac{s \log m}{p_{\min} n^{1-\delta}}}
\end{equation}
\end{corollary}

The following corollary follows easily and gives the first $\Omega(s \log p)$ algorithm for graph reconstruction on general graphs. The proofs are included in the Appendix.

\begin{corollary}
\label{cor:variable_selection}
Assume that ${\bf (RE)}$ holds with $\gamma_n = n \gamma$ for $\gamma > 0$ and that $\theta$ is s-sparse. Suppose that after solving for $\hat \theta$, we construct the set $\hat {\cal S}_\eta \defeq \{ j \in [1..p] : \hat p_j > \eta\}$ for $\eta > 0$. For $\epsilon>0$ and $\epsilon < \eta$, let ${\cal S}^*_{\eta + \epsilon} \defeq \{ j \in [1..p] :p^*_j > \eta +\epsilon \}$ be the set of all true `strong' parents. Suppose the number of measurements verifies:
\begin{equation}
n > \frac{36}{p_{\min}\gamma^2 \epsilon^2} s \log m
\end{equation}
Then with probability $1-\frac{1}{m}$, ${\cal S}^*_{\eta + \epsilon} \subset \hat {\cal S}_\eta \subset {\cal S}^*$. In other words we recover all `strong' parents and no `false' parents. Note that if $\theta$ is not exactly s-sparse and the number of measurements verifies:
\begin{equation}
n > \frac{36 \| p^*_{\lfloor s\rfloor}\|_1}{p_{\min}\gamma^2 \epsilon^4} s \log m
\end{equation}
then similarly: ${\cal S}^*_{\eta + \epsilon} \subset \hat {\cal S}_\eta \subset {\cal S}^*$ w.h.p.
\end{corollary}

\begin{proof}
By choosing $\delta = 0$, if $n>\frac{36}{p_{\min}\gamma^2 \epsilon^2} s \log m$, then $\|p-p^*\|_2 < \epsilon < \eta$ with probability $1-\frac{1}{m}$. If $p^*_j = 0$ and $\hat p > \eta$, then $\|p - p^*\|_2 \geq |\hat p_j-p^*_j| > \eta$, which is a contradiction. Therefore we get no false positives. If $p^*_j = \eta + \epsilon$, then $|\hat p_j - (\eta+\epsilon)| < \epsilon/2 \implies p_j > \eta + \epsilon/2$. Therefore, we get all strong parents.
\end{proof}

Note that $n$ is the number of measurements and not the number of cascades. This is an improvement over prior work since we expect several measurements per cascade.