aboutsummaryrefslogtreecommitdiffstats
path: root/notes
diff options
context:
space:
mode:
Diffstat (limited to 'notes')
-rw-r--r--notes/defs.tex17
-rw-r--r--notes/formalisation.pdfbin179913 -> 197923 bytes
-rw-r--r--notes/formalisation.tex136
3 files changed, 139 insertions, 14 deletions
diff --git a/notes/defs.tex b/notes/defs.tex
new file mode 100644
index 0000000..ca5dd55
--- /dev/null
+++ b/notes/defs.tex
@@ -0,0 +1,17 @@
+\newtheorem{theorem}{Theorem}[section]
+\newtheorem{lemma}[theorem]{Lemma}
+\newtheorem{proposition}[theorem]{Proposition}
+\newtheorem{corollary}[theorem]{Corollary}
+\theoremstyle{definition}
+\newtheorem*{remark}{Remark}
+
+\DeclareMathOperator{\E}{\mathbb{E}}
+\let\P\relax
+\DeclareMathOperator{\P}{\mathbb{P}}
+\newcommand{\ex}[1]{\E\left[#1\right]}
+\newcommand{\prob}[1]{\P\left[#1\right]}
+\newcommand{\inprod}[1]{\left\langle #1 \right\rangle}
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\N}{\mathbb{N}}
+\newcommand{\eps}{\varepsilon}
+\newcommand{\eqdef}{\mathbin{\stackrel{\rm def}{=}}}
diff --git a/notes/formalisation.pdf b/notes/formalisation.pdf
index 02e0a47..2d64bb0 100644
--- a/notes/formalisation.pdf
+++ b/notes/formalisation.pdf
Binary files differ
diff --git a/notes/formalisation.tex b/notes/formalisation.tex
index 53786d3..00eb891 100644
--- a/notes/formalisation.tex
+++ b/notes/formalisation.tex
@@ -1,13 +1,10 @@
\documentclass[10pt]{article}
-\usepackage{fullpage, amsmath, amssymb, bbm, amsthm}
-
-\newtheorem{theorem}{Theorem}[section]
-\newtheorem{lemma}[theorem]{Lemma}
-\newtheorem{proposition}[theorem]{Proposition}
-\newtheorem{corollary}[theorem]{Corollary}
-\newenvironment{remark}[1][Remark]{\begin{trivlist}
-\item[\hskip \labelsep {\bfseries #1}]}{\end{trivlist}}
-
+\usepackage[utf8x]{inputenc}
+\usepackage{amsmath, amssymb, bbm, amsthm}
+\usepackage[pagebackref=true,breaklinks=true,colorlinks=true,backref=false]{hyperref}
+\usepackage[capitalize, noabbrev]{cleveref}
+\usepackage{microtype}
+\input{defs}
\date{}
\title{Cracking Cascades: Applying Sparse Recovery to Graph Reconstruction}
@@ -18,6 +15,73 @@
\section{Preliminaries: What is a measurement?}
+There is an unknown graph $G=(V,E)$ that we wish to reconstruct by observing
+information cascades. An information cascade is a random discrete-time process
+where we get to observe the set of vertices infected at each time step.
+
+The key contribution is to formulate this problem as a sparse recovery problem
+akin to those found in the compressed sensing literature. The parallel between
+the two problems is as follows:
+\begin{itemize}
+ \item the unknown \emph{signal} $s$ is a vector in $\{0,1\}^{V\times V}$
+ with $s_{u,v} = 1$ iff. $(u,v)\in E$.
+ \item the \emph{sensing matrix} is given by the cascades. The \emph{sensing
+ matrix} maps the signal to the observation space. Since the cascades
+ are random time processes, they define a sequence of random
+ observations.
+ \item the \emph{observations} are the set of vertices infected at each time
+ step in each cascade.
+\end{itemize}
+
+Let us define $n\eqdef |V|$. If we assume that all the vertices in $G$ have
+a degree bounded by $d\in\N_+$, then the signal $s$ is $nd$-sparse. Assuming
+that the sensing matrix has good random properties, then we can hope to
+recover $s$ with $O(nd\log n)$ random observations.
+
+Two main goals:
+\begin{itemize}
+ \item understand to which extent the probabilistic cascade models translate
+ into incoherence properties of the sensing matrix and allow for an
+ accurate reconstruction of the unknown graph with a small number of
+ observations.
+ \item compare the empirical performance of this compressed sensing approach
+ to prior graph reconstruction approaches.
+\end{itemize}
+
+\begin{remark}
+\begin{itemize}
+ \item It seems that we will need a \emph{complete} probabilistic model for
+ the cascades. By complete I mean that we need to specify a joint
+ distribution on the source of the cascade (\emph{e.g.} the source is
+ chosen uniformly at random) and on the cascade diffusion process. Such
+ a model is probably not very realistic, but is necessary to obtain
+ theoretical guarantees. The unrealism of the model is not a problem
+ \emph{per se}; the eventual validity of this approach will be confirmed
+ experimentally anyway.
+ \item It seems likely that we will want the rows of the sensing matrix to
+ be independent realizations of the same distribution. However,
+ \textbf{this will never be the case} if we have one row for each time
+ step of each cascade: the rows within the same cascade are obviously
+ highly correlated. We might have to \emph{aggregate} them (for example
+ through concatenation?) in a sensible way to have with one row per
+ cascade.
+ \item Something which makes this problem very different from the usual
+ compressed sensing problems is that the measurements depend on the
+ signal we wish to recover: cascades propagate along the edges contained
+ in the signal. More precisely, not only the rows of the sensing matrix
+ are correlated (if we do not aggregate them by cascade) but also they
+ are correlated through the signal. Is this a problem? I don't know,
+ this looks scary though.
+ \item A slightly more general model which we will need in the independent
+ cascade case considers a signal $s\in[0,1]^{V\times V}$. We can always
+ go from this model to the binary case by first recovering the
+ non-binary $s$ and then applying a thresholding procedure. But if we
+ only care about recovering the support of $s$, can we do better than
+ this two-step process? (TO DO: read stuff about support recovery).
+\end{itemize}
+\end{remark}
+
+Previous stuff:
\begin{itemize}
\item Introduction
\item Simple graphs (cycle, star, path, tree?)
@@ -45,24 +109,43 @@ Recap on what the model is
\subsection{Results under isotropic condition}
\section{Independent Cascade Model}
+
\subsection{The Model}
-Recall the independent cascade model, where nodes have 3 states: susceptible, active, and inactive. All nodes start either as susceptible or active. At each time step t, for every (active, susceptible)-pair $(j,i)$, $j$ infects $i$ with probability $p_{j,i}$. At t+1, $j$ becomes inactive and $i$ becomes active. The cascade continues until no active nodes remain.
-Consider a node i. If $p_{j,i}$ is the probability that node $j$ infects $i$ at the next step, let $\theta_{i,j} := \log (1 - p_{j,i})$. Note that $\theta_{j,i} = 0 \Leftrightarrow p_{j,i} = 0$. Also note that note that
+Recall the independent cascade model, where nodes have 3 states: susceptible,
+active, and inactive. All nodes start either as susceptible or active. At each
+time step $t$, for every (active, susceptible)-pair $(j,i)$, $j$ attempts to
+infect $i$ with probability $p_{j,i}$ and become inactive. If $j$ succeeds, $i$
+will become active at time step $t+1$. The cascade continues until no active
+nodes remain.
+
+Consider a node $i$. If $p_{j,i}$ is the probability that node $j$ infects $i$
+at the next step, let $\theta_{i,j} := \log (1 - p_{j,i})$. Note that
+$\theta_{j,i} = 0 \Leftrightarrow p_{j,i} = 0$. Also note that:
$$\sum_{j\in S} \theta_{j,i} = \log \prod_{j \in S} (1 - p_{j,i}) = \log \mathbb{P}(\text{i not infected by any j} \in S)$$
Our objective is to recover the vector $\theta_{*,i}$ for every node $i$.
\subsection{Formulating the sparse recovery problem}
+
The idea behind what follows is the assumption that $p_{*,i}$ and therefore $\theta_{*,i}$ are sparse vectors: a node has few neighbors (parents in the directed graph) compared to the size of the graph.
-Let $\mathbbm{1}_S$ be the indicator variable for the nodes that are active at a certain time step t of a certain cascade. Under the previous assumption, we have: $$< \mathbbm{1}_S, \theta_{*,i}> = \log \mathbb{P}( \text{i not infected at } t+1| S \text{ active at } t)$$
+Let $\mathbbm{1}_S$ be the indicator variable for the nodes that are active at
+a certain time step $t$ of a certain cascade. Under the previous assumption, we
+have:
+$$\inprod{\mathbbm{1}_S, \theta_{*,i}} = \log \mathbb{P}( \text{i not infected at } t+1| S \text{ active at } t)$$
-Suppose we observe a series of cascades $(I_k^t)$, where $I_k^t$ are the nodes infected in cascade $k$ at time $t$. Recall that once a node has been infected (becomes active), it cannot be infected in the future (is inactive). Let us therefore look at all the sets $(I_k^t)_{t \leq t_i^k} $ such that for every cascade $k$, $t_i^k > t$, where $t_i^k = + \infty$ if $i$ was not infected in that cascade. If we concatenate these rows $\mathbbm{1}_{I_K^t}$ into a matrix $M$ (for measurement), we have:
+Suppose we observe a series of cascades $(A_k^t)$, where $A_k^t$ are the nodes
+active in cascade $k$ at time $t$ (these are exactly the nodes which became
+infected at time $t-1$). Recall that once a node has been infected (becomes
+active), it cannot be infected in the future (is inactive). Let us therefore
+look at all the sets $(A_k^t)_{t \leq t_i^k} $ such that for every cascade $k$,
+$t_i^k > t$, where $t_i^k = + \infty$ if $i$ was not infected in that cascade.
+If we concatenate these rows $\mathbbm{1}_{A_k^t}$ into a matrix $M$ (for
+measurement), we have:
$$M \theta_{*,i} = b$$
-
where $b$ is a column vector corresponding to $\log \mathbb{P}(\text{i was not infected at } t+1| S \text{ active at } t)$. Note that even if we had access to $b$ directly but $M$ does not have full column rank, then solving for $x$ is not trivial! However, under sparsity assumptions on $x$, we could apply results for the sparse recovery/sparse coding/compressed sensing literature
In reality, we do not have access to $b$. We have access to the realization vector $\vec w$ of $\{i$ was infected at $t+1 | S$ active at $t\}$:
@@ -71,6 +154,31 @@ $$\vec w\sim B(1 - e^{M \theta_{*,i}})$$
where the operation $f: \vec v \rightarrow 1 - e^{\vec v}$ is done element-wise and $B(\vec p)$ is a vector whose entries are bernoulli variables of parameter the entries of $\vec p$.
+\subsection{Maximum likelihood problem}
+
+The maximum likelihood problem takes the following form:
+\begin{displaymath}
+ \begin{split}
+ \max_\theta &\sum_{k,t}\left(b_{k,t}\log\left(1-\exp\inprod{A_k^t, \theta}\right)
+ + (1-b_{k,t})\log\exp\inprod{A_k^t,\theta}\right)\\
+ \text{s.t. }& \|\theta\|_1 \leq k
+ \end{split}
+\end{displaymath}
+which we can rewrite as:
+\begin{displaymath}
+ \begin{split}
+ \max_\theta &\sum_{k,t}b_{k,t}\log\left(\exp\left(-\inprod{A_k^t, \theta}\right)-1\right)
+ + \sum_{k,t}\inprod{A_k^t,\theta}\\
+ \text{s.t. }& \|\theta\|_1 \leq k
+ \end{split}
+\end{displaymath}
+In this form it is easy to see that this is a convex program: the objective
+function is the sum of a linear function and a concave function.
+
+\paragraph{Questions} to check: is the program also convex in the $p_{j,i}$'s?
+What is the theory of sparse recovery for maximum likelihood problems? (TO DO:
+read about Bregman divergence, I think it could be related).
+
\subsection{Results under strong assumptions}
What can we hope to have etc. (If M has full column rank or if same sets S occur frequently).