aboutsummaryrefslogtreecommitdiffstats
path: root/finale
diff options
context:
space:
mode:
Diffstat (limited to 'finale')
-rw-r--r--finale/mid_report.tex93
1 files changed, 63 insertions, 30 deletions
diff --git a/finale/mid_report.tex b/finale/mid_report.tex
index 75f5f4a..b7d9608 100644
--- a/finale/mid_report.tex
+++ b/finale/mid_report.tex
@@ -195,15 +195,19 @@ specific instances of the GLC model.
\section{Identifiability}
-A given source distribution $p_s$ and \cref{eq:markov} completely specifies the
-distribution $p$ of a cascade $\mathbf{x} = (x_t)_{t\geq 0}$:
-\begin{displaymath}
+It follows from Section 2, that a source distribution $p_s$ and
+\cref{eq:markov} together completely specify the distribution $p$ of a cascade
+$\mathbf{x} = (x_t)_{t\geq 0}$:
+\begin{equation}
+ \label{eq:dist}
p(\bx;\Theta) = p_s(x^0)\prod_{t\geq 1}\prod_{i\in S_t}
f(\bt_i\cdot x^{t-1})^{x^t_i}\big(1-f(\theta_i\cdot x^{t-1})\big)^{1-x_i^t}
-\end{displaymath}
+\end{equation}
-We see this model as parametrized by $\Theta$. $f$ and $p_s$ are considered
-fixed.
+In what follows, we consider that $p_s$ and $f$ are fixed. Hence the only
+parameter of the model is the matrix $\Theta$ of parameters that we wish to
+learn. We note that the problem of learning $\Theta$ might be ill-defined in
+case of unidentifiability. We recall the definition of identifiability below.
\begin{definition}
We say that the model $\mathcal{P}
@@ -215,9 +219,10 @@ fixed.
\end{displaymath}
\end{definition}
-In our case, identifiability is not guaranteed. In fact there are two factors
-which can lead to unidentifiability: lack of information or ambiguity. We
-illustrate both factors in the subsequent two examples.
+For the GLC model, identifiability is not guaranteed. To develop an intuition
+about why this might be the case, we present below examples of
+unidentifiability because of two obstructions: lack of information and parent
+ambiguity.
\begin{figure}
\begin{center}
@@ -234,11 +239,12 @@ adversarially.}
\begin{example}[Lack of information]
Let us consider the graph in \cref{fig:ident} and the source distribution
$p_s$ defined by $p_s\big((0, 1, 0)\big) = 1$. In other words, node $2$ is
- always the only infected node at $t=0$. Then it is clear that all
+ always the only infected node at $t=0$. Then, it is clear that all
casacades die at $t=1$ at which node 3 becomes infected with probability
$f(\theta_{2,3})$. This probability does not depend on $\theta_{1,3}$.
- Hence all matrices $\Theta$ compatible with the graph and for which
- $\theta_{2,3}$ is fixed yield the same cascade distribution.
+ Hence all matrices $\Theta$ compatible with the graph (defining the same
+ edge set) and for which $\theta_{2,3}$ is constant yield the same cascade
+ distribution according to \cref{eq:dist}.
\end{example}
\vspace{.5em}
@@ -256,17 +262,20 @@ for some $c\in \R_+$ define the same cascade probability distribution.
\vspace{.5em}
-The examples show that for a model to be identifiable, we need conditions on
-the source distribution, in particular in relation to how well it plays with
-the reachability structure of the graph.
+It appears from these examples show that the question of identifiability is
+closely related to the question of reachability of nodes from the source. The
+following proposition shows that with the right assumptions on the source
+distribution we can guarantee identifiability of the model.
-We will abuse the notation $p_s$ and for $u\in V$ write:
+We will slightly abuse the notation $p_s$ and for $u\in V$ write:
\begin{displaymath}
p_s(w) \eqdef \sum_{\bx: \bx_w = 1} p_s(\bx)
\end{displaymath}
-Furthermore we write $u\leadsto v$ if there is a directed path from $u$ to $v$
-in $G$ and $u\leadsto v\leadsto w$ if there is a directed path from $u$ to $w$
-passing through $v$ in $G$.
+and similarly for a set of vertics $S$, $p_s(S)$ is defined by replacing the
+condition $\bx_w = 1$ by $\bx_S = 1$ in the summation condition in the above
+equation. Furthermore, we write $u\leadsto v$ if there is a directed path from
+$u$ to $v$ in $G$ and $u\leadsto v\leadsto w$ if there is a directed path from
+$u$ to $w$ passing through $v$ in $G$.
\begin{proposition}
Under the assumptions:
@@ -276,22 +285,46 @@ Under the assumptions:
\item $\forall (u,v)\in V^2$, there exists $w$ such that $p_s(w)\neq 0$ and
$w\leadsto u\leadsto v$.
\end{enumerate}
-then the model is identifiable.
+then the GLC model is identifiable.
\end{proposition}
-2. prevents parent ambiguity at the source. 1. prevents parent ambiguity from
- occurring later on in the cascade even when there is no ambiguity at the
- source (could be relaxed to ``absence of forks''). 3. prevents lack of
- information.
+\begin{proof}[Proof sketch] We only give a quick proof sketch here and defer
+ the details to the final version of this report. Let us consider $\Theta
+ \neq \Theta'$. The goal is to prove that $p(\cdot\,;\,\Theta)\neq
+ p(\cdot\,\;\, \Theta')$. Since $\Theta \neq \Theta'$ there exists a pair
+ $(u,v)$ such that $\Theta_{u,v} \neq \Theta'_{u,v}$. Using assumption 3. we
+ are guaranteed that with non-zero probability there will exist a time-step
+ $t$ such that $x^t_u = 1$. We then distinguish two cases:
+ \begin{itemize}
+ \item $t=0$, in this case assumption 2. guarantees that with non-zero
+ probability $u$ will be the only infected node at $t=0$.
+ \item $t\geq 1$, in this case assumption 1. guarantees that with
+ non-zero probability $u$ will be the only infected node at
+ time-step $t$ (because the transitions in \cref{eq:markov} occur
+ independently).
+ \end{itemize}
+ In both cases, it then follows from the form of \cref{eq:dist} after
+ conditioning on $x^{t'}$ for $t' < t$ that the probability that $v$ will be
+ infected at $t+1$ is different under $\Theta$ and under $\Theta'$.
+\end{proof}
-\begin{proof}
-\end{proof}
+\begin{remark}
+Intuitively, the assumptions of the proposition can be interpreted as follows:
+assumption 2. prevents parent ambiguity at the source; assumption 1. prevents
+parent ambiguity from occurring at later time-steps in the cascade even when
+there is no ambiguity at the source; finally, assumption 3. prevents lack of
+information.
+
+When the graph $G$ is strongly connected, examples of simple source
+distributions which satisfy the assumptions of the proposition include a single
+source chosen uniformly at random among the vertices of the graph, or a source
+where each node in $G$ is included independently with some probability $p>0$.
-\begin{remark} The conditions are the weakest under which there is
- identifiability (we can show that they are necessary and sufficient).
- Reasonable source distributions which imply the conditions are uniform
- random, independent Bernouilli.
+Finally, Assumption 1 can be relaxed to a weaker assumption. For this weaker
+assumption, Proposition 2. becomes an exact characterization of the
+identifiability of the model. Again, we defer the details to the final version
+of this report.
\end{remark}
\section{Inference}