notes/extensions.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137

\documentclass{article}
\usepackage{fullpage, amsmath, amssymb}
\usepackage{algpseudocode, algorithm, algorithmicx}
\title{Extensions}

\begin{document}

\section{Bayesian Framework}

\subsection{Bayesian Learning}

Let $\theta$ be the parameters of the graph, let $S$ be a source node, let $X =
(X_1, \dots X_T)$ (where $T$ is potentially a stopping time) be the vectors
encoding a cascade on the graph $\mathcal{G}$. Assuming a prior $\mathcal{D}$ on
$\theta$, we have the following setup:

\begin{alignat*}{2}
 \theta & \sim \mathcal{D}\qquad &&\text{(prior on $\theta$)} \\
 X_0 = S & \sim D''&&\text{(random source distribution)}\\
 \forall t\geq 0, X_{t+1} | X_t, \theta &\sim \mathcal{D'}
 &&\text{(cascade process)}
\end{alignat*}

Prior work has focused on regularized lasso estimation, which is equivalent to
choosing an independent Laplace prior for $\theta$. In the context of social
network learning however, we can place more informative priors. We can

\begin{itemize}
\item Take into account the existence of edges
\item Take into account the distinction between weak and strong ties
\item Take into account common graph structures, such as triangles
\end{itemize}

We can sample from the posterior by MCMC.


\subsection{Active Learning}

In this setup, $S$ is no longer drawn from a random distribution $\mathcal{D''}$
but is chosen by the designer of the experiment. Once a source is drawn, a
cascade is drawn from the Markovian process $\mathcal{D'}$.  We wish to choose
the source which maximises the information gain on the parameter $\theta$, so as
to speed up learning. We suggest to choose the source which maximises the mutual
information between $\theta$ and $X$ at each time step:

\begin{algorithm}
\caption{Active Bayesian Learning through Cascades (ABC)}
\begin{algorithmic}[1]
\State $\theta \sim \mathcal{D}_0 = \mathcal{D}$ \Comment{Initial prior on
$\theta$}
\While{True}
\State $i \leftarrow \arg\min_{i} I(\theta, (X_t)_{t \geq 1} | X_0 = i)$
\Comment{Maximize conditional mutual information}
\State Observe $(X_t)_{t\geq1} |S = i$ \Comment{Observe a cascade from chosen
source node}
\State $\mathcal{D}_{t+1} \gets \text{posterior of $\theta \sim \mathcal{D}_t$
given $(X_t)_{t\geq1}$}$ \Comment{Update the distribution of $\theta$}
\EndWhile
\end{algorithmic}
\end{algorithm}

Note that the mutual information can be computed node-by-node by sampling
$$I((X_t)_{t\geq 1} ,\theta | X_0 = i) = \mathbb{E}_{\theta \sim D_t; X |
\theta, i \sim D'}[\log p(X|\theta)] - \mathbb{E}_{\theta \sim D_t; X' | \theta,
i \sim D'}[\log [\mathbb{E}_{\theta\sim D_t}p(X'|\theta) ]]$$

The update to the posterior can be done implicitly by MCMC, since we only need
to sample $\theta\sim D_t$ in order to run the algorithm.


\section{Gram matrix}

In the ICML paper, we showed that the graph could be recovered as long as the
expected gram matrix $\mathcal{G}$ of observations had its restricted
eigenvalues lower bounded away from $0$.  Recall that the gram matrix $G$ is
given by $X X^T$, where $X$ is an $n\times m$ matrix, where $n$ is the number of
measurements and $m$ is the number of nodes and if $X_i$ is a column vector
indicating the nodes infected at step $i$, then the rows of $X$ are given by
$X_i^T$. It follows that $$\mathcal{G} = \mathbb{E}[\sum_i X_i X_i^T]$$

Note that the indices of the sum  are themselves random variables!  Furthermore,
recall the definition of a restricted eigenvalue as $\|X_{S^c}\|_1 \leq 3
\|X_S\|_1$.

\subsection{Voter model}

In the voter model, $\mathbb{P}[X_j^{t+1} = 1 | X^t] = \sum_{i=1}^m A_{i,j}
X^t_i$, where $A$ is the weighted adjacency matrix of the graph. Furthermore,
the model runs indefinitely until time $T$, a hyperparameter of the model.
Therefore, $\mathbb{E}[X_{n+1} | X_n] = A^T X_n$ and $\mathcal{G} = \sum_i A^i
\mathbb{E}[X_0 X_0^T] {(A^T)}^i$.  In the case of the single-source model,
$\mathbb{E}[X_0X_0^T] = \frac{1}{m} I_m$ and it follows that $$\mathcal{G} =
\frac{1}{m} \sum_{i=1}^T A^i {(A^T)}^i$$

\section{Submodularity of Generalized Linear Cascades}

For which link functions is the resulting influence function submodular? We know
it to be true for the IC and LT model, what about the Logistic cascade model?

The answer is no. If we take the example of three nodes $(A, B, C)$ with the
possiblity to influence a remaining node D with respective edge weights: $a$,
$b$, and $c$, then we see that the following equality must be verified:
$$2\sigma(a+b) \geq \sigma(a+b+c) + \sigma(a)$$

We see that for $a=.5$, $,b=.5$, and $c=1$, the equality is violated.
Interestingly however is that if we let the scale parameter of the sigmoid go to
infinity, it is harder to violate the inequality. (TH) note that the LT model is
NOT submodular for every fixed value of thresholds, but only in expectation over
the random draw of these thresholds.

\section{Logistic regression}

Let's try to fit a logit model to the cascades. The variance of the parameters
is approximated by $\hat V (\hat \beta) = {(\sum p_i(1-p_i)X_i X_i^T)}^{-1}$

Let's have a look at the matrix we are taking the inverse of. If none of the
parents of node $i$ are active at step t, then the $t^{th}$ term is $0$.
Similarly, if the probability that node $i$ becomes active is $1$, then the term
cancels out. We can therefore write:$$A = \sum_{g_i \notin \{0,1\}} g_i (1-g_i)
X_i X_i^T$$

Furthermore, we notice that $x^TAx = \sum_{g_i \notin\{0,1\}} g_i(1-g_i)x^TX_i
X_i^Tx = \sum_{g_i \notin \{0,1\}} g_i(1-g_i)\|X_i^Tx\|_2^2$. This matrix-vector
product is zero as soon as I can find one node $a$ which can never be active
when the parents of my node are active (think line graph or circle with single
source law).

Suppose now that with high probability, each parent has been active at least
once. If I consider the union of all nodes which were active just before $X_i$
was, then by considering only this sub-space of parameters, I can avoid the
previous pitfall. But is it invertible nonetheless?

Let's consider a line and node $i \notin \{2, n-2\}$ along that line. The
parents cannot be infected at the same time, depending on whether the source $s
\geq i$ or $s \leq i$. The matrix $A$ is diagonal-by-block.

\end{document}