finale/project_proposal.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96

\documentclass[8pt]{article}
\usepackage{fullpage, amsmath, amssymb, amsthm}

\title{Regression Analysis with Network data}
\author{Jean Pouget-Abadie, Thibaut Horel}
\date{}

\begin{document}
\maketitle

\subsection*{The Network Inference problem}

The network inference problem concerns itself with learning the edges and the
edge weights of an unknown network. Each edge weight $\theta_e, e\in E$ is a
parameter to be estimated. The information at our disposal is the result of a
cascade process on the network. Here, we will focus on the Generalized Linear
Cascade (GLC) model introduced in~\cite{paper} presented below.

\paragraph{The GLC model}

Consider a graph $\mathcal{G} = (N, E)$, where nodes $N$ can be one of three
possible states: susceptible, infected, and immune.  Let $x^t$ be the indicator
variable of ``infected nodes'' at time step $t$.  A \emph{generalized linear
cascade model} is a cascade model such that for each ``susceptible'' node $i$ at
time step $t$, the probability of $i$ becoming ``infected'' at time step $t+1$
conditioned on $x^t$ is a Bernoulli variable of parameter $f(\theta_i
\cdot X^t)$:
\begin{equation}
    \mathbb{P}(x^{t+1}_i = 1|x^t) = f(\theta_i \cdot x^t)
\end{equation}
where $f: \mathbb{R} \rightarrow [0,1]$. There is no a priori rule for the
transition from and to the immune and susceptible state, but the most commonly
studied GLC model, known as the Independent Cascade model, assumes that all
infected nodes at time step $t$ are immune for all time steps $t' \geq t+1$.
Under this assumption, a cascade ends when no ``infected'' nodes remain. The
designer is also free to decide the initial state of graph $\mathcal{G}$ at the
beginning of every cascade process. For simplicity, we will suppose that, at the
start of each cascade, one node is chosen uniformly at random to be
``infected'', and that all other nodes are in the ``susceptible'' state.

\bigskip
The network inference problem can therefore be formalized as finding the support
of the edge weight vector $\theta$ from the observation of ${(x^c_t)}_{c\in
\mathcal{C}, t\in \mathcal{T}_C}$, where $\mathcal{C}$ indexes the cascade
number, and $\mathcal{T}_C$ the time step in that cascade. Prior work has
observed that the network inference problem is decomposable node-by-node
i.e.~solving for $\beta_{i} = \{\theta_{i,j} | j \in N\}$ for node $i$. This can
easily be formulated as the following maximum likelihood program:
$$ \theta_{i} \in \arg \max_\theta \log
\mathcal{L}_i(\theta_i\,|\,x^1,\ldots,x^n) = \frac{1}{|{\cal T}_i|} \sum_{t\in
{\cal T}_i } x_i^{t+1}\log f(\theta_i\cdot x^{t}) + (1 -
x_i^{t+1})\log\big(1-f(\theta_i \cdot x^t)\big)$$
where $\mathcal{T}_i = \{\text{node i is ``susceptible'' at time step } t\}$.
The reader will note the above problem boils down to fitting a generalized
linear model. In particular, if $f$ is the sigmoid function, we are performing
logistic regression:
$$
\begin{cases}
y_i^* = \theta_i \cdot x_i + \epsilon \text{~where~} \epsilon\sim
Logistic(0, 1) \text{~and~} x = {(x^t)}_{t \in \mathcal{T}_i}\\
y_i = [y_i^* > 0] \text{~where~} y_i^t = x_i^{t+1}
\end{cases}
$$

\subsection*{Objectives}

\begin{itemize}
\item Try a Bayesian approach to estimate these parameters. Use the posterior
predictive distribution to obtain confidence intervals for the edge parameters.
Validate this with bootstrapping. How does this perform in different networks?
Can you intuitively link certain node-level/graph-level properties with the
resulting variance on the estimated parameter?
\item Do the previous observations correspond with the theoretical result, given
by the Fisher information matrix: $$\hat \beta \sim \mathcal{N}(\beta,
I{(\theta)}^{-1})$$ where $I(\theta) = - {\left(\frac{\partial^2\log
\mathcal{L}}{\partial \theta^2} \right)}^{-1}$
\item Are there networks in which the Fisher information matrix is singular?
What happens to the estimation of $\beta$ in this case?
\item What if the generative process is generated with a different link
function? Is there a regularization scheme which can mitigate any bias/exploding
variance in the estimated parameters?
\end{itemize}

\subsection*{Program plan}

The project will be a series of simulations to answer each of the above
questions? When possible, we will try to explain the results found in the
simulation with a simplified analysis on toy-networks. Thibaut and I have worked
together in the past, and have kept our contributions balanced.

\begin{thebibliography}{1}
\bibitem{paper} Pouget-Abadie, J. and Horel, T. \emph{Inferring Graphs from
Cascades: A Sparse Recovery Framework}, ICML 2015
\end{thebibliography}

\end{document}