finale/project_proposal.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67

\documentclass[10pt]{article}
\usepackage{fullpage, amsmath, amssymb, amsthm}

\title{Regression Analysis with Network data}
\author{Jean Pouget-Abadie, Thibaut Horel}
\date{}

\begin{document}
\maketitle

\subsection*{The Network Inference problem}

The network inference problem concerns itself with learning the edges and the
edge weights of an unknown network. Each edge weight $\theta_e, e\in E$ is a
parameter to be estimated. The information at our disposal is the result of a
cascade process on the network. Here, we will focus on the Generalized Linear
Cascade (GLC) model introduced in~\cite{} presented below.

\paragraph{The GLC model}

Consider a graph $\mathcal{G} = (N, E)$, where nodes $N$ can be one of three
possible states: susceptible, infected, and immune.  Let $x^t$ be the indicator
variable of ``infected nodes'' at time step $t$.  A \emph{generalized linear
cascade model} is a cascade model such that for each ``susceptible'' node $i$ at
time step $t$, the probability of $i$ becoming ``infected'' at time step $t+1$
conditioned on $x^t$ is a Bernoulli variable of parameter $f(\theta_i
\cdot X^t)$:
\begin{equation}
    \mathbb{P}(x^{t+1}_i = 1|x^t) = f(\theta_i \cdot x^t)
\end{equation}
where $f: \mathbb{R} \rightarrow [0,1]$. There is no a priori rule for the
transition from and to the immune and susceptible state, but the most commonly
studied GLC model, known as the Independent Cascade model, assumes that all
infected nodes at time step $t$ are immune for all time steps $t' \geq t+1$.
Under this assumption, a cascade ends when no ``infected'' nodes remain. The
designer is also free to decide the initial state of graph $\mathcal{G}$ at the
beginning of every cascade process. For simplicity, we will suppose that, at the
start of each cascade, one node is chosen uniformly at random to be
``infected'', and that all other nodes are in the ``susceptible'' state.

\bigskip
The network inference problem can therefore be formalized as finding the support
of the edge weight vector $\theta$ from the observation of ${(x^c_t)}_{c\in
\mathcal{C}, t\in \mathcal{T}_C}$, where $\mathcal{C}$ indexes the cascade
number, and $\mathcal{T}_C$ the time step in that cascade. Prior work has
observed that the network inference problem is decomposable node-by-node
i.e.~solving for $\beta_{i} = \{\theta_{i,j} | j \in N\}$ for node $i$. This can
easily be formulated as the following maximum likelihood program:
$$ \theta_{i} \in \arg \max_\theta \log
\mathcal{L}_i(\theta_i\,|\,x^1,\ldots,x^n) = \frac{1}{|{\cal T}_i|} \sum_{t\in
{\cal T}_i } x_i^{t+1}\log f(\theta_i\cdot x^{t}) + (1 -
x_i^{t+1})\log\big(1-f(\theta_i \cdot x^t)\big)$$
where $\mathcal{T}_i = \{\text{node i is ``susceptible'' at time step } t\}$.
The reader will note the above problem boils down to fitting a generalized
linear model. In particular, if $f$ is the sigmoid function, we are performing
logistic regression:
$$
\begin{cases}
y_i^* = \theta_i \cdot x^t + \epsilon, \text{~where~} \epsilon\sim
Logistic(0, 1) \\
y_i = [y_i^* > 0] \text{~where~} y_i^t = x_i^{t+1}
\end{cases}
$$

\subsection*{Objectives}

\end{document}