1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
|
\documentclass[10pt]{article}
\usepackage[utf8x]{inputenc}
\usepackage{fullpage, amsmath, amssymb, amsthm}
\usepackage{bbm}
\DeclareMathOperator*{\argmax}{argmax}
\title{Regression Analysis with Network Data}
\author{Thibaut Horel \and Jean Pouget-Abadie}
\begin{document}
\maketitle
\section{Background: Cascade Models and Network Inference}
The Network Inference Problem concerns itself with learning the edges and the
edge weights of an unknown network on the set of vertices $V$. Formally, the
set of parameters to learn is a matrix $\theta\in\mathbb{R}^{V\times V}$ where
$\theta_{u,v}$ is the weight corresponding to the pair of vertices $(u, v)$.
The observations at our disposal for this learning task are samples drawn for
a probabilistic cascade model. In this project, we will focus on the
Generalized Linear Cascade (GLC) model introduced in~\cite{pouget} that we
briefly present below.
\paragraph{GLC model.} The nodes in $V$ can be in one of three possible states:
susceptible, infected, and immune. Let $x^t$ be the indicator variable of
``infected nodes'' at time step $t$ and $S_t$ be the set of susceptible nodes
at time step $t$. A \emph{generalized linear cascade model} is a discrete-time
random process in which the conditional law of $x^{t+1}$ given $x^t$ takes the
following form: for each ``susceptible'' node $i$ at time step $t$, the
probability of $i$ becoming ``infected'' at time step $t+1$ conditioned on
$x^t$ is a Bernoulli variable of parameter $f(\theta_i \cdot x^t)$:
\begin{equation}
\label{eq:model}
\mathbb{P}(x^{t+1}_i = 1\,|\,x^t) = f(\theta_i \cdot x^t),
\quad i\in S_t
\end{equation}
for some function $f: \mathbb{R} \rightarrow [0,1]$ and where $\theta_i
= (\theta_{u, i})_{u\in V}$. There is no a priori rule for the transition from
and to the immune and susceptible state, but the most commonly studied GLC
model, known as the Independent Cascade model \cite{Kempe:03}, assumes that all
infected nodes at time step $t$ are immune for all time steps $t' \geq t+1$.
Under this assumption, a cascade ends when no ``infected'' nodes remain. The
designer is also free to decide the initial state of graph $\mathcal{G}$ at the
beginning of every cascade process. For simplicity, we will suppose that, at
the start of each cascade, one node is chosen uniformly at random to be
``infected'', and that all other nodes are in the ``susceptible'' state.
\paragraph{Inference.} A cascade is the collection of vectors $x^t$ for all
time steps until the cascade ends. Given a sample of independent cascades, the
parameters $\theta$ can be estimated using maximum likelihood estimation. Prior
work has observed that the MLE problem can be decomposed node-by-node
\emph{i.e}, $\theta_i$ can be learned independently for each node $i$ by
solving the following optimization program:
\begin{equation}
\label{eq:mle}
\hat{\theta}_{i} \in \argmax_\theta
\log \mathcal{L}_i(\theta_i) =
\sum_{t:i\in S_t} x_i^{t+1}\log f(\theta_i\cdot x^{t})
+ (1 - x_i^{t+1})\log\big(1-f(\theta_i \cdot x^t)\big)
\end{equation}
The reader will observe that the above problem reduces to fitting a generalized
linear model. It is important to note that even though the cascades are
independent, the vectors $x^t$ observed within a cascade are not independent
since they follow the Markov process described in \eqref{eq:model}. In the
particular case where $f$ is the sigmoid function, we are performing
logistic regression:
\begin{displaymath}
\begin{cases}
y^t = \theta_i \cdot x^t + \varepsilon^t
&\text{where } \varepsilon^t\sim \text{Logistic}(0, 1)\\
\tilde{y}^t = \mathbf{1}\{y^t > 0\} &\text{where } x_i^{t+1} = \tilde{y}^t
\end{cases}
\end{displaymath}
\section{Problem Statement}
A growing series of work (see for example \cite{Netrapalli:2012,
Daneshmand:2014, pouget}) has focused on solving the Network Inference Problem
under different cascade models and obtaining estimators with close-to-optimal
convergence rates. However, these works often rely on hard-to-interpret
assumptions which often hide subtle properties of the probabilistic model
\eqref{eq:model}. The overarching goal of the current project is to analyze the
fundamental statistical properties of observations coming from model
\eqref{eq:model} and of the MLE estimator \eqref{eq:mle}. In particular we are
interested in the following questions:
\begin{itemize}
\item Is the model \eqref{eq:model} identifiable and is the MLE estimator
\eqref{eq:mle} consistent? Are there networks which are fundamentally
ambiguous and for which the Network Inference Problem is unsolvable?
\item Use a Bayesian approach to estimate the parameters. Use the posterior
predictive distribution to obtain confidence intervals for the edge
parameters. Validate this with bootstrapping.
\item Are there networks which are harder to learn than others? Can the
variance of the MLE estimator be related to certain graph theoretic
properties of the network?
\item Do the previous observations match with the theoretical result, given
by the Fisher information matrix:
\begin{displaymath}
\hat{\theta}_i \sim \mathcal{N}(\theta_i, I{(\theta_i)}^{-1})
\quad \text{where}\;
I(\theta) = - \left(\frac{\partial^2\log \mathcal{L}_i}{\partial \theta^2}
\right)^{-1}
\end{displaymath}
\item Are there networks in which the Fisher information matrix is singular?
What happens to the estimation of $\theta_i$ in this case?
\item Is the estimator \eqref{eq:mle} robust to the choice of $f$? What if the
observations are generated with $f_1$, but \eqref{eq:mle} is solved with
$f_2\neq f_1$? Is there a regularization scheme which can mitigate any bias
or lack of convergence in the estimated parameters?
\end{itemize}
\paragraph{Roadmap.} We plan to answer the above question by a mixture of
a theoretical-based and a simulation-based approach. We expect some of these
questions to be hard from the theoretical standpoint. In those cases, the
hardness can be mitigated by using simulations or focusing on specific cases:
toy-networks (star graph, cycle, complete graph, etc.) or simpler cascades
models (the Voter Model for example). We have worked together in the past on
related questions \cite{pouget} and plan to keep our contributions to the
project balanced without a clear a priori separation of tasks.
\bibliography{sparse}
\bibliographystyle{abbrv}
\end{document}
|