\documentclass[10pt]{article} \usepackage[utf8x]{inputenc} \usepackage{fullpage, amsmath, amssymb, amsthm} \usepackage{bbm} \DeclareMathOperator*{\argmax}{argmax} \title{Regression Analysis with Network Data} \author{Thibaut Horel \and Jean Pouget-Abadie} \begin{document} \maketitle \section{Background: Cascade Models and Network Inference} The Network Inference Problem concerns itself with learning the edges and the edge weights of an unknown network on the set of vertices $V$. Formally, the set of parameters to learn is a matrix $\theta\in\mathbb{R}^{V\times V}$ where $\theta_{u,v}$ is the weight corresponding to the pair of vertices $(u, v)$. The observations at our disposal for this learning task are samples drawn for a probabilistic cascade model. In this project, we will focus on the Generalized Linear Cascade (GLC) model introduced in~\cite{pouget} that we briefly present below. \paragraph{GLC model.} The nodes in $V$ can be in one of three possible states: susceptible, infected, and immune. Let $x^t$ be the indicator variable of ``infected nodes'' at time step $t$ and $S_t$ be the set of susceptible nodes at time step $t$. A \emph{generalized linear cascade model} is a discrete-time random process in which the conditional law of $x^{t+1}$ given $x^t$ takes the following form: for each ``susceptible'' node $i$ at time step $t$, the probability of $i$ becoming ``infected'' at time step $t+1$ conditioned on $x^t$ is a Bernoulli variable of parameter $f(\theta_i \cdot x^t)$: \begin{equation} \label{eq:model} \mathbb{P}(x^{t+1}_i = 1\,|\,x^t) = f(\theta_i \cdot x^t), \quad i\in S_t \end{equation} for some function $f: \mathbb{R} \rightarrow [0,1]$ and where $\theta_i = (\theta_{u, i})_{u\in V}$. There is no a priori rule for the transition from and to the immune and susceptible state, but the most commonly studied GLC model, known as the Independent Cascade model \cite{Kempe:03}, assumes that all infected nodes at time step $t$ are immune for all time steps $t' \geq t+1$. Under this assumption, a cascade ends when no ``infected'' nodes remain. The designer is also free to decide the initial state of graph $\mathcal{G}$ at the beginning of every cascade process. For simplicity, we will suppose that, at the start of each cascade, one node is chosen uniformly at random to be ``infected'', and that all other nodes are in the ``susceptible'' state. \paragraph{Inference.} A cascade is the collection of vectors $x^t$ for all time steps until the cascade ends. Given a sample of independent cascades, the parameters $\theta$ can be estimated using maximum likelihood estimation. Prior work has observed that the MLE problem can be decomposed node-by-node \emph{i.e}, $\theta_i$ can be learned independently for each node $i$ by solving the following optimization program: \begin{equation} \label{eq:mle} \hat{\theta}_{i} \in \argmax_\theta \log \mathcal{L}_i(\theta_i) = \sum_{t:i\in S_t} x_i^{t+1}\log f(\theta_i\cdot x^{t}) + (1 - x_i^{t+1})\log\big(1-f(\theta_i \cdot x^t)\big) \end{equation} The reader will observe that the above problem reduces to fitting a generalized linear model. It is important to note that even though the cascades are independent, the vectors $x^t$ observed within a cascade are not independent since they follow the Markov process described in \eqref{eq:model}. In the particular case where $f$ is the sigmoid function, we are performing logistic regression: \begin{displaymath} \begin{cases} y^t = \theta_i \cdot x^t + \varepsilon^t &\text{where } \varepsilon^t\sim \text{Logistic}(0, 1)\\ \tilde{y}^t = \mathbf{1}\{y^t > 0\} &\text{where } x_i^{t+1} = \tilde{y}^t \end{cases} \end{displaymath} \section{Problem Statement} A growing series of work (see for example \cite{Netrapalli:2012, Daneshmand:2014, pouget}) has focused on solving the Network Inference Problem under different cascade models and obtaining estimators with close-to-optimal convergence rates. However, these works often rely on hard-to-interpret assumptions which often hide subtle properties of the probabilistic model \eqref{eq:model}. The overarching goal of the current project is to analyze the fundamental statistical properties of observations coming from model \eqref{eq:model} and of the MLE estimator \eqref{eq:mle}. In particular we are interested in the following questions: \begin{itemize} \item Is the model \eqref{eq:model} identifiable and is the MLE estimator \eqref{eq:mle} consistent? Are there networks which are fundamentally ambiguous and for which the Network Inference Problem is unsolvable? \item Use a Bayesian approach to estimate the parameters. Use the posterior predictive distribution to obtain confidence intervals for the edge parameters. Validate this with bootstrapping. \item Are there networks which are harder to learn than others? Can the variance of the MLE estimator be related to certain graph theoretic properties of the network? \item Do the previous observations match with the theoretical result, given by the Fisher information matrix: \begin{displaymath} \hat{\theta}_i \sim \mathcal{N}(\theta_i, I{(\theta_i)}^{-1}) \quad \text{where}\; I(\theta) = - \left(\frac{\partial^2\log \mathcal{L}_i}{\partial \theta^2} \right)^{-1} \end{displaymath} \item Are there networks in which the Fisher information matrix is singular? What happens to the estimation of $\theta_i$ in this case? \item Is the estimator \eqref{eq:mle} robust to the choice of $f$? What if the observations are generated with $f_1$, but \eqref{eq:mle} is solved with $f_2\neq f_1$? Is there a regularization scheme which can mitigate any bias or lack of convergence in the estimated parameters? \end{itemize} \paragraph{Roadmap.} We plan to answer the above question by a mixture of a theoretical-based and a simulation-based approach. We expect some of these questions to be hard from the theoretical standpoint. In those cases, the hardness can be mitigated by using simulations or focusing on specific cases: toy-networks (star graph, cycle, complete graph, etc.) or simpler cascades models (the Voter Model for example). We have worked together in the past on related questions \cite{pouget} and plan to keep our contributions to the project balanced without a clear a priori separation of tasks. \bibliography{sparse} \bibliographystyle{abbrv} \end{document}