\documentclass[8pt]{article} \usepackage{fullpage, amsmath, amssymb, amsthm} \title{Regression Analysis with Network data} \author{Jean Pouget-Abadie, Thibaut Horel} \date{} \begin{document} \maketitle \subsection*{The Network Inference problem} The network inference problem concerns itself with learning the edges and the edge weights of an unknown network. Each edge weight $\theta_e, e\in E$ is a parameter to be estimated. The information at our disposal is the result of a cascade process on the network. Here, we will focus on the Generalized Linear Cascade (GLC) model introduced in~\cite{} presented below. \paragraph{The GLC model} Consider a graph $\mathcal{G} = (N, E)$, where nodes $N$ can be one of three possible states: susceptible, infected, and immune. Let $x^t$ be the indicator variable of ``infected nodes'' at time step $t$. A \emph{generalized linear cascade model} is a cascade model such that for each ``susceptible'' node $i$ at time step $t$, the probability of $i$ becoming ``infected'' at time step $t+1$ conditioned on $x^t$ is a Bernoulli variable of parameter $f(\theta_i \cdot X^t)$: \begin{equation} \mathbb{P}(x^{t+1}_i = 1|x^t) = f(\theta_i \cdot x^t) \end{equation} where $f: \mathbb{R} \rightarrow [0,1]$. There is no a priori rule for the transition from and to the immune and susceptible state, but the most commonly studied GLC model, known as the Independent Cascade model, assumes that all infected nodes at time step $t$ are immune for all time steps $t' \geq t+1$. Under this assumption, a cascade ends when no ``infected'' nodes remain. The designer is also free to decide the initial state of graph $\mathcal{G}$ at the beginning of every cascade process. For simplicity, we will suppose that, at the start of each cascade, one node is chosen uniformly at random to be ``infected'', and that all other nodes are in the ``susceptible'' state. \bigskip The network inference problem can therefore be formalized as finding the support of the edge weight vector $\theta$ from the observation of ${(x^c_t)}_{c\in \mathcal{C}, t\in \mathcal{T}_C}$, where $\mathcal{C}$ indexes the cascade number, and $\mathcal{T}_C$ the time step in that cascade. Prior work has observed that the network inference problem is decomposable node-by-node i.e.~solving for $\beta_{i} = \{\theta_{i,j} | j \in N\}$ for node $i$. This can easily be formulated as the following maximum likelihood program: $$ \theta_{i} \in \arg \max_\theta \log \mathcal{L}_i(\theta_i\,|\,x^1,\ldots,x^n) = \frac{1}{|{\cal T}_i|} \sum_{t\in {\cal T}_i } x_i^{t+1}\log f(\theta_i\cdot x^{t}) + (1 - x_i^{t+1})\log\big(1-f(\theta_i \cdot x^t)\big)$$ where $\mathcal{T}_i = \{\text{node i is ``susceptible'' at time step } t\}$. The reader will note the above problem boils down to fitting a generalized linear model. In particular, if $f$ is the sigmoid function, we are performing logistic regression: $$ \begin{cases} y_i^* = \theta_i \cdot x^t + \epsilon, \text{~where~} \epsilon\sim Logistic(0, 1) \\ y_i = [y_i^* > 0] \text{~where~} y_i^t = x_i^{t+1} \end{cases} $$ \subsection*{Objectives} \begin{itemize} \item Try a Bayesian approach to estimate these parameters. Use the posterior predictive distribution to obtain confidence intervals for the edge parameters. Validate this with bootstrapping. How does this perform in different networks? Can you intuitively link certain node-level/graph-level properties with the resulting variance on the estimated parameter? \item Do the previous observations correspond with the theoretical result, given by the Fisher information matrix: $$\hat \beta \sim \mathcal{N}(\beta, I{(\theta)}^{-1})$$ where $I(\theta) = - \left(\frac{\partial^2\log \mathcal{L}}{\partial \theta^2} \right)^{-1}$ \item Are there networks in which the Fisher information matrix is singular? What happens to the estimation of $\beta$ in this case? \end{itemize} \end{document}