\documentclass[10pt]{article} \usepackage{fullpage, amsmath, amssymb, amsthm} \title{Regression Analysis with Network data} \author{Jean Pouget-Abadie, Thibaut Horel} \date{} \begin{document} \maketitle \subsection*{The Network Inference problem} The network inference problem concerns itself with learning the edges and the edge weights of an unknown network. Each edge weight $\theta_e, e\in E$ is a parameter to be estimated. The information at our disposal is the result of a cascade process on the network. Here, we will focus on the Generalized Linear Cascade (GLC) model introduced in~\cite{} presented below. \paragraph{The GLC model} Consider a graph $\mathcal{G} = (N, E)$, where nodes $N$ can be one of three possible states: susceptible, infected, and immune. Let $x^t$ be the indicator variable of ``infected nodes'' at time step $t$. A \emph{generalized linear cascade model} is a cascade model such that for each ``susceptible'' node $i$ at time step $t$, the probability of $i$ becoming ``infected'' at time step $t+1$ conditioned on $x^t$ is a Bernoulli variable of parameter $f(\theta_i \cdot X^t)$: \begin{equation} \mathbb{P}(x^{t+1}_i = 1|x^t) = f(\theta_i \cdot x^t) \end{equation} where $f: \mathbb{R} \rightarrow [0,1]$. There is no a priori rule for the transition from and to the immune and susceptible state, but the most commonly studied GLC model, known as the Independent Cascade model, assumes that all infected nodes at time step $t$ are immune for all time steps $t' \geq t+1$. Under this assumption, a cascade ends when no ``infected'' nodes remain. The designer is also free to decide the initial state of graph $\mathcal{G}$ at the beginning of every cascade process. For simplicity, we will suppose that, at the start of each cascade, one node is chosen uniformly at random to be ``infected'', and that all other nodes are in the ``susceptible'' state. \bigskip The network inference problem can therefore be formalized as finding the support of the edge weight vector $\theta$ from the observation of ${(x^c_t)}_{c\in \mathcal{C}, t\in \mathcal{T}_C}$, where $\mathcal{C}$ indexes the cascade number, and $\mathcal{T}_C$ the time step in that cascade. Prior work has observed that the network inference problem is decomposable node-by-node i.e.~solving for $\beta_{i} = \{\theta_{i,j} | j \in N\}$ for node $i$. This can easily be formulated as the following maximum likelihood program: $$ \theta_{i} \in \arg \max_\theta \log \mathcal{L}_i(\theta_i\,|\,x^1,\ldots,x^n) = \frac{1}{|{\cal T}_i|} \sum_{t\in {\cal T}_i } x_i^{t+1}\log f(\theta_i\cdot x^{t}) + (1 - x_i^{t+1})\log\big(1-f(\theta_i \cdot x^t)\big)$$ where $\mathcal{T}_i = \{\text{node i is ``susceptible'' at time step } t\}$. The reader will note the above problem boils down to fitting a generalized linear model. In particular, if $f$ is the sigmoid function, we are performing logistic regression: $$ \begin{cases} y_i^* = \theta_i \cdot x^t + \epsilon, \text{~where~} \epsilon\sim Logistic(0, 1) \\ y_i = [y_i^* > 0] \text{~where~} y_i^t = x_i^{t+1} \end{cases} $$ \subsection*{Objectives} \end{document}