diff options
| author | Thibaut Horel <thibaut.horel@gmail.com> | 2015-10-14 21:38:46 -0400 |
|---|---|---|
| committer | Thibaut Horel <thibaut.horel@gmail.com> | 2015-10-14 21:38:46 -0400 |
| commit | bcba8817e73e7e9f779eb50a8a4df4b204c64d26 (patch) | |
| tree | 143f3ac0309a95c62972cd693cdae4bf98c39f47 /finale/project_proposal.tex | |
| parent | 67607518a47aed0ee63009695eb81979ab05ca92 (diff) | |
| download | cascades-bcba8817e73e7e9f779eb50a8a4df4b204c64d26.tar.gz | |
Cleaning up the project proposal
Diffstat (limited to 'finale/project_proposal.tex')
| -rw-r--r-- | finale/project_proposal.tex | 164 |
1 files changed, 95 insertions, 69 deletions
diff --git a/finale/project_proposal.tex b/finale/project_proposal.tex index 5e0d21c..c8607e6 100644 --- a/finale/project_proposal.tex +++ b/finale/project_proposal.tex @@ -1,96 +1,122 @@ -\documentclass[8pt]{article} +\documentclass[10pt]{article} +\usepackage[utf8x]{inputenc} \usepackage{fullpage, amsmath, amssymb, amsthm} +\usepackage{bbm} +\DeclareMathOperator*{\argmax}{argmax} -\title{Regression Analysis with Network data} -\author{Jean Pouget-Abadie, Thibaut Horel} -\date{} +\title{Regression Analysis with Network Data} +\author{Thibaut Horel \and Jean Pouget-Abadie} \begin{document} \maketitle -\subsection*{The Network Inference problem} +\section{Background: Cascade Models and Network Inference} -The network inference problem concerns itself with learning the edges and the -edge weights of an unknown network. Each edge weight $\theta_e, e\in E$ is a -parameter to be estimated. The information at our disposal is the result of a -cascade process on the network. Here, we will focus on the Generalized Linear -Cascade (GLC) model introduced in~\cite{paper} presented below. +The Network Inference Problem concerns itself with learning the edges and the +edge weights of an unknown network on the set of vertices $V$. Formally, the +set of parameters to learn is a matrix $\theta\in\mathbb{R}^{V\times V}$ where +$\theta_{u,v}$ is the weight corresponding to the pair of vertices $(u, v)$. +The observations at our disposal for this learning task are samples drawn for +a probabilistic cascade model. In this project, we will focus on the +Generalized Linear Cascade (GLC) model introduced in~\cite{pouget} that we +briefly present below. -\paragraph{The GLC model} - -Consider a graph $\mathcal{G} = (N, E)$, where nodes $N$ can be one of three -possible states: susceptible, infected, and immune. Let $x^t$ be the indicator -variable of ``infected nodes'' at time step $t$. A \emph{generalized linear -cascade model} is a cascade model such that for each ``susceptible'' node $i$ at -time step $t$, the probability of $i$ becoming ``infected'' at time step $t+1$ -conditioned on $x^t$ is a Bernoulli variable of parameter $f(\theta_i -\cdot X^t)$: +\paragraph{GLC model.} The nodes in $V$ can be in one of three possible states: +susceptible, infected, and immune. Let $x^t$ be the indicator variable of +``infected nodes'' at time step $t$ and $S_t$ be the set of susceptible nodes +at time step $t$. A \emph{generalized linear cascade model} is a discrete-time +random process in which the conditional law of $x^{t+1}$ given $x^t$ takes the +following form: for each ``susceptible'' node $i$ at time step $t$, the +probability of $i$ becoming ``infected'' at time step $t+1$ conditioned on +$x^t$ is a Bernoulli variable of parameter $f(\theta_i \cdot x^t)$: \begin{equation} - \mathbb{P}(x^{t+1}_i = 1|x^t) = f(\theta_i \cdot x^t) + \label{eq:model} + \mathbb{P}(x^{t+1}_i = 1\,|\,x^t) = f(\theta_i \cdot x^t), + \quad i\in S_t \end{equation} -where $f: \mathbb{R} \rightarrow [0,1]$. There is no a priori rule for the -transition from and to the immune and susceptible state, but the most commonly -studied GLC model, known as the Independent Cascade model, assumes that all +for some function $f: \mathbb{R} \rightarrow [0,1]$ and where $\theta_i += (\theta_{u, i})_{u\in V}$. There is no a priori rule for the transition from +and to the immune and susceptible state, but the most commonly studied GLC +model, known as the Independent Cascade model \cite{Kempe:03}, assumes that all infected nodes at time step $t$ are immune for all time steps $t' \geq t+1$. Under this assumption, a cascade ends when no ``infected'' nodes remain. The designer is also free to decide the initial state of graph $\mathcal{G}$ at the -beginning of every cascade process. For simplicity, we will suppose that, at the -start of each cascade, one node is chosen uniformly at random to be +beginning of every cascade process. For simplicity, we will suppose that, at +the start of each cascade, one node is chosen uniformly at random to be ``infected'', and that all other nodes are in the ``susceptible'' state. -\bigskip -The network inference problem can therefore be formalized as finding the support -of the edge weight vector $\theta$ from the observation of ${(x^c_t)}_{c\in -\mathcal{C}, t\in \mathcal{T}_C}$, where $\mathcal{C}$ indexes the cascade -number, and $\mathcal{T}_C$ the time step in that cascade. Prior work has -observed that the network inference problem is decomposable node-by-node -i.e.~solving for $\beta_{i} = \{\theta_{i,j} | j \in N\}$ for node $i$. This can -easily be formulated as the following maximum likelihood program: -$$ \theta_{i} \in \arg \max_\theta \log -\mathcal{L}_i(\theta_i\,|\,x^1,\ldots,x^n) = \frac{1}{|{\cal T}_i|} \sum_{t\in -{\cal T}_i } x_i^{t+1}\log f(\theta_i\cdot x^{t}) + (1 - -x_i^{t+1})\log\big(1-f(\theta_i \cdot x^t)\big)$$ -where $\mathcal{T}_i = \{\text{node i is ``susceptible'' at time step } t\}$. -The reader will note the above problem boils down to fitting a generalized -linear model. In particular, if $f$ is the sigmoid function, we are performing +\paragraph{Inference.} A cascade is the collection of vectors $x^t$ for all +time steps until the cascade ends. Given a sample of independent cascades, the +parameters $\theta$ can be estimated using maximum likelihood estimation. Prior +work has observed that the MLE problem can be decomposed node-by-node +\emph{i.e}, $\theta_i$ can be learned independently for each node $i$ by +solving the following optimization program: +\begin{equation} + \label{eq:mle} + \hat{\theta}_{i} \in \argmax_\theta + \log \mathcal{L}_i(\theta_i) = + \sum_{t:i\in S_t} x_i^{t+1}\log f(\theta_i\cdot x^{t}) + + (1 - x_i^{t+1})\log\big(1-f(\theta_i \cdot x^t)\big) +\end{equation} +The reader will observe that the above problem reduces to fitting a generalized +linear model. It is important to note that even though the cascades are +independent, the vectors $x^t$ observed within a cascade are not independent +since they follow the Markov process described in \eqref{eq:model}. In the +particular case where $f$ is the sigmoid function, we are performing logistic regression: -$$ +\begin{displaymath} \begin{cases} -y_i^* = \theta_i \cdot x_i + \epsilon \text{~where~} \epsilon\sim -Logistic(0, 1) \text{~and~} x = {(x^t)}_{t \in \mathcal{T}_i}\\ -y_i = [y_i^* > 0] \text{~where~} y_i^t = x_i^{t+1} + y^t = \theta_i \cdot x^t + \varepsilon^t + &\text{where } \varepsilon^t\sim \text{Logistic}(0, 1)\\ + \tilde{y}^t = \mathbf{1}\{y^t > 0\} &\text{where } x_i^{t+1} = \tilde{y}^t \end{cases} -$$ +\end{displaymath} -\subsection*{Objectives} +\section{Problem Statement} +A growing series of work (see for example \cite{Netrapalli:2012, +Daneshmand:2014, pouget}) has focused on solving the Network Inference Problem +under different cascade models and obtaining estimators with close-to-optimal +convergence rates. However, these works often rely on hard-to-interpret +assumptions which often hide subtle properties of the probabilistic model +\eqref{eq:model}. The overarching goal of the current project is to analyze the +fundamental statistical properties of observations coming from model +\eqref{eq:model} and of the MLE estimator \eqref{eq:mle}. In particular we are +interested in the following questions: \begin{itemize} -\item Try a Bayesian approach to estimate these parameters. Use the posterior -predictive distribution to obtain confidence intervals for the edge parameters. -Validate this with bootstrapping. How does this perform in different networks? -Can you intuitively link certain node-level/graph-level properties with the -resulting variance on the estimated parameter? -\item Do the previous observations correspond with the theoretical result, given -by the Fisher information matrix: $$\hat \beta \sim \mathcal{N}(\beta, -I{(\theta)}^{-1})$$ where $I(\theta) = - {\left(\frac{\partial^2\log -\mathcal{L}}{\partial \theta^2} \right)}^{-1}$ +\item Is the model \eqref{eq:model} identifiable and is the MLE estimator + \eqref{eq:mle} consistent? Are there networks which are fundamentally + ambiguous and for which the Network Inference Problem is unsolvable? +\item Use a Bayesian approach to estimate the parameters. Use the posterior + predictive distribution to obtain confidence intervals for the edge + parameters. Validate this with bootstrapping. +\item Are there networks which are harder to learn than others? Can the + variance of the MLE estimator be related to certain graph theoretic + properties of the network? +\item Do the previous observations match with the theoretical result, given +by the Fisher information matrix: +\begin{displaymath} + \hat{\theta}_i \sim \mathcal{N}(\theta_i, I{(\theta_i)}^{-1}) + \quad \text{where}\; + I(\theta) = - \left(\frac{\partial^2\log \mathcal{L}_i}{\partial \theta^2} + \right)^{-1} +\end{displaymath} \item Are there networks in which the Fisher information matrix is singular? -What happens to the estimation of $\beta$ in this case? -\item What if the generative process is generated with a different link -function? Is there a regularization scheme which can mitigate any bias/exploding -variance in the estimated parameters? +What happens to the estimation of $\theta_i$ in this case? +\item Is the estimator \eqref{eq:mle} robust to the choice of $f$? What if the + observations are generated with $f_1$, but \eqref{eq:mle} is solved with + $f_2\neq f_1$? Is there a regularization scheme which can mitigate any bias + or lack of convergence in the estimated parameters? \end{itemize} -\subsection*{Program plan} - -The project will be a series of simulations to answer each of the above -questions? When possible, we will try to explain the results found in the -simulation with a simplified analysis on toy-networks. Thibaut and I have worked -together in the past, and have kept our contributions balanced. +\paragraph{Roadmap.} We plan to answer the above question by a mixture of +a theoretical-based and a simulation-based approaches. We expect some of these +questions to be hard from the theoretical standpoint. In those cases, the +hardness can be mitigated by using simulations or focusing on specific cases: +toy-networks (star graph, cycle, complete graph, etc.) or simpler cascades +models (the Voter Model for example). -\begin{thebibliography}{1} -\bibitem{paper} Pouget-Abadie, J. and Horel, T. \emph{Inferring Graphs from -Cascades: A Sparse Recovery Framework}, ICML 2015 -\end{thebibliography} +\bibliography{sparse} +\bibliographystyle{abbrv} \end{document} |
