From 335d2bdfbfc406be36f3f1f240733cf7094b6147 Mon Sep 17 00:00:00 2001 From: Stratis Ioannidis Date: Sun, 10 Feb 2013 22:29:04 -0800 Subject: merging 3 --- problem.tex | 81 ++++++++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 59 insertions(+), 22 deletions(-) (limited to 'problem.tex') diff --git a/problem.tex b/problem.tex index 546500d..6027379 100644 --- a/problem.tex +++ b/problem.tex @@ -1,50 +1,87 @@ \label{sec:prel} -\subsection{Experimental Design} - -The theory of experimental design \cite{pukelsheim2006optimal,atkinson2007optimum} studies how an experimenter \E\ should select the parameters of a set of experiments she is about to conduct. In general, the optimality of a particular design depends on the purpose of the experiment, \emph{i.e.}, the quantity \E\ is trying to learn or the hypothesis she is trying to validate. Due to their ubiquity in statistical analysis, a large literature on the subject focuses on learning \emph{linear models}, where \E\ wishes to fit a linear function to the data she has collected. +\subsection{Linear Regression and Experimental Design} -More precisely, putting cost considerations aside, suppose that \E\ wishes to conduct $k$ among $n$ possible experiments. Each experiment $i\in\mathcal{N}\defeq \{1,\ldots,n\}$ is associated with a set of parameters (or features) $x_i\in \reals^d$, normalized so that $\|x_i\|_2\leq 1$. Denote by $S\subseteq \mathcal{N}$, where $|S|=k$, the set of experiments selected; upon its execution, experiment $i\in S$ reveals an output variable (the ``measurement'') $y_i$, related to the experiment features $x_i$ through a linear function, \emph{i.e.}, +The theory of experimental design \cite{pukelsheim2006optimal,atkinson2007optimum,chaloner1995bayesian} considers the following formal setting. % studies how an experimenter \E\ should select the parameters of a set of experiments she is about to conduct. In general, the optimality of a particular design depends on the purpose of the experiment, \emph{i.e.}, the quantity \E\ is trying to learn or the hypothesis she is trying to validate. Due to their ubiquity in statistical analysis, a large literature on the subject focuses on learning \emph{linear models}, where \E\ wishes to fit a linear function to the data she has collected. +Putting cost considerations aside, suppose that an experimenter \E\ wishes to conduct $k$ among $n$ possible experiments. Each experiment $i\in\mathcal{N}\defeq \{1,\ldots,n\}$ is associated with a set of parameters (or features) $x_i\in \reals^d$, normalized so that $\|x_i\|_2\leq 1$. Denote by $S\subseteq \mathcal{N}$, where $|S|=k$, the set of experiments selected; upon its execution, experiment $i\in S$ reveals an output variable (the ``measurement'') $y_i$, related to the experiment features $x_i$ through a linear function, \emph{i.e.}, \begin{align}\label{model} \forall i\in\mathcal{N},\quad y_i = \T{\beta} x_i + \varepsilon_i \end{align} where $\beta$ is a vector in $\reals^d$, commonly referred to as the \emph{model}, and $\varepsilon_i$ (the \emph{measurement noise}) are independent, normally distributed random variables with mean 0 and variance $\sigma^2$. -The purpose of these experiments is to allow \E\ to estimate the model $\beta$. In particular, under \eqref{model}, the maximum likelihood estimator of $\beta$ is the \emph{least squares} estimator: for $X_S=[x_i]_{i\in S}\in \reals^{|S|\times d}$ the matrix of experiment features and -$y_S=[y_i]_{i\in S}\in\reals^{|S|}$ the observed measurements, -\begin{align} \hat{\beta} &=\max_{\beta\in\reals^d}\prob(y_S;\beta) =\argmin_{\beta\in\reals^d } \sum_{i\in S}(\T{\beta}x_i-y_i)^2 \nonumber\\ -& = (\T{X_S}X_S)^{-1}X_S^Ty_S\label{leastsquares}\end{align} +The purpose of these experiments is to allow \E\ to estimate the model $\beta$. In particular, + assume that the experimenter \E\ has a {\em prior} +distribution on $\beta$, \emph{i.e.}, $\beta$ has a multivariate normal prior +with zero mean and covariance $\sigma^2R^{-1}\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance). +Then, \E\ estimates $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. Under the linearity assumption \eqref{model} and the Gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization \cite{hastie}: +\begin{displaymath} + \hat{\beta} = \argmin_{\beta\in\reals^d} \big(\sum_{i\in S} (y_i - \T{\beta}x_i)^2 + + \T{\beta}R\beta\big) = (R+\T{X_S}X_S)^{-1}X_S^Ty_S \label{ridge} +\end{displaymath} +where $X_S=[x_i]_{i\in S}\in \reals^{|S|\times d}$ is the matrix of experiment features and +$y_S=[y_i]_{i\in S}\in\reals^{|S|}$ the observed measurements. +This optimization, commonly known as \emph{ridge regression}, includes an additional quadratic penalty term compared to the standard least squares estimation. +% under \eqref{model}, the maximum likelihood estimator of $\beta$ is the \emph{least squares} estimator: for $X_S=[x_i]_{i\in S}\in \reals^{|S|\times d}$ the matrix of experiment features and +%$y_S=[y_i]_{i\in S}\in\reals^{|S|}$ the observed measurements, +%\begin{align} \hat{\beta} &=\max_{\beta\in\reals^d}\prob(y_S;\beta) =\argmin_{\beta\in\reals^d } \sum_{i\in S}(\T{\beta}x_i-y_i)^2 \nonumber\\ +%& = (\T{X_S}X_S)^{-1}X_S^Ty_S\label{leastsquares}\end{align} %The estimator $\hat{\beta}$ is unbiased, \emph{i.e.}, $\expt{\hat{\beta}} = \beta$ (where the expectation is over the noise variables $\varepsilon_i$). Furthermore, $\hat{\beta}$ is a multidimensional normal random variable with mean $\beta$ and covariance matrix $(X_S\T{X_S})^{-1}$. - Note that the estimator $\hat{\beta}$ is a linear map of $y_S$; as $y_S$ is a multidimensional normal r.v., so is $\hat{\beta}$ (the randomness coming from -the noise terms $\varepsilon_i$). In particular, $\hat{\beta}$ has mean $\beta$ -(\emph{i.e.}, it is an \emph{unbiased estimator}) and covariance -$(\T{X_S}X_S)^{-1}$. +the noise terms $\varepsilon_i$). In particular, $\hat{\beta}$ has %mean $\beta$% (\emph{i.e.}, it is an \emph{unbiased estimator}) and + covariance +$(R+\T{X_S}X_S)^{-1}$. Let $V:2^\mathcal{N}\to\reals$ be a \emph{value function}, quantifying how informative a set of experiments $S$ is in estimating $\beta$. The standard optimal experimental design problem amounts to finding a set $S$ that maximizes $V(S)$ subject to the constraint $|S|\leq k$. -A variety of different value functions are used in experimental design~\cite{pukelsheim2006optimal}; almost all make use of the covariance $(\T{X_S}X_S)^{-1}$ of the estimator $\hat{\beta}$. A value function preferred because of its relationship to entropy is the \emph{$D$-optimality criterion}: %which yields the following optimization problem +A variety of different value functions are used in experimental design~\cite{pukelsheim2006optimal}; almost all make use of the covariance $(R+\T{X_S}X_S)^{-1}$ of the estimator $\hat{\beta}$. A value function that has natural advantages is the \emph{information gain}: %\emph{$D$-optimality criterion}: %which yields the following optimization problem \begin{align} - V(S) &= \frac{1}{2}\log\det \T{X_S}X_S \label{dcrit} %\\ + V(S)= I(\beta;y_S) = \entropy(\beta)-\entropy(\beta\mid y_S). \label{informationgain} \end{align} -defined as $-\infty$ when $\mathrm{rank}(\T{X_S}X_S)