diff options
| author | Stratis Ioannidis <stratis@stratis-Latitude-E6320.(none)> | 2013-07-06 15:23:34 -0700 |
|---|---|---|
| committer | Stratis Ioannidis <stratis@stratis-Latitude-E6320.(none)> | 2013-07-06 15:23:34 -0700 |
| commit | 49995b4aecef20bd138dea3bf66d55dfccc8164d (patch) | |
| tree | f6ba48f71044402349dc67ef50ffa41370611743 | |
| parent | d71b6f325ded0ca101976e6b5c3b0fa72be4bfbd (diff) | |
| download | recommendation-49995b4aecef20bd138dea3bf66d55dfccc8164d.tar.gz | |
chaloner citation
| -rw-r--r-- | problem.tex | 18 |
1 files changed, 9 insertions, 9 deletions
diff --git a/problem.tex b/problem.tex index 0b10107..2e4db20 100644 --- a/problem.tex +++ b/problem.tex @@ -19,14 +19,14 @@ etc.). The magnitude of the coefficient $\beta_i$ captures the effect that featu The purpose of these experiments is to allow \E\ to estimate the model $\beta$. In particular, assume that the experimenter \E\ has a {\em prior} distribution on $\beta$, \emph{i.e.}, $\beta$ has a multivariate normal prior -with zero mean and covariance $\sigma^2R^{-1}\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance and the prior on $\beta$). +with zero mean and covariance $\sigma^2R^{-1}\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance). Then, \E\ estimates $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. Under the linearity assumption \eqref{model} and the Gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization \cite{hastie}: \begin{align} \hat{\beta} = \argmax_{\beta\in\reals^d} \prob(\beta\mid y_S) =\argmin_{\beta\in\reals^d} \big(\sum_{i\in S} (y_i - \T{\beta}x_i)^2 + \T{\beta}R\beta\big) = (R+\T{X_S}X_S)^{-1}X_S^Ty_S \label{ridge} \end{align} -where $X_S=[x_i]_{i\in S}\in \reals^{|S|\times d}$ is the matrix of experiment features and -$y_S=[y_i]_{i\in S}\in\reals^{|S|}$ the observed measurements. +where the last equality is obtained by setting $\nabla_{\beta}\prob(\beta\mid y_S)$ to zero and solving the resulting linear system; in \eqref{ridge}, $X_S\defeq[x_i]_{i\in S}\in \reals^{|S|\times d}$ is the matrix of experiment features and +$y_S\defeq[y_i]_{i\in S}\in\reals^{|S|}$ are the observed measurements. This optimization, commonly known as \emph{ridge regression}, includes an additional quadratic penalty term compared to the standard least squares estimation. % under \eqref{model}, the maximum likelihood estimator of $\beta$ is the \emph{least squares} estimator: for $X_S=[x_i]_{i\in S}\in \reals^{|S|\times d}$ the matrix of experiment features and %$y_S=[y_i]_{i\in S}\in\reals^{|S|}$ the observed measurements, @@ -46,17 +46,17 @@ which is the entropy reduction on $\beta$ after the revelation of $y_S$ (also kn Hence, selecting a set of experiments $S$ that maximizes $V(S)$ is equivalent to finding the set of experiments that minimizes the uncertainty on $\beta$, as captured by the entropy reduction of its estimator. -Under the linear model \eqref{model}, and the Gaussian prior, the information gain takes the form: +Under the linear model \eqref{model}, and the Gaussian prior, the information gain takes the following form (see, \emph{e.g.}, \cite{chaloner1995bayesian}): \begin{align} - V(S) &= \frac{1}{2}\log\det(R+ \T{X_S}X_S) \label{dcrit} %\\ + I(\beta;y_S)&= \frac{1}{2}\log\det(R+ \T{X_S}X_S) - \frac{1}{2}\log\det R\label{dcrit} %\\ \end{align} -This value function is known in the experimental design literature as the +Maximizing $I(\beta;y_S)$ is therefore equivalent to maximizing $\log\det(R+ \T{X_S}X_S)$, which is known in the experimental design literature as the Bayes $D$-optimality criterion \cite{pukelsheim2006optimal,atkinson2007optimum,chaloner1995bayesian}. -Note that the estimator $\hat{\beta}$ is a linear map of $y_S$; as $y_S$ is a multidimensional normal r.v., so is $\hat{\beta}$ (the randomness coming from the noise terms $\varepsilon_i$). -In particular, $\hat{\beta}$ has %mean $\beta$% (\emph{i.e.}, it is an \emph{unbiased estimator}) and -covariance $(R+\T{X_S}X_S)^{-1}$. As such, maximizing $V(S)$ can alternatively be seen as a means of reducing the uncertainty on estimator $\hat{\beta}$ my minimizing the product of the eigenvalues of its covariance. +Note that the estimator $\hat{\beta}$ is a linear map of $y_S$; as $y_S$ is a multidimensional normal r.v., so is $\hat{\beta}$ (the randomness coming from the noise terms $\varepsilon_i$ and the prior on $\beta$). +In particular, $\hat{\beta}$ has +covariance $\sigma^2(R+\T{X_S}X_S)^{-1}$. As such, maximizing $I(\beta;y_S)$ can alternatively be seen as a means of reducing the uncertainty on estimator $\hat{\beta}$ my minimizing the product of the eigenvalues of its covariance. %An alternative interpretation, given that $(R+ \T{X_S}X_S)^{-1}$ is the covariance of the estimator $\hat{\beta}$, is that it tries to minimize the %which is indeed a function of the covariance matrix $(R+\T{X_S}X_S)^{-1}$. |
