summaryrefslogtreecommitdiffstats
path: root/problem.tex
diff options
context:
space:
mode:
authorStratis Ioannidis <stratis@stratis-Latitude-E6320.(none)>2013-07-06 15:23:34 -0700
committerStratis Ioannidis <stratis@stratis-Latitude-E6320.(none)>2013-07-06 15:23:34 -0700
commit49995b4aecef20bd138dea3bf66d55dfccc8164d (patch)
treef6ba48f71044402349dc67ef50ffa41370611743 /problem.tex
parentd71b6f325ded0ca101976e6b5c3b0fa72be4bfbd (diff)
downloadrecommendation-49995b4aecef20bd138dea3bf66d55dfccc8164d.tar.gz
chaloner citation
Diffstat (limited to 'problem.tex')
-rw-r--r--problem.tex18
1 files changed, 9 insertions, 9 deletions
diff --git a/problem.tex b/problem.tex
index 0b10107..2e4db20 100644
--- a/problem.tex
+++ b/problem.tex
@@ -19,14 +19,14 @@ etc.). The magnitude of the coefficient $\beta_i$ captures the effect that featu
The purpose of these experiments is to allow \E\ to estimate the model $\beta$. In particular,
assume that the experimenter \E\ has a {\em prior}
distribution on $\beta$, \emph{i.e.}, $\beta$ has a multivariate normal prior
-with zero mean and covariance $\sigma^2R^{-1}\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance and the prior on $\beta$).
+with zero mean and covariance $\sigma^2R^{-1}\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance).
Then, \E\ estimates $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. Under the linearity assumption \eqref{model} and the Gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization \cite{hastie}:
\begin{align}
\hat{\beta} = \argmax_{\beta\in\reals^d} \prob(\beta\mid y_S) =\argmin_{\beta\in\reals^d} \big(\sum_{i\in S} (y_i - \T{\beta}x_i)^2
+ \T{\beta}R\beta\big) = (R+\T{X_S}X_S)^{-1}X_S^Ty_S \label{ridge}
\end{align}
-where $X_S=[x_i]_{i\in S}\in \reals^{|S|\times d}$ is the matrix of experiment features and
-$y_S=[y_i]_{i\in S}\in\reals^{|S|}$ the observed measurements.
+where the last equality is obtained by setting $\nabla_{\beta}\prob(\beta\mid y_S)$ to zero and solving the resulting linear system; in \eqref{ridge}, $X_S\defeq[x_i]_{i\in S}\in \reals^{|S|\times d}$ is the matrix of experiment features and
+$y_S\defeq[y_i]_{i\in S}\in\reals^{|S|}$ are the observed measurements.
This optimization, commonly known as \emph{ridge regression}, includes an additional quadratic penalty term compared to the standard least squares estimation.
% under \eqref{model}, the maximum likelihood estimator of $\beta$ is the \emph{least squares} estimator: for $X_S=[x_i]_{i\in S}\in \reals^{|S|\times d}$ the matrix of experiment features and
%$y_S=[y_i]_{i\in S}\in\reals^{|S|}$ the observed measurements,
@@ -46,17 +46,17 @@ which is the entropy reduction on $\beta$ after the revelation of $y_S$ (also kn
Hence, selecting a set of experiments $S$ that
maximizes $V(S)$ is equivalent to finding the set of experiments that minimizes
the uncertainty on $\beta$, as captured by the entropy reduction of its estimator.
-Under the linear model \eqref{model}, and the Gaussian prior, the information gain takes the form:
+Under the linear model \eqref{model}, and the Gaussian prior, the information gain takes the following form (see, \emph{e.g.}, \cite{chaloner1995bayesian}):
\begin{align}
- V(S) &= \frac{1}{2}\log\det(R+ \T{X_S}X_S) \label{dcrit} %\\
+ I(\beta;y_S)&= \frac{1}{2}\log\det(R+ \T{X_S}X_S) - \frac{1}{2}\log\det R\label{dcrit} %\\
\end{align}
-This value function is known in the experimental design literature as the
+Maximizing $I(\beta;y_S)$ is therefore equivalent to maximizing $\log\det(R+ \T{X_S}X_S)$, which is known in the experimental design literature as the Bayes
$D$-optimality criterion
\cite{pukelsheim2006optimal,atkinson2007optimum,chaloner1995bayesian}.
-Note that the estimator $\hat{\beta}$ is a linear map of $y_S$; as $y_S$ is a multidimensional normal r.v., so is $\hat{\beta}$ (the randomness coming from the noise terms $\varepsilon_i$).
-In particular, $\hat{\beta}$ has %mean $\beta$% (\emph{i.e.}, it is an \emph{unbiased estimator}) and
-covariance $(R+\T{X_S}X_S)^{-1}$. As such, maximizing $V(S)$ can alternatively be seen as a means of reducing the uncertainty on estimator $\hat{\beta}$ my minimizing the product of the eigenvalues of its covariance.
+Note that the estimator $\hat{\beta}$ is a linear map of $y_S$; as $y_S$ is a multidimensional normal r.v., so is $\hat{\beta}$ (the randomness coming from the noise terms $\varepsilon_i$ and the prior on $\beta$).
+In particular, $\hat{\beta}$ has
+covariance $\sigma^2(R+\T{X_S}X_S)^{-1}$. As such, maximizing $I(\beta;y_S)$ can alternatively be seen as a means of reducing the uncertainty on estimator $\hat{\beta}$ my minimizing the product of the eigenvalues of its covariance.
%An alternative interpretation, given that $(R+ \T{X_S}X_S)^{-1}$ is the covariance of the estimator $\hat{\beta}$, is that it tries to minimize the
%which is indeed a function of the covariance matrix $(R+\T{X_S}X_S)^{-1}$.