summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--general.tex15
1 files changed, 7 insertions, 8 deletions
diff --git a/general.tex b/general.tex
index 65b3c8f..5162887 100644
--- a/general.tex
+++ b/general.tex
@@ -6,7 +6,7 @@
%as our objective. Moreover,
We extend Theorem~\ref{thm:main} to a more general
-Bayesian setting where it is assumed that the experimenter \E\ has a {\em prior}
+Bayesian setting, where it is assumed that the experimenter \E\ has a {\em prior}
distribution on $\beta$: in particular, $\beta$ has a multivariate normal prior
with zero mean and covariance $\sigma^2R^{-1}\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance).
\E\ estimates $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. Under the linearity assumption \eqref{model} and the Gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization \cite{hastie}:
@@ -34,7 +34,7 @@ clearly follows from \eqref{bayesianobjective}
by setting $R=I_d$. Hence, the optimization discussed thus far can be interpreted as
a maximization of the information gain when the prior distribution has
a covariance $\sigma^2 I_d$, and the experimenter is solving a ridge regression
-problem with penalty term $\norm{x}_2^2$.
+problem with penalty term $\norm{\beta}_2^2$.
Our results can be extended to the general Bayesian case, by
replacing $I_d$ with the positive semidefinite matrix $R$. First, we re-set the
@@ -63,7 +63,7 @@ where $\mu$ is the smallest eigenvalue of $R$.
\subsection{$D$-Optimality and Beyond}
-We now reexamine the classical $D$-optimality in \eqref{dcrit}, which is~\ref{bayesianobjective} with $R$ replaced by
+We now reexamine the classical $D$-optimality in \eqref{dcrit}, which is given by objective ~\eqref{bayesianobjective} with $R$ replaced by
the zero matrix.
Since \eqref{dcrit} may take arbitrarily small negative values, to define a meaningful approximation one would consider the (equivalent) maximization of $V(S) = \det\T{X_S}X_S$. %, for some strictly increasing, on-to function $f:\reals_+\to\reals_+$.
However, the following lower bound implies that such an optimization goal cannot be attained under the constraints of truthfulness, budget feasibility, and individual rationality.
@@ -75,13 +75,12 @@ For any $M>1$, there is no $M$-approximate, truthful, budget feasible, individua
\input{proof_of_lower_bound1}
\end{proof}
-Beyond $D$-optimality,
-{\bf STRATIC, fill in..}
+Beyond $D$-optimality, several other objectives such as $E$-optimality (maximizing the smallest eigenvalue of $\T{X_S}X_S$) or $T$-optimality (maximizing $\mathrm{trace}(\T{X_S}{X_S}))$ are encountered in the literature \cite{pukelsheim2006optimal}, though they do not relate to entropy as $D$-optimality. We leave the task of approaching the maximization of such objectives from a strategic point of view as an open problem.
\subsection{Beyond Linear Models}
Selecting experiments that maximize the information gain in the Bayesian setup
leads to a natural generalization to other learning examples beyond linear
-regression. In particular, in the more general PAC learning setup \cite{valiant}, the features $x_i$, $i\in \mathcal{N}$ take values in some general set $\Omega$, called the \emph{query space}, and measurements $y_i\in\reals$ are given by
+regression. In particular, in the more general PAC learning setup \cite{valiant}, the features $x_i$, $i\in \mathcal{N}$ take values in some general set $\Omega$, called the \emph{query space}. Measurements $y_i\in\reals$ are given by
\begin{equation}\label{eq:hypothesis-model}
y_i = h(x_i) + \varepsilon_i
\end{equation}
@@ -89,8 +88,8 @@ where $h\in \mathcal{H}$ for some subset $\mathcal{H}$ of all possible mappings
$h:\Omega\to\reals$, called the \emph{hypothesis space}, and $\varepsilon_i$
are random variables in $\reals$, not necessarily identically distributed, that
are independent \emph{conditioned on $h$}. This model is quite broad, and
-captures many learning tasks, such as support vector machines (SVM), linear
-discriminant analysis (LDA), and the following tasks:
+captures many learning tasks, such as support vector machines (SVM) and linear
+discriminant analysis (LDA); we give a few concrete examples below:
\begin{enumerate}
\item\textbf{Generalized Linear Regression.} $\Omega=\reals^d$, $\mathcal{H}$ is the set of linear maps $\{h(x) = \T{\beta}x \text{ s.t. } \beta\in \reals^d\}$, and $\varepsilon_i$ are independent zero-mean normal variables, where $\expt{\varepsilon_i^2}=\sigma_i$.
\item\textbf{Logistic Regression.} $\Omega=\reals^d$, $\mathcal{H}$ is the set of maps $\{h(x) = \frac{e^{\T{\beta} x}}{1+e^{\T{\beta} x}} \text{ s.t. } \beta\in\reals^d\}$, and $\varepsilon_i$ are independent conditioned on $h$ such that $$\varepsilon_i=\begin{cases} 1- h(x_i),& \text{w.~prob.}\;h(x_i)\\-h(x_i),&\text{w.~prob.}\;1-h(x_i)\end{cases}$$