1 files changed, 7 insertions, 8 deletions
diff --git a/general.tex b/general.tex
index 65b3c8f..5162887 100644
--- a/general.tex
+++ b/general.tex
@@ -6,7 +6,7 @@
 %as our objective.  Moreover, 
 
 We extend Theorem~\ref{thm:main} to a more general
-Bayesian setting where it is assumed that the experimenter  \E\ has a {\em prior}
+Bayesian setting, where it is assumed that the experimenter  \E\ has a {\em prior}
 distribution on $\beta$: in particular, $\beta$ has a multivariate normal prior
 with zero mean  and covariance $\sigma^2R^{-1}\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance). 
 \E\ estimates $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. Under the linearity assumption \eqref{model} and the Gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization \cite{hastie}: 
@@ -34,7 +34,7 @@ clearly follows from \eqref{bayesianobjective}
 by setting $R=I_d$. Hence, the optimization discussed thus far can be interpreted as
 a maximization of the information gain when the prior distribution has
 a covariance $\sigma^2 I_d$, and the experimenter is solving a ridge regression
-problem with penalty term $\norm{x}_2^2$. 
+problem with penalty term $\norm{\beta}_2^2$. 
 
 Our results can be extended to the general Bayesian case, by
 replacing $I_d$ with the positive semidefinite matrix $R$. First, we re-set the
@@ -63,7 +63,7 @@ where $\mu$ is the smallest eigenvalue of $R$.
 
 \subsection{$D$-Optimality and Beyond}
 
-We now reexamine the classical $D$-optimality in \eqref{dcrit}, which is~\ref{bayesianobjective} with $R$ replaced by
+We now reexamine the classical $D$-optimality in \eqref{dcrit}, which is given by objective ~\eqref{bayesianobjective} with $R$ replaced by
 the zero matrix. 
 Since  \eqref{dcrit} may take arbitrarily small negative values, to define a meaningful approximation one would consider  the (equivalent) maximization of $V(S) = \det\T{X_S}X_S$. %, for some strictly increasing, on-to function $f:\reals_+\to\reals_+$. 
 However, the following lower bound implies that such an optimization goal cannot be attained under the constraints of  truthfulness, budget feasibility, and individual rationality.
@@ -75,13 +75,12 @@ For any $M>1$, there is no $M$-approximate, truthful, budget feasible, individua
 \input{proof_of_lower_bound1}
 \end{proof}
 
-Beyond $D$-optimality, 
-{\bf STRATIC, fill in..}
+Beyond $D$-optimality, several other objectives such as $E$-optimality (maximizing the smallest eigenvalue of $\T{X_S}X_S$) or $T$-optimality (maximizing $\mathrm{trace}(\T{X_S}{X_S}))$ are encountered in the literature \cite{pukelsheim2006optimal}, though they do not relate to entropy as $D$-optimality. We leave the task of approaching the maximization of such objectives from a strategic point of view as an open problem.
 
 \subsection{Beyond Linear Models}
 Selecting experiments that maximize the information gain in the Bayesian setup
 leads to a natural generalization to other learning examples beyond linear
-regression. In particular, in the more general PAC learning setup \cite{valiant}, the features $x_i$, $i\in \mathcal{N}$ take values in some general set $\Omega$, called the \emph{query space}, and measurements $y_i\in\reals$ are given by 
+regression. In particular, in the more general PAC learning setup \cite{valiant}, the features $x_i$, $i\in \mathcal{N}$ take values in some general set $\Omega$, called the \emph{query space}. Measurements $y_i\in\reals$ are given by 
  \begin{equation}\label{eq:hypothesis-model}
     y_i = h(x_i) + \varepsilon_i
 \end{equation}
@@ -89,8 +88,8 @@ where $h\in \mathcal{H}$ for some subset $\mathcal{H}$ of all possible mappings
 $h:\Omega\to\reals$, called the \emph{hypothesis space}, and $\varepsilon_i$
 are random variables in $\reals$, not necessarily identically distributed, that
 are independent \emph{conditioned on $h$}. This model is quite broad, and
-captures many learning tasks, such as support vector machines (SVM), linear
-discriminant analysis (LDA), and the following tasks:
+captures many learning tasks, such as support vector machines (SVM) and linear
+discriminant analysis (LDA); we give a few concrete examples below:
 \begin{enumerate}
 \item\textbf{Generalized Linear Regression.} $\Omega=\reals^d$,  $\mathcal{H}$ is the set of linear maps $\{h(x) = \T{\beta}x \text{ s.t. } \beta\in \reals^d\}$, and $\varepsilon_i$ are independent zero-mean normal variables, where  $\expt{\varepsilon_i^2}=\sigma_i$. 
 \item\textbf{Logistic Regression.} $\Omega=\reals^d$,  $\mathcal{H}$ is the set of maps $\{h(x) = \frac{e^{\T{\beta} x}}{1+e^{\T{\beta} x}} \text{ s.t. } \beta\in\reals^d\}$, and $\varepsilon_i$ are independent conditioned on $h$ such that $$\varepsilon_i=\begin{cases} 1- h(x_i),& \text{w.~prob.}\;h(x_i)\\-h(x_i),&\text{w.~prob.}\;1-h(x_i)\end{cases}$$