diff options
| -rw-r--r-- | general.tex | 15 |
1 files changed, 7 insertions, 8 deletions
diff --git a/general.tex b/general.tex index 65b3c8f..5162887 100644 --- a/general.tex +++ b/general.tex @@ -6,7 +6,7 @@ %as our objective. Moreover, We extend Theorem~\ref{thm:main} to a more general -Bayesian setting where it is assumed that the experimenter \E\ has a {\em prior} +Bayesian setting, where it is assumed that the experimenter \E\ has a {\em prior} distribution on $\beta$: in particular, $\beta$ has a multivariate normal prior with zero mean and covariance $\sigma^2R^{-1}\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance). \E\ estimates $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. Under the linearity assumption \eqref{model} and the Gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization \cite{hastie}: @@ -34,7 +34,7 @@ clearly follows from \eqref{bayesianobjective} by setting $R=I_d$. Hence, the optimization discussed thus far can be interpreted as a maximization of the information gain when the prior distribution has a covariance $\sigma^2 I_d$, and the experimenter is solving a ridge regression -problem with penalty term $\norm{x}_2^2$. +problem with penalty term $\norm{\beta}_2^2$. Our results can be extended to the general Bayesian case, by replacing $I_d$ with the positive semidefinite matrix $R$. First, we re-set the @@ -63,7 +63,7 @@ where $\mu$ is the smallest eigenvalue of $R$. \subsection{$D$-Optimality and Beyond} -We now reexamine the classical $D$-optimality in \eqref{dcrit}, which is~\ref{bayesianobjective} with $R$ replaced by +We now reexamine the classical $D$-optimality in \eqref{dcrit}, which is given by objective ~\eqref{bayesianobjective} with $R$ replaced by the zero matrix. Since \eqref{dcrit} may take arbitrarily small negative values, to define a meaningful approximation one would consider the (equivalent) maximization of $V(S) = \det\T{X_S}X_S$. %, for some strictly increasing, on-to function $f:\reals_+\to\reals_+$. However, the following lower bound implies that such an optimization goal cannot be attained under the constraints of truthfulness, budget feasibility, and individual rationality. @@ -75,13 +75,12 @@ For any $M>1$, there is no $M$-approximate, truthful, budget feasible, individua \input{proof_of_lower_bound1} \end{proof} -Beyond $D$-optimality, -{\bf STRATIC, fill in..} +Beyond $D$-optimality, several other objectives such as $E$-optimality (maximizing the smallest eigenvalue of $\T{X_S}X_S$) or $T$-optimality (maximizing $\mathrm{trace}(\T{X_S}{X_S}))$ are encountered in the literature \cite{pukelsheim2006optimal}, though they do not relate to entropy as $D$-optimality. We leave the task of approaching the maximization of such objectives from a strategic point of view as an open problem. \subsection{Beyond Linear Models} Selecting experiments that maximize the information gain in the Bayesian setup leads to a natural generalization to other learning examples beyond linear -regression. In particular, in the more general PAC learning setup \cite{valiant}, the features $x_i$, $i\in \mathcal{N}$ take values in some general set $\Omega$, called the \emph{query space}, and measurements $y_i\in\reals$ are given by +regression. In particular, in the more general PAC learning setup \cite{valiant}, the features $x_i$, $i\in \mathcal{N}$ take values in some general set $\Omega$, called the \emph{query space}. Measurements $y_i\in\reals$ are given by \begin{equation}\label{eq:hypothesis-model} y_i = h(x_i) + \varepsilon_i \end{equation} @@ -89,8 +88,8 @@ where $h\in \mathcal{H}$ for some subset $\mathcal{H}$ of all possible mappings $h:\Omega\to\reals$, called the \emph{hypothesis space}, and $\varepsilon_i$ are random variables in $\reals$, not necessarily identically distributed, that are independent \emph{conditioned on $h$}. This model is quite broad, and -captures many learning tasks, such as support vector machines (SVM), linear -discriminant analysis (LDA), and the following tasks: +captures many learning tasks, such as support vector machines (SVM) and linear +discriminant analysis (LDA); we give a few concrete examples below: \begin{enumerate} \item\textbf{Generalized Linear Regression.} $\Omega=\reals^d$, $\mathcal{H}$ is the set of linear maps $\{h(x) = \T{\beta}x \text{ s.t. } \beta\in \reals^d\}$, and $\varepsilon_i$ are independent zero-mean normal variables, where $\expt{\varepsilon_i^2}=\sigma_i$. \item\textbf{Logistic Regression.} $\Omega=\reals^d$, $\mathcal{H}$ is the set of maps $\{h(x) = \frac{e^{\T{\beta} x}}{1+e^{\T{\beta} x}} \text{ s.t. } \beta\in\reals^d\}$, and $\varepsilon_i$ are independent conditioned on $h$ such that $$\varepsilon_i=\begin{cases} 1- h(x_i),& \text{w.~prob.}\;h(x_i)\\-h(x_i),&\text{w.~prob.}\;1-h(x_i)\end{cases}$$ |
