1 files changed, 32 insertions, 59 deletions
diff --git a/general.tex b/general.tex
index 5162887..3c7eddf 100644
--- a/general.tex
+++ b/general.tex
@@ -1,49 +1,14 @@
-\subsection{Bayesian Experimental Design}\label{sec:bed}
+\subsection{Strategic Experimental Design with non-homotropic prior}\label{sec:bed}
 
 %In this section, we extend our results to Bayesian experimental design
 %\cite{chaloner1995bayesian}. We show that objective function \eqref{modified}
 %has a natural interpretation in this context, further motivating its selection
 %as our objective.  Moreover, 
 
-We extend Theorem~\ref{thm:main} to a more general
-Bayesian setting, where it is assumed that the experimenter  \E\ has a {\em prior}
-distribution on $\beta$: in particular, $\beta$ has a multivariate normal prior
-with zero mean  and covariance $\sigma^2R^{-1}\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance). 
-\E\ estimates $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. Under the linearity assumption \eqref{model} and the Gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization \cite{hastie}: 
-\begin{displaymath}
-    \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \T{\beta}x_i)^2
-    + \T{\beta}R\beta
-\end{displaymath}
-This optimization, commonly known as \emph{ridge regression}, includes an additional penalty term compared to the least squares estimation \eqref{leastsquares}.
-
-Let $\entropy(\beta)$ be the entropy of $\beta$ under this distribution, and
-$\entropy(\beta\mid y_S)$ the  entropy of $\beta$ conditioned on the experiment
-outcomes $y_S$, for some $S\subseteq \mathcal{N}$. In this setting, a natural
-objective, originally proposed by Lindley \cite{lindley1956measure}, is to
-select a set of experiments $S$ that maximizes her \emph{information gain}:
-\begin{displaymath}
-    I(\beta;y_S) = \entropy(\beta)-\entropy(\beta\mid y_S).
-\end{displaymath}
-Assuming normal noise variables, the information gain is  equal (up to a constant) to the following value function \cite{chaloner1995bayesian}:
-\begin{align}
-V(S)  = \frac{1}{2}\log\det(R + \T{X_S}X_S)\label{bayesianobjective}
-\end{align}
-Our objective for \EDP\ 
-%\eqref{modified} 
-clearly follows from \eqref{bayesianobjective}
-by setting $R=I_d$. Hence, the optimization discussed thus far can be interpreted as
-a maximization of the information gain when the prior distribution has
-a covariance $\sigma^2 I_d$, and the experimenter is solving a ridge regression
-problem with penalty term $\norm{\beta}_2^2$. 
-
-Our results can be extended to the general Bayesian case, by
-replacing $I_d$ with the positive semidefinite matrix $R$. First, we re-set the
-origin of the value function so that $V(\emptyset) = 0$:
-\begin{align}\label{eq:normalized}
-    \tilde{V}(S)
-    & = \frac{1}{2}\log\det(R + \T{X_S}X_S) - \frac{1}{2}\log\det R\\
-    & = \frac{1}{2}\log\det(I_d + R^{-1}\T{X_S}X_S)\notag
-\end{align}
+If the general case where the prior distribution of the experimenter on the
+model $\beta$ in \eqref{model} is not homotropic and has a generic covariance
+matrix $R$, the value function takes the general form given by
+\eqref{dcrit}.
 
 Applying the mechanism described in algorithm~\ref{mechanism} and adapting the
 analysis of the approximation ratio, we get the following result which extends
@@ -51,7 +16,7 @@ Theorem~\ref{thm:main}:
 
 \begin{theorem}
  There exists a truthful, individually rational and budget feasible mechanism for the objective
- function $\tilde{V}$ given by \eqref{eq:normalized}. Furthermore, for any $\varepsilon
+ function $V$ given by \eqref{dcrit}. Furthermore, for any $\varepsilon
  > 0$, in time $O(\text{poly}(n, d, \log\log \varepsilon^{-1}))$,
  the algorithm computes a set $S^*$ such that:
  \begin{displaymath}
@@ -61,11 +26,15 @@ Theorem~\ref{thm:main}:
 where $\mu$ is the smallest eigenvalue of $R$.
 \end{theorem}
 
-\subsection{$D$-Optimality and Beyond}
+\subsection{Other Experimental Design Criteria}
 
-We now reexamine the classical $D$-optimality in \eqref{dcrit}, which is given by objective ~\eqref{bayesianobjective} with $R$ replaced by
-the zero matrix. 
-Since  \eqref{dcrit} may take arbitrarily small negative values, to define a meaningful approximation one would consider  the (equivalent) maximization of $V(S) = \det\T{X_S}X_S$. %, for some strictly increasing, on-to function $f:\reals_+\to\reals_+$. 
+A value function which is frequently used in experimental design is the
+$D$-optimality criterion obtained by replacing $R$ by the zero matrix in
+\eqref{dcrit}:
+\begin{equation}\label{eq:d-optimal}
+V(S) = \log\det(X_S^TX_S)
+\end{equation}
+Since \eqref{eq:d-optimal} may take arbitrarily small negative values, to define a meaningful approximation one would consider  the (equivalent) maximization of $V(S) = \det\T{X_S}X_S$. %, for some strictly increasing, on-to function $f:\reals_+\to\reals_+$. 
 However, the following lower bound implies that such an optimization goal cannot be attained under the constraints of  truthfulness, budget feasibility, and individual rationality.
 \begin{lemma}
 For any $M>1$, there is no $M$-approximate, truthful, budget feasible, individually rational mechanism for a budget feasible reverse auction with value function $V(S) = \det{\T{X_S}X_S}$.
@@ -96,22 +65,27 @@ discriminant analysis (LDA); we give a few concrete examples below:
 \item\textbf{Learning Binary Functions with Bernoulli Noise.} $\Omega = \{0,1\}^d$, and $\mathcal{H}$ is some subset of $2^{\Omega}$, and $$\varepsilon_i =\begin{cases}0, &\text{w.~prob.}\;p\\\bar{h}(x_i)-h(x_i), &\text{w.~prob.}\;1-p\end{cases}$$
 \end{enumerate}
 
-In this setup, assume that the experimenter has a prior distribution on the hypothesis $h\in \mathcal{H}$. Then, the information gain objective can be written again as the mutual information between $\beta$ and $y_S$.
+In this setup, assuming that the experimenter has a prior distribution on the
+hypothesis $h\in \mathcal{H}$, the information gain objective can be written
+again as the mutual information between $h$ and $y_S$.
 \begin{align}\label{general}
-V(S) = \entropy(\beta) -\entropy(\beta\mid y_S),\quad S\subseteq\mathcal{N} 
+V(S) = \entropy(h) -\entropy(h\mid y_S),\quad S\subseteq\mathcal{N} 
 \end{align}
-This is a monotone set function, and it clearly satisfies $V(\emptyset)=0$. Though, in general, mutual information is not a submodular function, this specific setup leads indeed to a submodular formulation.
+This is a monotone set function, and it clearly satisfies $V(\emptyset)=0$.
+Though, in general, mutual information is not a submodular function, this
+specific setup leads to a submodular formulation.
+
 \begin{lemma}
 The value function given by the information gain \eqref{general} is submodular.
 \end{lemma}
 \begin{proof}
 Using the chain rule for the conditional entropy we get:
 \begin{equation}\label{eq:chain-rule}
-    V(S) = H(y_S) - H(y_S \mid \beta) 
-    = H(y_S) - \sum_{i\in S} H(y_i \mid \beta)
+    V(S) = H(y_S) - H(y_S \mid h) 
+    = H(y_S) - \sum_{i\in S} H(y_i \mid h)
 \end{equation}
 where the second equality comes from the independence of the $y_i$'s
-conditioned on $\beta$. Recall that the joint entropy of a set of random
+conditioned on $h$. Recall that the joint entropy of a set of random
 variables is a submodular function. Thus, the value function is written in
 \eqref{eq:chain-rule} as the sum of a submodular function and a modular function.
 \end{proof}
@@ -134,13 +108,12 @@ the following corollary:
     an approximation ratio of $8.34$. 
 \end{corollary}
 
-Note however that, in many scenarios covered by
-this model (including the last two examples above), even computing the entropy
-under a given set might be a hard task---\emph{i.e.}, the value query model may
-not apply. Hence, identifying learning tasks in the above class for which
-truthful or universally truthful constant approximation mechanisms exist, or
-studying these problems in the context of stronger query models such as the
-demand model \cite{dobz2011-mechanisms,bei2012budget} remains an interesting
-open question. 
+Note however that, in many scenarios covered by this model (including the last
+two examples above), even computing the entropy under a given set might be
+a hard task---\emph{i.e.}, the value query model may not apply. Hence,
+identifying learning tasks in the above class for which truthful or universally
+truthful constant approximation mechanisms exist, or studying these problems in
+the context of stronger query models such as the demand model
+\cite{dobz2011-mechanisms,bei2012budget} remains an interesting open question. 
 
 %TODO: Independent noise model. Captures models such as logistic regression, classification, etc. Arbitrary prior. Show that change in the entropy is submodular (cite Krause, Guestrin).