\subsection{Strategic Experimental Design with non-homotropic prior}\label{sec:bed} %In this section, we extend our results to Bayesian experimental design %\cite{chaloner1995bayesian}. We show that objective function \eqref{modified} %has a natural interpretation in this context, further motivating its selection %as our objective. Moreover, If the general case where the prior distribution of the experimenter on the model $\beta$ in \eqref{model} is not homotropic and has a generic covariance matrix $R$, the value function takes the general form given by \eqref{dcrit}. Applying the mechanism described in algorithm~\ref{mechanism} and adapting the analysis of the approximation ratio, we get the following result which extends Theorem~\ref{thm:main}: \begin{theorem} There exists a truthful, individually rational and budget feasible mechanism for the objective function $V$ given by \eqref{dcrit}. Furthermore, for any $\varepsilon > 0$, in time $O(\text{poly}(n, d, \log\log \varepsilon^{-1}))$, the algorithm computes a set $S^*$ such that: \begin{displaymath} OPT \leq \frac{5e-1}{e-1}\frac{2\mu}{\log(1+\mu)}V(S^*) + 5.1 + \varepsilon \end{displaymath} where $\mu$ is the smallest eigenvalue of $R$. \end{theorem} \subsection{Other Experimental Design Criteria} A value function which is frequently used in experimental design is the $D$-optimality criterion obtained by replacing $R$ by the zero matrix in \eqref{dcrit}: \begin{equation}\label{eq:d-optimal} V(S) = \log\det(X_S^TX_S) \end{equation} Since \eqref{eq:d-optimal} may take arbitrarily small negative values, to define a meaningful approximation one would consider the (equivalent) maximization of $V(S) = \det\T{X_S}X_S$. %, for some strictly increasing, on-to function $f:\reals_+\to\reals_+$. However, the following lower bound implies that such an optimization goal cannot be attained under the constraints of truthfulness, budget feasibility, and individual rationality. \begin{lemma} For any $M>1$, there is no $M$-approximate, truthful, budget feasible, individually rational mechanism for a budget feasible reverse auction with value function $V(S) = \det{\T{X_S}X_S}$. For any $M>1$, there is no $M$-approximate, truthful, budget feasible, individually rational mechanism for a budget feasible reverse auction with $V(S) = \det{\T{X_S}X_S}$. \end{lemma} \begin{proof} \input{proof_of_lower_bound1} \end{proof} Beyond $D$-optimality, several other objectives such as $E$-optimality (maximizing the smallest eigenvalue of $\T{X_S}X_S$) or $T$-optimality (maximizing $\mathrm{trace}(\T{X_S}{X_S}))$ are encountered in the literature \cite{pukelsheim2006optimal}, though they do not relate to entropy as $D$-optimality. We leave the task of approaching the maximization of such objectives from a strategic point of view as an open problem. \subsection{Beyond Linear Models} Selecting experiments that maximize the information gain in the Bayesian setup leads to a natural generalization to other learning examples beyond linear regression. In particular, in the more general PAC learning setup \cite{valiant}, the features $x_i$, $i\in \mathcal{N}$ take values in some general set $\Omega$, called the \emph{query space}. Measurements $y_i\in\reals$ are given by \begin{equation}\label{eq:hypothesis-model} y_i = h(x_i) + \varepsilon_i \end{equation} where $h\in \mathcal{H}$ for some subset $\mathcal{H}$ of all possible mappings $h:\Omega\to\reals$, called the \emph{hypothesis space}, and $\varepsilon_i$ are random variables in $\reals$, not necessarily identically distributed, that are independent \emph{conditioned on $h$}. This model is quite broad, and captures many learning tasks, such as support vector machines (SVM) and linear discriminant analysis (LDA); we give a few concrete examples below: \begin{enumerate} \item\textbf{Generalized Linear Regression.} $\Omega=\reals^d$, $\mathcal{H}$ is the set of linear maps $\{h(x) = \T{\beta}x \text{ s.t. } \beta\in \reals^d\}$, and $\varepsilon_i$ are independent zero-mean normal variables, where $\expt{\varepsilon_i^2}=\sigma_i$. \item\textbf{Logistic Regression.} $\Omega=\reals^d$, $\mathcal{H}$ is the set of maps $\{h(x) = \frac{e^{\T{\beta} x}}{1+e^{\T{\beta} x}} \text{ s.t. } \beta\in\reals^d\}$, and $\varepsilon_i$ are independent conditioned on $h$ such that $$\varepsilon_i=\begin{cases} 1- h(x_i),& \text{w.~prob.}\;h(x_i)\\-h(x_i),&\text{w.~prob.}\;1-h(x_i)\end{cases}$$ \item\textbf{Learning Binary Functions with Bernoulli Noise.} $\Omega = \{0,1\}^d$, and $\mathcal{H}$ is some subset of $2^{\Omega}$, and $$\varepsilon_i =\begin{cases}0, &\text{w.~prob.}\;p\\\bar{h}(x_i)-h(x_i), &\text{w.~prob.}\;1-p\end{cases}$$ \end{enumerate} In this setup, assuming that the experimenter has a prior distribution on the hypothesis $h\in \mathcal{H}$, the information gain objective can be written again as the mutual information between $h$ and $y_S$. \begin{align}\label{general} V(S) = \entropy(h) -\entropy(h\mid y_S),\quad S\subseteq\mathcal{N} \end{align} This is a monotone set function, and it clearly satisfies $V(\emptyset)=0$. Though, in general, mutual information is not a submodular function, this specific setup leads to a submodular formulation. \begin{lemma} The value function given by the information gain \eqref{general} is submodular. \end{lemma} \begin{proof} Using the chain rule for the conditional entropy we get: \begin{equation}\label{eq:chain-rule} V(S) = H(y_S) - H(y_S \mid h) = H(y_S) - \sum_{i\in S} H(y_i \mid h) \end{equation} where the second equality comes from the independence of the $y_i$'s conditioned on $h$. Recall that the joint entropy of a set of random variables is a submodular function. Thus, the value function is written in \eqref{eq:chain-rule} as the sum of a submodular function and a modular function. \end{proof} This lemma implies that learning an \emph{arbitrary hypothesis, under an arbitrary prior} when noise is conditionally independent leads to a submodular value function. Hence, we can apply the previously known results to get the following corollary: \begin{corollary} For Bayesian experimental design with the objective given by the information gain \eqref{general}, there exists a randomized, budget feasible, individually rational, and universally truthful mechanism. This mechanism has a complexity $O\big(\text{poly}(n,d)\big)$ and an approximation ratio of $7.91$. In cases where maximizing \eqref{general} can be done in polynomial time in the full-information setup, there exists a randomized, budget feasible, individually rational, and truthful mechanism for Bayesian experimental design. This mechanism has a complexity $O\big(\text{poly}(n,d)\big)$ and an approximation ratio of $8.34$. \end{corollary} Note however that, in many scenarios covered by this model (including the last two examples above), even computing the entropy under a given set might be a hard task---\emph{i.e.}, the value query model may not apply. Hence, identifying learning tasks in the above class for which truthful or universally truthful constant approximation mechanisms exist, or studying these problems in the context of stronger query models such as the demand model \cite{dobz2011-mechanisms,bei2012budget} remains an interesting open question. %TODO: Independent noise model. Captures models such as logistic regression, classification, etc. Arbitrary prior. Show that change in the entropy is submodular (cite Krause, Guestrin).