diff options
| -rw-r--r-- | general.tex | 91 |
1 files changed, 32 insertions, 59 deletions
diff --git a/general.tex b/general.tex index 5162887..3c7eddf 100644 --- a/general.tex +++ b/general.tex @@ -1,49 +1,14 @@ -\subsection{Bayesian Experimental Design}\label{sec:bed} +\subsection{Strategic Experimental Design with non-homotropic prior}\label{sec:bed} %In this section, we extend our results to Bayesian experimental design %\cite{chaloner1995bayesian}. We show that objective function \eqref{modified} %has a natural interpretation in this context, further motivating its selection %as our objective. Moreover, -We extend Theorem~\ref{thm:main} to a more general -Bayesian setting, where it is assumed that the experimenter \E\ has a {\em prior} -distribution on $\beta$: in particular, $\beta$ has a multivariate normal prior -with zero mean and covariance $\sigma^2R^{-1}\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance). -\E\ estimates $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. Under the linearity assumption \eqref{model} and the Gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization \cite{hastie}: -\begin{displaymath} - \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \T{\beta}x_i)^2 - + \T{\beta}R\beta -\end{displaymath} -This optimization, commonly known as \emph{ridge regression}, includes an additional penalty term compared to the least squares estimation \eqref{leastsquares}. - -Let $\entropy(\beta)$ be the entropy of $\beta$ under this distribution, and -$\entropy(\beta\mid y_S)$ the entropy of $\beta$ conditioned on the experiment -outcomes $y_S$, for some $S\subseteq \mathcal{N}$. In this setting, a natural -objective, originally proposed by Lindley \cite{lindley1956measure}, is to -select a set of experiments $S$ that maximizes her \emph{information gain}: -\begin{displaymath} - I(\beta;y_S) = \entropy(\beta)-\entropy(\beta\mid y_S). -\end{displaymath} -Assuming normal noise variables, the information gain is equal (up to a constant) to the following value function \cite{chaloner1995bayesian}: -\begin{align} -V(S) = \frac{1}{2}\log\det(R + \T{X_S}X_S)\label{bayesianobjective} -\end{align} -Our objective for \EDP\ -%\eqref{modified} -clearly follows from \eqref{bayesianobjective} -by setting $R=I_d$. Hence, the optimization discussed thus far can be interpreted as -a maximization of the information gain when the prior distribution has -a covariance $\sigma^2 I_d$, and the experimenter is solving a ridge regression -problem with penalty term $\norm{\beta}_2^2$. - -Our results can be extended to the general Bayesian case, by -replacing $I_d$ with the positive semidefinite matrix $R$. First, we re-set the -origin of the value function so that $V(\emptyset) = 0$: -\begin{align}\label{eq:normalized} - \tilde{V}(S) - & = \frac{1}{2}\log\det(R + \T{X_S}X_S) - \frac{1}{2}\log\det R\\ - & = \frac{1}{2}\log\det(I_d + R^{-1}\T{X_S}X_S)\notag -\end{align} +If the general case where the prior distribution of the experimenter on the +model $\beta$ in \eqref{model} is not homotropic and has a generic covariance +matrix $R$, the value function takes the general form given by +\eqref{dcrit}. Applying the mechanism described in algorithm~\ref{mechanism} and adapting the analysis of the approximation ratio, we get the following result which extends @@ -51,7 +16,7 @@ Theorem~\ref{thm:main}: \begin{theorem} There exists a truthful, individually rational and budget feasible mechanism for the objective - function $\tilde{V}$ given by \eqref{eq:normalized}. Furthermore, for any $\varepsilon + function $V$ given by \eqref{dcrit}. Furthermore, for any $\varepsilon > 0$, in time $O(\text{poly}(n, d, \log\log \varepsilon^{-1}))$, the algorithm computes a set $S^*$ such that: \begin{displaymath} @@ -61,11 +26,15 @@ Theorem~\ref{thm:main}: where $\mu$ is the smallest eigenvalue of $R$. \end{theorem} -\subsection{$D$-Optimality and Beyond} +\subsection{Other Experimental Design Criteria} -We now reexamine the classical $D$-optimality in \eqref{dcrit}, which is given by objective ~\eqref{bayesianobjective} with $R$ replaced by -the zero matrix. -Since \eqref{dcrit} may take arbitrarily small negative values, to define a meaningful approximation one would consider the (equivalent) maximization of $V(S) = \det\T{X_S}X_S$. %, for some strictly increasing, on-to function $f:\reals_+\to\reals_+$. +A value function which is frequently used in experimental design is the +$D$-optimality criterion obtained by replacing $R$ by the zero matrix in +\eqref{dcrit}: +\begin{equation}\label{eq:d-optimal} +V(S) = \log\det(X_S^TX_S) +\end{equation} +Since \eqref{eq:d-optimal} may take arbitrarily small negative values, to define a meaningful approximation one would consider the (equivalent) maximization of $V(S) = \det\T{X_S}X_S$. %, for some strictly increasing, on-to function $f:\reals_+\to\reals_+$. However, the following lower bound implies that such an optimization goal cannot be attained under the constraints of truthfulness, budget feasibility, and individual rationality. \begin{lemma} For any $M>1$, there is no $M$-approximate, truthful, budget feasible, individually rational mechanism for a budget feasible reverse auction with value function $V(S) = \det{\T{X_S}X_S}$. @@ -96,22 +65,27 @@ discriminant analysis (LDA); we give a few concrete examples below: \item\textbf{Learning Binary Functions with Bernoulli Noise.} $\Omega = \{0,1\}^d$, and $\mathcal{H}$ is some subset of $2^{\Omega}$, and $$\varepsilon_i =\begin{cases}0, &\text{w.~prob.}\;p\\\bar{h}(x_i)-h(x_i), &\text{w.~prob.}\;1-p\end{cases}$$ \end{enumerate} -In this setup, assume that the experimenter has a prior distribution on the hypothesis $h\in \mathcal{H}$. Then, the information gain objective can be written again as the mutual information between $\beta$ and $y_S$. +In this setup, assuming that the experimenter has a prior distribution on the +hypothesis $h\in \mathcal{H}$, the information gain objective can be written +again as the mutual information between $h$ and $y_S$. \begin{align}\label{general} -V(S) = \entropy(\beta) -\entropy(\beta\mid y_S),\quad S\subseteq\mathcal{N} +V(S) = \entropy(h) -\entropy(h\mid y_S),\quad S\subseteq\mathcal{N} \end{align} -This is a monotone set function, and it clearly satisfies $V(\emptyset)=0$. Though, in general, mutual information is not a submodular function, this specific setup leads indeed to a submodular formulation. +This is a monotone set function, and it clearly satisfies $V(\emptyset)=0$. +Though, in general, mutual information is not a submodular function, this +specific setup leads to a submodular formulation. + \begin{lemma} The value function given by the information gain \eqref{general} is submodular. \end{lemma} \begin{proof} Using the chain rule for the conditional entropy we get: \begin{equation}\label{eq:chain-rule} - V(S) = H(y_S) - H(y_S \mid \beta) - = H(y_S) - \sum_{i\in S} H(y_i \mid \beta) + V(S) = H(y_S) - H(y_S \mid h) + = H(y_S) - \sum_{i\in S} H(y_i \mid h) \end{equation} where the second equality comes from the independence of the $y_i$'s -conditioned on $\beta$. Recall that the joint entropy of a set of random +conditioned on $h$. Recall that the joint entropy of a set of random variables is a submodular function. Thus, the value function is written in \eqref{eq:chain-rule} as the sum of a submodular function and a modular function. \end{proof} @@ -134,13 +108,12 @@ the following corollary: an approximation ratio of $8.34$. \end{corollary} -Note however that, in many scenarios covered by -this model (including the last two examples above), even computing the entropy -under a given set might be a hard task---\emph{i.e.}, the value query model may -not apply. Hence, identifying learning tasks in the above class for which -truthful or universally truthful constant approximation mechanisms exist, or -studying these problems in the context of stronger query models such as the -demand model \cite{dobz2011-mechanisms,bei2012budget} remains an interesting -open question. +Note however that, in many scenarios covered by this model (including the last +two examples above), even computing the entropy under a given set might be +a hard task---\emph{i.e.}, the value query model may not apply. Hence, +identifying learning tasks in the above class for which truthful or universally +truthful constant approximation mechanisms exist, or studying these problems in +the context of stronger query models such as the demand model +\cite{dobz2011-mechanisms,bei2012budget} remains an interesting open question. %TODO: Independent noise model. Captures models such as logistic regression, classification, etc. Arbitrary prior. Show that change in the entropy is submodular (cite Krause, Guestrin). |
