general.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72

\subsection{Bayesian Experimental Design}\label{sec:bed}
In this section, we extend our results to Bayesian experimental design \cite{chaloner1995bayesian}. We show that objective function \eqref{modified} has a natural interpration in this context, further motivating its selection as our objective.  Moreover, we extend Theorem~\ref{thm:main} to a more general Bayesian setting.

In the Bayesian setting, it is assumed that the experimenter has a prior distribution on $\beta$: in particular, $\beta$ has a multivariate normal prior with zero mean  and covariance $\sigma^2R\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance). 
The experimenter estimates $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. Under the linearity assumption \eqref{model} and the gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization \cite{hastie}: FIX!
\begin{displaymath}
    \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \T{\beta}x_i)^2
    + \sum_i \norm{R\beta}_2^2
\end{displaymath}
This optimization, commonly known as \emph{ridge regression}, includes an additional penalty term compared to the least squares estimation \eqref{leastsquares}.


Let $\entropy(\beta)$ be the entropy of $\beta$ under this distribution, and $\entropy(\beta\mid y_S)$ the  entropy of $\beta$ conditioned on the experiment outcomes $Y_S$, for some $S\subseteq \mathcal{N}$. In this setting, a natural objective, originally proposed by Lindley \cite{lindley1956measure}, is to select a set of experiments $S$ that maximizes her \emph{information gain}:
$$ I(\beta;y_S) = \entropy(\beta)-\entropy(\beta\mid y_S). $$

Assuming normal noise variables, the information gain is  equal (up to a constant) to the following value function \cite{chaloner1995bayesian}:
\begin{align}
V(S)  = \frac{1}{2}\log\det(R + \T{X_S}X_S)\label{bayesianobjective}
\end{align}
Our objective \eqref{,,,} clearly follows from \eqref{bayesianobjective} by setting $R=I_d$. Hence, our optimization can be interpreted as a maximization of the information gain when the prior distribution has a covariance $\sigma^2 I_d$, and the experimenter is solving a ridge regression problem with penalty term $\norm{x}_2^2$. 

Moreover, our results can be extended to the general Bayesian case, by replacing $I_d$ with the positive semidefinite matrix $R$:

\thibaut{Discussion about the value function below}
\stratis{The text below is unpolished/not written for external consumption. Rather than trying to motivate $R>I$, it is probably better to figure out a lower bound.}

When there is an $R$ in the value function, it seems to make more sense to
study the modified value function:
\begin{displaymath}
    \tilde{V}(S) = \frac{1}{2}\log\det(R + \T{X_S}X_S) - \frac{1}{2}\log\det R
\end{displaymath}
For two reasons:
\begin{itemize}
    \item $\tilde{V}(\emptyset) = 0$: the value function is normalized, I think
    this assumption is needed somewhere in mechanism design.
    \item $\tilde{V}(S) = \frac{1}{2}\log\det(I_d + R^{-1}\T{X_S}X_S)$, so we
    can apply our result to get an $\alpha$ approximation ratio (see the value
    of $\alpha$ below). If we take $V$ instead of $\tilde{V}$ then one can
    write:
    \begin{displaymath}
        V(S) = \frac{1}{2}\log\det R + \tilde{V}(S)
    \end{displaymath}
    thus:
    \begin{displaymath}
        OPT(V) = \frac{1}{2}\log\det R + OPT(\tilde{V})
    \end{displaymath}
    we can find $S^*$ such that $OPT(\tilde{V}) \leq \alpha \tilde{V}(S)$, so:
    \begin{displaymath}
        OPT(V) = \frac{1}{2}\log\det R + \alpha\tilde{V}(S)
    \end{displaymath}
    But this does not give an $\alpha$ approximation ratio for $V$, because
    $\log\det R$ can be negative. This is only an \emph{asymptotic}
    approximation ratio\ldots.
\end{itemize}

\begin{theorem}
 For the function $\tilde{V}$ defined above, there is a truthful, budget
 feasible mechanism which achieves an approximation ratio of:
 \begin{displaymath}
    \frac{5e-1}{e-1}\frac{2\mu}{\log(1+\mu)} + 5.1
 \end{displaymath}
where $\mu$ is the smallest eigenvalue of $R$.
\end{theorem}

Note that everything becomes nice when $R \geq I_d$. In this case, the smallest
eigenvalue is larger than 1. Hence $\log\det R\geq 0$ and an approximation on
$\tilde{V}$ gives an approximation ration on $V$ (see discussion above).

\subsection{Beyond Linear Models}
Selecting experiments that maximize the information gain in the Bayesian setup leads to a natural generalization to other learning examples beyond linear regression. In particular, suppose that the measurements

TODO: Independent noise model. Captures models such as logistic regression, classification, etc. Arbitrary prior. Show that change in the entropy is submodular (cite Krause, Guestrin).