2 files changed, 72 insertions, 54 deletions
diff --git a/general.tex b/general.tex
index 12f88e8..4f9aaad 100644
--- a/general.tex
+++ b/general.tex
@@ -1,71 +1,61 @@
 \subsection{Bayesian Experimental Design}\label{sec:bed}
-In this section, we extend our results to Bayesian experimental design \cite{chaloner1995bayesian}. We show that objective function \eqref{modified} has a natural interpration in this context, further motivating its selection as our objective.  Moreover, we extend Theorem~\ref{thm:main} to a more general Bayesian setting.
+In this section, we extend our results to Bayesian experimental design
+\cite{chaloner1995bayesian}. We show that objective function \eqref{modified}
+has a natural interpretation in this context, further motivating its selection
+as our objective.  Moreover, we extend Theorem~\ref{thm:main} to a more general
+Bayesian setting.
 
 In the Bayesian setting, it is assumed that the experimenter has a prior distribution on $\beta$: in particular, $\beta$ has a multivariate normal prior with zero mean  and covariance $\sigma^2R\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance). 
-The experimenter estimates $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. Under the linearity assumption \eqref{model} and the gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization \cite{hastie}: FIX!
+The experimenter estimates $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. Under the linearity assumption \eqref{model} and the Gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization \cite{hastie}: FIX!
 \begin{displaymath}
     \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \T{\beta}x_i)^2
     + \sum_i \norm{R\beta}_2^2
 \end{displaymath}
 This optimization, commonly known as \emph{ridge regression}, includes an additional penalty term compared to the least squares estimation \eqref{leastsquares}.
 
-
-Let $\entropy(\beta)$ be the entropy of $\beta$ under this distribution, and $\entropy(\beta\mid y_S)$ the  entropy of $\beta$ conditioned on the experiment outcomes $Y_S$, for some $S\subseteq \mathcal{N}$. In this setting, a natural objective, originally proposed by Lindley \cite{lindley1956measure}, is to select a set of experiments $S$ that maximizes her \emph{information gain}:
-$$ I(\beta;y_S) = \entropy(\beta)-\entropy(\beta\mid y_S). $$
-
+Let $\entropy(\beta)$ be the entropy of $\beta$ under this distribution, and
+$\entropy(\beta\mid y_S)$ the  entropy of $\beta$ conditioned on the experiment
+outcomes $y_S$, for some $S\subseteq \mathcal{N}$. In this setting, a natural
+objective, originally proposed by Lindley \cite{lindley1956measure}, is to
+select a set of experiments $S$ that maximizes her \emph{information gain}:
+\begin{displaymath}
+    I(\beta;y_S) = \entropy(\beta)-\entropy(\beta\mid y_S).
+\end{displaymath}
 Assuming normal noise variables, the information gain is  equal (up to a constant) to the following value function \cite{chaloner1995bayesian}:
 \begin{align}
 V(S)  = \frac{1}{2}\log\det(R + \T{X_S}X_S)\label{bayesianobjective}
 \end{align}
-Our objective \eqref{,,,} clearly follows from \eqref{bayesianobjective} by setting $R=I_d$. Hence, our optimization can be interpreted as a maximization of the information gain when the prior distribution has a covariance $\sigma^2 I_d$, and the experimenter is solving a ridge regression problem with penalty term $\norm{x}_2^2$. 
+Our objective \eqref{modified} clearly follows from \eqref{bayesianobjective}
+by setting $R=I_d$. Hence, our optimization can be interpreted as
+a maximization of the information gain when the prior distribution has
+a covariance $\sigma^2 I_d$, and the experimenter is solving a ridge regression
+problem with penalty term $\norm{x}_2^2$. 
 
-Moreover, our results can be extended to the general Bayesian case, by replacing $I_d$ with the positive semidefinite matrix $R$:
-
-\thibaut{Discussion about the value function below}
-\stratis{The text below is unpolished/not written for external consumption. Rather than trying to motivate $R>I$, it is probably better to figure out a lower bound.}
+Moreover, our results can be extended to the general Bayesian case, by
+replacing $I_d$ with the positive semidefinite matrix $R$. First, we set the
+origin of the value function so that $V(\emptyset) = 0$; we define:
+\begin{align}\label{eq:normalized}
+    \tilde{V}(S)
+    & = \frac{1}{2}\log\det(R + \T{X_S}X_S) - \frac{1}{2}\log\det R\\
+    & = \frac{1}{2}\log\det(I_d + R^{-1}\T{X_S}X_S)\notag
+\end{align}
 
-When there is an $R$ in the value function, it seems to make more sense to
-study the modified value function:
-\begin{displaymath}
-    \tilde{V}(S) = \frac{1}{2}\log\det(R + \T{X_S}X_S) - \frac{1}{2}\log\det R
-\end{displaymath}
-For two reasons:
-\begin{itemize}
-    \item $\tilde{V}(\emptyset) = 0$: the value function is normalized, I think
-    this assumption is needed somewhere in mechanism design.
-    \item $\tilde{V}(S) = \frac{1}{2}\log\det(I_d + R^{-1}\T{X_S}X_S)$, so we
-    can apply our result to get an $\alpha$ approximation ratio (see the value
-    of $\alpha$ below). If we take $V$ instead of $\tilde{V}$ then one can
-    write:
-    \begin{displaymath}
-        V(S) = \frac{1}{2}\log\det R + \tilde{V}(S)
-    \end{displaymath}
-    thus:
-    \begin{displaymath}
-        OPT(V) = \frac{1}{2}\log\det R + OPT(\tilde{V})
-    \end{displaymath}
-    we can find $S^*$ such that $OPT(\tilde{V}) \leq \alpha \tilde{V}(S)$, so:
-    \begin{displaymath}
-        OPT(V) = \frac{1}{2}\log\det R + \alpha\tilde{V}(S)
-    \end{displaymath}
-    But this does not give an $\alpha$ approximation ratio for $V$, because
-    $\log\det R$ can be negative. This is only an \emph{asymptotic}
-    approximation ratio\ldots.
-\end{itemize}
+Applying the mechanism described in algorithm~\ref{mechanism} and adapting the
+analysis of the approximation ratio, we get the following result which extends
+theorem~\ref{thm:main}:
 
 \begin{theorem}
- For the function $\tilde{V}$ defined above, there is a truthful, budget
- feasible mechanism which achieves an approximation ratio of:
+ There exists a truthful and budget feasible mechanism for the objective
+ function $\tilde{V}$ \eqref{eq:normalized}. Furthermore, for any $\varepsilon
+ > 0$, in time $O(\text{poly}(|\mathcal{N}|, d, \log\log \varepsilon^{-1}))$,
+ the algorithm computes a set $S^*$ such that:
  \begin{displaymath}
-    \frac{5e-1}{e-1}\frac{2\mu}{\log(1+\mu)} + 5.1
- \end{displaymath}
+    OPT(\tilde{V}, \mathcal{N}, B) \leq 
+    \frac{5e-1}{e-1}\frac{2\mu}{\log(1+\mu)}V(S^*) + 5.1 + \varepsilon
+\end{displaymath}
 where $\mu$ is the smallest eigenvalue of $R$.
 \end{theorem}
 
-Note that everything becomes nice when $R \geq I_d$. In this case, the smallest
-eigenvalue is larger than 1. Hence $\log\det R\geq 0$ and an approximation on
-$\tilde{V}$ gives an approximation ration on $V$ (see discussion above).
-
 \subsection{Beyond Linear Models}
 Selecting experiments that maximize the information gain in the Bayesian setup leads to a natural generalization to other learning examples beyond linear regression. In particular, in the more general PAC learning setup \cite{...}, the features $x_i$, $i\in \mathcal{N}$ take values in some general set $\Omega$, called the \emph{query space}, and measurements $y_i\in\reals$ are given by 
  \begin{equation}\label{eq:hypothesis-model}
@@ -78,15 +68,44 @@ where $h\in \mathcal{H}$ for some subset $\mathcal{H}$ of all possible mappings
 \item\textbf{Learning Binary Functions with Bernoulli Noise.} $\Omega = \{0,1\}^d$, and $\mathcal{H}$ is some subset of $2^{\Omega\times\{0,1\}}$, and $$\varepsilon_i =\begin{cases}0, &\text{w.~prob.}~p\\\bar{h}(x_i)-h(x_i), \text{w.~prob.}~1-p\end{cases}$$
 \end{enumerate}
 
-In this setup, assume that the experimenter has a prior distribution on the hypothesis $h\in \mathcal{H}$. Then, the information gain objective can be written again as the mutual information between $\beta$ and $Y_S$.
-\begin{align}V(S) = \entropy(\beta) -\entropy(\beta\mid Y_S),\quad S\subseteq\mathcal{N} \label{general} \end{align}
+In this setup, assume that the experimenter has a prior distribution on the hypothesis $h\in \mathcal{H}$. Then, the information gain objective can be written again as the mutual information between $\beta$ and $y_S$.
+\begin{align}\label{general}
+V(S) = \entropy(\beta) -\entropy(\beta\mid y_S),\quad S\subseteq\mathcal{N} 
+\end{align}
 This is a monotone set function, and it clearly satisfies $V(\emptyset)=0$. Though, in general, mutual information is not a submodular function, this specific setup leads indeed to a submodular formulation.
-\begin{theorem}The value function given by the information gain \eqref{general} is submodular.
-\end{theorem}
+
+\begin{lemma}
+The value function given by the information gain \eqref{general} is submodular.
+\end{lemma}
+
 \begin{proof}
-The theorem is proved in a slightly different context in \cite{guestrin}; we repeat the proof here for the sake of completeness: \stratis{add one-line proof} 
+The theorem is proved in a slightly different context in \cite{guestrin}; we
+repeat the proof here for the sake of completeness. Using the chain rule for
+the conditional entropy we get:
+\begin{displaymath}\label{eq:chain-rule}
+    V(S) = H(y_S) - H(y_S \mid \beta) 
+    = H(y_S) - \sum_{i\in S} H(y_i \mid \beta)
+\end{displaymath}
+where the second equality comes from the independence of the $y_i$'s
+conditioned on $\beta$. Recall that the joint entropy of a set of random
+variables is a submodular function. Thus, our value function is written in
+\eqref{eq:chain-rule} as the sum of a submodular function and a modular function:
+it is submodular.
 \end{proof}
 
-This theorem that implies that learning an \emph{arbitrary hypothesis, under an arbitrary prior} when noise is conditionally independent leads to a submodular value function. Hence the universally truthful 7.91-approximate mechanism by Chen \emph{et al.} \cite{chen} immediately applies in the value query model. Moreover, in cases where maximizing \eqref{general} can be done in polynomial time in the full-information setup, the truthful $8.34$-appoximate mechanism of Chen \emph{et al.}~applies again. However that, in many scenarios covered by this model (inlcuding the last two examples above), even computing the entropy under a given set might be a hard task---\emph{i.e.}, the value query model may not apply. Hence, identifying learning tasks in the above class for which truthful or universally truthful constant approximation mechanisms exist, or studying these problems in the context of stronger query models such as the demand model \cite{dobz2011-mechanisms,bei2012budget} remains an interesting open question. 
+This lemma that implies that learning an \emph{arbitrary hypothesis, under an
+arbitrary prior} when noise is conditionally independent leads to a submodular
+value function. Hence the universally truthful 7.91-approximate mechanism by
+Chen \emph{et al.} \cite{chen} immediately applies in the value query model.
+Moreover, in cases where maximizing \eqref{general} can be done in polynomial
+time in the full-information setup, the truthful $8.34$-appoximate mechanism of
+Chen \emph{et al.}~applies again. However that, in many scenarios covered by
+this model (inlcuding the last two examples above), even computing the entropy
+under a given set might be a hard task---\emph{i.e.}, the value query model may
+not apply. Hence, identifying learning tasks in the above class for which
+truthful or universally truthful constant approximation mechanisms exist, or
+studying these problems in the context of stronger query models such as the
+demand model \cite{dobz2011-mechanisms,bei2012budget} remains an interesting
+open question. 
 
 %TODO: Independent noise model. Captures models such as logistic regression, classification, etc. Arbitrary prior. Show that change in the entropy is submodular (cite Krause, Guestrin). 
diff --git a/notes.bib b/notes.bib
index 74fa406..b9eb5c2 100644
--- a/notes.bib
+++ b/notes.bib
@@ -5,7 +5,6 @@
   publisher={Cambridge University Press}
 }
 
-
 @article{vandenberghe1998determinant,
   title={Determinant maximization with linear matrix inequality constraints},
   author={Vandenberghe, L. and Boyd, S. and Wu, S.P.},