1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
|
\subsection{Bayesian Experimental Design}\label{sec:bed}
In this section, we extend our results to Bayesian experimental design
\cite{chaloner1995bayesian}. We show that objective function \eqref{modified}
has a natural interpretation in this context, further motivating its selection
as our objective. Moreover, we extend Theorem~\ref{thm:main} to a more general
Bayesian setting.
In the Bayesian setting, it is assumed that the experimenter has a prior
distribution on $\beta$: in particular, $\beta$ has a multivariate normal prior
with zero mean and covariance $\sigma^2R^{-1}\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance).
The experimenter estimates $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. Under the linearity assumption \eqref{model} and the Gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization \cite{hastie}:
\begin{displaymath}
\hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \T{\beta}x_i)^2
+ \T{\beta}R\beta
\end{displaymath}
This optimization, commonly known as \emph{ridge regression}, includes an additional penalty term compared to the least squares estimation \eqref{leastsquares}.
Let $\entropy(\beta)$ be the entropy of $\beta$ under this distribution, and
$\entropy(\beta\mid y_S)$ the entropy of $\beta$ conditioned on the experiment
outcomes $y_S$, for some $S\subseteq \mathcal{N}$. In this setting, a natural
objective, originally proposed by Lindley \cite{lindley1956measure}, is to
select a set of experiments $S$ that maximizes her \emph{information gain}:
\begin{displaymath}
I(\beta;y_S) = \entropy(\beta)-\entropy(\beta\mid y_S).
\end{displaymath}
Assuming normal noise variables, the information gain is equal (up to a constant) to the following value function \cite{chaloner1995bayesian}:
\begin{align}
V(S) = \frac{1}{2}\log\det(R + \T{X_S}X_S)\label{bayesianobjective}
\end{align}
Our objective \eqref{modified} clearly follows from \eqref{bayesianobjective}
by setting $R=I_d$. Hence, our optimization can be interpreted as
a maximization of the information gain when the prior distribution has
a covariance $\sigma^2 I_d$, and the experimenter is solving a ridge regression
problem with penalty term $\norm{x}_2^2$.
Moreover, our results can be extended to the general Bayesian case, by
replacing $I_d$ with the positive semidefinite matrix $R$. First, we re-set the
origin of the value function so that $V(\emptyset) = 0$:
\begin{align}\label{eq:normalized}
\tilde{V}(S)
& = \frac{1}{2}\log\det(R + \T{X_S}X_S) - \frac{1}{2}\log\det R\\
& = \frac{1}{2}\log\det(I_d + R^{-1}\T{X_S}X_S)\notag
\end{align}
Applying the mechanism described in algorithm~\ref{mechanism} and adapting the
analysis of the approximation ratio, we get the following result which extends
Theorem~\ref{thm:main}:
\begin{theorem}
There exists a truthful, individually rational and budget feasible mechanism for the objective
function $\tilde{V}$ given by \eqref{eq:normalized}. Furthermore, for any $\varepsilon
> 0$, in time $O(\text{poly}(n, d, \log\log \varepsilon^{-1}))$,
the algorithm computes a set $S^*$ such that:
\begin{displaymath}
OPT \leq
\frac{5e-1}{e-1}\frac{2\mu}{\log(1+\mu)}V(S^*) + 5.1 + \varepsilon
\end{displaymath}
where $\mu$ is the smallest eigenvalue of $R$.
\end{theorem}
\subsection{$D$-Optimality and Beyond}
Ideally, motivated by the $D$-optimality criterion, we would like to design a mechanism that maximizes \eqref{dcrit} within a good approximation ratio. In what follows, we consider a slightly more general objective as follows:
\begin{center}
\textsc{ExperimentalDesignProblem} (EDP)
\begin{subequations}
\begin{align}
\text{Maximize}\quad V(S) &= \log\det(I_d+\T{X_S}X_S) \label{modified} \\
\text{subject to}\quad \sum_{i\in S} c_i&\leq B
\end{align}\label{edp}
\end{subequations}
\end{center} where $I_d\in \reals^{d\times d}$ is the identity matrix.
\junk{One possible interpretation of \eqref{modified} is that, prior to the auction, the experimenter has free access to $d$ experiments whose features form an orthonormal basis in $\reals^d$. However, \eqref{modified} can also be motivated in the context of \emph{Bayesian experimental design} \cite{chaloner1995bayesian}. In short, the objective \eqref{modified} arises naturally when the experimenter retrieves the model $\beta$ through \emph{ridge regression}, rather than the linear regression \eqref{leastsquares} over the observed data; we explore this connection in Section~\ref{sec:bed}.
}
We present our results with this version of the objective function because it is simple and captures the versions
we will need later (including an arbitrary matrix in place of $I_d$ or a zero matrix which will correspond to ~\eqref{dcrit} ).
%Note that maximizing \eqref{modified} is equivalent to maximizing \eqref{dcrit} in the full-information case. In particular, $\det(\T{X_S}X_S)> \det(\T{X_{S'}}X_{S'})$ iff $\det(I_d+\T{X_S}X_S)>\det(I_d+\T{X_{S'}}X_{S'})$. In addition,
\EDP{} \emph{and} the corresponding problem with objective \eqref{dcrit} are NP-hard; to see this, note that \textsc{Knapsack} reduces to EDP with dimension $d=1$ by mapping the weight of each item $w_i$ to an experiment with $x_i=w_i$. Moreover, \eqref{modified} is submodular, monotone and satisfies $V(\emptyset) = 0$.
%, allowing us to use the extensive machinery for the optimization of submodular functions, as well as recent results in the
% context of budget feasible auctions \cite{chen,singer-mechanisms}.
%\stratis{A potential problem is that all of the above properties hold also for, \emph{e.g.}, $V(S)=\log(1+\det(\T{X_S}X_S))$\ldots}
As \eqref{dcrit} may take arbitrarily small negative values, to define a meaningful approximation one would consider the (equivalent) maximization of $V(S) = \det\T{X_S}X_S$. %, for some strictly increasing, on-to function $f:\reals_+\to\reals_+$.
However, the following lower bound implies that such an optimization goal cannot be attained under the constraints of truthfulness, budget feasibility, and individual rationality.
\begin{lemma}
For any $M>1$, there is no $M$-approximate, truthful, budget feasible, individually rational mechanism for a budget feasible reverse auction with value function $V(S) = \det{\T{X_S}X_S}$.
For any $M>1$, there is no $M$-approximate, truthful, budget feasible, individually rational mechanism for a budget feasible reverse auction with $V(S) = \det{\T{X_S}X_S}$.
\end{lemma}
\begin{proof}
\input{proof_of_lower_bound1}
\end{proof}
\subsection{Beyond Linear Models}
Selecting experiments that maximize the information gain in the Bayesian setup
leads to a natural generalization to other learning examples beyond linear
regression. In particular, in the more general PAC learning setup \cite{valiant}, the features $x_i$, $i\in \mathcal{N}$ take values in some general set $\Omega$, called the \emph{query space}, and measurements $y_i\in\reals$ are given by
\begin{equation}\label{eq:hypothesis-model}
y_i = h(x_i) + \varepsilon_i
\end{equation}
where $h\in \mathcal{H}$ for some subset $\mathcal{H}$ of all possible mappings
$h:\Omega\to\reals$, called the \emph{hypothesis space}, and $\varepsilon_i$
are random variables in $\reals$, not necessarily identically distributed, that
are independent \emph{conditioned on $h$}. This model is quite broad, and
captures many learning tasks, such as support vector machines (SVM), linear
discriminant analysis (LDA), and the following tasks:
\begin{enumerate}
\item\textbf{Generalized Linear Regression.} $\Omega=\reals^d$, $\mathcal{H}$ is the set of linear maps $\{h(x) = \T{\beta}x \text{ s.t. } \beta\in \reals^d\}$, and $\varepsilon_i$ are independent zero-mean normal variables, where $\expt{\varepsilon_i^2}=\sigma_i$.
\item\textbf{Logistic Regression.} $\Omega=\reals^d$, $\mathcal{H}$ is the set of maps $\{h(x) = \frac{e^{\T{\beta} x}}{1+e^{\T{\beta} x}} \text{ s.t. } \beta\in\reals^d\}$, and $\varepsilon_i$ are independent conditioned on $h$ such that $$\varepsilon_i=\begin{cases} 1- h(x_i),& \text{w.~prob.}\;h(x_i)\\-h(x_i),&\text{w.~prob.}\;1-h(x_i)\end{cases}$$
\item\textbf{Learning Binary Functions with Bernoulli Noise.} $\Omega = \{0,1\}^d$, and $\mathcal{H}$ is some subset of $2^{\Omega}$, and $$\varepsilon_i =\begin{cases}0, &\text{w.~prob.}\;p\\\bar{h}(x_i)-h(x_i), &\text{w.~prob.}\;1-p\end{cases}$$
\end{enumerate}
In this setup, assume that the experimenter has a prior distribution on the hypothesis $h\in \mathcal{H}$. Then, the information gain objective can be written again as the mutual information between $\beta$ and $y_S$.
\begin{align}\label{general}
V(S) = \entropy(\beta) -\entropy(\beta\mid y_S),\quad S\subseteq\mathcal{N}
\end{align}
This is a monotone set function, and it clearly satisfies $V(\emptyset)=0$. Though, in general, mutual information is not a submodular function, this specific setup leads indeed to a submodular formulation.
\begin{lemma}
The value function given by the information gain \eqref{general} is submodular.
\end{lemma}
\begin{proof}
Using the chain rule for the conditional entropy we get:
\begin{equation}\label{eq:chain-rule}
V(S) = H(y_S) - H(y_S \mid \beta)
= H(y_S) - \sum_{i\in S} H(y_i \mid \beta)
\end{equation}
where the second equality comes from the independence of the $y_i$'s
conditioned on $\beta$. Recall that the joint entropy of a set of random
variables is a submodular function. Thus, the value function is written in
\eqref{eq:chain-rule} as the sum of a submodular function and a modular function.
\end{proof}
This lemma implies that learning an \emph{arbitrary hypothesis, under an
arbitrary prior} when noise is conditionally independent leads to a submodular
value function. Hence, we can apply the results from Chen \emph{et al.} to get
the following corollary:
\begin{corollary}
For Bayesian experimental design with the objective given by the
information gain \eqref{general}, there exists a randomized, budget
feasible, individually rational, and universally truthful mechanism. This
mechanism has a complexity $O\big(\text{poly}(n,d)\big)$ and an
approximation ratio of $7.91$.
In cases where maximizing \eqref{general} can be done in polynomial time in
the full-information setup, there exists a randomized, budget feasible,
individually rational, and truthful mechanism for Bayesian experimental
design. This mechanism has a complexity $O\big(\text{poly}(n,d)\big)$ and
an approximation ratio of $8.34$.
\end{corollary}
Note however that, in many scenarios covered by
this model (including the last two examples above), even computing the entropy
under a given set might be a hard task---\emph{i.e.}, the value query model may
not apply. Hence, identifying learning tasks in the above class for which
truthful or universally truthful constant approximation mechanisms exist, or
studying these problems in the context of stronger query models such as the
demand model \cite{dobz2011-mechanisms,bei2012budget} remains an interesting
open question.
%TODO: Independent noise model. Captures models such as logistic regression, classification, etc. Arbitrary prior. Show that change in the entropy is submodular (cite Krause, Guestrin).
|