general.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127

\subsection{Strategic Experimental Design with Non-Homotropic Prior}\label{sec:bed}

%In this section, we extend our results to Bayesian experimental design
%\cite{chaloner1995bayesian}. We show that objective function \eqref{modified}
%has a natural interpretation in this context, further motivating its selection
%as our objective.  Moreover, 

In the general case where the prior distribution of the experimenter on the
model $\beta$ in \eqref{model} is not homotropic and has a generic covariance
matrix $R$, the value function takes the general form given by
\eqref{dcrit}.

Applying the mechanism described in Algorithm~\ref{mechanism} and adapting the
analysis of the approximation ratio \emph{mutatis mutandis}, we get the following result which extends
Theorem~\ref{thm:main}. 

\begin{theorem}
 For any $\delta>0$, there exists a $\delta$-truthful, individually rational
 and budget feasible mechanism for the objective function $V$ given by
 \eqref{dcrit}. Furthermore, let us denote by $\lambda$ (resp. $\Lambda$) the
 smallest (resp. largest) eigenvalue of $R$, then for any $\varepsilon > 0$, in time
 $O(\text{poly}(n, d, \log\log\frac{B\Lambda}{\varepsilon\delta b\lambda}))$,
 the mechanism computes a set $S^*$ such that:
 \begin{displaymath}
    OPT \leq 
    \bigg(\frac{6e-2}{e-1}\frac{1/\lambda}{\log(1+1/\lambda)}+ 4.66\bigg)V(S^*)
    + \varepsilon.
\end{displaymath}
\end{theorem}

\subsection{Non-Bayesian Setting}

In the non-bayesian setting, \emph{i.e.} when the experimenter has no prior
distribution on the model, the covariance matrix $R$ is the zero matrix. In this case,
the ridge regression estimation procedure \eqref{ridge} reduces to simple least squares (\emph{i.e.}, linear regression),
and the $D$-optimality criterion reduces to the entropy of $\hat{\beta}$, given by:
\begin{equation}\label{eq:d-optimal}
V(S) = \log\det(X_S^TX_S)
\end{equation}
A natural question which arises is whether it is possible to design
a deterministic mechanism in this setting. Since \eqref{eq:d-optimal} may take
arbitrarily small negative values, to define a meaningful approximation one
would consider  the (equivalent) maximization of $V(S) = \det\T{X_S}X_S$.
However, the following lower bound implies that such an optimization goal
cannot be attained under the constraints of  truthfulness, budget feasibility,
and individual rationality.

\begin{lemma}
For any $M>1$, there is no $M$-approximate, truthful, budget feasible,
individually rational mechanism for a budget feasible reverse auction with
value function $V(S) = \det{\T{X_S}X_S}$.  
\end{lemma}

\begin{proof}
\input{proof_of_lower_bound1}
\end{proof}

%Beyond $D$-optimality, several other objectives such as $E$-optimality (maximizing the smallest eigenvalue of $\T{X_S}X_S$) or $T$-optimality (maximizing $\mathrm{trace}(\T{X_S}{X_S}))$ are encountered in the literature \cite{pukelsheim2006optimal}, though they do not relate to entropy as $D$-optimality. We leave the task of approaching the maximization of such objectives from a strategic point of view as an open problem.

\subsection{Beyond Linear Models}
Selecting experiments that maximize the information gain in the Bayesian setup
leads to a natural generalization to other learning examples beyond linear
regression. In particular, consider the following variant of the standard PAC learning setup \cite{valiant}: assume that the features $x_i$, $i\in \mathcal{N}$ take values in some generic set $\Omega$, called the \emph{query space}. 
Measurements $y_i\in\reals$ are given by 
 \begin{equation}\label{eq:hypothesis-model}
    y_i = h(x_i) + \varepsilon_i
\end{equation}
where $h\in \mathcal{H}$ for some subset $\mathcal{H}$ of all possible mappings
$h:\Omega\to\reals$, called the \emph{hypothesis space}. As before, we assume that the experimenter has a prior distribution on the
hypothesis $h\in \mathcal{H}$; we also assume that  $\varepsilon_i$
are random variables in $\reals$, not necessarily identically distributed, that
are independent \emph{conditioned on $h$}. As before, the features $x_i$ are public, and the goal of the experimenter is to (a) retrieve measurements $y_i$ and (b) estimate $h$ as accurately as possible. 


This model is quite broad, and
captures many classic machine learning tasks; we give a few concrete examples below:
\begin{enumerate}
\item\textbf{Generalized Linear Regression.} In this case, $\Omega=\reals^d$, $\mathcal{H}$ is the set of linear maps $\{h(x) = \T{\beta}x \text{ s.t. } \beta\in \reals^d\}$, and $\varepsilon_i$ are independent zero-mean variables (not necessarily identically distributed). 
\item\textbf{Learning Binary Functions with Bernoulli Noise.} When learning a binary function under noise, the experimenter wishes to determine a binary function $h$ by testing its output on differrent inputs; however, the output may be corrupted with probability $p$. Formally, $\Omega = \{0,1\}^d$, $\mathcal{H}$ is some subset of binary functions $h:\Omega\to\{0,1\}$, and $$\varepsilon_i =\begin{cases}0, &\text{w.~prob.}\;1-p\\\bar{h}(x_i)-h(x_i), &\text{w.~prob.}\;p\end{cases}$$
\item\textbf{Logistic Regression.} Logistic regression aims to learn a hyperplane separating $+1$--labeled values from $-1$--labeled values; again, values can be corrupted, and the probability that a label is flipped drops with the distance from the hyperplane. Formally, $\Omega=\reals^d$,  $\mathcal{H}$ is the set of maps $\{h(x) = \mathop{\mathrm{sign}}(\beta^T x) \text{ for some } \beta\in\reals^d\}$, and $\varepsilon_i$ are independent conditioned on $\beta$ such that $$\varepsilon_i=\begin{cases} -2\cdot\id_{\beta^Tx>0}, & \text{w.~prob.} \frac{1}{1+e^{\beta^Tx}}\;\\ +2\cdot\id_{\beta^Tx<0},&\text{w.~prob.}\frac{e^{\beta^Tx}}{1+e^{\beta^Tx}}\;\end{cases}$$
\end{enumerate}

We can again define the information gain as an objective to maximize:
\begin{align}\label{general}
V(S) = \entropy(h) -\entropy(h\mid y_S),\quad S\subseteq\mathcal{N} 
\end{align}
This is a monotone set function, and it clearly satisfies $V(\emptyset)=0$.
 In general, the information gain is not a submodular function. However, when the errors $\epsilon_i$ are independent conditioned on $h$, the following lemma holds: \begin{lemma}
The value function given by the information gain \eqref{general} is submodular.
\end{lemma}
\begin{proof}
A more general statement for graphical models is shown in~\cite{krause2005near}; in short, using the chain rule for the conditional entropy we get:
\begin{equation}\label{eq:chain-rule}
    V(S) = H(y_S) - H(y_S \mid h) 
    = H(y_S) - \sum_{i\in S} H(y_i \mid h)
\end{equation}
where the second equality comes from the independence of the $y_i$'s
conditioned on $h$. Recall that the joint entropy of a set of random
variables is a submodular function. Thus, the value function is written in
\eqref{eq:chain-rule} as the sum of a submodular function and a modular function.
\end{proof}

This lemma implies that learning an \emph{arbitrary hypothesis, under an
arbitrary prior} when noise is conditionally independent leads to a submodular
value function. Hence, we can apply the previously known results by \citeN{singer-mechanisms} and \citeN{chen} to get
the following corollary:
\begin{corollary}
    For Bayesian experimental design with an objective given by the
    information gain \eqref{general}, there exists a  randomized, polynomial-time, budget
    feasible, individually rational, and universally truthful mechanism with
    a $7.91$ approximation ratio, in expectation.

    In cases where maximizing \eqref{general} can be done in polynomial time in
    the full-information setup, there exists a deterministic,  polynomial-time, budget feasible,
    individually rational, and truthful mechanism for Bayesian experimental
    design with an $8.34$ approximation ratio. 
\end{corollary}

Note however that, in many scenarios covered by this model (including the last
two examples above), even computing the entropy under a given set might be
a hard task---\emph{i.e.}, the value query model may not apply. Hence,
identifying learning tasks in the above class for which truthful or universally
truthful constant approximation mechanisms exist, or studying these problems in
the context of stronger query models such as the demand model
\cite{dobz2011-mechanisms,bei2012budget} remains an interesting open question. 

%TODO: Independent noise model. Captures models such as logistic regression, classification, etc. Arbitrary prior. Show that change in the entropy is submodular (cite Krause, Guestrin).