1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
|
In the classic setting of experimental design \cite{pukelsheim2006optimal,atkinson2007optimum},
an {\em experimenter} \E\ has access to a population of $n$ potential experiment subjects.
Each subject $i\in \{1,\ldots,n\}$ is associated with a set of parameters (or features) $x_i\in \reals^d$,
known to the experimenter.
\E\ wishes to perform an experiment that measures a certain inherent property of the subjects: the outcome for a subject $i$ is denoted $y_i$, which is unknown to \E\ before the experiment is performed.
Typically, \E\ has a hypothesis of the relationship between $x_i$'s and $y_i$'s, such as, say linear, \emph{i.e.}, $y_i \approx \T{\beta} x_i$; conducting the experiments and obtaining the measurements $y_i$ lets \E\ estimate $\beta$. %The goal of experimental design amounts to determining which subjects to experiment upon to produce the best possible such estimate.
The above experimental design scenario has many applications, from medical testing to marketing research and others.
There is a rich literature about estimation procedures, as well as for means for quantifying the quality of the produced estimate \cite{pukelsheim2006optimal}. There is also an extensive theory on how to select subjects
if \E\ can conduct only a limited number of experiments, so the estimation process returns $\beta$
that approximates the true parameter of the underlying population \cite{ginebra2007measure,le1996comparison,chaloner1995bayesian,boyd2004convex}.
We depart from this classical set up by viewing experimental design in a strategic setting, and by studying mechanism design issues.
In our setup, experiments cannot be manipulated and hence measurements are considered precise.\footnote{Thus, the experiments of our interest are statistically significant ones where each experiment provides a reliable outcome.} However, there
is a cost $c_i$ associated with experimenting on
subject $i$ which varies from subject to subject. This may be viewed as the
cost subject $i$ incurs when tested, and hence $i$ needs to be reimbursed; or, it might be viewed as the incentive for $i$
to participate in the experiment; or, it might be the inherent value of the data. This economic aspect has always been inherent in experimental design: experimenters often work within strict budgets and design creative incentives. However, we are not aware of principled study of this setting from a strategic point of view. When subjects are strategic, they may have an incentive to mis-report their cost.
Our contributions are as follows.
\begin{itemize}
\item
We formulate the problem of experimental design subject to a given budget, in presence of strategic agents who specify their costs. In particular, we focus on linear regression. This is naturally viewed as a budget feasible mechanism design problem. The objective function is sophisticated and is related to the covariance of the $x_i$'s. In particular, we formulate the {\em Experimental Design Problem} (\EDP) as follows: the experimenter \E\ wishes to find set $S$ of subjects to maximize
\begin{align}V(S) = \log\det(I_d+\sum_{i\in S}x_i\T{x_i}) \label{obj}\end{align}
with a budget constraint $\sum_{i\in S}c_i\leq B$, where $B$ is \E's budget. %, and other {\em strategic constraints} we don't list here.
The objective function, which is the key, is obtained by optimizing the information gain in $\beta$ when it is learned through linear regression methods, and is related to the so-called $D$-optimality criterion.
\item
The above objective is submodular.
There are several recent results in budget feasible
mechanisms~\cite{singer-mechanisms,chen,singer-influence,bei2012budget,dobz2011-mechanisms}, and some apply to the submodular optimization in
\EDP.
There is a randomized, 7.91-approximate polynomial time mechanism for maximizing a general submodular function that is universally truthful, \emph{i.e.}, it is sampled from a distribution among truthful mechanisms. Also, there is a $8.34$-approximate exponential time deterministic mechanism.
There are however no known deterministic, truthful, polynomial time mechanisms for general submodular functions.
For specific combinatorial problems such as \textsc{Knapsack} or \textsc{Coverage}, there exist
deterministic, truthful, polynomial constant-approximation algorithms~\cite{singer-mechanisms,chen, singer-influence}, but they do not work for the linear-algebraic objective function in \EDP.
%{\bf S+T: could we verify that the above sentence is correct in its implication?}
We present the first known, polynomial time truthful mechanism for \EDP{}. Our mechanism is a constant factor ($\approx 19.68$) approximation for \EDP{}. From a technical perspective, we present a convex relaxation to \EDP\; and show that it is within a constant factor approximation of a multi-linear relaxation for \EDP{}. This allows us to adopt the prior work in budget feasible mechanisms by Chen \emph{et al.}~\cite{chen} and Singer~\cite{singer-influence}. In contrast to this, no truthful algorithms are possible for \EDP{} within a factor 2 approximation. {\bf FIX the last sentence}
\item
Our approach to mechanisms for experimental design --- by
optimizing the information gain in parameters like $\beta$ which are estimated through the data analysis process --- is general. We give examples of this approach beyond linear regression to logistic regression, learning binary functions and others and derive budgeted mechanism design problems with submodular optimization where prior work immediately applies and gives constant factor approximations. We leave it open to get deterministic truthful polynomial time mechanisms for these problems, like we did for \EDP.
\end{itemize}
In what follows, we described related work in Section~\ref{sec:related}. We describe the basics of experimental design and budget feasible mechanisms in Section~\ref{sec:peel} and define \EDP formally. In Section~\ref{sec:main} we present our mechanism for \EDP and prove our main results. In~\ref{sec:ext} we present applications of out general framework.
\junk{
\stratis{maximizing other ``optimality criteria'' (E-optimality, G-optimality, etc.) Improved upper lower bound for $R\neq I$. General entropy gain optimization}
\begin{itemize}
\item already existing field of experiment design: survey-like setup, what
are the best points to include in your experiment? Measure of the
usefulness of the data: variance-reduction or entropy-reduction.
\item nowadays, there is also a big focus on purchasing data: paid surveys,
mechanical turk, etc. that add economic aspects to the problem of
experiment design
\item recent advances (Singer, Chen) in the field of budgeted mechanisms
\item we study ridge regression, very widely used in statistical learning,
and treat it as a problem of budgeted experiment design
\item we make the following contributions: ...
\item extension to a more general setup which includes a wider class of
machine learning problems
\end{itemize}
}
\input{related}
|