intro.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99

%The statistical analysis of user data is a widely spread practice among Internet companies, which routinely use machine learning techniques over vast records of user data to perform inference and classification tasks integral to their daily operations. Statistical analysis of personal data collected through surveys or experimentation is also the cornerstone of marketing research, as well as research in a variety of experimental sciences such as medicine and sociology. 

%This state of affairs has motivated several recent studies of \emph{data markets}, in which an analyst wishes to purchase data from a set of users \cite{...}. Eah user discloses her data to the analyst only if she receives an appropriate compensation. Assuming that the analyst has a limited budget, a natural question to ask is how she should allocate her budget across different users. 

 In the classic setting of experimental design \cite{pukelsheim2006optimal,atkinson2007optimum},
 an {\em experimenter}  \E\ has access to a population of $n$ potential experiment subjects. 
Each subject $i\in  \{1,\ldots,n\}$ is associated with a set of parameters (or features) $x_i\in \reals^d$, 
known to the experimenter. 
\E\ wishes to measure a certain inherent property of the subjects by performing an experiment: the outcome $y_i$ of the experiment on a subject $i$ is unknown to \E\ before the experiment is performed.

Typically, \E\ has a hypothesis on the relationship between $x_i$'s and $y_i$'s. Due to its simplicity, as well as its ubiquity in statistical analysis, a large body of work has focused on linear hypotheses: \emph{i.e.}, it is assumed that there exists a $\beta\in\reals^d$ such that  
$$y_i =  \T{\beta} x_i+\varepsilon_i,$$ for all $i\in \{1,\ldots,n\},$ where $\varepsilon_i$ are zero-mean, i.i.d.~random variables. Conducting the experiments and obtaining the measurements $y_i$ lets \E\  estimate  $\beta$, \emph{e.g.}, through linear regression. %, \emph{i.e.}, the model underlying the data, and the experimenter's goal is to obtain such an estimate as accurately as possible. %The goal of experimental design amounts to determining which subjects to experiment upon to produce the best possible such estimate.
 
The above experimental design scenario  has many applications. Regression over  personal data collected through surveys or experimentation is the cornerstone of marketing research, as well as research in a variety of experimental sciences such as medicine and sociology. Crucially, statistical analysis of user data is also a widely spread practice among Internet companies, which routinely use machine learning techniques over vast records of user data to perform inference and classification tasks integral to their daily operations.
Beyond linear regression, there is a rich literature about estimation procedures, as well as for means of quantifying the quality of the produced estimate~\cite{pukelsheim2006optimal}.  There is also an extensive theory on how to select subjects 
if \E\ can conduct only a limited number of experiments, so the estimation process returns a $\beta$
that approximates the true parameter of the underlying population \cite{ginebra2007measure,le1996comparison,chaloner1995bayesian,boyd2004convex}. 

We depart from this classical setup by viewing experimental design in a strategic setting, and by studying budgeted mechanism design issues. 
In our setting, experiments cannot be manipulated and hence measurements are reliable. %\footnote{Thus, the experiments of our interest are statistically significant ones where each experiment provides a reliable outcome.}  
 \E{} has a total budget of $B$ to conduct all the experiments. 
There is a cost $c_i$ associated with experimenting on
subject $i$ which varies from subject to subject.  This cost $c_i$ is determined by the subject $i$: it may be viewed as the  
cost $i$ incurs when tested and for which she needs to be reimbursed; or, it might be viewed as the incentive for $i$ to participate in the experiment; or, it might be the intrinsic worth of the data to the subject. The economic aspect of paying subjects has always been inherent in experimental design: experimenters often work within strict budgets and design creative incentives. Subjects often negotiate better incentives or higher payments. 
However, we are not aware of a principled study of this setting from a strategic point of view, when subjects declare their costs and therefore determine their payment.  Such a setting is increasingly realistic, given the growth of these experiments over the Internet. % and associated data markets. 

% When subjects are strategic, they may have an incentive to misreport their cost, leading to the need for a sophisticated choice of experiments and payments. Arguably, user incentiviation is of particular pertinence due to the extent of statistical analysis over user data on the Internet. %, which has led to the rise of several different research efforts in studying data markets \cite{...}.

Our contributions are as follows.
\begin{itemize}
\item
We initiate the study of experimental design in the presence of a budget and strategic subjects. 
%formulate the problem of experimental design subject to a given budget, in the presence of strategic agents who may lie about their costs. %In particular, we focus on linear regression. This is naturally viewed  as a budget feasible mechanism design problem, in which the objective function %is sophisticated and 
%is  related to the covariance of the $x_i$'s.
 In particular, we formulate the  {\em  Experimental Design Problem} (\SEDP) as
 follows: the experimenter \E\ wishes to find a set $S$ of subjects to maximize 
\begin{align}V(S) = \log\det\Big(I_d+\sum_{i\in S}x_i\T{x_i}\Big) \label{obj}\end{align}
subject to a budget constraint $\sum_{i\in S}c_i\leq B$, where $B$ is \E's budget. When subjects are strategic, the above problem can be naturally approached  as a \emph{budget feasible mechanism design} problem, as introduced by \citeN{singer-mechanisms}.
%, and other {\em strategic constraints} we don't list here.

\smallskip
The objective function, which is the key, is formally obtained by optimizing  the information gain in  $\beta$ when the latter is learned  through ridge regression, and is related to  the so-called $D$-optimality criterion~\cite{pukelsheim2006optimal,atkinson2007optimum}. 
\item
We present a polynomial time, $\delta$-truthful mechanism for \SEDP{}, yielding a constant factor ($\approx 12.98$) approximation to the optimal value of \eqref{obj}. In contrast to this, we show that no truthful, budget-feasible mechanisms are possible for \SEDP{}  within a factor 2 approximation. 

\smallskip
We note that the objective \eqref{obj} is submodular. Using this fact, applying previous results on budget feasible mechanism design under general submodular objectives~\cite{singer-mechanisms,chen} would yield either a deterministic, truthful, constant-approximation mechanism that requires exponential time,  or a non-deterministic, (universally) truthful, poly-time mechanism that yields a constant approximation ratio only \emph{in expectation} (\emph{i.e.}, its approximation guarantee for a given instance may in fact be unbounded).
\end{itemize}


%  budget feasible mechanisms for submodular maximization yields a  $8.34$-approximate deterministic mechanism for \SEDP{} that is not polynomial time, unless P=NP. Alternatively, previous work by \citeN{chen} on general submodular objectives also yields a randomized,  7.91-approximate polynomial time mechanism for \SEDP{} that is however \emph{universally truthful}, \emph{i.e.}, it is sampled from a distribution among truthful mechanisms. In contrast, our result is the first deterministic constant factor approximation mechanism for \SEDP{} that is both polytime and truthful.  
% either a randomized,  7.91-approximate polynomial time mechanism for maximizing a general submodular function that is universally truthful, \emph{i.e.}, it is sampled from a distribution among truthful mechanisms. 
%There are several recent results in budget feasible 
%mechanisms~\cite{singer-mechanisms,chen,singer-influence,bei2012budget,dobz2011-mechanisms}, and some apply to  the submodular optimization in 
%\EDP.
%There is a randomized, 7.91-approximate polynomial time mechanism for maximizing a general submodular function that is universally truthful, \emph{i.e.}, it is sampled from a distribution among truthful mechanisms. Also, there is a $8.34$-approximate exponential time deterministic mechanism. 
%There are however no known deterministic, truthful, polynomial time mechanisms for general submodular functions.

%Though such mechanisms were known to exist for  combinatorial problems with specific submodular objectives such as \textsc{Knapsack} or \textsc{Coverage}~\cite{singer-mechanisms,chen, singer-influence}, these do not readily apply to the more complicated linear-algebraic  objective function \eqref{obj} of  \SEDP.
%{\bf S+T: could we verify that the above sentence is correct in its implication?}

%From a technical perspective, we present a convex relaxation of \eqref{obj}, and show that its optimal value is within a constant factor from the so-called multi-linear relaxation of  \eqref{obj}, which in turn can be related to \eqref{obj} through pipage rounding. We establish the constant factor to the multi-linear relaxation by bounding  the partial derivatives of these two functions; we achieve the latter by exploiting convexity properties of matrix functions over the convex cone of positive semidefinite matrices.

From a technical perspective, we propose a convex optimization problem and establish that its optimal value is within a constant factor from the optimal value of \EDP.  
 In particular, we show our relaxed objective is within a constant factor from the so-called multi-linear extension of  \eqref{obj}, which in turn can be related to \eqref{obj} through pipage rounding. We establish the constant factor to the multi-linear extension by bounding  the partial derivatives of these two functions; we achieve the latter by exploiting convexity properties of matrix functions over the convex cone of positive semidefinite matrices.

Our convex relaxation of \EDP{} involves maximizing a self-concordant function subject to linear constraints. Its optimal value can be computed with arbitrary accuracy in polynomial time using the so-called barrier method. However, the outcome of this computation may not necessarily be monotone, a property needed in designing a truthful mechanism. Nevetheless, we construct an algorithm that solves the above convex relaxation and is ``almost'' monotone; we achieve this by applying the barrier method on a set perturbed constraints, over which our objective is ``sufficiently'' concave. In turn, we show how to employ this algorithm to design a poly-time, $\delta$-truthful, constant-approximation mechanism for \EDP{}.


%This allows us to adopt the approach followed by prior work in budget feasible mechanisms by Chen \emph{et al.}~\cite{chen} and Singer~\cite{singer-influence}.   %{\bf FIX the last sentence}

%\item
%Our approach to mechanisms for experimental design --- by 
% optimizing  the information gain in parameters like $\beta$ which are estimated through the data analysis process --- is general.  We give examples of this approach beyond linear regression to a general class that includes logistic regression and learning binary functions, and show that the corresponding budgeted mechanism design problem is also expressed through a submodular optimization.  Hence,  prior work \cite{chen,singer-mechanisms} immediately applies, and gives randomized, universally truthful, polynomial time, constant factor approximation mechanisms for problems in this class. Getting deterministic, truthful, polynomial time mechanisms with a constant approximation factor for this class or specific problems in it, like we did for \EDP, remains an open problem.

In what follows, we describe related work in Section~\ref{sec:related}. We briefly review  experimental design and budget feasible mechanisms in Section~\ref{sec:peel} and define \SEDP\ formally. We present our convex relaxation to \EDP{} in Section~\ref{sec:approximation} and use it to construct our mechanism in Section~\ref{sec:main}. We conclude in Section~\ref{sec:concl}. All proofs of our technical results are provided in the appendix. %we present our mechanism for \SEDP\ and state our main results. %A generalization of our framework to machine learning tasks beyond linear regression is presented in Section~\ref{sec:ext}. 

\junk{

\stratis{maximizing other ``optimality criteria'' (E-optimality, G-optimality, etc.) Improved upper lower bound for $R\neq I$. General entropy gain optimization}

\begin{itemize}
    \item already existing field of experiment design: survey-like setup, what
    are the best points to include in your experiment? Measure of the
    usefulness of the data: variance-reduction or entropy-reduction.
    \item nowadays, there is also a big focus on purchasing data: paid surveys,
    mechanical turk, etc. that add economic aspects to the problem of
    experiment design
    \item recent advances (Singer, Chen) in the field of budgeted mechanisms
    \item we study ridge regression, very widely used in statistical learning,
    and treat it as a problem of budgeted experiment design
    \item we make the following contributions: ...
    \item extension to a more general setup which includes a wider class of
    machine learning problems
\end{itemize}

}

\input{related}