diff options
Diffstat (limited to 'intro.tex')
| -rw-r--r-- | intro.tex | 31 |
1 files changed, 20 insertions, 11 deletions
@@ -8,34 +8,40 @@ Each subject $i\in \{1,\ldots,n\}$ is associated with a set of parameters (or f known to the experimenter. \E\ wishes to measure a certain inherent property of the subjects by performing an experiment: the outcome $y_i$ of the experiment on a subject $i$ is unknown to \E\ before the experiment is performed. -Typically, \E\ has a hypothesis on the relationship between $x_i$'s and $y_i$'s. Due to its simplicity, as well as its ubiquity in statistical analysis, a large body of work has focused on linear hypotheses: \emph{i.e.}, it is assumed that there exists a $\beta\in\reals^d$ such that $y_i = \T{\beta} x_i+\varepsilon_i$, for all $i\in \{1,\ldots,n\},$ where $\varepsilon_i$ are zero-mean, i.i.d.~random variables. Conducting the experiments and obtaining the measurements $y_i$ lets \E\ estimate $\beta$, \emph{e.g.}, through linear regression. %, \emph{i.e.}, the model underlying the data, and the experimenter's goal is to obtain such an estimate as accurately as possible. %The goal of experimental design amounts to determining which subjects to experiment upon to produce the best possible such estimate. +Typically, \E\ has a hypothesis on the relationship between $x_i$'s and $y_i$'s. Due to its simplicity, as well as its ubiquity in statistical analysis, a large body of work has focused on linear hypotheses: \emph{i.e.}, it is assumed that there exists a $\beta\in\reals^d$ such that +$$y_i = \T{\beta} x_i+\varepsilon_i,$$ for all $i\in \{1,\ldots,n\},$ where $\varepsilon_i$ are zero-mean, i.i.d.~random variables. Conducting the experiments and obtaining the measurements $y_i$ lets \E\ estimate $\beta$, \emph{e.g.}, through linear regression. %, \emph{i.e.}, the model underlying the data, and the experimenter's goal is to obtain such an estimate as accurately as possible. %The goal of experimental design amounts to determining which subjects to experiment upon to produce the best possible such estimate. The above experimental design scenario has many applications. Regression over personal data collected through surveys or experimentation is the cornerstone of marketing research, as well as research in a variety of experimental sciences such as medicine and sociology. Crucially, the statistical analysis of user data is also a widely spread practice among Internet companies, which routinely use machine learning techniques over vast records of user data to perform inference and classification tasks integral to their daily operations. Beyond linear regression, there is a rich literature about estimation procedures, as well as for means of quantifying the quality of the produced estimate~\cite{pukelsheim2006optimal}. There is also an extensive theory on how to select subjects if \E\ can conduct only a limited number of experiments, so the estimation process returns $\beta$ that approximates the true parameter of the underlying population \cite{ginebra2007measure,le1996comparison,chaloner1995bayesian,boyd2004convex}. -We depart from this classical setup by viewing experimental design in a strategic setting, and by studying mechanism design issues. +We depart from this classical setup by viewing experimental design in a strategic setting, and by studying budgeted mechanism design issues. In our setting, experiments cannot be manipulated and hence measurements are reliable. %\footnote{Thus, the experiments of our interest are statistically significant ones where each experiment provides a reliable outcome.} -However, there -is a cost $c_i$ associated with experimenting on -subject $i$ which varies from subject to subject. This may be viewed as the -cost subject $i$ incurs when tested and for which she needs to be reimbursed; or, it might be viewed as the incentive for $i$ to participate in the experiment; or, it might be the intrinsic worth of the data to the user. - - This economic aspect has always been inherent in experimental design: experimenters often work within strict budgets and design creative incentives. However, we are not aware of a principled study of this setting from a strategic point of view. When subjects are strategic, they may have an incentive to misreport their cost, leading to the need for a sophisticated choice of experiments and payments. Arguably, user incentiviation is of particular pertinence due to the extent of statistical analysis over user data on the Internet. %, which has led to the rise of several different research efforts in studying data markets \cite{...}. + \E{} has a total budget of $B$ to conduct all the experiments. +There is a cost $c_i$ associated with experimenting on +subject $i$ which varies from subject to subject. This cost $c_i$ is determined by the subject $i$: it may be viewed as the +cost $i$ incurs when tested and for which she needs to be reimbursed; or, it might be viewed as the incentive for $i$ to participate in the experiment; or, it might be the intrinsic worth of the data to the user. The economic aspect of paying subjects has always been inherent in experimental design: experimenters often work within strict budgets and design creative incentives. Subjects often negotiate better incentives or higher payments. +However, we are not aware of a principled study of this setting from a strategic point of view, when subjects declare their costs and therefore determine their payment. Such a setting is increasingly realistic, given the growth of these experiments over the Internet and associated data markets. +% When subjects are strategic, they may have an incentive to misreport their cost, leading to the need for a sophisticated choice of experiments and payments. Arguably, user incentiviation is of particular pertinence due to the extent of statistical analysis over user data on the Internet. %, which has led to the rise of several different research efforts in studying data markets \cite{...}. Our contributions are as follows. \begin{itemize} \item -We formulate the problem of experimental design subject to a given budget, in the presence of strategic agents who may lie about their costs. %In particular, we focus on linear regression. This is naturally viewed as a budget feasible mechanism design problem, in which the objective function %is sophisticated and +We initiate the study of experimental design problem in presence of budgets and strategic subjects. +%formulate the problem of experimental design subject to a given budget, in the presence of strategic agents who may lie about their costs. %In particular, we focus on linear regression. This is naturally viewed as a budget feasible mechanism design problem, in which the objective function %is sophisticated and %is related to the covariance of the $x_i$'s. In particular, we formulate the {\em Strategic Experimental Design Problem} (\SEDP) as follows: the experimenter \E\ wishes to find set $S$ of subjects to maximize \begin{align}V(S) = \log\det\Big(I_d+\sum_{i\in S}x_i\T{x_i}\Big) \label{obj}\end{align} subject to a budget constraint $\sum_{i\in S}c_i\leq B$, where $B$ is \E's budget, as well as the usual constraints of truthfulness and individual rationality. %, and other {\em strategic constraints} we don't list here. + +\smallskip The objective function, which is the key, is formally obtained by optimizing the information gain in $\beta$ when the latter is learned through linear regression, and is related to the so-called $D$-optimality criterion~\cite{pukelsheim2006optimal,atkinson2007optimum}. \item -We present the first known polynomial time truthful mechanism for \SEDP{}, yielding a constant factor ($\approx 12.98$) approximation. In contrast to this, we show that no truthful, budget-feasible algorithms are possible for \SEDP{} within a factor 2 approximation. +We present a polynomial time, truthful mechanism for \SEDP{}, yielding a constant factor ($\approx 12.98$) approximation. In contrast to this, we show that no truthful, budget-feasible algorithms are possible for \SEDP{} within a factor 2 approximation. + +\smallskip We note that the objective \eqref{obj} is submodular. Using this fact, applying previous results for budget feasible mechanisms under general submodular objectives would yield either a truthful deterministic mechanism that requires exponential time, or a poly-time algorithm that is not deterministic~\cite{singer-mechanisms,chen}. \end{itemize} @@ -51,7 +57,10 @@ We note that the objective \eqref{obj} is submodular. Using this fact, applying %Though such mechanisms were known to exist for combinatorial problems with specific submodular objectives such as \textsc{Knapsack} or \textsc{Coverage}~\cite{singer-mechanisms,chen, singer-influence}, these do not readily apply to the more complicated linear-algebraic objective function \eqref{obj} of \SEDP. %{\bf S+T: could we verify that the above sentence is correct in its implication?} -From a technical perspective, we present a convex relaxation of \eqref{obj}, and show that it is within a constant factor from the so-called multi-linear relaxation of \eqref{obj}. This allows us to adopt the approach followed by prior work in budget feasible mechanisms by Chen \emph{et al.}~\cite{chen} and Singer~\cite{singer-influence}. %{\bf FIX the last sentence} +From a technical perspective, we present a convex relaxation of \eqref{obj}, and show that it is within a constant factor from the so-called multi-linear relaxation of \eqref{obj}. + + +%This allows us to adopt the approach followed by prior work in budget feasible mechanisms by Chen \emph{et al.}~\cite{chen} and Singer~\cite{singer-influence}. %{\bf FIX the last sentence} %\item %Our approach to mechanisms for experimental design --- by |
