summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--definitions.tex1
-rw-r--r--intro.tex55
-rw-r--r--paper.tex2
-rw-r--r--problem.tex39
-rw-r--r--related.tex23
5 files changed, 74 insertions, 46 deletions
diff --git a/definitions.tex b/definitions.tex
index 2c364da..22a1f46 100644
--- a/definitions.tex
+++ b/definitions.tex
@@ -24,6 +24,7 @@
\newcommand{\thibaut}[1]{\textcolor{blue}{Thibaut: #1}}
\newcommand{\T}[1]{#1^T}
\newcommand{\EDP}{EDP}
+\newcommand{\SEDP}{SEDP}
\newcommand{\E}{{\tt E}}
\newcommand{\id}{\mathbbm{1}}
\newcommand{\junk}[1]{}
diff --git a/intro.tex b/intro.tex
index d43d62b..f5545a9 100644
--- a/intro.tex
+++ b/intro.tex
@@ -1,53 +1,58 @@
-The statistical analysis of user data is a widely spread practice among Internet companies, which routinely use machine learning techniques over vast records of user data to perform inference and classification tasks integral to their daily operations. Statistical analysis of personal data collected through surveys or experimentation is also the cornerstone of marketing research, as well as research in a variety of experimental sciences such as medicine and sociology.
+%The statistical analysis of user data is a widely spread practice among Internet companies, which routinely use machine learning techniques over vast records of user data to perform inference and classification tasks integral to their daily operations. Statistical analysis of personal data collected through surveys or experimentation is also the cornerstone of marketing research, as well as research in a variety of experimental sciences such as medicine and sociology.
-This state of affairs has motivated several recent studies of \emph{data markets}, in which an analyst wishes to purchase data from a set of users \cite{...}. Eah user discloses her data to the analyst only if she receives an appropriate compensation. Assuming that the analyst has a limited budget, a natural question to ask is how she should allocate her budget across different users.
+%This state of affairs has motivated several recent studies of \emph{data markets}, in which an analyst wishes to purchase data from a set of users \cite{...}. Eah user discloses her data to the analyst only if she receives an appropriate compensation. Assuming that the analyst has a limited budget, a natural question to ask is how she should allocate her budget across different users.
In the classic setting of experimental design \cite{pukelsheim2006optimal,atkinson2007optimum},
an {\em experimenter} \E\ has access to a population of $n$ potential experiment subjects.
Each subject $i\in \{1,\ldots,n\}$ is associated with a set of parameters (or features) $x_i\in \reals^d$,
known to the experimenter.
-\E\ wishes to perform an experiment that measures a certain inherent property of the subjects: the outcome for a subject $i$ is denoted $y_i$, which is unknown to \E\ before the experiment is performed.
-Typically, \E\ has a hypothesis of the relationship between $x_i$'s and $y_i$'s, such as, say linear, \emph{i.e.}, $y_i \approx \T{\beta} x_i$; conducting the experiments and obtaining the measurements $y_i$ lets \E\ estimate $\beta$. %The goal of experimental design amounts to determining which subjects to experiment upon to produce the best possible such estimate.
+\E\ wishes to measures a certain inherent property of the subjects by performing an experiment: the outcome $y_i$ of the experiment on a subject $i$ is unknown to \E\ before the experiment is performed.
+
+Typically, \E\ has a hypothesis on the relationship between $x_i$'s and $y_i$'s. Due to its simplicity, as well as its ubiquity in statistical analysis, a large body of work has focused on linear hypotheses: \emph{i.e.}, it is assumed that there exists a $\beta\in\reals^d$ such that $y_i = \T{\beta} x_i+\varepsilon_i$, for all $i\in \{1,\ldots,n\},$ where $\varepsilon_i$ are zero-mean, i.i.d.~random variables. Conducting the experiments and obtaining the measurements $y_i$ lets \E\ estimate $\beta$, \emph{e.g.}, through linear regression. %, \emph{i.e.}, the model underlying the data, and the experimenter's goal is to obtain such an estimate as accurately as possible. %The goal of experimental design amounts to determining which subjects to experiment upon to produce the best possible such estimate.
The above experimental design scenario has many applications, from medical testing to marketing research and others.
-There is a rich literature about estimation procedures, as well as for means of quantifying the quality of the produced estimate \cite{pukelsheim2006optimal}. There is also an extensive theory on how to select subjects
+There is a rich literature about estimation procedures, as well as for means of quantifying the quality of the produced estimate~\cite{pukelsheim2006optimal}. There is also an extensive theory on how to select subjects
if \E\ can conduct only a limited number of experiments, so the estimation process returns $\beta$
that approximates the true parameter of the underlying population \cite{ginebra2007measure,le1996comparison,chaloner1995bayesian,boyd2004convex}.
We depart from this classical setup by viewing experimental design in a strategic setting, and by studying mechanism design issues.
-In our setting, experiments cannot be manipulated and hence measurements are considered reliable. %\footnote{Thus, the experiments of our interest are statistically significant ones where each experiment provides a reliable outcome.}
+In our setting, experiments cannot be manipulated and hence measurements are reliable. %\footnote{Thus, the experiments of our interest are statistically significant ones where each experiment provides a reliable outcome.}
However, there
is a cost $c_i$ associated with experimenting on
subject $i$ which varies from subject to subject. This may be viewed as the
-cost subject $i$ incurs when tested and for which she needs to be reimbursed; or, it might be viewed as the incentive for $i$
-to participate in the experiment; or, it might be the inherent value of the data. This economic aspect has always been inherent in experimental design: experimenters often work within strict budgets and design creative incentives. However, we are not aware of principled study of this setting from a strategic point of view. When subjects are strategic, they may have an incentive to misreport their cost and the choice of experiments and payments need to be more sophisticated.
+cost subject $i$ incurs when tested and for which she needs to be reimbursed; or, it might be viewed as the incentive for $i$ to participate in the experiment; or, it might be the inherent value of the data.
+
+ This economic aspect has always been inherent in experimental design: experimenters often work within strict budgets and design creative incentives. However, we are not aware of a principled study of this setting from a strategic point of view. When subjects are strategic, they may have an incentive to misreport their cost, leading to the neeed for a sophisticated choice of experiments and payments.
Our contributions are as follows.
\begin{itemize}
\item
-We formulate the problem of experimental design subject to a given budget, in the presence of strategic agents who may lie about their costs. In particular, we focus on linear regression. This is naturally viewed as a budget feasible mechanism design problem, in which the objective function %is sophisticated and
-is related to the covariance of the $x_i$'s. In particular, we formulate the {\em Experimental Design Problem} (\EDP) as follows: the experimenter \E\ wishes to find set $S$ of subjects to maximize
+We formulate the problem of experimental design subject to a given budget, in the presence of strategic agents who may lie about their costs. %In particular, we focus on linear regression. This is naturally viewed as a budget feasible mechanism design problem, in which the objective function %is sophisticated and
+%is related to the covariance of the $x_i$'s.
+ In particular, we formulate the {\em Strategic Experimental Design Problem} (\SEDP) as follows: the experimenter \E\ wishes to find set $S$ of subjects to maximize
\begin{align}V(S) = \log\det\Big(I_d+\sum_{i\in S}x_i\T{x_i}\Big) \label{obj}\end{align}
-subject to a budget constraint $\sum_{i\in S}c_i\leq B$, where $B$ is \E's budget. %, and other {\em strategic constraints} we don't list here.
-The objective function, which is the key, is obtained by optimizing the information gain in $\beta$ when it is learned through linear regression methods, and is related to the so-called $D$-optimality criterion.
+subject to a budget constraint $\sum_{i\in S}c_i\leq B$, where $B$ is \E's budget, as well as the usual constraints of truthfulness and individual rationality. %, and other {\em strategic constraints} we don't list here.
+The objective function, which is the key, is formally obtained by optimizing the information gain in $\beta$ when the latter is learned through linear regression, and is known as the $D$-optimality criterion in experimental design literature~\cite{pukelsheim2006optimal,atkinson2007optimum}.
\item
-The above objective is submodular.
-There are several recent results in budget feasible
-mechanisms~\cite{singer-mechanisms,chen,singer-influence,bei2012budget,dobz2011-mechanisms}, and some apply to the submodular optimization in
-\EDP.
-There is a randomized, 7.91-approximate polynomial time mechanism for maximizing a general submodular function that is universally truthful, \emph{i.e.}, it is sampled from a distribution among truthful mechanisms. Also, there is a $8.34$-approximate exponential time deterministic mechanism.
-There are however no known deterministic, truthful, polynomial time mechanisms for general submodular functions.
-For specific combinatorial problems such as \textsc{Knapsack} or \textsc{Coverage}, there exist
-deterministic, truthful, polynomial constant-approximation algorithms~\cite{singer-mechanisms,chen, singer-influence}, but they do not work for the linear-algebraic objective function in \EDP.
+We present the first known polynomial time truthful mechanism for \SEDP{}, yielding a constant factor ($\approx 12.98$) approximation. In contrast to this, we show that no truthful, budget-feasible algorithms are possible for \SEDP{} within a factor 2 approximation.
+\end{itemize}
+
+We note that the objective \eqref{obj} is submodular. Using this fact, previous work by \citeN{singer-mechanisms} and \citeN{chen} on budget feasible mechanisms for submodular maximization yields a $8.34$-approximate deterministic mechanism for \SEDP{} that is not polynomial time, unless P=NP. Alternatively, previous work by \citeN{chen} on general submodular objectives also yields a randomized, 7.91-approximate polynomial time mechanism for \SEDP{} that is however \emph{universally truthful}, \emph{i.e.}, it is sampled from a distribution among truthful mechanisms. In contrast, our result is the first deterministic constant factor approximation mechanism for \SEDP{} that is both polytime and truthful.
+% either a randomized, 7.91-approximate polynomial time mechanism for maximizing a general submodular function that is universally truthful, \emph{i.e.}, it is sampled from a distribution among truthful mechanisms.
+%There are several recent results in budget feasible
+%mechanisms~\cite{singer-mechanisms,chen,singer-influence,bei2012budget,dobz2011-mechanisms}, and some apply to the submodular optimization in
+%\EDP.
+%There is a randomized, 7.91-approximate polynomial time mechanism for maximizing a general submodular function that is universally truthful, \emph{i.e.}, it is sampled from a distribution among truthful mechanisms. Also, there is a $8.34$-approximate exponential time deterministic mechanism.
+%There are however no known deterministic, truthful, polynomial time mechanisms for general submodular functions.
+Though such mechanisms were known to exist for combinatorial problems with specific submodular objectives such as \textsc{Knapsack} or \textsc{Coverage}~\cite{singer-mechanisms,chen, singer-influence}, these do not readily apply to the more complicated linear-algebraic objective function \eqref{obj} of \SEDP.
%{\bf S+T: could we verify that the above sentence is correct in its implication?}
-We present the first known, polynomial time truthful mechanism for \EDP{}. Our mechanism is a constant factor ($\approx 12.98$) approximation for \EDP{}. In contrast to this, we show that no truthful, budget-feasible algorithms are possible for \EDP{} within a factor 2 approximation. From a technical perspective, we present a convex relaxation of \eqref{obj}, and show that it is within a constant factor from the so-called multi-linear relaxation of \eqref{obj}. This allows us to adopt the approach followed by prior work in budget feasible mechanisms by Chen \emph{et al.}~\cite{chen} and Singer~\cite{singer-influence}. %{\bf FIX the last sentence}
+From a technical perspective, we present a convex relaxation of \eqref{obj}, and show that it is within a constant factor from the so-called multi-linear relaxation of \eqref{obj}. This allows us to adopt the approach followed by prior work in budget feasible mechanisms by Chen \emph{et al.}~\cite{chen} and Singer~\cite{singer-influence}. %{\bf FIX the last sentence}
-\item
-Our approach to mechanisms for experimental design --- by
- optimizing the information gain in parameters like $\beta$ which are estimated through the data analysis process --- is general. We give examples of this approach beyond linear regression to a general class that includes logistic regression and learning binary functions, and show that the corresponding budgeted mechanism design problem is also expressed through a submodular optimization. Hence, prior work \cite{chen,singer-mechanisms} immediately applies, and gives randomized, universally truthful, polynomial time, constant factor approximation mechanisms for problems in this class. Getting deterministic, truthful, polynomial time mechanisms with a constant approximation factor for this class or specific problems in it, like we did for \EDP, remains an open problem.
-\end{itemize}
+%\item
+%Our approach to mechanisms for experimental design --- by
+% optimizing the information gain in parameters like $\beta$ which are estimated through the data analysis process --- is general. We give examples of this approach beyond linear regression to a general class that includes logistic regression and learning binary functions, and show that the corresponding budgeted mechanism design problem is also expressed through a submodular optimization. Hence, prior work \cite{chen,singer-mechanisms} immediately applies, and gives randomized, universally truthful, polynomial time, constant factor approximation mechanisms for problems in this class. Getting deterministic, truthful, polynomial time mechanisms with a constant approximation factor for this class or specific problems in it, like we did for \EDP, remains an open problem.
In what follows, we describe related work in Section~\ref{sec:related}. We briefly review experimental design and budget feasible mechanisms in Section~\ref{sec:peel} and define \EDP\ formally. In Section~\ref{sec:main} we present our mechanism for \EDP\ and prove our main results. The present applications of our general framework are presented in Section~\ref{sec:ext}.
diff --git a/paper.tex b/paper.tex
index dfe876b..467b9ef 100644
--- a/paper.tex
+++ b/paper.tex
@@ -33,7 +33,7 @@ S. Muthukrishnan \affil{Rutgers University, Microsoft Research}}
\section{Preliminaries}\label{sec:peel}
\input{problem}
-\section{Mechanism for \EDP{}}
+\section{Mechanism for \SEDP{}}
\input{main}
\section{Extension to Other Problems}\label{sec:ext}
\input{general}
diff --git a/problem.tex b/problem.tex
index e820f85..546500d 100644
--- a/problem.tex
+++ b/problem.tex
@@ -1,7 +1,7 @@
\label{sec:prel}
\subsection{Experimental Design}
-The theory of experimental design \cite{pukelsheim2006optimal,atkinson2007optimum} studies how an experimenter \E\ should select the parameters of a set of experiments she is about to conduct. In general, the optimality of a particular design depends on the purpose of the experiment, \emph{i.e.}, the quantity \E\ is trying to learn or the hypothesis she is trying to validate. Due to their ubiquity in statistical analysis, a large literature on the subject focuses on learning \emph{linear models}, where \E\ wishes to fit a linear map to the data she has collected.
+The theory of experimental design \cite{pukelsheim2006optimal,atkinson2007optimum} studies how an experimenter \E\ should select the parameters of a set of experiments she is about to conduct. In general, the optimality of a particular design depends on the purpose of the experiment, \emph{i.e.}, the quantity \E\ is trying to learn or the hypothesis she is trying to validate. Due to their ubiquity in statistical analysis, a large literature on the subject focuses on learning \emph{linear models}, where \E\ wishes to fit a linear function to the data she has collected.
More precisely, putting cost considerations aside, suppose that \E\ wishes to conduct $k$ among $n$ possible experiments. Each experiment $i\in\mathcal{N}\defeq \{1,\ldots,n\}$ is associated with a set of parameters (or features) $x_i\in \reals^d$, normalized so that $\|x_i\|_2\leq 1$. Denote by $S\subseteq \mathcal{N}$, where $|S|=k$, the set of experiments selected; upon its execution, experiment $i\in S$ reveals an output variable (the ``measurement'') $y_i$, related to the experiment features $x_i$ through a linear function, \emph{i.e.},
\begin{align}\label{model}
@@ -27,6 +27,7 @@ A variety of different value functions are used in experimental design~\cite{puk
\begin{align}
V(S) &= \frac{1}{2}\log\det \T{X_S}X_S \label{dcrit} %\\
\end{align}
+defined as $-\infty$ when $\mathrm{rank}(\T{X_S}X_S)<d$.
As $\hat{\beta}$ is a multidimensional normal random variable, the
$D$-optimality criterion is equal (up to a constant) to the negative of the
entropy of $\hat{\beta}$. Hence, selecting a set of experiments $S$ that
@@ -39,8 +40,7 @@ the uncertainty on $\beta$, as captured by the entropy of its estimator.
%\end{align}
%There are several reasons
- Value function \eqref{dcrit} is undefined when $\mathrm{rank}(\T{X_S}X_S)<d$; in this case, we take $V(S)=-\infty$ (so that $V$ takes values in the extended reals).
-Note that \eqref{dcrit} is a submodular set function, \emph{i.e.},
+ Note that \eqref{dcrit} is a submodular set function, \emph{i.e.},
$V(S)+V(T)\geq V(S\cup T)+V(S\cap T)$ for all $S,T\subseteq \mathcal{N}$; it is also monotone, \emph{i.e.}, $V(S)\leq V(T)$ for all $S\subset T$. %In addition, the maximization of convex relaxations of this function is a well-studied problem \cite{boyd}.
\subsection{Budget Feasible Reverse Auctions}
@@ -71,8 +71,7 @@ In addition to the above, mechanism design in budget feasible reverse auctions s
\item \emph{Truthfulness.} An agent has no incentive to missreport her true cost. Formally, let $c_{-i}$
be a vector of costs of all agents except $i$. % $c_{-i}$ being the same). Let $[p_i']_{i\in \mathcal{N}} = p(c_i',
% c_{-i})$, then the
-A mechanism is truthful iff for every $i \in \mathcal{N}$ and every two cost vectors $(c_i,c_{-i})$ and $(c_i',c_{-i})$
- $p_i(c_i,c_{-i}) - s_i(c_i,c_{-i})\cdot c_i \geq p_i(c_i',c_{-i}) - s(c_i',c_{-i})\cdot c_i$.
+A mechanism is truthful iff $p_i(c_i,c_{-i}) - s_i(c_i,c_{-i})\cdot c_i \geq p_i(c_i',c_{-i}) - s(c_i',c_{-i})\cdot c_i$ for every $i \in \mathcal{N}$ and every two cost vectors $(c_i,c_{-i})$ and $(c_i',c_{-i})$.
\item \emph{Budget Feasibility.} The sum of the payments should not exceed the
budget constraint:
%\begin{displaymath}
@@ -85,7 +84,8 @@ A mechanism is truthful iff for every $i \in \mathcal{N}$ and every two cost vec
\begin{displaymath}
OPT \leq \alpha V(S).
\end{displaymath}
- The approximation ratio captures the \emph{price of truthfulness}, \emph{i.e.}, the relative value loss incurred by adding the truthfulness constraint. % to the value function maximization.
+ % The approximation ratio captures the \emph{price of truthfulness}, \emph{i.e.}, the relative value loss incurred by adding the truthfulness constraint. % to the value function maximization.
+ We stress here that the approximation ratio is defined with respect \emph{to the maximal value in the full information case}.
\item \emph{Computational efficiency.} The allocation and payment function
should be computable in polynomial time in the number of
agents $n$. %\thibaut{Should we say something about the black-box model for $V$? Needed to say something in general, but not in our case where the value function can be computed in polynomial time}.
@@ -130,7 +130,7 @@ biometric information (\emph{e.g.}, her red cell blood count, a genetic marker,
etc.). The cost $c_i$ is the amount the subject deems sufficient to
incentivize her participation in the study. Note that, in this setup, the feature vectors $x_i$ are public information that the experimenter can consult prior the experiment design. Moreover, though a subject may lie about her true cost $c_i$, she cannot lie about $x_i$ (\emph{i.e.}, all features are verifiable upon collection) or $y_i$ (\emph{i.e.}, she cannot falsify her measurement).
-Ideally, motivated by the $D$-optimality criterion, we would like to design a mechanism that maximizes \eqref{dcrit} within a good approximation ratio. In what follows, we consider a slightly more general objective as follows:
+Ideally, motivated by the $D$-optimality criterion, we would like to design a mechanism that maximizes \eqref{dcrit} within a good approximation ratio. The latter however can take negative values, making it unfit for stating approximation ratios. As such, we consider the following full information problem:
\begin{center}
\textsc{ExperimentalDesignProblem} (EDP)
\begin{subequations}
@@ -140,19 +140,36 @@ Ideally, motivated by the $D$-optimality criterion, we would like to design a me
\end{align}\label{edp}
\end{subequations}
\end{center} where $I_d\in \reals^{d\times d}$ is the identity matrix.
-\junk{One possible interpretation of \eqref{modified} is that, prior to the auction, the experimenter has free access to $d$ experiments whose features form an orthonormal basis in $\reals^d$. However, \eqref{modified} can also be motivated in the context of \emph{Bayesian experimental design} \cite{chaloner1995bayesian}. In short, the objective \eqref{modified} arises naturally when the experimenter retrieves the model $\beta$ through \emph{ridge regression}, rather than the linear regression \eqref{leastsquares} over the observed data; we explore this connection in Section~\ref{sec:bed}.
-}
+The objective~\eqref{modified} is non-negative, and can be motivated in the context of \emph{Bayesian experimental design} \cite{chaloner1995bayesian}. In short, it arises naturally when the experimenter retrieves the model $\beta$ through \emph{ridge regression}, rather than the linear regression \eqref{leastsquares} over the observed data; we explore this connection in Section~\ref{sec:bed}.
+
+
+
+
+%We present our results with this version of the objective function because it is simple and captures the versions
+%we will need later (including an arbitrary matrix in place of $I_d$ or a zero matrix which will correspond to ~\eqref{dcrit} ).
+
+%\begin{center}
+%\textsc{ExperimentalDesignProblem} (EDP)
+%\begin{subequations}
+%\begin{align}
+%\text{Maximize}\quad V(S) &= \log\det(I_d+\T{X_S}X_S) \label{modified} \\
+%\text{subject to}\quad \sum_{i\in S} p_i&\leq B\\
+%p_i&\geq c_i, \quad i\in S\\
+%\end{align}\label{edp}
+%\end{subequations}
+%\end{center}
-We present our results with this version of the objective function because it is simple and captures the versions
-we will need later (including an arbitrary matrix in place of $I_d$ or a zero matrix which will correspond to ~\eqref{dcrit} ).
%Note that maximizing \eqref{modified} is equivalent to maximizing \eqref{dcrit} in the full-information case. In particular, $\det(\T{X_S}X_S)> \det(\T{X_{S'}}X_{S'})$ iff $\det(I_d+\T{X_S}X_S)>\det(I_d+\T{X_{S'}}X_{S'})$. In addition,
\EDP{} \emph{and} the corresponding problem with objective \eqref{dcrit} are NP-hard; to see this, note that \textsc{Knapsack} reduces to EDP with dimension $d=1$ by mapping the weight of each item $w_i$ to an experiment with $x_i=w_i$. Moreover, \eqref{modified} is submodular, monotone and satisfies $V(\emptyset) = 0$.
+Finally, we define the strategic version of \EDP{} (denoted by \SEDP) by adding the additional constraints of truthfulness, individual rationality, and normal payments, as described in the previous section.
+
%, allowing us to use the extensive machinery for the optimization of submodular functions, as well as recent results in the
% context of budget feasible auctions \cite{chen,singer-mechanisms}.
%\stratis{A potential problem is that all of the above properties hold also for, \emph{e.g.}, $V(S)=\log(1+\det(\T{X_S}X_S))$\ldots}
+
\begin{comment}\junk{
As \eqref{dcrit} may take arbitrarily small negative values, to define a meaningful approximation one would consider the (equivalent) maximization of $V(S) = \det\T{X_S}X_S$. %, for some strictly increasing, on-to function $f:\reals_+\to\reals_+$.
However, the following lower bound implies that such an optimization goal cannot be attained under the constraints of truthfulness, budget feasibility, and individual rationality.
diff --git a/related.tex b/related.tex
index 16b2309..cf27b32 100644
--- a/related.tex
+++ b/related.tex
@@ -1,27 +1,32 @@
\section{Related work}
\label{sec:related}
-Budget feasible mechanism design was originally proposed by Singer \cite{singer-mechanisms}. Singer considers the problem of maximizing an arbitrary submodular function subject to a budget constraint in the \emph{value query} model, \emph{i.e.} assuming an oracle providing the value of the submodular objective on any given set.
- Singer shows that there exists a randomized, 112-approximation mechanism for submodular maximization that is \emph{universally truthful} (\emph{i.e.}, it is a randomized mechanism sampled from a distribution over truthful mechanisms). Chen \emph{et al.}~\cite{chen} improve this result by providing a 7.91-approximate mechanism, and show a corresponding lower bound of $2$ among universally truthful mechanisms for submodular maximization.
+\subsection{Budget Feasible Mechanisms for General Submodular Functions}
+Budget feasible mechanism design was originally proposed by \citeN{singer-mechanisms}. Singer considers the problem of maximizing an arbitrary submodular function subject to a budget constraint in the \emph{value query} model, \emph{i.e.} assuming an oracle providing the value of the submodular objective on any given set.
+ Singer shows that there exists a randomized, 112-approximation mechanism for submodular maximization that is \emph{universally truthful} (\emph{i.e.}, it is a randomized mechanism sampled from a distribution over truthful mechanisms). \citeN{chen} improve this result by providing a 7.91-approximate mechanism, and show a corresponding lower bound of $2$ among universally truthful mechanisms for submodular maximization.
-\sloppy
-In contrast to the above results, no deterministic, truthful, constant approximation mechanism that runs in polynomial time is presently known for submodular maximization. However, assuming access to an oracle providing the optimum in the full-information setup, Chen \emph{et al.},~provide a truthful, $8.34$-approximate mechanism; in cases for which the full information problem is NP-hard, as the one we consider here, this mechanism is not poly-time, unless P=NP. Chen \emph{et al.}~also prove a $1+\sqrt{2}$ lower bound for truthful mechanisms, improving upon an earlier bound of 2 by Singer \cite{singer-mechanisms}.
+In contrast to the above results, no deterministic, truthful, constant approximation mechanism that runs in polynomial time is presently known for submodular maximization. However, assuming access to an oracle providing the optimum in the full-information setup, Chen \emph{et al.},~provide a truthful, $8.34$-approximate mechanism; in cases for which the full information problem is NP-hard, as the one we consider here, this mechanism is not poly-time, unless P=NP. Chen \emph{et al.}~also prove a $1+\sqrt{2}$ lower bound for truthful mechanisms, improving upon an earlier bound of 2 by \citeN{singer-mechanisms}.
-\fussy
-Improved bounds, as well as deterministic polynomial mechanisms, are known for specific submodular objectives. For symmetric submodular functions, a truthful mechanism with approximation ratio 2 is known, and this ratio is tight \cite{singer-mechanisms}. Singer also provides a 7.32-approximate truthful mechanism for the budget feasible version of \textsc{Matching}, and a corresponding lower bound of 2 \cite{singer-mechanisms}. Improving an earlier result by Singer, Chen \emph{et al.}~\cite{chen}, give a truthful, $2+\sqrt{2}$-approximate mechanism for \textsc{Knapsack}, and a lower bound of $1+\sqrt{2}$. Finally, a truthful, 31-approximate mechanism is also known for the budgeted version of \textsc{Coverage} \cite{singer-mechanisms,singer-influence}.
+\subsection{Budget Feasible Mechanism Design on Specific Problems}
+Improved bounds, as well as deterministic polynomial mechanisms, are known for specific submodular objectives under the query oracle. For symmetric submodular functions, a truthful mechanism with approximation ratio 2 is known, and this ratio is tight \cite{singer-mechanisms}. Singer also provides a 7.32-approximate truthful mechanism for the budget feasible version of \textsc{Matching}, and a corresponding lower bound of 2 \cite{singer-mechanisms}. Improving an earlier result by Singer, \citeN{chen} give a truthful, $2+\sqrt{2}$-approximate mechanism for \textsc{Knapsack}, and a lower bound of $1+\sqrt{2}$. Finally, a truthful, 31-approximate mechanism is also known for the budgeted version of \textsc{Coverage} \cite{singer-mechanisms,singer-influence}. Our results therefore extend the set of problems for which a deterministic, polynomial mechanism is known to include \SEDP.
+
+\subsection{Beyond Submodular Objectives}
Beyond submodular objectives, it is known that no truthful mechanism with approximation ratio smaller than $n^{1/2-\epsilon}$ exists for maximizing fractionally subadditive functions (a class that includes submodular functions) assuming access to a value query oracle~\cite{singer-mechanisms}. Assuming access to a stronger oracle (the \emph{demand} oracle), there exists
a truthful, $O(\log^3 n)$-approximate mechanism
\cite{dobz2011-mechanisms} as well as a universally truthful, $O(\frac{\log n}{\log \log n})$-approximate mechanism for subadditive maximization
\cite{bei2012budget}. Moreover, in a Bayesian setup, assuming a prior distribution among the agent's costs, there exists a truthful mechanism with a 768/512-approximation ratio \cite{bei2012budget}. %(in terms of expectations)
+
+
+\subsection{Data Markets}
A series of recent papers \cite{mcsherrytalwar,approximatemechanismdesign,xiao:privacy-truthfulness,chen:privacy-truthfulness} consider the related problem of retrieving data from an \textit{unverified} database: the auctioneer cannot verify the data reported by individuals and therefore must incentivize them to report truthfully.
-McSherry and Talwar \cite{mcsherrytalwar} argue that \emph{differentially private} mechanisms offer a form of \emph{approximate truthfulness}: if users have a utility that depends on their privacy, reporting their data untruthfully can only increase their utility by a small amount. Xiao \cite{xiao:privacy-truthfulness}, improving upon earlier work by Nissim \emph{et al.}~\cite{approximatemechanismdesign}, constructs mechanisms that
+\citeN{mcsherrytalwar} argue that \emph{differentially private} mechanisms offer a form of \emph{approximate truthfulness}: if users have a utility that depends on their privacy, reporting their data untruthfully can only increase their utility by a small amount. \citeN{xiao:privacy-truthfulness}, improving upon earlier work by~\citeN{approximatemechanismdesign}, constructs mechanisms that
simultaneously achieve exact truthfulness as well as differential privacy. Eliciting private data through a \emph{survey} \cite{roth-liggett}, whereby individuals first decide whether to participate in the survey and then report their data,
- also falls under the unverified database setting \cite{xiao:privacy-truthfulness}. In the \emph{verified} database setting, Ghosh and Roth~\cite{ghosh-roth:privacy-auction} and Dandekar \emph{et al.}~\cite{pranav} consider budgeted auctions where users have a utility again captured by differential privacy. Our work departs from the above setups in that utilities do not involve privacy, whose effects are assumed to be internalized in the costs reported by the users; crucially, we also assume that experiments are tamper-proof, and individuals can misreport their costs but not their values.
+ also falls under the unverified database setting \cite{xiao:privacy-truthfulness}. In the \emph{verified} database setting, \citeN{ghosh-roth:privacy-auction} and~\citeN{pranav} consider budgeted auctions where users have a utility again captured by differential privacy. Our work departs from the above setups in that utilities do not involve privacy, whose effects are assumed to be internalized in the costs reported by the users; crucially, we also assume that experiments are tamper-proof, and individuals can misreport their costs but not their values.
\sloppy
-Our work is closest to the survey setup of Roth and Schoenebeck~\cite{roth-schoenebeck}, who also consider how to sample individuals with different features who report a hidden value at a certain cost. The authors assume a prior on the joint distribution between costs and features, and wish to obtain an unbiased estimate of the expectation of the hidden value under the constraints of truthfulness, budget feasibility and individual rationality. Our work departs by learning a more general statistic (a linear model) than data means. We note that, as in \cite{roth-schoenebeck}, costs and features can be arbitrarily correlated (our results are prior-free).
+Our work is closest to the survey setup of~\citeN{roth-schoenebeck}, who also consider how to sample individuals with different features who report a hidden value at a certain cost. The authors assume a prior on the joint distribution between costs and features, and wish to obtain an unbiased estimate of the expectation of the hidden value under the constraints of truthfulness, budget feasibility and individual rationality. Our work departs by learning a more general statistic (a linear model) than data means. We note that, as in \cite{roth-schoenebeck}, costs and features can be arbitrarily correlated (our results are prior-free).
\fussy