summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorStratis Ioannidis <stratis@stratis-Latitude-E6320.(none)>2013-02-11 09:37:09 -0800
committerStratis Ioannidis <stratis@stratis-Latitude-E6320.(none)>2013-02-11 09:37:09 -0800
commit8c09cfd7da709aab03fb004b58ecd8e1eb4fb553 (patch)
treefe24e8514094cfcd172fce175bf6df60d0031d9a
parent114a6b8eac3e6addebe84b831c5eafbec7bc9ef4 (diff)
downloadrecommendation-8c09cfd7da709aab03fb004b58ecd8e1eb4fb553.tar.gz
muthu
-rw-r--r--abstract.tex19
-rw-r--r--intro.tex31
-rw-r--r--problem.tex30
-rw-r--r--related.tex8
4 files changed, 57 insertions, 31 deletions
diff --git a/abstract.tex b/abstract.tex
index f005dbf..eebeb97 100644
--- a/abstract.tex
+++ b/abstract.tex
@@ -1,21 +1,24 @@
%We initiate the study of mechanisms for \emph{experimental design}.
In the classical {\em experimental design} setting,
-an experimenter \E\ with a budget $B$ has access to a population of $n$ potential experiment subjects $i\in \{1,\ldots,n\}$, each associated with a vector of features $x_i\in\reals^d$ as well as a cost $c_i>0$.
+an experimenter \E\
+%with a budget $B$
+has access to a population of $n$ potential experiment subjects $i\in \{1,\ldots,n\}$, each associated with a vector of features $x_i\in\reals^d$.
+%as well as a cost $c_i>0$.
Conducting an experiment with subject $i$ reveals an unknown value $y_i\in \reals$ to \E. \E\ typically assumes some
hypothetical relationship between $x_i$'s and $y_i$'s, \emph{e.g.}, $y_i \approx \T{\beta} x_i$, and estimates
$\beta$ from experiments.
%conducting the experiments and obtaining the measurements $y_i$ allows
%\E\ can estimate $\beta$.
-\E 's goal is to select which experiments to conduct, subject to her budget constraint.
+As a proxy for various practical constraints, \E{} may select subjects to select for the experiments.
+%\E 's goal is to select which experiments to conduct, subject to her budget constraint.
%, to obtain the best estimate possible for $\beta$.
-We initiate the study of mechanisms for experimental design. In this setting,
-subjects are \emph{strategic} and may lie about their costs. In particular, we formulate the {\em Experimental Design Problem} (\EDP) as finding a set $S$ of subjects that maximize $V(S) = \log\det(I_d+\sum_{i\in S}x_i\T{x_i})$ under the constraint $\sum_{i\in S}c_i\leq B$; our objective function corresponds to the information gain in $\beta$ when it is learned through linear regression methods, and is related to the so-called $D$-optimality criterion. We present the first known
-deterministic, polynomial time, truthful, budget feasible mechanism for \EDP{}.
-Our mechanism yields a constant factor ($\approx 19.68$) approximation, and we show that no truthful, budget-feasible algorithms are possible within a factor 2 approximation.
-Our approach here generally applies to a wider class of learning problems and
-obtains polynomial time universally truthful (\emph{i.e.}, randomized) budget feasible mechanism, also within a constant factor approximation.
+We initiate the study of budgeted mechanisms for experimental design. In this setting, \E{} has a budget $B$.
+Each subject $i$ declares associated cost $c_i >0$ to be part of the experiment, and must be paid at least their cost. Further, the subjects
+are \emph{strategic} and may lie about their costs . In particular, we formulate the {\em Strategic Experimental Design Problem} (\SEDP) as finding a set $S$ of subjects for the experiment that maximizes $V(S) = \log\det(I_d+\sum_{i\in S}x_i\T{x_i})$ under the constraint $\sum_{i\in S}c_i\leq B$; our objective function corresponds to the information gain in parameter $\beta$ that is learned through linear regression methods, and is related to the so-called $D$-optimality criterion.
+We present a deterministic, polynomial time, truthful, budget feasible mechanism for \EDP{}.
+By applying previous work on budget feasible mechanisms with submodular objective, one could have derived either an exponential time deterministic mechanism or a randomized polynomial time mechanism. Our mechanism yields a constant factor ($\approx 12.68$) approximation, and we show that no truthful, budget-feasible algorithms are possible within a factor $2$ approximation. We also show how to apply our approach to a wide class of learning problems.
diff --git a/intro.tex b/intro.tex
index dd48aa7..ce66c1d 100644
--- a/intro.tex
+++ b/intro.tex
@@ -8,34 +8,40 @@ Each subject $i\in \{1,\ldots,n\}$ is associated with a set of parameters (or f
known to the experimenter.
\E\ wishes to measure a certain inherent property of the subjects by performing an experiment: the outcome $y_i$ of the experiment on a subject $i$ is unknown to \E\ before the experiment is performed.
-Typically, \E\ has a hypothesis on the relationship between $x_i$'s and $y_i$'s. Due to its simplicity, as well as its ubiquity in statistical analysis, a large body of work has focused on linear hypotheses: \emph{i.e.}, it is assumed that there exists a $\beta\in\reals^d$ such that $y_i = \T{\beta} x_i+\varepsilon_i$, for all $i\in \{1,\ldots,n\},$ where $\varepsilon_i$ are zero-mean, i.i.d.~random variables. Conducting the experiments and obtaining the measurements $y_i$ lets \E\ estimate $\beta$, \emph{e.g.}, through linear regression. %, \emph{i.e.}, the model underlying the data, and the experimenter's goal is to obtain such an estimate as accurately as possible. %The goal of experimental design amounts to determining which subjects to experiment upon to produce the best possible such estimate.
+Typically, \E\ has a hypothesis on the relationship between $x_i$'s and $y_i$'s. Due to its simplicity, as well as its ubiquity in statistical analysis, a large body of work has focused on linear hypotheses: \emph{i.e.}, it is assumed that there exists a $\beta\in\reals^d$ such that
+$$y_i = \T{\beta} x_i+\varepsilon_i,$$ for all $i\in \{1,\ldots,n\},$ where $\varepsilon_i$ are zero-mean, i.i.d.~random variables. Conducting the experiments and obtaining the measurements $y_i$ lets \E\ estimate $\beta$, \emph{e.g.}, through linear regression. %, \emph{i.e.}, the model underlying the data, and the experimenter's goal is to obtain such an estimate as accurately as possible. %The goal of experimental design amounts to determining which subjects to experiment upon to produce the best possible such estimate.
The above experimental design scenario has many applications. Regression over personal data collected through surveys or experimentation is the cornerstone of marketing research, as well as research in a variety of experimental sciences such as medicine and sociology. Crucially, the statistical analysis of user data is also a widely spread practice among Internet companies, which routinely use machine learning techniques over vast records of user data to perform inference and classification tasks integral to their daily operations.
Beyond linear regression, there is a rich literature about estimation procedures, as well as for means of quantifying the quality of the produced estimate~\cite{pukelsheim2006optimal}. There is also an extensive theory on how to select subjects
if \E\ can conduct only a limited number of experiments, so the estimation process returns $\beta$
that approximates the true parameter of the underlying population \cite{ginebra2007measure,le1996comparison,chaloner1995bayesian,boyd2004convex}.
-We depart from this classical setup by viewing experimental design in a strategic setting, and by studying mechanism design issues.
+We depart from this classical setup by viewing experimental design in a strategic setting, and by studying budgeted mechanism design issues.
In our setting, experiments cannot be manipulated and hence measurements are reliable. %\footnote{Thus, the experiments of our interest are statistically significant ones where each experiment provides a reliable outcome.}
-However, there
-is a cost $c_i$ associated with experimenting on
-subject $i$ which varies from subject to subject. This may be viewed as the
-cost subject $i$ incurs when tested and for which she needs to be reimbursed; or, it might be viewed as the incentive for $i$ to participate in the experiment; or, it might be the intrinsic worth of the data to the user.
-
- This economic aspect has always been inherent in experimental design: experimenters often work within strict budgets and design creative incentives. However, we are not aware of a principled study of this setting from a strategic point of view. When subjects are strategic, they may have an incentive to misreport their cost, leading to the need for a sophisticated choice of experiments and payments. Arguably, user incentiviation is of particular pertinence due to the extent of statistical analysis over user data on the Internet. %, which has led to the rise of several different research efforts in studying data markets \cite{...}.
+ \E{} has a total budget of $B$ to conduct all the experiments.
+There is a cost $c_i$ associated with experimenting on
+subject $i$ which varies from subject to subject. This cost $c_i$ is determined by the subject $i$: it may be viewed as the
+cost $i$ incurs when tested and for which she needs to be reimbursed; or, it might be viewed as the incentive for $i$ to participate in the experiment; or, it might be the intrinsic worth of the data to the user. The economic aspect of paying subjects has always been inherent in experimental design: experimenters often work within strict budgets and design creative incentives. Subjects often negotiate better incentives or higher payments.
+However, we are not aware of a principled study of this setting from a strategic point of view, when subjects declare their costs and therefore determine their payment. Such a setting is increasingly realistic, given the growth of these experiments over the Internet and associated data markets.
+% When subjects are strategic, they may have an incentive to misreport their cost, leading to the need for a sophisticated choice of experiments and payments. Arguably, user incentiviation is of particular pertinence due to the extent of statistical analysis over user data on the Internet. %, which has led to the rise of several different research efforts in studying data markets \cite{...}.
Our contributions are as follows.
\begin{itemize}
\item
-We formulate the problem of experimental design subject to a given budget, in the presence of strategic agents who may lie about their costs. %In particular, we focus on linear regression. This is naturally viewed as a budget feasible mechanism design problem, in which the objective function %is sophisticated and
+We initiate the study of experimental design problem in presence of budgets and strategic subjects.
+%formulate the problem of experimental design subject to a given budget, in the presence of strategic agents who may lie about their costs. %In particular, we focus on linear regression. This is naturally viewed as a budget feasible mechanism design problem, in which the objective function %is sophisticated and
%is related to the covariance of the $x_i$'s.
In particular, we formulate the {\em Strategic Experimental Design Problem} (\SEDP) as follows: the experimenter \E\ wishes to find set $S$ of subjects to maximize
\begin{align}V(S) = \log\det\Big(I_d+\sum_{i\in S}x_i\T{x_i}\Big) \label{obj}\end{align}
subject to a budget constraint $\sum_{i\in S}c_i\leq B$, where $B$ is \E's budget, as well as the usual constraints of truthfulness and individual rationality. %, and other {\em strategic constraints} we don't list here.
+
+\smallskip
The objective function, which is the key, is formally obtained by optimizing the information gain in $\beta$ when the latter is learned through linear regression, and is related to the so-called $D$-optimality criterion~\cite{pukelsheim2006optimal,atkinson2007optimum}.
\item
-We present the first known polynomial time truthful mechanism for \SEDP{}, yielding a constant factor ($\approx 12.98$) approximation. In contrast to this, we show that no truthful, budget-feasible algorithms are possible for \SEDP{} within a factor 2 approximation.
+We present a polynomial time, truthful mechanism for \SEDP{}, yielding a constant factor ($\approx 12.98$) approximation. In contrast to this, we show that no truthful, budget-feasible algorithms are possible for \SEDP{} within a factor 2 approximation.
+
+\smallskip
We note that the objective \eqref{obj} is submodular. Using this fact, applying previous results for budget feasible mechanisms under general submodular objectives would yield either a truthful deterministic mechanism that requires exponential time, or a poly-time algorithm that is not deterministic~\cite{singer-mechanisms,chen}.
\end{itemize}
@@ -51,7 +57,10 @@ We note that the objective \eqref{obj} is submodular. Using this fact, applying
%Though such mechanisms were known to exist for combinatorial problems with specific submodular objectives such as \textsc{Knapsack} or \textsc{Coverage}~\cite{singer-mechanisms,chen, singer-influence}, these do not readily apply to the more complicated linear-algebraic objective function \eqref{obj} of \SEDP.
%{\bf S+T: could we verify that the above sentence is correct in its implication?}
-From a technical perspective, we present a convex relaxation of \eqref{obj}, and show that it is within a constant factor from the so-called multi-linear relaxation of \eqref{obj}. This allows us to adopt the approach followed by prior work in budget feasible mechanisms by Chen \emph{et al.}~\cite{chen} and Singer~\cite{singer-influence}. %{\bf FIX the last sentence}
+From a technical perspective, we present a convex relaxation of \eqref{obj}, and show that it is within a constant factor from the so-called multi-linear relaxation of \eqref{obj}.
+
+
+%This allows us to adopt the approach followed by prior work in budget feasible mechanisms by Chen \emph{et al.}~\cite{chen} and Singer~\cite{singer-influence}. %{\bf FIX the last sentence}
%\item
%Our approach to mechanisms for experimental design --- by
diff --git a/problem.tex b/problem.tex
index 668cd57..3ad3270 100644
--- a/problem.tex
+++ b/problem.tex
@@ -13,7 +13,7 @@ For example, each $i$ may correspond to a human subject; the
feature vector $x_i$ may correspond to a normalized vector of her age, weight,
gender, income, \emph{etc.}, and the measurement $y_i$ may capture some
biometric information (\emph{e.g.}, her red cell blood count, a genetic marker,
-etc.). The magnitude of the coefficient $\beta_i$ captures the effect that feature $i$ has on the measured variable, and its sign captures whether the corellation is positive or negative.
+etc.). The magnitude of the coefficient $\beta_i$ captures the effect that feature $i$ has on the measured variable, and its sign captures whether the correlation is positive or negative.
The purpose of these experiments is to allow \E\ to estimate the model $\beta$. In particular,
@@ -35,12 +35,15 @@ This optimization, commonly known as \emph{ridge regression}, includes an additi
%The estimator $\hat{\beta}$ is unbiased, \emph{i.e.}, $\expt{\hat{\beta}} = \beta$ (where the expectation is over the noise variables $\varepsilon_i$). Furthermore, $\hat{\beta}$ is a multidimensional normal random variable with mean $\beta$ and covariance matrix $(X_S\T{X_S})^{-1}$.
Note that the estimator $\hat{\beta}$ is a linear map of $y_S$; as $y_S$ is
a multidimensional normal r.v., so is $\hat{\beta}$ (the randomness coming from
-the noise terms $\varepsilon_i$). In particular, $\hat{\beta}$ has %mean $\beta$% (\emph{i.e.}, it is an \emph{unbiased estimator}) and
- covariance
-$(R+\T{X_S}X_S)^{-1}$.
+the noise terms $\varepsilon_i$).
+%In particular, $\hat{\beta}$ has %mean $\beta$% (\emph{i.e.}, it is an \emph{unbiased estimator}) and
+% covariance
+%$(R+\T{X_S}X_S)^{-1}$.
Let $V:2^\mathcal{N}\to\reals$ be a \emph{value function}, quantifying how informative a set of experiments $S$ is in estimating $\beta$. The classical experimental design problem amounts to finding a set $S$ that maximizes $V(S)$ subject to the constraint $|S|\leq k$.
-A variety of different value functions are used in literature~\cite{pukelsheim2006optimal}; almost all make use of the covariance $(R+\T{X_S}X_S)^{-1}$ of the estimator $\hat{\beta}$. A value function that has natural advantages is the \emph{information gain}: %\emph{$D$-optimality criterion}: %which yields the following optimization problem
+A variety of different value functions are used in literature~\cite{pukelsheim2006optimal}.
+%; almost all make use of the covariance $(R+\T{X_S}X_S)^{-1}$ of the estimator $\hat{\beta}$.
+A value function that has natural advantages is the \emph{information gain}: %\emph{$D$-optimality criterion}: %which yields the following optimization problem
\begin{align}
V(S)= I(\beta;y_S) = \entropy(\beta)-\entropy(\beta\mid y_S). \label{informationgain}
\end{align}
@@ -52,7 +55,7 @@ Under the linear model \eqref{model}, and the Gaussian prior, the information ga
\begin{align}
V(S) &= \frac{1}{2}\log\det(R+ \T{X_S}X_S) \label{dcrit} %\\
\end{align}
-which is indeed a function of the covariance matrix $(R+\T{X_S}X_S)^{-1}$.
+%which is indeed a function of the covariance matrix $(R+\T{X_S}X_S)^{-1}$.
%defined as $-\infty$ when $\mathrm{rank}(\T{X_S}X_S)<d$.
%As $\hat{\beta}$ is a multidimensional normal random variable, the
%$D$-optimality criterion is equal (up to a constant) to the negative of the
@@ -66,14 +69,17 @@ which is indeed a function of the covariance matrix $(R+\T{X_S}X_S)^{-1}$.
%\end{align}
%There are several reasons
%In addition, the maximization of convex relaxations of this function is a well-studied problem \cite{boyd}.
-Our analysis will focus on the case of a \emph{homotropic} prior, in which the prior covariance is the identity matrix, \emph{i.e.}, $$R=I_d\in \reals^{d\times d}.$$ Intuitively, this corresponds to the simplest prior, in which no direction of $\reals^d$ is a priori favored; equivalently, it also corresponds to the case where ridge regression estimation \eqref{ridge} performed by $\E$ has a penalty term $\norm{\beta}_2^2$.
+Our analysis will focus on the case of a \emph{homotropic} prior, in which the prior covariance is the identity matrix, \emph{i.e.}, $R=I_d\in \reals^{d\times d}.$ Intuitively, this corresponds to the simplest prior, in which no direction of $\reals^d$ is a priori favored; equivalently, it also corresponds to the case where ridge regression estimation \eqref{ridge} performed by $\E$ has a penalty term $\norm{\beta}_2^2$. In Section 5, we will address other $R$'s.
%Note that \eqref{dcrit} is a submodular set function, \emph{i.e.},
%$V(S)+V(T)\geq V(S\cup T)+V(S\cap T)$ for all $S,T\subseteq \mathcal{N}$; it is also monotone, \emph{i.e.}, $V(S)\leq V(T)$ for all $S\subset T$.
\subsection{Budgeted-Feasible Experimental Design: Full Information Case}
-We depart from the above classic experimental design setting by assuming that each experiment is associated with a cost $c_i\in\reals_+$. Moreover, the experimenter $\E$ is limited by a budget $B\in \reals_+$.
+Beyond the cardinality constraint in classical experimental design discussed above, is
+the budgeted version.
+%depart from the above classic experimental design setting by assuming that
+Each experiment is associated with a cost $c_i\in\reals_+$. Moreover, the experimenter $\E$ is limited by a budget $B\in \reals_+$.
The cost $c_i$ can capture, \emph{e.g.}, the amount the subject $i$ deems sufficient to
incentivize her participation in the experiment.
In the full-information case, the experiment costs are common knowledge; as such, the optimization problem that the experimenter wishes to solve is:
@@ -104,6 +110,14 @@ for all $S\subseteq\mathcal{N}$, since the matrix $\T{X_S}X_S$ is positive semi-
the optimal value achievable in the full-information case.
\subsection{Experimental Design Problem: Strategic Case}
+We study the strategic case when $c_i$'s are specified by the subjects and they are strategic.
+In this case, the costs $c_i$ are {\em not} common knowledge and they are reported by the
+% their reporting can be manipulated by the
+experiment subjects.
+ Moreover, though a subject may lie about her true cost $c_i$, she cannot lie about $x_i$ (\emph{i.e.}, all public features are verifiable prior to the experiment) nor $y_i$ (\emph{i.e.}, the subject cannot falsify her measurement).
+The subjects report $c_i$ in order to maximize their expected {\em utility} which is
+$p_i-c_i$.
+
When the experiment subjects are strategic, the experimental design problem becomes a special case of a \emph{budget feasible reverse auction}, as introduced by \citeN{singer-mechanisms}. %comprises a set of items $\mathcal{N}=\{1,\ldots,n\}$ as well as a single buyer. Each item has a cost $c_i\in\reals_+$. Moreover, the buyer has a positive value function $V:2^{\mathcal{N}}\to \reals_+$, as well as a budget $B\in \reals_+$. In the \emph{full information case}, the costs $c_i$ are common
%knowledge; the objective of the buyer in this context is to select a set $S$
diff --git a/related.tex b/related.tex
index a4956e9..e2c3138 100644
--- a/related.tex
+++ b/related.tex
@@ -20,10 +20,10 @@ a truthful, $O(\log^3 n)$-approximate mechanism
\subsection{Data Markets}
- A series of recent papers \cite{mcsherrytalwar,approximatemechanismdesign,xiao:privacy-truthfulness,chen:privacy-truthfulness} consider the related problem of retrieving data from an \textit{unverified} database: the auctioneer cannot verify the data reported by individuals and therefore must incentivize them to report truthfully.
-\citeN{mcsherrytalwar} argue that \emph{differentially private} mechanisms offer a form of \emph{approximate truthfulness}: if users have a utility that depends on their privacy, reporting their data untruthfully can only increase their utility by a small amount. \citeN{xiao:privacy-truthfulness}, improving upon earlier work by~\citeN{approximatemechanismdesign}, constructs mechanisms that
-simultaneously achieve exact truthfulness as well as differential privacy. Eliciting private data through a \emph{survey} \cite{roth-liggett}, whereby individuals first decide whether to participate in the survey and then report their data,
- also falls under the unverified database setting \cite{xiao:privacy-truthfulness}. In the \emph{verified} database setting, \citeN{ghosh-roth:privacy-auction} and~\citeN{pranav} consider budgeted auctions where users have a utility again captured by differential privacy. Our work departs from the above setups in that utilities do not involve privacy, whose effects are assumed to be internalized in the costs reported by the users; crucially, we also assume that experiments are tamper-proof, and individuals can misreport their costs but not their values.
+ A series of recent papers \cite{mcsherrytalwar,approximatemechanismdesign,xiao:privacy-truthfulness,chen:privacy-truthfulness} consider the related problem of retrieving data from an \textit{unverified} database, where strategic users may misreport their data to a data analyst to ensure their privacy. \citeN{mcsherrytalwar} argue that \emph{differentially private} mechanisms offer a form of \emph{approximate truthfulness}: if users have a utility that depends on their privacy, reporting their data untruthfully can only increase their utility by a small amount. %\citeN{xiao:privacy-truthfulness}, improving upon earlier work by~\citeN{approximatemechanismdesign}, constructs mechanisms that simultaneously achieve exact truthfulness as well as differential privacy.
+We depart by assuming that experiment outcomes are tamper-proof, cannot be manipulated.
+A different set of papers \cite{ghosh-roth:privacy-auction,roth-liggett,pranav} consider a setting where data cannot be misreported, but the utility of users is a function of the differential privacy guarantee an analyst provides them. In contrast, any privacy costs in our setup are internalized in the costs $c_i$. %Eliciting private data through a \emph{survey} \cite{roth-liggett}, whereby individuals first decide whether to participate in the survey and then report their data,
+% also falls under the unverified database setting \cite{xiao:privacy-truthfulness}. In the \emph{verified} database setting, \citeN{ghosh-roth:privacy-auction} and~\citeN{pranav} consider budgeted auctions where users have a utility again captured by differential privacy. Our work departs from the above setups in that utilities do not involve privacy, whose effects are assumed to be internalized in the costs reported by the users; crucially, we also assume that experiments are tamper-proof, and individuals can misreport their costs but not their values.
\sloppy
Our work is closest to the survey setup of~\citeN{roth-schoenebeck}, who also consider how to sample individuals with different features who report a hidden value at a certain cost. The authors assume a prior on the joint distribution between costs and features, and wish to obtain an unbiased estimate of the expectation of the hidden value under the constraints of truthfulness, budget feasibility and individual rationality. Our work departs by learning a more general statistic (a linear model) than data means. We note that, as in \cite{roth-schoenebeck}, costs and features can be arbitrarily correlated (our results are prior-free).