From 335d2bdfbfc406be36f3f1f240733cf7094b6147 Mon Sep 17 00:00:00 2001
From: Stratis Ioannidis <stratis@stratis-Latitude-E6320.(none)>
Date: Sun, 10 Feb 2013 22:29:04 -0800
Subject: merging 3

---
 problem.tex | 81 ++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 59 insertions(+), 22 deletions(-)

(limited to 'problem.tex')

diff --git a/problem.tex b/problem.tex
index 546500d..6027379 100644
--- a/problem.tex
+++ b/problem.tex
@@ -1,50 +1,87 @@
 \label{sec:prel}
-\subsection{Experimental Design}
-
-The theory of experimental design \cite{pukelsheim2006optimal,atkinson2007optimum}  studies how an experimenter \E\  should select the parameters of a set of experiments she is about to conduct. In general, the optimality of a particular design depends on the purpose of the experiment, \emph{i.e.}, the quantity \E\  is trying to learn or the hypothesis she is trying to validate. Due to their ubiquity in statistical analysis, a large literature on the subject focuses on learning \emph{linear models}, where \E\ wishes to  fit  a linear function to the data she has collected.
+\subsection{Linear Regression and Experimental Design}
 
-More precisely, putting cost considerations aside, suppose that \E\ wishes to conduct $k$ among $n$ possible experiments. Each experiment $i\in\mathcal{N}\defeq \{1,\ldots,n\}$ is associated with a set of parameters (or features) $x_i\in \reals^d$, normalized so that $\|x_i\|_2\leq 1$. Denote by $S\subseteq \mathcal{N}$, where $|S|=k$, the set of experiments selected; upon its execution, experiment $i\in S$ reveals an output variable (the ``measurement'') $y_i$,  related to the experiment features $x_i$ through a linear function, \emph{i.e.},
+The theory of experimental design \cite{pukelsheim2006optimal,atkinson2007optimum,chaloner1995bayesian} considers the following formal setting. %  studies how an experimenter \E\  should select the parameters of a set of experiments she is about to conduct. In general, the optimality of a particular design depends on the purpose of the experiment, \emph{i.e.}, the quantity \E\  is trying to learn or the hypothesis she is trying to validate. Due to their ubiquity in statistical analysis, a large literature on the subject focuses on learning \emph{linear models}, where \E\ wishes to  fit  a linear function to the data she has collected.
+Putting cost considerations aside, suppose that an experimenter \E\ wishes to conduct $k$ among $n$ possible experiments. Each experiment $i\in\mathcal{N}\defeq \{1,\ldots,n\}$ is associated with a set of parameters (or features) $x_i\in \reals^d$, normalized so that $\|x_i\|_2\leq 1$. Denote by $S\subseteq \mathcal{N}$, where $|S|=k$, the set of experiments selected; upon its execution, experiment $i\in S$ reveals an output variable (the ``measurement'') $y_i$,  related to the experiment features $x_i$ through a linear function, \emph{i.e.},
 \begin{align}\label{model}
      \forall i\in\mathcal{N},\quad y_i = \T{\beta} x_i + \varepsilon_i
 \end{align}
 where $\beta$ is a vector in $\reals^d$, commonly referred to as the \emph{model}, and $\varepsilon_i$ (the \emph{measurement noise}) are independent, normally distributed random  variables with mean 0 and variance $\sigma^2$. 
 
-The purpose of these experiments is to allow \E\  to estimate the model $\beta$. In particular, under \eqref{model}, the maximum likelihood estimator of $\beta$ is the \emph{least squares} estimator: for $X_S=[x_i]_{i\in S}\in \reals^{|S|\times d}$ the matrix of experiment features and
-$y_S=[y_i]_{i\in S}\in\reals^{|S|}$ the observed measurements, 
-\begin{align} \hat{\beta} &=\max_{\beta\in\reals^d}\prob(y_S;\beta) =\argmin_{\beta\in\reals^d } \sum_{i\in S}(\T{\beta}x_i-y_i)^2 \nonumber\\
-& = (\T{X_S}X_S)^{-1}X_S^Ty_S\label{leastsquares}\end{align} 
+The purpose of these experiments is to allow \E\  to estimate the model $\beta$. In particular,
+ assume that the experimenter  \E\ has a {\em prior}
+distribution on $\beta$, \emph{i.e.},  $\beta$ has a multivariate normal prior
+with zero mean  and covariance $\sigma^2R^{-1}\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance). 
+Then, \E\ estimates $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. Under the linearity assumption \eqref{model} and the Gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization \cite{hastie}: 
+\begin{displaymath}
+    \hat{\beta} = \argmin_{\beta\in\reals^d} \big(\sum_{i\in S} (y_i - \T{\beta}x_i)^2
+    + \T{\beta}R\beta\big) = (R+\T{X_S}X_S)^{-1}X_S^Ty_S \label{ridge}
+\end{displaymath}
+where $X_S=[x_i]_{i\in S}\in \reals^{|S|\times d}$ is the matrix of experiment features and
+$y_S=[y_i]_{i\in S}\in\reals^{|S|}$ the observed measurements. 
+This optimization, commonly known as \emph{ridge regression}, includes an additional quadratic penalty term compared to the standard least squares estimation.
+% under \eqref{model}, the maximum likelihood estimator of $\beta$ is the \emph{least squares} estimator: for $X_S=[x_i]_{i\in S}\in \reals^{|S|\times d}$ the matrix of experiment features and
+%$y_S=[y_i]_{i\in S}\in\reals^{|S|}$ the observed measurements, 
+%\begin{align} \hat{\beta} &=\max_{\beta\in\reals^d}\prob(y_S;\beta) =\argmin_{\beta\in\reals^d } \sum_{i\in S}(\T{\beta}x_i-y_i)^2 \nonumber\\
+%& = (\T{X_S}X_S)^{-1}X_S^Ty_S\label{leastsquares}\end{align} 
 %The estimator $\hat{\beta}$ is unbiased, \emph{i.e.}, $\expt{\hat{\beta}} = \beta$ (where the expectation is over the noise variables $\varepsilon_i$). Furthermore, $\hat{\beta}$ is a multidimensional normal random variable with mean $\beta$ and covariance matrix $(X_S\T{X_S})^{-1}$. 
-
 Note that the estimator $\hat{\beta}$ is a linear map of $y_S$; as $y_S$ is
 a multidimensional normal r.v., so is $\hat{\beta}$ (the randomness coming from
-the noise terms $\varepsilon_i$). In particular, $\hat{\beta}$ has mean $\beta$
-(\emph{i.e.}, it is an \emph{unbiased estimator}) and covariance
-$(\T{X_S}X_S)^{-1}$.
+the noise terms $\varepsilon_i$). In particular, $\hat{\beta}$ has %mean $\beta$% (\emph{i.e.}, it is an \emph{unbiased estimator}) and
+ covariance
+$(R+\T{X_S}X_S)^{-1}$.
 
 Let $V:2^\mathcal{N}\to\reals$ be a \emph{value function}, quantifying how informative a set of experiments $S$ is in estimating $\beta$. The standard optimal experimental design problem amounts to finding a set $S$ that maximizes $V(S)$ subject to the constraint $|S|\leq k$. 
 
-A variety of different value functions are used in experimental design~\cite{pukelsheim2006optimal}; almost all make use of the covariance $(\T{X_S}X_S)^{-1}$ of the estimator $\hat{\beta}$. A value function preferred because of its relationship to entropy is the \emph{$D$-optimality criterion}: %which yields the following optimization problem
+A variety of different value functions are used in experimental design~\cite{pukelsheim2006optimal}; almost all make use of the covariance $(R+\T{X_S}X_S)^{-1}$ of the estimator $\hat{\beta}$. A value function that has natural advantages is the \emph{information gain}:  %\emph{$D$-optimality criterion}: %which yields the following optimization problem
 \begin{align}
- V(S) &= \frac{1}{2}\log\det \T{X_S}X_S \label{dcrit} %\\
+   V(S)= I(\beta;y_S) = \entropy(\beta)-\entropy(\beta\mid y_S). \label{informationgain}
 \end{align}
-defined as $-\infty$ when $\mathrm{rank}(\T{X_S}X_S)<d$. 
-As $\hat{\beta}$ is a multidimensional normal random variable, the
-$D$-optimality criterion is equal (up to a constant) to the negative of the
-entropy of $\hat{\beta}$. Hence, selecting a set of experiments $S$ that
+which is the entropy reduction on $\beta$ after the revelation of $y_S$ (also known as the mutual information between $Y_S$ and $\beta$). 
+Hence, selecting a set of experiments $S$ that
 maximizes $V(S)$ is equivalent to finding the set of experiments that minimizes
-the uncertainty on $\beta$, as captured by the entropy of its estimator.
+the uncertainty on $\beta$, as captured by the entropy reduction of its estimator.
+Under the linear model \eqref{model}, and the Gaussian prior, the information gain takes the form:
+\begin{align}
+ V(S) &= \frac{1}{2}\log\det(R+ \T{X_S}X_S) \label{dcrit} %\\
+\end{align}
+which is indeed a function of the covariance matrix $(R+\T{X_S}X_S)^{-1}$.
+%defined as $-\infty$ when $\mathrm{rank}(\T{X_S}X_S)<d$. 
+%As $\hat{\beta}$ is a multidimensional normal random variable, the
+%$D$-optimality criterion is equal (up to a constant) to the negative of the
+%entropy of $\hat{\beta}$. Hence, selecting a set of experiments $S$ that
+%maximizes $V(S)$ is equivalent to finding the set of experiments that minimizes
+%the uncertainty on $\beta$, as captured by the entropy of its estimator.
 
 %As discussed in the next section, in this paper, we work with a modified measure of information function, namely 
 %\begin{align}
 %V(S) & = \frac{1}{2} \log\det \left(I + \T{X_S}X_S\right)
 %\end{align}
 %There are several  reasons
-
  Note that \eqref{dcrit} is a submodular set function, \emph{i.e.}, 
 $V(S)+V(T)\geq V(S\cup T)+V(S\cap T)$ for all $S,T\subseteq \mathcal{N}$; it is also monotone, \emph{i.e.}, $V(S)\leq V(T)$ for all $S\subset T$. %In addition, the maximization of convex relaxations of this function is a  well-studied problem \cite{boyd}.
+Our analysis will focus on the case of a \emph{homotropic} prior, in which the prior covariance is the identity matrix, \emph{i.e.}, $$R=I_d\in \reals^{d\times d}.$$ Intuitively, this corresponds to the simplest case of a prior, in which no direction of $\beta$ is a priori favored; equivalently it also corresponds to solving a ridge regression problem with penalty term $\norm{\beta}_2^2$.
+
+\subsection{Budgeted-Feasible Experimental Design: Full Information Case}
+We depart from the above classic experimental design setting by assuming that each experiment is associated with a cost $c_i\in\reals_+$. Moreover, the experimenter $\E$ is limited by a budget $B\in \reals_+$. In the full-information case, the experiment costs are common knowledge and the optimization problem that the experimenter wishes to solve is: 
+\begin{center}
+\textsc{ExperimentalDesignProblem} (EDP)
+\begin{subequations}
+\begin{align}
+\text{Maximize}\quad V(S) &= \log\det(I_d+\T{X_S}X_S) \label{modified} \\
+\text{subject to}\quad \sum_{i\in S} c_i&\leq B
+\end{align}\label{edp}
+\end{subequations}
+\end{center}
+\EDP{} \emph{and} the corresponding problem with the more general objective \eqref{dcrit} are NP-hard; to see this, note that \textsc{Knapsack} reduces to EDP with dimension $d=1$ by mapping the weight of each item $w_i$ to an experiment with $x_i=w_i$. Moreover, \eqref{modified} is submodular, monotone and satisfies $V(\emptyset) = 0$.
+
+ Note that \eqref{dcrit} is a submodular set function, \emph{i.e.}, 
+$V(S)+V(T)\geq V(S\cup T)+V(S\cap T)$ for all $S,T\subseteq \mathcal{N}$; it is also monotone, \emph{i.e.}, $V(S)\leq V(T)$ for all $S\subset T$, and non-negative, as $V(S)\geq 0$ for all $S\subset\mathcal{N}$.
+%Finally, we define the strategic version  of \EDP{} (denoted by \SEDP) by adding the additional constraints of truthfulness, individual rationality, and normal payments, as described in the previous section.
+
 
-\subsection{Budget Feasible Reverse Auctions}
-A \emph{budget feasible reverse auction}
+\subsection{Experimental Design Problem: Strategic Case}
+When agents are strategic, the experimental design problem becomes a special case of  a \emph{budget feasible reverse auction}
 \cite{singer-mechanisms} comprises a set of items $\mathcal{N}=\{1,\ldots,n\}$ as well as a single buyer. Each item has a cost $c_i\in\reals_+$. Moreover, the buyer has a positive value function  $V:2^{\mathcal{N}}\to \reals_+$, as well as a budget $B\in \reals_+$. In the \emph{full information case}, the costs $c_i$ are common
 knowledge; the objective of the buyer in this context is to select a set $S$
 maximizing the value $V(S)$ subject to the constraint $\sum_{i\in S} c_i\leq
-- 
cgit v1.2.3-70-g09d2