\label{sec:prel} \subsection{Linear Regression and Experimental Design}\label{sec:edprelim} The theory of experimental design \cite{pukelsheim2006optimal,atkinson2007optimum,chaloner1995bayesian} considers the following formal setting. % studies how an experimenter \E\ should select the parameters of a set of experiments she is about to conduct. In general, the optimality of a particular design depends on the purpose of the experiment, \emph{i.e.}, the quantity \E\ is trying to learn or the hypothesis she is trying to validate. Due to their ubiquity in statistical analysis, a large literature on the subject focuses on learning \emph{linear models}, where \E\ wishes to fit a linear function to the data she has collected. %Putting cost considerations aside Suppose that an experimenter \E\ wishes to conduct $k$ among $n$ possible experiments. Each experiment $i\in\mathcal{N}\defeq \{1,\ldots,n\}$ is associated with a set of parameters (or features) $x_i\in \reals^d$, normalized so that $b\leq \|x_i\|^2_2\leq 1,$ for some $b>0$. Denote by $S\subseteq \mathcal{N}$, where $|S|=k$, the set of experiments selected; upon its execution, experiment $i\in S$ reveals an output variable (the ``measurement'') $y_i$, related to the experiment features $x_i$ through a linear function, \emph{i.e.}, $y_i = \T{\beta} x_i + \varepsilon_i$ where $\beta$ is a vector in $\reals^d$, commonly referred to as the \emph{model}, and $\varepsilon_i$ (the \emph{measurement noise}) are independent, normally distributed random variables with mean 0 and variance $\sigma^2$. For example, each $i$ may correspond to a human subject; the feature vector $x_i$ may correspond to a normalized vector of her age, weight, gender, income, \emph{etc.}, and the measurement $y_i$ may capture some biometric information (\emph{e.g.}, her red cell blood count, a genetic marker, etc.). The magnitude of the coefficient $\beta_i$ captures the effect that feature $i$ has on the measured variable, and its sign captures whether the correlation is positive or negative. The purpose of these experiments is to allow \E\ to estimate the model $\beta$. In particular, assume that the experimenter \E\ has a {\em prior} distribution on $\beta$, \emph{i.e.}, $\beta$ has a multivariate normal prior with zero mean and covariance $\sigma^2R^{-1}\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance). Then, \E\ estimates $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. Under the linearity assumption and the Gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization \cite{hastie}: \begin{align} \begin{split} \hat{\beta} = \argmax_{\beta\in\reals^d} \prob(\beta\mid y_S) &=\argmin_{\beta\in\reals^d} \big(\sum_{i\in S} (y_i - \T{\beta}x_i)^2 + \T{\beta}R\beta\big)\\ & = (R+\T{X_S}X_S)^{-1}X_S^Ty_S \label{ridge} \end{split} \end{align} where the last equality is obtained by setting $\nabla_{\beta}\prob(\beta\mid y_S)$ to zero and solving the resulting linear system; in \eqref{ridge}, $X_S\defeq[x_i]_{i\in S}\in \reals^{|S|\times d}$ is the matrix of experiment features and $y_S\defeq[y_i]_{i\in S}\in\reals^{|S|}$ are the observed measurements. This optimization, commonly known as \emph{ridge regression}, includes an additional quadratic penalty term $\beta^TR\beta$ compared to the standard least squares estimation. % under \eqref{model}, the maximum likelihood estimator of $\beta$ is the \emph{least squares} estimator: for $X_S=[x_i]_{i\in S}\in \reals^{|S|\times d}$ the matrix of experiment features and %$y_S=[y_i]_{i\in S}\in\reals^{|S|}$ the observed measurements, %\begin{align} \hat{\beta} &=\max_{\beta\in\reals^d}\prob(y_S;\beta) =\argmin_{\beta\in\reals^d } \sum_{i\in S}(\T{\beta}x_i-y_i)^2 \nonumber\\ %& = (\T{X_S}X_S)^{-1}X_S^Ty_S\label{leastsquares}\end{align} %The estimator $\hat{\beta}$ is unbiased, \emph{i.e.}, $\expt{\hat{\beta}} = \beta$ (where the expectation is over the noise variables $\varepsilon_i$). Furthermore, $\hat{\beta}$ is a multidimensional normal random variable with mean $\beta$ and covariance matrix $(X_S\T{X_S})^{-1}$. %Note that the estimator $\hat{\beta}$ is a linear map of $y_S$; as $y_S$ is a multidimensional normal r.v., so is $\hat{\beta}$ (the randomness coming from the noise terms $\varepsilon_i$). %In particular, $\hat{\beta}$ has %mean $\beta$% (\emph{i.e.}, it is an \emph{unbiased estimator}) and covariance $(R+\T{X_S}X_S)^{-1}$. Let $V:2^\mathcal{N}\to\reals$ be a \emph{value function}, quantifying how informative a set of experiments $S$ is in estimating $\beta$. The classical experimental design problem amounts to finding a set $S$ that maximizes $V(S)$ subject to the constraint $|S|\leq k$. A variety of different value functions are used in literature~\cite{pukelsheim2006optimal,boyd2004convex}; %; almost all make use of the covariance $(R+\T{X_S}X_S)^{-1}$ of the estimator $\hat{\beta}$. one that has natural advantages is the \emph{information gain}, %\emph{$D$-optimality criterion}: %which yields the following optimization problem $V(S)= I(\beta;y_S) = \entropy(\beta)-\entropy(\beta\mid y_S)$ which is the entropy reduction on $\beta$ after the revelation of $y_S$ (also known as the mutual information between $y_S$ and $\beta$). Hence, selecting a set of experiments $S$ that maximizes $V(S)$ is equivalent to finding the set of experiments that minimizes the uncertainty on $\beta$, as captured by the entropy reduction of its estimator. Under the linear model, and the Gaussian prior, the information gain takes the following form (see, \emph{e.g.}, \cite{chaloner1995bayesian}): \begin{align} I(\beta;y_S)&= \frac{1}{2}\log\det(R+ \T{X_S}X_S) - \frac{1}{2}\log\det R\label{dcrit} %\\ \end{align} Maximizing $I(\beta;y_S)$ is therefore equivalent to maximizing $\log\det(R+ \T{X_S}X_S)$, which is known in the experimental design literature as the Bayes $D$-optimality criterion \cite{pukelsheim2006optimal,atkinson2007optimum,chaloner1995bayesian}. %Note that the estimator $\hat{\beta}$ is a linear map of $y_S$. As $y_S$ is a multidimensional normal r.v., so is $\hat{\beta}$; in particular, $\hat{\beta}$ has covariance $\sigma^2(R+\T{X_S}X_S)^{-1}$. As such, maximizing $I(\beta;y_S)$ can alternatively be seen as a means of reducing the uncertainty on estimator $\hat{\beta}$ by minimizing the product of the eigenvalues of its covariance (as the latter equals the determinant). %An alternative interpretation, given that $(R+ \T{X_S}X_S)^{-1}$ is the covariance of the estimator $\hat{\beta}$, is that it tries to minimize the %which is indeed a function of the covariance matrix $(R+\T{X_S}X_S)^{-1}$. %defined as $-\infty$ when $\mathrm{rank}(\T{X_S}X_S)B$ can be in an $S$ satisfying \eqref{lincon}. Denote by \begin{equation}\label{eq:non-strategic} OPT = \max_{S\subseteq\mathcal{N}} \Big\{V(S) \;\Big| \; \sum_{i\in S}c_i\leq B\Big\} \end{equation} the optimal value achievable in the full-information case. \EDP{}, as defined above, is NP-hard; to see this, note that \textsc{Knapsack} reduces to EDP with dimension $d=1$ by mapping the weight of each item, say, $w_i$, to an experiment with $x_i^2=w_i$.% Moreover, \eqref{modified} is submodular, monotone and satisfies $V(\emptyset) = 0$. The value function \eqref{modified} has the following properties, which are proved in Appendix~\ref{app:properties}. First, it is non-negative, \emph{i.e.}, $V(S)\geq 0$ for all $S\subseteq\mathcal{N}$. Second, it is also monotone, \emph{i.e.}, $V(S)\leq V(T)$ for all $S\subseteq T$, with $V(\emptyset)=0$. Finally, it is submodular, \emph{i.e.}, $V(S\cup \{i\})-V(S)\geq V(T\cup \{i\})-V(T)$ for all $S\subseteq T\subseteq \mathcal{N}$ and $i\in \mathcal{N}$. %Finally, we define the strategic version of \EDP{} (denoted by \SEDP) by adding the additional constraints of truthfulness, individual rationality, and normal payments, as described in the previous section. The above imply that a greedy algorithm yields a constant approximation ratio to \EDP. In particular, consider the greedy algorithm in which, for %In the full-information case, submodular maximization under a budget constraint %\eqref{eq:non-strategic} relies on a greedy heuristic in which elements are %added to the solution set according to the following greedy selection rule. $S\subseteq\mathcal{N}$ the set constructed thus far, the next element $i$ included is the one which maximizes the \emph{marginal-value-per-cost}, \emph{i.e.}, %\begin{align} $ i = \argmax_{j\in\mathcal{N}\setminus S}{(V(S\cup\{i\}) - V(S))}/{c_i}.$ %\label{greedy} %\end{align} This is repeated until adding an element in $S$ exceeds the budget $B$. Denote by $S_G$ the set constructed by this heuristic and let $i^*=\argmax_{i\in\mathcal{N}} V(\{i\})$ be the element of maximum singleton value. Then, the following algorithm: \begin{equation}\label{eq:max-algorithm} \textbf{if}\; V(\{i^*\}) \geq V(S_G)\; \textbf{return}\; \{i^*\} \;\textbf{else return}\; S_G \end{equation} yields an approximation ratio of $\frac{5 e }{e-1}~$\cite{singer-mechanisms}; this can be further improved to $\frac{e}{e-1}$ using more complicated greedy set constructions~\cite{krause-submodular,sviridenko-submodular}. \subsection{Budget-Feasible Experimental Design: Strategic Case} We study the following \emph{strategic} setting, in which the costs $c_i$ are {\em not} common knowledge and their reporting can be manipulated by the experiment subjects. The latter are strategic and wish to maximize their utility, which is the difference of the payment they receive and their true cost. We note that, though subjects may misreport $c_i$, they cannot lie about $x_i$ (\emph{i.e.}, all public features are verifiable prior to the experiment) nor $y_i$ (\emph{i.e.}, the subject cannot falsify her measurement). In this setting, experimental design reduces to a \emph{budget feasible reverse auction}, as introduced by \citeN{singer-mechanisms}; we review the formal definition in Appendix~\ref{app:budgetfeasible}. In short, given a budget $B$ and a value function $V:2^{\mathcal{N}}\to\reals_+$, a \emph{reverse auction mechanism} $\mathcal{M} = (S,p)$ comprises (a) an \emph{allocation function} $S:\reals_+^n \to 2^\mathcal{N}$, determining the set of experiments to be purchased, and (b) a \emph{payment function} $p:\reals_+^n\to \reals_+^n$, determining the payments $[p_i(c)]_{i\in \mathcal{N}}$ received by experiment subjects. We seek mechanisms that are \emph{normalized} (unallocated experiments receive zero payments), \emph{individually rational} (payments for allocated experiments exceed costs), have \emph{no positive transfers} (payments are non-negative), and are \emph{budget feasible} (the sum of payments does not exceed the budget $B$). We relax the notion of truthfulness to \emph{$\delta$-truthfulness}, requiring that reporting one's true cost is an \emph{almost-dominant} strategy: no subject increases their utility by reporting a cost that differs more than $\delta>0$ from their true cost. Under this definition, a mechanism is truthful if $\delta=0$. In addition, we would like the allocation $S(c)$ to be of maximal value; however, $\delta$-truthfulness, as well as the hardness of \EDP{}, preclude achieving this goal. Hence, we seek mechanisms with that are \emph{$(\alpha,\beta)$-approximate}, \emph{i.e.}, there exist $\alpha\geq 1$ and $\beta>0$ s.t.~$OPT \leq \alpha V(S(c))+\beta$, and are \emph{computationally efficient}, in that $S$ and $p$ can be computed in polynomial time. We note that the constant approximation algorithm \eqref{eq:max-algorithm} breaks truthfulness. Though this is not true for all submodular functions (see, \emph{e.g.}, \cite{singer-mechanisms}), it is true for the objective of \EDP{}: we show this in Appendix~\ref{sec:non-monotonicity}. This motivates our study of more complex mechanisms. \begin{comment} As noted in \cite{singer-mechanisms, chen}, budget feasible reverse auctions are \emph{single parameter} auctions: each agent has only one private value (namely, $c_i$). As such, Myerson's Theorem \cite{myerson} gives a characterization of truthful mechanisms. We use the following variant of the theorem: %NEEDS TO BE FIXED \begin{lemma}[\citeN{myerson}]\label{thm:myerson} \sloppy A normalized mechanism $\mathcal{M} = (S,p)$ for a single parameter auction is truthful iff: %\begin{enumerate} %\item (a) $S$ is \emph{monotone}, \emph{i.e.}, for any agent $i$ and $c_i' \leq c_i$, for any fixed costs $c_{-i}$ of agents in $\mathcal{N}\setminus\{i\}$, $i\in S(c_i, c_{-i})$ implies $i\in S(c_i', c_{-i})$, and (b) %\item agents are paid \emph{threshold payments}, \emph{i.e.}, for all $i\in S(c)$, $p_i(c)=\inf\{c_i': i\in S(c_i', c_{-i})\}$. %\end{enumerate} \end{lemma} \begin{lemma}[\citeN{myerson}]\label{thm:myerson-variant} \sloppy A normalized mechanism $\mathcal{M} = (S,p)$ for a single parameter auction is $\delta$-truthful if: (a) $S$ is $\delta$-decreasing, \emph{i.e.}, for any agent $i$ and $c_i' \leq c_i-\delta$, for any fixed costs $c_{-i}$ of agents in $\mathcal{N}\setminus\{i\}$, $i\in S(c_i, c_{-i})$ implies $i\in S(c_i', c_{-i})$, and (b) agents are paid \emph{threshold payments}, \emph{i.e.}, for all $i\in S(c)$, $p_i(c)=\inf\{c_i': i\in S(c_i', c_{-i})\}$. \end{lemma} Myerson's Theorem % is particularly useful when designing a budget feasible truthful mechanism, as it allows us to focus on designing a monotone allocation function $S$. Then, the mechanism will be truthful as long as we give each agent her threshold payment---the caveat being that the latter need to sum to a value below $B$. \end{comment} \begin{comment} \subsection{Budget Feasible Experimental Design} We approach the problem of optimal experimental design from the perspective of a budget feasible reverse auction, as defined above. In particular, we assume the experimenter \E\ has a budget $B\in\reals_+$ and plays the role of the buyer. Each experiment $i\in \mathcal{N}$ corresponds to a strategic agent, whose cost $c_i$ is private. In order to obtain the measurement $y_i$, the experimenter needs to pay agent $i$ a price that exceeds her cost. For example, each $i$ may correspond to a human subject; the feature vector $x_i$ may correspond to a normalized vector of her age, weight, gender, income, \emph{etc.}, and the measurement $y_i$ may capture some biometric information (\emph{e.g.}, her red cell blood count, a genetic marker, etc.). The cost $c_i$ is the amount the subject deems sufficient to incentivize her participation in the study. Note that, in this setup, the feature vectors $x_i$ are public information that the experimenter can consult prior the experiment design. Moreover, though a subject may lie about her true cost $c_i$, she cannot lie about $x_i$ (\emph{i.e.}, all features are verifiable upon collection) or $y_i$ (\emph{i.e.}, she cannot falsify her measurement). Ideally, motivated by the $D$-optimality criterion, we would like to design a mechanism that maximizes \eqref{dcrit} within a good approximation ratio. The latter however can take negative values, making it unfit for stating approximation ratios. As such, we consider the following full information problem: \begin{center} \textsc{ExperimentalDesignProblem} (EDP) \begin{subequations} \begin{align} \text{Maximize}\quad V(S) &= \log\det(I_d+\T{X_S}X_S) \label{modified} \\ \text{subject to}\quad \sum_{i\in S} c_i&\leq B \end{align}\label{edp} \end{subequations} \end{center} where $I_d\in \reals^{d\times d}$ is the identity matrix. The objective~\eqref{modified} is non-negative, and can be motivated in the context of \emph{Bayesian experimental design} \cite{chaloner1995bayesian}. In short, it arises naturally when the experimenter retrieves the model $\beta$ through \emph{ridge regression}, rather than the linear regression \eqref{leastsquares} over the observed data; we explore this connection in Section~\ref{sec:bed}. %We present our results with this version of the objective function because it is simple and captures the versions %we will need later (including an arbitrary matrix in place of $I_d$ or a zero matrix which will correspond to ~\eqref{dcrit} ). %\begin{center} %\textsc{ExperimentalDesignProblem} (EDP) %\begin{subequations} %\begin{align} %\text{Maximize}\quad V(S) &= \log\det(I_d+\T{X_S}X_S) \label{modified} \\ %\text{subject to}\quad \sum_{i\in S} p_i&\leq B\\ %p_i&\geq c_i, \quad i\in S\\ %\end{align}\label{edp} %\end{subequations} %\end{center} %Note that maximizing \eqref{modified} is equivalent to maximizing \eqref{dcrit} in the full-information case. In particular, $\det(\T{X_S}X_S)> \det(\T{X_{S'}}X_{S'})$ iff $\det(I_d+\T{X_S}X_S)>\det(I_d+\T{X_{S'}}X_{S'})$. In addition, \EDP{} \emph{and} the corresponding problem with objective \eqref{dcrit} are NP-hard; to see this, note that \textsc{Knapsack} reduces to EDP with dimension $d=1$ by mapping the weight of each item $w_i$ to an experiment with $x_i=w_i$. Moreover, \eqref{modified} is submodular, monotone and satisfies $V(\emptyset) = 0$. Finally, we define the strategic version of \EDP{} (denoted by \SEDP) by adding the additional constraints of truthfulness, individual rationality, and normal payments, as described in the previous section. %, allowing us to use the extensive machinery for the optimization of submodular functions, as well as recent results in the % context of budget feasible auctions \cite{chen,singer-mechanisms}. %\stratis{A potential problem is that all of the above properties hold also for, \emph{e.g.}, $V(S)=\log(1+\det(\T{X_S}X_S))$\ldots} \end{comment} \begin{comment}\junk{ As \eqref{dcrit} may take arbitrarily small negative values, to define a meaningful approximation one would consider the (equivalent) maximization of $V(S) = \det\T{X_S}X_S$. %, for some strictly increasing, on-to function $f:\reals_+\to\reals_+$. However, the following lower bound implies that such an optimization goal cannot be attained under the constraints of truthfulness, budget feasibility, and individual rationality. \begin{lemma} For any $M>1$, there is no $M$-approximate, truthful, budget feasible, individually rational mechanism for a budget feasible reverse auction with value function $V(S) = \det{\T{X_S}X_S}$. For any $M>1$, there is no $M$-approximate, truthful, budget feasible, individually rational mechanism for a budget feasible reverse auction with $V(S) = \det{\T{X_S}X_S}$. \end{lemma} \begin{proof} \input{proof_of_lower_bound1} \end{proof} This negative result motivates us to study a problem with a modified objective:} \end{comment} \begin{comment} \subsection{Experimental Design} In the context of experimental design, an \emph{experiment} is a random variable $E$ sampled from a distribution $P_\beta$, where $\beta\in \Omega$ is an unknown parameter. An experimenter wishes to learn parameter $\beta$, and can chose among a set of possible different experiments, all of which have distributions parametrized by the same $\beta$. The problem of optimal experimental design amounts to determining an experiment that maximizes the information revealed about parameter $\beta$. Though a variety of different measures of information exist in literature (see, \emph{e.g.}, \cite{ginebra,chaloner}), the so-called \emph{value of information} \cite{lindley} is commonly used in traditional Bayesian experimental design. In particular, in the Bayesian setup, it is assumed that $\beta$ is sampled from a well-known prior distribution. The value of an experiment $E$ is then defined as the expected change in the entropy of $\beta$ (\emph{i.e.}, the mutual information between $E$ an $\beta$), given by \begin{align} \mutual(\beta; E) = \entropy(\beta) - \entropy(\beta \mid E).\label{voi} \end{align} %where $H(\beta)$ is the entropy of the prior distribution and $\entropy(\beta \mid E)$ is the conditional entropy con \subsection{Linear Regression} In this paper, we focus on \emph{linear regression} experiments, aiming to discover a linear function from data. In particular, we consider a set of $n$ users $\mathcal{N} = \{1,\ldots, n\}$. Each user $i\in\mathcal{N}$ has a public vector of features $x_i\in\reals^d$, $\norm{x_i}_2\leq 1$, and an undisclosed piece of information $y_i\in\reals$. For example, the features could be the age, weight, or height of user $i$, while the latter can be her propensity to contract a disease. The variables $x_i$ and $y_i$ are related through the following relationship: \begin{align} y_i = \T{\beta} x_i + \varepsilon_i,\quad\forall i\in\mathcal{N},\label{model} \end{align} where $\beta\in\reals^d$ is the unknown parameter and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2_0$. The experimenter wishes to learn $\beta$ by observing the undisclosed $y_i$ of a subset of users $i\in S$. Hence, the set of possible experiments comprises all random variables $y_{S}\in\reals^{|S|}$, $S\subseteq \mathcal{N}$. Learning $\beta$ has many interesting applications, that make linear regression ubiquitous in data mining, statistics, and machine learning. On one hand, the function itself can be used for \emph{prediction}, \emph{i.e.}, to compute the output value $y$ of a new input $x\in \reals^d$. Moreover, the sign and relative magnitude coefficients of $\beta$ can aid in identifying how different inputs affect the output---establishing, \emph{e.g.}, that weight, rather than age, is more strongly correlated to a disease. \subsection{Value of Information} In the Bayesian setting, it is commonly assumed that $\beta$ follows a multivariate normal distribution of mean zero and covariance matrix $\sigma_1^2 I_d$. Under this prior and the linear model \eqref{model}, the value of information \eqref{voi} of an experiment $Y_S$ is given by \cite{boyd,chaloner} \begin{align}\label{vs} V(S) & \defeq I(\beta;y_S) = \frac{1}{2}\log\det\left(I_d + \mu\sum_{i\in S} x_i\T{x_i}\right), % & \defeq \frac{1}{2}\log\det A(S) \end{align} where $\mu = \sigma_1^2/\sigma_0^2$. Some intuition behind the Gaussian prior on $\beta$ and the role that parameter $\mu$ plays can be obtained from how the experimenter estimates $\beta$ upon the observation of $y_S$. In particular, the experimenter can estimate $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. It is easy to show that, under the linearity assumption \eqref{model} and the gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization problem: \begin{displaymath} \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \T{\beta}x_i)^2 + \frac{1}{\mu}\sum_i \norm{\beta}_2^2 \end{displaymath} This optimization, commonly known as \emph{ridge regression}, reduces to a least squares fit for $\mu=\infty$. For finite $\mu$, ridge regression acts as a sort of ``Occam's razor'', favoring a \emph{parsimonious} model for $\beta$: among two vectors with the same square error, the one with the smallest norm is preferred. This is consistent with the Gaussian prior on $\beta$, which implies that vectors with small norms are more likely. %In practice, ridge regression is known to give better prediction results over new data than model parameters computed through a least squares fit. \subsection{A Budgeted Auction}\label{sec:auction} TODO Explain the optimization problem, why it has to be formulated as an auction problem. Explain the goals: \begin{itemize} \item truthful \item individually rational \item budget feasible \item has a good approximation ratio \end{itemize} \thibaut{TODO Explain what is already known: it is ok when the function is submodular. When should we introduce the notion of submodularity?} \subsection{Linear and Ridge Regression} Linear regression, a true workhorse of statistical analysis, fits a linear function to a set of observable input and output variables. In particular, consider a set of $n$ value pairs $(x_i,y_i)$, $i\in \mathcal{N} =\{1,\ldots,n\}$. The variables $x_i\in \reals^d$ are usually referred to as \emph{input} variables or \emph{covariates} and $y_i\in \reals$ are referred to as \emph{output} or \emph{response} variables. For example, the former could be the age, weight, or height of a person, while the latter can be their propensity to contract a disease. Assume that input and output variables are related through the following relationship: \begin{displaymath} y_i = \T{\beta} x_i + \varepsilon_i,\quad\forall i\in\mathcal{N}, \end{displaymath} where $\beta\in\reals^d$, is called the \emph{model parameter vector} and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2$. The purpose of linear regression is to recover the vector $\beta$ upon the observation of the pairs $(x_i,y_i)$, $i=1,\ldots,n$. Learning $\beta$ has many interesting applications, that make linear regression ubiquitous in data mining, statistics, and machine learning. On one hand, the function itself can be used for \emph{prediction}, \emph{i.e.}, to compute the output value $y$ of a new input $x\in \reals^d$. Moreover, the sign and relative magnitude coefficients of $\beta$ can aid in identifying how different inputs affect the output---establishing, \emph{e.g.}, that weight, rather than age, is more strongly correlated to a disease. In the absense of any additional information, a natural method for estimating $\beta$ is through a least squares fit. However, in a more general setup, a prior knowledge about $\beta$, captured by a distribution over $\reals^d$, can also be taken into account. In this setup, after observing the data, $\beta$ can be retreived through \emph{maximum a posteriori estimation}: computing the parameters which maximize the posterior distribution of $\beta$ given the observations. A common prior assumption for linear regression is that $\beta$ follows a multivariate normal distribution of mean zero and covariance matrix $\kappa I_d$. Under this assumption, maximum a posteriori estimation leads to the following maximization problem, commonly known as \emph{ridge regression}: \begin{displaymath} \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - T{\beta}x_i)^2 + \frac{1}{\mu}\sum_i \norm{\beta}_2^2 \end{displaymath} where $\mu = \frac{\kappa}{\sigma^2}$ is referred to as the \emph{regularization parameter}. For $\mu=\infty$, the above minimization recovers a least squares fit. Compared to the latter, ridge regression acts as a sort of ``Occam's razor'', favoring a \emph{parsimonious} model for $\beta$: among two vectors with the same square error, the one with the smallest norm is preferred. This is consistent with the Gaussian prior on $\beta$, which favors vectors with small norms, and is known in practice to give better prediction results over new data than model parameters computed through a least squares fit. \stratis{$1/\mu$ is awkward, especially since we do not cover linear regression. Can we use $\mu$ instead?} \subsection{Notations} Throughout the paper, we will make use of the following notations: if $x$ is a (column) vector in $\reals^d$, $\T{x}$ denotes its transposed (line) vector. Thus, the standard inner product between two vectors $x$ and $y$ is simply $\T{x}y$. $\norm{x}_2 = \T{x}x$ will denote the $L_2$ norm of $x$. We will also often use the following order over symmetric matrices: if $A$ and $B$ are two $d\times d$ and $B$ are two $d\times d$ real symmetric matrices, we write that $A\leq B$ iff: \begin{displaymath} \forall x\in\reals^d,\quad \T{x}Ax \leq \T{x}Bx \end{displaymath} That is, iff $B-A$ is symmetric semi-definite positive. This order let us define the notion of a \emph{decreasing} or \emph{convex} matrix function similarly to their real counterparts. In particular, let us recall that the matrix inversion is decreasing and convex over symmetric definite positive matrices. \stratis{This section should either be removed or compactified. Short notations like $*$ for transpose can be introduced where they first appear. Better yet, can we revert to $^T$ or something similar for transpose? I know $*$ is used in real analysis etc., but $*$ is used for optimal values too when dealing with optimization. Also, do we need the extension of decreasing and convex to matrix function? Does it have ``global scope''? If it is used only once, we might as well place this notation there.} \subsection{Linear and Ridge Regression} Linear regression, a true workhorse of statistical analysis, fits a linear function to a set of observable input and output variables. In particular, consider a set of $n$ value pairs $(x_i,y_i)$, $i\in \mathcal{N} =\{1,\ldots,n\}$. The variables $x_i\in \reals^d$ are usually referred to as \emph{input} variables or \emph{covariates} and $y_i\in \reals$ are referred to as \emph{output} or \emph{response} variables. For example, the former could be the age, weight, or height of a person, while the latter can be their propensity to contract a disease. Assume that input and output variables are related through the following relationship: \begin{displaymath} y_i = \T{\beta} x_i + \varepsilon_i,\quad\forall i\in\mathcal{N}, \end{displaymath} where $\beta\in\reals^d$, is called the \emph{model parameter vector} and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2$. The purpose of linear regression is to recover the vector $\beta$ upon the observation of the pairs $(x_i,y_i)$, $i=1,\ldots,n$. Learning $\beta$ has many interesting applications, that make linear regression ubiquitous in data mining, statistics, and machine learning. On one hand, the function itself can be used for \emph{prediction}, \emph{i.e.}, to compute the output value $y$ of a new input $x\in \reals^d$. Moreover, the sign and relative magnitude coefficients of $\beta$ can aid in identifying how different inputs affect the output---establishing, \emph{e.g.}, that weight, rather than age, is more strongly correlated to a disease. In the absence of any additional information, a natural method for estimating $\beta$ is through a least squares fit. However, in a more general setup, a prior knowledge about $\beta$, captured by a distribution over $\reals^d$, can also be taken into account. In this setup, after observing the data, $\beta$ can be retrieved through \emph{maximum a posteriori estimation}: computing the parameters which maximize the posterior distribution of $\beta$ given the observations. A common prior assumption for linear regression is that $\beta$ follows a multivariate normal distribution of mean zero and covariance matrix $\kappa I_d$. Under this assumption, maximum a posteriori estimation leads to the following maximization problem, commonly known as \emph{ridge regression}: \begin{displaymath} \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \T{\beta}x_i)^2 + \frac{1}{\mu}\sum_i \norm{\beta}_2^2 \end{displaymath} where $\mu = \frac{\kappa}{\sigma^2}$ is referred to as the \emph{regularization parameter}. For $\mu=\infty$, the above minimization recovers a least squares fit. Compared to the latter, ridge regression acts as a sort of ``Occam's razor'', favoring a \emph{parsimonious} model for $\beta$: among two vectors with the same square error, the one with the smallest norm is preferred. This is consistent with the Gaussian prior on $\beta$, which favors vectors with small norms, and is known in practice to give better prediction results over new data than model parameters computed through a least squares fit. \stratis{$1/\mu$ is awkward, especially since we do not cover linear regression. Can we use $\mu$ instead?} \subsection{Experiment Design} There is a set of $n$ users, $\mathcal{N} = \{1,\ldots, n\}$. Each user $i\in\mathcal{N}$ has a public vector of features $x_i\in\reals^d$ and an undisclosed piece of information $y_i\in\reals$. We assume that the data has already been normalized so that $\norm{x_i}_2\leq 1$ for all $i\in\mathcal{N}$. The experimenter is going to select a set of users and ask them to reveal their private piece of information. We are interested in a \emph{survey setup}: the experimenter has not seen the data yet, but he wants to know which users he should be selecting. His goal is to learn the model underlying the data. \stratis{Describe the experiment setup. The value function can be part of this now, as it is the means through which the experimenter quantifies the value of a solution.} \subsection{Data model} There is a set of $n$ users, $\mathcal{N} = \{1,\ldots, n\}$. Each user $i\in\mathcal{N}$ has a public vector of features $x_i\in\reals^d$ and an undisclosed piece of information $y_i\in\reals$. We assume that the data has already been normalized so that $\norm{x_i}_2\leq 1$ for all $i\in\mathcal{N}$. The experimenter is going to select a set of users and ask them to reveal their private piece of information. We are interested in a \emph{survey setup}: the experimenter has not seen the data yet, but he wants to know which users he should be selecting. His goal is to learn the model underlying the data. Here, we assume a linear model: \begin{displaymath} \forall i\in\mathcal{N},\quad y_i = \T{\beta}x_i + \varepsilon_i \end{displaymath} where $\beta\in\reals^d$ and $\varepsilon_i\in\reals$ follows a normal distribution of mean $0$ and variance $\sigma^2$. Furthermore, we assume the error $\varepsilon$ to be independent of the user: $(\varepsilon_i)_{i\in\mathcal{N}}$ are mutually independent. After observing the data, the experimenter could simply do linear regression to learn the model parameter $\beta$. However, in a more general setup, the experimenter has a prior knowledge about $\beta$, a distribution over $\reals^d$. After observing the data, the experimenter performs \emph{maximum a posteriori estimation}: computing the point which maximizes the posterior distribution of $\beta$ given the observations. Here, we will assume, as it is often done, that the prior distribution is a multivariate normal distribution of mean zero and covariance matrix $\kappa I_d$. Maximum a posteriori estimation leads to the following maximization problem: \begin{displaymath} \beta_{\text{max}} = \argmax_{\beta\in\reals^d} \sum_i (y_i - \T{\beta}x_i)^2 + \frac{1}{\mu}\sum_i \norm{\beta}_2^2 \end{displaymath} which is the well-known \emph{ridge regression}. $\mu = \frac{\kappa}{\sigma^2}$ is the regularization parameter. Ridge regression can thus be seen as linear regression with a regularization term which prevents $\beta$ from having a large $L_2$-norm. \subsection{Value of data} Because the user private variables $y_i$ have not been observed yet when the experimenter has to decide which users to include in his experiment, we treat $\beta$ as a random variable whose distribution is updated after observing the data. Let us recall that if $\beta$ is random variable over $\reals^d$ whose probability distribution has a density function $f$ with respect to the Lebesgue measure, its entropy is given by: \begin{displaymath} \mathbb{H}(\beta) \defeq - \int_{b\in\reals^d} \log f(b) f(b)\text{d}b \end{displaymath} A usual way to measure the decrease of uncertainty induced by the observation of data is to use the entropy. This leads to the following definition of the value of data called the \emph{value of information}: \begin{displaymath} \forall S\subset\mathcal{N},\quad V(S) = \mathbb{H}(\beta) - \mathbb{H}(\beta\,|\, Y_S) \end{displaymath} where $Y_S = \{y_i,\,i\in S\}$ is the set of observed data. \begin{theorem} Under the ridge regression model explained in section TODO, the value of data is equal to: \begin{align*} \forall S\subset\mathcal{N},\; V(S) & = \frac{1}{2}\log\det\left(I_d + \mu\sum_{i\in S} x_i\T{x_i}\right)\\ & \defeq \frac{1}{2}\log\det A(S) \end{align*} \end{theorem} \begin{proof} Let us denote by $X_S$ the matrix whose rows are the vectors $(\T{x_i})_{i\in S}$. Observe that $A_S$ can simply be written as: \begin{displaymath} A_S = I_d + \mu \T{X_S} X_S \end{displaymath} Let us recall that the entropy of a multivariate normal variable $B$ over $\reals^d$ of covariance $\Sigma I_d$ is given by: \begin{equation}\label{eq:multivariate-entropy} \mathbb{H}(B) = \frac{1}{2}\log\big((2\pi e)^d \det \Sigma I_d\big) \end{equation} Using the chain rule for conditional entropy, we get that: \begin{displaymath} V(S) = \mathbb{H}(Y_S) - \mathbb{H}(Y_S\,|\,\beta) \end{displaymath} Conditioned on $\beta$, $(Y_S)$ follows a multivariate normal distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence: \begin{equation}\label{eq:h1} \mathbb{H}(Y_S\,|\,\beta) = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right) \end{equation} $(Y_S)$ also follows a multivariate normal distribution of mean zero. Let us compute its covariance matrix, $\Sigma_Y$: \begin{align*} \Sigma_Y & = \expt{Y\T{Y}} = \expt{(X_S\beta + E)\T{(X_S\beta + E)}}\\ & = \kappa X_S \T{X_S} + \sigma^2I_n \end{align*} Thus, we get that: \begin{equation}\label{eq:h2} \mathbb{H}(Y_S) = \frac{1}{2}\log\left((2\pi e)^n \det(\kappa X_S \T{X_S} + \sigma^2 I_n)\right) \end{equation} Combining \eqref{eq:h1} and \eqref{eq:h2} we get: \begin{displaymath} V(S) = \frac{1}{2}\log\det\left(I_n+\frac{\kappa}{\sigma^2}X_S \T{X_S}\right) \end{displaymath} Finally, we can use Sylvester's determinant theorem to get the result. \end{proof} It is also interesting to look at the marginal contribution of a user to a set: the increase of value induced by adding a user to an already existing set of users. We have the following lemma. \begin{lemma}[Marginal contribution] \begin{displaymath} \Delta_i V(S)\defeq V(S\cup\{i\}) - V(S) = \frac{1}{2}\log\left(1 + \mu \T{x_i} A(S)^{-1}x_i\right) \end{displaymath} \end{lemma} \begin{proof} We have: \begin{align*} V(S\cup\{i\}) & = \frac{1}{2}\log\det A(S\cup\{i\})\\ & = \frac{1}{2}\log\det\left(A(S) + \mu x_i \T{x_i}\right)\\ & = V(S) + \frac{1}{2}\log\det\left(I_d + \mu A(S)^{-1}x_i \T{x_i}\right)\\ & = V(S) + \frac{1}{2}\log\left(1 + \mu \T{x_i} A(S)^{-1}x_i\right) \end{align*} where the last equality comes from Sylvester's determinant formula. \end{proof} Because $A(S)$ is symmetric definite positive, the marginal contribution is positive, which proves that the value function is set increasing. Furthermore, it is easy to see that if $S\subset S'$, then $A(S)\leq A(S')$. Using the fact that matrix inversion is decreasing, we see that the marginal contribution of a fixed user is a set decreasing function. This is the \emph{submodularity} of the value function. TODO? Explain what are the points which are the most valuable : points which are aligned along directions where the spread over the already existing points is small. \end{comment}