summaryrefslogtreecommitdiffstats
path: root/problem.tex
blob: b8e6af844861a7c745339858e0d7c38dd83e25b3 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
\subsection{Experimental Design}

The theory of experimental design \cite{pukelsheim2006optimal,atkinson2007optimum}  studies how an experimenter should select the parameters of a set of experiments she is about to conduct. In general, the optimality of a particular design depends on the purpose of the experiment, \emph{i.e.}, the quantity the experimenter is trying to learn or the hypothesis she is trying to validate. Due to their ubiquity in statistical analysis, a large literature on the subject focuses on learning \emph{linear models}, whereby the experimenter wishes to  fit  a linear map to the data she has collected.

More precisely, putting cost considerations aside, suppose that an experimenter wishes to conduct $k$ among $n$ possible experiments. Each experiment $i\in\mathcal{N}\defeq \{1,\ldots,n\}$ is associated with a set of parameters (or features) $x_i\in \reals^d$, normalized so that $\|x_i\|_2\leq 1$. Denote by $S\subseteq \mathcal{N}$, where $|S|=k$, the set of experiments selected; upon its execution, experiment $i\in S$ reveals an output variable (the ``measurement'') $y_i$,  related to the experiment features $x_i$ through a linear function, \emph{i.e.},
\begin{align}
     y_i = \T{\beta} x_i + \varepsilon_i,\quad\forall i\in\mathcal{N},\label{model}
\end{align}
where $\beta$ a vector in $\reals^d$, commonly referred to as the \emph{model}, and $\varepsilon_i$ (the \emph{measurement noise}) are independent, normally distributed random  variables with zero mean and variance $\sigma^2$. 

The purpose of these experiments is to allow the experimenter to estimate the model $\beta$. In particular, under \eqref{model}, the maximum likelihood estimator of $\beta$ is the \emph{least squares} estimator: for $X_S=[x_i]_{i\in S}\in \reals^{|S|\times d}$ the matrix of experiment features and
$y_S=[y_i]_{i\in S}\in\reals^{|S|}$ the observed measurements, 
\begin{align} \hat{\beta} &=\max_{\beta\in\reals^d}\prob(y_S;\beta) =\argmin_{\beta\in\reals^d } \sum_{i\in S}(\T{\beta}x_i-y_i)^2 \nonumber\\
& = (\T{X_S}X_S)^{-1}X_S^Ty_S\label{leastsquares}\end{align} 
%The estimator $\hat{\beta}$ is unbiased, \emph{i.e.}, $\expt{\hat{\beta}} = \beta$ (where the expectation is over the noise variables $\varepsilon_i$). Furthermore, $\hat{\beta}$ is a multidimensional normal random variable with mean $\beta$ and covariance matrix $(X_S\T{X_S})^{-1}$. 

Note that the estimator $\hat{\beta}$ is a linear map of $y_S$; as $y_S$ is
a multidimensional normal r.v., so is $\hat{\beta}$ (the randomness coming from
the noise terms $\varepsilon_i$). In particular, $\hat{\beta}$ has mean $\beta$
(\emph{i.e.}, it is an \emph{unbiased estimator}) and covariance
$(\T{X_S}X_S)^{-1}$.

Let $V:2^\mathcal{N}\to\reals$ be a \emph{value function}, quantifying how informative a set of experiments $S$ is in estimating $\beta$. The standard optimal experimental design problem amounts to finding a set $S$ that maximizes $V(S)$ subject to the constraint $|S|\leq k$. 

A variety of different value functions are used in experimental design~\cite{pukelsheim2006optimal}; almost all make use of the covariance $(\T{X_S}X_S)^{-1}$ of the estimator $\hat{\beta}$. A value function preferred because of its relationship to entropy is the \emph{$D$-optimality criterion}: %which yields the following optimization problem
\begin{align}
 V(S) &= \frac{1}{2}\log\det \T{X_S}X_S \label{dcrit} %\\
\end{align}
As $\hat{\beta}$ is a multidimensional normal random variable, the
$D$-optimality criterion is equal (up to a constant) to the negative of the
entropy of $\hat{\beta}$. Hence, selecting a set of experiments $S$ that
maximizes $V(S)$ is equivalent to finding the set of experiments that minimizes
the uncertainty on $\beta$, as captured by the entropy of its estimator.

%As discussed in the next section, in this paper, we work with a modified measure of information function, namely 
%\begin{align}
%V(S) & = \frac{1}{2} \log\det \left(I + \T{X_S}X_S\right)
%\end{align}
%There are several  reasons

 Value function \eqref{dcrit} is undefined when $\mathrm{rank}(\T{X_S}X_S)<d$; in this case, we take $V(S)=-\infty$ (so that $V$ takes values in the extended reals).
Note that \eqref{dcrit} is a submodular set function, \emph{i.e.}, 
$V(S)+V(T)\geq V(S\cup T)+V(S\cap T)$ for all $S,T\subseteq \mathcal{N}$; it is also monotone, \emph{i.e.}, $V(S)\leq V(T)$ for all $S\subset T$. %In addition, the maximization of convex relaxations of this function is a  well-studied problem \cite{boyd}.

\subsection{Budget Feasible Reverse Auctions}
A \emph{budget feasible reverse auction}
\cite{singer-mechanisms} comprises a set of items $\mathcal{N}=\{1,\ldots,n\}$ as well as a single buyer. Each item has a cost $c_i\in\reals_+$. Moreover, the buyer has a positive value function  $V:2^{\mathcal{N}}\to \reals_+$, as well as a budget $B\in \reals_+$. In the \emph{full information case}, the costs $c_i$ are common
knowledge; the objective of the buyer in this context is to select a set $S$
maximizing the value $V(S)$ subject to the constraint $\sum_{i\in S} c_i\leq
B$.We write:
\begin{equation}\label{eq:non-strategic}
    OPT(V,\mathcal{N}, B) = \max_{S\subseteq\mathcal{N}} \left\{V(S) \mid
    \sum_{i\in S}c_i\leq B\right\}
\end{equation}
for the optimal value achievable in the full-information case. \stratis{Should be $OPT(V,c,B)$\ldots better drop the arguments here and introduce them wherever necessary.}

In the \emph{strategic case}, each items in $\mathcal{N}$ is held by a different strategic agent, whose cost is a-priori private.
A \emph{mechanism} $\mathcal{M} = (f,p)$ comprises (a) an \emph{allocation function}
$f:\reals_+^n \to 2^\mathcal{N}$ and (b) a \emph{payment function}
$p:\reals_+^n\to \reals_+^n$. Given the
vector or costs $c=[c_i]_{i\in\mathcal{N}}$, the allocation function $f$ determines the set in
$ \mathcal{N}$ of items to be purchased, while the payment function
returns a vector of payments $[p_i]_{i\in \mathcal{N}}$. Let $s_i(c) = \id_{i\in f_i(c)}$ be the binary indicator of  $i\in f(c)$. As in
\cite{singer-mechanisms, chen}, we study mechanisms that are normalized
($i\notin f(c)$ implies $p_i(c)=0$), individually rational ($p_i(c)\geq c_i\cdot s_i(c)$) and have
no positive transfers ($p_i(c)\geq 0$). 

In addition to the above, mechanism design in budget feasible reverse auctions seeks mechanisms that have the following properties \cite{singer-mechanisms,chen}:
\begin{itemize}
    \item \emph{Truthfulness.} An agent has no incentive to missreport her true cost. Formally, let $c_{-i}$
 be a vector of costs of all agents except $i$. %    $c_{-i}$ being the same). Let $[p_i']_{i\in \mathcal{N}} = p(c_i',
%    c_{-i})$, then the 
A mechanism is truthful iff  every $i \in \mathcal{N}$ and every two cost vectors $(c_i,c_{-i})$ and $(c_i',c_{-i})$
 $p_i(c_i,c_{-i}) - s_i(c_i,c_{-i})\cdot c_i \geq p_i(c_i',c_{-i}) - s(c_i',c_{-i})\cdot c_i$.
    \item \emph{Budget Feasibility.} The sum of the payments should not exceed the
    budget constraint:
    \begin{displaymath}
        \sum_{i\in\mathcal{N}} p_i \leq B.
    \end{displaymath}
    \item \emph{Approximation ratio.} The value of the allocated set should not
    be too far from the optimum value of the full information case
    \eqref{eq:non-strategic}. Formally, there must exist some $\alpha\geq 1$
    such that:
    \begin{displaymath}
        OPT(V,\mathcal{N}, B) \leq \alpha V(S).
    \end{displaymath}
    The approximation ratio captures the \emph{price of truthfulness}, \emph{i.e.},  the relative value loss incurred by adding the truthfulness constraint. % to the value function maximization.
    \item \emph{Computationally efficient.} The allocation and payment function
    should be computable in polynomial time in the number of
    agents $n$. %\thibaut{Should we say something about the black-box model for $V$?  Needed to say something in general, but not in our case where the value function can be computed in polynomial time}.
\end{itemize}

As noted in \cite{singer-mechanisms, chen}, budget feasible reverse auctions are  \emph{single parameter} auctions: each agent has only one
private value. In this case, Myerson's Theorem \cite{myerson} gives
a characterization of truthful mechanisms.

\begin{lemma}[Myerson \cite{myerson}]\label{thm:myerson}
\sloppy A normalized mechanism $\mathcal{M} = (f,p)$ for a single parameter auction is
truthful iff:
%\begin{enumerate}
%\item 
(a) $f$ is \emph{monotone}, \emph{i.e.}, for any agent $i$ and $c_i' \leq c_i$, for any
fixed costs $c_{-i}$ of agents in $\mathcal{N}\setminus\{i\}$, $i\in f(c_i,
c_{-i})$ implies $i\in f(c_i', c_{-i})$, and (b)
%\item
 agents are paid \emph{threshold payments}, \emph{i.e.}, for all $i\in f(c)$, $p_i(c)=\inf\{c_i': i\in f(c_i', c_{-i})\}$.
%\end{enumerate}
\end{lemma}
\fussy
Myerson's Theorem is particularly useful when designing a budget feasible truthful
mechanism, as it allows us to focus on designing a monotone allocation function. Then, the
mechanism will be truthful as long as we give each agent her threshold payment---the caveat being that the latter need to sum to a value below $B$.

\subsection{Budget Feasible Experimental Design}
In this paper, we approach the problem of optimal experimental design from the
perspective of a budget feasible reverse auction, as defined above.
 In particular, we assume the experimenter has a budget
$B\in\reals_+$ and plays the role of the buyer. 
Each experiment $i\in
\mathcal{N}$ corresponds to a strategic agent, whose cost $c_i$ is
private. In order to obtain the measurement $y_i$, the experimenter  needs to
pay agent $i$ a price that exceeds her cost. 

For example, each $i$ may correspond to a human participant; the
feature vector $x_i$ may correspond to a normalized vector of her age, weight,
gender, income, \emph{etc.}, and the measurement $y_i$ may capture some
biometric information (\emph{e.g.}, her red cell blood count, a genetic marker,
etc.). The cost $c_i$ is the amount the participant deems sufficient to
incentivize her participation in the study. Note that, in this setup, the feature vectors $x_i$ are public information that the experimenter can consult prior the experiment design. Moreover, though a participant may lie about her true cost $c_i$, she cannot lie about $x_i$ (\emph{i.e.}, all features are verifiable upon collection) or $y_i$ (\emph{i.e.}, she cannot falsify her measurement).

%\subsection{D-Optimality Criterion}
Ideally, motivated by the $D$-optimality criterion, we would like to design a mechanism that maximizes \eqref{dcrit} within a good approximation ratio. As \eqref{dcrit} may take arbitrarily small negative values, to define a meaningful approximation one would consider  the (equivalent) maximization of $V(S) = \det\T{X_S}X_S$. %, for some strictly increasing, on-to function $f:\reals_+\to\reals_+$. 
However, the following lower bound implies that such an optimization goal cannot be attained under the costraints of  truthfulness, budget feasibility, and individional rationallity.

\begin{lemma}
For any $M>1$, there is no $M$-approximate, truthful, budget feasible, individionally rational mechanism for a budget feasible reverse auction with value fuction $V(S) = \det{\T{X_S}X_S}$.
\end{lemma}
\begin{proof}
\input{proof_of_lower_bound1}
\end{proof}

This negative result motivates us to study a problem with a modified objective:
\begin{center}
\textsc{ExperimentalDesign} (EDP)
\begin{subequations}
\begin{align}
\text{Maximize}\quad V(S) &= \log\det(I_d+\T{X_S}X_S) \label{modified} \\
\text{subject to}\quad \sum_{i\in S} c_i&\leq B
\end{align}\label{edp}
\end{subequations}
\end{center} where $I_d\in \reals^{d\times d}$ is the identity matrix.
One possible interpretation of \eqref{modified} is that, prior to the auction, the experimenter has free access to $d$ experiments whose features form an orthonormal basis in $\reals^d$. However, \eqref{modified} can also be motivated in the context of \emph{Bayesian experimental design} \cite{chaloner1995bayesian}. In short, the objective \eqref{modified} arises naturally when the experimenter retrieves the model $\beta$ through \emph{ridge regression}, rather than the linear regression \eqref{leastsquares} over the observed data; we explore this connection in Section~\ref{sec:bed}. 

Note that maximizing \eqref{modified} is equivalent to maximizing \eqref{dcrit} in the full-information case. In particular, $\det(\T{X_S}X_S)> \det(\T{X_{S'}}X_{S'})$ iff $\det(I_d+\T{X_S}X_S)>\det(I_d+\T{X_{S'}}X_{S'})$. In addition, \eqref{edp} (and the equivalent problem with objective \eqref{dcrit}) are NP-hard; to see this, note that \textsc{Knapsack} reduces to EDP with dimension $d=1$ by mapping the weight of each item $w_i$ to an experiment with $x_i=w_i$. Nevertheless, \eqref{modified} is submodular, monotone and satifies $V(\emptyset) = 0$, allowing us to use the extensive machinery for the optimization of submodular functions, as well as recent results in the context of budget feasible auctions \cite{chen,singer-mechanisms}. 



\begin{comment}
\subsection{Experimental Design}



In the context of experimental design, an \emph{experiment} is a random variable $E$ sampled from a distribution $P_\beta$, where $\beta\in \Omega$ is an unknown parameter. An experimenter wishes to learn parameter $\beta$, and can chose among a set of possible different experiments, all of which have distributions parametrized by the same $\beta$. 

The problem of optimal experimental design amounts to determining an experiment that maximizes the information revealed about parameter $\beta$. 
Though a variety of different measures of information exist in literature (see, \emph{e.g.}, \cite{ginebra,chaloner}), the so-called \emph{value of information} \cite{lindley} is commonly used in traditional Bayesian experimental design. In particular, in the Bayesian setup, it is assumed that $\beta$ is sampled from a well-known prior distribution. The value of an experiment $E$ is then defined as the expected change in the entropy of $\beta$ (\emph{i.e.}, the mutual information between $E$ an $\beta$), given by
\begin{align}
 \mutual(\beta; E) =  \entropy(\beta) - \entropy(\beta \mid E).\label{voi}
\end{align} 
%where $H(\beta)$ is the entropy of the prior distribution and $\entropy(\beta \mid E)$ is the conditional entropy con
\subsection{Linear Regression}
In this paper, we focus on \emph{linear regression} experiments, aiming to discover a linear function from user data. In particular, we consider a set of $n$ users $\mathcal{N} = \{1,\ldots, n\}$. Each user
$i\in\mathcal{N}$ has a public vector of features $x_i\in\reals^d$, $\norm{x_i}_2\leq 1$, and an
undisclosed piece of information $y_i\in\reals$. 
For example, the features could be the age, weight, or height of user $i$, while the latter can be her propensity to contract a disease. 

The variables $x_i$ and $y_i$ are related through the following relationship:
\begin{align}
     y_i = \T{\beta} x_i + \varepsilon_i,\quad\forall i\in\mathcal{N},\label{model}
\end{align}
where $\beta\in\reals^d$ is the unknown parameter and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2_0$. 
The experimenter wishes to learn $\beta$ by observing the undisclosed $y_i$ of a subset of users $i\in S$. Hence, the set of possible experiments comprises all random variables $y_{S}\in\reals^{|S|}$, $S\subseteq \mathcal{N}$.

Learning $\beta$ has many interesting applications, that make linear regression ubiquitous in data mining, statistics, and machine learning. On one hand, the function itself can be used for \emph{prediction}, \emph{i.e.}, to compute the output value $y$ of a new input $x\in \reals^d$. Moreover, the sign and relative magnitude coefficients of $\beta$ can aid in identifying how different inputs affect the output---establishing, \emph{e.g.}, that weight, rather than age, is more strongly correlated to a disease. 


\subsection{Value of Information}
In the Bayesian setting,  
it is commonly assumed that $\beta$  follows a
multivariate normal distribution of mean zero and covariance matrix $\sigma_1^2
I_d$. Under this prior and the linear model \eqref{model}, the value of information \eqref{voi} of an experiment $Y_S$ is given by \cite{boyd,chaloner}
    \begin{align}\label{vs}
        V(S)
        & \defeq I(\beta;y_S) = \frac{1}{2}\log\det\left(I_d
            + \mu\sum_{i\in S} x_i\T{x_i}\right), 
     %   & \defeq \frac{1}{2}\log\det A(S)
    \end{align}
where $\mu = \sigma_1^2/\sigma_0^2$.

Some intuition behind the Gaussian prior on $\beta$ and the role that parameter $\mu$ plays can be obtained from how the experimenter estimates $\beta$ upon the observation of $y_S$. In particular, the experimenter can estimate $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$.
It is easy to show that, under the linearity assumption \eqref{model} and the gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization
problem:
\begin{displaymath}
    \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \T{\beta}x_i)^2
    + \frac{1}{\mu}\sum_i \norm{\beta}_2^2
\end{displaymath}
This optimization, commonly known as \emph{ridge regression}, reduces to a least squares fit for $\mu=\infty$. For finite $\mu$, ridge regression
acts as a sort of ``Occam's razor'', favoring a \emph{parsimonious} model for $\beta$: among two vectors with the same square error, the one with the smallest norm is preferred.   This is consistent with the Gaussian prior on $\beta$, which implies that vectors with small norms are more likely. %In practice, ridge regression is known to give better prediction results over new data than model parameters computed through a least squares fit.

\subsection{A Budgeted Auction}\label{sec:auction}

TODO Explain the optimization problem, why it has to be formulated as an auction
problem. Explain the goals:
\begin{itemize}
    \item truthful
    \item individually rational
    \item budget feasible
    \item has a good approximation ratio
\end{itemize}

\thibaut{TODO Explain what is already known: it is ok when the function is submodular. When
should we introduce the notion of submodularity?}

\subsection{Linear and Ridge Regression}

Linear regression, a true workhorse of statistical analysis, fits a linear function to a set of observable input and output variables. In particular, consider a set of $n$ value pairs $(x_i,y_i)$, $i\in \mathcal{N} =\{1,\ldots,n\}$. The variables $x_i\in \reals^d$ are usually referred to as \emph{input} variables or \emph{covariates} and $y_i\in \reals$ are referred to as  \emph{output} or \emph{response} variables. For example, the former could be the age, weight, or height of a person, while the latter can be their propensity to contract a disease. Assume that input and output variables are related through the following relationship:
 \begin{displaymath}
     y_i = \T{\beta} x_i + \varepsilon_i,\quad\forall i\in\mathcal{N},
\end{displaymath}
where $\beta\in\reals^d$, is called the \emph{model parameter vector} and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2$. 

The purpose of linear regression is to recover the vector $\beta$ upon the observation of the pairs $(x_i,y_i)$, $i=1,\ldots,n$. Learning $\beta$ has many interesting applications, that make linear regression ubiquitous in data mining, statistics, and machine learning. On one hand, the function itself can be used for \emph{prediction}, \emph{i.e.}, to compute the output value $y$ of a new input $x\in \reals^d$. Moreover, the sign and relative magnitude coefficients of $\beta$ can aid in identifying how different inputs affect the output---establishing, \emph{e.g.}, that weight, rather than age, is more strongly correlated to a disease. 




In the absense of any additional information, a natural method for estimating $\beta$ is through a least squares fit. However, in a more general setup, 
 a prior knowledge about $\beta$, captured by a distribution over
$\reals^d$, can also be taken into account. In this setup, after observing the data, $\beta$ can be retreived through \emph{maximum a posteriori estimation}: computing the parameters which maximize the
posterior distribution of $\beta$ given the observations.

A common prior assumption for linear regression is that $\beta$ follows a
multivariate normal distribution of mean zero and covariance matrix $\kappa
I_d$. Under this assumption, maximum a posteriori estimation leads to the following maximization
problem, commonly known as \emph{ridge regression}:
\begin{displaymath}
    \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - T{\beta}x_i)^2
    + \frac{1}{\mu}\sum_i \norm{\beta}_2^2
\end{displaymath}
where 
$\mu
= \frac{\kappa}{\sigma^2}$ is referred to as the \emph{regularization parameter}.  For $\mu=\infty$, the above minimization recovers a least squares fit. Compared to the latter, ridge regression
acts as a sort of ``Occam's razor'', favoring a \emph{parsimonious} model for $\beta$: among two vectors with the same square error, the one with the smallest norm is preferred.   This is consistent with the Gaussian prior on $\beta$, which favors vectors with small norms, and is known in practice to give better prediction results over new data than model parameters computed through a least squares fit.

\stratis{$1/\mu$ is awkward, especially since we do not cover linear regression. Can we use $\mu$ instead?}




\subsection{Notations}


Throughout the paper, we will make use of the following notations: if $x$ is
a (column) vector in $\reals^d$, $\T{x}$ denotes its transposed (line)
vector. Thus, the standard inner product between two vectors $x$ and $y$ is
simply $\T{x}y$. $\norm{x}_2 = \T{x}x$ will denote the $L_2$ norm of $x$.

We will also often use the following order over symmetric matrices: if $A$ and
$B$ are two $d\times d$ and $B$ are two $d\times d$ real symmetric matrices, we
write that $A\leq B$ iff:
\begin{displaymath}
    \forall x\in\reals^d,\quad
    \T{x}Ax \leq \T{x}Bx
\end{displaymath}
That is, iff $B-A$ is symmetric semi-definite positive.

This order let us define the notion of a \emph{decreasing} or \emph{convex}
matrix function similarly to their real counterparts. In particular, let us
recall that the matrix inversion is decreasing and convex over symmetric
definite positive matrices.

\stratis{This section should either be removed or compactified. Short notations like $*$ for transpose can be introduced where they first appear. Better yet,
 can we revert to $^T$ or something similar for transpose? I know $*$ is used in real analysis etc., but $*$ is used for optimal values too when dealing with optimization.  Also, do we need the extension of decreasing and convex to matrix function? Does it have ``global scope''? If it is used only once, we might as well place this notation there.}

\subsection{Linear and Ridge Regression}

Linear regression, a true workhorse of statistical analysis, fits a linear function to a set of observable input and output variables. In particular, consider a set of $n$ value pairs $(x_i,y_i)$, $i\in \mathcal{N} =\{1,\ldots,n\}$. The variables $x_i\in \reals^d$ are usually referred to as \emph{input} variables or \emph{covariates} and $y_i\in \reals$ are referred to as  \emph{output} or \emph{response} variables. For example, the former could be the age, weight, or height of a person, while the latter can be their propensity to contract a disease. Assume that input and output variables are related through the following relationship:
 \begin{displaymath}
     y_i = \T{\beta} x_i + \varepsilon_i,\quad\forall i\in\mathcal{N},
\end{displaymath}
where $\beta\in\reals^d$, is called the \emph{model parameter vector} and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2$. 

The purpose of linear regression is to recover the vector $\beta$ upon the observation of the pairs $(x_i,y_i)$, $i=1,\ldots,n$. Learning $\beta$ has many interesting applications, that make linear regression ubiquitous in data mining, statistics, and machine learning. On one hand, the function itself can be used for \emph{prediction}, \emph{i.e.}, to compute the output value $y$ of a new input $x\in \reals^d$. Moreover, the sign and relative magnitude coefficients of $\beta$ can aid in identifying how different inputs affect the output---establishing, \emph{e.g.}, that weight, rather than age, is more strongly correlated to a disease. 




In the absence of any additional information, a natural method for estimating $\beta$ is through a least squares fit. However, in a more general setup, 
 a prior knowledge about $\beta$, captured by a distribution over
$\reals^d$, can also be taken into account. In this setup, after observing the data, $\beta$ can be retrieved through \emph{maximum a posteriori estimation}: computing the parameters which maximize the
posterior distribution of $\beta$ given the observations.

A common prior assumption for linear regression is that $\beta$ follows a
multivariate normal distribution of mean zero and covariance matrix $\kappa
I_d$. Under this assumption, maximum a posteriori estimation leads to the following maximization
problem, commonly known as \emph{ridge regression}:
\begin{displaymath}
    \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \T{\beta}x_i)^2
    + \frac{1}{\mu}\sum_i \norm{\beta}_2^2
\end{displaymath}
where 
$\mu
= \frac{\kappa}{\sigma^2}$ is referred to as the \emph{regularization parameter}.  For $\mu=\infty$, the above minimization recovers a least squares fit. Compared to the latter, ridge regression
acts as a sort of ``Occam's razor'', favoring a \emph{parsimonious} model for $\beta$: among two vectors with the same square error, the one with the smallest norm is preferred.   This is consistent with the Gaussian prior on $\beta$, which favors vectors with small norms, and is known in practice to give better prediction results over new data than model parameters computed through a least squares fit.

\stratis{$1/\mu$ is awkward, especially since we do not cover linear regression. Can we use $\mu$ instead?}

\subsection{Experiment Design}
There is a set of $n$ users, $\mathcal{N} = \{1,\ldots, n\}$. Each user
$i\in\mathcal{N}$ has a public vector of features $x_i\in\reals^d$ and an
undisclosed piece of information $y_i\in\reals$. We assume that the data
has already been normalized so that $\norm{x_i}_2\leq 1$ for all
$i\in\mathcal{N}$.

The experimenter is going to select a set of users and ask them to reveal their
private piece of information. We are interested in a \emph{survey setup}: the
experimenter has not seen the data yet, but he wants to know which users he
should be selecting. His goal is to learn the model underlying the data. 

\stratis{Describe the experiment setup. The value function can be part of this now, as it is the means through which the experimenter quantifies the value of a solution.}

\subsection{Data model}

There is a set of $n$ users, $\mathcal{N} = \{1,\ldots, n\}$. Each user
$i\in\mathcal{N}$ has a public vector of features $x_i\in\reals^d$ and an
undisclosed piece of information $y_i\in\reals$. We assume that the data
has already been normalized so that $\norm{x_i}_2\leq 1$ for all
$i\in\mathcal{N}$.

The experimenter is going to select a set of users and ask them to reveal their
private piece of information. We are interested in a \emph{survey setup}: the
experimenter has not seen the data yet, but he wants to know which users he
should be selecting. His goal is to learn the model underlying the data. Here,
we assume a linear model:
\begin{displaymath}
    \forall i\in\mathcal{N},\quad y_i = \T{\beta}x_i + \varepsilon_i
\end{displaymath}
where $\beta\in\reals^d$ and $\varepsilon_i\in\reals$ follows a normal
distribution of mean $0$ and variance $\sigma^2$. Furthermore, we assume the
error $\varepsilon$ to be independent of the user:
$(\varepsilon_i)_{i\in\mathcal{N}}$ are mutually independent.

After observing the data, the experimenter could simply do linear regression to
learn the model parameter $\beta$. However, in a more general setup, the
experimenter has a prior knowledge about $\beta$, a distribution over
$\reals^d$. After observing the data, the experimenter performs
\emph{maximum a posteriori estimation}: computing the point which maximizes the
posterior distribution of $\beta$ given the observations.

Here, we will assume, as it is often done, that the prior distribution is
a multivariate normal distribution of mean zero and covariance matrix $\kappa
I_d$. Maximum a posteriori estimation leads to the following maximization
problem:
\begin{displaymath}
    \beta_{\text{max}} = \argmax_{\beta\in\reals^d} \sum_i (y_i - \T{\beta}x_i)^2
    + \frac{1}{\mu}\sum_i \norm{\beta}_2^2
\end{displaymath}
which is the well-known \emph{ridge regression}. $\mu
= \frac{\kappa}{\sigma^2}$ is the regularization parameter. Ridge regression
can thus be seen as linear regression with a regularization term which
prevents $\beta$ from having a large $L_2$-norm.

\subsection{Value of data}

Because the user private variables $y_i$ have not been observed yet when the
experimenter has to decide which users to include in his experiment, we treat
$\beta$ as a random variable whose distribution is updated after observing the
data.

Let us recall that if $\beta$ is random variable over $\reals^d$ whose
probability distribution has a density function $f$ with respect to the
Lebesgue measure, its entropy is given by:
\begin{displaymath}
    \mathbb{H}(\beta) \defeq - \int_{b\in\reals^d} \log f(b) f(b)\text{d}b
\end{displaymath}

A usual way to measure the decrease of uncertainty induced by the observation
of data is to use the entropy. This leads to the following definition of the
value of data called the \emph{value of information}:
\begin{displaymath}
    \forall S\subset\mathcal{N},\quad V(S) = \mathbb{H}(\beta)
    - \mathbb{H}(\beta\,|\,
    Y_S)
\end{displaymath}
where $Y_S = \{y_i,\,i\in S\}$ is the set of observed data.

\begin{theorem}
    Under the ridge regression model explained in section TODO, the value of data
    is equal to:
    \begin{align*}
        \forall S\subset\mathcal{N},\; V(S)
        & = \frac{1}{2}\log\det\left(I_d
            + \mu\sum_{i\in S} x_i\T{x_i}\right)\\
        & \defeq \frac{1}{2}\log\det A(S)
    \end{align*}
\end{theorem}

\begin{proof}

Let us denote by $X_S$ the matrix whose rows are the vectors $(\T{x_i})_{i\in
S}$. Observe that $A_S$ can simply be written as:
\begin{displaymath}
    A_S  = I_d + \mu \T{X_S} X_S
\end{displaymath}

Let us recall that the entropy of a multivariate normal variable $B$ over
$\reals^d$ of covariance $\Sigma I_d$ is given by:
\begin{equation}\label{eq:multivariate-entropy}
    \mathbb{H}(B) = \frac{1}{2}\log\big((2\pi e)^d \det \Sigma I_d\big)
\end{equation}

Using the chain rule for conditional entropy, we get that:
\begin{displaymath}
    V(S) = \mathbb{H}(Y_S) - \mathbb{H}(Y_S\,|\,\beta)
\end{displaymath}

Conditioned on $\beta$, $(Y_S)$ follows a multivariate normal
distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence:
\begin{equation}\label{eq:h1}
    \mathbb{H}(Y_S\,|\,\beta) 
    = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right)
\end{equation}

$(Y_S)$ also follows a multivariate normal distribution of mean zero. Let us
compute its covariance matrix, $\Sigma_Y$:
\begin{align*}
    \Sigma_Y & = \expt{Y\T{Y}} = \expt{(X_S\beta + E)\T{(X_S\beta + E)}}\\
             & = \kappa X_S \T{X_S} + \sigma^2I_n
\end{align*}
Thus, we get that:
\begin{equation}\label{eq:h2}
    \mathbb{H}(Y_S) 
    = \frac{1}{2}\log\left((2\pi e)^n \det(\kappa X_S \T{X_S} + \sigma^2 I_n)\right)
\end{equation}

Combining \eqref{eq:h1} and \eqref{eq:h2} we get:
\begin{displaymath}
    V(S) = \frac{1}{2}\log\det\left(I_n+\frac{\kappa}{\sigma^2}X_S
    \T{X_S}\right)
\end{displaymath}

Finally, we can use Sylvester's determinant theorem to get the result.
\end{proof}

It is also interesting to look at the marginal contribution of a user to a set: the
increase of value induced by adding a user to an already existing set of users.
We have the following lemma.

\begin{lemma}[Marginal contribution]
    \begin{displaymath}
        \Delta_i V(S)\defeq V(S\cup\{i\}) - V(S) 
        = \frac{1}{2}\log\left(1 + \mu \T{x_i} A(S)^{-1}x_i\right)
    \end{displaymath}
\end{lemma}

\begin{proof}
    We have:
    \begin{align*}
        V(S\cup\{i\}) & = \frac{1}{2}\log\det A(S\cup\{i\})\\
        & = \frac{1}{2}\log\det\left(A(S) + \mu x_i \T{x_i}\right)\\
        & = V(S) + \frac{1}{2}\log\det\left(I_d + \mu A(S)^{-1}x_i
    \T{x_i}\right)\\
        & = V(S) + \frac{1}{2}\log\left(1 + \mu \T{x_i} A(S)^{-1}x_i\right)
    \end{align*}
    where the last equality comes from Sylvester's determinant formula.
\end{proof}

Because $A(S)$ is symmetric definite positive, the marginal contribution is
positive, which proves that the value function is set increasing. Furthermore,
it is easy to see that if $S\subset S'$, then $A(S)\leq A(S')$. Using the fact
that matrix inversion is decreasing, we see that the marginal contribution of
a fixed user is a set decreasing function. This is the \emph{submodularity} of
the value function.

TODO? Explain what are the points which are the most valuable : points which
are aligned along directions where the spread over the already existing points
is small.

\subsection{Auction}

TODO Explain the optimization problem, why it has to be formulated as an auction
problem. Explain the goals:
\begin{itemize}
    \item truthful
    \item individually rational
    \item budget feasible
    \item has a good approximation ratio

TODO Explain what is already known: it is ok when the function is submodular. When
should we introduce the notion of submodularity?
\end{itemize}

\end{comment}