3 files changed, 67 insertions, 12 deletions
diff --git a/definitions.tex b/definitions.tex
index ca6b3ba..139a6db 100644
--- a/definitions.tex
+++ b/definitions.tex
@@ -13,4 +13,6 @@
 \newcommand{\mse}{\mathop{\mathrm{MSE}}}
 \DeclareMathOperator{\trace}{tr}
 \DeclareMathOperator*{\argmax}{arg\,max}
-
+\DeclareMathOperator*{\argmin}{arg\,min}
+\newcommand{\reals}{\ensuremath{\mathbbm{R}}}
+\newcommand{\stratis}[1]{\textcolor{red}{Stratis: #1}}
diff --git a/paper.tex b/paper.tex
index fd02b24..d2b17bc 100644
--- a/paper.tex
+++ b/paper.tex
@@ -2,7 +2,7 @@
 \usepackage[utf8]{inputenc}
 \usepackage{amsmath,amsfonts}
 \usepackage{algorithm}
-\usepackage{algpseudocode}
+\usepackage{algpseudocode,bbm,color,verbatim}
 \input{definitions}
 \title{Budgeted Auctions for Experiment Design}
 \begin{document}
diff --git a/problem.tex b/problem.tex
index fb9f8e1..e25e5c3 100644
--- a/problem.tex
+++ b/problem.tex
@@ -1,7 +1,8 @@
+
 \subsection{Notations}
 
 Throughout the paper, we will make use of the following notations: if $x$ is
-a (column) vector in $\mathbf{R}^d$, $x^*$ denotes its transposed (line)
+a (column) vector in $\reals^d$, $x^*$ denotes its transposed (line)
 vector. Thus, the standard inner product between two vectors $x$ and $y$ is
 simply $x^* y$. $\norm{x}_2 = x^*x$ will denote the $L_2$ norm of $x$.
 
@@ -9,7 +10,7 @@ We will also often use the following order over symmetric matrices: if $A$ and
 $B$ are two $d\times d$ and $B$ are two $d\times d$ real symmetric matrices, we
 write that $A\leq B$ iff:
 \begin{displaymath}
-    \forall x\in\mathbf{R}^d,\quad
+    \forall x\in\reals^d,\quad
     x^*Ax \leq x^*Bx
 \end{displaymath}
 That is, iff $B-A$ is symmetric semi-definite positive.
@@ -19,11 +20,62 @@ matrix function similarly to their real counterparts. In particular, let us
 recall that the matrix inversion is decreasing and convex over symmetric
 definite positive matrices.
 
+\stratis{This section should either be removed or compactified. Short notations like $*$ for transpose can be introduced where they first appear. Better yet,
+ can we revert to $^T$ or something similar for transpose? I know $*$ is used in real analysis etc., but $*$ is used for optimal values too when dealing with optimization.  Also, do we need the extension of decreasing and convex to matrix function? Does it have ``global scope''? If it is used only once, we might as well place this notation there.}
+
+\subsection{Linear and Ridge Regression}
+
+Linear regression, a true workhorse of statistical analysis, fits a linear function to a set of observable input and output variables. In particular, consider a set of $n$ value pairs $(x_i,y_i)$, $i\in \mathcal{N} =\{1,\ldots,n\}$. The variables $x_i\in \reals^d$ are usually referred to as \emph{input} variables or \emph{covariates} and $y_i\in \reals$ are referred to as  \emph{output} or \emph{response} variables. For example, the former could be the age, weight, or height of a person, while the latter can be their propensity to contract a disease. Assume that input and output variables are related through the following relationship:
+ \begin{displaymath}
+     y_i = \beta^* x_i + \varepsilon_i,\quad\forall i\in\mathcal{N},
+\end{displaymath}
+where $\beta\in\reals^d$, is called the \emph{model parameter vector} and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2$. 
+
+The purpose of linear regression is to recover the vector $\beta$ upon the observation of the pairs $(x_i,y_i)$, $i=1,\ldots,n$. Learning $\beta$ has many interesting applications, that make linear regression ubiquitous in data mining, statistics, and machine learning. On one hand, the function itself can be used for \emph{prediction}, \emph{i.e.}, to compute the output value $y$ of a new input $x\in \reals^d$. Moreover, the sign and relative magnitude coefficients of $\beta$ can aid in identifying how different inputs affect the output---establishing, \emph{e.g.}, that weight, rather than age, is more strongly correlated to a disease. 
+
+
+
+
+In the absense of any additional information, a natural method for estimating $\beta$ is through a least squares fit. However, in a more general setup, 
+ a prior knowledge about $\beta$, captured by a distribution over
+$\reals^d$, can also be taken into account. In this setup, after observing the data, $\beta$ can be retreived through \emph{maximum a posteriori estimation}: computing the parameters which maximize the
+posterior distribution of $\beta$ given the observations.
+
+A common prior assumption for linear regression is that $\beta$ follows a
+multivariate normal distribution of mean zero and covariance matrix $\kappa
+I_d$. Under this assumption, maximum a posteriori estimation leads to the following maximization
+problem, commonly known as \emph{ridge regression}:
+\begin{displaymath}
+    \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \beta^*x_i)^2
+    + \frac{1}{\mu}\sum_i \norm{\beta}_2^2
+\end{displaymath}
+where 
+$\mu
+= \frac{\kappa}{\sigma^2}$ is referred to as the \emph{regularization parameter}.  For $\mu=\infty$, the above minimization recovers a least squares fit. Compared to the latter, ridge regression
+acts as a sort of ``Occam's razor'', favoring a \emph{parsimonious} model for $\beta$: among two vectors with the same square error, the one with the smallest norm is preferred.   This is consistent with the Gaussian prior on $\beta$, which favors vectors with small norms, and is known in practice to give better prediction results over new data than model parameters computed through a least squares fit.
+
+\stratis{$1/\mu$ is awkward, especially since we do not cover linear regression. Can we use $\mu$ instead?}
+
+\subsection{Experiment Design}
+There is a set of $n$ users, $\mathcal{N} = \{1,\ldots, n\}$. Each user
+$i\in\mathcal{N}$ has a public vector of features $x_i\in\reals^d$ and an
+undisclosed piece of information $y_i\in\reals$. We assume that the data
+has already been normalized so that $\norm{x_i}_2\leq 1$ for all
+$i\in\mathcal{N}$.
+
+The experimenter is going to select a set of users and ask them to reveal their
+private piece of information. We are interested in a \emph{survey setup}: the
+experimenter has not seen the data yet, but he wants to know which users he
+should be selecting. His goal is to learn the model underlying the data. 
+
+\stratis{Describe the experiment setup. The value function can be part of this now, as it is the means through which the experimenter quantifies the value of a solution.}
+
+\begin{comment}
 \subsection{Data model}
 
 There is a set of $n$ users, $\mathcal{N} = \{1,\ldots, n\}$. Each user
-$i\in\mathcal{N}$ has a public vector of features $x_i\in\mathbf{R}^d$ and an
-undisclosed piece of information $y_i\in\mathbf{R}$. We assume that the data
+$i\in\mathcal{N}$ has a public vector of features $x_i\in\reals^d$ and an
+undisclosed piece of information $y_i\in\reals$. We assume that the data
 has already been normalized so that $\norm{x_i}_2\leq 1$ for all
 $i\in\mathcal{N}$.
 
@@ -35,7 +87,7 @@ we assume a linear model:
 \begin{displaymath}
     \forall i\in\mathcal{N},\quad y_i = \beta^* x_i + \varepsilon_i
 \end{displaymath}
-where $\beta\in\mathbf{R}^d$ and $\varepsilon_i\in\mathbf{R}$ follows a normal
+where $\beta\in\reals^d$ and $\varepsilon_i\in\reals$ follows a normal
 distribution of mean $0$ and variance $\sigma^2$. Furthermore, we assume the
 error $\varepsilon$ to be independent of the user:
 $(\varepsilon_i)_{i\in\mathcal{N}}$ are mutually independent.
@@ -43,7 +95,7 @@ $(\varepsilon_i)_{i\in\mathcal{N}}$ are mutually independent.
 After observing the data, the experimenter could simply do linear regression to
 learn the model parameter $\beta$. However, in a more general setup, the
 experimenter has a prior knowledge about $\beta$, a distribution over
-$\mathbf{R}^d$. After observing the data, the experimenter performs
+$\reals^d$. After observing the data, the experimenter performs
 \emph{maximum a posteriori estimation}: computing the point which maximizes the
 posterior distribution of $\beta$ given the observations.
 
@@ -52,13 +104,14 @@ a multivariate normal distribution of mean zero and covariance matrix $\kappa
 I_d$. Maximum a posteriori estimation leads to the following maximization
 problem:
 \begin{displaymath}
-    \beta_{\text{max}} = \argmax_{\beta\in\mathbf{R}^d} \sum_i (y_i - \beta^*x_i)^2
+    \beta_{\text{max}} = \argmax_{\beta\in\reals^d} \sum_i (y_i - \beta^*x_i)^2
     + \frac{1}{\mu}\sum_i \norm{\beta}_2^2
 \end{displaymath}
 which is the well-known \emph{ridge regression}. $\mu
 = \frac{\kappa}{\sigma^2}$ is the regularization parameter. Ridge regression
 can thus be seen as linear regression with a regularization term which
 prevents $\beta$ from having a large $L_2$-norm.
+\end{comment}
 
 \subsection{Value of data}
 
@@ -67,11 +120,11 @@ experimenter has to decide which users to include in his experiment, we treat
 $\beta$ as a random variable whose distribution is updated after observing the
 data.
 
-Let us recall that if $\beta$ is random variable over $\mathbf{R}^d$ whose
+Let us recall that if $\beta$ is random variable over $\reals^d$ whose
 probability distribution has a density function $f$ with respect to the
 Lebesgue measure, its entropy is given by:
 \begin{displaymath}
-    \mathbb{H}(\beta) \defeq - \int_{b\in\mathbf{R}^d} \log f(b) f(b)\text{d}b
+    \mathbb{H}(\beta) \defeq - \int_{b\in\reals^d} \log f(b) f(b)\text{d}b
 \end{displaymath}
 
 A usual way to measure the decrease of uncertainty induced by the observation
@@ -104,7 +157,7 @@ S}$. Observe that $A_S$ can simply be written as:
 \end{displaymath}
 
 Let us recall that the entropy of a multivariate normal variable $B$ over
-$\mathbf{R}^d$ of covariance $\Sigma I_d$ is given by:
+$\reals^d$ of covariance $\Sigma I_d$ is given by:
 \begin{equation}\label{eq:multivariate-entropy}
     \mathbb{H}(B) = \frac{1}{2}\log\big((2\pi e)^d \det \Sigma I_d\big)
 \end{equation}