9 files changed, 107 insertions, 5 deletions
diff --git a/definitions.tex b/definitions.tex
index 139a6db..378781d 100644
--- a/definitions.tex
+++ b/definitions.tex
@@ -1,9 +1,12 @@
+\newcommand{\mutual}{\ensuremath{{I}}}
+\newcommand{\entropy}{\ensuremath{{H}}}
 \newtheorem{lemma}{Lemma}
 \newtheorem{fact}{Fact}
 \newtheorem{example}{Example}
 \newtheorem{prop}{Proposition}
 \newtheorem{theorem}{Theorem}
-\newcommand*{\defeq}{\stackrel{\text{def}}{=}}
+%\newcommand*{\defeq}{\stackrel{\text{def}}{=}}
+\newcommand*{\defeq}{\equiv}
 \newcommand{\var}{\mathop{\mathrm{Var}}}
 \newcommand{\condexp}[2]{\mathop{\mathbb{E}}\left[#1|#2\right]}
 \newcommand{\expt}[1]{\mathop{\mathbb{E}}\left[#1\right]}
diff --git a/paper.tex b/paper.tex
index 4577bfa..46ad60a 100644
--- a/paper.tex
+++ b/paper.tex
@@ -4,7 +4,7 @@
 \usepackage{algorithm}
 \usepackage{algpseudocode,bbm,color,verbatim}
 \input{definitions}
-\title{Budgeted Auctions for Experiment Design}
+\title{Budgeted Auctions for Experimental Design}
 \begin{document}
 
 \maketitle
@@ -16,7 +16,7 @@
 \input{problem}
 \section{Main result}
 \input{main}
-\section{General setup}
+\section{Extension to Other Problems}
 \input{general}
 \section{Conclusion}
 \input{conclusion}
diff --git a/papers/chaloner.pdf b/papers/chaloner.pdf
new file mode 100644
index 0000000..184eb39
--- /dev/null
+++ b/papers/chaloner.pdf
diff --git a/papers/experimental_design/degroot1.pdf b/papers/experimental_design/degroot1.pdf
new file mode 100644
index 0000000..efe4131
--- /dev/null
+++ b/papers/experimental_design/degroot1.pdf
diff --git a/papers/experimental_design/ginebra.pdf b/papers/experimental_design/ginebra.pdf
new file mode 100644
index 0000000..d1010d5
--- /dev/null
+++ b/papers/experimental_design/ginebra.pdf
diff --git a/papers/experimental_design/kiefer.pdf b/papers/experimental_design/kiefer.pdf
new file mode 100644
index 0000000..c1e9423
--- /dev/null
+++ b/papers/experimental_design/kiefer.pdf
diff --git a/papers/experimental_design/lindley.pdf b/papers/experimental_design/lindley.pdf
new file mode 100644
index 0000000..dd73e3d
--- /dev/null
+++ b/papers/experimental_design/lindley.pdf
diff --git a/papers/experimental_design/max-ent00.pdf b/papers/experimental_design/max-ent00.pdf
new file mode 100644
index 0000000..410626e
--- /dev/null
+++ b/papers/experimental_design/max-ent00.pdf
diff --git a/problem.tex b/problem.tex
index e25e5c3..3faef3b 100644
--- a/problem.tex
+++ b/problem.tex
@@ -1,6 +1,106 @@
+\subsection{Experimental Design}
+In the context of experimental design, an \emph{experiment} is a random variable $E$ sampled from a distribution $P_\beta$, where $\beta\in \Omega$ is an unknown parameter. An experimenter wishes to learn parameter $\beta$, and can chose among a set of possible different experiments, all of which have distributions parametrized by the same $\beta$. 
+
+The problem of optimal experimental design amounts to determining an experiment that maximizes the information revealed about parameter $\beta$. 
+Though a variety of different measures of information exist in literature (see, \emph{e.g.}, \cite{ginebra}), the so-called \emph{value of information} \cite{lindley} is commonly used in traditional Bayesian experimental design  \cite{lindley}. In particular, in the Bayesian setup, it is assumed that $\beta$ is sampled from a well-known prior distribution. The value of an experiment $E$ is then defined as the expected change in the entropy of $\beta$ (\emph{i.e.}, the mutual information between $E$ an $\beta$), given by
+\begin{align}
+ \mutual(\beta; E) =  \entropy(\beta) - \entropy(\beta \mid E).\label{voi}
+\end{align} 
+%where $H(\beta)$ is the entropy of the prior distribution and $\entropy(\beta \mid E)$ is the conditional entropy con
+\subsection{Linear Regression}
+In this paper, we focus on \emph{linear regression} experiments, aiming to discover a linear function from user data. In particular, we consider a set of $n$ users $\mathcal{N} = \{1,\ldots, n\}$. Each user
+$i\in\mathcal{N}$ has a public vector of features $x_i\in\reals^d$, $\norm{x_i}_2\leq 1$, and an
+undisclosed piece of information $y_i\in\reals$. 
+For example, the features could be the age, weight, or height of user $i$, while the latter can be her propensity to contract a disease. 
+
+The variables $x_i$ and $y_i$ are related through the following relationship:
+\begin{align}
+     y_i = \beta^T x_i + \varepsilon_i,\quad\forall i\in\mathcal{N},\label{model}
+\end{align}
+where $\beta\in\reals^d$ is the unknown parameter and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2_0$. 
+The experimenter wishes to learn $\beta$ by observing the undisclosed $y_i$ of a subset of users $i\in S$. Hence, the set of possible experiments comprises all random variables $y_{S}\in\reals^{|S|}$, $S\subseteq \mathcal{N}$.
+
+Learning $\beta$ has many interesting applications, that make linear regression ubiquitous in data mining, statistics, and machine learning. On one hand, the function itself can be used for \emph{prediction}, \emph{i.e.}, to compute the output value $y$ of a new input $x\in \reals^d$. Moreover, the sign and relative magnitude coefficients of $\beta$ can aid in identifying how different inputs affect the output---establishing, \emph{e.g.}, that weight, rather than age, is more strongly correlated to a disease. 
+
+
+\subsection{Value of Information}
+In the Bayesian setting,  
+it is commonly assumed that $\beta$  follows a
+multivariate normal distribution of mean zero and covariance matrix $\sigma_1^2
+I_d$. Under this prior and the linear model \eqref{model}, the value of information \eqref{voi} of an experiment $Y_S$ is given by \cite{...}
+    \begin{align}\label{vs}
+        V(S)
+        & \defeq I(\beta;y_S) = \frac{1}{2}\log\det\left(I_d
+            + \mu\sum_{i\in S} x_ix_i^T\right), 
+     %   & \defeq \frac{1}{2}\log\det A(S)
+    \end{align}
+where $\mu = \sigma_1^2/\sigma_0^2$.
+
+Some intuition behind the Gaussian prior on $\beta$ and the role that parameter $\mu$ plays can be obtained from how the experimenter estimates $\beta$ upon the observation of $y_S$. In particular, the experimenter can estimate $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$.
+It is easy to show that, under the linearity assumption \eqref{model} and the gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization
+problem:
+\begin{displaymath}
+    \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \beta^*x_i)^2
+    + \frac{1}{\mu}\sum_i \norm{\beta}_2^2
+\end{displaymath}
+This optimization, commonly known as \emph{ridge regression}, reduces to a least squares fit for $\mu=\infty$. For finite $\mu$, ridge regression
+acts as a sort of ``Occam's razor'', favoring a \emph{parsimonious} model for $\beta$: among two vectors with the same square error, the one with the smallest norm is preferred.   This is consistent with the Gaussian prior on $\beta$, which implies that vectors with small norms are more likely. %In practice, ridge regression is known to give better prediction results over new data than model parameters computed through a least squares fit.
+
+\subsection{A Budgeted Auction}
+
+TODO Explain the optimization problem, why it has to be formulated as an auction
+problem. Explain the goals:
+\begin{itemize}
+    \item truthful
+    \item individually rational
+    \item budget feasible
+    \item has a good approximation ratio
+
+TODO Explain what is already known: it is ok when the function is submodular. When
+should we introduce the notion of submodularity?
+\end{itemize}
+
+
+\begin{comment}
+\subsection{Linear and Ridge Regression}
+
+Linear regression, a true workhorse of statistical analysis, fits a linear function to a set of observable input and output variables. In particular, consider a set of $n$ value pairs $(x_i,y_i)$, $i\in \mathcal{N} =\{1,\ldots,n\}$. The variables $x_i\in \reals^d$ are usually referred to as \emph{input} variables or \emph{covariates} and $y_i\in \reals$ are referred to as  \emph{output} or \emph{response} variables. For example, the former could be the age, weight, or height of a person, while the latter can be their propensity to contract a disease. Assume that input and output variables are related through the following relationship:
+ \begin{displaymath}
+     y_i = \beta^* x_i + \varepsilon_i,\quad\forall i\in\mathcal{N},
+\end{displaymath}
+where $\beta\in\reals^d$, is called the \emph{model parameter vector} and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2$. 
+
+The purpose of linear regression is to recover the vector $\beta$ upon the observation of the pairs $(x_i,y_i)$, $i=1,\ldots,n$. Learning $\beta$ has many interesting applications, that make linear regression ubiquitous in data mining, statistics, and machine learning. On one hand, the function itself can be used for \emph{prediction}, \emph{i.e.}, to compute the output value $y$ of a new input $x\in \reals^d$. Moreover, the sign and relative magnitude coefficients of $\beta$ can aid in identifying how different inputs affect the output---establishing, \emph{e.g.}, that weight, rather than age, is more strongly correlated to a disease. 
+
+
+
+
+In the absense of any additional information, a natural method for estimating $\beta$ is through a least squares fit. However, in a more general setup, 
+ a prior knowledge about $\beta$, captured by a distribution over
+$\reals^d$, can also be taken into account. In this setup, after observing the data, $\beta$ can be retreived through \emph{maximum a posteriori estimation}: computing the parameters which maximize the
+posterior distribution of $\beta$ given the observations.
+
+A common prior assumption for linear regression is that $\beta$ follows a
+multivariate normal distribution of mean zero and covariance matrix $\kappa
+I_d$. Under this assumption, maximum a posteriori estimation leads to the following maximization
+problem, commonly known as \emph{ridge regression}:
+\begin{displaymath}
+    \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \beta^*x_i)^2
+    + \frac{1}{\mu}\sum_i \norm{\beta}_2^2
+\end{displaymath}
+where 
+$\mu
+= \frac{\kappa}{\sigma^2}$ is referred to as the \emph{regularization parameter}.  For $\mu=\infty$, the above minimization recovers a least squares fit. Compared to the latter, ridge regression
+acts as a sort of ``Occam's razor'', favoring a \emph{parsimonious} model for $\beta$: among two vectors with the same square error, the one with the smallest norm is preferred.   This is consistent with the Gaussian prior on $\beta$, which favors vectors with small norms, and is known in practice to give better prediction results over new data than model parameters computed through a least squares fit.
+
+\stratis{$1/\mu$ is awkward, especially since we do not cover linear regression. Can we use $\mu$ instead?}
+
+
+
 
 \subsection{Notations}
 
+
 Throughout the paper, we will make use of the following notations: if $x$ is
 a (column) vector in $\reals^d$, $x^*$ denotes its transposed (line)
 vector. Thus, the standard inner product between two vectors $x$ and $y$ is
@@ -70,7 +170,6 @@ should be selecting. His goal is to learn the model underlying the data.
 
 \stratis{Describe the experiment setup. The value function can be part of this now, as it is the means through which the experimenter quantifies the value of a solution.}
 
-\begin{comment}
 \subsection{Data model}
 
 There is a set of $n$ users, $\mathcal{N} = \{1,\ldots, n\}$. Each user
@@ -111,7 +210,6 @@ which is the well-known \emph{ridge regression}. $\mu
 = \frac{\kappa}{\sigma^2}$ is the regularization parameter. Ridge regression
 can thus be seen as linear regression with a regularization term which
 prevents $\beta$ from having a large $L_2$-norm.
-\end{comment}
 
 \subsection{Value of data}
 
@@ -243,4 +341,5 @@ TODO Explain what is already known: it is ok when the function is submodular. Wh
 should we introduce the notion of submodularity?
 \end{itemize}
 
+\end{comment}