summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--definitions.tex5
-rw-r--r--paper.tex4
-rw-r--r--papers/chaloner.pdfbin0 -> 1696615 bytes
-rw-r--r--papers/experimental_design/degroot1.pdfbin0 -> 1469466 bytes
-rw-r--r--papers/experimental_design/ginebra.pdfbin0 -> 412755 bytes
-rw-r--r--papers/experimental_design/kiefer.pdfbin0 -> 6393019 bytes
-rw-r--r--papers/experimental_design/lindley.pdfbin0 -> 1940302 bytes
-rw-r--r--papers/experimental_design/max-ent00.pdfbin0 -> 106829 bytes
-rw-r--r--problem.tex103
9 files changed, 107 insertions, 5 deletions
diff --git a/definitions.tex b/definitions.tex
index 139a6db..378781d 100644
--- a/definitions.tex
+++ b/definitions.tex
@@ -1,9 +1,12 @@
+\newcommand{\mutual}{\ensuremath{{I}}}
+\newcommand{\entropy}{\ensuremath{{H}}}
\newtheorem{lemma}{Lemma}
\newtheorem{fact}{Fact}
\newtheorem{example}{Example}
\newtheorem{prop}{Proposition}
\newtheorem{theorem}{Theorem}
-\newcommand*{\defeq}{\stackrel{\text{def}}{=}}
+%\newcommand*{\defeq}{\stackrel{\text{def}}{=}}
+\newcommand*{\defeq}{\equiv}
\newcommand{\var}{\mathop{\mathrm{Var}}}
\newcommand{\condexp}[2]{\mathop{\mathbb{E}}\left[#1|#2\right]}
\newcommand{\expt}[1]{\mathop{\mathbb{E}}\left[#1\right]}
diff --git a/paper.tex b/paper.tex
index 4577bfa..46ad60a 100644
--- a/paper.tex
+++ b/paper.tex
@@ -4,7 +4,7 @@
\usepackage{algorithm}
\usepackage{algpseudocode,bbm,color,verbatim}
\input{definitions}
-\title{Budgeted Auctions for Experiment Design}
+\title{Budgeted Auctions for Experimental Design}
\begin{document}
\maketitle
@@ -16,7 +16,7 @@
\input{problem}
\section{Main result}
\input{main}
-\section{General setup}
+\section{Extension to Other Problems}
\input{general}
\section{Conclusion}
\input{conclusion}
diff --git a/papers/chaloner.pdf b/papers/chaloner.pdf
new file mode 100644
index 0000000..184eb39
--- /dev/null
+++ b/papers/chaloner.pdf
Binary files differ
diff --git a/papers/experimental_design/degroot1.pdf b/papers/experimental_design/degroot1.pdf
new file mode 100644
index 0000000..efe4131
--- /dev/null
+++ b/papers/experimental_design/degroot1.pdf
Binary files differ
diff --git a/papers/experimental_design/ginebra.pdf b/papers/experimental_design/ginebra.pdf
new file mode 100644
index 0000000..d1010d5
--- /dev/null
+++ b/papers/experimental_design/ginebra.pdf
Binary files differ
diff --git a/papers/experimental_design/kiefer.pdf b/papers/experimental_design/kiefer.pdf
new file mode 100644
index 0000000..c1e9423
--- /dev/null
+++ b/papers/experimental_design/kiefer.pdf
Binary files differ
diff --git a/papers/experimental_design/lindley.pdf b/papers/experimental_design/lindley.pdf
new file mode 100644
index 0000000..dd73e3d
--- /dev/null
+++ b/papers/experimental_design/lindley.pdf
Binary files differ
diff --git a/papers/experimental_design/max-ent00.pdf b/papers/experimental_design/max-ent00.pdf
new file mode 100644
index 0000000..410626e
--- /dev/null
+++ b/papers/experimental_design/max-ent00.pdf
Binary files differ
diff --git a/problem.tex b/problem.tex
index e25e5c3..3faef3b 100644
--- a/problem.tex
+++ b/problem.tex
@@ -1,6 +1,106 @@
+\subsection{Experimental Design}
+In the context of experimental design, an \emph{experiment} is a random variable $E$ sampled from a distribution $P_\beta$, where $\beta\in \Omega$ is an unknown parameter. An experimenter wishes to learn parameter $\beta$, and can chose among a set of possible different experiments, all of which have distributions parametrized by the same $\beta$.
+
+The problem of optimal experimental design amounts to determining an experiment that maximizes the information revealed about parameter $\beta$.
+Though a variety of different measures of information exist in literature (see, \emph{e.g.}, \cite{ginebra}), the so-called \emph{value of information} \cite{lindley} is commonly used in traditional Bayesian experimental design \cite{lindley}. In particular, in the Bayesian setup, it is assumed that $\beta$ is sampled from a well-known prior distribution. The value of an experiment $E$ is then defined as the expected change in the entropy of $\beta$ (\emph{i.e.}, the mutual information between $E$ an $\beta$), given by
+\begin{align}
+ \mutual(\beta; E) = \entropy(\beta) - \entropy(\beta \mid E).\label{voi}
+\end{align}
+%where $H(\beta)$ is the entropy of the prior distribution and $\entropy(\beta \mid E)$ is the conditional entropy con
+\subsection{Linear Regression}
+In this paper, we focus on \emph{linear regression} experiments, aiming to discover a linear function from user data. In particular, we consider a set of $n$ users $\mathcal{N} = \{1,\ldots, n\}$. Each user
+$i\in\mathcal{N}$ has a public vector of features $x_i\in\reals^d$, $\norm{x_i}_2\leq 1$, and an
+undisclosed piece of information $y_i\in\reals$.
+For example, the features could be the age, weight, or height of user $i$, while the latter can be her propensity to contract a disease.
+
+The variables $x_i$ and $y_i$ are related through the following relationship:
+\begin{align}
+ y_i = \beta^T x_i + \varepsilon_i,\quad\forall i\in\mathcal{N},\label{model}
+\end{align}
+where $\beta\in\reals^d$ is the unknown parameter and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2_0$.
+The experimenter wishes to learn $\beta$ by observing the undisclosed $y_i$ of a subset of users $i\in S$. Hence, the set of possible experiments comprises all random variables $y_{S}\in\reals^{|S|}$, $S\subseteq \mathcal{N}$.
+
+Learning $\beta$ has many interesting applications, that make linear regression ubiquitous in data mining, statistics, and machine learning. On one hand, the function itself can be used for \emph{prediction}, \emph{i.e.}, to compute the output value $y$ of a new input $x\in \reals^d$. Moreover, the sign and relative magnitude coefficients of $\beta$ can aid in identifying how different inputs affect the output---establishing, \emph{e.g.}, that weight, rather than age, is more strongly correlated to a disease.
+
+
+\subsection{Value of Information}
+In the Bayesian setting,
+it is commonly assumed that $\beta$ follows a
+multivariate normal distribution of mean zero and covariance matrix $\sigma_1^2
+I_d$. Under this prior and the linear model \eqref{model}, the value of information \eqref{voi} of an experiment $Y_S$ is given by \cite{...}
+ \begin{align}\label{vs}
+ V(S)
+ & \defeq I(\beta;y_S) = \frac{1}{2}\log\det\left(I_d
+ + \mu\sum_{i\in S} x_ix_i^T\right),
+ % & \defeq \frac{1}{2}\log\det A(S)
+ \end{align}
+where $\mu = \sigma_1^2/\sigma_0^2$.
+
+Some intuition behind the Gaussian prior on $\beta$ and the role that parameter $\mu$ plays can be obtained from how the experimenter estimates $\beta$ upon the observation of $y_S$. In particular, the experimenter can estimate $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$.
+It is easy to show that, under the linearity assumption \eqref{model} and the gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization
+problem:
+\begin{displaymath}
+ \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \beta^*x_i)^2
+ + \frac{1}{\mu}\sum_i \norm{\beta}_2^2
+\end{displaymath}
+This optimization, commonly known as \emph{ridge regression}, reduces to a least squares fit for $\mu=\infty$. For finite $\mu$, ridge regression
+acts as a sort of ``Occam's razor'', favoring a \emph{parsimonious} model for $\beta$: among two vectors with the same square error, the one with the smallest norm is preferred. This is consistent with the Gaussian prior on $\beta$, which implies that vectors with small norms are more likely. %In practice, ridge regression is known to give better prediction results over new data than model parameters computed through a least squares fit.
+
+\subsection{A Budgeted Auction}
+
+TODO Explain the optimization problem, why it has to be formulated as an auction
+problem. Explain the goals:
+\begin{itemize}
+ \item truthful
+ \item individually rational
+ \item budget feasible
+ \item has a good approximation ratio
+
+TODO Explain what is already known: it is ok when the function is submodular. When
+should we introduce the notion of submodularity?
+\end{itemize}
+
+
+\begin{comment}
+\subsection{Linear and Ridge Regression}
+
+Linear regression, a true workhorse of statistical analysis, fits a linear function to a set of observable input and output variables. In particular, consider a set of $n$ value pairs $(x_i,y_i)$, $i\in \mathcal{N} =\{1,\ldots,n\}$. The variables $x_i\in \reals^d$ are usually referred to as \emph{input} variables or \emph{covariates} and $y_i\in \reals$ are referred to as \emph{output} or \emph{response} variables. For example, the former could be the age, weight, or height of a person, while the latter can be their propensity to contract a disease. Assume that input and output variables are related through the following relationship:
+ \begin{displaymath}
+ y_i = \beta^* x_i + \varepsilon_i,\quad\forall i\in\mathcal{N},
+\end{displaymath}
+where $\beta\in\reals^d$, is called the \emph{model parameter vector} and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2$.
+
+The purpose of linear regression is to recover the vector $\beta$ upon the observation of the pairs $(x_i,y_i)$, $i=1,\ldots,n$. Learning $\beta$ has many interesting applications, that make linear regression ubiquitous in data mining, statistics, and machine learning. On one hand, the function itself can be used for \emph{prediction}, \emph{i.e.}, to compute the output value $y$ of a new input $x\in \reals^d$. Moreover, the sign and relative magnitude coefficients of $\beta$ can aid in identifying how different inputs affect the output---establishing, \emph{e.g.}, that weight, rather than age, is more strongly correlated to a disease.
+
+
+
+
+In the absense of any additional information, a natural method for estimating $\beta$ is through a least squares fit. However, in a more general setup,
+ a prior knowledge about $\beta$, captured by a distribution over
+$\reals^d$, can also be taken into account. In this setup, after observing the data, $\beta$ can be retreived through \emph{maximum a posteriori estimation}: computing the parameters which maximize the
+posterior distribution of $\beta$ given the observations.
+
+A common prior assumption for linear regression is that $\beta$ follows a
+multivariate normal distribution of mean zero and covariance matrix $\kappa
+I_d$. Under this assumption, maximum a posteriori estimation leads to the following maximization
+problem, commonly known as \emph{ridge regression}:
+\begin{displaymath}
+ \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \beta^*x_i)^2
+ + \frac{1}{\mu}\sum_i \norm{\beta}_2^2
+\end{displaymath}
+where
+$\mu
+= \frac{\kappa}{\sigma^2}$ is referred to as the \emph{regularization parameter}. For $\mu=\infty$, the above minimization recovers a least squares fit. Compared to the latter, ridge regression
+acts as a sort of ``Occam's razor'', favoring a \emph{parsimonious} model for $\beta$: among two vectors with the same square error, the one with the smallest norm is preferred. This is consistent with the Gaussian prior on $\beta$, which favors vectors with small norms, and is known in practice to give better prediction results over new data than model parameters computed through a least squares fit.
+
+\stratis{$1/\mu$ is awkward, especially since we do not cover linear regression. Can we use $\mu$ instead?}
+
+
+
\subsection{Notations}
+
Throughout the paper, we will make use of the following notations: if $x$ is
a (column) vector in $\reals^d$, $x^*$ denotes its transposed (line)
vector. Thus, the standard inner product between two vectors $x$ and $y$ is
@@ -70,7 +170,6 @@ should be selecting. His goal is to learn the model underlying the data.
\stratis{Describe the experiment setup. The value function can be part of this now, as it is the means through which the experimenter quantifies the value of a solution.}
-\begin{comment}
\subsection{Data model}
There is a set of $n$ users, $\mathcal{N} = \{1,\ldots, n\}$. Each user
@@ -111,7 +210,6 @@ which is the well-known \emph{ridge regression}. $\mu
= \frac{\kappa}{\sigma^2}$ is the regularization parameter. Ridge regression
can thus be seen as linear regression with a regularization term which
prevents $\beta$ from having a large $L_2$-norm.
-\end{comment}
\subsection{Value of data}
@@ -243,4 +341,5 @@ TODO Explain what is already known: it is ok when the function is submodular. Wh
should we introduce the notion of submodularity?
\end{itemize}
+\end{comment}