From 2d0ffdfbfa3a25c6ba123c9627ae3bc6cc529b3c Mon Sep 17 00:00:00 2001 From: Stratis Ioannidis Date: Mon, 29 Oct 2012 23:49:11 -0700 Subject: chaloner --- problem.tex | 103 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 101 insertions(+), 2 deletions(-) (limited to 'problem.tex') diff --git a/problem.tex b/problem.tex index e25e5c3..3faef3b 100644 --- a/problem.tex +++ b/problem.tex @@ -1,6 +1,106 @@ +\subsection{Experimental Design} +In the context of experimental design, an \emph{experiment} is a random variable $E$ sampled from a distribution $P_\beta$, where $\beta\in \Omega$ is an unknown parameter. An experimenter wishes to learn parameter $\beta$, and can chose among a set of possible different experiments, all of which have distributions parametrized by the same $\beta$. + +The problem of optimal experimental design amounts to determining an experiment that maximizes the information revealed about parameter $\beta$. +Though a variety of different measures of information exist in literature (see, \emph{e.g.}, \cite{ginebra}), the so-called \emph{value of information} \cite{lindley} is commonly used in traditional Bayesian experimental design \cite{lindley}. In particular, in the Bayesian setup, it is assumed that $\beta$ is sampled from a well-known prior distribution. The value of an experiment $E$ is then defined as the expected change in the entropy of $\beta$ (\emph{i.e.}, the mutual information between $E$ an $\beta$), given by +\begin{align} + \mutual(\beta; E) = \entropy(\beta) - \entropy(\beta \mid E).\label{voi} +\end{align} +%where $H(\beta)$ is the entropy of the prior distribution and $\entropy(\beta \mid E)$ is the conditional entropy con +\subsection{Linear Regression} +In this paper, we focus on \emph{linear regression} experiments, aiming to discover a linear function from user data. In particular, we consider a set of $n$ users $\mathcal{N} = \{1,\ldots, n\}$. Each user +$i\in\mathcal{N}$ has a public vector of features $x_i\in\reals^d$, $\norm{x_i}_2\leq 1$, and an +undisclosed piece of information $y_i\in\reals$. +For example, the features could be the age, weight, or height of user $i$, while the latter can be her propensity to contract a disease. + +The variables $x_i$ and $y_i$ are related through the following relationship: +\begin{align} + y_i = \beta^T x_i + \varepsilon_i,\quad\forall i\in\mathcal{N},\label{model} +\end{align} +where $\beta\in\reals^d$ is the unknown parameter and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2_0$. +The experimenter wishes to learn $\beta$ by observing the undisclosed $y_i$ of a subset of users $i\in S$. Hence, the set of possible experiments comprises all random variables $y_{S}\in\reals^{|S|}$, $S\subseteq \mathcal{N}$. + +Learning $\beta$ has many interesting applications, that make linear regression ubiquitous in data mining, statistics, and machine learning. On one hand, the function itself can be used for \emph{prediction}, \emph{i.e.}, to compute the output value $y$ of a new input $x\in \reals^d$. Moreover, the sign and relative magnitude coefficients of $\beta$ can aid in identifying how different inputs affect the output---establishing, \emph{e.g.}, that weight, rather than age, is more strongly correlated to a disease. + + +\subsection{Value of Information} +In the Bayesian setting, +it is commonly assumed that $\beta$ follows a +multivariate normal distribution of mean zero and covariance matrix $\sigma_1^2 +I_d$. Under this prior and the linear model \eqref{model}, the value of information \eqref{voi} of an experiment $Y_S$ is given by \cite{...} + \begin{align}\label{vs} + V(S) + & \defeq I(\beta;y_S) = \frac{1}{2}\log\det\left(I_d + + \mu\sum_{i\in S} x_ix_i^T\right), + % & \defeq \frac{1}{2}\log\det A(S) + \end{align} +where $\mu = \sigma_1^2/\sigma_0^2$. + +Some intuition behind the Gaussian prior on $\beta$ and the role that parameter $\mu$ plays can be obtained from how the experimenter estimates $\beta$ upon the observation of $y_S$. In particular, the experimenter can estimate $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. +It is easy to show that, under the linearity assumption \eqref{model} and the gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization +problem: +\begin{displaymath} + \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \beta^*x_i)^2 + + \frac{1}{\mu}\sum_i \norm{\beta}_2^2 +\end{displaymath} +This optimization, commonly known as \emph{ridge regression}, reduces to a least squares fit for $\mu=\infty$. For finite $\mu$, ridge regression +acts as a sort of ``Occam's razor'', favoring a \emph{parsimonious} model for $\beta$: among two vectors with the same square error, the one with the smallest norm is preferred. This is consistent with the Gaussian prior on $\beta$, which implies that vectors with small norms are more likely. %In practice, ridge regression is known to give better prediction results over new data than model parameters computed through a least squares fit. + +\subsection{A Budgeted Auction} + +TODO Explain the optimization problem, why it has to be formulated as an auction +problem. Explain the goals: +\begin{itemize} + \item truthful + \item individually rational + \item budget feasible + \item has a good approximation ratio + +TODO Explain what is already known: it is ok when the function is submodular. When +should we introduce the notion of submodularity? +\end{itemize} + + +\begin{comment} +\subsection{Linear and Ridge Regression} + +Linear regression, a true workhorse of statistical analysis, fits a linear function to a set of observable input and output variables. In particular, consider a set of $n$ value pairs $(x_i,y_i)$, $i\in \mathcal{N} =\{1,\ldots,n\}$. The variables $x_i\in \reals^d$ are usually referred to as \emph{input} variables or \emph{covariates} and $y_i\in \reals$ are referred to as \emph{output} or \emph{response} variables. For example, the former could be the age, weight, or height of a person, while the latter can be their propensity to contract a disease. Assume that input and output variables are related through the following relationship: + \begin{displaymath} + y_i = \beta^* x_i + \varepsilon_i,\quad\forall i\in\mathcal{N}, +\end{displaymath} +where $\beta\in\reals^d$, is called the \emph{model parameter vector} and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2$. + +The purpose of linear regression is to recover the vector $\beta$ upon the observation of the pairs $(x_i,y_i)$, $i=1,\ldots,n$. Learning $\beta$ has many interesting applications, that make linear regression ubiquitous in data mining, statistics, and machine learning. On one hand, the function itself can be used for \emph{prediction}, \emph{i.e.}, to compute the output value $y$ of a new input $x\in \reals^d$. Moreover, the sign and relative magnitude coefficients of $\beta$ can aid in identifying how different inputs affect the output---establishing, \emph{e.g.}, that weight, rather than age, is more strongly correlated to a disease. + + + + +In the absense of any additional information, a natural method for estimating $\beta$ is through a least squares fit. However, in a more general setup, + a prior knowledge about $\beta$, captured by a distribution over +$\reals^d$, can also be taken into account. In this setup, after observing the data, $\beta$ can be retreived through \emph{maximum a posteriori estimation}: computing the parameters which maximize the +posterior distribution of $\beta$ given the observations. + +A common prior assumption for linear regression is that $\beta$ follows a +multivariate normal distribution of mean zero and covariance matrix $\kappa +I_d$. Under this assumption, maximum a posteriori estimation leads to the following maximization +problem, commonly known as \emph{ridge regression}: +\begin{displaymath} + \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \beta^*x_i)^2 + + \frac{1}{\mu}\sum_i \norm{\beta}_2^2 +\end{displaymath} +where +$\mu += \frac{\kappa}{\sigma^2}$ is referred to as the \emph{regularization parameter}. For $\mu=\infty$, the above minimization recovers a least squares fit. Compared to the latter, ridge regression +acts as a sort of ``Occam's razor'', favoring a \emph{parsimonious} model for $\beta$: among two vectors with the same square error, the one with the smallest norm is preferred. This is consistent with the Gaussian prior on $\beta$, which favors vectors with small norms, and is known in practice to give better prediction results over new data than model parameters computed through a least squares fit. + +\stratis{$1/\mu$ is awkward, especially since we do not cover linear regression. Can we use $\mu$ instead?} + + + \subsection{Notations} + Throughout the paper, we will make use of the following notations: if $x$ is a (column) vector in $\reals^d$, $x^*$ denotes its transposed (line) vector. Thus, the standard inner product between two vectors $x$ and $y$ is @@ -70,7 +170,6 @@ should be selecting. His goal is to learn the model underlying the data. \stratis{Describe the experiment setup. The value function can be part of this now, as it is the means through which the experimenter quantifies the value of a solution.} -\begin{comment} \subsection{Data model} There is a set of $n$ users, $\mathcal{N} = \{1,\ldots, n\}$. Each user @@ -111,7 +210,6 @@ which is the well-known \emph{ridge regression}. $\mu = \frac{\kappa}{\sigma^2}$ is the regularization parameter. Ridge regression can thus be seen as linear regression with a regularization term which prevents $\beta$ from having a large $L_2$-norm. -\end{comment} \subsection{Value of data} @@ -243,4 +341,5 @@ TODO Explain what is already known: it is ok when the function is submodular. Wh should we introduce the notion of submodularity? \end{itemize} +\end{comment} -- cgit v1.2.3-70-g09d2