diff options
| -rw-r--r-- | definitions.tex | 4 | ||||
| -rw-r--r-- | paper.tex | 2 | ||||
| -rw-r--r-- | problem.tex | 73 |
3 files changed, 67 insertions, 12 deletions
diff --git a/definitions.tex b/definitions.tex index ca6b3ba..139a6db 100644 --- a/definitions.tex +++ b/definitions.tex @@ -13,4 +13,6 @@ \newcommand{\mse}{\mathop{\mathrm{MSE}}} \DeclareMathOperator{\trace}{tr} \DeclareMathOperator*{\argmax}{arg\,max} - +\DeclareMathOperator*{\argmin}{arg\,min} +\newcommand{\reals}{\ensuremath{\mathbbm{R}}} +\newcommand{\stratis}[1]{\textcolor{red}{Stratis: #1}} @@ -2,7 +2,7 @@ \usepackage[utf8]{inputenc} \usepackage{amsmath,amsfonts} \usepackage{algorithm} -\usepackage{algpseudocode} +\usepackage{algpseudocode,bbm,color,verbatim} \input{definitions} \title{Budgeted Auctions for Experiment Design} \begin{document} diff --git a/problem.tex b/problem.tex index fb9f8e1..e25e5c3 100644 --- a/problem.tex +++ b/problem.tex @@ -1,7 +1,8 @@ + \subsection{Notations} Throughout the paper, we will make use of the following notations: if $x$ is -a (column) vector in $\mathbf{R}^d$, $x^*$ denotes its transposed (line) +a (column) vector in $\reals^d$, $x^*$ denotes its transposed (line) vector. Thus, the standard inner product between two vectors $x$ and $y$ is simply $x^* y$. $\norm{x}_2 = x^*x$ will denote the $L_2$ norm of $x$. @@ -9,7 +10,7 @@ We will also often use the following order over symmetric matrices: if $A$ and $B$ are two $d\times d$ and $B$ are two $d\times d$ real symmetric matrices, we write that $A\leq B$ iff: \begin{displaymath} - \forall x\in\mathbf{R}^d,\quad + \forall x\in\reals^d,\quad x^*Ax \leq x^*Bx \end{displaymath} That is, iff $B-A$ is symmetric semi-definite positive. @@ -19,11 +20,62 @@ matrix function similarly to their real counterparts. In particular, let us recall that the matrix inversion is decreasing and convex over symmetric definite positive matrices. +\stratis{This section should either be removed or compactified. Short notations like $*$ for transpose can be introduced where they first appear. Better yet, + can we revert to $^T$ or something similar for transpose? I know $*$ is used in real analysis etc., but $*$ is used for optimal values too when dealing with optimization. Also, do we need the extension of decreasing and convex to matrix function? Does it have ``global scope''? If it is used only once, we might as well place this notation there.} + +\subsection{Linear and Ridge Regression} + +Linear regression, a true workhorse of statistical analysis, fits a linear function to a set of observable input and output variables. In particular, consider a set of $n$ value pairs $(x_i,y_i)$, $i\in \mathcal{N} =\{1,\ldots,n\}$. The variables $x_i\in \reals^d$ are usually referred to as \emph{input} variables or \emph{covariates} and $y_i\in \reals$ are referred to as \emph{output} or \emph{response} variables. For example, the former could be the age, weight, or height of a person, while the latter can be their propensity to contract a disease. Assume that input and output variables are related through the following relationship: + \begin{displaymath} + y_i = \beta^* x_i + \varepsilon_i,\quad\forall i\in\mathcal{N}, +\end{displaymath} +where $\beta\in\reals^d$, is called the \emph{model parameter vector} and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2$. + +The purpose of linear regression is to recover the vector $\beta$ upon the observation of the pairs $(x_i,y_i)$, $i=1,\ldots,n$. Learning $\beta$ has many interesting applications, that make linear regression ubiquitous in data mining, statistics, and machine learning. On one hand, the function itself can be used for \emph{prediction}, \emph{i.e.}, to compute the output value $y$ of a new input $x\in \reals^d$. Moreover, the sign and relative magnitude coefficients of $\beta$ can aid in identifying how different inputs affect the output---establishing, \emph{e.g.}, that weight, rather than age, is more strongly correlated to a disease. + + + + +In the absense of any additional information, a natural method for estimating $\beta$ is through a least squares fit. However, in a more general setup, + a prior knowledge about $\beta$, captured by a distribution over +$\reals^d$, can also be taken into account. In this setup, after observing the data, $\beta$ can be retreived through \emph{maximum a posteriori estimation}: computing the parameters which maximize the +posterior distribution of $\beta$ given the observations. + +A common prior assumption for linear regression is that $\beta$ follows a +multivariate normal distribution of mean zero and covariance matrix $\kappa +I_d$. Under this assumption, maximum a posteriori estimation leads to the following maximization +problem, commonly known as \emph{ridge regression}: +\begin{displaymath} + \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \beta^*x_i)^2 + + \frac{1}{\mu}\sum_i \norm{\beta}_2^2 +\end{displaymath} +where +$\mu += \frac{\kappa}{\sigma^2}$ is referred to as the \emph{regularization parameter}. For $\mu=\infty$, the above minimization recovers a least squares fit. Compared to the latter, ridge regression +acts as a sort of ``Occam's razor'', favoring a \emph{parsimonious} model for $\beta$: among two vectors with the same square error, the one with the smallest norm is preferred. This is consistent with the Gaussian prior on $\beta$, which favors vectors with small norms, and is known in practice to give better prediction results over new data than model parameters computed through a least squares fit. + +\stratis{$1/\mu$ is awkward, especially since we do not cover linear regression. Can we use $\mu$ instead?} + +\subsection{Experiment Design} +There is a set of $n$ users, $\mathcal{N} = \{1,\ldots, n\}$. Each user +$i\in\mathcal{N}$ has a public vector of features $x_i\in\reals^d$ and an +undisclosed piece of information $y_i\in\reals$. We assume that the data +has already been normalized so that $\norm{x_i}_2\leq 1$ for all +$i\in\mathcal{N}$. + +The experimenter is going to select a set of users and ask them to reveal their +private piece of information. We are interested in a \emph{survey setup}: the +experimenter has not seen the data yet, but he wants to know which users he +should be selecting. His goal is to learn the model underlying the data. + +\stratis{Describe the experiment setup. The value function can be part of this now, as it is the means through which the experimenter quantifies the value of a solution.} + +\begin{comment} \subsection{Data model} There is a set of $n$ users, $\mathcal{N} = \{1,\ldots, n\}$. Each user -$i\in\mathcal{N}$ has a public vector of features $x_i\in\mathbf{R}^d$ and an -undisclosed piece of information $y_i\in\mathbf{R}$. We assume that the data +$i\in\mathcal{N}$ has a public vector of features $x_i\in\reals^d$ and an +undisclosed piece of information $y_i\in\reals$. We assume that the data has already been normalized so that $\norm{x_i}_2\leq 1$ for all $i\in\mathcal{N}$. @@ -35,7 +87,7 @@ we assume a linear model: \begin{displaymath} \forall i\in\mathcal{N},\quad y_i = \beta^* x_i + \varepsilon_i \end{displaymath} -where $\beta\in\mathbf{R}^d$ and $\varepsilon_i\in\mathbf{R}$ follows a normal +where $\beta\in\reals^d$ and $\varepsilon_i\in\reals$ follows a normal distribution of mean $0$ and variance $\sigma^2$. Furthermore, we assume the error $\varepsilon$ to be independent of the user: $(\varepsilon_i)_{i\in\mathcal{N}}$ are mutually independent. @@ -43,7 +95,7 @@ $(\varepsilon_i)_{i\in\mathcal{N}}$ are mutually independent. After observing the data, the experimenter could simply do linear regression to learn the model parameter $\beta$. However, in a more general setup, the experimenter has a prior knowledge about $\beta$, a distribution over -$\mathbf{R}^d$. After observing the data, the experimenter performs +$\reals^d$. After observing the data, the experimenter performs \emph{maximum a posteriori estimation}: computing the point which maximizes the posterior distribution of $\beta$ given the observations. @@ -52,13 +104,14 @@ a multivariate normal distribution of mean zero and covariance matrix $\kappa I_d$. Maximum a posteriori estimation leads to the following maximization problem: \begin{displaymath} - \beta_{\text{max}} = \argmax_{\beta\in\mathbf{R}^d} \sum_i (y_i - \beta^*x_i)^2 + \beta_{\text{max}} = \argmax_{\beta\in\reals^d} \sum_i (y_i - \beta^*x_i)^2 + \frac{1}{\mu}\sum_i \norm{\beta}_2^2 \end{displaymath} which is the well-known \emph{ridge regression}. $\mu = \frac{\kappa}{\sigma^2}$ is the regularization parameter. Ridge regression can thus be seen as linear regression with a regularization term which prevents $\beta$ from having a large $L_2$-norm. +\end{comment} \subsection{Value of data} @@ -67,11 +120,11 @@ experimenter has to decide which users to include in his experiment, we treat $\beta$ as a random variable whose distribution is updated after observing the data. -Let us recall that if $\beta$ is random variable over $\mathbf{R}^d$ whose +Let us recall that if $\beta$ is random variable over $\reals^d$ whose probability distribution has a density function $f$ with respect to the Lebesgue measure, its entropy is given by: \begin{displaymath} - \mathbb{H}(\beta) \defeq - \int_{b\in\mathbf{R}^d} \log f(b) f(b)\text{d}b + \mathbb{H}(\beta) \defeq - \int_{b\in\reals^d} \log f(b) f(b)\text{d}b \end{displaymath} A usual way to measure the decrease of uncertainty induced by the observation @@ -104,7 +157,7 @@ S}$. Observe that $A_S$ can simply be written as: \end{displaymath} Let us recall that the entropy of a multivariate normal variable $B$ over -$\mathbf{R}^d$ of covariance $\Sigma I_d$ is given by: +$\reals^d$ of covariance $\Sigma I_d$ is given by: \begin{equation}\label{eq:multivariate-entropy} \mathbb{H}(B) = \frac{1}{2}\log\big((2\pi e)^d \det \Sigma I_d\big) \end{equation} |
