diff options
| author | Stratis Ioannidis <stratis@stratis-Latitude-E6320.(none)> | 2012-10-30 08:28:13 -0700 |
|---|---|---|
| committer | Stratis Ioannidis <stratis@stratis-Latitude-E6320.(none)> | 2012-10-30 08:28:13 -0700 |
| commit | 2d54b5d27ffd2782aec0bab8200b5b9e55585805 (patch) | |
| tree | 7921aeb0ed2a13a0e9ecb29c2ee26dbe0acaf6ae /problem.tex | |
| parent | dacff6f8d498ef281066742305db90d1121d7f3b (diff) | |
| parent | fe8aa3a0134a9493a2443a0965392254a7125094 (diff) | |
| download | recommendation-2d54b5d27ffd2782aec0bab8200b5b9e55585805.tar.gz | |
conf
Diffstat (limited to 'problem.tex')
| -rw-r--r-- | problem.tex | 55 |
1 files changed, 27 insertions, 28 deletions
diff --git a/problem.tex b/problem.tex index c3fd38f..db7108b 100644 --- a/problem.tex +++ b/problem.tex @@ -15,7 +15,7 @@ For example, the features could be the age, weight, or height of user $i$, while The variables $x_i$ and $y_i$ are related through the following relationship: \begin{align} - y_i = \beta^T x_i + \varepsilon_i,\quad\forall i\in\mathcal{N},\label{model} + y_i = \T{\beta} x_i + \varepsilon_i,\quad\forall i\in\mathcal{N},\label{model} \end{align} where $\beta\in\reals^d$ is the unknown parameter and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2_0$. The experimenter wishes to learn $\beta$ by observing the undisclosed $y_i$ of a subset of users $i\in S$. Hence, the set of possible experiments comprises all random variables $y_{S}\in\reals^{|S|}$, $S\subseteq \mathcal{N}$. @@ -31,7 +31,7 @@ I_d$. Under this prior and the linear model \eqref{model}, the value of informat \begin{align}\label{vs} V(S) & \defeq I(\beta;y_S) = \frac{1}{2}\log\det\left(I_d - + \mu\sum_{i\in S} x_ix_i^T\right), + + \mu\sum_{i\in S} x_i\T{x_i}\right), % & \defeq \frac{1}{2}\log\det A(S) \end{align} where $\mu = \sigma_1^2/\sigma_0^2$. @@ -40,7 +40,7 @@ Some intuition behind the Gaussian prior on $\beta$ and the role that parameter It is easy to show that, under the linearity assumption \eqref{model} and the gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization problem: \begin{displaymath} - \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \beta^*x_i)^2 + \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \T{\beta}x_i)^2 + \frac{1}{\mu}\sum_i \norm{\beta}_2^2 \end{displaymath} This optimization, commonly known as \emph{ridge regression}, reduces to a least squares fit for $\mu=\infty$. For finite $\mu$, ridge regression @@ -55,18 +55,17 @@ problem. Explain the goals: \item individually rational \item budget feasible \item has a good approximation ratio - -TODO Explain what is already known: it is ok when the function is submodular. When -should we introduce the notion of submodularity? \end{itemize} +\thibaut{TODO Explain what is already known: it is ok when the function is submodular. When +should we introduce the notion of submodularity?} \begin{comment} \subsection{Linear and Ridge Regression} Linear regression, a true workhorse of statistical analysis, fits a linear function to a set of observable input and output variables. In particular, consider a set of $n$ value pairs $(x_i,y_i)$, $i\in \mathcal{N} =\{1,\ldots,n\}$. The variables $x_i\in \reals^d$ are usually referred to as \emph{input} variables or \emph{covariates} and $y_i\in \reals$ are referred to as \emph{output} or \emph{response} variables. For example, the former could be the age, weight, or height of a person, while the latter can be their propensity to contract a disease. Assume that input and output variables are related through the following relationship: \begin{displaymath} - y_i = \beta^* x_i + \varepsilon_i,\quad\forall i\in\mathcal{N}, + y_i = \T{\beta} x_i + \varepsilon_i,\quad\forall i\in\mathcal{N}, \end{displaymath} where $\beta\in\reals^d$, is called the \emph{model parameter vector} and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2$. @@ -85,7 +84,7 @@ multivariate normal distribution of mean zero and covariance matrix $\kappa I_d$. Under this assumption, maximum a posteriori estimation leads to the following maximization problem, commonly known as \emph{ridge regression}: \begin{displaymath} - \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \beta^*x_i)^2 + \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - T{\beta}x_i)^2 + \frac{1}{\mu}\sum_i \norm{\beta}_2^2 \end{displaymath} where @@ -102,16 +101,16 @@ acts as a sort of ``Occam's razor'', favoring a \emph{parsimonious} model for $\ Throughout the paper, we will make use of the following notations: if $x$ is -a (column) vector in $\reals^d$, $x^*$ denotes its transposed (line) +a (column) vector in $\reals^d$, $\T{x}$ denotes its transposed (line) vector. Thus, the standard inner product between two vectors $x$ and $y$ is -simply $x^* y$. $\norm{x}_2 = x^*x$ will denote the $L_2$ norm of $x$. +simply $\T{x}y$. $\norm{x}_2 = \T{x}x$ will denote the $L_2$ norm of $x$. We will also often use the following order over symmetric matrices: if $A$ and $B$ are two $d\times d$ and $B$ are two $d\times d$ real symmetric matrices, we write that $A\leq B$ iff: \begin{displaymath} \forall x\in\reals^d,\quad - x^*Ax \leq x^*Bx + \T{x}Ax \leq \T{x}Bx \end{displaymath} That is, iff $B-A$ is symmetric semi-definite positive. @@ -127,7 +126,7 @@ definite positive matrices. Linear regression, a true workhorse of statistical analysis, fits a linear function to a set of observable input and output variables. In particular, consider a set of $n$ value pairs $(x_i,y_i)$, $i\in \mathcal{N} =\{1,\ldots,n\}$. The variables $x_i\in \reals^d$ are usually referred to as \emph{input} variables or \emph{covariates} and $y_i\in \reals$ are referred to as \emph{output} or \emph{response} variables. For example, the former could be the age, weight, or height of a person, while the latter can be their propensity to contract a disease. Assume that input and output variables are related through the following relationship: \begin{displaymath} - y_i = \beta^* x_i + \varepsilon_i,\quad\forall i\in\mathcal{N}, + y_i = \T{\beta} x_i + \varepsilon_i,\quad\forall i\in\mathcal{N}, \end{displaymath} where $\beta\in\reals^d$, is called the \emph{model parameter vector} and $\varepsilon_i\in\reals$ are independent, normally distributed random variables, with zero mean and variance $\sigma^2$. @@ -136,9 +135,9 @@ The purpose of linear regression is to recover the vector $\beta$ upon the obser -In the absense of any additional information, a natural method for estimating $\beta$ is through a least squares fit. However, in a more general setup, +In the absence of any additional information, a natural method for estimating $\beta$ is through a least squares fit. However, in a more general setup, a prior knowledge about $\beta$, captured by a distribution over -$\reals^d$, can also be taken into account. In this setup, after observing the data, $\beta$ can be retreived through \emph{maximum a posteriori estimation}: computing the parameters which maximize the +$\reals^d$, can also be taken into account. In this setup, after observing the data, $\beta$ can be retrieved through \emph{maximum a posteriori estimation}: computing the parameters which maximize the posterior distribution of $\beta$ given the observations. A common prior assumption for linear regression is that $\beta$ follows a @@ -146,7 +145,7 @@ multivariate normal distribution of mean zero and covariance matrix $\kappa I_d$. Under this assumption, maximum a posteriori estimation leads to the following maximization problem, commonly known as \emph{ridge regression}: \begin{displaymath} - \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \beta^*x_i)^2 + \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \T{\beta}x_i)^2 + \frac{1}{\mu}\sum_i \norm{\beta}_2^2 \end{displaymath} where @@ -184,7 +183,7 @@ experimenter has not seen the data yet, but he wants to know which users he should be selecting. His goal is to learn the model underlying the data. Here, we assume a linear model: \begin{displaymath} - \forall i\in\mathcal{N},\quad y_i = \beta^* x_i + \varepsilon_i + \forall i\in\mathcal{N},\quad y_i = \T{\beta}x_i + \varepsilon_i \end{displaymath} where $\beta\in\reals^d$ and $\varepsilon_i\in\reals$ follows a normal distribution of mean $0$ and variance $\sigma^2$. Furthermore, we assume the @@ -203,7 +202,7 @@ a multivariate normal distribution of mean zero and covariance matrix $\kappa I_d$. Maximum a posteriori estimation leads to the following maximization problem: \begin{displaymath} - \beta_{\text{max}} = \argmax_{\beta\in\reals^d} \sum_i (y_i - \beta^*x_i)^2 + \beta_{\text{max}} = \argmax_{\beta\in\reals^d} \sum_i (y_i - \T{\beta}x_i)^2 + \frac{1}{\mu}\sum_i \norm{\beta}_2^2 \end{displaymath} which is the well-known \emph{ridge regression}. $\mu @@ -241,17 +240,17 @@ where $Y_S = \{y_i,\,i\in S\}$ is the set of observed data. \begin{align*} \forall S\subset\mathcal{N},\; V(S) & = \frac{1}{2}\log\det\left(I_d - + \mu\sum_{i\in S} x_ix_i^*\right)\\ + + \mu\sum_{i\in S} x_i\T{x_i}\right)\\ & \defeq \frac{1}{2}\log\det A(S) \end{align*} \end{theorem} \begin{proof} -Let us denote by $X_S$ the matrix whose rows are the vectors $(x_i^*)_{i\in +Let us denote by $X_S$ the matrix whose rows are the vectors $(\T{x_i})_{i\in S}$. Observe that $A_S$ can simply be written as: \begin{displaymath} - A_S = I_d + \mu X_S^* X_S + A_S = I_d + \mu \T{X_S} X_S \end{displaymath} Let us recall that the entropy of a multivariate normal variable $B$ over @@ -275,19 +274,19 @@ distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence: $(Y_S)$ also follows a multivariate normal distribution of mean zero. Let us compute its covariance matrix, $\Sigma_Y$: \begin{align*} - \Sigma_Y & = \expt{YY^*} = \expt{(X_S\beta + E)(X_S\beta + E)^*}\\ - & = \kappa X_S X_S^* + \sigma^2I_n + \Sigma_Y & = \expt{Y\T{Y}} = \expt{(X_S\beta + E)\T{(X_S\beta + E)}}\\ + & = \kappa X_S \T{X_S} + \sigma^2I_n \end{align*} Thus, we get that: \begin{equation}\label{eq:h2} \mathbb{H}(Y_S) - = \frac{1}{2}\log\left((2\pi e)^n \det(\kappa X_S X_S^* + \sigma^2 I_n)\right) + = \frac{1}{2}\log\left((2\pi e)^n \det(\kappa X_S \T{X_S} + \sigma^2 I_n)\right) \end{equation} Combining \eqref{eq:h1} and \eqref{eq:h2} we get: \begin{displaymath} V(S) = \frac{1}{2}\log\det\left(I_n+\frac{\kappa}{\sigma^2}X_S - X_S^*\right) + \T{X_S}\right) \end{displaymath} Finally, we can use Sylvester's determinant theorem to get the result. @@ -300,7 +299,7 @@ We have the following lemma. \begin{lemma}[Marginal contribution] \begin{displaymath} \Delta_i V(S)\defeq V(S\cup\{i\}) - V(S) - = \frac{1}{2}\log\left(1 + \mu x_i^*A(S)^{-1}x_i\right) + = \frac{1}{2}\log\left(1 + \mu \T{x_i} A(S)^{-1}x_i\right) \end{displaymath} \end{lemma} @@ -308,10 +307,10 @@ We have the following lemma. We have: \begin{align*} V(S\cup\{i\}) & = \frac{1}{2}\log\det A(S\cup\{i\})\\ - & = \frac{1}{2}\log\det\left(A(S) + \mu x_i x_i^*\right)\\ + & = \frac{1}{2}\log\det\left(A(S) + \mu x_i \T{x_i}\right)\\ & = V(S) + \frac{1}{2}\log\det\left(I_d + \mu A(S)^{-1}x_i - x_i^*\right)\\ - & = V(S) + \frac{1}{2}\log\left(1 + \mu x_i^* A(S)^{-1}x_i\right) + \T{x_i}\right)\\ + & = V(S) + \frac{1}{2}\log\left(1 + \mu \T{x_i} A(S)^{-1}x_i\right) \end{align*} where the last equality comes from Sylvester's determinant formula. \end{proof} |
