diff options
| author | Thibaut Horel <thibaut.horel@gmail.com> | 2012-08-27 09:36:47 -0700 |
|---|---|---|
| committer | Thibaut Horel <thibaut.horel@gmail.com> | 2012-08-27 09:36:47 -0700 |
| commit | 429adf8b0db780e579570c664edce09c1a117a37 (patch) | |
| tree | cc4f15c5b94de651fc9daa9a627592bc22e7ddab | |
| parent | 0ae6c4f22a3d3683408f6fb2857831102f934dcc (diff) | |
| download | recommendation-429adf8b0db780e579570c664edce09c1a117a37.tar.gz | |
More notes: finished the gaussian case, rewrite of the introduction
| -rw-r--r-- | notes.tex | 164 |
1 files changed, 115 insertions, 49 deletions
@@ -13,60 +13,45 @@ \newcommand{\tr}[1]{#1^*} \newcommand{\ip}[2]{\langle #1, #2 \rangle} \newcommand{\mse}{\mathop{\mathrm{MSE}}} -\newcommand{\trace}{\mathop{\mathrm{tr}}} +\DeclareMathOperator{\trace}{tr} \begin{document} -\section{Problem} +\section{Introduction} -\paragraph{Data Model} +\subsection{Framework} +We are given a set of users, each user has a vector of features which is known +to us and allows us to distinguish this user from the others. We are interested +in studying a \emph{survey} in which a subset of users will be selected and +queried. -There is a database $D$ of $N$ users. Each user $i\in\{1,\ldots, N\}$ is -represented by a vector of features $x_i\in\mathbf{R}^d$. To each user -$i$, is also associated a variable of interest $y_i$, which can be inferred -from $x_i$. +We assume there is a hypothesis function such that, when submitted to a query, +the answer of a specific user to the query will be equal to the function +applied to his feature vector with some noise added. The noise is considered to +be independent of the user, which is equivalent to say that conditioned on the +hypothesis, the answers are independent. -We will thus assume that there is a \emph{model} describing the relation -between $y_i$ and $x_i$, which in the most general form can be written as: -\begin{displaymath} - y_i = h(x_i) -\end{displaymath} -where $h$ is an unknown function. +The hypothesis is unknown to us, and the answers of the users are going to be +used to learn it. Because we do not know the hypothesis, we maintain +a distribution over the hypothesis set, and this distribution is going to be +updated after the survey. The meaning of this distribution is that it conveys +the uncertainty we have about the hypothesis. More precisely the uncertainty is +captured by the \emph{entropy} of the distribution. -The database is valuable, because by observing the database, or a subset $S$ of -the database, it is possible to learn the unknown model function $h$ (or to -improve a prior knowledge of this function) and thus, to predict the variable -of interest associated with a given user. +Thus, the setup takes the form of an \emph{active learning} setup: we want to +select the most interesting users before conducting the survey. Here, ``most +interesting'' refers to a notion of \emph{value} implied by the model: the +value of a subset of users will be defined as the decrease of entropy caused by +the observation of the answers of this group of users. -\paragraph{Value} Consequently, the value $V(S)$ of a subset $S$ of the -database will be defined as the decrease of \emph{uncertainty} about $h$ -induced by observing $S$. Depending on the definition of uncertainty used, it -is possible to distinguish two general approaches to define $V(S)$: -\begin{itemize} - \item \emph{Information theoretic:} the knowledge of $h$ we have is - represented by a probabilistic distribution, and the \emph{entropy} of - $h$ is used as a measure of the uncertainty. The observation of the set - $S$ allows us to update the distribution of $h$, and the uncertainty - becomes the conditional entropy of $h$ given the observation. This - leads to the definition of the value: - \begin{displaymath} - V(S) = H(h) - H(h\,|\,S) - \end{displaymath} - \item \emph{Statistical:} in this case, we maintain a predictor $\tilde{y}$ - which given a user's feature vector $x$ gives a prediction - $\tilde{y}(x)$ of his variable of interest. The uncertainty is then - measured by the variance of the predictor at a given point $x$, - $\sigma^2_x$. After observing a subset $S$ of the database, the - predictor is updated, and its new variance is denoted by - $\sigma_{x|S}^2$. Thus, we could define the value of $S$ by $V(S) - = \sigma_x^2 - \sigma_{x|S}^2$. However, this definition depends on the - point $x$. If we knew the distribution of $x$, one could simply take - the expectation of the previous quantity as a definition of the value. - Another way to solve the issue, is to average this quantity over the - database, leading to the definition of the value: - \begin{displaymath} - V(S) = \frac{1}{N}\sum_{x\in D} \sigma_x^2 - \sigma_{x|A}^2 - \end{displaymath} -\end{itemize} +The motivation of this definition of value makes sense when considering the +information theoretic interpretation of the entropy: let us consider that we +have an oracle to whom we can ask yes/no questions about the hypothesis. Then, +the entropy of the hypothesis distribution is exactly the number of questions +we need to ask to fully know the hypothesis. If we need to pay for each +question asked to the oracle, then our definition of value directly relates to +the cost decrease implied by observing a set of users. + +\subsection{Problem} Once you have defined a notion of value, there are several problems you can address: @@ -90,7 +75,7 @@ know about the value functions defined above, etc.]} These problems have already been studied, and optimal or near-optimal solution are already known when the value function is submodular. -\section{Data value in the Gaussian world} +\section{Value of data in the Gaussian world} In this section we will assume a multivariate Gaussian model: \begin{displaymath} @@ -119,6 +104,8 @@ where: C = \big(\sqrt{2\pi}\big)^d\sqrt{\det\Sigma} = \int_{b\in\mathbf{R}^d}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db \end{displaymath} + +By expanding the logarithm, we get: \begin{displaymath} H(\beta) = \log(C) + \frac{1}{2C}\int_{b\in\mathbf{R}^d} b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db @@ -133,13 +120,92 @@ One can notice that: where $b$ follows a multivariate normal distribution of variance $\Sigma$ and mean zero. \begin{displaymath} - \expt{b^*\Sigma^{-1}b} = \expt{tr(\Sigma^{-1}bb^*)} = tr(\Sigma^{-1}\Sigma) + \expt{b^*\Sigma^{-1}b} = \expt{\trace\big(\Sigma^{-1}bb^*\big)} + = \trace\big(\Sigma^{-1}\Sigma\big) = 1 \end{displaymath} + +Finally: \begin{displaymath} H(\beta) = \frac{1}{2}\log\big((2\pi)^d\det\Sigma\big) + \frac{1}{2} = \frac{1}{2}\log\big((2\pi e)^d\det\Sigma\big) \end{displaymath} + +\subsubsection*{Conditional entropy} + +Let $S$ be a subset of $D$ of size $n$. Let us denote by $x_1,\ldots,x_n$ the +points in $S$ and by $Y_1, \ldots, Y_n$ the associated random variables of +interest. + +Using the Bayes rule for conditional entropy, we get that: +\begin{displaymath} + H(\beta\,|\,Y_1,\ldots ,Y_n) = H(Y_1,\ldots,Y_n\,|\,\beta) + +H(\beta) - H(Y_1,\ldots, Y_n) +\end{displaymath} + +Conditioned on $\beta$, $(Y_1,\ldots,Y_n)$ follows a multivariate normal +distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence: +\begin{displaymath} + H(Y_1,\ldots,Y_n\,|\,\beta) + = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right) +\end{displaymath} + +$(Y_1,\ldots,Y_n)$ follows a multivariate normal distribution of mean 0. Let us +denote by $\Sigma_Y$ its covariance matrix: +\begin{align} + \Sigma_Y & = \expt{YY^*} = \expt{(X\beta + E)(X\beta + E)^*}\\ + & = X\Sigma X^* + \sigma^2I_n +\end{align} +which yields the following expression for the joint entropy: +\begin{displaymath} + H(Y_1,\ldots, Y_n) + = \frac{1}{2}\log\left((2\pi e)^n \det(X\Sigma X^* + \sigma^2 I_n)\right) +\end{displaymath} + +Combining the above equations, we get: +\begin{align} + H(\beta\,|\, Y_1,\ldots,Y_n) + & = \frac{1}{2}\log\left((2\pi e)^d \det \Sigma + \det\left(I_d + \Sigma \frac{X^*X}{\sigma^2}\right)^{-1}\right)\\ + & = - \frac{1}{2}\log\left((2\pi e)^d +\det\left(\Sigma^{-1} + \frac{X^*X}{\sigma^2}\right)\right) +\end{align} + +\subsubsection*{Value of data} + +By combining the formula for entropy and the conditional entropy, we get that +the value of a set $S$ of points is equal to: +\begin{align} + V(S) & = \frac{1}{2}\log\left((2\pi e)^d \det \Sigma\right) + + \frac{1}{2}\log\left((2\pi e)^d \det + \left(\Sigma^{-1}+\frac{X_S^*X_S}{\sigma^2}\right)\right)\\ + & = \frac{1}{2}\log\left(\det\left(I_d + + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\right) +\end{align} + +\subsubsection*{Marginal contribution} + +Here, we want to compute the marginal contribution of a point $x$ to a set $S$ of +users: +\begin{displaymath} + \Delta_xV(S) = V(S\cup \{x\}) - V(S) +\end{displaymath} + +Using that: +\begin{displaymath} + X_{S\cup\{x\}}^*X_{S\cup\{x\}} = X_S^*X_S + xx^* +\end{displaymath} +we get: +\begin{align} + \Delta_x V(S) & = \frac{1}{2}\log\det\left(I_d + + \Sigma\frac{X_S^*X_S}{\sigma^2} + \Sigma\frac{xx^*}{\sigma^2}\right) + - \frac{1}{2}\log\det\left(I_d + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\\ + & = \frac{1}{2}\log\det\left(I_d + \frac{xx^*}{\sigma^2}\left(\Sigma^{-1} + +\frac{X_S^*X_S}{\sigma^2}\right)^{-1}\right)\\ +& = \frac{1}{2}\log\left(1 + \frac{1}{\sigma^2}x^*\left(\Sigma^{-1} ++ \frac{X_S^*X_S}{\sigma^2}\right)^{-1}x\right) +\end{align} + \section{Submodularity} In this section, we will consider that we are given a \emph{universe} set $U$. |
