More notes: finished the gaussian case, rewrite of the introduction

author: Thibaut Horel <thibaut.horel@gmail.com> 2012-08-27 09:36:47 -0700
committer: Thibaut Horel <thibaut.horel@gmail.com> 2012-08-27 09:36:47 -0700
commit: 429adf8b0db780e579570c664edce09c1a117a37 (patch)
tree: cc4f15c5b94de651fc9daa9a627592bc22e7ddab
parent: 0ae6c4f22a3d3683408f6fb2857831102f934dcc (diff)
download: recommendation-429adf8b0db780e579570c664edce09c1a117a37.tar.gz
1 files changed, 115 insertions, 49 deletions
diff --git a/notes.tex b/notes.tex
index a0e5a48..f6d1cb7 100644
--- a/notes.tex
+++ b/notes.tex
@@ -13,60 +13,45 @@
 \newcommand{\tr}[1]{#1^*}
 \newcommand{\ip}[2]{\langle #1, #2 \rangle}
 \newcommand{\mse}{\mathop{\mathrm{MSE}}}
-\newcommand{\trace}{\mathop{\mathrm{tr}}}
+\DeclareMathOperator{\trace}{tr}
 \begin{document}
 
-\section{Problem}
+\section{Introduction}
 
-\paragraph{Data Model} 
+\subsection{Framework}
+We are given a set of users, each user has a vector of features which is known
+to us and allows us to distinguish this user from the others. We are interested
+in studying a \emph{survey} in which a subset of users will be selected and
+queried.
 
-There is a database $D$ of $N$ users. Each user $i\in\{1,\ldots, N\}$ is
-represented by a vector of features $x_i\in\mathbf{R}^d$. To each user
-$i$, is also associated a variable of interest $y_i$, which can be inferred
-from $x_i$.
+We assume there is a hypothesis function such that, when submitted to a query,
+the answer of a specific user to the query will be equal to the function
+applied to his feature vector with some noise added. The noise is considered to
+be independent of the user, which is equivalent to say that conditioned on the
+hypothesis, the answers are independent.
 
-We will thus assume that there is a \emph{model} describing the relation
-between $y_i$ and $x_i$, which in the most general form can be written as:
-\begin{displaymath}
-    y_i = h(x_i)
-\end{displaymath}
-where $h$ is an unknown function.
+The hypothesis is unknown to us, and the answers of the users are going to be
+used to learn it. Because we do not know the hypothesis, we maintain
+a distribution over the hypothesis set, and this distribution is going to be
+updated after the survey. The meaning of this distribution is that it conveys
+the uncertainty we have about the hypothesis. More precisely the uncertainty is
+captured by the \emph{entropy} of the distribution.
 
-The database is valuable, because by observing the database, or a subset $S$ of
-the database, it is possible to learn the unknown model function $h$ (or to
-improve a prior knowledge of this function) and thus, to predict the variable
-of interest associated with a given user.
+Thus, the setup takes the form of an \emph{active learning} setup: we want to
+select the most interesting users before conducting the survey. Here, ``most
+interesting'' refers to a notion of \emph{value} implied by the model: the
+value of a subset of users will be defined as the decrease of entropy caused by
+the observation of the answers of this group of users.
 
-\paragraph{Value} Consequently, the value $V(S)$ of a subset $S$ of the
-database will be defined as the decrease of \emph{uncertainty} about $h$
-induced by observing $S$. Depending on the definition of uncertainty used, it
-is possible to distinguish two general approaches to define $V(S)$:
-\begin{itemize}
-    \item \emph{Information theoretic:} the knowledge of $h$ we have is
-        represented by a probabilistic distribution, and the \emph{entropy} of
-        $h$ is used as a measure of the uncertainty. The observation of the set
-        $S$ allows us to update the distribution of $h$, and the uncertainty
-        becomes the conditional entropy of $h$ given the observation. This
-        leads to the definition of the value:
-        \begin{displaymath}
-            V(S) = H(h) - H(h\,|\,S) 
-        \end{displaymath}
-    \item \emph{Statistical:} in this case, we maintain a predictor $\tilde{y}$
-        which given a user's feature vector $x$ gives a prediction
-        $\tilde{y}(x)$ of his variable of interest. The uncertainty is then
-        measured by the variance of the predictor at a given point $x$,
-        $\sigma^2_x$. After observing a subset $S$ of the database, the
-        predictor is updated, and its new variance is denoted by
-        $\sigma_{x|S}^2$. Thus, we could define the value of $S$ by $V(S)
-        = \sigma_x^2 - \sigma_{x|S}^2$. However, this definition depends on the
-        point $x$. If we knew the distribution of $x$, one could simply take
-        the expectation of the previous quantity as a definition of the value.
-        Another way to solve the issue, is to average this quantity over the
-        database, leading to the definition of the value:
-        \begin{displaymath}
-            V(S) = \frac{1}{N}\sum_{x\in D} \sigma_x^2 - \sigma_{x|A}^2
-        \end{displaymath}
-\end{itemize}
+The motivation of this definition of value makes sense when considering the
+information theoretic interpretation of the entropy: let us consider that we
+have an oracle to whom we can ask yes/no questions about the hypothesis. Then,
+the entropy of the hypothesis distribution is exactly the number of questions
+we need to ask to fully know the hypothesis. If we need to pay for each
+question asked to the oracle, then our definition of value directly relates to
+the cost decrease implied by observing a set of users. 
+
+\subsection{Problem}
 
 Once you have defined a notion of value, there are several problems you can
 address:
@@ -90,7 +75,7 @@ know about the value functions defined above, etc.]}
 These problems have already been studied, and optimal or near-optimal solution
 are already known when the value function is submodular. 
 
-\section{Data value in the Gaussian world}
+\section{Value of data in the Gaussian world}
 
 In this section we will assume a multivariate Gaussian model:
 \begin{displaymath}
@@ -119,6 +104,8 @@ where:
     C = \big(\sqrt{2\pi}\big)^d\sqrt{\det\Sigma}
     = \int_{b\in\mathbf{R}^d}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db 
 \end{displaymath}
+
+By expanding the logarithm, we get:
 \begin{displaymath}
     H(\beta) = \log(C) + \frac{1}{2C}\int_{b\in\mathbf{R}^d}
     b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db
@@ -133,13 +120,92 @@ One can notice that:
 where $b$ follows a multivariate normal distribution of variance $\Sigma$
 and mean zero.
 \begin{displaymath}
-    \expt{b^*\Sigma^{-1}b} = \expt{tr(\Sigma^{-1}bb^*)} = tr(\Sigma^{-1}\Sigma)
+    \expt{b^*\Sigma^{-1}b} = \expt{\trace\big(\Sigma^{-1}bb^*\big)}
+    = \trace\big(\Sigma^{-1}\Sigma\big)
     = 1
 \end{displaymath}
+
+Finally:
 \begin{displaymath}
     H(\beta) = \frac{1}{2}\log\big((2\pi)^d\det\Sigma\big) + \frac{1}{2}
     = \frac{1}{2}\log\big((2\pi e)^d\det\Sigma\big)
 \end{displaymath}
+
+\subsubsection*{Conditional entropy}
+
+Let $S$ be a subset of $D$ of size $n$. Let us denote by $x_1,\ldots,x_n$ the
+points in $S$ and by $Y_1, \ldots, Y_n$ the associated random variables of
+interest.
+
+Using the Bayes rule for conditional entropy, we get that:
+\begin{displaymath}
+    H(\beta\,|\,Y_1,\ldots ,Y_n) = H(Y_1,\ldots,Y_n\,|\,\beta)
+    +H(\beta) - H(Y_1,\ldots, Y_n)
+\end{displaymath}
+
+Conditioned on $\beta$, $(Y_1,\ldots,Y_n)$ follows a multivariate normal
+distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence:
+\begin{displaymath}
+    H(Y_1,\ldots,Y_n\,|\,\beta) 
+    = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right)
+\end{displaymath}
+
+$(Y_1,\ldots,Y_n)$ follows a multivariate normal distribution of mean 0. Let us
+denote by $\Sigma_Y$ its covariance matrix:
+\begin{align}
+    \Sigma_Y & = \expt{YY^*} = \expt{(X\beta + E)(X\beta + E)^*}\\
+             & = X\Sigma X^* + \sigma^2I_n
+\end{align}
+which yields the following expression for the joint entropy:
+\begin{displaymath}
+    H(Y_1,\ldots, Y_n) 
+    = \frac{1}{2}\log\left((2\pi e)^n \det(X\Sigma X^* + \sigma^2 I_n)\right)
+\end{displaymath}
+
+Combining the above equations, we get:
+\begin{align}
+    H(\beta\,|\, Y_1,\ldots,Y_n) 
+    & = \frac{1}{2}\log\left((2\pi e)^d \det \Sigma
+    \det\left(I_d + \Sigma \frac{X^*X}{\sigma^2}\right)^{-1}\right)\\
+    & = - \frac{1}{2}\log\left((2\pi e)^d
+\det\left(\Sigma^{-1} + \frac{X^*X}{\sigma^2}\right)\right)
+\end{align}
+
+\subsubsection*{Value of data}
+
+By combining the formula for entropy and the conditional entropy, we get that
+the value of a set $S$ of points is equal to:
+\begin{align}
+    V(S) & = \frac{1}{2}\log\left((2\pi e)^d \det \Sigma\right)
+    + \frac{1}{2}\log\left((2\pi e)^d \det
+    \left(\Sigma^{-1}+\frac{X_S^*X_S}{\sigma^2}\right)\right)\\
+    & = \frac{1}{2}\log\left(\det\left(I_d 
+    + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\right)
+\end{align}
+
+\subsubsection*{Marginal contribution}
+
+Here, we want to compute the marginal contribution of a point $x$ to a set $S$ of
+users:
+\begin{displaymath}
+    \Delta_xV(S) = V(S\cup \{x\}) - V(S)
+\end{displaymath}
+
+Using that:
+\begin{displaymath}
+    X_{S\cup\{x\}}^*X_{S\cup\{x\}} = X_S^*X_S + xx^*
+\end{displaymath}
+we get:
+\begin{align}
+    \Delta_x V(S) & = \frac{1}{2}\log\det\left(I_d 
+    + \Sigma\frac{X_S^*X_S}{\sigma^2} + \Sigma\frac{xx^*}{\sigma^2}\right)
+    - \frac{1}{2}\log\det\left(I_d + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\\
+    &  = \frac{1}{2}\log\det\left(I_d + \frac{xx^*}{\sigma^2}\left(\Sigma^{-1} +
+\frac{X_S^*X_S}{\sigma^2}\right)^{-1}\right)\\
+& = \frac{1}{2}\log\left(1 + \frac{1}{\sigma^2}x^*\left(\Sigma^{-1}
++ \frac{X_S^*X_S}{\sigma^2}\right)^{-1}x\right)
+\end{align}
+
 \section{Submodularity}
 
 In this section, we will consider that we are given a \emph{universe} set $U$.
author	Thibaut Horel <thibaut.horel@gmail.com>	2012-08-27 09:36:47 -0700
committer	Thibaut Horel <thibaut.horel@gmail.com>	2012-08-27 09:36:47 -0700
commit	429adf8b0db780e579570c664edce09c1a117a37 (patch)
tree	cc4f15c5b94de651fc9daa9a627592bc22e7ddab
parent	0ae6c4f22a3d3683408f6fb2857831102f934dcc (diff)
download	recommendation-429adf8b0db780e579570c664edce09c1a117a37.tar.gz