summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorThibaut Horel <thibaut.horel@gmail.com>2012-08-27 09:36:47 -0700
committerThibaut Horel <thibaut.horel@gmail.com>2012-08-27 09:36:47 -0700
commit429adf8b0db780e579570c664edce09c1a117a37 (patch)
treecc4f15c5b94de651fc9daa9a627592bc22e7ddab
parent0ae6c4f22a3d3683408f6fb2857831102f934dcc (diff)
downloadrecommendation-429adf8b0db780e579570c664edce09c1a117a37.tar.gz
More notes: finished the gaussian case, rewrite of the introduction
-rw-r--r--notes.tex164
1 files changed, 115 insertions, 49 deletions
diff --git a/notes.tex b/notes.tex
index a0e5a48..f6d1cb7 100644
--- a/notes.tex
+++ b/notes.tex
@@ -13,60 +13,45 @@
\newcommand{\tr}[1]{#1^*}
\newcommand{\ip}[2]{\langle #1, #2 \rangle}
\newcommand{\mse}{\mathop{\mathrm{MSE}}}
-\newcommand{\trace}{\mathop{\mathrm{tr}}}
+\DeclareMathOperator{\trace}{tr}
\begin{document}
-\section{Problem}
+\section{Introduction}
-\paragraph{Data Model}
+\subsection{Framework}
+We are given a set of users, each user has a vector of features which is known
+to us and allows us to distinguish this user from the others. We are interested
+in studying a \emph{survey} in which a subset of users will be selected and
+queried.
-There is a database $D$ of $N$ users. Each user $i\in\{1,\ldots, N\}$ is
-represented by a vector of features $x_i\in\mathbf{R}^d$. To each user
-$i$, is also associated a variable of interest $y_i$, which can be inferred
-from $x_i$.
+We assume there is a hypothesis function such that, when submitted to a query,
+the answer of a specific user to the query will be equal to the function
+applied to his feature vector with some noise added. The noise is considered to
+be independent of the user, which is equivalent to say that conditioned on the
+hypothesis, the answers are independent.
-We will thus assume that there is a \emph{model} describing the relation
-between $y_i$ and $x_i$, which in the most general form can be written as:
-\begin{displaymath}
- y_i = h(x_i)
-\end{displaymath}
-where $h$ is an unknown function.
+The hypothesis is unknown to us, and the answers of the users are going to be
+used to learn it. Because we do not know the hypothesis, we maintain
+a distribution over the hypothesis set, and this distribution is going to be
+updated after the survey. The meaning of this distribution is that it conveys
+the uncertainty we have about the hypothesis. More precisely the uncertainty is
+captured by the \emph{entropy} of the distribution.
-The database is valuable, because by observing the database, or a subset $S$ of
-the database, it is possible to learn the unknown model function $h$ (or to
-improve a prior knowledge of this function) and thus, to predict the variable
-of interest associated with a given user.
+Thus, the setup takes the form of an \emph{active learning} setup: we want to
+select the most interesting users before conducting the survey. Here, ``most
+interesting'' refers to a notion of \emph{value} implied by the model: the
+value of a subset of users will be defined as the decrease of entropy caused by
+the observation of the answers of this group of users.
-\paragraph{Value} Consequently, the value $V(S)$ of a subset $S$ of the
-database will be defined as the decrease of \emph{uncertainty} about $h$
-induced by observing $S$. Depending on the definition of uncertainty used, it
-is possible to distinguish two general approaches to define $V(S)$:
-\begin{itemize}
- \item \emph{Information theoretic:} the knowledge of $h$ we have is
- represented by a probabilistic distribution, and the \emph{entropy} of
- $h$ is used as a measure of the uncertainty. The observation of the set
- $S$ allows us to update the distribution of $h$, and the uncertainty
- becomes the conditional entropy of $h$ given the observation. This
- leads to the definition of the value:
- \begin{displaymath}
- V(S) = H(h) - H(h\,|\,S)
- \end{displaymath}
- \item \emph{Statistical:} in this case, we maintain a predictor $\tilde{y}$
- which given a user's feature vector $x$ gives a prediction
- $\tilde{y}(x)$ of his variable of interest. The uncertainty is then
- measured by the variance of the predictor at a given point $x$,
- $\sigma^2_x$. After observing a subset $S$ of the database, the
- predictor is updated, and its new variance is denoted by
- $\sigma_{x|S}^2$. Thus, we could define the value of $S$ by $V(S)
- = \sigma_x^2 - \sigma_{x|S}^2$. However, this definition depends on the
- point $x$. If we knew the distribution of $x$, one could simply take
- the expectation of the previous quantity as a definition of the value.
- Another way to solve the issue, is to average this quantity over the
- database, leading to the definition of the value:
- \begin{displaymath}
- V(S) = \frac{1}{N}\sum_{x\in D} \sigma_x^2 - \sigma_{x|A}^2
- \end{displaymath}
-\end{itemize}
+The motivation of this definition of value makes sense when considering the
+information theoretic interpretation of the entropy: let us consider that we
+have an oracle to whom we can ask yes/no questions about the hypothesis. Then,
+the entropy of the hypothesis distribution is exactly the number of questions
+we need to ask to fully know the hypothesis. If we need to pay for each
+question asked to the oracle, then our definition of value directly relates to
+the cost decrease implied by observing a set of users.
+
+\subsection{Problem}
Once you have defined a notion of value, there are several problems you can
address:
@@ -90,7 +75,7 @@ know about the value functions defined above, etc.]}
These problems have already been studied, and optimal or near-optimal solution
are already known when the value function is submodular.
-\section{Data value in the Gaussian world}
+\section{Value of data in the Gaussian world}
In this section we will assume a multivariate Gaussian model:
\begin{displaymath}
@@ -119,6 +104,8 @@ where:
C = \big(\sqrt{2\pi}\big)^d\sqrt{\det\Sigma}
= \int_{b\in\mathbf{R}^d}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db
\end{displaymath}
+
+By expanding the logarithm, we get:
\begin{displaymath}
H(\beta) = \log(C) + \frac{1}{2C}\int_{b\in\mathbf{R}^d}
b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db
@@ -133,13 +120,92 @@ One can notice that:
where $b$ follows a multivariate normal distribution of variance $\Sigma$
and mean zero.
\begin{displaymath}
- \expt{b^*\Sigma^{-1}b} = \expt{tr(\Sigma^{-1}bb^*)} = tr(\Sigma^{-1}\Sigma)
+ \expt{b^*\Sigma^{-1}b} = \expt{\trace\big(\Sigma^{-1}bb^*\big)}
+ = \trace\big(\Sigma^{-1}\Sigma\big)
= 1
\end{displaymath}
+
+Finally:
\begin{displaymath}
H(\beta) = \frac{1}{2}\log\big((2\pi)^d\det\Sigma\big) + \frac{1}{2}
= \frac{1}{2}\log\big((2\pi e)^d\det\Sigma\big)
\end{displaymath}
+
+\subsubsection*{Conditional entropy}
+
+Let $S$ be a subset of $D$ of size $n$. Let us denote by $x_1,\ldots,x_n$ the
+points in $S$ and by $Y_1, \ldots, Y_n$ the associated random variables of
+interest.
+
+Using the Bayes rule for conditional entropy, we get that:
+\begin{displaymath}
+ H(\beta\,|\,Y_1,\ldots ,Y_n) = H(Y_1,\ldots,Y_n\,|\,\beta)
+ +H(\beta) - H(Y_1,\ldots, Y_n)
+\end{displaymath}
+
+Conditioned on $\beta$, $(Y_1,\ldots,Y_n)$ follows a multivariate normal
+distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence:
+\begin{displaymath}
+ H(Y_1,\ldots,Y_n\,|\,\beta)
+ = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right)
+\end{displaymath}
+
+$(Y_1,\ldots,Y_n)$ follows a multivariate normal distribution of mean 0. Let us
+denote by $\Sigma_Y$ its covariance matrix:
+\begin{align}
+ \Sigma_Y & = \expt{YY^*} = \expt{(X\beta + E)(X\beta + E)^*}\\
+ & = X\Sigma X^* + \sigma^2I_n
+\end{align}
+which yields the following expression for the joint entropy:
+\begin{displaymath}
+ H(Y_1,\ldots, Y_n)
+ = \frac{1}{2}\log\left((2\pi e)^n \det(X\Sigma X^* + \sigma^2 I_n)\right)
+\end{displaymath}
+
+Combining the above equations, we get:
+\begin{align}
+ H(\beta\,|\, Y_1,\ldots,Y_n)
+ & = \frac{1}{2}\log\left((2\pi e)^d \det \Sigma
+ \det\left(I_d + \Sigma \frac{X^*X}{\sigma^2}\right)^{-1}\right)\\
+ & = - \frac{1}{2}\log\left((2\pi e)^d
+\det\left(\Sigma^{-1} + \frac{X^*X}{\sigma^2}\right)\right)
+\end{align}
+
+\subsubsection*{Value of data}
+
+By combining the formula for entropy and the conditional entropy, we get that
+the value of a set $S$ of points is equal to:
+\begin{align}
+ V(S) & = \frac{1}{2}\log\left((2\pi e)^d \det \Sigma\right)
+ + \frac{1}{2}\log\left((2\pi e)^d \det
+ \left(\Sigma^{-1}+\frac{X_S^*X_S}{\sigma^2}\right)\right)\\
+ & = \frac{1}{2}\log\left(\det\left(I_d
+ + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\right)
+\end{align}
+
+\subsubsection*{Marginal contribution}
+
+Here, we want to compute the marginal contribution of a point $x$ to a set $S$ of
+users:
+\begin{displaymath}
+ \Delta_xV(S) = V(S\cup \{x\}) - V(S)
+\end{displaymath}
+
+Using that:
+\begin{displaymath}
+ X_{S\cup\{x\}}^*X_{S\cup\{x\}} = X_S^*X_S + xx^*
+\end{displaymath}
+we get:
+\begin{align}
+ \Delta_x V(S) & = \frac{1}{2}\log\det\left(I_d
+ + \Sigma\frac{X_S^*X_S}{\sigma^2} + \Sigma\frac{xx^*}{\sigma^2}\right)
+ - \frac{1}{2}\log\det\left(I_d + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\\
+ & = \frac{1}{2}\log\det\left(I_d + \frac{xx^*}{\sigma^2}\left(\Sigma^{-1} +
+\frac{X_S^*X_S}{\sigma^2}\right)^{-1}\right)\\
+& = \frac{1}{2}\log\left(1 + \frac{1}{\sigma^2}x^*\left(\Sigma^{-1}
++ \frac{X_S^*X_S}{\sigma^2}\right)^{-1}x\right)
+\end{align}
+
\section{Submodularity}
In this section, we will consider that we are given a \emph{universe} set $U$.