summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--notes.tex123
1 files changed, 95 insertions, 28 deletions
diff --git a/notes.tex b/notes.tex
index 92eabb5..a0e5a48 100644
--- a/notes.tex
+++ b/notes.tex
@@ -18,38 +18,53 @@
\section{Problem}
-\begin{itemize}
-\item database $D = (X_i)_{1\leq i\leq n}$
-\item $(X_i)_{1\leq i\leq n}\in\big(\mathbf{R^d}\big)^n$ sampled in an i.i.d
- fashion from the probability distribution $\mu$
-\end{itemize}
+\paragraph{Data Model}
+
+There is a database $D$ of $N$ users. Each user $i\in\{1,\ldots, N\}$ is
+represented by a vector of features $x_i\in\mathbf{R}^d$. To each user
+$i$, is also associated a variable of interest $y_i$, which can be inferred
+from $x_i$.
-We assume that there is a common variable $P$ which is the underlying
-phenomenon that we are interesting in, and all the variables in the database,
-are noisy realisations of this phenomenon:
+We will thus assume that there is a \emph{model} describing the relation
+between $y_i$ and $x_i$, which in the most general form can be written as:
\begin{displaymath}
- X_i = P + N_i
+ y_i = h(x_i)
\end{displaymath}
-where the $N_i$ are pairwise independent and also independent from $H$.
+where $h$ is an unknown function.
+The database is valuable, because by observing the database, or a subset $S$ of
+the database, it is possible to learn the unknown model function $h$ (or to
+improve a prior knowledge of this function) and thus, to predict the variable
+of interest associated with a given user.
-We are interested in defining the $\emph{value}$ of a subset $S$ of the
-database $D$. The approaches to such a definition can be classified in two main
-families:
+\paragraph{Value} Consequently, the value $V(S)$ of a subset $S$ of the
+database will be defined as the decrease of \emph{uncertainty} about $h$
+induced by observing $S$. Depending on the definition of uncertainty used, it
+is possible to distinguish two general approaches to define $V(S)$:
\begin{itemize}
- \item \emph{Information theoretic:} We use the entropy of $P$ as a measure
- of the uncertainty. Then the value of a subset $S$ is the decrease of
- entropy induced by revealing it:
+ \item \emph{Information theoretic:} the knowledge of $h$ we have is
+ represented by a probabilistic distribution, and the \emph{entropy} of
+ $h$ is used as a measure of the uncertainty. The observation of the set
+ $S$ allows us to update the distribution of $h$, and the uncertainty
+ becomes the conditional entropy of $h$ given the observation. This
+ leads to the definition of the value:
\begin{displaymath}
- V(S) = H(P) - H(P\,|\,S)
+ V(S) = H(h) - H(h\,|\,S)
\end{displaymath}
- \item \emph{Statistical:} in this case you are trying to estimate the
- phenomenon. The variance of the estimation is used as a measure of
- the uncertainty. The value of a set $S$ can be defined as the average
- reduction of the variance over all the points in the database once you
- observe $S$:
+ \item \emph{Statistical:} in this case, we maintain a predictor $\tilde{y}$
+ which given a user's feature vector $x$ gives a prediction
+ $\tilde{y}(x)$ of his variable of interest. The uncertainty is then
+ measured by the variance of the predictor at a given point $x$,
+ $\sigma^2_x$. After observing a subset $S$ of the database, the
+ predictor is updated, and its new variance is denoted by
+ $\sigma_{x|S}^2$. Thus, we could define the value of $S$ by $V(S)
+ = \sigma_x^2 - \sigma_{x|S}^2$. However, this definition depends on the
+ point $x$. If we knew the distribution of $x$, one could simply take
+ the expectation of the previous quantity as a definition of the value.
+ Another way to solve the issue, is to average this quantity over the
+ database, leading to the definition of the value:
\begin{displaymath}
- V(S) = \frac{1}{|D|}\sum_{x\in D} \sigma_x^2 - \sigma_{x|A}^2
+ V(S) = \frac{1}{N}\sum_{x\in D} \sigma_x^2 - \sigma_{x|A}^2
\end{displaymath}
\end{itemize}
@@ -59,9 +74,9 @@ address:
\item \emph{budget auctions:} the points in the database are associated
with users. Each user $i$ has a cost $c_i$ for revealing his data. You
are given a budget $B$ and you want to select a subset $S$ of users and
- a payment scheme which is truthful. The value of the subset
- $S$ should be as big as possible with the constraint that the payments
- should stay within budget.
+ a payment scheme which is truthful. The value of the subset $S$ should
+ be as big as possible with the constraint that the payments should stay
+ within budget.
\item \emph{value sharing:} the goal is to find a value sharing method to
split the whole value of a subset $S$ among its participants.
\item \emph{k-best auctions:} this one is very similar to \emph{budget
@@ -69,10 +84,62 @@ address:
on the number of people in the subset $S$.
\end{enumerate}
+\emph{[TODO: add references to the papers mentioning these problems, what we
+know about the value functions defined above, etc.]}
+
These problems have already been studied, and optimal or near-optimal solution
-are already known when the value function is submodular. Fortunately, it can be
-shown that the information theoretic value defined above is submodular.
+are already known when the value function is submodular.
+
+\section{Data value in the Gaussian world}
+
+In this section we will assume a multivariate Gaussian model:
+\begin{displaymath}
+ y = \beta^*x + \epsilon
+\end{displaymath}
+
+The prior distribution of $\beta$ is a multivariate normal distribution of mean
+zero and covariance $\Sigma$. $\epsilon$ is the noise which is modeled with
+a normal distribution of mean zero and variance $\sigma^2$.
+Note that this model is the probabilistic model used in ridge regression. To
+compute the value of a subset $S$ as defined above, we have to compute the
+differential conditional entropy of the posterior distribution after observing
+the set $S$, which is exactly the distribution which leads to the ridge
+regression estimator. Thus the following computation can be seen as the study
+of how the ridge regression estimator evolves as you observe more points.
+
+Let us start by computing the entropy of the multivariate normal distribution:
+\begin{displaymath}
+ H(\beta) = -\frac{1}{C}
+ \int_{b\in\mathbf{R}^d} \exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)
+ \log\left(-\frac{1}{C}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)\right)db
+\end{displaymath}
+where:
+\begin{displaymath}
+ C = \big(\sqrt{2\pi}\big)^d\sqrt{\det\Sigma}
+ = \int_{b\in\mathbf{R}^d}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db
+\end{displaymath}
+\begin{displaymath}
+ H(\beta) = \log(C) + \frac{1}{2C}\int_{b\in\mathbf{R}^d}
+ b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db
+\end{displaymath}
+
+One can notice that:
+\begin{displaymath}
+ \frac{1}{C}\int_{b\in\mathbf{R}^d}
+ b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db
+ = \expt{b^*\Sigma^{-1}b}
+\end{displaymath}
+where $b$ follows a multivariate normal distribution of variance $\Sigma$
+and mean zero.
+\begin{displaymath}
+ \expt{b^*\Sigma^{-1}b} = \expt{tr(\Sigma^{-1}bb^*)} = tr(\Sigma^{-1}\Sigma)
+ = 1
+\end{displaymath}
+\begin{displaymath}
+ H(\beta) = \frac{1}{2}\log\big((2\pi)^d\det\Sigma\big) + \frac{1}{2}
+ = \frac{1}{2}\log\big((2\pi e)^d\det\Sigma\big)
+\end{displaymath}
\section{Submodularity}
In this section, we will consider that we are given a \emph{universe} set $U$.