diff options
Diffstat (limited to 'notes.tex')
| -rw-r--r-- | notes.tex | 123 |
1 files changed, 95 insertions, 28 deletions
@@ -18,38 +18,53 @@ \section{Problem} -\begin{itemize} -\item database $D = (X_i)_{1\leq i\leq n}$ -\item $(X_i)_{1\leq i\leq n}\in\big(\mathbf{R^d}\big)^n$ sampled in an i.i.d - fashion from the probability distribution $\mu$ -\end{itemize} +\paragraph{Data Model} + +There is a database $D$ of $N$ users. Each user $i\in\{1,\ldots, N\}$ is +represented by a vector of features $x_i\in\mathbf{R}^d$. To each user +$i$, is also associated a variable of interest $y_i$, which can be inferred +from $x_i$. -We assume that there is a common variable $P$ which is the underlying -phenomenon that we are interesting in, and all the variables in the database, -are noisy realisations of this phenomenon: +We will thus assume that there is a \emph{model} describing the relation +between $y_i$ and $x_i$, which in the most general form can be written as: \begin{displaymath} - X_i = P + N_i + y_i = h(x_i) \end{displaymath} -where the $N_i$ are pairwise independent and also independent from $H$. +where $h$ is an unknown function. +The database is valuable, because by observing the database, or a subset $S$ of +the database, it is possible to learn the unknown model function $h$ (or to +improve a prior knowledge of this function) and thus, to predict the variable +of interest associated with a given user. -We are interested in defining the $\emph{value}$ of a subset $S$ of the -database $D$. The approaches to such a definition can be classified in two main -families: +\paragraph{Value} Consequently, the value $V(S)$ of a subset $S$ of the +database will be defined as the decrease of \emph{uncertainty} about $h$ +induced by observing $S$. Depending on the definition of uncertainty used, it +is possible to distinguish two general approaches to define $V(S)$: \begin{itemize} - \item \emph{Information theoretic:} We use the entropy of $P$ as a measure - of the uncertainty. Then the value of a subset $S$ is the decrease of - entropy induced by revealing it: + \item \emph{Information theoretic:} the knowledge of $h$ we have is + represented by a probabilistic distribution, and the \emph{entropy} of + $h$ is used as a measure of the uncertainty. The observation of the set + $S$ allows us to update the distribution of $h$, and the uncertainty + becomes the conditional entropy of $h$ given the observation. This + leads to the definition of the value: \begin{displaymath} - V(S) = H(P) - H(P\,|\,S) + V(S) = H(h) - H(h\,|\,S) \end{displaymath} - \item \emph{Statistical:} in this case you are trying to estimate the - phenomenon. The variance of the estimation is used as a measure of - the uncertainty. The value of a set $S$ can be defined as the average - reduction of the variance over all the points in the database once you - observe $S$: + \item \emph{Statistical:} in this case, we maintain a predictor $\tilde{y}$ + which given a user's feature vector $x$ gives a prediction + $\tilde{y}(x)$ of his variable of interest. The uncertainty is then + measured by the variance of the predictor at a given point $x$, + $\sigma^2_x$. After observing a subset $S$ of the database, the + predictor is updated, and its new variance is denoted by + $\sigma_{x|S}^2$. Thus, we could define the value of $S$ by $V(S) + = \sigma_x^2 - \sigma_{x|S}^2$. However, this definition depends on the + point $x$. If we knew the distribution of $x$, one could simply take + the expectation of the previous quantity as a definition of the value. + Another way to solve the issue, is to average this quantity over the + database, leading to the definition of the value: \begin{displaymath} - V(S) = \frac{1}{|D|}\sum_{x\in D} \sigma_x^2 - \sigma_{x|A}^2 + V(S) = \frac{1}{N}\sum_{x\in D} \sigma_x^2 - \sigma_{x|A}^2 \end{displaymath} \end{itemize} @@ -59,9 +74,9 @@ address: \item \emph{budget auctions:} the points in the database are associated with users. Each user $i$ has a cost $c_i$ for revealing his data. You are given a budget $B$ and you want to select a subset $S$ of users and - a payment scheme which is truthful. The value of the subset - $S$ should be as big as possible with the constraint that the payments - should stay within budget. + a payment scheme which is truthful. The value of the subset $S$ should + be as big as possible with the constraint that the payments should stay + within budget. \item \emph{value sharing:} the goal is to find a value sharing method to split the whole value of a subset $S$ among its participants. \item \emph{k-best auctions:} this one is very similar to \emph{budget @@ -69,10 +84,62 @@ address: on the number of people in the subset $S$. \end{enumerate} +\emph{[TODO: add references to the papers mentioning these problems, what we +know about the value functions defined above, etc.]} + These problems have already been studied, and optimal or near-optimal solution -are already known when the value function is submodular. Fortunately, it can be -shown that the information theoretic value defined above is submodular. +are already known when the value function is submodular. + +\section{Data value in the Gaussian world} + +In this section we will assume a multivariate Gaussian model: +\begin{displaymath} + y = \beta^*x + \epsilon +\end{displaymath} + +The prior distribution of $\beta$ is a multivariate normal distribution of mean +zero and covariance $\Sigma$. $\epsilon$ is the noise which is modeled with +a normal distribution of mean zero and variance $\sigma^2$. +Note that this model is the probabilistic model used in ridge regression. To +compute the value of a subset $S$ as defined above, we have to compute the +differential conditional entropy of the posterior distribution after observing +the set $S$, which is exactly the distribution which leads to the ridge +regression estimator. Thus the following computation can be seen as the study +of how the ridge regression estimator evolves as you observe more points. + +Let us start by computing the entropy of the multivariate normal distribution: +\begin{displaymath} + H(\beta) = -\frac{1}{C} + \int_{b\in\mathbf{R}^d} \exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right) + \log\left(-\frac{1}{C}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)\right)db +\end{displaymath} +where: +\begin{displaymath} + C = \big(\sqrt{2\pi}\big)^d\sqrt{\det\Sigma} + = \int_{b\in\mathbf{R}^d}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db +\end{displaymath} +\begin{displaymath} + H(\beta) = \log(C) + \frac{1}{2C}\int_{b\in\mathbf{R}^d} + b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db +\end{displaymath} + +One can notice that: +\begin{displaymath} + \frac{1}{C}\int_{b\in\mathbf{R}^d} + b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db + = \expt{b^*\Sigma^{-1}b} +\end{displaymath} +where $b$ follows a multivariate normal distribution of variance $\Sigma$ +and mean zero. +\begin{displaymath} + \expt{b^*\Sigma^{-1}b} = \expt{tr(\Sigma^{-1}bb^*)} = tr(\Sigma^{-1}\Sigma) + = 1 +\end{displaymath} +\begin{displaymath} + H(\beta) = \frac{1}{2}\log\big((2\pi)^d\det\Sigma\big) + \frac{1}{2} + = \frac{1}{2}\log\big((2\pi e)^d\det\Sigma\big) +\end{displaymath} \section{Submodularity} In this section, we will consider that we are given a \emph{universe} set $U$. |
