1 files changed, 95 insertions, 28 deletions
diff --git a/notes.tex b/notes.tex
index 92eabb5..a0e5a48 100644
--- a/notes.tex
+++ b/notes.tex
@@ -18,38 +18,53 @@
 
 \section{Problem}
 
-\begin{itemize}
-\item  database $D = (X_i)_{1\leq i\leq n}$
-\item $(X_i)_{1\leq i\leq n}\in\big(\mathbf{R^d}\big)^n$ sampled in an i.i.d
-    fashion from the probability distribution $\mu$
-\end{itemize}
+\paragraph{Data Model} 
+
+There is a database $D$ of $N$ users. Each user $i\in\{1,\ldots, N\}$ is
+represented by a vector of features $x_i\in\mathbf{R}^d$. To each user
+$i$, is also associated a variable of interest $y_i$, which can be inferred
+from $x_i$.
 
-We assume that there is a common variable $P$ which is the underlying
-phenomenon that we are interesting in, and all the variables in the database,
-are noisy realisations of this phenomenon:
+We will thus assume that there is a \emph{model} describing the relation
+between $y_i$ and $x_i$, which in the most general form can be written as:
 \begin{displaymath}
-    X_i = P + N_i
+    y_i = h(x_i)
 \end{displaymath}
-where the $N_i$ are pairwise independent and also independent from $H$.
+where $h$ is an unknown function.
 
+The database is valuable, because by observing the database, or a subset $S$ of
+the database, it is possible to learn the unknown model function $h$ (or to
+improve a prior knowledge of this function) and thus, to predict the variable
+of interest associated with a given user.
 
-We are interested in defining the $\emph{value}$ of a subset $S$ of the
-database $D$. The approaches to such a definition can be classified in two main
-families:
+\paragraph{Value} Consequently, the value $V(S)$ of a subset $S$ of the
+database will be defined as the decrease of \emph{uncertainty} about $h$
+induced by observing $S$. Depending on the definition of uncertainty used, it
+is possible to distinguish two general approaches to define $V(S)$:
 \begin{itemize}
-    \item \emph{Information theoretic:} We use the entropy of $P$ as a measure
-        of the uncertainty. Then the value of a subset $S$ is the decrease of
-        entropy induced by revealing it:
+    \item \emph{Information theoretic:} the knowledge of $h$ we have is
+        represented by a probabilistic distribution, and the \emph{entropy} of
+        $h$ is used as a measure of the uncertainty. The observation of the set
+        $S$ allows us to update the distribution of $h$, and the uncertainty
+        becomes the conditional entropy of $h$ given the observation. This
+        leads to the definition of the value:
         \begin{displaymath}
-            V(S) = H(P) - H(P\,|\,S) 
+            V(S) = H(h) - H(h\,|\,S) 
         \end{displaymath}
-    \item \emph{Statistical:} in this case you are trying to estimate the
-        phenomenon. The variance of the estimation is used as a measure of
-        the uncertainty. The value of a set $S$ can be defined as the average
-        reduction of the variance over all the points in the database once you
-        observe $S$:
+    \item \emph{Statistical:} in this case, we maintain a predictor $\tilde{y}$
+        which given a user's feature vector $x$ gives a prediction
+        $\tilde{y}(x)$ of his variable of interest. The uncertainty is then
+        measured by the variance of the predictor at a given point $x$,
+        $\sigma^2_x$. After observing a subset $S$ of the database, the
+        predictor is updated, and its new variance is denoted by
+        $\sigma_{x|S}^2$. Thus, we could define the value of $S$ by $V(S)
+        = \sigma_x^2 - \sigma_{x|S}^2$. However, this definition depends on the
+        point $x$. If we knew the distribution of $x$, one could simply take
+        the expectation of the previous quantity as a definition of the value.
+        Another way to solve the issue, is to average this quantity over the
+        database, leading to the definition of the value:
         \begin{displaymath}
-            V(S) = \frac{1}{|D|}\sum_{x\in D} \sigma_x^2 - \sigma_{x|A}^2
+            V(S) = \frac{1}{N}\sum_{x\in D} \sigma_x^2 - \sigma_{x|A}^2
         \end{displaymath}
 \end{itemize}
 
@@ -59,9 +74,9 @@ address:
     \item \emph{budget auctions:} the points in the database are associated
         with users. Each user $i$ has a cost $c_i$ for revealing his data. You
         are given a budget $B$ and you want to select a subset $S$ of users and
-        a payment scheme which is truthful. The value of the subset
-        $S$ should be as big as possible with the constraint that the payments
-        should stay within budget.
+        a payment scheme which is truthful. The value of the subset $S$ should
+        be as big as possible with the constraint that the payments should stay
+        within budget.
     \item \emph{value sharing:} the goal is to find a value sharing method to
         split the whole value of a subset $S$ among its participants.
     \item \emph{k-best auctions:} this one is very similar to \emph{budget
@@ -69,10 +84,62 @@ address:
         on the number of people in the subset $S$.
 \end{enumerate}
 
+\emph{[TODO: add references to the papers mentioning these problems, what we
+know about the value functions defined above, etc.]}
+
 These problems have already been studied, and optimal or near-optimal solution
-are already known when the value function is submodular. Fortunately, it can be
-shown that the information theoretic value defined above is submodular.
+are already known when the value function is submodular. 
+
+\section{Data value in the Gaussian world}
+
+In this section we will assume a multivariate Gaussian model:
+\begin{displaymath}
+    y = \beta^*x + \epsilon
+\end{displaymath}
+
+The prior distribution of $\beta$ is a multivariate normal distribution of mean
+zero and covariance $\Sigma$. $\epsilon$ is the noise which is modeled with
+a normal distribution of mean zero and variance $\sigma^2$.
 
+Note that this model is the probabilistic model used in ridge regression. To
+compute the value of a subset $S$ as defined above, we have to compute the
+differential conditional entropy of the posterior distribution after observing
+the set $S$, which is exactly the distribution which leads to the ridge
+regression estimator. Thus the following computation can be seen as the study
+of how the ridge regression estimator evolves as you observe more points.
+
+Let us start by computing the entropy of the multivariate normal distribution:
+\begin{displaymath}
+    H(\beta) = -\frac{1}{C}
+    \int_{b\in\mathbf{R}^d} \exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)
+    \log\left(-\frac{1}{C}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)\right)db
+\end{displaymath}
+where:
+\begin{displaymath}
+    C = \big(\sqrt{2\pi}\big)^d\sqrt{\det\Sigma}
+    = \int_{b\in\mathbf{R}^d}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db 
+\end{displaymath}
+\begin{displaymath}
+    H(\beta) = \log(C) + \frac{1}{2C}\int_{b\in\mathbf{R}^d}
+    b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db
+\end{displaymath}
+
+One can notice that:
+\begin{displaymath}
+    \frac{1}{C}\int_{b\in\mathbf{R}^d}
+    b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db
+    = \expt{b^*\Sigma^{-1}b}
+\end{displaymath}
+where $b$ follows a multivariate normal distribution of variance $\Sigma$
+and mean zero.
+\begin{displaymath}
+    \expt{b^*\Sigma^{-1}b} = \expt{tr(\Sigma^{-1}bb^*)} = tr(\Sigma^{-1}\Sigma)
+    = 1
+\end{displaymath}
+\begin{displaymath}
+    H(\beta) = \frac{1}{2}\log\big((2\pi)^d\det\Sigma\big) + \frac{1}{2}
+    = \frac{1}{2}\log\big((2\pi e)^d\det\Sigma\big)
+\end{displaymath}
 \section{Submodularity}
 
 In this section, we will consider that we are given a \emph{universe} set $U$.