1 files changed, 277 insertions, 229 deletions
diff --git a/notes.tex b/notes.tex
index 12e04a5..ec0435c 100644
--- a/notes.tex
+++ b/notes.tex
@@ -1,4 +1,4 @@
-\documentclass{article}
+\documentclass[twocolumn]{article}
 \usepackage[utf8]{inputenc}
 \usepackage{amsmath,amsthm,amsfonts}
 \usepackage{comment}
@@ -15,48 +15,42 @@
 \newcommand{\mse}{\mathop{\mathrm{MSE}}}
 \DeclareMathOperator{\trace}{tr}
 \DeclareMathOperator*{\argmax}{arg\,max}
-\title{Value of data: an approach to the economics of user data}
+\title{Value of data}
 \begin{document}
 \maketitle
 
 \section{Introduction}
 
-With the recent development of communication technologies, user data is now
-present everywhere. It is also clear that there is some notion of value
-attached to user data: this is particularly obvious when you consider the
-market of targeted advertisement on the Web. However, there is no clear notion
-of what a good definition of the value of data should be.
-
-The goal of this work is to propose a framework to study the value of data.
-This framework, inspired both by the fields of Information Theory and
-Statistical Learning, leads to a natural definition of the value of data, which
-can then be used to answer economic questions regarding the use of this data in
-a data market.
+The goal of this work is to propose a framework to study the value of user
+data. Although it is clear that there is some notion of value attached to
+user data (for example user data can be used to generate revenue through online
+advertising, recommender systems, etc.), it is not clear which definition of
+value should be used for formal works on this notion. After having proposed
+a definition of value, we study how generic economic problem behave with
+regards to this definition. Finally we study the computational feasibility of
+these problems.
 
 \section{Data model}
+\label{sec:data-model}
 
 There is a set of users and an experimenter. Each user has a vector of public
-features (e.g. age, height, movie ratings, binary features, labels, etc.) and
-a private piece of information, an undisclosed variable.
-
-The experimenter has a prior knowledge of how the undisclosed variable relates
-to the public vector of features, this prior knowledge is called the
-\emph{hypothesis}. The experimenter wants to test and refine his hypothesis
-against the users' data: he is going to select a set of users and ask them to
-reveal their private variable. Based on the observed data, he will be able to
-update his prior knowledge.
+features (e.g. age, height, binary features, labels, etc.) and a private piece
+of information: an undisclosed variable.
 
-Because the users' data helps the experimenter refining his hypothesis, there
-is a notion of value attached to a group of users which quantifies how much
-uncertainty about the hypothesis is removed by observing their data.
+The experimenter wants to learn the relationship between the public features
+and the private variable. Before conducting the experiment, he has a prior
+knowledge of this relationship, called his \emph{hypothesis} and the experiment
+consists in selecting a set of users and asking them to reveal their private
+variables. Based on the observed data, the experimenter updates his hypothesis. 
 
-The users also have a cost for revealing their data (for example for privacy
-reasons, or because it requires them some time to provide the information to
-the experimenter). The experimenter has finite resources: there is a budget
-constraint on how much he can spend.
+For the experimenter, there is a notion of value attached to a group of users:
+this is how much the data teaches him about the hypothesis, how much it reduces
+his uncertainty about the hypothesis. For the users, there is a cost attached
+to revealing their data. The experimenter also has a budget constraint on the
+amount of money he can spend.
 
 The problems arising in this setup are natural: the experimenter wants to
-maximize his utility, the value of the set of users he selects, but he needs to
+maximize his utility: the value of the set of users he selects, but he needs to
 compensate the users by taking into account their costs and his budget
 constraint. The users' costs can either be public, which directly leads to
 combinatorial optimizations problems, or private, in which case, a notion of
@@ -67,35 +61,32 @@ Formally, there is a set of users indexed by a set $\mathcal{I}$. The public
 feature vector of user $i\in\mathcal{I}$ is an element $x_i$ of a feature set
 $E$, his undisclosed variable is denoted by $y_i$ and belongs to some space
 $A$. The cost of user $i$ for revealing his data is a positive real number
-$c_i\in\mathbf{R}^+$. 
+$c_i\in\mathbf{R}_+$. The budget of the experimenter is denoted by $B$.
 
 The prior knowledge of the experimenter takes the form of a random variable $H$
 over $A^E$ or a subset of $A^E$ called the \emph{hypothesis set}. This random
-variable expresses $y_i$ as a function of $x_i$ through this equation:
+variable expresses his uncertainty about the true hypothesis $h$. The true
+hypothesis gives the relationship between the feature vector of user $i$ and
+his private variable through the equation:
 \begin{equation}\label{eq:hypothesis-model}
-    y_i = H(x_i) + \varepsilon_i
+    y_i = h(x_i) + \varepsilon_i
 \end{equation}
-where $\varepsilon_i$ is a random variable over $A$ independent from $H$.
+where $\varepsilon_i$ is a random variable over $A$.
 
 \emph{TODO: explain why this model is not restrictive: $y$ can always be
     written as a deterministic function of $x$ plus something independent of
 $x$}
 
-The framework for the experimenter is Bayesian: his prior knowledge the
-distribution of $H$ is the prior distribution. Based on the disclosed variables
-of the selected users, he updates his knowledge to get the posterior
-distribution. 
-
 \emph{Examples.}
 \begin{enumerate}
     \item if the hypothesis set is finite (the experimenter has a few
         deterministic model he wants to choose from), observing data allows him
         to rule out parts of the hypothesis set. In this case, the uncertainty
-        of the experimenter can simply be measured by the size of the
+        of the experimenter could simply be measured by the size of the
         hypothesis set.
     \item if $A=\{0,1\}$ and the hypothesis set is the set of all binary
-        functions on $E$, the learning task of the experimenter implied by the
-        above setup is a binary classification problem.
+        functions on $E$, the learning task of the experimenter is a binary
+        classification problem.
     \item if $A=\mathbf{R}$, $E=\mathbf{R}^d$ and the hypothesis set is the set
         of linear functions from $E$ to $A$: $\mathcal{L}(A,E)$, the learning
         task is a linear regression. The prior knowledge of the experimenter is
@@ -103,228 +94,275 @@ distribution.
         equivalent to regularized linear regression (e.g. ridge regression).
 \end{enumerate}
 
-\section{Value of data}
+\section{Economics of data}
+\label{sec:economics}
 
-The experimenter does not know the function which appears in
-\eqref{eq:answer-model}. Instead, he has a given set of \emph{hypotheses}
-$\mathcal{H}$ and a prior knowledge of the true hypothesis $h$ which is modeled
-by a random variable $H$ over $\mathcal{H}$. The data model in
-\eqref{eq:answer-model} expresses that conditioned on the hypothesis, the
-answers to the experimenter's query are independent.
+The goal of this section is to further discuss the optimization problems
+mentioned in the previous section.
 
-The prior distribution can also be seen as an expression of the uncertainty of
-the experimenter about the true hypothesis $h$. The uncertainty of the
-distribution $P$ of $H$ can be classically measured by its entropy
-$\mathbb{H}(H)$\footnote{Here we choose to write the entropy of a discrete
-distribution, but our results do not rely on this assumption: if the
-distribution is continuous, one simply needs to replace the entropy by the
-conditional entropy.}:
-\begin{displaymath}
-    \mathbb{H}(H) = -\sum_{h\in A^E}P(H = h)\log\big(P(H = h)\big)
-\end{displaymath}
+The value function (the utility function of the experimenter) will be denoted
+by $V$ and is simply a function mapping a set of users $S\subset \mathcal{I}$
+to $V(S)\in\mathbf{R}_+$. The choice of a specific value function will be
+discussed in Section~\ref{sec:data-value}.
 
-Given a subset $S\subset I$ of users, we denote by $\mathcal{A}_S$ the set of
-answers given by the users in $S$ according to the experimenter's knowledge of
-the hypothesis:
-\begin{displaymath}
-    \mathcal{A}_S = \{Y_i,\, i\in S\} \quad\mathrm{with}
-    \quad Y_i = H(x_i) + \varepsilon_i
-\end{displaymath}
+\subsection{Optimal observation selection}
 
-The goal of the experimenter is to use the answers of the users to learn the
-true hypothesis. This setup is the one commonly found in Bayesian Active
-Learning with Noise. In this setup the value of a data, or value of information,
-used when using the entropy as a measure of uncertainty is simply the decrease
-of entropy implied by observing the information. In our case, this leads to
-defining the value function $V$:
-\begin{equation}\label{eq:value}
-    \forall S\subset I,\; V(S) 
-    = \mathbb{H}(H) - \mathbb{H}(H\,|\,\mathcal{A}_S)
+In this problem, the costs $(c_i)_{i\in \mathcal{I}}$ of the users are public.
+When selecting the set $S\subset\mathcal{I}$ of users, the experimenter has to
+pay exactly $\sum_{i\in S} c_i$. Hence, the optimization problem consist in
+selecting $S^*$ defined by:
+\begin{equation}\label{eq:observation-selection}
+    S^* = \argmax_{S\subset\mathcal{I}}\Big\{V(S)\;\Big|\; 
+        \sum_{i\in S} c_i \leq B\Big\}
 \end{equation}
-where $\mathbb{H}(H\,|\,\mathcal{A}_S)$ is the conditional entropy of $H$ given
-$\mathcal{A}_S$. One can also recognize that the definition of the entropy
-given above is simply the mutual information $I(H;\mathcal{A}_S)$ between $H$
-and $\mathcal{A}_S$. 
-
-Using the \emph{information never hurts} principle, it is
-easy to see that the value function defined in \eqref{eq:value} is positive and
-set increasing. Thus one could extend the definition of the value to be any
-increasing function of the mutual information.
 
-The motivation of this definition of value makes sense when considering the
-information theoretic interpretation of the entropy: if we consider that the
-experimenter has access to an oracle to whom he can ask yes/no questions. Then,
-the entropy of the distribution is exactly the number of questions he needs to
-ask to fully know the hypothesis. If he needs to pay for each question asked to
-the oracle, then our definition of value directly relates to the cost decrease
-implied by observing a set of users. 
-
-\section{Economics of data}
+This is a function maximization problem under a knapsack constraint. If $V$ can
+be any function, then this problem is obviously NP-hard. A common assumption to
+make on $V$ is that it is submodular (see Section~\ref{sec:submodularity})
+which is the extension of the notion of convexity for set functions. However,
+maximizing a submodular function under knapsack constraint is still NP-hard.
 
-Independently of the chosen definition of value, several optimization problems,
-motivated by economic considerations naturally arise in the above setup.
+Sviridenko (2004) gave a (1-1/e) polynomial time approximation ratio for this
+problem, when the value function is non-decreasing and submodular.
 
-\subsection{Optimal observation selection}
+Note that this problem covers the case where all users have the same cost $c$.
+In this case, letting $B' = \left\lfloor B/c\right\rfloor$, the problems
+becomes a maximization problem under a size constraint:
+\begin{displaymath}
+    S^* = \argmax_{S\subset\mathcal{I}}\left\{V(S)\;|\; 
+        | S| \leq B'\right\}
+\end{displaymath}
+for which Newhauser (1978) gave an optimal (1-1/e) polynomial approximation
+scheme, in the case of a non-decreasing submodular function.
 
-If we assume that the experimenter has a cost $c$ for each user that he
-includes in his experiment, his goal will be to optimize the value of his
-experiment while minimizing his cost. Because the cost does not depend on the
-users, this is equivalent to maximize the value while minimizing the size of
-the chosen set of users. Hence, the optimization problems takes the following
-form:
-\begin{equation}\label{eq:observation-selection}
-    S^* = \argmax_{\substack{S\subset \mathcal{I}\\ \vert S\vert \leq k}} V(S)
-\end{equation}
-\emph{Note: (1-1/e) approximation algorithm known since Nemhauser (1978)}
+\subsection{Budget feasible auctions}
 
-The fact that there is a function defining the value of a set of users
-strongly suggests that some users are more valuable than others. It is then
-natural that each user $i$ has a specific cost $c_i$. The above optimization
-problem then takes the form:
-\begin{equation}\label{eq:cost-observation-selection}
-    S^* = \argmax_{\substack{S\subset \mathcal{I}\\ \sum_{i\in S}c_i\leq B}} V(S)
-\end{equation}
+Here, the cost of the users are private. Before the beginning of the
+experiment, they report a cost $c_i'$ which is not necessarily equal to their
+true cost: a user may decide to lie to receive more money, however, by doing
+so, their is a risk of not being included in the experiment.
 
-\emph{Note: this is known as the budgeted maximization problem, or maximization
-    problem under knapsack/linear constraint. This problem is much harder than
-    the previous one. A (1-1/e) approximation algorithm was known in the case
-    when the value function is the one used in the Max-Coverage problem. But
-for general, non-decreasing submodular functions, this is a result by
-Sviridenko (2004) : A note on maximizing a submodular set function subject to
-knapsack constraint}
-    
-\subsection{Budget feasible auctions}
+This notion of strategy involved in the way the users reveal their costs roots
+this problem in the auction and mechanism design theory. Formally, the
+experimenter wants to design an allocation function $f:
+\mathbf{R}_+^{\mathcal{I}} \rightarrow 2^{\mathcal{I}}$, which, given the
+reported costs of the users selects the set of users to be included in the
+experiment, and a payment function $p: \mathbf{R}_+^{\mathcal{I}} \rightarrow
+\mathbf{R}_+^{\mathcal{I}}$, which given the reported costs returns the vector
+of payments to allocate to each user.
 
-In this section, the experimenter wants to compensate users for joining the
-experiment. Because the payments made by the experimenter are based on the
-costs reported by the users, reporting a false cost could be a strategy for a user
-to maximize his revenue.
+For notation convenience, we will assume given the costs $\{c_i,
+i\in\mathcal{I}\}$ of the users, and we will denote by $\{s_i,
+i\in\mathcal{I}\}$ the characteristic function of $f(\{c_i,
+i\in\mathcal{I}\})$, that is, $s_i = 1$ iff $i\in f(\{c_i, i\in\mathcal{I}\})$.
+The payment received by user $i$ will be denoted by $p_i$.
 
-Therefore, this is an auction setup: the experimenter observes the costs
-reported by the users $c_i, i\in\mathcal{I}$ and select a subset $S$ of users,
-as well as payments $p_i, i\in S$ such that:
+The mechanism should satisfy the following conditions:
 \begin{itemize}
-    \item \textbf{individually rational}
-    \item \textbf{truthful}
-    \item \textbf{budget-constrained}: $\sum_{i\in S} p_i \leq B$
-    \item \textbf{approximation}: $OPT \leq \alpha V(S)$ where $OPT$ is the solution
-        to the budgeted maximization problem in \eqref{eq:cost-observation-selection}
+    \item \textbf{Normalized} if $s_i = 0$ then $p_i = 0$.
+    \item \textbf{Indiviually rational} $p_i \geq s_ic_i$.
+    \item \textbf{Truthful} $p_i - s_ic_i \geq p_i' - s_i'c_i$, where $p_i'$
+        and $s_i'$ are the payment and allocation of user $i$ had he reported
+        a cost $c_i'$ different from his true cost $c_i$ (keeping the costs
+        reported by the other users the same).
+    \item \textbf{Budget feasible} the payments should be withing budget:
+        \begin{displaymath}
+            \sum_{i\in \mathcal{I}} s_ip_i \leq B
+        \end{displaymath}
 \end{itemize}
 
-\emph{Note: Yaron Singer (2010) Budget feasible auctions, has a 117.7
-    polynomial approximation for this problem. It is proven that no algorithm
-can do better than $2-\epsilon$ for any $\epsilon$.}
+Yaron Singer (2010) proved a lower bound of 2 for the approximation ratio of
+a polynomial algorithm to solve this problem. He also gave a randomized general
+algorithm with an approximation ratio of 117.7, although for specific problems
+(coverage, knapsack, etc.) better ratios can be attained.
 
-\subsection{Value sharing}
+State of the art: Chenn, Gravin and Lie, lower bound of $1+\sqrt{2}$ and upper
+bound of 8.34 in the fractional non-strategic optimization problem can be
+solved in polynomial time.
 
-\section{Value of data in the Gaussian world}
+\section{Value of data}
+\label{sec:data-value}
+
+Here, we discuss a choice for the value function appearing in the problems
+discussed in Section~\ref{sec:economics}. Such a value function should be at
+least normalized, positive and non-decreasing. It should furthermore capture
+a notion of \emph{uncertainty reduction} related to the learning task of the
+experimenter.
 
-In this section we will assume a multivariate Gaussian model:
+The prior knowledge of the experimenter, the distribution of the random
+variable $H$ over the hypothesis set conveys his uncertainty about the true
+hypothesis. A common measure of uncertainty is given by the entropy. If we
+denote by $P_H$ the probability distribution of $H$, its entropy is defined
+by\footnote{Here we choose to write the entropy of a discrete distribution, but
+    our results do not rely on this assumption: if the distribution is
+    continuous, one simply needs to replace the entropy by the differential
+entropy.}:
 \begin{displaymath}
-    y = \beta^*x + \epsilon
+    \mathbb{H}(H) = -\sum_{h\in A^E}P_H(h)\log\big(P_H(h)\big)
 \end{displaymath}
 
-The prior distribution of $\beta$ is a multivariate normal distribution of mean
-zero and covariance $\Sigma$. $\epsilon$ is the noise which is modeled with
-a normal distribution of mean zero and variance $\sigma^2$.
+We will denote by $Y_i$, $i\in\mathcal{I}$ the answer of user $i$ according to
+the experimenter's knowledge:
+\begin{equation}\label{eq:data-model}
+    Y_i = H(x_i) + \varepsilon_i
+\end{equation}
+and $Y_S$ will denote the set of answers from the users in a subset $S$ of
+$\mathcal{I}$.
 
-Note that this model is the probabilistic model used in ridge regression. To
-compute the value of a subset $S$ as defined above, we have to compute the
-differential conditional entropy of the posterior distribution after observing
-the set $S$, which is exactly the distribution which leads to the ridge
-regression estimator. Thus the following computation can be seen as the study
-of how the ridge regression estimator evolves as you observe more points.
+We can now define the value of a group $S$ of users as being the decrease of
+entropy induced by observing $S$:
+\begin{equation}\label{eq:value}
+    \forall S\subset I,\; V(S) 
+    = \mathbb{H}(H) - \mathbb{H}(H\,|\,Y_S)
+\end{equation}
+where $\mathbb{H}(H\,|\,Y_S)$ is the conditional entropy of $H$ given $Y_S$.
+One can also note that the definition of the value given in \eqref{eq:value} is
+simply the mutual information $I(H;Y_S)$ between $H$ and $Y_S$. Submodularity
+is conserved by composition by a concave function on the left (see
+Section~\ref{sec:submodularity}). Hence, the definition of the value of $S$ can be
+extended to any $f\big(V(S)\big)$ where $f$ is a non-decreasing concave
+function. 
 
-Let us start by computing the entropy of the multivariate normal distribution:
-\begin{displaymath}
-    H(\beta) = -\frac{1}{C}
-    \int_{b\in\mathbf{R}^d} \exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)
-    \log\left(-\frac{1}{C}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)\right)db
-\end{displaymath}
-where:
-\begin{displaymath}
-    C = \big(\sqrt{2\pi}\big)^d\sqrt{\det\Sigma}
-    = \int_{b\in\mathbf{R}^d}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db 
-\end{displaymath}
+This notion of value, also known as the \emph{value of information} (TODO: ref)
+can be motivated by considering the information theoretic interpretation of the
+entropy: if we consider that the experimenter has access to an oracle to whom
+he can ask yes/no questions. Then, the entropy of the distribution is exactly
+the number of questions he needs to ask to fully know the hypothesis. If he
+needs to pay for each question asked to the oracle, then our definition of
+value directly relates to the cost decrease implied by observing a set of
+users. 
 
-By expanding the logarithm, we get:
-\begin{displaymath}
-    H(\beta) = \log(C) + \frac{1}{2C}\int_{b\in\mathbf{R}^d}
-    b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db
-\end{displaymath}
+Using the \emph{information never hurts} principle, it is easy to see that the
+value function defined by \eqref{eq:value} is positive an non-decreasing (with
+regard to inclusion).
 
-One can notice that:
-\begin{displaymath}
-    \frac{1}{C}\int_{b\in\mathbf{R}^d}
-    b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db
-    = \expt{b^*\Sigma^{-1}b}
-\end{displaymath}
-where $b$ follows a multivariate normal distribution of variance $\Sigma$
-and mean zero.
+Furthermore, if we add the natural assumption that
+$(\varepsilon_i)_{i\in\mathcal{I}}$ are jointly independent, which is
+equivalent to say that conditioned on the hypothesis, the
+$(Y_i)_{i\in\mathcal{I}}$ are independent, we get the following fact.
+
+\begin{fact}\label{value-submodularity}
+    Assuming that $(Y_i)_{i\in\mathcal{I}}$ defined in \eqref{eq:data-model}
+    are independent conditioned on $H$, then $V$ defined in \eqref{eq:value} is
+    submodular.
+\end{fact}
+
+\begin{proof}
+    Using the chain rule, one can rewrite the value of $S$ as:
+    \begin{displaymath}
+        V(S) = \mathbb{H}(Y_S) - \mathbb{H}(Y_S\,|\, H)
+    \end{displaymath}
+    We can now use the conditional independence of $Y_S$ to write the
+    conditional entropy as a sum:
+    \begin{displaymath}
+        V(S) = \mathbb{H}(Y_S) - \sum_{s\in S}\mathbb{H}(Y_s\,|\, H)
+    \end{displaymath}
+
+    It is well known that the joint entropy of a set of random variable is
+    submodular. Thus the last equation expresses $V$ as the sum of a submodular
+    function and of an additive function. As a consequence, $V$ is submodular.
+\end{proof}
+
+\section{Value of data in the linear regression setup}
+
+In this section we will assume a linear model: the feature vectors belong to
+$\mathbf{R}^d$ and the private variables belong to $\mathbf{R}$. The private
+variable can be expressed as a linear combination of the features:
 \begin{displaymath}
-    \expt{b^*\Sigma^{-1}b} = \expt{\trace\big(\Sigma^{-1}bb^*\big)}
-    = \trace\big(\Sigma^{-1}\Sigma\big)
-    = 1
+    y = \beta^*x + \epsilon
 \end{displaymath}
+The noise $\epsilon$ is normally distributed, zero-mean and of variance
+$\sigma^2$. Furthermore, the noise is independent of the user.
+
+The hypothesis set of the experimenter is the set of all linear forms from
+$\mathbf{R}^d$ to $\mathbf{R}$. Because a linear form can be uniquely
+represented as the inner product by a vector, it is equivalent to say that the
+hypothesis set is $\mathbf{R}^d$ and that the experimenter's hypothesis is
+a random variable $\beta$ over $\mathbf{R}^d$.
 
-Finally:
+A common assumption made in linear regression is that $\beta$ is normally
+distributed, zero-mean, and its covariance matrix is denoted by $\Sigma$. This
+prior distribution conveys the idea that $\beta$ should have a small $L^2$
+norm, which means that the model should have some kind of sparsity. Indeed, it
+is easy to prove that choosing the $\hat\beta$ which maximizes the \emph{a
+posteriori} distribution given the observations under a normal prior is
+equivalent to solve the following optimization problem:
 \begin{displaymath}
-    H(\beta) = \frac{1}{2}\log\big((2\pi)^d\det\Sigma\big) + \frac{1}{2}
-    = \frac{1}{2}\log\big((2\pi e)^d\det\Sigma\big)
+    \hat\beta = \arg\min_{\beta\in\mathbf{R}^d}\|Y-X\beta\|^2 + \lambda
+    \|\beta\|^2
 \end{displaymath}
+where $\lambda$ can be expressed as a function of $\Sigma$ and $\sigma^2$. This
+optimization problem is known as ridge regression, and is simply a least square
+fit of the data with a penalization on the $\beta$ which have a large $L^2$
+norm, which is consistent with the prior distribution. 
 
-\subsubsection*{Conditional entropy}
+\begin{fact}
+    Under the linear regression model, with a multivariate normal prior, the
+    value of data of a set $S$ of users is given by:
+    \begin{equation}\label{eq:linear-regression-value}
+        V(S) = \frac{1}{2}\log\det\left(I_d
+        + \frac{\Sigma}{\sigma^2}X_S^*X_S\right)
+    \end{equation}
+    where $X_S$ is the matrix whose rows are the line-vectors $x_s^*$ for $s$ in
+    $S$.
+\end{fact}
 
-Let $S$ be a subset of $D$ of size $n$. Let us denote by $x_1,\ldots,x_n$ the
-points in $S$ and by $Y_1, \ldots, Y_n$ the associated random variables of
-interest.
+\begin{proof}
+Let us recall that the entropy of a multivariate normal variable $B$ in
+dimension $d$ and of covariance $\Sigma$ (the mean is not relevant) is given
+by:
+\begin{equation}\label{eq:multivariate-entropy}
+    \mathbb{H}(B) = \frac{1}{2}\log\big((2\pi e)^d \det \Sigma\big)
+\end{equation}
 
-Using the Bayes rule for conditional entropy, we get that:
+Using the chain rule as in the proof of Fact~\ref{value-submodularity} we get
+that:
 \begin{displaymath}
-    H(\beta\,|\,Y_1,\ldots ,Y_n) = H(Y_1,\ldots,Y_n\,|\,\beta)
-    +H(\beta) - H(Y_1,\ldots, Y_n)
+    V(S) = \mathbb{H}(Y_S) - \mathbb{H}(Y_S\,|\,\beta)
 \end{displaymath}
 
-Conditioned on $\beta$, $(Y_1,\ldots,Y_n)$ follows a multivariate normal
+Conditioned on $\beta$, $(Y_S)$ follows a multivariate normal
 distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence:
-\begin{displaymath}
-    H(Y_1,\ldots,Y_n\,|\,\beta) 
+\begin{equation}\label{eq:h1}
+    \mathbb{H}(Y_S\,|\,\beta) 
     = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right)
-\end{displaymath}
+\end{equation}
 
-$(Y_1,\ldots,Y_n)$ follows a multivariate normal distribution of mean 0. Let us
-denote by $\Sigma_Y$ its covariance matrix:
-\begin{align}
-    \Sigma_Y & = \expt{YY^*} = \expt{(X\beta + E)(X\beta + E)^*}\\
-             & = X\Sigma X^* + \sigma^2I_n
-\end{align}
-which yields the following expression for the joint entropy:
+$(Y_S)$ also follows a multivariate normal distribution of mean zero. Let us
+compute its covariance matrix, $\Sigma_Y$:
+\begin{align*}
+    \Sigma_Y & = \expt{YY^*} = \expt{(X_S\beta + E)(X_S\beta + E)^*}\\
+             & = X_S\Sigma X_S^* + \sigma^2I_n
+\end{align*}
+Thus, we get that:
+\begin{equation}\label{eq:h2}
+    \mathbb{H}(Y_S) 
+    = \frac{1}{2}\log\left((2\pi e)^n \det(X_S\Sigma X_S^* + \sigma^2 I_n)\right)
+\end{equation}
+
+Combining \eqref{eq:h1} and \eqref{eq:h2} we get:
 \begin{displaymath}
-    H(Y_1,\ldots, Y_n) 
-    = \frac{1}{2}\log\left((2\pi e)^n \det(X\Sigma X^* + \sigma^2 I_n)\right)
+    V(S) = \frac{1}{2}\log\det\left(I_n+\frac{1}{\sigma^2}X_S\Sigma
+    X_S^*\right)
 \end{displaymath}
 
-Combining the above equations, we get:
-\begin{align}
-    H(\beta\,|\, Y_1,\ldots,Y_n) 
-    & = \frac{1}{2}\log\left((2\pi e)^d \det \Sigma
-    \det\left(I_d + \Sigma \frac{X^*X}{\sigma^2}\right)^{-1}\right)\\
-    & = - \frac{1}{2}\log\left((2\pi e)^d
-\det\left(\Sigma^{-1} + \frac{X^*X}{\sigma^2}\right)\right)
-\end{align}
-
-\subsubsection*{Value of data}
-
-By combining the formula for entropy and the conditional entropy, we get that
-the value of a set $S$ of points is equal to:
-\begin{align}
-    V(S) & = \frac{1}{2}\log\left((2\pi e)^d \det \Sigma\right)
-    + \frac{1}{2}\log\left((2\pi e)^d \det
-    \left(\Sigma^{-1}+\frac{X_S^*X_S}{\sigma^2}\right)\right)\\
-    & = \frac{1}{2}\log\left(\det\left(I_d 
-    + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\right)
-\end{align}
+Finally, we can use the Sylvester's formula to get the result.
+\end{proof}
 
+\emph{Remarks.}
+\begin{enumerate}
+    \item it is known that for a set of symmetric positive definite matrices,
+        defining the value of a set to be the $\log\det$ of the sum of the
+        matrices yields a non-decreasing, submodular value function. Noting that:
+        \begin{displaymath}
+            X_S^*X_S = \sum_{s\in S}x_sx_s^*
+        \end{displaymath}
+        it is clear that our value function is non-decreasing and submodular.
+        The positivity follows from a direct application of the spectral
+        theorem.
+    \item This value function can be computed up to a fixed decimal precision in
+        polynomial time.
+\end{enumerate}
 \subsubsection*{Marginal contribution}
 
 Here, we want to compute the marginal contribution of a point $x$ to a set $S$ of
@@ -338,17 +376,24 @@ Using that:
     X_{S\cup\{x\}}^*X_{S\cup\{x\}} = X_S^*X_S + xx^*
 \end{displaymath}
 we get:
-\begin{align}
+\begin{align*}
     \Delta_x V(S) & = \frac{1}{2}\log\det\left(I_d 
-    + \Sigma\frac{X_S^*X_S}{\sigma^2} + \Sigma\frac{xx^*}{\sigma^2}\right)
-    - \frac{1}{2}\log\det\left(I_d + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\\
-    &  = \frac{1}{2}\log\det\left(I_d + \frac{xx^*}{\sigma^2}\left(\Sigma^{-1} +
-\frac{X_S^*X_S}{\sigma^2}\right)^{-1}\right)\\
-& = \frac{1}{2}\log\left(1 + \frac{1}{\sigma^2}x^*\left(\Sigma^{-1}
-+ \frac{X_S^*X_S}{\sigma^2}\right)^{-1}x\right)
-\end{align}
+    + \Sigma\frac{X_S^*X_S}{\sigma^2} + \Sigma\frac{xx^*}{\sigma^2}\right)\\
+    & - \frac{1}{2}\log\det\left(I_d + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\\
+    &  = \frac{1}{2}\log\det\left(I_d + xx^*\left(\sigma^2\Sigma^{-1} +
+X_S^*X_S\right)^{-1}\right)\\
+& = \frac{1}{2}\log\left(1 + x^*\left(\sigma^2\Sigma^{-1}
++ X_S^*X_S\right)^{-1}x\right)
+\end{align*}
+
+\emph{Remark.} This formula shows that given a set $S$, users do not bring all
+the same contribution to the set $S$. This contribution depends on the norm of
+$x$ for the bilinear form defined by the matrix $(\sigma^2\Sigma^{-1}
++ X_S^*X_S)^{-1}$ which reflects how well the new point $x$ \emph{aligns} with
+the already existing points.
 
 \section*{Appendix: Submodularity}
+\label{sec:submodularity}
 
 In this section, we will consider that we are given a \emph{universe} set $U$.
 A set function $f$ is a function defined on the power set of $U$, $\mathfrak{P}(U)$.
@@ -393,8 +438,11 @@ increasing functions.
   
   Thus, by concavity of $R$:
   \begin{displaymath}\label{eq:base}
-    \forall\,V,\quad\frac{R\big(f(S)\big)-R\big(f(S\cup V)\big)}{f(S)-f(S\cup V)}
+  \begin{split}
+    \forall\,V,\quad\frac{R\big(f(S)\big)-R\big(f(S\cup V)\big)}{f(S)-f(S\cup
+    V)}\\
     \leq\frac{R\big(f(T)\big)-R\big(f(T\cup V)\big)}{f(T)-f(T\cup V)}
+  \end{split}
   \end{displaymath}
   
   $f$ is decreasing, so multiplying this last inequality by