\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage{amsmath,amsthm,amsfonts}
\usepackage{comment}
\newtheorem{lemma}{Lemma}
\newtheorem{fact}{Fact}
\newtheorem{example}{Example}
\newtheorem{prop}{Proposition}
\newcommand{\var}{\mathop{\mathrm{Var}}}
\newcommand{\condexp}[2]{\mathop{\mathbb{E}}\left[#1|#2\right]}
\newcommand{\expt}[1]{\mathop{\mathbb{E}}\left[#1\right]}
\newcommand{\norm}[1]{\lVert#1\rVert}
\newcommand{\tr}[1]{#1^*}
\newcommand{\ip}[2]{\langle #1, #2 \rangle}
\newcommand{\mse}{\mathop{\mathrm{MSE}}}
\DeclareMathOperator{\trace}{tr}
\title{Value of data: an approach to the economics of user data}
\begin{document}
\maketitle

\section{Introduction}
With the recent development of communication technologies, user data is now
present everywhere. It is also clear that there is some notion of value
attached to user data: this is particularly obvious when you consider the
market of targeted advertisement on the Web. However, there is no clear notion
of what a good definition of the value of data should be.

The goal of this work is to propose a framework to study the value of data.
This framework, inspired both by the fields of Information Theory and
Statistical Learning, leads to a natural definition of the value of data, which
can then be used to answer economic questions regarding the use of this data in
a data market.

\section{Value of data}

There is a set of users than can be queried. We assume that when queried,
the answer of the user to the query question is a function of its intrinsic
characteristics, distorted by some noise.

An experimenter is about to conduct a survey, where he is going to select
a subset of users and query them. The answers will be used to learn the
function underlying the answers of the users.

Because the experimenter does not know the function beforehand, he maintains
a probabilistic distribution over the set of possible functions. The meaning of
this distribution is that it conveys the uncertainty of the experimenter about
the underlying function. Observing the users' answers during the survey allows
him to update his knowledge of the underlying function, thus reducing his
uncertainty. 

A classical measure of uncertainty is the entropy, which we use to define the
value of the data of a subset of users : \emph{the value of a subset of users
is the decrease of the entropy induced by observing the users' answer to
a query}.

The motivation of this definition of value makes sense when considering the
information theoretic interpretation of the entropy: if we consider that the
experimenter has access to an oracle to whom he can ask yes/no questions. Then,
the entropy of the distribution is exactly the number of questions he needs to
ask to fully know the hypothesis. If he needs to pay for each question asked to
the oracle, then our definition of value directly relates to the cost decrease
implied by observing a set of users. 

We will now formalize the above definition. The set of users is indexed by an
index set $I$. The intrinsic characteristics of user $i$ are denoted by $x_i$,
a vector belonging to a feature space $E$. The answer of the users, when
queried belongs to the answer space $A$. The underlying function $f$, expressing
how the answer relates to the feature vector is a mapping from $E$ to $A$. The
answer of user $i$ will be denoted by $y_i$. Thus, the answer model takes the
form:
\begin{displaymath}
    y_i = f(x_i) + \varepsilon_i
\end{displaymath}
where $\varepsilon_i$ is the error distorting the answer of user $i$, which is
a random variable over $A$. We assume that the errors are mutually independent.

The experimenter's prior knowledge of $f$ is a random variable $F$ over $A^E$.
The entropy of $F$ is denoted by $H(F)$:
\begin{displaymath}
    H(F) = \sum_{f\in A^E} P(F = f)\log\big(P(F = f)\big)
\end{displaymath}

If $S$ is a subset of users, we will denote by $A_S$ the set of answers given
by these users, $A_S = \{y_i,\, i\in S\}$. We can now give the definition of
the value $V:2^I\rightarrow \mathbf{R}^+$:
\begin{displaymath}
    \forall S\in 2^I,\; V(S) = H(F) - H(F\,|\, A_S)
\end{displaymath}
where $H(F\,|\, A_S)$ denotes the conditional entropy of $F$ given $A_S$.

\section{Economics of data}

As soon as you have a definition of value, there are several natural economic
problems arising, and which can be expressed independently of the definition of
the value function:
\begin{enumerate}
    \item \emph{budget-feasible auctions:} each user $i$ has a cost $c_i$ for
        revealing his data. You are given a budget $B$ and you want to select
        a subset of players and a payment scheme such that the value of the
        subset is maximal with the constraint that the sum of the payments
        should stay within budget. The mechanism must be truthful. 
    \item \emph{cost sharing:} 
    \item \emph{k-best auctions:} this one is very similar to \emph{budget
        auctions}, but instead of having a budget constraint, the constraint is
        on the number of people in the subset $S$.
\end{enumerate}

\textbf{TODO} add references. Explain that our value function is submodular and
write the theorems we get when applying previous results to our specific case.

\section{Value of data in the Gaussian world}

In this section we will assume a multivariate Gaussian model:
\begin{displaymath}
    y = \beta^*x + \epsilon
\end{displaymath}

The prior distribution of $\beta$ is a multivariate normal distribution of mean
zero and covariance $\Sigma$. $\epsilon$ is the noise which is modeled with
a normal distribution of mean zero and variance $\sigma^2$.

Note that this model is the probabilistic model used in ridge regression. To
compute the value of a subset $S$ as defined above, we have to compute the
differential conditional entropy of the posterior distribution after observing
the set $S$, which is exactly the distribution which leads to the ridge
regression estimator. Thus the following computation can be seen as the study
of how the ridge regression estimator evolves as you observe more points.

Let us start by computing the entropy of the multivariate normal distribution:
\begin{displaymath}
    H(\beta) = -\frac{1}{C}
    \int_{b\in\mathbf{R}^d} \exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)
    \log\left(-\frac{1}{C}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)\right)db
\end{displaymath}
where:
\begin{displaymath}
    C = \big(\sqrt{2\pi}\big)^d\sqrt{\det\Sigma}
    = \int_{b\in\mathbf{R}^d}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db 
\end{displaymath}

By expanding the logarithm, we get:
\begin{displaymath}
    H(\beta) = \log(C) + \frac{1}{2C}\int_{b\in\mathbf{R}^d}
    b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db
\end{displaymath}

One can notice that:
\begin{displaymath}
    \frac{1}{C}\int_{b\in\mathbf{R}^d}
    b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db
    = \expt{b^*\Sigma^{-1}b}
\end{displaymath}
where $b$ follows a multivariate normal distribution of variance $\Sigma$
and mean zero.
\begin{displaymath}
    \expt{b^*\Sigma^{-1}b} = \expt{\trace\big(\Sigma^{-1}bb^*\big)}
    = \trace\big(\Sigma^{-1}\Sigma\big)
    = 1
\end{displaymath}

Finally:
\begin{displaymath}
    H(\beta) = \frac{1}{2}\log\big((2\pi)^d\det\Sigma\big) + \frac{1}{2}
    = \frac{1}{2}\log\big((2\pi e)^d\det\Sigma\big)
\end{displaymath}

\subsubsection*{Conditional entropy}

Let $S$ be a subset of $D$ of size $n$. Let us denote by $x_1,\ldots,x_n$ the
points in $S$ and by $Y_1, \ldots, Y_n$ the associated random variables of
interest.

Using the Bayes rule for conditional entropy, we get that:
\begin{displaymath}
    H(\beta\,|\,Y_1,\ldots ,Y_n) = H(Y_1,\ldots,Y_n\,|\,\beta)
    +H(\beta) - H(Y_1,\ldots, Y_n)
\end{displaymath}

Conditioned on $\beta$, $(Y_1,\ldots,Y_n)$ follows a multivariate normal
distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence:
\begin{displaymath}
    H(Y_1,\ldots,Y_n\,|\,\beta) 
    = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right)
\end{displaymath}

$(Y_1,\ldots,Y_n)$ follows a multivariate normal distribution of mean 0. Let us
denote by $\Sigma_Y$ its covariance matrix:
\begin{align}
    \Sigma_Y & = \expt{YY^*} = \expt{(X\beta + E)(X\beta + E)^*}\\
             & = X\Sigma X^* + \sigma^2I_n
\end{align}
which yields the following expression for the joint entropy:
\begin{displaymath}
    H(Y_1,\ldots, Y_n) 
    = \frac{1}{2}\log\left((2\pi e)^n \det(X\Sigma X^* + \sigma^2 I_n)\right)
\end{displaymath}

Combining the above equations, we get:
\begin{align}
    H(\beta\,|\, Y_1,\ldots,Y_n) 
    & = \frac{1}{2}\log\left((2\pi e)^d \det \Sigma
    \det\left(I_d + \Sigma \frac{X^*X}{\sigma^2}\right)^{-1}\right)\\
    & = - \frac{1}{2}\log\left((2\pi e)^d
\det\left(\Sigma^{-1} + \frac{X^*X}{\sigma^2}\right)\right)
\end{align}

\subsubsection*{Value of data}

By combining the formula for entropy and the conditional entropy, we get that
the value of a set $S$ of points is equal to:
\begin{align}
    V(S) & = \frac{1}{2}\log\left((2\pi e)^d \det \Sigma\right)
    + \frac{1}{2}\log\left((2\pi e)^d \det
    \left(\Sigma^{-1}+\frac{X_S^*X_S}{\sigma^2}\right)\right)\\
    & = \frac{1}{2}\log\left(\det\left(I_d 
    + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\right)
\end{align}

\subsubsection*{Marginal contribution}

Here, we want to compute the marginal contribution of a point $x$ to a set $S$ of
users:
\begin{displaymath}
    \Delta_xV(S) = V(S\cup \{x\}) - V(S)
\end{displaymath}

Using that:
\begin{displaymath}
    X_{S\cup\{x\}}^*X_{S\cup\{x\}} = X_S^*X_S + xx^*
\end{displaymath}
we get:
\begin{align}
    \Delta_x V(S) & = \frac{1}{2}\log\det\left(I_d 
    + \Sigma\frac{X_S^*X_S}{\sigma^2} + \Sigma\frac{xx^*}{\sigma^2}\right)
    - \frac{1}{2}\log\det\left(I_d + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\\
    &  = \frac{1}{2}\log\det\left(I_d + \frac{xx^*}{\sigma^2}\left(\Sigma^{-1} +
\frac{X_S^*X_S}{\sigma^2}\right)^{-1}\right)\\
& = \frac{1}{2}\log\left(1 + \frac{1}{\sigma^2}x^*\left(\Sigma^{-1}
+ \frac{X_S^*X_S}{\sigma^2}\right)^{-1}x\right)
\end{align}

\section*{Appendix: Submodularity}

In this section, we will consider that we are given a \emph{universe} set $U$.
A set function $f$ is a function defined on the power set of $U$, $\mathfrak{P}(U)$.

A set function $f$ defined on $\mathfrak{P}(U)$ will be said \emph{increasing} if
it is increasing with regards to inclusion, that is:

\begin{displaymath}
  \forall\,S\subseteq T,\quad f(S)\leq f(T)
\end{displaymath}

A \emph{decreasing} function on $\mathfrak{P}(U)$ is defined similarly.

A set function $f$ defined on $\mathfrak{P}(U)$ is said to be
\emph{submodular} if it verifies the diminishing returns property, that is,
the marginal increments when adding a point to a set, is a set decreasing
function. More formally, for any point $x$ in $U$, we can define the marginal
increment of $f$ regarding $x$, it is the set function defined as:
\begin{displaymath}
    \Delta_x f(S) = f(S\cup\{x\}) - f(S)
\end{displaymath}
Then, $f$ is \emph{submodular} iff. for all $x$ in $U$, $\Delta_x f$ is a set
decreasing function.

Similarly, a \emph{supermodular} is a function whose marginal increments are set
increasing functions.
\begin{prop}
  Let $R:\mathbf{R}\rightarrow \mathbf{R}$ be a decreasing concave function and
  $f:\mathfrak{P}(U)\rightarrow\mathbf{R}$ be a decreasing submodular function,
  then the composed function $R\circ f$ is increasing and supermodular.
\end{prop}

\begin{proof}
  The increasingness of $R\circ f$ follows immediately from the decreasingness
  of $R$ and $f$.
  
  For the supermodularity, let $S$ and $T$ be two sets such that $S\subseteq
  T$. By decreasingness of $f$, we have:
  \begin{displaymath}
    \forall\,V,\quad f(T)\leq f(S)\quad\mathrm{and}\quad f(T\cup V)\leq f(S\cup V)
  \end{displaymath}
  
  Thus, by concavity of $R$:
  \begin{displaymath}\label{eq:base}
    \forall\,V,\quad\frac{R\big(f(S)\big)-R\big(f(S\cup V)\big)}{f(S)-f(S\cup V)}
    \leq\frac{R\big(f(T)\big)-R\big(f(T\cup V)\big)}{f(T)-f(T\cup V)}
  \end{displaymath}
  
  $f$ is decreasing, so multiplying this last inequality by
  $f(S)-f(S\cup V)$ and $f(T)-f(T\cup V)$ yields:
  \begin{multline}
    \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big)\\
    \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)
  \end{multline}

  $f$ is submodular, so:
  \begin{displaymath}
    f(T\cup V)-f(T)\leq f(S\cup V) - f(S)
  \end{displaymath}
  
  $R\circ f$ is increasing, so:
  \begin{displaymath}
    R\big(f(S)\big)-R\big(f(S\cup V)\big)\leq 0
  \end{displaymath}

  By combining the two previous inequalities, we get:
    \begin{multline*}
    \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\
    \leq \Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big)
  \end{multline*}

  Injecting this last inequality into \eqref{eq:base} gives:
  \begin{multline*}
    \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\
    \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)
  \end{multline*}

  Dividing left and right by $f(S)-f(S\cup V)$ yields:
  \begin{displaymath}
    \forall V,\quad R\big(f(S)\big)-R\big(f(S\cup V)\big)
    \leq R\big(f(T)\big)-R\big(f(T\cup V)\big)
  \end{displaymath}
  which is exactly the supermodularity of $R\circ f$.
\end{proof}

\end{document}