\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage{amsmath,amsthm,amsfonts}
\usepackage{comment}
\newtheorem{lemma}{Lemma}
\newtheorem{fact}{Fact}
\newtheorem{example}{Example}
\newtheorem{prop}{Proposition}
\newcommand{\var}{\mathop{\mathrm{Var}}}
\newcommand{\condexp}[2]{\mathop{\mathbb{E}}\left[#1|#2\right]}
\newcommand{\expt}[1]{\mathop{\mathbb{E}}\left[#1\right]}
\newcommand{\norm}[1]{\lVert#1\rVert}
\newcommand{\tr}[1]{#1^*}
\newcommand{\ip}[2]{\langle #1, #2 \rangle}
\newcommand{\mse}{\mathop{\mathrm{MSE}}}
\DeclareMathOperator{\trace}{tr}
\DeclareMathOperator*{\argmax}{arg\,max}
\title{Value of data: an approach to the economics of user data}
\begin{document}
\maketitle

\section{Introduction}
With the recent development of communication technologies, user data is now
present everywhere. It is also clear that there is some notion of value
attached to user data: this is particularly obvious when you consider the
market of targeted advertisement on the Web. However, there is no clear notion
of what a good definition of the value of data should be.

The goal of this work is to propose a framework to study the value of data.
This framework, inspired both by the fields of Information Theory and
Statistical Learning, leads to a natural definition of the value of data, which
can then be used to answer economic questions regarding the use of this data in
a data market.

\section{Data model}

There is a set of users and an experimenter who is about to conduct a survey.
The survey process is the following: the experimenter is going to select
a subset of users an query them, that is, ask the same question to each user in
the selected subgroup.

To each user is attached some intrinsic characteristics, known to the
experimenter, which impacts on his answer to the experimenter's query. More
precisely, the answer of the user is a function of its intrinsic characteristic
distorted by some noise.

Formally, we have a set of users indexed by a set $\mathcal{I}$. The intrinsic
characteristics of user $i\in\mathcal{I}$ is an element $x_i$ of a feature set
$E$.  When the experimenter queries user $i$, he gets the answer $y_i$:
\begin{equation}\label{eq:answer-model}
    y_i = h(x_i) + \varepsilon_i
\end{equation}
where $h$ is a mapping from the feature set $E$ to the answer space $A$ and
$\varepsilon_i$ is a random variable over $A$. We assume that the errors
are mutually independent.

\emph{Examples.}
\begin{enumerate}
    \item if $A=\{0,1\}$ and $h$ is a binary function on $E$ and the model
        above is a binary classification setup.
    \item if $A=\mathbf{R}$, $E=\mathbf{R}^d$ and $h$ is a linear function, the
        model above is a linear regressions setup.
\end{enumerate}

\section{Value of data}

The experimenter does not know the function which appears in
\eqref{eq:answer-model}. Instead, he has a given set of \emph{hypotheses}
$\mathcal{H}$ and a prior knowledge of the true hypothesis $h$ which is modeled
by a random variable $H$ over $\mathcal{H}$. The data model in
\eqref{eq:answer-model} expresses that conditioned on the hypothesis, the
answers to the experimenter's query are independent.

The prior distribution can also be seen as an expression of the uncertainty of
the experimenter about the true hypothesis $h$. The uncertainty of the
distribution $P$ of $H$ can be classically measured by its entropy
$\mathbb{H}(H)$\footnote{Here we choose to write the entropy of a discrete
distribution, but our results do not rely on this assumption: if the
distribution is continuous, one simply needs to replace the entropy by the
conditional entropy.}:
\begin{displaymath}
    \mathbb{H}(H) = -\sum_{h\in A^E}P(H = h)\log\big(P(H = h)\big)
\end{displaymath}

Given a subset $S\subset I$ of users, we denote by $\mathcal{A}_S$ the set of
answers given by the users in $S$ according to the experimenter's knowledge of
the hypothesis:
\begin{displaymath}
    \mathcal{A}_S = \{Y_i,\, i\in S\} \quad\mathrm{with}
    \quad Y_i = H(x_i) + \varepsilon_i
\end{displaymath}

The goal of the experimenter is to use the answers of the users to learn the
true hypothesis. This setup is the one commonly found in Bayesian Active
Learning with Noise. In this setup the value of a data, or value of information,
used when using the entropy as a measure of uncertainty is simply the decrease
of entropy implied by observing the information. In our case, this leads to
defining the value function $V$:
\begin{equation}\label{eq:value}
    \forall S\subset I,\; V(S) 
    = \mathbb{H}(H) - \mathbb{H}(H\,|\,\mathcal{A}_S)
\end{equation}
where $\mathbb{H}(H\,|\,\mathcal{A}_S)$ is the conditional entropy of $H$ given
$\mathcal{A}_S$. One can also recognize that the definition of the entropy
given above is simply the mutual information $I(H;\mathcal{A}_S)$ between $H$
and $\mathcal{A}_S$. 

Using the \emph{information never hurts} principle, it is
easy to see that the value function defined in \eqref{eq:value} is positive and
set increasing. Thus one could extend the definition of the value to be any
increasing function of the mutual information.

The motivation of this definition of value makes sense when considering the
information theoretic interpretation of the entropy: if we consider that the
experimenter has access to an oracle to whom he can ask yes/no questions. Then,
the entropy of the distribution is exactly the number of questions he needs to
ask to fully know the hypothesis. If he needs to pay for each question asked to
the oracle, then our definition of value directly relates to the cost decrease
implied by observing a set of users. 

\section{Economics of data}

Independently of the chosen definition of value, several optimization problems,
motivated by economic considerations naturally arise in the above setup.

\subsection{Optimal observation selection}

If we assume that the experimenter has a cost $c$ for each user that he
includes in his experiment, his goal will be to optimize the value of his
experiment while minimizing his cost. Because the cost does not depend on the
users, this is equivalent to maximize the value while minimizing the size of
the chosen set of users. Hence, the optimization problems takes the following
form:
\begin{equation}\label{eq:observation-selection}
    S^* = \argmax_{\substack{S\subset \mathcal{I}\\ \vert S\vert \leq k}} V(S)
\end{equation}
\emph{Note: (1-1/e) approximation algorithm known since Nemhauser (1978)}

The fact that there is a function defining the value of a set of users
strongly suggests that some users are more valuable than others. It is then
natural that each user $i$ has a specific cost $c_i$. The above optimization
problem then takes the form:
\begin{equation}\label{eq:cost-observation-selection}
    S^* = \argmax_{\substack{S\subset \mathcal{I}\\ \sum_{i\in S}c_i\leq B}} V(S)
\end{equation}

\emph{Note: this is known as the budgeted maximization problem, or maximization
    problem under knapsack/linear constraint. This problem is much harder than
    the previous one. A (1-1/e) approximation algorithm was known in the case
    when the value function is the one used in the Max-Coverage problem. But
for general, non-decreasing submodular functions, this is a result by
Sviridenko (2004) : A note on maximizing a submodular set function subject to
knapsack constraint}
    
\subsection{Budget feasible auctions}

In this section, the experimenter wants to compensate users for joining the
experiment. Because the payments made by the experimenter are based on the
costs reported by the users, reporting a false cost could be a strategy for a user
to maximize his revenue.

Therefore, this is an auction setup: the experimenter observes the costs
reported by the users $c_i, i\in\mathcal{I}$ and select a subset $S$ of users,
as well as payments $p_i, i\in S$ such that:
\begin{itemize}
    \item \textbf{individually rational}
    \item \textbf{truthful}
    \item \textbf{budget-constrained}: $\sum_{i\in S} p_i \leq B$
    \item \textbf{approximation}: $OPT \leq \alpha V(S)$ where $OPT$ is the solution
        to the budgeted maximization problem in \eqref{eq:cost-observation-selection}
\end{itemize}

\emph{Note: Yaron Singer (2010) Budget feasible auctions, has a 117.7
    polynomial approximation for this problem. It is proven that no algorithm
can do better than $2-\epsilon$ for any $\epsilon$.}

\subsection{Value sharing}

\section{Value of data in the Gaussian world}

In this section we will assume a multivariate Gaussian model:
\begin{displaymath}
    y = \beta^*x + \epsilon
\end{displaymath}

The prior distribution of $\beta$ is a multivariate normal distribution of mean
zero and covariance $\Sigma$. $\epsilon$ is the noise which is modeled with
a normal distribution of mean zero and variance $\sigma^2$.

Note that this model is the probabilistic model used in ridge regression. To
compute the value of a subset $S$ as defined above, we have to compute the
differential conditional entropy of the posterior distribution after observing
the set $S$, which is exactly the distribution which leads to the ridge
regression estimator. Thus the following computation can be seen as the study
of how the ridge regression estimator evolves as you observe more points.

Let us start by computing the entropy of the multivariate normal distribution:
\begin{displaymath}
    H(\beta) = -\frac{1}{C}
    \int_{b\in\mathbf{R}^d} \exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)
    \log\left(-\frac{1}{C}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)\right)db
\end{displaymath}
where:
\begin{displaymath}
    C = \big(\sqrt{2\pi}\big)^d\sqrt{\det\Sigma}
    = \int_{b\in\mathbf{R}^d}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db 
\end{displaymath}

By expanding the logarithm, we get:
\begin{displaymath}
    H(\beta) = \log(C) + \frac{1}{2C}\int_{b\in\mathbf{R}^d}
    b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db
\end{displaymath}

One can notice that:
\begin{displaymath}
    \frac{1}{C}\int_{b\in\mathbf{R}^d}
    b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db
    = \expt{b^*\Sigma^{-1}b}
\end{displaymath}
where $b$ follows a multivariate normal distribution of variance $\Sigma$
and mean zero.
\begin{displaymath}
    \expt{b^*\Sigma^{-1}b} = \expt{\trace\big(\Sigma^{-1}bb^*\big)}
    = \trace\big(\Sigma^{-1}\Sigma\big)
    = 1
\end{displaymath}

Finally:
\begin{displaymath}
    H(\beta) = \frac{1}{2}\log\big((2\pi)^d\det\Sigma\big) + \frac{1}{2}
    = \frac{1}{2}\log\big((2\pi e)^d\det\Sigma\big)
\end{displaymath}

\subsubsection*{Conditional entropy}

Let $S$ be a subset of $D$ of size $n$. Let us denote by $x_1,\ldots,x_n$ the
points in $S$ and by $Y_1, \ldots, Y_n$ the associated random variables of
interest.

Using the Bayes rule for conditional entropy, we get that:
\begin{displaymath}
    H(\beta\,|\,Y_1,\ldots ,Y_n) = H(Y_1,\ldots,Y_n\,|\,\beta)
    +H(\beta) - H(Y_1,\ldots, Y_n)
\end{displaymath}

Conditioned on $\beta$, $(Y_1,\ldots,Y_n)$ follows a multivariate normal
distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence:
\begin{displaymath}
    H(Y_1,\ldots,Y_n\,|\,\beta) 
    = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right)
\end{displaymath}

$(Y_1,\ldots,Y_n)$ follows a multivariate normal distribution of mean 0. Let us
denote by $\Sigma_Y$ its covariance matrix:
\begin{align}
    \Sigma_Y & = \expt{YY^*} = \expt{(X\beta + E)(X\beta + E)^*}\\
             & = X\Sigma X^* + \sigma^2I_n
\end{align}
which yields the following expression for the joint entropy:
\begin{displaymath}
    H(Y_1,\ldots, Y_n) 
    = \frac{1}{2}\log\left((2\pi e)^n \det(X\Sigma X^* + \sigma^2 I_n)\right)
\end{displaymath}

Combining the above equations, we get:
\begin{align}
    H(\beta\,|\, Y_1,\ldots,Y_n) 
    & = \frac{1}{2}\log\left((2\pi e)^d \det \Sigma
    \det\left(I_d + \Sigma \frac{X^*X}{\sigma^2}\right)^{-1}\right)\\
    & = - \frac{1}{2}\log\left((2\pi e)^d
\det\left(\Sigma^{-1} + \frac{X^*X}{\sigma^2}\right)\right)
\end{align}

\subsubsection*{Value of data}

By combining the formula for entropy and the conditional entropy, we get that
the value of a set $S$ of points is equal to:
\begin{align}
    V(S) & = \frac{1}{2}\log\left((2\pi e)^d \det \Sigma\right)
    + \frac{1}{2}\log\left((2\pi e)^d \det
    \left(\Sigma^{-1}+\frac{X_S^*X_S}{\sigma^2}\right)\right)\\
    & = \frac{1}{2}\log\left(\det\left(I_d 
    + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\right)
\end{align}

\subsubsection*{Marginal contribution}

Here, we want to compute the marginal contribution of a point $x$ to a set $S$ of
users:
\begin{displaymath}
    \Delta_xV(S) = V(S\cup \{x\}) - V(S)
\end{displaymath}

Using that:
\begin{displaymath}
    X_{S\cup\{x\}}^*X_{S\cup\{x\}} = X_S^*X_S + xx^*
\end{displaymath}
we get:
\begin{align}
    \Delta_x V(S) & = \frac{1}{2}\log\det\left(I_d 
    + \Sigma\frac{X_S^*X_S}{\sigma^2} + \Sigma\frac{xx^*}{\sigma^2}\right)
    - \frac{1}{2}\log\det\left(I_d + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\\
    &  = \frac{1}{2}\log\det\left(I_d + \frac{xx^*}{\sigma^2}\left(\Sigma^{-1} +
\frac{X_S^*X_S}{\sigma^2}\right)^{-1}\right)\\
& = \frac{1}{2}\log\left(1 + \frac{1}{\sigma^2}x^*\left(\Sigma^{-1}
+ \frac{X_S^*X_S}{\sigma^2}\right)^{-1}x\right)
\end{align}

\section*{Appendix: Submodularity}

In this section, we will consider that we are given a \emph{universe} set $U$.
A set function $f$ is a function defined on the power set of $U$, $\mathfrak{P}(U)$.

A set function $f$ defined on $\mathfrak{P}(U)$ will be said \emph{increasing} if
it is increasing with regards to inclusion, that is:

\begin{displaymath}
  \forall\,S\subseteq T,\quad f(S)\leq f(T)
\end{displaymath}

A \emph{decreasing} function on $\mathfrak{P}(U)$ is defined similarly.

A set function $f$ defined on $\mathfrak{P}(U)$ is said to be
\emph{submodular} if it verifies the diminishing returns property, that is,
the marginal increments when adding a point to a set, is a set decreasing
function. More formally, for any point $x$ in $U$, we can define the marginal
increment of $f$ regarding $x$, it is the set function defined as:
\begin{displaymath}
    \Delta_x f(S) = f(S\cup\{x\}) - f(S)
\end{displaymath}
Then, $f$ is \emph{submodular} iff. for all $x$ in $U$, $\Delta_x f$ is a set
decreasing function.

Similarly, a \emph{supermodular} is a function whose marginal increments are set
increasing functions.
\begin{prop}
  Let $R:\mathbf{R}\rightarrow \mathbf{R}$ be a decreasing concave function and
  $f:\mathfrak{P}(U)\rightarrow\mathbf{R}$ be a decreasing submodular function,
  then the composed function $R\circ f$ is increasing and supermodular.
\end{prop}

\begin{proof}
  The increasingness of $R\circ f$ follows immediately from the decreasingness
  of $R$ and $f$.
  
  For the supermodularity, let $S$ and $T$ be two sets such that $S\subseteq
  T$. By decreasingness of $f$, we have:
  \begin{displaymath}
    \forall\,V,\quad f(T)\leq f(S)\quad\mathrm{and}\quad f(T\cup V)\leq f(S\cup V)
  \end{displaymath}
  
  Thus, by concavity of $R$:
  \begin{displaymath}\label{eq:base}
    \forall\,V,\quad\frac{R\big(f(S)\big)-R\big(f(S\cup V)\big)}{f(S)-f(S\cup V)}
    \leq\frac{R\big(f(T)\big)-R\big(f(T\cup V)\big)}{f(T)-f(T\cup V)}
  \end{displaymath}
  
  $f$ is decreasing, so multiplying this last inequality by
  $f(S)-f(S\cup V)$ and $f(T)-f(T\cup V)$ yields:
  \begin{multline}
    \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big)\\
    \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)
  \end{multline}

  $f$ is submodular, so:
  \begin{displaymath}
    f(T\cup V)-f(T)\leq f(S\cup V) - f(S)
  \end{displaymath}
  
  $R\circ f$ is increasing, so:
  \begin{displaymath}
    R\big(f(S)\big)-R\big(f(S\cup V)\big)\leq 0
  \end{displaymath}

  By combining the two previous inequalities, we get:
    \begin{multline*}
    \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\
    \leq \Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big)
  \end{multline*}

  Injecting this last inequality into \eqref{eq:base} gives:
  \begin{multline*}
    \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\
    \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)
  \end{multline*}

  Dividing left and right by $f(S)-f(S\cup V)$ yields:
  \begin{displaymath}
    \forall V,\quad R\big(f(S)\big)-R\big(f(S\cup V)\big)
    \leq R\big(f(T)\big)-R\big(f(T\cup V)\big)
  \end{displaymath}
  which is exactly the supermodularity of $R\circ f$.
\end{proof}

\end{document}