\documentclass{IEEEtran}
%\usepackage{mathptmx}
\usepackage[utf8]{inputenc}
\usepackage{amsmath,amsthm,amsfonts}
\newtheorem{lemma}{Lemma}
\newtheorem{fact}{Fact}
\newtheorem{example}{Example}
\newtheorem{prop}{Proposition}
\newcommand{\var}{\mathop{\mathrm{Var}}}
\newcommand{\condexp}[2]{\mathop{\mathbb{E}}\left[#1|#2\right]}
\newcommand{\expt}[1]{\mathop{\mathbb{E}}\left[#1\right]}
\newcommand{\norm}[1]{\lVert#1\rVert}
\newcommand{\tr}[1]{#1^*}
\newcommand{\ip}[2]{\langle #1, #2 \rangle}
\newcommand{\mse}{\mathop{\mathrm{MSE}}}
\DeclareMathOperator{\trace}{tr}
\DeclareMathOperator*{\argmax}{arg\,max}
\title{Value of data}
\author{Stratis Ionnadis \and Thibaut Horel}
\begin{document}
\maketitle

\section{Introduction}

The goal of this work is to propose a framework to study the value of user
data. Although it is clear that there is some notion of value attached to
user data (for example user data can be used to generate revenue through online
advertising, recommender systems, etc.), it is not clear which definition of
value should be used for formal works on this notion. After having proposed
a definition of value, we study how generic economic problem behave with
regards to this definition. Finally we study the computational feasibility of
these problems.

\section{Data model}
\label{sec:data-model}

There is a set of users and an experimenter. Each user has a vector of public
features (e.g. age, height, binary features, labels, etc.) and a private piece
of information: an undisclosed variable.

The experimenter wants to learn the relationship between the public features
and the private variable. Before conducting the experiment, he has a prior
knowledge of this relationship, called his \emph{hypothesis} and the experiment
consists in selecting a set of users and asking them to reveal their private
variables. Based on the observed data, the experimenter updates his hypothesis. 

For the experimenter, there is a notion of value attached to a group of users:
this is how much the data teaches him about the hypothesis, how much it reduces
his uncertainty about the hypothesis. For the users, there is a cost attached
to revealing their data. The experimenter also has a budget constraint on the
amount of money he can spend.

The problems arising in this setup are natural: the experimenter wants to
maximize his utility: the value of the set of users he selects, but he needs to
compensate the users by taking into account their costs and his budget
constraint. The users' costs can either be public, which directly leads to
combinatorial optimizations problems, or private, in which case, a notion of
strategy intervenes in the way the experimenter compensates the users and
requires an auction approach.

Formally, there is a set of users indexed by a set $\mathcal{I}$. The public
feature vector of user $i\in\mathcal{I}$ is an element $x_i$ of a feature set
$E$, his undisclosed variable is denoted by $y_i$ and belongs to some space
$A$. The cost of user $i$ for revealing his data is a positive real number
$c_i\in\mathbf{R}_+$. The budget of the experimenter is denoted by $B$.

The prior knowledge of the experimenter takes the form of a random variable $H$
over $A^E$ or a subset of $A^E$ called the \emph{hypothesis set}. This random
variable expresses his uncertainty about the true hypothesis $h$. The true
hypothesis gives the relationship between the feature vector of user $i$ and
his private variable through the equation:
\begin{equation}\label{eq:hypothesis-model}
    y_i = h(x_i) + \varepsilon_i
\end{equation}
where $\varepsilon_i$ is a random variable over $A$.

\emph{TODO: explain why this model is not restrictive: $y$ can always be
    written as a deterministic function of $x$ plus something independent of
$x$}

\emph{Examples.}
\begin{enumerate}
    \item if the hypothesis set is finite (the experimenter has a few
        deterministic model he wants to choose from), observing data allows him
        to rule out parts of the hypothesis set. In this case, the uncertainty
        of the experimenter could simply be measured by the size of the
        hypothesis set.
    \item if $A=\{0,1\}$ and the hypothesis set is the set of all binary
        functions on $E$, the learning task of the experimenter is a binary
        classification problem.
    \item if $A=\mathbf{R}$, $E=\mathbf{R}^d$ and the hypothesis set is the set
        of linear functions from $E$ to $A$: $\mathcal{L}(A,E)$, the learning
        task is a linear regression. The prior knowledge of the experimenter is
        a prior distribution on the parameters of the linear function, which is
        equivalent to regularized linear regression (e.g. ridge regression).
\end{enumerate}

\section{Economics of data}
\label{sec:economics}

The goal of this section is to further discuss the optimization problems
mentioned in the previous section.

The value function (the utility function of the experimenter) will be denoted
by $V$ and is simply a function mapping a set of users $S\subset \mathcal{I}$
to $V(S)\in\mathbf{R}_+$. The choice of a specific value function will be
discussed in Section~\ref{sec:data-value}.

\subsection{Optimal observation selection}

In this problem, the costs $(c_i)_{i\in \mathcal{I}}$ of the users are public.
When selecting the set $S\subset\mathcal{I}$ of users, the experimenter has to
pay exactly $\sum_{i\in S} c_i$. Hence, the optimization problem consist in
selecting $S^*$ defined by:
\begin{equation}\label{eq:observation-selection}
    S^* = \argmax_{S\subset\mathcal{I}}\Big\{V(S)\;\Big|\; 
        \sum_{i\in S} c_i \leq B\Big\}
\end{equation}

This is a function maximization problem under a knapsack constraint. If $V$ can
be any function, then this problem is obviously NP-hard. A common assumption to
make on $V$ is that it is submodular (see Section~\ref{sec:submodularity})
which is the extension of the notion of convexity for set functions. However,
maximizing a submodular function under knapsack constraint is still NP-hard.

Sviridenko (2004) gave a (1-1/e) polynomial time approximation ratio for this
problem, when the value function is non-decreasing and submodular.

Note that this problem covers the case where all users have the same cost $c$.
In this case, letting $B' = \left\lfloor B/c\right\rfloor$, the problems
becomes a maximization problem under a size constraint:
\begin{displaymath}
    S^* = \argmax_{S\subset\mathcal{I}}\left\{V(S)\;|\; 
        | S| \leq B'\right\}
\end{displaymath}
for which Newhauser (1978) gave an optimal (1-1/e) polynomial approximation
scheme, in the case of a non-decreasing submodular function.

\subsection{Budget feasible auctions}

Here, the cost of the users are private. Before the beginning of the
experiment, they report a cost $c_i'$ which is not necessarily equal to their
true cost: a user may decide to lie to receive more money, however, by doing
so, their is a risk of not being included in the experiment.

This notion of strategy involved in the way the users reveal their costs roots
this problem in the auction and mechanism design theory. Formally, the
experimenter wants to design an allocation function $f:
\mathbf{R}_+^{\mathcal{I}} \rightarrow 2^{\mathcal{I}}$, which, given the
reported costs of the users selects the set of users to be included in the
experiment, and a payment function $p: \mathbf{R}_+^{\mathcal{I}} \rightarrow
\mathbf{R}_+^{\mathcal{I}}$, which given the reported costs returns the vector
of payments to allocate to each user.

For notation convenience, we will assume given the costs $\{c_i,
i\in\mathcal{I}\}$ of the users, and we will denote by $\{s_i,
i\in\mathcal{I}\}$ the characteristic function of $f(\{c_i,
i\in\mathcal{I}\})$, that is, $s_i = 1$ iff $i\in f(\{c_i, i\in\mathcal{I}\})$.
The payment received by user $i$ will be denoted by $p_i$.

The mechanism should satisfy the following conditions:
\begin{itemize}
    \item \textbf{Normalized} if $s_i = 0$ then $p_i = 0$.
    \item \textbf{Individually rational} $p_i \geq s_ic_i$.
    \item \textbf{Truthful} $p_i - s_ic_i \geq p_i' - s_i'c_i$, where $p_i'$
        and $s_i'$ are the payment and allocation of user $i$ had he reported
        a cost $c_i'$ different from his true cost $c_i$ (keeping the costs
        reported by the other users the same).
    \item \textbf{Budget feasible} the payments should be within budget:
        \begin{displaymath}
            \sum_{i\in \mathcal{I}} s_ip_i \leq B
        \end{displaymath}
\end{itemize}

Yaron Singer (2010) proved a lower bound of 2 for the approximation ratio of
a polynomial algorithm to solve this problem. He also gave a randomized general
algorithm with an approximation ratio of 117.7, although for specific problems
(coverage, knapsack, etc.) better ratios can be attained.

State of the art: Chenn, Gravin and Lie, lower bound of $1+\sqrt{2}$ and upper
bound of 8.34 in the fractional non-strategic optimization problem can be
solved in polynomial time.

\section{Value of data}
\label{sec:data-value}

Here, we discuss a choice for the value function appearing in the problems
discussed in Section~\ref{sec:economics}. Such a value function should be at
least normalized, positive and non-decreasing. It should furthermore capture
a notion of \emph{uncertainty reduction} related to the learning task of the
experimenter.

The prior knowledge of the experimenter, the distribution of the random
variable $H$ over the hypothesis set conveys his uncertainty about the true
hypothesis. A common measure of uncertainty is given by the entropy. If we
denote by $P_H$ the probability distribution of $H$, its entropy is defined
by\footnote{Here we choose to write the entropy of a discrete distribution, but
    our results do not rely on this assumption: if the distribution is
    continuous, one simply needs to replace the entropy by the differential
entropy.}:
\begin{displaymath}
    \mathbb{H}(H) = -\sum_{h\in A^E}P_H(h)\log\big(P_H(h)\big)
\end{displaymath}

We will denote by $Y_i$, $i\in\mathcal{I}$ the answer of user $i$ according to
the experimenter's knowledge:
\begin{equation}\label{eq:data-model}
    Y_i = H(x_i) + \varepsilon_i
\end{equation}
and $Y_S$ will denote the set of answers from the users in a subset $S$ of
$\mathcal{I}$.

We can now define the value of a group $S$ of users as being the decrease of
entropy induced by observing $S$:
\begin{equation}\label{eq:value}
    \forall S\subset I,\; V(S) 
    = \mathbb{H}(H) - \mathbb{H}(H\,|\,Y_S)
\end{equation}
where $\mathbb{H}(H\,|\,Y_S)$ is the conditional entropy of $H$ given $Y_S$.
One can also note that the definition of the value given in \eqref{eq:value} is
simply the mutual information $I(H;Y_S)$ between $H$ and $Y_S$. Submodularity
is conserved by composition by a concave function on the left (see
Section~\ref{sec:submodularity}). Hence, the definition of the value of $S$ can be
extended to any $f\big(V(S)\big)$ where $f$ is a non-decreasing concave
function. 

This notion of value, also known as the \emph{value of information} (TODO: ref)
can be motivated by considering the information theoretic interpretation of the
entropy: if we consider that the experimenter has access to an oracle to whom
he can ask yes/no questions. Then, the entropy of the distribution is exactly
the number of questions he needs to ask to fully know the hypothesis. If he
needs to pay for each question asked to the oracle, then our definition of
value directly relates to the cost decrease implied by observing a set of
users. 

Using the \emph{information never hurts} principle, it is easy to see that the
value function defined by \eqref{eq:value} is positive an non-decreasing (with
regard to inclusion).

Furthermore, if we add the natural assumption that
$(\varepsilon_i)_{i\in\mathcal{I}}$ are jointly independent, which is
equivalent to say that conditioned on the hypothesis, the
$(Y_i)_{i\in\mathcal{I}}$ are independent, we get the following fact.

\begin{fact}\label{value-submodularity}
    Assuming that $(Y_i)_{i\in\mathcal{I}}$ defined in \eqref{eq:data-model}
    are independent conditioned on $H$, then $V$ defined in \eqref{eq:value} is
    submodular.
\end{fact}

\begin{proof}
    Using the chain rule, one can rewrite the value of $S$ as:
    \begin{displaymath}
        V(S) = \mathbb{H}(Y_S) - \mathbb{H}(Y_S\,|\, H)
    \end{displaymath}
    We can now use the conditional independence of $Y_S$ to write the
    conditional entropy as a sum:
    \begin{displaymath}
        V(S) = \mathbb{H}(Y_S) - \sum_{s\in S}\mathbb{H}(Y_s\,|\, H)
    \end{displaymath}

    It is well known that the joint entropy of a set of random variable is
    submodular. Thus the last equation expresses $V$ as the sum of a submodular
    function and of an additive function. As a consequence, $V$ is submodular.
\end{proof}

\section{Value of data in the linear regression setup}

In this section we will assume a linear model: the feature vectors belong to
$\mathbf{R}^d$ and the private variables belong to $\mathbf{R}$. The private
variable can be expressed as a linear combination of the features:
\begin{displaymath}
    y = \beta^*x + \epsilon
\end{displaymath}
The noise $\epsilon$ is normally distributed, zero-mean and of variance
$\sigma^2$. Furthermore, the noise is independent of the user.

The hypothesis set of the experimenter is the set of all linear forms from
$\mathbf{R}^d$ to $\mathbf{R}$. Because a linear form can be uniquely
represented as the inner product by a vector, it is equivalent to say that the
hypothesis set is $\mathbf{R}^d$ and that the experimenter's hypothesis is
a random variable $\beta$ over $\mathbf{R}^d$.

A common assumption made in linear regression is that $\beta$ is normally
distributed, zero-mean, and its covariance matrix is denoted by $\Sigma$. This
prior distribution conveys the idea that $\beta$ should have a small $L^2$
norm, which means that the model should have some kind of sparsity. Indeed, it
is easy to prove that choosing the $\hat\beta$ which maximizes the \emph{a
posteriori} distribution given the observations under a normal prior is
equivalent to solve the following optimization problem:
\begin{displaymath}
    \hat\beta = \arg\min_{\beta\in\mathbf{R}^d}\|Y-X\beta\|^2 + \lambda
    \|\beta\|^2
\end{displaymath}
where $\lambda$ can be expressed as a function of $\Sigma$ and $\sigma^2$. This
optimization problem is known as ridge regression, and is simply a least square
fit of the data with a penalization on the $\beta$ which have a large $L^2$
norm, which is consistent with the prior distribution. 

\begin{fact}
    Under the linear regression model, with a multivariate normal prior, the
    value of data of a set $S$ of users is given by:
    \begin{equation}\label{eq:linear-regression-value}
        V(S) = \frac{1}{2}\log\det\left(I_d
        + \frac{\Sigma}{\sigma^2}X_S^*X_S\right)
    \end{equation}
    where $X_S$ is the matrix whose rows are the line-vectors $x_s^*$ for $s$ in
    $S$.
\end{fact}

\begin{proof}
Let us recall that the entropy of a multivariate normal variable $B$ in
dimension $d$ and of covariance $\Sigma$ (the mean is not relevant) is given
by:
\begin{equation}\label{eq:multivariate-entropy}
    \mathbb{H}(B) = \frac{1}{2}\log\big((2\pi e)^d \det \Sigma\big)
\end{equation}

Using the chain rule as in the proof of Fact~\ref{value-submodularity} we get
that:
\begin{displaymath}
    V(S) = \mathbb{H}(Y_S) - \mathbb{H}(Y_S\,|\,\beta)
\end{displaymath}

Conditioned on $\beta$, $(Y_S)$ follows a multivariate normal
distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence:
\begin{equation}\label{eq:h1}
    \mathbb{H}(Y_S\,|\,\beta) 
    = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right)
\end{equation}

$(Y_S)$ also follows a multivariate normal distribution of mean zero. Let us
compute its covariance matrix, $\Sigma_Y$:
\begin{align*}
    \Sigma_Y & = \expt{YY^*} = \expt{(X_S\beta + E)(X_S\beta + E)^*}\\
             & = X_S\Sigma X_S^* + \sigma^2I_n
\end{align*}
Thus, we get that:
\begin{equation}\label{eq:h2}
    \mathbb{H}(Y_S) 
    = \frac{1}{2}\log\left((2\pi e)^n \det(X_S\Sigma X_S^* + \sigma^2 I_n)\right)
\end{equation}

Combining \eqref{eq:h1} and \eqref{eq:h2} we get:
\begin{displaymath}
    V(S) = \frac{1}{2}\log\det\left(I_n+\frac{1}{\sigma^2}X_S\Sigma
    X_S^*\right)
\end{displaymath}

Finally, we can use the Sylvester's formula to get the result.
\end{proof}

\emph{Remarks.}
\begin{enumerate}
    \item it is known that for a set of symmetric positive definite matrices,
        defining the value of a set to be the $\log\det$ of the sum of the
        matrices yields a non-decreasing, submodular value function. Noting that:
        \begin{displaymath}
            X_S^*X_S = \sum_{s\in S}x_sx_s^*
        \end{displaymath}
        it is clear that our value function is non-decreasing and submodular.
        The positivity follows from a direct application of the spectral
        theorem.
    \item the matrix which appears in the  value function:
        \begin{displaymath}
            I_d + \frac{\Sigma}{\sigma^2}X_S^*X_S
        \end{displaymath}
    is also the inverse of the covariance matrix of the ridge regression
    estimator. In optimal experiment design, it is common to use the
    determinant of the inverse of the estiamator's covariance matrix as
    a mesure of the quality of the predicion. Indeed, this directly relates to
    the inverse of the volume of the confidence ellipsoid.
    \item This value function can be computed up to a fixed decimal precision in
        polynomial time.
\end{enumerate}
\subsubsection*{Marginal contribution}

Here, we want to compute the marginal contribution of a point $x$ to a set $S$ of
users:
\begin{displaymath}
    \Delta_xV(S) = V(S\cup \{x\}) - V(S)
\end{displaymath}

Using that:
\begin{displaymath}
    X_{S\cup\{x\}}^*X_{S\cup\{x\}} = X_S^*X_S + xx^*
\end{displaymath}
we get:
\begin{align*}
    \Delta_x V(S) & = \frac{1}{2}\log\det\left(I_d 
    + \Sigma\frac{X_S^*X_S}{\sigma^2} + \Sigma\frac{xx^*}{\sigma^2}\right)\\
    & - \frac{1}{2}\log\det\left(I_d + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\\
    &  = \frac{1}{2}\log\det\left(I_d + xx^*\left(\sigma^2\Sigma^{-1} +
X_S^*X_S\right)^{-1}\right)\\
& = \frac{1}{2}\log\left(1 + x^*\left(\sigma^2\Sigma^{-1}
+ X_S^*X_S\right)^{-1}x\right)
\end{align*}

\emph{Remark.} This formula shows that given a set $S$, users do not bring all
the same contribution to the set $S$. This contribution depends on the norm of
$x$ for the bilinear form defined by the matrix $(\sigma^2\Sigma^{-1}
+ X_S^*X_S)^{-1}$ which reflects how well the new point $x$ \emph{aligns} with
the already existing points.

\section*{Appendix: Submodularity}
\label{sec:submodularity}

In this section, we will consider that we are given a \emph{universe} set $U$.
A set function $f$ is a function defined on the power set of $U$, $\mathfrak{P}(U)$.

A set function $f$ defined on $\mathfrak{P}(U)$ will be said \emph{increasing} if
it is increasing with regards to inclusion, that is:

\begin{displaymath}
  \forall\,S\subseteq T,\quad f(S)\leq f(T)
\end{displaymath}

A \emph{decreasing} function on $\mathfrak{P}(U)$ is defined similarly.

A set function $f$ defined on $\mathfrak{P}(U)$ is said to be
\emph{submodular} if it verifies the diminishing returns property, that is,
the marginal increments when adding a point to a set, is a set decreasing
function. More formally, for any point $x$ in $U$, we can define the marginal
increment of $f$ regarding $x$, it is the set function defined as:
\begin{displaymath}
    \Delta_x f(S) = f(S\cup\{x\}) - f(S)
\end{displaymath}
Then, $f$ is \emph{submodular} iff. for all $x$ in $U$, $\Delta_x f$ is a set
decreasing function.

Similarly, a \emph{supermodular} is a function whose marginal increments are set
increasing functions.
\begin{prop}
  Let $R:\mathbf{R}\rightarrow \mathbf{R}$ be a decreasing concave function and
  $f:\mathfrak{P}(U)\rightarrow\mathbf{R}$ be a decreasing submodular function,
  then the composed function $R\circ f$ is increasing and supermodular.
\end{prop}

\begin{proof}
  The increasingness of $R\circ f$ follows immediately from the decreasingness
  of $R$ and $f$.
  
  For the supermodularity, let $S$ and $T$ be two sets such that $S\subseteq
  T$. By decreasingness of $f$, we have:
  \begin{displaymath}
    \forall\,V,\quad f(T)\leq f(S)\quad\mathrm{and}\quad f(T\cup V)\leq f(S\cup V)
  \end{displaymath}
  
  Thus, by concavity of $R$:
  \begin{displaymath}\label{eq:base}
  \begin{split}
    \forall\,V,\quad\frac{R\big(f(S)\big)-R\big(f(S\cup V)\big)}{f(S)-f(S\cup
    V)}\\
    \leq\frac{R\big(f(T)\big)-R\big(f(T\cup V)\big)}{f(T)-f(T\cup V)}
  \end{split}
  \end{displaymath}
  
  $f$ is decreasing, so multiplying this last inequality by
  $f(S)-f(S\cup V)$ and $f(T)-f(T\cup V)$ yields:
  \begin{multline}
    \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big)\\
    \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)
  \end{multline}

  $f$ is submodular, so:
  \begin{displaymath}
    f(T\cup V)-f(T)\leq f(S\cup V) - f(S)
  \end{displaymath}
  
  $R\circ f$ is increasing, so:
  \begin{displaymath}
    R\big(f(S)\big)-R\big(f(S\cup V)\big)\leq 0
  \end{displaymath}

  By combining the two previous inequalities, we get:
    \begin{multline*}
    \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\
    \leq \Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big)
  \end{multline*}

  Injecting this last inequality into \eqref{eq:base} gives:
  \begin{multline*}
    \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\
    \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)
  \end{multline*}

  Dividing left and right by $f(S)-f(S\cup V)$ yields:
  \begin{displaymath}
    \forall V,\quad R\big(f(S)\big)-R\big(f(S\cup V)\big)
    \leq R\big(f(T)\big)-R\big(f(T\cup V)\big)
  \end{displaymath}
  which is exactly the supermodularity of $R\circ f$.
\end{proof}

\end{document}