\documentclass{IEEEtran} %\usepackage{mathptmx} \usepackage[utf8]{inputenc} \usepackage{amsmath,amsthm,amsfonts} \newtheorem{lemma}{Lemma} \newtheorem{fact}{Fact} \newtheorem{example}{Example} \newtheorem{prop}{Proposition} \newcommand{\var}{\mathop{\mathrm{Var}}} \newcommand{\condexp}[2]{\mathop{\mathbb{E}}\left[#1|#2\right]} \newcommand{\expt}[1]{\mathop{\mathbb{E}}\left[#1\right]} \newcommand{\norm}[1]{\lVert#1\rVert} \newcommand{\tr}[1]{#1^*} \newcommand{\ip}[2]{\langle #1, #2 \rangle} \newcommand{\mse}{\mathop{\mathrm{MSE}}} \DeclareMathOperator{\trace}{tr} \DeclareMathOperator*{\argmax}{arg\,max} \title{Value of data} \author{Stratis Ionnadis \and Thibaut Horel} \begin{document} \maketitle \section{Introduction} The goal of this work is to propose a framework to study the value of user data. Although it is clear that there is some notion of value attached to user data (for example user data can be used to generate revenue through online advertising, recommender systems, etc.), it is not clear which definition of value should be used for formal works on this notion. After having proposed a definition of value, we study how generic economic problem behave with regards to this definition. Finally we study the computational feasibility of these problems. \section{Data model} \label{sec:data-model} There is a set of users and an experimenter. Each user has a vector of public features (e.g. age, height, binary features, labels, etc.) and a private piece of information: an undisclosed variable. The experimenter wants to learn the relationship between the public features and the private variable. Before conducting the experiment, he has a prior knowledge of this relationship, called his \emph{hypothesis} and the experiment consists in selecting a set of users and asking them to reveal their private variables. Based on the observed data, the experimenter updates his hypothesis. For the experimenter, there is a notion of value attached to a group of users: this is how much the data teaches him about the hypothesis, how much it reduces his uncertainty about the hypothesis. For the users, there is a cost attached to revealing their data. The experimenter also has a budget constraint on the amount of money he can spend. The problems arising in this setup are natural: the experimenter wants to maximize his utility: the value of the set of users he selects, but he needs to compensate the users by taking into account their costs and his budget constraint. The users' costs can either be public, which directly leads to combinatorial optimizations problems, or private, in which case, a notion of strategy intervenes in the way the experimenter compensates the users and requires an auction approach. Formally, there is a set of users indexed by a set $\mathcal{I}$. The public feature vector of user $i\in\mathcal{I}$ is an element $x_i$ of a feature set $E$, his undisclosed variable is denoted by $y_i$ and belongs to some space $A$. The cost of user $i$ for revealing his data is a positive real number $c_i\in\mathbf{R}_+$. The budget of the experimenter is denoted by $B$. The prior knowledge of the experimenter takes the form of a random variable $H$ over $A^E$ or a subset of $A^E$ called the \emph{hypothesis set}. This random variable expresses his uncertainty about the true hypothesis $h$. The true hypothesis gives the relationship between the feature vector of user $i$ and his private variable through the equation: \begin{equation}\label{eq:hypothesis-model} y_i = h(x_i) + \varepsilon_i \end{equation} where $\varepsilon_i$ is a random variable over $A$. \emph{TODO: explain why this model is not restrictive: $y$ can always be written as a deterministic function of $x$ plus something independent of $x$} \emph{Examples.} \begin{enumerate} \item if the hypothesis set is finite (the experimenter has a few deterministic model he wants to choose from), observing data allows him to rule out parts of the hypothesis set. In this case, the uncertainty of the experimenter could simply be measured by the size of the hypothesis set. \item if $A=\{0,1\}$ and the hypothesis set is the set of all binary functions on $E$, the learning task of the experimenter is a binary classification problem. \item if $A=\mathbf{R}$, $E=\mathbf{R}^d$ and the hypothesis set is the set of linear functions from $E$ to $A$: $\mathcal{L}(A,E)$, the learning task is a linear regression. The prior knowledge of the experimenter is a prior distribution on the parameters of the linear function, which is equivalent to regularized linear regression (e.g. ridge regression). \end{enumerate} \section{Economics of data} \label{sec:economics} The goal of this section is to further discuss the optimization problems mentioned in the previous section. The value function (the utility function of the experimenter) will be denoted by $V$ and is simply a function mapping a set of users $S\subset \mathcal{I}$ to $V(S)\in\mathbf{R}_+$. The choice of a specific value function will be discussed in Section~\ref{sec:data-value}. \subsection{Optimal observation selection} In this problem, the costs $(c_i)_{i\in \mathcal{I}}$ of the users are public. When selecting the set $S\subset\mathcal{I}$ of users, the experimenter has to pay exactly $\sum_{i\in S} c_i$. Hence, the optimization problem consist in selecting $S^*$ defined by: \begin{equation}\label{eq:observation-selection} S^* = \argmax_{S\subset\mathcal{I}}\Big\{V(S)\;\Big|\; \sum_{i\in S} c_i \leq B\Big\} \end{equation} This is a function maximization problem under a knapsack constraint. If $V$ can be any function, then this problem is obviously NP-hard. A common assumption to make on $V$ is that it is submodular (see Section~\ref{sec:submodularity}) which is the extension of the notion of convexity for set functions. However, maximizing a submodular function under knapsack constraint is still NP-hard. Sviridenko (2004) gave a (1-1/e) polynomial time approximation ratio for this problem, when the value function is non-decreasing and submodular. Note that this problem covers the case where all users have the same cost $c$. In this case, letting $B' = \left\lfloor B/c\right\rfloor$, the problems becomes a maximization problem under a size constraint: \begin{displaymath} S^* = \argmax_{S\subset\mathcal{I}}\left\{V(S)\;|\; | S| \leq B'\right\} \end{displaymath} for which Newhauser (1978) gave an optimal (1-1/e) polynomial approximation scheme, in the case of a non-decreasing submodular function. \subsection{Budget feasible auctions} Here, the cost of the users are private. Before the beginning of the experiment, they report a cost $c_i'$ which is not necessarily equal to their true cost: a user may decide to lie to receive more money, however, by doing so, their is a risk of not being included in the experiment. This notion of strategy involved in the way the users reveal their costs roots this problem in the auction and mechanism design theory. Formally, the experimenter wants to design an allocation function $f: \mathbf{R}_+^{\mathcal{I}} \rightarrow 2^{\mathcal{I}}$, which, given the reported costs of the users selects the set of users to be included in the experiment, and a payment function $p: \mathbf{R}_+^{\mathcal{I}} \rightarrow \mathbf{R}_+^{\mathcal{I}}$, which given the reported costs returns the vector of payments to allocate to each user. For notation convenience, we will assume given the costs $\{c_i, i\in\mathcal{I}\}$ of the users, and we will denote by $\{s_i, i\in\mathcal{I}\}$ the characteristic function of $f(\{c_i, i\in\mathcal{I}\})$, that is, $s_i = 1$ iff $i\in f(\{c_i, i\in\mathcal{I}\})$. The payment received by user $i$ will be denoted by $p_i$. The mechanism should satisfy the following conditions: \begin{itemize} \item \textbf{Normalized} if $s_i = 0$ then $p_i = 0$. \item \textbf{Individually rational} $p_i \geq s_ic_i$. \item \textbf{Truthful} $p_i - s_ic_i \geq p_i' - s_i'c_i$, where $p_i'$ and $s_i'$ are the payment and allocation of user $i$ had he reported a cost $c_i'$ different from his true cost $c_i$ (keeping the costs reported by the other users the same). \item \textbf{Budget feasible} the payments should be within budget: \begin{displaymath} \sum_{i\in \mathcal{I}} s_ip_i \leq B \end{displaymath} \end{itemize} Yaron Singer (2010) proved a lower bound of 2 for the approximation ratio of a polynomial algorithm to solve this problem. He also gave a randomized general algorithm with an approximation ratio of 117.7, although for specific problems (coverage, knapsack, etc.) better ratios can be attained. State of the art: Chenn, Gravin and Lie, lower bound of $1+\sqrt{2}$ and upper bound of 8.34 in the fractional non-strategic optimization problem can be solved in polynomial time. \section{Value of data} \label{sec:data-value} Here, we discuss a choice for the value function appearing in the problems discussed in Section~\ref{sec:economics}. Such a value function should be at least normalized, positive and non-decreasing. It should furthermore capture a notion of \emph{uncertainty reduction} related to the learning task of the experimenter. The prior knowledge of the experimenter, the distribution of the random variable $H$ over the hypothesis set conveys his uncertainty about the true hypothesis. A common measure of uncertainty is given by the entropy. If we denote by $P_H$ the probability distribution of $H$, its entropy is defined by\footnote{Here we choose to write the entropy of a discrete distribution, but our results do not rely on this assumption: if the distribution is continuous, one simply needs to replace the entropy by the differential entropy.}: \begin{displaymath} \mathbb{H}(H) = -\sum_{h\in A^E}P_H(h)\log\big(P_H(h)\big) \end{displaymath} We will denote by $Y_i$, $i\in\mathcal{I}$ the answer of user $i$ according to the experimenter's knowledge: \begin{equation}\label{eq:data-model} Y_i = H(x_i) + \varepsilon_i \end{equation} and $Y_S$ will denote the set of answers from the users in a subset $S$ of $\mathcal{I}$. We can now define the value of a group $S$ of users as being the decrease of entropy induced by observing $S$: \begin{equation}\label{eq:value} \forall S\subset I,\; V(S) = \mathbb{H}(H) - \mathbb{H}(H\,|\,Y_S) \end{equation} where $\mathbb{H}(H\,|\,Y_S)$ is the conditional entropy of $H$ given $Y_S$. One can also note that the definition of the value given in \eqref{eq:value} is simply the mutual information $I(H;Y_S)$ between $H$ and $Y_S$. Submodularity is conserved by composition by a concave function on the left (see Section~\ref{sec:submodularity}). Hence, the definition of the value of $S$ can be extended to any $f\big(V(S)\big)$ where $f$ is a non-decreasing concave function. This notion of value, also known as the \emph{value of information} (TODO: ref) can be motivated by considering the information theoretic interpretation of the entropy: if we consider that the experimenter has access to an oracle to whom he can ask yes/no questions. Then, the entropy of the distribution is exactly the number of questions he needs to ask to fully know the hypothesis. If he needs to pay for each question asked to the oracle, then our definition of value directly relates to the cost decrease implied by observing a set of users. Using the \emph{information never hurts} principle, it is easy to see that the value function defined by \eqref{eq:value} is positive an non-decreasing (with regard to inclusion). Furthermore, if we add the natural assumption that $(\varepsilon_i)_{i\in\mathcal{I}}$ are jointly independent, which is equivalent to say that conditioned on the hypothesis, the $(Y_i)_{i\in\mathcal{I}}$ are independent, we get the following fact. \begin{fact}\label{value-submodularity} Assuming that $(Y_i)_{i\in\mathcal{I}}$ defined in \eqref{eq:data-model} are independent conditioned on $H$, then $V$ defined in \eqref{eq:value} is submodular. \end{fact} \begin{proof} Using the chain rule, one can rewrite the value of $S$ as: \begin{displaymath} V(S) = \mathbb{H}(Y_S) - \mathbb{H}(Y_S\,|\, H) \end{displaymath} We can now use the conditional independence of $Y_S$ to write the conditional entropy as a sum: \begin{displaymath} V(S) = \mathbb{H}(Y_S) - \sum_{s\in S}\mathbb{H}(Y_s\,|\, H) \end{displaymath} It is well known that the joint entropy of a set of random variable is submodular. Thus the last equation expresses $V$ as the sum of a submodular function and of an additive function. As a consequence, $V$ is submodular. \end{proof} \section{Value of data in the linear regression setup} In this section we will assume a linear model: the feature vectors belong to $\mathbf{R}^d$ and the private variables belong to $\mathbf{R}$. The private variable can be expressed as a linear combination of the features: \begin{displaymath} y = \beta^*x + \epsilon \end{displaymath} The noise $\epsilon$ is normally distributed, zero-mean and of variance $\sigma^2$. Furthermore, the noise is independent of the user. The hypothesis set of the experimenter is the set of all linear forms from $\mathbf{R}^d$ to $\mathbf{R}$. Because a linear form can be uniquely represented as the inner product by a vector, it is equivalent to say that the hypothesis set is $\mathbf{R}^d$ and that the experimenter's hypothesis is a random variable $\beta$ over $\mathbf{R}^d$. A common assumption made in linear regression is that $\beta$ is normally distributed, zero-mean, and its covariance matrix is denoted by $\Sigma$. This prior distribution conveys the idea that $\beta$ should have a small $L^2$ norm, which means that the model should have some kind of sparsity. Indeed, it is easy to prove that choosing the $\hat\beta$ which maximizes the \emph{a posteriori} distribution given the observations under a normal prior is equivalent to solve the following optimization problem: \begin{displaymath} \hat\beta = \arg\min_{\beta\in\mathbf{R}^d}\|Y-X\beta\|^2 + \lambda \|\beta\|^2 \end{displaymath} where $\lambda$ can be expressed as a function of $\Sigma$ and $\sigma^2$. This optimization problem is known as ridge regression, and is simply a least square fit of the data with a penalization on the $\beta$ which have a large $L^2$ norm, which is consistent with the prior distribution. \begin{fact} Under the linear regression model, with a multivariate normal prior, the value of data of a set $S$ of users is given by: \begin{equation}\label{eq:linear-regression-value} V(S) = \frac{1}{2}\log\det\left(I_d + \frac{\Sigma}{\sigma^2}X_S^*X_S\right) \end{equation} where $X_S$ is the matrix whose rows are the line-vectors $x_s^*$ for $s$ in $S$. \end{fact} \begin{proof} Let us recall that the entropy of a multivariate normal variable $B$ in dimension $d$ and of covariance $\Sigma$ (the mean is not relevant) is given by: \begin{equation}\label{eq:multivariate-entropy} \mathbb{H}(B) = \frac{1}{2}\log\big((2\pi e)^d \det \Sigma\big) \end{equation} Using the chain rule as in the proof of Fact~\ref{value-submodularity} we get that: \begin{displaymath} V(S) = \mathbb{H}(Y_S) - \mathbb{H}(Y_S\,|\,\beta) \end{displaymath} Conditioned on $\beta$, $(Y_S)$ follows a multivariate normal distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence: \begin{equation}\label{eq:h1} \mathbb{H}(Y_S\,|\,\beta) = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right) \end{equation} $(Y_S)$ also follows a multivariate normal distribution of mean zero. Let us compute its covariance matrix, $\Sigma_Y$: \begin{align*} \Sigma_Y & = \expt{YY^*} = \expt{(X_S\beta + E)(X_S\beta + E)^*}\\ & = X_S\Sigma X_S^* + \sigma^2I_n \end{align*} Thus, we get that: \begin{equation}\label{eq:h2} \mathbb{H}(Y_S) = \frac{1}{2}\log\left((2\pi e)^n \det(X_S\Sigma X_S^* + \sigma^2 I_n)\right) \end{equation} Combining \eqref{eq:h1} and \eqref{eq:h2} we get: \begin{displaymath} V(S) = \frac{1}{2}\log\det\left(I_n+\frac{1}{\sigma^2}X_S\Sigma X_S^*\right) \end{displaymath} Finally, we can use the Sylvester's formula to get the result. \end{proof} \emph{Remarks.} \begin{enumerate} \item it is known that for a set of symmetric positive definite matrices, defining the value of a set to be the $\log\det$ of the sum of the matrices yields a non-decreasing, submodular value function. Noting that: \begin{displaymath} X_S^*X_S = \sum_{s\in S}x_sx_s^* \end{displaymath} it is clear that our value function is non-decreasing and submodular. The positivity follows from a direct application of the spectral theorem. \item the matrix which appears in the value function: \begin{displaymath} I_d + \frac{\Sigma}{\sigma^2}X_S^*X_S \end{displaymath} is also the inverse of the covariance matrix of the ridge regression estimator. In optimal experiment design, it is common to use the determinant of the inverse of the estiamator's covariance matrix as a mesure of the quality of the predicion. Indeed, this directly relates to the inverse of the volume of the confidence ellipsoid. \item This value function can be computed up to a fixed decimal precision in polynomial time. \end{enumerate} \subsubsection*{Marginal contribution} Here, we want to compute the marginal contribution of a point $x$ to a set $S$ of users: \begin{displaymath} \Delta_xV(S) = V(S\cup \{x\}) - V(S) \end{displaymath} Using that: \begin{displaymath} X_{S\cup\{x\}}^*X_{S\cup\{x\}} = X_S^*X_S + xx^* \end{displaymath} we get: \begin{align*} \Delta_x V(S) & = \frac{1}{2}\log\det\left(I_d + \Sigma\frac{X_S^*X_S}{\sigma^2} + \Sigma\frac{xx^*}{\sigma^2}\right)\\ & - \frac{1}{2}\log\det\left(I_d + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\\ & = \frac{1}{2}\log\det\left(I_d + xx^*\left(\sigma^2\Sigma^{-1} + X_S^*X_S\right)^{-1}\right)\\ & = \frac{1}{2}\log\left(1 + x^*\left(\sigma^2\Sigma^{-1} + X_S^*X_S\right)^{-1}x\right) \end{align*} \emph{Remark.} This formula shows that given a set $S$, users do not bring all the same contribution to the set $S$. This contribution depends on the norm of $x$ for the bilinear form defined by the matrix $(\sigma^2\Sigma^{-1} + X_S^*X_S)^{-1}$ which reflects how well the new point $x$ \emph{aligns} with the already existing points. \section*{Appendix: Submodularity} \label{sec:submodularity} In this section, we will consider that we are given a \emph{universe} set $U$. A set function $f$ is a function defined on the power set of $U$, $\mathfrak{P}(U)$. A set function $f$ defined on $\mathfrak{P}(U)$ will be said \emph{increasing} if it is increasing with regards to inclusion, that is: \begin{displaymath} \forall\,S\subseteq T,\quad f(S)\leq f(T) \end{displaymath} A \emph{decreasing} function on $\mathfrak{P}(U)$ is defined similarly. A set function $f$ defined on $\mathfrak{P}(U)$ is said to be \emph{submodular} if it verifies the diminishing returns property, that is, the marginal increments when adding a point to a set, is a set decreasing function. More formally, for any point $x$ in $U$, we can define the marginal increment of $f$ regarding $x$, it is the set function defined as: \begin{displaymath} \Delta_x f(S) = f(S\cup\{x\}) - f(S) \end{displaymath} Then, $f$ is \emph{submodular} iff. for all $x$ in $U$, $\Delta_x f$ is a set decreasing function. Similarly, a \emph{supermodular} is a function whose marginal increments are set increasing functions. \begin{prop} Let $R:\mathbf{R}\rightarrow \mathbf{R}$ be a decreasing concave function and $f:\mathfrak{P}(U)\rightarrow\mathbf{R}$ be a decreasing submodular function, then the composed function $R\circ f$ is increasing and supermodular. \end{prop} \begin{proof} The increasingness of $R\circ f$ follows immediately from the decreasingness of $R$ and $f$. For the supermodularity, let $S$ and $T$ be two sets such that $S\subseteq T$. By decreasingness of $f$, we have: \begin{displaymath} \forall\,V,\quad f(T)\leq f(S)\quad\mathrm{and}\quad f(T\cup V)\leq f(S\cup V) \end{displaymath} Thus, by concavity of $R$: \begin{displaymath}\label{eq:base} \begin{split} \forall\,V,\quad\frac{R\big(f(S)\big)-R\big(f(S\cup V)\big)}{f(S)-f(S\cup V)}\\ \leq\frac{R\big(f(T)\big)-R\big(f(T\cup V)\big)}{f(T)-f(T\cup V)} \end{split} \end{displaymath} $f$ is decreasing, so multiplying this last inequality by $f(S)-f(S\cup V)$ and $f(T)-f(T\cup V)$ yields: \begin{multline} \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big)\\ \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big) \end{multline} $f$ is submodular, so: \begin{displaymath} f(T\cup V)-f(T)\leq f(S\cup V) - f(S) \end{displaymath} $R\circ f$ is increasing, so: \begin{displaymath} R\big(f(S)\big)-R\big(f(S\cup V)\big)\leq 0 \end{displaymath} By combining the two previous inequalities, we get: \begin{multline*} \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\ \leq \Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big) \end{multline*} Injecting this last inequality into \eqref{eq:base} gives: \begin{multline*} \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\ \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big) \end{multline*} Dividing left and right by $f(S)-f(S\cup V)$ yields: \begin{displaymath} \forall V,\quad R\big(f(S)\big)-R\big(f(S\cup V)\big) \leq R\big(f(T)\big)-R\big(f(T\cup V)\big) \end{displaymath} which is exactly the supermodularity of $R\circ f$. \end{proof} \end{document}