\documentclass{article} \usepackage[utf8]{inputenc} \usepackage{amsmath,amsthm,amsfonts} \usepackage{comment} \newtheorem{lemma}{Lemma} \newtheorem{fact}{Fact} \newtheorem{example}{Example} \newtheorem{prop}{Proposition} \newcommand{\var}{\mathop{\mathrm{Var}}} \newcommand{\condexp}[2]{\mathop{\mathbb{E}}\left[#1|#2\right]} \newcommand{\expt}[1]{\mathop{\mathbb{E}}\left[#1\right]} \newcommand{\norm}[1]{\lVert#1\rVert} \newcommand{\tr}[1]{#1^*} \newcommand{\ip}[2]{\langle #1, #2 \rangle} \newcommand{\mse}{\mathop{\mathrm{MSE}}} \newcommand{\trace}{\mathop{\mathrm{tr}}} \begin{document} \section{Problem} \paragraph{Data Model} There is a database $D$ of $N$ users. Each user $i\in\{1,\ldots, N\}$ is represented by a vector of features $x_i\in\mathbf{R}^d$. To each user $i$, is also associated a variable of interest $y_i$, which can be inferred from $x_i$. We will thus assume that there is a \emph{model} describing the relation between $y_i$ and $x_i$, which in the most general form can be written as: \begin{displaymath} y_i = h(x_i) \end{displaymath} where $h$ is an unknown function. The database is valuable, because by observing the database, or a subset $S$ of the database, it is possible to learn the unknown model function $h$ (or to improve a prior knowledge of this function) and thus, to predict the variable of interest associated with a given user. \paragraph{Value} Consequently, the value $V(S)$ of a subset $S$ of the database will be defined as the decrease of \emph{uncertainty} about $h$ induced by observing $S$. Depending on the definition of uncertainty used, it is possible to distinguish two general approaches to define $V(S)$: \begin{itemize} \item \emph{Information theoretic:} the knowledge of $h$ we have is represented by a probabilistic distribution, and the \emph{entropy} of $h$ is used as a measure of the uncertainty. The observation of the set $S$ allows us to update the distribution of $h$, and the uncertainty becomes the conditional entropy of $h$ given the observation. This leads to the definition of the value: \begin{displaymath} V(S) = H(h) - H(h\,|\,S) \end{displaymath} \item \emph{Statistical:} in this case, we maintain a predictor $\tilde{y}$ which given a user's feature vector $x$ gives a prediction $\tilde{y}(x)$ of his variable of interest. The uncertainty is then measured by the variance of the predictor at a given point $x$, $\sigma^2_x$. After observing a subset $S$ of the database, the predictor is updated, and its new variance is denoted by $\sigma_{x|S}^2$. Thus, we could define the value of $S$ by $V(S) = \sigma_x^2 - \sigma_{x|S}^2$. However, this definition depends on the point $x$. If we knew the distribution of $x$, one could simply take the expectation of the previous quantity as a definition of the value. Another way to solve the issue, is to average this quantity over the database, leading to the definition of the value: \begin{displaymath} V(S) = \frac{1}{N}\sum_{x\in D} \sigma_x^2 - \sigma_{x|A}^2 \end{displaymath} \end{itemize} Once you have defined a notion of value, there are several problems you can address: \begin{enumerate} \item \emph{budget auctions:} the points in the database are associated with users. Each user $i$ has a cost $c_i$ for revealing his data. You are given a budget $B$ and you want to select a subset $S$ of users and a payment scheme which is truthful. The value of the subset $S$ should be as big as possible with the constraint that the payments should stay within budget. \item \emph{value sharing:} the goal is to find a value sharing method to split the whole value of a subset $S$ among its participants. \item \emph{k-best auctions:} this one is very similar to \emph{budget auctions}, but instead of having a budget constraint, the constraint is on the number of people in the subset $S$. \end{enumerate} \emph{[TODO: add references to the papers mentioning these problems, what we know about the value functions defined above, etc.]} These problems have already been studied, and optimal or near-optimal solution are already known when the value function is submodular. \section{Data value in the Gaussian world} In this section we will assume a multivariate Gaussian model: \begin{displaymath} y = \beta^*x + \epsilon \end{displaymath} The prior distribution of $\beta$ is a multivariate normal distribution of mean zero and covariance $\Sigma$. $\epsilon$ is the noise which is modeled with a normal distribution of mean zero and variance $\sigma^2$. Note that this model is the probabilistic model used in ridge regression. To compute the value of a subset $S$ as defined above, we have to compute the differential conditional entropy of the posterior distribution after observing the set $S$, which is exactly the distribution which leads to the ridge regression estimator. Thus the following computation can be seen as the study of how the ridge regression estimator evolves as you observe more points. Let us start by computing the entropy of the multivariate normal distribution: \begin{displaymath} H(\beta) = -\frac{1}{C} \int_{b\in\mathbf{R}^d} \exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right) \log\left(-\frac{1}{C}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)\right)db \end{displaymath} where: \begin{displaymath} C = \big(\sqrt{2\pi}\big)^d\sqrt{\det\Sigma} = \int_{b\in\mathbf{R}^d}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db \end{displaymath} \begin{displaymath} H(\beta) = \log(C) + \frac{1}{2C}\int_{b\in\mathbf{R}^d} b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db \end{displaymath} One can notice that: \begin{displaymath} \frac{1}{C}\int_{b\in\mathbf{R}^d} b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db = \expt{b^*\Sigma^{-1}b} \end{displaymath} where $b$ follows a multivariate normal distribution of variance $\Sigma$ and mean zero. \begin{displaymath} \expt{b^*\Sigma^{-1}b} = \expt{tr(\Sigma^{-1}bb^*)} = tr(\Sigma^{-1}\Sigma) = 1 \end{displaymath} \begin{displaymath} H(\beta) = \frac{1}{2}\log\big((2\pi)^d\det\Sigma\big) + \frac{1}{2} = \frac{1}{2}\log\big((2\pi e)^d\det\Sigma\big) \end{displaymath} \section{Submodularity} In this section, we will consider that we are given a \emph{universe} set $U$. A set function $f$ is a function defined on the power set of $U$, $\mathfrak{P}(U)$. A set function $f$ defined on $\mathfrak{P}(U)$ will be said \emph{increasing} if it is increasing with regards to inclusion, that is: \begin{displaymath} \forall\,S\subseteq T,\quad f(S)\leq f(T) \end{displaymath} A \emph{decreasing} function on $\mathfrak{P}(U)$ is defined similarly. A set function $f$ defined on $\mathfrak{P}(U)$ is said to be \emph{submodular} if it verifies the diminishing returns property, that is, the marginal increments when adding a point to a set, is a set decreasing function. More formally, for any point $x$ in $U$, we can define the marginal increment of $f$ regarding $x$, it is the set function defined as: \begin{displaymath} \Delta_x f(S) = f(S\cup\{x\}) - f(S) \end{displaymath} Then, $f$ is \emph{submodular} iff. for all $x$ in $U$, $\Delta_x f$ is a set decreasing function. Similarly, a \emph{supermodular} is a function whose marginal increments are set increasing functions. \begin{prop} Let $R:\mathbf{R}\rightarrow \mathbf{R}$ be a decreasing concave function and $f:\mathfrak{P}(U)\rightarrow\mathbf{R}$ be a decreasing submodular function, then the composed function $R\circ f$ is increasing and supermodular. \end{prop} \begin{proof} The increasingness of $R\circ f$ follows immediately from the decreasingness of $R$ and $f$. For the supermodularity, let $S$ and $T$ be two sets such that $S\subseteq T$. By decreasingness of $f$, we have: \begin{displaymath} \forall\,V,\quad f(T)\leq f(S)\quad\mathrm{and}\quad f(T\cup V)\leq f(S\cup V) \end{displaymath} Thus, by concavity of $R$: \begin{displaymath}\label{eq:base} \forall\,V,\quad\frac{R\big(f(S)\big)-R\big(f(S\cup V)\big)}{f(S)-f(S\cup V)} \leq\frac{R\big(f(T)\big)-R\big(f(T\cup V)\big)}{f(T)-f(T\cup V)} \end{displaymath} $f$ is decreasing, so multiplying this last inequality by $f(S)-f(S\cup V)$ and $f(T)-f(T\cup V)$ yields: \begin{multline} \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big)\\ \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big) \end{multline} $f$ is submodular, so: \begin{displaymath} f(T\cup V)-f(T)\leq f(S\cup V) - f(S) \end{displaymath} $R\circ f$ is increasing, so: \begin{displaymath} R\big(f(S)\big)-R\big(f(S\cup V)\big)\leq 0 \end{displaymath} By combining the two previous inequalities, we get: \begin{multline*} \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\ \leq \Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big) \end{multline*} Injecting this last inequality into \eqref{eq:base} gives: \begin{multline*} \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\ \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big) \end{multline*} Dividing left and right by $f(S)-f(S\cup V)$ yields: \begin{displaymath} \forall V,\quad R\big(f(S)\big)-R\big(f(S\cup V)\big) \leq R\big(f(T)\big)-R\big(f(T\cup V)\big) \end{displaymath} which is exactly the supermodularity of $R\circ f$. \end{proof} \end{document}