\documentclass{article} \usepackage[utf8]{inputenc} \usepackage{amsmath,amsthm,amsfonts} \usepackage{comment} \newtheorem{lemma}{Lemma} \newtheorem{fact}{Fact} \newtheorem{example}{Example} \newtheorem{prop}{Proposition} \newcommand{\var}{\mathop{\mathrm{Var}}} \newcommand{\condexp}[2]{\mathop{\mathbb{E}}\left[#1|#2\right]} \newcommand{\expt}[1]{\mathop{\mathbb{E}}\left[#1\right]} \newcommand{\norm}[1]{\lVert#1\rVert} \newcommand{\tr}[1]{#1^*} \newcommand{\ip}[2]{\langle #1, #2 \rangle} \newcommand{\mse}{\mathop{\mathrm{MSE}}} \DeclareMathOperator{\trace}{tr} \begin{document} \section{Introduction} \subsection{Framework} We are given a set of users, each user has a vector of features which is known to us and allows us to distinguish this user from the others. We are interested in studying a \emph{survey} in which a subset of users will be selected and queried. We assume there is a hypothesis function such that, when submitted to a query, the answer of a specific user to the query will be equal to the function applied to his feature vector with some noise added. The noise is considered to be independent of the user, which is equivalent to say that conditioned on the hypothesis, the answers are independent. The hypothesis is unknown to us, and the answers of the users are going to be used to learn it. Because we do not know the hypothesis, we maintain a distribution over the hypothesis set, and this distribution is going to be updated after the survey. The meaning of this distribution is that it conveys the uncertainty we have about the hypothesis. More precisely the uncertainty is captured by the \emph{entropy} of the distribution. Thus, the setup takes the form of an \emph{active learning} setup: we want to select the most interesting users before conducting the survey. Here, ``most interesting'' refers to a notion of \emph{value} implied by the model: the value of a subset of users will be defined as the decrease of entropy caused by the observation of the answers of this group of users. The motivation of this definition of value makes sense when considering the information theoretic interpretation of the entropy: let us consider that we have an oracle to whom we can ask yes/no questions about the hypothesis. Then, the entropy of the hypothesis distribution is exactly the number of questions we need to ask to fully know the hypothesis. If we need to pay for each question asked to the oracle, then our definition of value directly relates to the cost decrease implied by observing a set of users. \subsection{Problem} Once you have defined a notion of value, there are several problems you can address: \begin{enumerate} \item \emph{budget auctions:} the points in the database are associated with users. Each user $i$ has a cost $c_i$ for revealing his data. You are given a budget $B$ and you want to select a subset $S$ of users and a payment scheme which is truthful. The value of the subset $S$ should be as big as possible with the constraint that the payments should stay within budget. \item \emph{value sharing:} the goal is to find a value sharing method to split the whole value of a subset $S$ among its participants. \item \emph{k-best auctions:} this one is very similar to \emph{budget auctions}, but instead of having a budget constraint, the constraint is on the number of people in the subset $S$. \end{enumerate} \emph{[TODO: add references to the papers mentioning these problems, what we know about the value functions defined above, etc.]} These problems have already been studied, and optimal or near-optimal solution are already known when the value function is submodular. \section{Value of data in the Gaussian world} In this section we will assume a multivariate Gaussian model: \begin{displaymath} y = \beta^*x + \epsilon \end{displaymath} The prior distribution of $\beta$ is a multivariate normal distribution of mean zero and covariance $\Sigma$. $\epsilon$ is the noise which is modeled with a normal distribution of mean zero and variance $\sigma^2$. Note that this model is the probabilistic model used in ridge regression. To compute the value of a subset $S$ as defined above, we have to compute the differential conditional entropy of the posterior distribution after observing the set $S$, which is exactly the distribution which leads to the ridge regression estimator. Thus the following computation can be seen as the study of how the ridge regression estimator evolves as you observe more points. Let us start by computing the entropy of the multivariate normal distribution: \begin{displaymath} H(\beta) = -\frac{1}{C} \int_{b\in\mathbf{R}^d} \exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right) \log\left(-\frac{1}{C}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)\right)db \end{displaymath} where: \begin{displaymath} C = \big(\sqrt{2\pi}\big)^d\sqrt{\det\Sigma} = \int_{b\in\mathbf{R}^d}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db \end{displaymath} By expanding the logarithm, we get: \begin{displaymath} H(\beta) = \log(C) + \frac{1}{2C}\int_{b\in\mathbf{R}^d} b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db \end{displaymath} One can notice that: \begin{displaymath} \frac{1}{C}\int_{b\in\mathbf{R}^d} b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db = \expt{b^*\Sigma^{-1}b} \end{displaymath} where $b$ follows a multivariate normal distribution of variance $\Sigma$ and mean zero. \begin{displaymath} \expt{b^*\Sigma^{-1}b} = \expt{\trace\big(\Sigma^{-1}bb^*\big)} = \trace\big(\Sigma^{-1}\Sigma\big) = 1 \end{displaymath} Finally: \begin{displaymath} H(\beta) = \frac{1}{2}\log\big((2\pi)^d\det\Sigma\big) + \frac{1}{2} = \frac{1}{2}\log\big((2\pi e)^d\det\Sigma\big) \end{displaymath} \subsubsection*{Conditional entropy} Let $S$ be a subset of $D$ of size $n$. Let us denote by $x_1,\ldots,x_n$ the points in $S$ and by $Y_1, \ldots, Y_n$ the associated random variables of interest. Using the Bayes rule for conditional entropy, we get that: \begin{displaymath} H(\beta\,|\,Y_1,\ldots ,Y_n) = H(Y_1,\ldots,Y_n\,|\,\beta) +H(\beta) - H(Y_1,\ldots, Y_n) \end{displaymath} Conditioned on $\beta$, $(Y_1,\ldots,Y_n)$ follows a multivariate normal distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence: \begin{displaymath} H(Y_1,\ldots,Y_n\,|\,\beta) = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right) \end{displaymath} $(Y_1,\ldots,Y_n)$ follows a multivariate normal distribution of mean 0. Let us denote by $\Sigma_Y$ its covariance matrix: \begin{align} \Sigma_Y & = \expt{YY^*} = \expt{(X\beta + E)(X\beta + E)^*}\\ & = X\Sigma X^* + \sigma^2I_n \end{align} which yields the following expression for the joint entropy: \begin{displaymath} H(Y_1,\ldots, Y_n) = \frac{1}{2}\log\left((2\pi e)^n \det(X\Sigma X^* + \sigma^2 I_n)\right) \end{displaymath} Combining the above equations, we get: \begin{align} H(\beta\,|\, Y_1,\ldots,Y_n) & = \frac{1}{2}\log\left((2\pi e)^d \det \Sigma \det\left(I_d + \Sigma \frac{X^*X}{\sigma^2}\right)^{-1}\right)\\ & = - \frac{1}{2}\log\left((2\pi e)^d \det\left(\Sigma^{-1} + \frac{X^*X}{\sigma^2}\right)\right) \end{align} \subsubsection*{Value of data} By combining the formula for entropy and the conditional entropy, we get that the value of a set $S$ of points is equal to: \begin{align} V(S) & = \frac{1}{2}\log\left((2\pi e)^d \det \Sigma\right) + \frac{1}{2}\log\left((2\pi e)^d \det \left(\Sigma^{-1}+\frac{X_S^*X_S}{\sigma^2}\right)\right)\\ & = \frac{1}{2}\log\left(\det\left(I_d + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\right) \end{align} \subsubsection*{Marginal contribution} Here, we want to compute the marginal contribution of a point $x$ to a set $S$ of users: \begin{displaymath} \Delta_xV(S) = V(S\cup \{x\}) - V(S) \end{displaymath} Using that: \begin{displaymath} X_{S\cup\{x\}}^*X_{S\cup\{x\}} = X_S^*X_S + xx^* \end{displaymath} we get: \begin{align} \Delta_x V(S) & = \frac{1}{2}\log\det\left(I_d + \Sigma\frac{X_S^*X_S}{\sigma^2} + \Sigma\frac{xx^*}{\sigma^2}\right) - \frac{1}{2}\log\det\left(I_d + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\\ & = \frac{1}{2}\log\det\left(I_d + \frac{xx^*}{\sigma^2}\left(\Sigma^{-1} + \frac{X_S^*X_S}{\sigma^2}\right)^{-1}\right)\\ & = \frac{1}{2}\log\left(1 + \frac{1}{\sigma^2}x^*\left(\Sigma^{-1} + \frac{X_S^*X_S}{\sigma^2}\right)^{-1}x\right) \end{align} \section{Submodularity} In this section, we will consider that we are given a \emph{universe} set $U$. A set function $f$ is a function defined on the power set of $U$, $\mathfrak{P}(U)$. A set function $f$ defined on $\mathfrak{P}(U)$ will be said \emph{increasing} if it is increasing with regards to inclusion, that is: \begin{displaymath} \forall\,S\subseteq T,\quad f(S)\leq f(T) \end{displaymath} A \emph{decreasing} function on $\mathfrak{P}(U)$ is defined similarly. A set function $f$ defined on $\mathfrak{P}(U)$ is said to be \emph{submodular} if it verifies the diminishing returns property, that is, the marginal increments when adding a point to a set, is a set decreasing function. More formally, for any point $x$ in $U$, we can define the marginal increment of $f$ regarding $x$, it is the set function defined as: \begin{displaymath} \Delta_x f(S) = f(S\cup\{x\}) - f(S) \end{displaymath} Then, $f$ is \emph{submodular} iff. for all $x$ in $U$, $\Delta_x f$ is a set decreasing function. Similarly, a \emph{supermodular} is a function whose marginal increments are set increasing functions. \begin{prop} Let $R:\mathbf{R}\rightarrow \mathbf{R}$ be a decreasing concave function and $f:\mathfrak{P}(U)\rightarrow\mathbf{R}$ be a decreasing submodular function, then the composed function $R\circ f$ is increasing and supermodular. \end{prop} \begin{proof} The increasingness of $R\circ f$ follows immediately from the decreasingness of $R$ and $f$. For the supermodularity, let $S$ and $T$ be two sets such that $S\subseteq T$. By decreasingness of $f$, we have: \begin{displaymath} \forall\,V,\quad f(T)\leq f(S)\quad\mathrm{and}\quad f(T\cup V)\leq f(S\cup V) \end{displaymath} Thus, by concavity of $R$: \begin{displaymath}\label{eq:base} \forall\,V,\quad\frac{R\big(f(S)\big)-R\big(f(S\cup V)\big)}{f(S)-f(S\cup V)} \leq\frac{R\big(f(T)\big)-R\big(f(T\cup V)\big)}{f(T)-f(T\cup V)} \end{displaymath} $f$ is decreasing, so multiplying this last inequality by $f(S)-f(S\cup V)$ and $f(T)-f(T\cup V)$ yields: \begin{multline} \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big)\\ \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big) \end{multline} $f$ is submodular, so: \begin{displaymath} f(T\cup V)-f(T)\leq f(S\cup V) - f(S) \end{displaymath} $R\circ f$ is increasing, so: \begin{displaymath} R\big(f(S)\big)-R\big(f(S\cup V)\big)\leq 0 \end{displaymath} By combining the two previous inequalities, we get: \begin{multline*} \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\ \leq \Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big) \end{multline*} Injecting this last inequality into \eqref{eq:base} gives: \begin{multline*} \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\ \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big) \end{multline*} Dividing left and right by $f(S)-f(S\cup V)$ yields: \begin{displaymath} \forall V,\quad R\big(f(S)\big)-R\big(f(S\cup V)\big) \leq R\big(f(T)\big)-R\big(f(T\cup V)\big) \end{displaymath} which is exactly the supermodularity of $R\circ f$. \end{proof} \end{document}