diff options
| author | Thibaut Horel <thibaut.horel@gmail.com> | 2012-11-06 02:53:42 +0100 |
|---|---|---|
| committer | Thibaut Horel <thibaut.horel@gmail.com> | 2012-11-06 02:53:42 +0100 |
| commit | a90b10d4653eb81c2647fa1b267dadc0a9fcacd8 (patch) | |
| tree | aade432d37b7e0fdd68c5ef1674f32e570f702e9 /notes.tex | |
| parent | 2f6accb332cfa17521490fdb239327c91ed1dcd4 (diff) | |
| download | recommendation-a90b10d4653eb81c2647fa1b267dadc0a9fcacd8.tar.gz | |
Post submission clean-up
Diffstat (limited to 'notes.tex')
| -rw-r--r-- | notes.tex | 496 |
1 files changed, 0 insertions, 496 deletions
diff --git a/notes.tex b/notes.tex deleted file mode 100644 index 1ce4cc4..0000000 --- a/notes.tex +++ /dev/null @@ -1,496 +0,0 @@ -\documentclass{IEEEtran} -%\usepackage{mathptmx} -\usepackage[utf8]{inputenc} -\usepackage{amsmath,amsthm,amsfonts} -\newtheorem{lemma}{Lemma} -\newtheorem{fact}{Fact} -\newtheorem{example}{Example} -\newtheorem{prop}{Proposition} -\newcommand{\var}{\mathop{\mathrm{Var}}} -\newcommand{\condexp}[2]{\mathop{\mathbb{E}}\left[#1|#2\right]} -\newcommand{\expt}[1]{\mathop{\mathbb{E}}\left[#1\right]} -\newcommand{\norm}[1]{\lVert#1\rVert} -\newcommand{\tr}[1]{#1^*} -\newcommand{\ip}[2]{\langle #1, #2 \rangle} -\newcommand{\mse}{\mathop{\mathrm{MSE}}} -\DeclareMathOperator{\trace}{tr} -\DeclareMathOperator*{\argmax}{arg\,max} -\title{Value of data} -\author{Stratis Ionnadis \and Thibaut Horel} -\begin{document} -\maketitle - -\section{Introduction} - -The goal of this work is to propose a framework to study the value of user -data. Although it is clear that there is some notion of value attached to -user data (for example user data can be used to generate revenue through online -advertising, recommender systems, etc.), it is not clear which definition of -value should be used for formal works on this notion. After having proposed -a definition of value, we study how generic economic problem behave with -regards to this definition. Finally we study the computational feasibility of -these problems. - -\section{Data model} -\label{sec:data-model} - -There is a set of users and an experimenter. Each user has a vector of public -features (e.g. age, height, binary features, labels, etc.) and a private piece -of information: an undisclosed variable. - -The experimenter wants to learn the relationship between the public features -and the private variable. Before conducting the experiment, he has a prior -knowledge of this relationship, called his \emph{hypothesis} and the experiment -consists in selecting a set of users and asking them to reveal their private -variables. Based on the observed data, the experimenter updates his hypothesis. - -For the experimenter, there is a notion of value attached to a group of users: -this is how much the data teaches him about the hypothesis, how much it reduces -his uncertainty about the hypothesis. For the users, there is a cost attached -to revealing their data. The experimenter also has a budget constraint on the -amount of money he can spend. - -The problems arising in this setup are natural: the experimenter wants to -maximize his utility: the value of the set of users he selects, but he needs to -compensate the users by taking into account their costs and his budget -constraint. The users' costs can either be public, which directly leads to -combinatorial optimizations problems, or private, in which case, a notion of -strategy intervenes in the way the experimenter compensates the users and -requires an auction approach. - -Formally, there is a set of users indexed by a set $\mathcal{I}$. The public -feature vector of user $i\in\mathcal{I}$ is an element $x_i$ of a feature set -$E$, his undisclosed variable is denoted by $y_i$ and belongs to some space -$A$. The cost of user $i$ for revealing his data is a positive real number -$c_i\in\mathbf{R}_+$. The budget of the experimenter is denoted by $B$. - -The prior knowledge of the experimenter takes the form of a random variable $H$ -over $A^E$ or a subset of $A^E$ called the \emph{hypothesis set}. This random -variable expresses his uncertainty about the true hypothesis $h$. The true -hypothesis gives the relationship between the feature vector of user $i$ and -his private variable through the equation: -\begin{equation}\label{eq:hypothesis-model} - y_i = h(x_i) + \varepsilon_i -\end{equation} -where $\varepsilon_i$ is a random variable over $A$. - -\emph{TODO: explain why this model is not restrictive: $y$ can always be - written as a deterministic function of $x$ plus something independent of -$x$} - -\emph{Examples.} -\begin{enumerate} - \item if the hypothesis set is finite (the experimenter has a few - deterministic model he wants to choose from), observing data allows him - to rule out parts of the hypothesis set. In this case, the uncertainty - of the experimenter could simply be measured by the size of the - hypothesis set. - \item if $A=\{0,1\}$ and the hypothesis set is the set of all binary - functions on $E$, the learning task of the experimenter is a binary - classification problem. - \item if $A=\mathbf{R}$, $E=\mathbf{R}^d$ and the hypothesis set is the set - of linear functions from $E$ to $A$: $\mathcal{L}(A,E)$, the learning - task is a linear regression. The prior knowledge of the experimenter is - a prior distribution on the parameters of the linear function, which is - equivalent to regularized linear regression (e.g. ridge regression). -\end{enumerate} - -\section{Economics of data} -\label{sec:economics} - -The goal of this section is to further discuss the optimization problems -mentioned in the previous section. - -The value function (the utility function of the experimenter) will be denoted -by $V$ and is simply a function mapping a set of users $S\subset \mathcal{I}$ -to $V(S)\in\mathbf{R}_+$. The choice of a specific value function will be -discussed in Section~\ref{sec:data-value}. - -\subsection{Optimal observation selection} - -In this problem, the costs $(c_i)_{i\in \mathcal{I}}$ of the users are public. -When selecting the set $S\subset\mathcal{I}$ of users, the experimenter has to -pay exactly $\sum_{i\in S} c_i$. Hence, the optimization problem consist in -selecting $S^*$ defined by: -\begin{equation}\label{eq:observation-selection} - S^* = \argmax_{S\subset\mathcal{I}}\Big\{V(S)\;\Big|\; - \sum_{i\in S} c_i \leq B\Big\} -\end{equation} - -This is a function maximization problem under a knapsack constraint. If $V$ can -be any function, then this problem is obviously NP-hard. A common assumption to -make on $V$ is that it is submodular (see Section~\ref{sec:submodularity}) -which is the extension of the notion of convexity for set functions. However, -maximizing a submodular function under knapsack constraint is still NP-hard. - -Sviridenko (2004) gave a (1-1/e) polynomial time approximation ratio for this -problem, when the value function is non-decreasing and submodular. - -Note that this problem covers the case where all users have the same cost $c$. -In this case, letting $B' = \left\lfloor B/c\right\rfloor$, the problems -becomes a maximization problem under a size constraint: -\begin{displaymath} - S^* = \argmax_{S\subset\mathcal{I}}\left\{V(S)\;|\; - | S| \leq B'\right\} -\end{displaymath} -for which Newhauser (1978) gave an optimal (1-1/e) polynomial approximation -scheme, in the case of a non-decreasing submodular function. - -\subsection{Budget feasible auctions} - -Here, the cost of the users are private. Before the beginning of the -experiment, they report a cost $c_i'$ which is not necessarily equal to their -true cost: a user may decide to lie to receive more money, however, by doing -so, their is a risk of not being included in the experiment. - -This notion of strategy involved in the way the users reveal their costs roots -this problem in the auction and mechanism design theory. Formally, the -experimenter wants to design an allocation function $f: -\mathbf{R}_+^{\mathcal{I}} \rightarrow 2^{\mathcal{I}}$, which, given the -reported costs of the users selects the set of users to be included in the -experiment, and a payment function $p: \mathbf{R}_+^{\mathcal{I}} \rightarrow -\mathbf{R}_+^{\mathcal{I}}$, which given the reported costs returns the vector -of payments to allocate to each user. - -For notation convenience, we will assume given the costs $\{c_i, -i\in\mathcal{I}\}$ of the users, and we will denote by $\{s_i, -i\in\mathcal{I}\}$ the characteristic function of $f(\{c_i, -i\in\mathcal{I}\})$, that is, $s_i = 1$ iff $i\in f(\{c_i, i\in\mathcal{I}\})$. -The payment received by user $i$ will be denoted by $p_i$. - -The mechanism should satisfy the following conditions: -\begin{itemize} - \item \textbf{Normalized} if $s_i = 0$ then $p_i = 0$. - \item \textbf{Individually rational} $p_i \geq s_ic_i$. - \item \textbf{Truthful} $p_i - s_ic_i \geq p_i' - s_i'c_i$, where $p_i'$ - and $s_i'$ are the payment and allocation of user $i$ had he reported - a cost $c_i'$ different from his true cost $c_i$ (keeping the costs - reported by the other users the same). - \item \textbf{Budget feasible} the payments should be within budget: - \begin{displaymath} - \sum_{i\in \mathcal{I}} s_ip_i \leq B - \end{displaymath} -\end{itemize} - -Yaron Singer (2010) proved a lower bound of 2 for the approximation ratio of -a polynomial algorithm to solve this problem. He also gave a randomized general -algorithm with an approximation ratio of 117.7, although for specific problems -(coverage, knapsack, etc.) better ratios can be attained. - -State of the art: Chenn, Gravin and Lie, lower bound of $1+\sqrt{2}$ and upper -bound of 8.34 in the fractional non-strategic optimization problem can be -solved in polynomial time. - -\section{Value of data} -\label{sec:data-value} - -Here, we discuss a choice for the value function appearing in the problems -discussed in Section~\ref{sec:economics}. Such a value function should be at -least normalized, positive and non-decreasing. It should furthermore capture -a notion of \emph{uncertainty reduction} related to the learning task of the -experimenter. - -The prior knowledge of the experimenter, the distribution of the random -variable $H$ over the hypothesis set conveys his uncertainty about the true -hypothesis. A common measure of uncertainty is given by the entropy. If we -denote by $P_H$ the probability distribution of $H$, its entropy is defined -by\footnote{Here we choose to write the entropy of a discrete distribution, but - our results do not rely on this assumption: if the distribution is - continuous, one simply needs to replace the entropy by the differential -entropy.}: -\begin{displaymath} - \mathbb{H}(H) = -\sum_{h\in A^E}P_H(h)\log\big(P_H(h)\big) -\end{displaymath} - -We will denote by $Y_i$, $i\in\mathcal{I}$ the answer of user $i$ according to -the experimenter's knowledge: -\begin{equation}\label{eq:data-model} - Y_i = H(x_i) + \varepsilon_i -\end{equation} -and $Y_S$ will denote the set of answers from the users in a subset $S$ of -$\mathcal{I}$. - -We can now define the value of a group $S$ of users as being the decrease of -entropy induced by observing $S$: -\begin{equation}\label{eq:value} - \forall S\subset I,\; V(S) - = \mathbb{H}(H) - \mathbb{H}(H\,|\,Y_S) -\end{equation} -where $\mathbb{H}(H\,|\,Y_S)$ is the conditional entropy of $H$ given $Y_S$. -One can also note that the definition of the value given in \eqref{eq:value} is -simply the mutual information $I(H;Y_S)$ between $H$ and $Y_S$. Submodularity -is conserved by composition by a concave function on the left (see -Section~\ref{sec:submodularity}). Hence, the definition of the value of $S$ can be -extended to any $f\big(V(S)\big)$ where $f$ is a non-decreasing concave -function. - -This notion of value, also known as the \emph{value of information} (TODO: ref) -can be motivated by considering the information theoretic interpretation of the -entropy: if we consider that the experimenter has access to an oracle to whom -he can ask yes/no questions. Then, the entropy of the distribution is exactly -the number of questions he needs to ask to fully know the hypothesis. If he -needs to pay for each question asked to the oracle, then our definition of -value directly relates to the cost decrease implied by observing a set of -users. - -Using the \emph{information never hurts} principle, it is easy to see that the -value function defined by \eqref{eq:value} is positive an non-decreasing (with -regard to inclusion). - -Furthermore, if we add the natural assumption that -$(\varepsilon_i)_{i\in\mathcal{I}}$ are jointly independent, which is -equivalent to say that conditioned on the hypothesis, the -$(Y_i)_{i\in\mathcal{I}}$ are independent, we get the following fact. - -\begin{fact}\label{value-submodularity} - Assuming that $(Y_i)_{i\in\mathcal{I}}$ defined in \eqref{eq:data-model} - are independent conditioned on $H$, then $V$ defined in \eqref{eq:value} is - submodular. -\end{fact} - -\begin{proof} - Using the chain rule, one can rewrite the value of $S$ as: - \begin{displaymath} - V(S) = \mathbb{H}(Y_S) - \mathbb{H}(Y_S\,|\, H) - \end{displaymath} - We can now use the conditional independence of $Y_S$ to write the - conditional entropy as a sum: - \begin{displaymath} - V(S) = \mathbb{H}(Y_S) - \sum_{s\in S}\mathbb{H}(Y_s\,|\, H) - \end{displaymath} - - It is well known that the joint entropy of a set of random variable is - submodular. Thus the last equation expresses $V$ as the sum of a submodular - function and of an additive function. As a consequence, $V$ is submodular. -\end{proof} - -\section{Value of data in the linear regression setup} - -In this section we will assume a linear model: the feature vectors belong to -$\mathbf{R}^d$ and the private variables belong to $\mathbf{R}$. The private -variable can be expressed as a linear combination of the features: -\begin{displaymath} - y = \beta^*x + \epsilon -\end{displaymath} -The noise $\epsilon$ is normally distributed, zero-mean and of variance -$\sigma^2$. Furthermore, the noise is independent of the user. - -The hypothesis set of the experimenter is the set of all linear forms from -$\mathbf{R}^d$ to $\mathbf{R}$. Because a linear form can be uniquely -represented as the inner product by a vector, it is equivalent to say that the -hypothesis set is $\mathbf{R}^d$ and that the experimenter's hypothesis is -a random variable $\beta$ over $\mathbf{R}^d$. - -A common assumption made in linear regression is that $\beta$ is normally -distributed, zero-mean, and its covariance matrix is denoted by $\Sigma$. This -prior distribution conveys the idea that $\beta$ should have a small $L^2$ -norm, which means that the model should have some kind of sparsity. Indeed, it -is easy to prove that choosing the $\hat\beta$ which maximizes the \emph{a -posteriori} distribution given the observations under a normal prior is -equivalent to solve the following optimization problem: -\begin{displaymath} - \hat\beta = \arg\min_{\beta\in\mathbf{R}^d}\|Y-X\beta\|^2 + \lambda - \|\beta\|^2 -\end{displaymath} -where $\lambda$ can be expressed as a function of $\Sigma$ and $\sigma^2$. This -optimization problem is known as ridge regression, and is simply a least square -fit of the data with a penalization on the $\beta$ which have a large $L^2$ -norm, which is consistent with the prior distribution. - -\begin{fact} - Under the linear regression model, with a multivariate normal prior, the - value of data of a set $S$ of users is given by: - \begin{equation}\label{eq:linear-regression-value} - V(S) = \frac{1}{2}\log\det\left(I_d - + \frac{\Sigma}{\sigma^2}X_S^*X_S\right) - \end{equation} - where $X_S$ is the matrix whose rows are the line-vectors $x_s^*$ for $s$ in - $S$. -\end{fact} - -\begin{proof} -Let us recall that the entropy of a multivariate normal variable $B$ in -dimension $d$ and of covariance $\Sigma$ (the mean is not relevant) is given -by: -\begin{equation}\label{eq:multivariate-entropy} - \mathbb{H}(B) = \frac{1}{2}\log\big((2\pi e)^d \det \Sigma\big) -\end{equation} - -Using the chain rule as in the proof of Fact~\ref{value-submodularity} we get -that: -\begin{displaymath} - V(S) = \mathbb{H}(Y_S) - \mathbb{H}(Y_S\,|\,\beta) -\end{displaymath} - -Conditioned on $\beta$, $(Y_S)$ follows a multivariate normal -distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence: -\begin{equation}\label{eq:h1} - \mathbb{H}(Y_S\,|\,\beta) - = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right) -\end{equation} - -$(Y_S)$ also follows a multivariate normal distribution of mean zero. Let us -compute its covariance matrix, $\Sigma_Y$: -\begin{align*} - \Sigma_Y & = \expt{YY^*} = \expt{(X_S\beta + E)(X_S\beta + E)^*}\\ - & = X_S\Sigma X_S^* + \sigma^2I_n -\end{align*} -Thus, we get that: -\begin{equation}\label{eq:h2} - \mathbb{H}(Y_S) - = \frac{1}{2}\log\left((2\pi e)^n \det(X_S\Sigma X_S^* + \sigma^2 I_n)\right) -\end{equation} - -Combining \eqref{eq:h1} and \eqref{eq:h2} we get: -\begin{displaymath} - V(S) = \frac{1}{2}\log\det\left(I_n+\frac{1}{\sigma^2}X_S\Sigma - X_S^*\right) -\end{displaymath} - -Finally, we can use the Sylvester's formula to get the result. -\end{proof} - -\emph{Remarks.} -\begin{enumerate} - \item it is known that for a set of symmetric positive definite matrices, - defining the value of a set to be the $\log\det$ of the sum of the - matrices yields a non-decreasing, submodular value function. Noting that: - \begin{displaymath} - X_S^*X_S = \sum_{s\in S}x_sx_s^* - \end{displaymath} - it is clear that our value function is non-decreasing and submodular. - The positivity follows from a direct application of the spectral - theorem. - \item the matrix which appears in the value function: - \begin{displaymath} - I_d + \frac{\Sigma}{\sigma^2}X_S^*X_S - \end{displaymath} - is also the inverse of the covariance matrix of the ridge regression - estimator. In optimal experiment design, it is common to use the - determinant of the inverse of the estiamator's covariance matrix as - a mesure of the quality of the predicion. Indeed, this directly relates to - the inverse of the volume of the confidence ellipsoid. - \item This value function can be computed up to a fixed decimal precision in - polynomial time. -\end{enumerate} -\subsubsection*{Marginal contribution} - -Here, we want to compute the marginal contribution of a point $x$ to a set $S$ of -users: -\begin{displaymath} - \Delta_xV(S) = V(S\cup \{x\}) - V(S) -\end{displaymath} - -Using that: -\begin{displaymath} - X_{S\cup\{x\}}^*X_{S\cup\{x\}} = X_S^*X_S + xx^* -\end{displaymath} -we get: -\begin{align*} - \Delta_x V(S) & = \frac{1}{2}\log\det\left(I_d - + \Sigma\frac{X_S^*X_S}{\sigma^2} + \Sigma\frac{xx^*}{\sigma^2}\right)\\ - & - \frac{1}{2}\log\det\left(I_d + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\\ - & = \frac{1}{2}\log\det\left(I_d + xx^*\left(\sigma^2\Sigma^{-1} + -X_S^*X_S\right)^{-1}\right)\\ -& = \frac{1}{2}\log\left(1 + x^*\left(\sigma^2\Sigma^{-1} -+ X_S^*X_S\right)^{-1}x\right) -\end{align*} - -\emph{Remark.} This formula shows that given a set $S$, users do not bring all -the same contribution to the set $S$. This contribution depends on the norm of -$x$ for the bilinear form defined by the matrix $(\sigma^2\Sigma^{-1} -+ X_S^*X_S)^{-1}$ which reflects how well the new point $x$ \emph{aligns} with -the already existing points. - -\section*{Appendix: Submodularity} -\label{sec:submodularity} - -In this section, we will consider that we are given a \emph{universe} set $U$. -A set function $f$ is a function defined on the power set of $U$, $\mathfrak{P}(U)$. - -A set function $f$ defined on $\mathfrak{P}(U)$ will be said \emph{increasing} if -it is increasing with regards to inclusion, that is: - -\begin{displaymath} - \forall\,S\subseteq T,\quad f(S)\leq f(T) -\end{displaymath} - -A \emph{decreasing} function on $\mathfrak{P}(U)$ is defined similarly. - -A set function $f$ defined on $\mathfrak{P}(U)$ is said to be -\emph{submodular} if it verifies the diminishing returns property, that is, -the marginal increments when adding a point to a set, is a set decreasing -function. More formally, for any point $x$ in $U$, we can define the marginal -increment of $f$ regarding $x$, it is the set function defined as: -\begin{displaymath} - \Delta_x f(S) = f(S\cup\{x\}) - f(S) -\end{displaymath} -Then, $f$ is \emph{submodular} iff. for all $x$ in $U$, $\Delta_x f$ is a set -decreasing function. - -Similarly, a \emph{supermodular} is a function whose marginal increments are set -increasing functions. -\begin{prop} - Let $R:\mathbf{R}\rightarrow \mathbf{R}$ be a decreasing concave function and - $f:\mathfrak{P}(U)\rightarrow\mathbf{R}$ be a decreasing submodular function, - then the composed function $R\circ f$ is increasing and supermodular. -\end{prop} - -\begin{proof} - The increasingness of $R\circ f$ follows immediately from the decreasingness - of $R$ and $f$. - - For the supermodularity, let $S$ and $T$ be two sets such that $S\subseteq - T$. By decreasingness of $f$, we have: - \begin{displaymath} - \forall\,V,\quad f(T)\leq f(S)\quad\mathrm{and}\quad f(T\cup V)\leq f(S\cup V) - \end{displaymath} - - Thus, by concavity of $R$: - \begin{displaymath}\label{eq:base} - \begin{split} - \forall\,V,\quad\frac{R\big(f(S)\big)-R\big(f(S\cup V)\big)}{f(S)-f(S\cup - V)}\\ - \leq\frac{R\big(f(T)\big)-R\big(f(T\cup V)\big)}{f(T)-f(T\cup V)} - \end{split} - \end{displaymath} - - $f$ is decreasing, so multiplying this last inequality by - $f(S)-f(S\cup V)$ and $f(T)-f(T\cup V)$ yields: - \begin{multline} - \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big)\\ - \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big) - \end{multline} - - $f$ is submodular, so: - \begin{displaymath} - f(T\cup V)-f(T)\leq f(S\cup V) - f(S) - \end{displaymath} - - $R\circ f$ is increasing, so: - \begin{displaymath} - R\big(f(S)\big)-R\big(f(S\cup V)\big)\leq 0 - \end{displaymath} - - By combining the two previous inequalities, we get: - \begin{multline*} - \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\ - \leq \Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big) - \end{multline*} - - Injecting this last inequality into \eqref{eq:base} gives: - \begin{multline*} - \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\ - \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big) - \end{multline*} - - Dividing left and right by $f(S)-f(S\cup V)$ yields: - \begin{displaymath} - \forall V,\quad R\big(f(S)\big)-R\big(f(S\cup V)\big) - \leq R\big(f(T)\big)-R\big(f(T\cup V)\big) - \end{displaymath} - which is exactly the supermodularity of $R\circ f$. -\end{proof} - -\end{document} - |
