summaryrefslogtreecommitdiffstats
path: root/notes.tex
diff options
context:
space:
mode:
Diffstat (limited to 'notes.tex')
-rw-r--r--notes.tex496
1 files changed, 0 insertions, 496 deletions
diff --git a/notes.tex b/notes.tex
deleted file mode 100644
index 1ce4cc4..0000000
--- a/notes.tex
+++ /dev/null
@@ -1,496 +0,0 @@
-\documentclass{IEEEtran}
-%\usepackage{mathptmx}
-\usepackage[utf8]{inputenc}
-\usepackage{amsmath,amsthm,amsfonts}
-\newtheorem{lemma}{Lemma}
-\newtheorem{fact}{Fact}
-\newtheorem{example}{Example}
-\newtheorem{prop}{Proposition}
-\newcommand{\var}{\mathop{\mathrm{Var}}}
-\newcommand{\condexp}[2]{\mathop{\mathbb{E}}\left[#1|#2\right]}
-\newcommand{\expt}[1]{\mathop{\mathbb{E}}\left[#1\right]}
-\newcommand{\norm}[1]{\lVert#1\rVert}
-\newcommand{\tr}[1]{#1^*}
-\newcommand{\ip}[2]{\langle #1, #2 \rangle}
-\newcommand{\mse}{\mathop{\mathrm{MSE}}}
-\DeclareMathOperator{\trace}{tr}
-\DeclareMathOperator*{\argmax}{arg\,max}
-\title{Value of data}
-\author{Stratis Ionnadis \and Thibaut Horel}
-\begin{document}
-\maketitle
-
-\section{Introduction}
-
-The goal of this work is to propose a framework to study the value of user
-data. Although it is clear that there is some notion of value attached to
-user data (for example user data can be used to generate revenue through online
-advertising, recommender systems, etc.), it is not clear which definition of
-value should be used for formal works on this notion. After having proposed
-a definition of value, we study how generic economic problem behave with
-regards to this definition. Finally we study the computational feasibility of
-these problems.
-
-\section{Data model}
-\label{sec:data-model}
-
-There is a set of users and an experimenter. Each user has a vector of public
-features (e.g. age, height, binary features, labels, etc.) and a private piece
-of information: an undisclosed variable.
-
-The experimenter wants to learn the relationship between the public features
-and the private variable. Before conducting the experiment, he has a prior
-knowledge of this relationship, called his \emph{hypothesis} and the experiment
-consists in selecting a set of users and asking them to reveal their private
-variables. Based on the observed data, the experimenter updates his hypothesis.
-
-For the experimenter, there is a notion of value attached to a group of users:
-this is how much the data teaches him about the hypothesis, how much it reduces
-his uncertainty about the hypothesis. For the users, there is a cost attached
-to revealing their data. The experimenter also has a budget constraint on the
-amount of money he can spend.
-
-The problems arising in this setup are natural: the experimenter wants to
-maximize his utility: the value of the set of users he selects, but he needs to
-compensate the users by taking into account their costs and his budget
-constraint. The users' costs can either be public, which directly leads to
-combinatorial optimizations problems, or private, in which case, a notion of
-strategy intervenes in the way the experimenter compensates the users and
-requires an auction approach.
-
-Formally, there is a set of users indexed by a set $\mathcal{I}$. The public
-feature vector of user $i\in\mathcal{I}$ is an element $x_i$ of a feature set
-$E$, his undisclosed variable is denoted by $y_i$ and belongs to some space
-$A$. The cost of user $i$ for revealing his data is a positive real number
-$c_i\in\mathbf{R}_+$. The budget of the experimenter is denoted by $B$.
-
-The prior knowledge of the experimenter takes the form of a random variable $H$
-over $A^E$ or a subset of $A^E$ called the \emph{hypothesis set}. This random
-variable expresses his uncertainty about the true hypothesis $h$. The true
-hypothesis gives the relationship between the feature vector of user $i$ and
-his private variable through the equation:
-\begin{equation}\label{eq:hypothesis-model}
- y_i = h(x_i) + \varepsilon_i
-\end{equation}
-where $\varepsilon_i$ is a random variable over $A$.
-
-\emph{TODO: explain why this model is not restrictive: $y$ can always be
- written as a deterministic function of $x$ plus something independent of
-$x$}
-
-\emph{Examples.}
-\begin{enumerate}
- \item if the hypothesis set is finite (the experimenter has a few
- deterministic model he wants to choose from), observing data allows him
- to rule out parts of the hypothesis set. In this case, the uncertainty
- of the experimenter could simply be measured by the size of the
- hypothesis set.
- \item if $A=\{0,1\}$ and the hypothesis set is the set of all binary
- functions on $E$, the learning task of the experimenter is a binary
- classification problem.
- \item if $A=\mathbf{R}$, $E=\mathbf{R}^d$ and the hypothesis set is the set
- of linear functions from $E$ to $A$: $\mathcal{L}(A,E)$, the learning
- task is a linear regression. The prior knowledge of the experimenter is
- a prior distribution on the parameters of the linear function, which is
- equivalent to regularized linear regression (e.g. ridge regression).
-\end{enumerate}
-
-\section{Economics of data}
-\label{sec:economics}
-
-The goal of this section is to further discuss the optimization problems
-mentioned in the previous section.
-
-The value function (the utility function of the experimenter) will be denoted
-by $V$ and is simply a function mapping a set of users $S\subset \mathcal{I}$
-to $V(S)\in\mathbf{R}_+$. The choice of a specific value function will be
-discussed in Section~\ref{sec:data-value}.
-
-\subsection{Optimal observation selection}
-
-In this problem, the costs $(c_i)_{i\in \mathcal{I}}$ of the users are public.
-When selecting the set $S\subset\mathcal{I}$ of users, the experimenter has to
-pay exactly $\sum_{i\in S} c_i$. Hence, the optimization problem consist in
-selecting $S^*$ defined by:
-\begin{equation}\label{eq:observation-selection}
- S^* = \argmax_{S\subset\mathcal{I}}\Big\{V(S)\;\Big|\;
- \sum_{i\in S} c_i \leq B\Big\}
-\end{equation}
-
-This is a function maximization problem under a knapsack constraint. If $V$ can
-be any function, then this problem is obviously NP-hard. A common assumption to
-make on $V$ is that it is submodular (see Section~\ref{sec:submodularity})
-which is the extension of the notion of convexity for set functions. However,
-maximizing a submodular function under knapsack constraint is still NP-hard.
-
-Sviridenko (2004) gave a (1-1/e) polynomial time approximation ratio for this
-problem, when the value function is non-decreasing and submodular.
-
-Note that this problem covers the case where all users have the same cost $c$.
-In this case, letting $B' = \left\lfloor B/c\right\rfloor$, the problems
-becomes a maximization problem under a size constraint:
-\begin{displaymath}
- S^* = \argmax_{S\subset\mathcal{I}}\left\{V(S)\;|\;
- | S| \leq B'\right\}
-\end{displaymath}
-for which Newhauser (1978) gave an optimal (1-1/e) polynomial approximation
-scheme, in the case of a non-decreasing submodular function.
-
-\subsection{Budget feasible auctions}
-
-Here, the cost of the users are private. Before the beginning of the
-experiment, they report a cost $c_i'$ which is not necessarily equal to their
-true cost: a user may decide to lie to receive more money, however, by doing
-so, their is a risk of not being included in the experiment.
-
-This notion of strategy involved in the way the users reveal their costs roots
-this problem in the auction and mechanism design theory. Formally, the
-experimenter wants to design an allocation function $f:
-\mathbf{R}_+^{\mathcal{I}} \rightarrow 2^{\mathcal{I}}$, which, given the
-reported costs of the users selects the set of users to be included in the
-experiment, and a payment function $p: \mathbf{R}_+^{\mathcal{I}} \rightarrow
-\mathbf{R}_+^{\mathcal{I}}$, which given the reported costs returns the vector
-of payments to allocate to each user.
-
-For notation convenience, we will assume given the costs $\{c_i,
-i\in\mathcal{I}\}$ of the users, and we will denote by $\{s_i,
-i\in\mathcal{I}\}$ the characteristic function of $f(\{c_i,
-i\in\mathcal{I}\})$, that is, $s_i = 1$ iff $i\in f(\{c_i, i\in\mathcal{I}\})$.
-The payment received by user $i$ will be denoted by $p_i$.
-
-The mechanism should satisfy the following conditions:
-\begin{itemize}
- \item \textbf{Normalized} if $s_i = 0$ then $p_i = 0$.
- \item \textbf{Individually rational} $p_i \geq s_ic_i$.
- \item \textbf{Truthful} $p_i - s_ic_i \geq p_i' - s_i'c_i$, where $p_i'$
- and $s_i'$ are the payment and allocation of user $i$ had he reported
- a cost $c_i'$ different from his true cost $c_i$ (keeping the costs
- reported by the other users the same).
- \item \textbf{Budget feasible} the payments should be within budget:
- \begin{displaymath}
- \sum_{i\in \mathcal{I}} s_ip_i \leq B
- \end{displaymath}
-\end{itemize}
-
-Yaron Singer (2010) proved a lower bound of 2 for the approximation ratio of
-a polynomial algorithm to solve this problem. He also gave a randomized general
-algorithm with an approximation ratio of 117.7, although for specific problems
-(coverage, knapsack, etc.) better ratios can be attained.
-
-State of the art: Chenn, Gravin and Lie, lower bound of $1+\sqrt{2}$ and upper
-bound of 8.34 in the fractional non-strategic optimization problem can be
-solved in polynomial time.
-
-\section{Value of data}
-\label{sec:data-value}
-
-Here, we discuss a choice for the value function appearing in the problems
-discussed in Section~\ref{sec:economics}. Such a value function should be at
-least normalized, positive and non-decreasing. It should furthermore capture
-a notion of \emph{uncertainty reduction} related to the learning task of the
-experimenter.
-
-The prior knowledge of the experimenter, the distribution of the random
-variable $H$ over the hypothesis set conveys his uncertainty about the true
-hypothesis. A common measure of uncertainty is given by the entropy. If we
-denote by $P_H$ the probability distribution of $H$, its entropy is defined
-by\footnote{Here we choose to write the entropy of a discrete distribution, but
- our results do not rely on this assumption: if the distribution is
- continuous, one simply needs to replace the entropy by the differential
-entropy.}:
-\begin{displaymath}
- \mathbb{H}(H) = -\sum_{h\in A^E}P_H(h)\log\big(P_H(h)\big)
-\end{displaymath}
-
-We will denote by $Y_i$, $i\in\mathcal{I}$ the answer of user $i$ according to
-the experimenter's knowledge:
-\begin{equation}\label{eq:data-model}
- Y_i = H(x_i) + \varepsilon_i
-\end{equation}
-and $Y_S$ will denote the set of answers from the users in a subset $S$ of
-$\mathcal{I}$.
-
-We can now define the value of a group $S$ of users as being the decrease of
-entropy induced by observing $S$:
-\begin{equation}\label{eq:value}
- \forall S\subset I,\; V(S)
- = \mathbb{H}(H) - \mathbb{H}(H\,|\,Y_S)
-\end{equation}
-where $\mathbb{H}(H\,|\,Y_S)$ is the conditional entropy of $H$ given $Y_S$.
-One can also note that the definition of the value given in \eqref{eq:value} is
-simply the mutual information $I(H;Y_S)$ between $H$ and $Y_S$. Submodularity
-is conserved by composition by a concave function on the left (see
-Section~\ref{sec:submodularity}). Hence, the definition of the value of $S$ can be
-extended to any $f\big(V(S)\big)$ where $f$ is a non-decreasing concave
-function.
-
-This notion of value, also known as the \emph{value of information} (TODO: ref)
-can be motivated by considering the information theoretic interpretation of the
-entropy: if we consider that the experimenter has access to an oracle to whom
-he can ask yes/no questions. Then, the entropy of the distribution is exactly
-the number of questions he needs to ask to fully know the hypothesis. If he
-needs to pay for each question asked to the oracle, then our definition of
-value directly relates to the cost decrease implied by observing a set of
-users.
-
-Using the \emph{information never hurts} principle, it is easy to see that the
-value function defined by \eqref{eq:value} is positive an non-decreasing (with
-regard to inclusion).
-
-Furthermore, if we add the natural assumption that
-$(\varepsilon_i)_{i\in\mathcal{I}}$ are jointly independent, which is
-equivalent to say that conditioned on the hypothesis, the
-$(Y_i)_{i\in\mathcal{I}}$ are independent, we get the following fact.
-
-\begin{fact}\label{value-submodularity}
- Assuming that $(Y_i)_{i\in\mathcal{I}}$ defined in \eqref{eq:data-model}
- are independent conditioned on $H$, then $V$ defined in \eqref{eq:value} is
- submodular.
-\end{fact}
-
-\begin{proof}
- Using the chain rule, one can rewrite the value of $S$ as:
- \begin{displaymath}
- V(S) = \mathbb{H}(Y_S) - \mathbb{H}(Y_S\,|\, H)
- \end{displaymath}
- We can now use the conditional independence of $Y_S$ to write the
- conditional entropy as a sum:
- \begin{displaymath}
- V(S) = \mathbb{H}(Y_S) - \sum_{s\in S}\mathbb{H}(Y_s\,|\, H)
- \end{displaymath}
-
- It is well known that the joint entropy of a set of random variable is
- submodular. Thus the last equation expresses $V$ as the sum of a submodular
- function and of an additive function. As a consequence, $V$ is submodular.
-\end{proof}
-
-\section{Value of data in the linear regression setup}
-
-In this section we will assume a linear model: the feature vectors belong to
-$\mathbf{R}^d$ and the private variables belong to $\mathbf{R}$. The private
-variable can be expressed as a linear combination of the features:
-\begin{displaymath}
- y = \beta^*x + \epsilon
-\end{displaymath}
-The noise $\epsilon$ is normally distributed, zero-mean and of variance
-$\sigma^2$. Furthermore, the noise is independent of the user.
-
-The hypothesis set of the experimenter is the set of all linear forms from
-$\mathbf{R}^d$ to $\mathbf{R}$. Because a linear form can be uniquely
-represented as the inner product by a vector, it is equivalent to say that the
-hypothesis set is $\mathbf{R}^d$ and that the experimenter's hypothesis is
-a random variable $\beta$ over $\mathbf{R}^d$.
-
-A common assumption made in linear regression is that $\beta$ is normally
-distributed, zero-mean, and its covariance matrix is denoted by $\Sigma$. This
-prior distribution conveys the idea that $\beta$ should have a small $L^2$
-norm, which means that the model should have some kind of sparsity. Indeed, it
-is easy to prove that choosing the $\hat\beta$ which maximizes the \emph{a
-posteriori} distribution given the observations under a normal prior is
-equivalent to solve the following optimization problem:
-\begin{displaymath}
- \hat\beta = \arg\min_{\beta\in\mathbf{R}^d}\|Y-X\beta\|^2 + \lambda
- \|\beta\|^2
-\end{displaymath}
-where $\lambda$ can be expressed as a function of $\Sigma$ and $\sigma^2$. This
-optimization problem is known as ridge regression, and is simply a least square
-fit of the data with a penalization on the $\beta$ which have a large $L^2$
-norm, which is consistent with the prior distribution.
-
-\begin{fact}
- Under the linear regression model, with a multivariate normal prior, the
- value of data of a set $S$ of users is given by:
- \begin{equation}\label{eq:linear-regression-value}
- V(S) = \frac{1}{2}\log\det\left(I_d
- + \frac{\Sigma}{\sigma^2}X_S^*X_S\right)
- \end{equation}
- where $X_S$ is the matrix whose rows are the line-vectors $x_s^*$ for $s$ in
- $S$.
-\end{fact}
-
-\begin{proof}
-Let us recall that the entropy of a multivariate normal variable $B$ in
-dimension $d$ and of covariance $\Sigma$ (the mean is not relevant) is given
-by:
-\begin{equation}\label{eq:multivariate-entropy}
- \mathbb{H}(B) = \frac{1}{2}\log\big((2\pi e)^d \det \Sigma\big)
-\end{equation}
-
-Using the chain rule as in the proof of Fact~\ref{value-submodularity} we get
-that:
-\begin{displaymath}
- V(S) = \mathbb{H}(Y_S) - \mathbb{H}(Y_S\,|\,\beta)
-\end{displaymath}
-
-Conditioned on $\beta$, $(Y_S)$ follows a multivariate normal
-distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence:
-\begin{equation}\label{eq:h1}
- \mathbb{H}(Y_S\,|\,\beta)
- = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right)
-\end{equation}
-
-$(Y_S)$ also follows a multivariate normal distribution of mean zero. Let us
-compute its covariance matrix, $\Sigma_Y$:
-\begin{align*}
- \Sigma_Y & = \expt{YY^*} = \expt{(X_S\beta + E)(X_S\beta + E)^*}\\
- & = X_S\Sigma X_S^* + \sigma^2I_n
-\end{align*}
-Thus, we get that:
-\begin{equation}\label{eq:h2}
- \mathbb{H}(Y_S)
- = \frac{1}{2}\log\left((2\pi e)^n \det(X_S\Sigma X_S^* + \sigma^2 I_n)\right)
-\end{equation}
-
-Combining \eqref{eq:h1} and \eqref{eq:h2} we get:
-\begin{displaymath}
- V(S) = \frac{1}{2}\log\det\left(I_n+\frac{1}{\sigma^2}X_S\Sigma
- X_S^*\right)
-\end{displaymath}
-
-Finally, we can use the Sylvester's formula to get the result.
-\end{proof}
-
-\emph{Remarks.}
-\begin{enumerate}
- \item it is known that for a set of symmetric positive definite matrices,
- defining the value of a set to be the $\log\det$ of the sum of the
- matrices yields a non-decreasing, submodular value function. Noting that:
- \begin{displaymath}
- X_S^*X_S = \sum_{s\in S}x_sx_s^*
- \end{displaymath}
- it is clear that our value function is non-decreasing and submodular.
- The positivity follows from a direct application of the spectral
- theorem.
- \item the matrix which appears in the value function:
- \begin{displaymath}
- I_d + \frac{\Sigma}{\sigma^2}X_S^*X_S
- \end{displaymath}
- is also the inverse of the covariance matrix of the ridge regression
- estimator. In optimal experiment design, it is common to use the
- determinant of the inverse of the estiamator's covariance matrix as
- a mesure of the quality of the predicion. Indeed, this directly relates to
- the inverse of the volume of the confidence ellipsoid.
- \item This value function can be computed up to a fixed decimal precision in
- polynomial time.
-\end{enumerate}
-\subsubsection*{Marginal contribution}
-
-Here, we want to compute the marginal contribution of a point $x$ to a set $S$ of
-users:
-\begin{displaymath}
- \Delta_xV(S) = V(S\cup \{x\}) - V(S)
-\end{displaymath}
-
-Using that:
-\begin{displaymath}
- X_{S\cup\{x\}}^*X_{S\cup\{x\}} = X_S^*X_S + xx^*
-\end{displaymath}
-we get:
-\begin{align*}
- \Delta_x V(S) & = \frac{1}{2}\log\det\left(I_d
- + \Sigma\frac{X_S^*X_S}{\sigma^2} + \Sigma\frac{xx^*}{\sigma^2}\right)\\
- & - \frac{1}{2}\log\det\left(I_d + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\\
- & = \frac{1}{2}\log\det\left(I_d + xx^*\left(\sigma^2\Sigma^{-1} +
-X_S^*X_S\right)^{-1}\right)\\
-& = \frac{1}{2}\log\left(1 + x^*\left(\sigma^2\Sigma^{-1}
-+ X_S^*X_S\right)^{-1}x\right)
-\end{align*}
-
-\emph{Remark.} This formula shows that given a set $S$, users do not bring all
-the same contribution to the set $S$. This contribution depends on the norm of
-$x$ for the bilinear form defined by the matrix $(\sigma^2\Sigma^{-1}
-+ X_S^*X_S)^{-1}$ which reflects how well the new point $x$ \emph{aligns} with
-the already existing points.
-
-\section*{Appendix: Submodularity}
-\label{sec:submodularity}
-
-In this section, we will consider that we are given a \emph{universe} set $U$.
-A set function $f$ is a function defined on the power set of $U$, $\mathfrak{P}(U)$.
-
-A set function $f$ defined on $\mathfrak{P}(U)$ will be said \emph{increasing} if
-it is increasing with regards to inclusion, that is:
-
-\begin{displaymath}
- \forall\,S\subseteq T,\quad f(S)\leq f(T)
-\end{displaymath}
-
-A \emph{decreasing} function on $\mathfrak{P}(U)$ is defined similarly.
-
-A set function $f$ defined on $\mathfrak{P}(U)$ is said to be
-\emph{submodular} if it verifies the diminishing returns property, that is,
-the marginal increments when adding a point to a set, is a set decreasing
-function. More formally, for any point $x$ in $U$, we can define the marginal
-increment of $f$ regarding $x$, it is the set function defined as:
-\begin{displaymath}
- \Delta_x f(S) = f(S\cup\{x\}) - f(S)
-\end{displaymath}
-Then, $f$ is \emph{submodular} iff. for all $x$ in $U$, $\Delta_x f$ is a set
-decreasing function.
-
-Similarly, a \emph{supermodular} is a function whose marginal increments are set
-increasing functions.
-\begin{prop}
- Let $R:\mathbf{R}\rightarrow \mathbf{R}$ be a decreasing concave function and
- $f:\mathfrak{P}(U)\rightarrow\mathbf{R}$ be a decreasing submodular function,
- then the composed function $R\circ f$ is increasing and supermodular.
-\end{prop}
-
-\begin{proof}
- The increasingness of $R\circ f$ follows immediately from the decreasingness
- of $R$ and $f$.
-
- For the supermodularity, let $S$ and $T$ be two sets such that $S\subseteq
- T$. By decreasingness of $f$, we have:
- \begin{displaymath}
- \forall\,V,\quad f(T)\leq f(S)\quad\mathrm{and}\quad f(T\cup V)\leq f(S\cup V)
- \end{displaymath}
-
- Thus, by concavity of $R$:
- \begin{displaymath}\label{eq:base}
- \begin{split}
- \forall\,V,\quad\frac{R\big(f(S)\big)-R\big(f(S\cup V)\big)}{f(S)-f(S\cup
- V)}\\
- \leq\frac{R\big(f(T)\big)-R\big(f(T\cup V)\big)}{f(T)-f(T\cup V)}
- \end{split}
- \end{displaymath}
-
- $f$ is decreasing, so multiplying this last inequality by
- $f(S)-f(S\cup V)$ and $f(T)-f(T\cup V)$ yields:
- \begin{multline}
- \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big)\\
- \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)
- \end{multline}
-
- $f$ is submodular, so:
- \begin{displaymath}
- f(T\cup V)-f(T)\leq f(S\cup V) - f(S)
- \end{displaymath}
-
- $R\circ f$ is increasing, so:
- \begin{displaymath}
- R\big(f(S)\big)-R\big(f(S\cup V)\big)\leq 0
- \end{displaymath}
-
- By combining the two previous inequalities, we get:
- \begin{multline*}
- \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\
- \leq \Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big)
- \end{multline*}
-
- Injecting this last inequality into \eqref{eq:base} gives:
- \begin{multline*}
- \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\
- \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)
- \end{multline*}
-
- Dividing left and right by $f(S)-f(S\cup V)$ yields:
- \begin{displaymath}
- \forall V,\quad R\big(f(S)\big)-R\big(f(S\cup V)\big)
- \leq R\big(f(T)\big)-R\big(f(T\cup V)\big)
- \end{displaymath}
- which is exactly the supermodularity of $R\circ f$.
-\end{proof}
-
-\end{document}
-