diff options
| author | Thibaut Horel <thibaut.horel@gmail.com> | 2012-09-06 01:35:45 +0200 |
|---|---|---|
| committer | Thibaut Horel <thibaut.horel@gmail.com> | 2012-09-06 01:35:45 +0200 |
| commit | 316875f370ca33b82033072b03da0731d05b851f (patch) | |
| tree | 524fd1a6bc9a3be6eab748741e6c81716aa76f0c /notes.tex | |
| parent | 429adf8b0db780e579570c664edce09c1a117a37 (diff) | |
| download | recommendation-316875f370ca33b82033072b03da0731d05b851f.tar.gz | |
Rewrite of the introduction
Diffstat (limited to 'notes.tex')
| -rw-r--r-- | notes.tex | 122 |
1 files changed, 78 insertions, 44 deletions
@@ -14,66 +14,99 @@ \newcommand{\ip}[2]{\langle #1, #2 \rangle} \newcommand{\mse}{\mathop{\mathrm{MSE}}} \DeclareMathOperator{\trace}{tr} +\title{Value of data: an approach to the economics of user data} \begin{document} +\maketitle \section{Introduction} +With the recent development of communication technologies, user data is now +present everywhere. It is also clear that there is some notion of value +attached to user data: this is particularly obvious when you consider the +market of targeted advertisement on the Web. However, there is no clear notion +of what a good definition of the value of data should be. -\subsection{Framework} -We are given a set of users, each user has a vector of features which is known -to us and allows us to distinguish this user from the others. We are interested -in studying a \emph{survey} in which a subset of users will be selected and -queried. +The goal of this work is to propose a framework to study the value of data. +This framework, inspired both by the fields of Information Theory and +Statistical Learning, leads to a natural definition of the value of data, which +can then be used to answer economic questions regarding the use of this data in +a data market. -We assume there is a hypothesis function such that, when submitted to a query, -the answer of a specific user to the query will be equal to the function -applied to his feature vector with some noise added. The noise is considered to -be independent of the user, which is equivalent to say that conditioned on the -hypothesis, the answers are independent. +\section{Value of data} -The hypothesis is unknown to us, and the answers of the users are going to be -used to learn it. Because we do not know the hypothesis, we maintain -a distribution over the hypothesis set, and this distribution is going to be -updated after the survey. The meaning of this distribution is that it conveys -the uncertainty we have about the hypothesis. More precisely the uncertainty is -captured by the \emph{entropy} of the distribution. +There is a set of users than can be queried. We assume that when queried, +the answer of the user to the query question is a function of its intrinsic +characteristics, distorted by some noise. -Thus, the setup takes the form of an \emph{active learning} setup: we want to -select the most interesting users before conducting the survey. Here, ``most -interesting'' refers to a notion of \emph{value} implied by the model: the -value of a subset of users will be defined as the decrease of entropy caused by -the observation of the answers of this group of users. +An experimenter is about to conduct a survey, where he is going to select +a subset of users and query them. The answers will be used to learn the +function underlying the answers of the users. + +Because the experimenter does not know the function beforehand, he maintains +a probabilistic distribution over the set of possible functions. The meaning of +this distribution is that it conveys the uncertainty of the experimenter about +the underlying function. Observing the users' answers during the survey allows +him to update his knowledge of the underlying function, thus reducing his +uncertainty. + +A classical measure of uncertainty is the entropy, which we use to define the +value of the data of a subset of users : \emph{the value of a subset of users +is the decrease of the entropy induced by observing the users' answer to +a query}. The motivation of this definition of value makes sense when considering the -information theoretic interpretation of the entropy: let us consider that we -have an oracle to whom we can ask yes/no questions about the hypothesis. Then, -the entropy of the hypothesis distribution is exactly the number of questions -we need to ask to fully know the hypothesis. If we need to pay for each -question asked to the oracle, then our definition of value directly relates to -the cost decrease implied by observing a set of users. +information theoretic interpretation of the entropy: if we consider that the +experimenter has access to an oracle to whom he can ask yes/no questions. Then, +the entropy of the distribution is exactly the number of questions he needs to +ask to fully know the hypothesis. If he needs to pay for each question asked to +the oracle, then our definition of value directly relates to the cost decrease +implied by observing a set of users. -\subsection{Problem} +We will now formalize the above definition. The set of users is indexed by an +index set $I$. The intrinsic characteristics of user $i$ are denoted by $x_i$, +a vector belonging to a feature space $E$. The answer of the users, when +queried belongs to the answer space $A$. The underlying function $f$, expressing +how the answer relates to the feature vector is a mapping from $E$ to $A$. The +answer of user $i$ will be denoted by $y_i$. Thus, the answer model takes the +form: +\begin{displaymath} + y_i = f(x_i) + \varepsilon_i +\end{displaymath} +where $\varepsilon_i$ is the error distorting the answer of user $i$, which is +a random variable over $A$. We assume that the errors are mutually independent. + +The experimenter's prior knowledge of $f$ is a random variable $F$ over $A^E$. +The entropy of $F$ is denoted by $H(F)$: +\begin{displaymath} + H(F) = \sum_{f\in A^E} P(F = f)\log\big(P(F = f)\big) +\end{displaymath} -Once you have defined a notion of value, there are several problems you can -address: +If $S$ is a subset of users, we will denote by $A_S$ the set of answers given +by these users, $A_S = \{y_i,\, i\in S\}$. We can now give the definition of +the value $V:2^I\rightarrow \mathbf{R}^+$: +\begin{displaymath} + \forall S\in 2^I,\; V(S) = H(F) - H(F\,|\, A_S) +\end{displaymath} +where $H(F\,|\, A_S)$ denotes the conditional entropy of $F$ given $A_S$. + +\section{Economics of data} + +As soon as you have a definition of value, there are several natural economic +problems arising, and which can be expressed independently of the definition of +the value function: \begin{enumerate} - \item \emph{budget auctions:} the points in the database are associated - with users. Each user $i$ has a cost $c_i$ for revealing his data. You - are given a budget $B$ and you want to select a subset $S$ of users and - a payment scheme which is truthful. The value of the subset $S$ should - be as big as possible with the constraint that the payments should stay - within budget. - \item \emph{value sharing:} the goal is to find a value sharing method to - split the whole value of a subset $S$ among its participants. + \item \emph{budget-feasible auctions:} each user $i$ has a cost $c_i$ for + revealing his data. You are given a budget $B$ and you want to select + a subset of players and a payment scheme such that the value of the + subset is maximal with the constraint that the sum of the payments + should stay within budget. The mechanism must be truthful. + \item \emph{cost sharing:} \item \emph{k-best auctions:} this one is very similar to \emph{budget auctions}, but instead of having a budget constraint, the constraint is on the number of people in the subset $S$. \end{enumerate} -\emph{[TODO: add references to the papers mentioning these problems, what we -know about the value functions defined above, etc.]} - -These problems have already been studied, and optimal or near-optimal solution -are already known when the value function is submodular. +\textbf{TODO} add references. Explain that our value function is submodular and +write the theorems we get when applying previous results to our specific case. \section{Value of data in the Gaussian world} @@ -206,7 +239,7 @@ we get: + \frac{X_S^*X_S}{\sigma^2}\right)^{-1}x\right) \end{align} -\section{Submodularity} +\section*{Appendix: Submodularity} In this section, we will consider that we are given a \emph{universe} set $U$. A set function $f$ is a function defined on the power set of $U$, $\mathfrak{P}(U)$. @@ -293,3 +326,4 @@ increasing functions. \end{proof} \end{document} + |
