Rewrite of the introduction

author: Thibaut Horel <thibaut.horel@gmail.com> 2012-09-06 01:35:45 +0200
committer: Thibaut Horel <thibaut.horel@gmail.com> 2012-09-06 01:35:45 +0200
commit: 316875f370ca33b82033072b03da0731d05b851f (patch)
tree: 524fd1a6bc9a3be6eab748741e6c81716aa76f0c /notes.tex
parent: 429adf8b0db780e579570c664edce09c1a117a37 (diff)
download: recommendation-316875f370ca33b82033072b03da0731d05b851f.tar.gz
1 files changed, 78 insertions, 44 deletions
diff --git a/notes.tex b/notes.tex
index f6d1cb7..db2b6a3 100644
--- a/notes.tex
+++ b/notes.tex
@@ -14,66 +14,99 @@
 \newcommand{\ip}[2]{\langle #1, #2 \rangle}
 \newcommand{\mse}{\mathop{\mathrm{MSE}}}
 \DeclareMathOperator{\trace}{tr}
+\title{Value of data: an approach to the economics of user data}
 \begin{document}
+\maketitle
 
 \section{Introduction}
+With the recent development of communication technologies, user data is now
+present everywhere. It is also clear that there is some notion of value
+attached to user data: this is particularly obvious when you consider the
+market of targeted advertisement on the Web. However, there is no clear notion
+of what a good definition of the value of data should be.
 
-\subsection{Framework}
-We are given a set of users, each user has a vector of features which is known
-to us and allows us to distinguish this user from the others. We are interested
-in studying a \emph{survey} in which a subset of users will be selected and
-queried.
+The goal of this work is to propose a framework to study the value of data.
+This framework, inspired both by the fields of Information Theory and
+Statistical Learning, leads to a natural definition of the value of data, which
+can then be used to answer economic questions regarding the use of this data in
+a data market.
 
-We assume there is a hypothesis function such that, when submitted to a query,
-the answer of a specific user to the query will be equal to the function
-applied to his feature vector with some noise added. The noise is considered to
-be independent of the user, which is equivalent to say that conditioned on the
-hypothesis, the answers are independent.
+\section{Value of data}
 
-The hypothesis is unknown to us, and the answers of the users are going to be
-used to learn it. Because we do not know the hypothesis, we maintain
-a distribution over the hypothesis set, and this distribution is going to be
-updated after the survey. The meaning of this distribution is that it conveys
-the uncertainty we have about the hypothesis. More precisely the uncertainty is
-captured by the \emph{entropy} of the distribution.
+There is a set of users than can be queried. We assume that when queried,
+the answer of the user to the query question is a function of its intrinsic
+characteristics, distorted by some noise.
 
-Thus, the setup takes the form of an \emph{active learning} setup: we want to
-select the most interesting users before conducting the survey. Here, ``most
-interesting'' refers to a notion of \emph{value} implied by the model: the
-value of a subset of users will be defined as the decrease of entropy caused by
-the observation of the answers of this group of users.
+An experimenter is about to conduct a survey, where he is going to select
+a subset of users and query them. The answers will be used to learn the
+function underlying the answers of the users.
+
+Because the experimenter does not know the function beforehand, he maintains
+a probabilistic distribution over the set of possible functions. The meaning of
+this distribution is that it conveys the uncertainty of the experimenter about
+the underlying function. Observing the users' answers during the survey allows
+him to update his knowledge of the underlying function, thus reducing his
+uncertainty. 
+
+A classical measure of uncertainty is the entropy, which we use to define the
+value of the data of a subset of users : \emph{the value of a subset of users
+is the decrease of the entropy induced by observing the users' answer to
+a query}.
 
 The motivation of this definition of value makes sense when considering the
-information theoretic interpretation of the entropy: let us consider that we
-have an oracle to whom we can ask yes/no questions about the hypothesis. Then,
-the entropy of the hypothesis distribution is exactly the number of questions
-we need to ask to fully know the hypothesis. If we need to pay for each
-question asked to the oracle, then our definition of value directly relates to
-the cost decrease implied by observing a set of users. 
+information theoretic interpretation of the entropy: if we consider that the
+experimenter has access to an oracle to whom he can ask yes/no questions. Then,
+the entropy of the distribution is exactly the number of questions he needs to
+ask to fully know the hypothesis. If he needs to pay for each question asked to
+the oracle, then our definition of value directly relates to the cost decrease
+implied by observing a set of users. 
 
-\subsection{Problem}
+We will now formalize the above definition. The set of users is indexed by an
+index set $I$. The intrinsic characteristics of user $i$ are denoted by $x_i$,
+a vector belonging to a feature space $E$. The answer of the users, when
+queried belongs to the answer space $A$. The underlying function $f$, expressing
+how the answer relates to the feature vector is a mapping from $E$ to $A$. The
+answer of user $i$ will be denoted by $y_i$. Thus, the answer model takes the
+form:
+\begin{displaymath}
+    y_i = f(x_i) + \varepsilon_i
+\end{displaymath}
+where $\varepsilon_i$ is the error distorting the answer of user $i$, which is
+a random variable over $A$. We assume that the errors are mutually independent.
+
+The experimenter's prior knowledge of $f$ is a random variable $F$ over $A^E$.
+The entropy of $F$ is denoted by $H(F)$:
+\begin{displaymath}
+    H(F) = \sum_{f\in A^E} P(F = f)\log\big(P(F = f)\big)
+\end{displaymath}
 
-Once you have defined a notion of value, there are several problems you can
-address:
+If $S$ is a subset of users, we will denote by $A_S$ the set of answers given
+by these users, $A_S = \{y_i,\, i\in S\}$. We can now give the definition of
+the value $V:2^I\rightarrow \mathbf{R}^+$:
+\begin{displaymath}
+    \forall S\in 2^I,\; V(S) = H(F) - H(F\,|\, A_S)
+\end{displaymath}
+where $H(F\,|\, A_S)$ denotes the conditional entropy of $F$ given $A_S$.
+
+\section{Economics of data}
+
+As soon as you have a definition of value, there are several natural economic
+problems arising, and which can be expressed independently of the definition of
+the value function:
 \begin{enumerate}
-    \item \emph{budget auctions:} the points in the database are associated
-        with users. Each user $i$ has a cost $c_i$ for revealing his data. You
-        are given a budget $B$ and you want to select a subset $S$ of users and
-        a payment scheme which is truthful. The value of the subset $S$ should
-        be as big as possible with the constraint that the payments should stay
-        within budget.
-    \item \emph{value sharing:} the goal is to find a value sharing method to
-        split the whole value of a subset $S$ among its participants.
+    \item \emph{budget-feasible auctions:} each user $i$ has a cost $c_i$ for
+        revealing his data. You are given a budget $B$ and you want to select
+        a subset of players and a payment scheme such that the value of the
+        subset is maximal with the constraint that the sum of the payments
+        should stay within budget. The mechanism must be truthful. 
+    \item \emph{cost sharing:} 
     \item \emph{k-best auctions:} this one is very similar to \emph{budget
         auctions}, but instead of having a budget constraint, the constraint is
         on the number of people in the subset $S$.
 \end{enumerate}
 
-\emph{[TODO: add references to the papers mentioning these problems, what we
-know about the value functions defined above, etc.]}
-
-These problems have already been studied, and optimal or near-optimal solution
-are already known when the value function is submodular. 
+\textbf{TODO} add references. Explain that our value function is submodular and
+write the theorems we get when applying previous results to our specific case.
 
 \section{Value of data in the Gaussian world}
 
@@ -206,7 +239,7 @@ we get:
 + \frac{X_S^*X_S}{\sigma^2}\right)^{-1}x\right)
 \end{align}
 
-\section{Submodularity}
+\section*{Appendix: Submodularity}
 
 In this section, we will consider that we are given a \emph{universe} set $U$.
 A set function $f$ is a function defined on the power set of $U$, $\mathfrak{P}(U)$.
@@ -293,3 +326,4 @@ increasing functions.
 \end{proof}
 
 \end{document}
+
author	Thibaut Horel <thibaut.horel@gmail.com>	2012-09-06 01:35:45 +0200
committer	Thibaut Horel <thibaut.horel@gmail.com>	2012-09-06 01:35:45 +0200
commit	316875f370ca33b82033072b03da0731d05b851f (patch)
tree	524fd1a6bc9a3be6eab748741e6c81716aa76f0c /notes.tex
parent	429adf8b0db780e579570c664edce09c1a117a37 (diff)
download	recommendation-316875f370ca33b82033072b03da0731d05b851f.tar.gz