summaryrefslogtreecommitdiffstats
path: root/notes.tex
diff options
context:
space:
mode:
authorThibaut Horel <thibaut.horel@gmail.com>2012-09-06 01:35:45 +0200
committerThibaut Horel <thibaut.horel@gmail.com>2012-09-06 01:35:45 +0200
commit316875f370ca33b82033072b03da0731d05b851f (patch)
tree524fd1a6bc9a3be6eab748741e6c81716aa76f0c /notes.tex
parent429adf8b0db780e579570c664edce09c1a117a37 (diff)
downloadrecommendation-316875f370ca33b82033072b03da0731d05b851f.tar.gz
Rewrite of the introduction
Diffstat (limited to 'notes.tex')
-rw-r--r--notes.tex122
1 files changed, 78 insertions, 44 deletions
diff --git a/notes.tex b/notes.tex
index f6d1cb7..db2b6a3 100644
--- a/notes.tex
+++ b/notes.tex
@@ -14,66 +14,99 @@
\newcommand{\ip}[2]{\langle #1, #2 \rangle}
\newcommand{\mse}{\mathop{\mathrm{MSE}}}
\DeclareMathOperator{\trace}{tr}
+\title{Value of data: an approach to the economics of user data}
\begin{document}
+\maketitle
\section{Introduction}
+With the recent development of communication technologies, user data is now
+present everywhere. It is also clear that there is some notion of value
+attached to user data: this is particularly obvious when you consider the
+market of targeted advertisement on the Web. However, there is no clear notion
+of what a good definition of the value of data should be.
-\subsection{Framework}
-We are given a set of users, each user has a vector of features which is known
-to us and allows us to distinguish this user from the others. We are interested
-in studying a \emph{survey} in which a subset of users will be selected and
-queried.
+The goal of this work is to propose a framework to study the value of data.
+This framework, inspired both by the fields of Information Theory and
+Statistical Learning, leads to a natural definition of the value of data, which
+can then be used to answer economic questions regarding the use of this data in
+a data market.
-We assume there is a hypothesis function such that, when submitted to a query,
-the answer of a specific user to the query will be equal to the function
-applied to his feature vector with some noise added. The noise is considered to
-be independent of the user, which is equivalent to say that conditioned on the
-hypothesis, the answers are independent.
+\section{Value of data}
-The hypothesis is unknown to us, and the answers of the users are going to be
-used to learn it. Because we do not know the hypothesis, we maintain
-a distribution over the hypothesis set, and this distribution is going to be
-updated after the survey. The meaning of this distribution is that it conveys
-the uncertainty we have about the hypothesis. More precisely the uncertainty is
-captured by the \emph{entropy} of the distribution.
+There is a set of users than can be queried. We assume that when queried,
+the answer of the user to the query question is a function of its intrinsic
+characteristics, distorted by some noise.
-Thus, the setup takes the form of an \emph{active learning} setup: we want to
-select the most interesting users before conducting the survey. Here, ``most
-interesting'' refers to a notion of \emph{value} implied by the model: the
-value of a subset of users will be defined as the decrease of entropy caused by
-the observation of the answers of this group of users.
+An experimenter is about to conduct a survey, where he is going to select
+a subset of users and query them. The answers will be used to learn the
+function underlying the answers of the users.
+
+Because the experimenter does not know the function beforehand, he maintains
+a probabilistic distribution over the set of possible functions. The meaning of
+this distribution is that it conveys the uncertainty of the experimenter about
+the underlying function. Observing the users' answers during the survey allows
+him to update his knowledge of the underlying function, thus reducing his
+uncertainty.
+
+A classical measure of uncertainty is the entropy, which we use to define the
+value of the data of a subset of users : \emph{the value of a subset of users
+is the decrease of the entropy induced by observing the users' answer to
+a query}.
The motivation of this definition of value makes sense when considering the
-information theoretic interpretation of the entropy: let us consider that we
-have an oracle to whom we can ask yes/no questions about the hypothesis. Then,
-the entropy of the hypothesis distribution is exactly the number of questions
-we need to ask to fully know the hypothesis. If we need to pay for each
-question asked to the oracle, then our definition of value directly relates to
-the cost decrease implied by observing a set of users.
+information theoretic interpretation of the entropy: if we consider that the
+experimenter has access to an oracle to whom he can ask yes/no questions. Then,
+the entropy of the distribution is exactly the number of questions he needs to
+ask to fully know the hypothesis. If he needs to pay for each question asked to
+the oracle, then our definition of value directly relates to the cost decrease
+implied by observing a set of users.
-\subsection{Problem}
+We will now formalize the above definition. The set of users is indexed by an
+index set $I$. The intrinsic characteristics of user $i$ are denoted by $x_i$,
+a vector belonging to a feature space $E$. The answer of the users, when
+queried belongs to the answer space $A$. The underlying function $f$, expressing
+how the answer relates to the feature vector is a mapping from $E$ to $A$. The
+answer of user $i$ will be denoted by $y_i$. Thus, the answer model takes the
+form:
+\begin{displaymath}
+ y_i = f(x_i) + \varepsilon_i
+\end{displaymath}
+where $\varepsilon_i$ is the error distorting the answer of user $i$, which is
+a random variable over $A$. We assume that the errors are mutually independent.
+
+The experimenter's prior knowledge of $f$ is a random variable $F$ over $A^E$.
+The entropy of $F$ is denoted by $H(F)$:
+\begin{displaymath}
+ H(F) = \sum_{f\in A^E} P(F = f)\log\big(P(F = f)\big)
+\end{displaymath}
-Once you have defined a notion of value, there are several problems you can
-address:
+If $S$ is a subset of users, we will denote by $A_S$ the set of answers given
+by these users, $A_S = \{y_i,\, i\in S\}$. We can now give the definition of
+the value $V:2^I\rightarrow \mathbf{R}^+$:
+\begin{displaymath}
+ \forall S\in 2^I,\; V(S) = H(F) - H(F\,|\, A_S)
+\end{displaymath}
+where $H(F\,|\, A_S)$ denotes the conditional entropy of $F$ given $A_S$.
+
+\section{Economics of data}
+
+As soon as you have a definition of value, there are several natural economic
+problems arising, and which can be expressed independently of the definition of
+the value function:
\begin{enumerate}
- \item \emph{budget auctions:} the points in the database are associated
- with users. Each user $i$ has a cost $c_i$ for revealing his data. You
- are given a budget $B$ and you want to select a subset $S$ of users and
- a payment scheme which is truthful. The value of the subset $S$ should
- be as big as possible with the constraint that the payments should stay
- within budget.
- \item \emph{value sharing:} the goal is to find a value sharing method to
- split the whole value of a subset $S$ among its participants.
+ \item \emph{budget-feasible auctions:} each user $i$ has a cost $c_i$ for
+ revealing his data. You are given a budget $B$ and you want to select
+ a subset of players and a payment scheme such that the value of the
+ subset is maximal with the constraint that the sum of the payments
+ should stay within budget. The mechanism must be truthful.
+ \item \emph{cost sharing:}
\item \emph{k-best auctions:} this one is very similar to \emph{budget
auctions}, but instead of having a budget constraint, the constraint is
on the number of people in the subset $S$.
\end{enumerate}
-\emph{[TODO: add references to the papers mentioning these problems, what we
-know about the value functions defined above, etc.]}
-
-These problems have already been studied, and optimal or near-optimal solution
-are already known when the value function is submodular.
+\textbf{TODO} add references. Explain that our value function is submodular and
+write the theorems we get when applying previous results to our specific case.
\section{Value of data in the Gaussian world}
@@ -206,7 +239,7 @@ we get:
+ \frac{X_S^*X_S}{\sigma^2}\right)^{-1}x\right)
\end{align}
-\section{Submodularity}
+\section*{Appendix: Submodularity}
In this section, we will consider that we are given a \emph{universe} set $U$.
A set function $f$ is a function defined on the power set of $U$, $\mathfrak{P}(U)$.
@@ -293,3 +326,4 @@ increasing functions.
\end{proof}
\end{document}
+