Take Stratis' comments into account for the first sections. Added some content in

the "economics of data" section.
author: Thibaut Horel <thibaut.horel@gmail.com> 2012-09-10 18:02:43 +0200
committer: Thibaut Horel <thibaut.horel@gmail.com> 2012-09-10 18:02:43 +0200
commit: b65081f4ba1d15abef445bc8bed66b1e288b2165 (patch)
tree: 8d58ae436c32303866343a8991192c267b77688c /notes.tex
parent: 316875f370ca33b82033072b03da0731d05b851f (diff)
download: recommendation-b65081f4ba1d15abef445bc8bed66b1e288b2165.tar.gz
1 files changed, 124 insertions, 56 deletions
diff --git a/notes.tex b/notes.tex
index db2b6a3..ad34a21 100644
--- a/notes.tex
+++ b/notes.tex
@@ -14,6 +14,7 @@
 \newcommand{\ip}[2]{\langle #1, #2 \rangle}
 \newcommand{\mse}{\mathop{\mathrm{MSE}}}
 \DeclareMathOperator{\trace}{tr}
+\DeclareMathOperator*{\argmax}{arg\,max}
 \title{Value of data: an approach to the economics of user data}
 \begin{document}
 \maketitle
@@ -31,27 +32,83 @@ Statistical Learning, leads to a natural definition of the value of data, which
 can then be used to answer economic questions regarding the use of this data in
 a data market.
 
+\section{Data model}
+
+There is a set of users and an experimenter who is about to conduct a survey.
+The survey process is the following: the experimenter is going to select
+a subset of users an query them, that is, ask the same question to each user in
+the selected subgroup.
+
+To each user is attached some intrinsic characteristics, known to the
+experimenter, which impacts on his answer to the experimenter's query. More
+precisely, the answer of the user is a function of its intrinsic characteristic
+distorted by some noise.
+
+Formally, we have a set of users indexed by a set $\mathcal{I}$. The intrinsic
+characteristics of user $i\in\mathcal{I}$ is an element $x_i$ of a feature set
+$E$.  When the experimenter queries user $i$, he gets the answer $y_i$:
+\begin{equation}\label{eq:answer-model}
+    y_i = h(x_i) + \varepsilon_i
+\end{equation}
+where $h$ is a mapping from the feature set $E$ to the answer space $A$ and
+$\varepsilon_i$ is a random variable over $A$. We assume that the errors
+are mutually independent.
+
+\emph{Examples.}
+\begin{enumerate}
+    \item if $A=\{0,1\}$ and $h$ is a binary function on $E$ and the model
+        above is a binary classification setup.
+    \item if $A=\mathbf{R}$, $E=\mathbf{R}^d$ and $h$ is a linear function, the
+        model above is a linear regressions setup.
+\end{enumerate}
+
 \section{Value of data}
 
-There is a set of users than can be queried. We assume that when queried,
-the answer of the user to the query question is a function of its intrinsic
-characteristics, distorted by some noise.
+The experimenter does not know the function which appears in
+\eqref{eq:answer-model}. Instead, he has a given set of \emph{hypotheses}
+$\mathcal{H}$ and a prior knowledge of the true hypothesis $h$ which is modeled
+by a random variable $H$ over $\mathcal{H}$. The data model in
+\eqref{eq:answer-model} expresses that conditioned on the hypothesis, the
+answers to the experimenter's query are independent.
 
-An experimenter is about to conduct a survey, where he is going to select
-a subset of users and query them. The answers will be used to learn the
-function underlying the answers of the users.
+The prior distribution can also be seen as an expression of the uncertainty of
+the experimenter about the true hypothesis $h$. The uncertainty of the
+distribution $P$ of $H$ can be classically measured by its entropy
+$\mathbb{H}(H)$\footnote{Here we choose to write the entropy of a discrete
+distribution, but our results do not rely on this assumption: if the
+distribution is continuous, one simply needs to replace the entropy by the
+conditional entropy.}:
+\begin{displaymath}
+    \mathbb{H}(H) = -\sum_{h\in A^E}P(H = h)\log\big(P(H = h)\big)
+\end{displaymath}
+
+Given a subset $S\subset I$ of users, we denote by $\mathcal{A}_S$ the set of
+answers given by the users in $S$ according to the experimenter's knowledge of
+the hypothesis:
+\begin{displaymath}
+    \mathcal{A}_S = \{Y_i,\, i\in S\} \quad\mathrm{with}
+    \quad Y_i = H(x_i) + \varepsilon_i
+\end{displaymath}
 
-Because the experimenter does not know the function beforehand, he maintains
-a probabilistic distribution over the set of possible functions. The meaning of
-this distribution is that it conveys the uncertainty of the experimenter about
-the underlying function. Observing the users' answers during the survey allows
-him to update his knowledge of the underlying function, thus reducing his
-uncertainty. 
+The goal of the experimenter is to use the answers of the users to learn the
+true hypothesis. This setup is the one commonly found in Bayesian Active
+Learning with Noise. In this setup the value of a data, or value of information,
+used when using the entropy as a measure of uncertainty is simply the decrease
+of entropy implied by observing the information. In our case, this leads to
+defining the value function $V$:
+\begin{equation}\label{eq:value}
+    \forall S\subset I,\; V(S) 
+    = \mathbb{H}(H) - \mathbb{H}(H\,|\,\mathcal{A}_S)
+\end{equation}
+where $\mathbb{H}(H\,|\,\mathcal{A}_S)$ is the conditional entropy of $H$ given
+$\mathcal{A}_S$. One can also recognize that the definition of the entropy
+given above is simply the mutual information $I(H;\mathcal{A}_S)$ between $H$
+and $\mathcal{A}_S$. 
 
-A classical measure of uncertainty is the entropy, which we use to define the
-value of the data of a subset of users : \emph{the value of a subset of users
-is the decrease of the entropy induced by observing the users' answer to
-a query}.
+Using the \emph{information never hurts} principle, it is
+easy to see that the value function defined in \eqref{eq:value} is positive and
+set increasing. Thus one could extend the definition of the value to be any
+increasing function of the mutual information.
 
 The motivation of this definition of value makes sense when considering the
 information theoretic interpretation of the entropy: if we consider that the
@@ -61,52 +118,63 @@ ask to fully know the hypothesis. If he needs to pay for each question asked to
 the oracle, then our definition of value directly relates to the cost decrease
 implied by observing a set of users. 
 
-We will now formalize the above definition. The set of users is indexed by an
-index set $I$. The intrinsic characteristics of user $i$ are denoted by $x_i$,
-a vector belonging to a feature space $E$. The answer of the users, when
-queried belongs to the answer space $A$. The underlying function $f$, expressing
-how the answer relates to the feature vector is a mapping from $E$ to $A$. The
-answer of user $i$ will be denoted by $y_i$. Thus, the answer model takes the
+\section{Economics of data}
+
+Independently of the chosen definition of value, several optimization problems,
+motivated by economic considerations naturally arise in the above setup.
+
+\subsection{Optimal observation selection}
+
+If we assume that the experimenter has a cost $c$ for each user that he
+includes in his experiment, his goal will be to optimize the value of his
+experiment while minimizing his cost. Because the cost does not depend on the
+users, this is equivalent to maximize the value while minimizing the size of
+the chosen set of users. Hence, the optimization problems takes the following
 form:
-\begin{displaymath}
-    y_i = f(x_i) + \varepsilon_i
-\end{displaymath}
-where $\varepsilon_i$ is the error distorting the answer of user $i$, which is
-a random variable over $A$. We assume that the errors are mutually independent.
+\begin{equation}\label{eq:observation-selection}
+    S^* = \argmax_{\substack{S\subset \mathcal{I}\\ \vert S\vert \leq k}} V(S)
+\end{equation}
+\emph{Note: (1-1/e) approximation algorithm known since Nemhauser (1978)}
 
-The experimenter's prior knowledge of $f$ is a random variable $F$ over $A^E$.
-The entropy of $F$ is denoted by $H(F)$:
-\begin{displaymath}
-    H(F) = \sum_{f\in A^E} P(F = f)\log\big(P(F = f)\big)
-\end{displaymath}
+The fact that there is a function defining the value of a set of users
+strongly suggests that some users are more valuable than others. It is then
+natural that each user $i$ has a specific cost $c_i$. The above optimization
+problem then takes the form:
+\begin{equation}\label{eq:cost-observation-selection}
+    S^* = \argmax_{\substack{S\subset \mathcal{I}\\ \sum_{i\in S}c_i\leq B}} V(S)
+\end{equation}
 
-If $S$ is a subset of users, we will denote by $A_S$ the set of answers given
-by these users, $A_S = \{y_i,\, i\in S\}$. We can now give the definition of
-the value $V:2^I\rightarrow \mathbf{R}^+$:
-\begin{displaymath}
-    \forall S\in 2^I,\; V(S) = H(F) - H(F\,|\, A_S)
-\end{displaymath}
-where $H(F\,|\, A_S)$ denotes the conditional entropy of $F$ given $A_S$.
+\emph{Note: this is known as the budgeted maximization problem, or maximization
+    problem under knapsack/linear constraint. This problem is much harder than
+    the previous one. A (1-1/e) approximation algorithm was known in the case
+    when the value function is the one used in the Max-Coverage problem. But
+for general, non-decreasing submodular functions, this is a result by
+Sviridenko (2004) : A note on maximizing a submodular set function subject to
+knapsack constraint}
+    
+\subsection{Budget feasible auctions}
 
-\section{Economics of data}
+In this section, the experimenter wants to compensate users for joining the
+experiment. Because the payments made by the experimenter are based on the
+costs reported by the users, reporting a false cost could be a strategy for a user
+to maximize his revenue.
 
-As soon as you have a definition of value, there are several natural economic
-problems arising, and which can be expressed independently of the definition of
-the value function:
-\begin{enumerate}
-    \item \emph{budget-feasible auctions:} each user $i$ has a cost $c_i$ for
-        revealing his data. You are given a budget $B$ and you want to select
-        a subset of players and a payment scheme such that the value of the
-        subset is maximal with the constraint that the sum of the payments
-        should stay within budget. The mechanism must be truthful. 
-    \item \emph{cost sharing:} 
-    \item \emph{k-best auctions:} this one is very similar to \emph{budget
-        auctions}, but instead of having a budget constraint, the constraint is
-        on the number of people in the subset $S$.
-\end{enumerate}
+Therefore, this is an auction setup: the experimenter observes the costs
+reported by the users $c_i, i\in\mathcal{I}$ and select a subset $S$ of users,
+as well as payments $p_i, i\in S$ such that:
+\begin{itemize}
+    \item \textbf{individually rational}
+    \item \textbf{truthful}
+    \item \textbf{budget-constrained}: $\sum_{i\in S} p_i \leq B$
+    \item \textbf{approximation}: $OPT \leq \alpha V(S)$ where $OPT$ is the solution
+        to the budgeted maximization problem in \eqref{eq:cost-observation-selection}
+\end{itemize}
+
+\emph{Note: Yaron Singer (2010) Budget feasible auctions, has a 117.7
+    polynomial approximation for this problem. It is proven that no algorithm
+can do better than $2-\epsilon$ for any $\epsilon$.}
 
-\textbf{TODO} add references. Explain that our value function is submodular and
-write the theorems we get when applying previous results to our specific case.
+\subsection{Value sharing}
 
 \section{Value of data in the Gaussian world}
author	Thibaut Horel <thibaut.horel@gmail.com>	2012-09-10 18:02:43 +0200
committer	Thibaut Horel <thibaut.horel@gmail.com>	2012-09-10 18:02:43 +0200
commit	b65081f4ba1d15abef445bc8bed66b1e288b2165 (patch)
tree	8d58ae436c32303866343a8991192c267b77688c /notes.tex
parent	316875f370ca33b82033072b03da0731d05b851f (diff)
download	recommendation-b65081f4ba1d15abef445bc8bed66b1e288b2165.tar.gz