diff options
| author | Thibaut Horel <thibaut.horel@gmail.com> | 2012-09-10 18:02:43 +0200 |
|---|---|---|
| committer | Thibaut Horel <thibaut.horel@gmail.com> | 2012-09-10 18:02:43 +0200 |
| commit | b65081f4ba1d15abef445bc8bed66b1e288b2165 (patch) | |
| tree | 8d58ae436c32303866343a8991192c267b77688c /notes.tex | |
| parent | 316875f370ca33b82033072b03da0731d05b851f (diff) | |
| download | recommendation-b65081f4ba1d15abef445bc8bed66b1e288b2165.tar.gz | |
Take Stratis' comments into account for the first sections. Added some content in
the "economics of data" section.
Diffstat (limited to 'notes.tex')
| -rw-r--r-- | notes.tex | 180 |
1 files changed, 124 insertions, 56 deletions
@@ -14,6 +14,7 @@ \newcommand{\ip}[2]{\langle #1, #2 \rangle} \newcommand{\mse}{\mathop{\mathrm{MSE}}} \DeclareMathOperator{\trace}{tr} +\DeclareMathOperator*{\argmax}{arg\,max} \title{Value of data: an approach to the economics of user data} \begin{document} \maketitle @@ -31,27 +32,83 @@ Statistical Learning, leads to a natural definition of the value of data, which can then be used to answer economic questions regarding the use of this data in a data market. +\section{Data model} + +There is a set of users and an experimenter who is about to conduct a survey. +The survey process is the following: the experimenter is going to select +a subset of users an query them, that is, ask the same question to each user in +the selected subgroup. + +To each user is attached some intrinsic characteristics, known to the +experimenter, which impacts on his answer to the experimenter's query. More +precisely, the answer of the user is a function of its intrinsic characteristic +distorted by some noise. + +Formally, we have a set of users indexed by a set $\mathcal{I}$. The intrinsic +characteristics of user $i\in\mathcal{I}$ is an element $x_i$ of a feature set +$E$. When the experimenter queries user $i$, he gets the answer $y_i$: +\begin{equation}\label{eq:answer-model} + y_i = h(x_i) + \varepsilon_i +\end{equation} +where $h$ is a mapping from the feature set $E$ to the answer space $A$ and +$\varepsilon_i$ is a random variable over $A$. We assume that the errors +are mutually independent. + +\emph{Examples.} +\begin{enumerate} + \item if $A=\{0,1\}$ and $h$ is a binary function on $E$ and the model + above is a binary classification setup. + \item if $A=\mathbf{R}$, $E=\mathbf{R}^d$ and $h$ is a linear function, the + model above is a linear regressions setup. +\end{enumerate} + \section{Value of data} -There is a set of users than can be queried. We assume that when queried, -the answer of the user to the query question is a function of its intrinsic -characteristics, distorted by some noise. +The experimenter does not know the function which appears in +\eqref{eq:answer-model}. Instead, he has a given set of \emph{hypotheses} +$\mathcal{H}$ and a prior knowledge of the true hypothesis $h$ which is modeled +by a random variable $H$ over $\mathcal{H}$. The data model in +\eqref{eq:answer-model} expresses that conditioned on the hypothesis, the +answers to the experimenter's query are independent. -An experimenter is about to conduct a survey, where he is going to select -a subset of users and query them. The answers will be used to learn the -function underlying the answers of the users. +The prior distribution can also be seen as an expression of the uncertainty of +the experimenter about the true hypothesis $h$. The uncertainty of the +distribution $P$ of $H$ can be classically measured by its entropy +$\mathbb{H}(H)$\footnote{Here we choose to write the entropy of a discrete +distribution, but our results do not rely on this assumption: if the +distribution is continuous, one simply needs to replace the entropy by the +conditional entropy.}: +\begin{displaymath} + \mathbb{H}(H) = -\sum_{h\in A^E}P(H = h)\log\big(P(H = h)\big) +\end{displaymath} + +Given a subset $S\subset I$ of users, we denote by $\mathcal{A}_S$ the set of +answers given by the users in $S$ according to the experimenter's knowledge of +the hypothesis: +\begin{displaymath} + \mathcal{A}_S = \{Y_i,\, i\in S\} \quad\mathrm{with} + \quad Y_i = H(x_i) + \varepsilon_i +\end{displaymath} -Because the experimenter does not know the function beforehand, he maintains -a probabilistic distribution over the set of possible functions. The meaning of -this distribution is that it conveys the uncertainty of the experimenter about -the underlying function. Observing the users' answers during the survey allows -him to update his knowledge of the underlying function, thus reducing his -uncertainty. +The goal of the experimenter is to use the answers of the users to learn the +true hypothesis. This setup is the one commonly found in Bayesian Active +Learning with Noise. In this setup the value of a data, or value of information, +used when using the entropy as a measure of uncertainty is simply the decrease +of entropy implied by observing the information. In our case, this leads to +defining the value function $V$: +\begin{equation}\label{eq:value} + \forall S\subset I,\; V(S) + = \mathbb{H}(H) - \mathbb{H}(H\,|\,\mathcal{A}_S) +\end{equation} +where $\mathbb{H}(H\,|\,\mathcal{A}_S)$ is the conditional entropy of $H$ given +$\mathcal{A}_S$. One can also recognize that the definition of the entropy +given above is simply the mutual information $I(H;\mathcal{A}_S)$ between $H$ +and $\mathcal{A}_S$. -A classical measure of uncertainty is the entropy, which we use to define the -value of the data of a subset of users : \emph{the value of a subset of users -is the decrease of the entropy induced by observing the users' answer to -a query}. +Using the \emph{information never hurts} principle, it is +easy to see that the value function defined in \eqref{eq:value} is positive and +set increasing. Thus one could extend the definition of the value to be any +increasing function of the mutual information. The motivation of this definition of value makes sense when considering the information theoretic interpretation of the entropy: if we consider that the @@ -61,52 +118,63 @@ ask to fully know the hypothesis. If he needs to pay for each question asked to the oracle, then our definition of value directly relates to the cost decrease implied by observing a set of users. -We will now formalize the above definition. The set of users is indexed by an -index set $I$. The intrinsic characteristics of user $i$ are denoted by $x_i$, -a vector belonging to a feature space $E$. The answer of the users, when -queried belongs to the answer space $A$. The underlying function $f$, expressing -how the answer relates to the feature vector is a mapping from $E$ to $A$. The -answer of user $i$ will be denoted by $y_i$. Thus, the answer model takes the +\section{Economics of data} + +Independently of the chosen definition of value, several optimization problems, +motivated by economic considerations naturally arise in the above setup. + +\subsection{Optimal observation selection} + +If we assume that the experimenter has a cost $c$ for each user that he +includes in his experiment, his goal will be to optimize the value of his +experiment while minimizing his cost. Because the cost does not depend on the +users, this is equivalent to maximize the value while minimizing the size of +the chosen set of users. Hence, the optimization problems takes the following form: -\begin{displaymath} - y_i = f(x_i) + \varepsilon_i -\end{displaymath} -where $\varepsilon_i$ is the error distorting the answer of user $i$, which is -a random variable over $A$. We assume that the errors are mutually independent. +\begin{equation}\label{eq:observation-selection} + S^* = \argmax_{\substack{S\subset \mathcal{I}\\ \vert S\vert \leq k}} V(S) +\end{equation} +\emph{Note: (1-1/e) approximation algorithm known since Nemhauser (1978)} -The experimenter's prior knowledge of $f$ is a random variable $F$ over $A^E$. -The entropy of $F$ is denoted by $H(F)$: -\begin{displaymath} - H(F) = \sum_{f\in A^E} P(F = f)\log\big(P(F = f)\big) -\end{displaymath} +The fact that there is a function defining the value of a set of users +strongly suggests that some users are more valuable than others. It is then +natural that each user $i$ has a specific cost $c_i$. The above optimization +problem then takes the form: +\begin{equation}\label{eq:cost-observation-selection} + S^* = \argmax_{\substack{S\subset \mathcal{I}\\ \sum_{i\in S}c_i\leq B}} V(S) +\end{equation} -If $S$ is a subset of users, we will denote by $A_S$ the set of answers given -by these users, $A_S = \{y_i,\, i\in S\}$. We can now give the definition of -the value $V:2^I\rightarrow \mathbf{R}^+$: -\begin{displaymath} - \forall S\in 2^I,\; V(S) = H(F) - H(F\,|\, A_S) -\end{displaymath} -where $H(F\,|\, A_S)$ denotes the conditional entropy of $F$ given $A_S$. +\emph{Note: this is known as the budgeted maximization problem, or maximization + problem under knapsack/linear constraint. This problem is much harder than + the previous one. A (1-1/e) approximation algorithm was known in the case + when the value function is the one used in the Max-Coverage problem. But +for general, non-decreasing submodular functions, this is a result by +Sviridenko (2004) : A note on maximizing a submodular set function subject to +knapsack constraint} + +\subsection{Budget feasible auctions} -\section{Economics of data} +In this section, the experimenter wants to compensate users for joining the +experiment. Because the payments made by the experimenter are based on the +costs reported by the users, reporting a false cost could be a strategy for a user +to maximize his revenue. -As soon as you have a definition of value, there are several natural economic -problems arising, and which can be expressed independently of the definition of -the value function: -\begin{enumerate} - \item \emph{budget-feasible auctions:} each user $i$ has a cost $c_i$ for - revealing his data. You are given a budget $B$ and you want to select - a subset of players and a payment scheme such that the value of the - subset is maximal with the constraint that the sum of the payments - should stay within budget. The mechanism must be truthful. - \item \emph{cost sharing:} - \item \emph{k-best auctions:} this one is very similar to \emph{budget - auctions}, but instead of having a budget constraint, the constraint is - on the number of people in the subset $S$. -\end{enumerate} +Therefore, this is an auction setup: the experimenter observes the costs +reported by the users $c_i, i\in\mathcal{I}$ and select a subset $S$ of users, +as well as payments $p_i, i\in S$ such that: +\begin{itemize} + \item \textbf{individually rational} + \item \textbf{truthful} + \item \textbf{budget-constrained}: $\sum_{i\in S} p_i \leq B$ + \item \textbf{approximation}: $OPT \leq \alpha V(S)$ where $OPT$ is the solution + to the budgeted maximization problem in \eqref{eq:cost-observation-selection} +\end{itemize} + +\emph{Note: Yaron Singer (2010) Budget feasible auctions, has a 117.7 + polynomial approximation for this problem. It is proven that no algorithm +can do better than $2-\epsilon$ for any $\epsilon$.} -\textbf{TODO} add references. Explain that our value function is submodular and -write the theorems we get when applying previous results to our specific case. +\subsection{Value sharing} \section{Value of data in the Gaussian world} |
