diff options
| author | Thibaut Horel <thibaut.horel@gmail.com> | 2012-09-13 14:11:46 -0700 |
|---|---|---|
| committer | Thibaut Horel <thibaut.horel@gmail.com> | 2012-09-13 14:11:46 -0700 |
| commit | 1b32037a9cd0232493da6a4af45ec803a9604458 (patch) | |
| tree | d4f3a145c2382fa598283d8c070fddc60fc23d00 | |
| parent | b65081f4ba1d15abef445bc8bed66b1e288b2165 (diff) | |
| download | recommendation-1b32037a9cd0232493da6a4af45ec803a9604458.tar.gz | |
Update on section 2, need to update following sections accordingly.
| -rw-r--r-- | notes.tex | 81 |
1 files changed, 61 insertions, 20 deletions
@@ -20,6 +20,7 @@ \maketitle \section{Introduction} + With the recent development of communication technologies, user data is now present everywhere. It is also clear that there is some notion of value attached to user data: this is particularly obvious when you consider the @@ -34,32 +35,72 @@ a data market. \section{Data model} -There is a set of users and an experimenter who is about to conduct a survey. -The survey process is the following: the experimenter is going to select -a subset of users an query them, that is, ask the same question to each user in -the selected subgroup. +There is a set of users and an experimenter. Each user has a vector of public +features (e.g. age, height, movie ratings, binary features, labels, etc.) and +a private piece of information, an undisclosed variable. + +The experimenter has a prior knowledge of how the undisclosed variable relates +to the public vector of features, this prior knowledge is called the +\emph{hypothesis}. The experimenter wants to test and refine his hypothesis +against the users' data: he is going to select a set of users and ask them to +reveal their private variable. Based on the observed data, he will be able to +update his prior knowledge. + +Because the users' data helps the experimenter refining his hypothesis, there +is a notion of value attached to a group of users which quantifies how much +uncertainty about the hypothesis is removed by observing their data. + +The users also have a cost for revealing their data (for example for privacy +reasons, or because it requires them some time to provide the information to +the experimenter). The experimenter has finite resources: there is a budget +constraint on how much he can spend. -To each user is attached some intrinsic characteristics, known to the -experimenter, which impacts on his answer to the experimenter's query. More -precisely, the answer of the user is a function of its intrinsic characteristic -distorted by some noise. +The problems arising in this setup are natural: the experimenter wants to +maximize his utility, the value of the set of users he selects, but he needs to +compensate the users by taking into account their costs and his budget +constraint. The users' costs can either be public, which directly leads to +combinatorial optimizations problems, or private, in which case, a notion of +strategy intervenes in the way the experimenter compensates the users and +requires an auction approach. -Formally, we have a set of users indexed by a set $\mathcal{I}$. The intrinsic -characteristics of user $i\in\mathcal{I}$ is an element $x_i$ of a feature set -$E$. When the experimenter queries user $i$, he gets the answer $y_i$: -\begin{equation}\label{eq:answer-model} - y_i = h(x_i) + \varepsilon_i +Formally, there is a set of users indexed by a set $\mathcal{I}$. The public +feature vector of user $i\in\mathcal{I}$ is an element $x_i$ of a feature set +$E$, his undisclosed variable is denoted by $y_i$ and belongs to some space +$A$. The cost of user $i$ for revealing his data is a positive real number +$c_i\in\mathbf{R}^+$. + +The prior knowledge of the experimenter takes the form of a random variable $H$ +over $A^E$ or a subset of $A^E$ called the \emph{hypothesis set}. This random +variable expresses $y_i$ as a function of $x_i$ through this equation: +\begin{equation}\label{eq:hypothesis-model} + y_i = H(x_i) + \varepsilon_i \end{equation} -where $h$ is a mapping from the feature set $E$ to the answer space $A$ and -$\varepsilon_i$ is a random variable over $A$. We assume that the errors -are mutually independent. +where $\varepsilon_i$ is a random variable over $A$ independent from $H$. + +\emph{TODO: explain why this model is not restrictive: $y$ can always be + written as a deterministic function of $x$ plus something independent of +$x$} + +The framework for the experimenter is Bayesian: his prior knowledge the +distribution of $H$ is the prior distribution. Based on the disclosed variables +of the selected users, he updates his knowledge to get the posterior +distribution. \emph{Examples.} \begin{enumerate} - \item if $A=\{0,1\}$ and $h$ is a binary function on $E$ and the model - above is a binary classification setup. - \item if $A=\mathbf{R}$, $E=\mathbf{R}^d$ and $h$ is a linear function, the - model above is a linear regressions setup. + \item if the hypothesis set is finite (the experimenter has a few + deterministic model he wants to choose from), observing data allows him + to rule out parts of the hypothesis set. In this case, the uncertainty + of the experimenter can simply be measured by the size of the + hypothesis set. + \item if $A=\{0,1\}$ and the hypothesis set is the set of all binary + functions on $E$, the learning task of the experimenter implied by the + above setup is a binary classification problem. + \item if $A=\mathbf{R}$, $E=\mathbf{R}^d$ and the hypothesis set is the set + of linear functions from $E$ to $A$: $\mathcal{L}(A,E)$, the learning + task is a linear regression. The prior knowledge of the experimenter is + a prior distribution on the parameters of the linear function, which is + equivalent to regularized linear regression (e.g. ridge regression). \end{enumerate} \section{Value of data} |
