summaryrefslogtreecommitdiffstats
path: root/notes.tex
diff options
context:
space:
mode:
authorThibaut Horel <thibaut.horel@gmail.com>2012-09-13 14:11:46 -0700
committerThibaut Horel <thibaut.horel@gmail.com>2012-09-13 14:11:46 -0700
commit1b32037a9cd0232493da6a4af45ec803a9604458 (patch)
treed4f3a145c2382fa598283d8c070fddc60fc23d00 /notes.tex
parentb65081f4ba1d15abef445bc8bed66b1e288b2165 (diff)
downloadrecommendation-1b32037a9cd0232493da6a4af45ec803a9604458.tar.gz
Update on section 2, need to update following sections accordingly.
Diffstat (limited to 'notes.tex')
-rw-r--r--notes.tex81
1 files changed, 61 insertions, 20 deletions
diff --git a/notes.tex b/notes.tex
index ad34a21..12e04a5 100644
--- a/notes.tex
+++ b/notes.tex
@@ -20,6 +20,7 @@
\maketitle
\section{Introduction}
+
With the recent development of communication technologies, user data is now
present everywhere. It is also clear that there is some notion of value
attached to user data: this is particularly obvious when you consider the
@@ -34,32 +35,72 @@ a data market.
\section{Data model}
-There is a set of users and an experimenter who is about to conduct a survey.
-The survey process is the following: the experimenter is going to select
-a subset of users an query them, that is, ask the same question to each user in
-the selected subgroup.
+There is a set of users and an experimenter. Each user has a vector of public
+features (e.g. age, height, movie ratings, binary features, labels, etc.) and
+a private piece of information, an undisclosed variable.
+
+The experimenter has a prior knowledge of how the undisclosed variable relates
+to the public vector of features, this prior knowledge is called the
+\emph{hypothesis}. The experimenter wants to test and refine his hypothesis
+against the users' data: he is going to select a set of users and ask them to
+reveal their private variable. Based on the observed data, he will be able to
+update his prior knowledge.
+
+Because the users' data helps the experimenter refining his hypothesis, there
+is a notion of value attached to a group of users which quantifies how much
+uncertainty about the hypothesis is removed by observing their data.
+
+The users also have a cost for revealing their data (for example for privacy
+reasons, or because it requires them some time to provide the information to
+the experimenter). The experimenter has finite resources: there is a budget
+constraint on how much he can spend.
-To each user is attached some intrinsic characteristics, known to the
-experimenter, which impacts on his answer to the experimenter's query. More
-precisely, the answer of the user is a function of its intrinsic characteristic
-distorted by some noise.
+The problems arising in this setup are natural: the experimenter wants to
+maximize his utility, the value of the set of users he selects, but he needs to
+compensate the users by taking into account their costs and his budget
+constraint. The users' costs can either be public, which directly leads to
+combinatorial optimizations problems, or private, in which case, a notion of
+strategy intervenes in the way the experimenter compensates the users and
+requires an auction approach.
-Formally, we have a set of users indexed by a set $\mathcal{I}$. The intrinsic
-characteristics of user $i\in\mathcal{I}$ is an element $x_i$ of a feature set
-$E$. When the experimenter queries user $i$, he gets the answer $y_i$:
-\begin{equation}\label{eq:answer-model}
- y_i = h(x_i) + \varepsilon_i
+Formally, there is a set of users indexed by a set $\mathcal{I}$. The public
+feature vector of user $i\in\mathcal{I}$ is an element $x_i$ of a feature set
+$E$, his undisclosed variable is denoted by $y_i$ and belongs to some space
+$A$. The cost of user $i$ for revealing his data is a positive real number
+$c_i\in\mathbf{R}^+$.
+
+The prior knowledge of the experimenter takes the form of a random variable $H$
+over $A^E$ or a subset of $A^E$ called the \emph{hypothesis set}. This random
+variable expresses $y_i$ as a function of $x_i$ through this equation:
+\begin{equation}\label{eq:hypothesis-model}
+ y_i = H(x_i) + \varepsilon_i
\end{equation}
-where $h$ is a mapping from the feature set $E$ to the answer space $A$ and
-$\varepsilon_i$ is a random variable over $A$. We assume that the errors
-are mutually independent.
+where $\varepsilon_i$ is a random variable over $A$ independent from $H$.
+
+\emph{TODO: explain why this model is not restrictive: $y$ can always be
+ written as a deterministic function of $x$ plus something independent of
+$x$}
+
+The framework for the experimenter is Bayesian: his prior knowledge the
+distribution of $H$ is the prior distribution. Based on the disclosed variables
+of the selected users, he updates his knowledge to get the posterior
+distribution.
\emph{Examples.}
\begin{enumerate}
- \item if $A=\{0,1\}$ and $h$ is a binary function on $E$ and the model
- above is a binary classification setup.
- \item if $A=\mathbf{R}$, $E=\mathbf{R}^d$ and $h$ is a linear function, the
- model above is a linear regressions setup.
+ \item if the hypothesis set is finite (the experimenter has a few
+ deterministic model he wants to choose from), observing data allows him
+ to rule out parts of the hypothesis set. In this case, the uncertainty
+ of the experimenter can simply be measured by the size of the
+ hypothesis set.
+ \item if $A=\{0,1\}$ and the hypothesis set is the set of all binary
+ functions on $E$, the learning task of the experimenter implied by the
+ above setup is a binary classification problem.
+ \item if $A=\mathbf{R}$, $E=\mathbf{R}^d$ and the hypothesis set is the set
+ of linear functions from $E$ to $A$: $\mathcal{L}(A,E)$, the learning
+ task is a linear regression. The prior knowledge of the experimenter is
+ a prior distribution on the parameters of the linear function, which is
+ equivalent to regularized linear regression (e.g. ridge regression).
\end{enumerate}
\section{Value of data}