Update on section 2, need to update following sections accordingly.

author: Thibaut Horel <thibaut.horel@gmail.com> 2012-09-13 14:11:46 -0700
committer: Thibaut Horel <thibaut.horel@gmail.com> 2012-09-13 14:11:46 -0700
commit: 1b32037a9cd0232493da6a4af45ec803a9604458 (patch)
tree: d4f3a145c2382fa598283d8c070fddc60fc23d00
parent: b65081f4ba1d15abef445bc8bed66b1e288b2165 (diff)
download: recommendation-1b32037a9cd0232493da6a4af45ec803a9604458.tar.gz
1 files changed, 61 insertions, 20 deletions
diff --git a/notes.tex b/notes.tex
index ad34a21..12e04a5 100644
--- a/notes.tex
+++ b/notes.tex
@@ -20,6 +20,7 @@
 \maketitle
 
 \section{Introduction}
+
 With the recent development of communication technologies, user data is now
 present everywhere. It is also clear that there is some notion of value
 attached to user data: this is particularly obvious when you consider the
@@ -34,32 +35,72 @@ a data market.
 
 \section{Data model}
 
-There is a set of users and an experimenter who is about to conduct a survey.
-The survey process is the following: the experimenter is going to select
-a subset of users an query them, that is, ask the same question to each user in
-the selected subgroup.
+There is a set of users and an experimenter. Each user has a vector of public
+features (e.g. age, height, movie ratings, binary features, labels, etc.) and
+a private piece of information, an undisclosed variable.
+
+The experimenter has a prior knowledge of how the undisclosed variable relates
+to the public vector of features, this prior knowledge is called the
+\emph{hypothesis}. The experimenter wants to test and refine his hypothesis
+against the users' data: he is going to select a set of users and ask them to
+reveal their private variable. Based on the observed data, he will be able to
+update his prior knowledge.
+
+Because the users' data helps the experimenter refining his hypothesis, there
+is a notion of value attached to a group of users which quantifies how much
+uncertainty about the hypothesis is removed by observing their data.
+
+The users also have a cost for revealing their data (for example for privacy
+reasons, or because it requires them some time to provide the information to
+the experimenter). The experimenter has finite resources: there is a budget
+constraint on how much he can spend.
 
-To each user is attached some intrinsic characteristics, known to the
-experimenter, which impacts on his answer to the experimenter's query. More
-precisely, the answer of the user is a function of its intrinsic characteristic
-distorted by some noise.
+The problems arising in this setup are natural: the experimenter wants to
+maximize his utility, the value of the set of users he selects, but he needs to
+compensate the users by taking into account their costs and his budget
+constraint. The users' costs can either be public, which directly leads to
+combinatorial optimizations problems, or private, in which case, a notion of
+strategy intervenes in the way the experimenter compensates the users and
+requires an auction approach.
 
-Formally, we have a set of users indexed by a set $\mathcal{I}$. The intrinsic
-characteristics of user $i\in\mathcal{I}$ is an element $x_i$ of a feature set
-$E$.  When the experimenter queries user $i$, he gets the answer $y_i$:
-\begin{equation}\label{eq:answer-model}
-    y_i = h(x_i) + \varepsilon_i
+Formally, there is a set of users indexed by a set $\mathcal{I}$. The public
+feature vector of user $i\in\mathcal{I}$ is an element $x_i$ of a feature set
+$E$, his undisclosed variable is denoted by $y_i$ and belongs to some space
+$A$. The cost of user $i$ for revealing his data is a positive real number
+$c_i\in\mathbf{R}^+$. 
+
+The prior knowledge of the experimenter takes the form of a random variable $H$
+over $A^E$ or a subset of $A^E$ called the \emph{hypothesis set}. This random
+variable expresses $y_i$ as a function of $x_i$ through this equation:
+\begin{equation}\label{eq:hypothesis-model}
+    y_i = H(x_i) + \varepsilon_i
 \end{equation}
-where $h$ is a mapping from the feature set $E$ to the answer space $A$ and
-$\varepsilon_i$ is a random variable over $A$. We assume that the errors
-are mutually independent.
+where $\varepsilon_i$ is a random variable over $A$ independent from $H$.
+
+\emph{TODO: explain why this model is not restrictive: $y$ can always be
+    written as a deterministic function of $x$ plus something independent of
+$x$}
+
+The framework for the experimenter is Bayesian: his prior knowledge the
+distribution of $H$ is the prior distribution. Based on the disclosed variables
+of the selected users, he updates his knowledge to get the posterior
+distribution. 
 
 \emph{Examples.}
 \begin{enumerate}
-    \item if $A=\{0,1\}$ and $h$ is a binary function on $E$ and the model
-        above is a binary classification setup.
-    \item if $A=\mathbf{R}$, $E=\mathbf{R}^d$ and $h$ is a linear function, the
-        model above is a linear regressions setup.
+    \item if the hypothesis set is finite (the experimenter has a few
+        deterministic model he wants to choose from), observing data allows him
+        to rule out parts of the hypothesis set. In this case, the uncertainty
+        of the experimenter can simply be measured by the size of the
+        hypothesis set.
+    \item if $A=\{0,1\}$ and the hypothesis set is the set of all binary
+        functions on $E$, the learning task of the experimenter implied by the
+        above setup is a binary classification problem.
+    \item if $A=\mathbf{R}$, $E=\mathbf{R}^d$ and the hypothesis set is the set
+        of linear functions from $E$ to $A$: $\mathcal{L}(A,E)$, the learning
+        task is a linear regression. The prior knowledge of the experimenter is
+        a prior distribution on the parameters of the linear function, which is
+        equivalent to regularized linear regression (e.g. ridge regression).
 \end{enumerate}
 
 \section{Value of data}
author	Thibaut Horel <thibaut.horel@gmail.com>	2012-09-13 14:11:46 -0700
committer	Thibaut Horel <thibaut.horel@gmail.com>	2012-09-13 14:11:46 -0700
commit	1b32037a9cd0232493da6a4af45ec803a9604458 (patch)
tree	d4f3a145c2382fa598283d8c070fddc60fc23d00
parent	b65081f4ba1d15abef445bc8bed66b1e288b2165 (diff)
download	recommendation-1b32037a9cd0232493da6a4af45ec803a9604458.tar.gz