Some progress in section 2

author: Thibaut Horel <thibaut.horel@gmail.com> 2012-10-26 11:13:49 -0700
committer: Thibaut Horel <thibaut.horel@gmail.com> 2012-10-26 11:13:49 -0700
commit: 5810ae8660dffd8805317a83a0d4761ab2018cec (patch)
tree: b0e3018489e62bbb8eb9d826c53b61b6a5b8c1dc
parent: f6b1d3078d0116554d404123ced21664268c3b48 (diff)
download: recommendation-5810ae8660dffd8805317a83a0d4761ab2018cec.tar.gz
1 files changed, 109 insertions, 61 deletions
diff --git a/paper.tex b/paper.tex
index a593d95..4d02cad 100644
--- a/paper.tex
+++ b/paper.tex
@@ -44,21 +44,24 @@
 
 \subsection{Notations}
 
-\begin{itemize}
-    \item If $x\in\mathbf{R}^d$, $x^*$ will denote the transpose of vector $x$.
-        If $(x, y)\in\mathbf{R}^d\times\mathbf{R}^d$, $x^*y$ is the standard
-        inner product of $\mathbf{R}^d$.
-    \item We will often use the following order over symmetric matrices: if $A$
-        and $B$ are two $d\times d$ real symmetric matrices, we write that
-        $A\leq B$ iff:
-        \begin{displaymath}
-            \forall x\in\mathbf{R}^d,\quad x^*Ax \leq x^*Bx 
-        \end{displaymath}
-        That is, iff $B-A$ is symmetric semi-definite positive.
-    \item Let us recall this property of the ordered defined above: if $A$ and
-        $B$ are two symmetric definite positive matrices such that $A\leq B$,
-        then $A^{-1}\leq B^{-1}$.
-\end{itemize}
+Throughout the paper, we will make use of the following notations: if $x$ is
+a (column) vector in $\mathbf{R}^d$, $x^*$ denotes its transposed (line)
+vector. Thus, the standard inner product between two vectors $x$ and $y$ is
+simply $x^* y$. $\norm{x}_2 = x^*x$ will denote the $L_2$ norm of $x$.
+
+We will also often use the following order over symmetric matrices: if $A$ and
+$B$ are two $d\times d$ and $B$ are two $d\times d$ real symmetric matrices, we
+write that $A\leq B$ iff:
+\begin{displaymath}
+    \forall x\in\mathbf{R}^d,\quad
+    x^*Ax \leq x^*Bx
+\end{displaymath}
+That is, iff $B-A$ is symmetric semi-definite positive.
+
+This order let us define the notion of a \emph{decreasing} or \emph{convex}
+matrix function similarly to their real counterparts. In particular, let us
+recall that the matrix inversion is decreasing and convex over symmetric
+definite positive matrices.
 
 \subsection{Data model}
 
@@ -77,7 +80,9 @@ we assume a linear model:
     \forall i\in\mathcal{N},\quad y_i = \beta^* x_i + \varepsilon_i
 \end{displaymath}
 where $\beta\in\mathbf{R}^d$ and $\varepsilon_i\in\mathbf{R}$ follows a normal
-distribution of mean $0$ and variance $\sigma^2$.
+distribution of mean $0$ and variance $\sigma^2$. Furthermore, we assume the
+error $\varepsilon$ to be independent of the user:
+$(\varepsilon_i)_{i\in\mathcal{N}}$ are mutually independent.
 
 After observing the data, the experimenter could simply do linear regression to
 learn the model parameter $\beta$. However, in a more general setup, the
@@ -99,48 +104,91 @@ which is the well-known \emph{ridge regression}. $\mu
 can thus be seen as linear regression with a regularization term which
 prevents $\beta$ from having a large $L_2$-norm.
 
+\subsection{Value of data}
 
-\begin{itemize}
-    \item set of $n$ users $\mathcal{N} = \{1,\ldots, n\}$
-    \item each user has a public vector of features $x_i\in\mathbf{R}^d$ and some
-        undisclosed variable $y_i\in\mathbf{R}$
-    \item we assume that the data has been normalized: $\forall
-        i\in\mathcal{N}$, $\norm{x_i}\leq 1$
-    \item \textbf{Ridge regression model:}
-        \begin{itemize}
-            \item $y_i = \beta^*x_i + \varepsilon_i$
-            \item $\varepsilon_i \sim \mathcal{N}(0,\sigma^2)$, 
-                $(\varepsilon_i)_{i\in \mathcal{N}}$ are mutually independent.
-            \item prior knowledge of $\beta$: $\beta\sim\mathcal{N}(0,\kappa I_d)$
-            \item $\mu = \frac{\kappa}{\sigma^2}$ is the regularization factor.
-            \item after the variables $y_i$ are disclosed, the experimenter
-                does \emph{maximum a posteriori estimation} which is equivalent
-                under Gaussian prior to solving the ridge regression
-                maximisation problem:
-                \begin{displaymath}
-                    \beta_{\text{max}} = \argmax \sum_i (y_i - \beta^*x_i)^2
-                    + \frac{1}{\mu}\sum_i \norm{\beta}_2^2
-                \end{displaymath}
-        \end{itemize}
-\end{itemize}
+Because the user private variables $y_i$ have not been observed yet when the
+experimenter has to decide which users to include in his experiment, we treat
+$\beta$ as a random variable whose distribution is updated after observing the
+data.
 
-\subsection{Economics}
+Let us recall that if $\beta$ is random variable over $\mathbf{R}^d$ whose
+probability distribution has a density function $f$ with respect to the
+Lebesgue measure, its entropy is given by:
+\begin{displaymath}
+    \mathbb{H}(\beta) \defeq - \int_{b\in\mathbf{R}^d} \log f(b) f(b)\text{d}b
+\end{displaymath}
 
-Value function: following the experiment design theory, the value of
-data is the decrease of uncertainty about the model. Common measure of
-uncertainty, the entropy. Thus, the value of a set $S$ of users is:
+A usual way to measure the decrease of uncertainty induced by the observation
+of data is to use the entropy. This leads to the following definition of the
+value of data called the \emph{value of information}:
 \begin{displaymath}
-    V(S) = \mathbb{H}(\beta) - \mathbb{H}(\beta|Y_S)
+    \forall S\subset\mathcal{N},\quad V(S) = \mathbb{H}(\beta)
+    - \mathbb{H}(\beta\,|\,
+    Y_S)
 \end{displaymath}
-where $Y_S = \{y_i,\,i\in S\}$ is the set of observations.
+where $Y_S = \{y_i,\,i\in S\}$ is the set of observed data.
 
-In our case (under Gaussian prior):
+\begin{theorem}
+    Under the ridge regression model explained in section TODO, the value of data
+    is equal to:
+    \begin{align*}
+        \forall S\subset\mathcal{N},\; V(S)
+        & = \frac{1}{2}\log\det\left(I_d
+            + \mu\sum_{i\in S} x_ix_i^*\right)\\
+        & \defeq \frac{1}{2}\log\det A(S)
+    \end{align*}
+\end{theorem}
+
+\begin{proof}
+
+Let us denote by $X_S$ the matrix whose rows are the vectors $(x_i^*)_{i\in
+S}$. Observe that $A_S$ can simply be written as:
+\begin{displaymath}
+    A_S  = I_d + \mu X_S^* X_S
+\end{displaymath}
+
+Let us recall that the entropy of a multivariate normal variable $B$ over
+$\mathbf{R}^d$ of covariance $\Sigma I_d$ is given by:
+\begin{equation}\label{eq:multivariate-entropy}
+    \mathbb{H}(B) = \frac{1}{2}\log\big((2\pi e)^d \det \Sigma I_d\big)
+\end{equation}
+
+Using the chain rule for conditional entropy, we get that:
+\begin{displaymath}
+    V(S) = \mathbb{H}(Y_S) - \mathbb{H}(Y_S\,|\,\beta)
+\end{displaymath}
+
+Conditioned on $\beta$, $(Y_S)$ follows a multivariate normal
+distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence:
+\begin{equation}\label{eq:h1}
+    \mathbb{H}(Y_S\,|\,\beta) 
+    = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right)
+\end{equation}
+
+$(Y_S)$ also follows a multivariate normal distribution of mean zero. Let us
+compute its covariance matrix, $\Sigma_Y$:
 \begin{align*}
-    \forall S\subset\mathcal{N},\; V(S)
-    & = \frac{1}{2}\log\det\left(I_d
-        + \mu\sum_{i\in S} x_ix_i^*\right)\\
-    & \defeq \frac{1}{2}\log\det A(S)
+    \Sigma_Y & = \expt{YY^*} = \expt{(X_S\beta + E)(X_S\beta + E)^*}\\
+             & = \kappa X_S X_S^* + \sigma^2I_n
 \end{align*}
+Thus, we get that:
+\begin{equation}\label{eq:h2}
+    \mathbb{H}(Y_S) 
+    = \frac{1}{2}\log\left((2\pi e)^n \det(\kappa X_S X_S^* + \sigma^2 I_n)\right)
+\end{equation}
+
+Combining \eqref{eq:h1} and \eqref{eq:h2} we get:
+\begin{displaymath}
+    V(S) = \frac{1}{2}\log\det\left(I_n+\frac{\kappa}{\sigma^2}X_S
+    X_S^*\right)
+\end{displaymath}
+
+Finally, we can use Sylvester's determinant theorem to get the result.
+\end{proof}
+
+It is also interesting to the marginal contribution of a user to a set: the
+increase of value induced by adding a user to an already existing set of users.
+We have the following lemma.
 
 \begin{lemma}[Marginal contribution]
     \begin{displaymath}
@@ -160,18 +208,18 @@ In our case (under Gaussian prior):
     \end{align*}
     where the last equality comes from Sylvester's determinant formula.
 \end{proof}
-\begin{itemize}
-    \item each user $i$ has a cost $c_i$
-    \item the auctioneer has a budget constraint $B$
-    \item optimisation problem:
-        \begin{displaymath}
-            OPT(V,\mathcal{N}, B) = \max_{S\subset\mathcal{N}} \left\{ V(S)\,|\,
-            \sum_{i\in S}c_i\leq B\right\}
-        \end{displaymath}
-\end{itemize}
 
-Problem, the costs are reported by users, they can lie. It is an auction
-problem. Goal design a mechanism (selection and payment scheme) which is:
+Because $A(S)$ is symmetric definite positive, the marginal contribution is
+positive, which proves that the value function is set increasing. Furthermore,
+it is easy to see that if $S\subset S'$, then $A(S)\leq A(S')$. Using the fact
+that matrix inversion is decreasing, we see that the marginal contribution of
+a fixed user is a set decreasing function. This is the \emph{submodularity} of
+the value function.
+
+\subsection{Auction}
+
+Explain the optimization problem, why it has to be formulated as an auction
+problem. Explain the goals:
 \begin{itemize}
     \item truthful
     \item individually rational
author	Thibaut Horel <thibaut.horel@gmail.com>	2012-10-26 11:13:49 -0700
committer	Thibaut Horel <thibaut.horel@gmail.com>	2012-10-26 11:13:49 -0700
commit	5810ae8660dffd8805317a83a0d4761ab2018cec (patch)
tree	b0e3018489e62bbb8eb9d826c53b61b6a5b8c1dc
parent	f6b1d3078d0116554d404123ced21664268c3b48 (diff)
download	recommendation-5810ae8660dffd8805317a83a0d4761ab2018cec.tar.gz