1 files changed, 262 insertions, 0 deletions
diff --git a/hw4/main.tex b/hw4/main.tex
new file mode 100644
index 0000000..414f37d
--- /dev/null
+++ b/hw4/main.tex
@@ -0,0 +1,262 @@
+% No 'submit' option for the problems by themselves.
+\documentclass[submit]{harvardml}
+% Use the 'submit' option when you submit your solutions.
+%\documentclass[submit]{harvardml}
+
+\usepackage{url}
+
+% Put in your full name and email address.
+\name{Thibaut Horel}
+\email{thorel@seas.harvard.edu}
+
+% List any people you worked with.
+\collaborators{%
+}
+
+% You don't need to change these.
+\course{CS281-F13}
+\assignment{Assignment \#4}
+\duedate{23:59pm November 13, 2015}
+
+% Useful macros
+\newcommand{\bw}{\boldsymbol{w}}
+\newcommand{\distNorm}{\mathcal{N}}
+\newcommand{\given}{\,|\,}
+\newcommand{\ident}{\mathbb{I}}
+\newcommand{\btheta}{\boldsymbol{\theta}}
+\newcommand{\bz}{\boldsymbol{z}}
+\newcommand{\balpha}{\boldsymbol{\alpha}}
+\newcommand{\bbeta}{\boldsymbol{\beta}}
+\usepackage{amsmath}
+\usepackage{hyperref}
+\usepackage{graphicx}
+
+% Some useful macros.
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\E}{\mathbb{E}}
+\newcommand{\var}{\text{var}}
+\newcommand{\cov}{\text{cov}}
+\newcommand{\N}{\mathcal{N}}
+\newcommand{\ep}{\varepsilon}
+
+\theoremstyle{plain}
+\newtheorem{lemma}{Lemma}
+\input{def.tex}
+\begin{document}
+
+\begin{problem}[10 pts]
+  For a target density~$p(x)$ and proposal
+  density~$q(x'\gets x)$, the Metropolis--Hastings transition
+  operator is given by
+  \begin{align*}
+    T(x'\gets x) &= q(x'\gets x)\,\min\left\{1,
+    \frac{\tilde p(x')\,q(x\gets x')}{\tilde p(x)\,q(x'\gets x)}
+    \right\}
+  \end{align*}
+  where $\tilde p(x)$ is the unnormalized target density.
+  
+  Show that this transition operator satisfies detailed balance.
+\end{problem}
+
+\paragraph{Solution.} We need to show that for any two states $x$ and $x'$:
+\begin{displaymath}
+    p(x)T(x'\gets x) = p(x')T(x\gets x')
+\end{displaymath}
+Let us expand both sides, we need to show that:
+\begin{displaymath}
+    p(x) q(x'\gets x)\,\min\left\{1,
+    \frac{\tilde p(x')\,q(x\gets x')}{\tilde p(x)\,q(x'\gets x)}
+    \right\}
+    = p(x')q(x\gets x')\,\min\left\{1,
+    \frac{\tilde p(x)\,q(x'\gets x)}{\tilde p(x')\,q(x\gets x')}
+    \right\}
+\end{displaymath}
+We distinguish two cases:
+\begin{itemize}
+    \item $\tilde{p}(x')q(x\gets x') \leq \tilde{p}(x)q(x'\gets x)$: in that
+        case the $\min$ in the left-hand side collapses to its second argument
+        and the left-hand side collapses to:
+        \begin{displaymath}
+            p(x')q(x\gets x')
+        \end{displaymath}
+        But in that case the $\min$ in the right-hand side collapses to $1$
+        (its first argument) and the right-hand side collapses to:
+        \begin{displaymath}
+            p(x')q(x\gets x')
+        \end{displaymath}
+        and we see that the two sides are equal.
+    \item $\tilde{p}(x')q(x\gets x') > \tilde{p}(x)q(x'\gets x)$: this case is
+        treated in exactly the same way as the above case by observing that it
+        is equivalent after swapping the role of $x$ and $x'$.
+\end{itemize}
+The two cases together conclude the proof of detailed balance.
+
+\newpage
+
+\section*{Modeling users and jokes with a Bayesian latent bilinear model}
+
+The next two questions will develop Bayesian inference methods for the simplest version of the latent bilinear model you used to model jokes ratings in HW3.
+
+
+The data set we'll use is the same as in HW3, a modified and preprocessed variant of the Jester data set.
+However, to make things easier (and to make being Bayesian more worthwhile) {\bf we'll only use the first 100,000 joke ratings} for your training set.  The second 100,000 ratings will form your test set.
+
+\subsection*{The model}
+
+The model is the same as in HW3, but with Gaussian priors on the latent parameter matrices $U$ and $V$.
+
+Let~${r_{i,j}\in\{1,2,3,4,5\}}$ be the rating of user $i$ on joke $j$.  A latent linear model introduces a vector ${u_i\in\R^K}$ for each user and a vector~${v_j\in\R^K}$ for each joke.  Then, each rating is modeled as a noisy version of the appropriate inner product. Specifically,
+\[
+r_{i,j} \sim \mathcal{N}(u_i^T v_j, \sigma_\epsilon^2).
+\]
+Fix $\sigma_\epsilon^2$ to be 1.0, and $K = 2$.
+
+We put independent Gaussian priors on each element of $U$ and $V$:
+\[U_{i,k} \sim \mathcal{N}(0, \sigma_U^2=5)\]
+\[V_{i,k} \sim \mathcal{N}(0, \sigma_V^2=5)\]
+
+\begin{problem}[Gibbs sampling, 30pts]
+
+.
+
+\begin{enumerate}
+
+\item Write down the Gibbs update equations for U and V.  That is to say, write their conditional distibutions, conditioned on all the other variables as well as the training data:
+%
+$$p(U | V, R )$$
+$$p(V | U, R )$$
+
+Because the model is bi-linear, these updates should have fairly simple forms.
+
+\item Write a Gibbs sampler for $U$ and $V$ that updates $U$ and $V$ in turn.
+
+\item Run the Gibbs sampler for 100 steps (i.e. update both $U$ and $V$ 100 times).
+Plot the training and test-set log-likelihood as a function of the number of steps through your training set.
+That is, use all previous samples of $U, V$ to evaluate the predictive probability of all ratings.
+
+\item One reason to be Bayesian is that you don't have to worry about overfitting.
+Run the your Gibbs sampler for $K = 1$ to $K = 10$, and plot the training and test-set log-likelihood for each value of $K$.  How do the shapes of these curve differ from the curves you saw when doing maximum likelihood estimation in HW3?
+
+
+\end{enumerate}
+\end{problem}
+
+\paragraph{Solution.} 1. Using the results about systems of Gaussians from
+Murphy's chapter 4, we get:
+\begin{displaymath}
+    u_i | R, V \sim \mathcal{N}\left(\frac{\Sigma_1}{\sigma_\epsilon^2} V_i^Tr_i,
+        \Sigma_1\right)
+\end{displaymath}
+where $r_i$ is the vector of ratings given by user $i$, $V_i$ is the projection
+of matrix $V$ which only keeps the jokes (rows) rated by user $i$, and $\Sigma$
+is given by:
+\begin{displaymath}
+    \Sigma_1^{-1} = \frac{Id}{\sigma_U^2} + \frac{V_i^TV_i}{\sigma^2_\epsilon}
+\end{displaymath}
+
+Similarly, we get that:
+\begin{displaymath}
+    v_j | R, v \sim
+    \mathcal{N}\left(\frac{\Sigma_2}{\sigma_\epsilon^2}U_j^Tr_j,
+    \Sigma_2\right)
+\end{displaymath}
+where $U_j$ is the projection of matrix $U$ by only keeping the rows
+(users) which have rated joke $j$, $r_j$ is the vector of ratings given to joke
+$j$, and $\Sigma_2$ is given by:
+\begin{displaymath}
+    \Sigma_2^{-1} = \frac{Id}{\sigma_V^2} + \frac{U_j^TU_j}{\sigma^2_\epsilon}
+\end{displaymath}
+
+2. Please see the code in the appendix \textsf{2.py}. I used two copies of the
+   ratings matrix: one in CSC format and one in CSR format to be able to
+   quickly extract $r_i$ the ratings of a given user and $r_j$ the ratings of
+   a given joke, as well as the indices of these ratings to index the matrices
+   $U_i$ and $V_j$ appearing in the above formulas.
+
+3. The plot of the likelihood (up to a multiplicative factor) can be found
+   below:
+\begin{center}
+    \includegraphics[width=0.8\textwidth]{2.pdf}
+\end{center}
+
+4. For clarity I decided to plot the MSE (mean squared error) in log scale
+   instead of the log-likelihood. Otherwise the lines were hard to distinguish.
+   Note that the MSE is (up to a multiplicative factor) simply the opposite of
+   the log-likelihood because we have a Gaussian model.
+
+   The plot of the training MSE for different values of $K$ can be found below:
+\begin{center}
+    \includegraphics[width=0.7\textwidth]{train.pdf}
+\end{center}
+And for the test MSE:
+\begin{center}
+    \includegraphics[width=0.7\textwidth]{test.pdf}
+\end{center}
+As we can see, the MSE is smaller or the training dataset, which is to be
+expected. However, unlike in the previous assignment, the test MSE gets better
+when $K$ increases. We do not observe the overfitting phenomenon where the
+train MSE decreases with $K$ but the test MSE starts increasing again or large
+values of $K$.
+
+
+\begin{problem}[Stochastic Variational Inference, 30pts]
+
+Now we'll repeat the same experiments, but using stochastic variational inference instead.
+
+Recall that variational inference optimizes a lower bound on the log marginal likelihood (integrating out parameters $\theta$), like so:
+\begin{align}
+\log p(x) & = \log \int p(x, \theta) d\theta = \log \int p(x|\theta) p(\theta) d\theta \\
+& = \log \int \frac{q_\phi(\theta)}{q_\phi(\theta)} p(x|\theta) p(\theta) d\theta 
+  = \log \mathbb{E}_{q_\phi} \frac{1}{q(\theta)} p(x|\theta) p(\theta) d\theta \\
+& \geq \mathbb{E}_{q_\phi} \log \left[ \frac{1}{q_\phi(\theta)} p(x|\theta) p(\theta) \right] 
+ = \underbrace{-\mathbb{E}_{q_\phi} \log q_\phi(\theta)}_{\textnormal{entropy}}  + \underbrace{\mathbb{E}_{q_\phi} \log p(\theta)}_{\textnormal{prior}} + \underbrace{\mathbb{E}_{q_\phi} \log p(x|\theta)}_{\textnormal{likelihood}}
+= \mathcal{L}(\phi)
+\end{align}
+%
+In this case, $\theta = U,V$ and $x = R$:
+%
+\begin{align}
+\mathcal{L}(\phi) = -\mathbb{E}_{q_\phi} \log q_\phi(U, V) + \mathbb{E}_{q_\phi} \log p(U, V) + \sum_{n=1}^N \mathbb{E}_{q_\phi} \log p(r_n | U, V) 
+\end{align}
+%
+The nice part about writing the lower bound as a set of expectations is that if we can sample from $q_\phi(\theta)$, then we can estimate the bound (and its gradient) by simple Monte Carlo.
+That is, we can sample from $q(\theta)$, and estimate the bound or its gradients by evaluating at the samples and averaging them.
+
+This is a very general formula that works for many different priors, likelihoods and variational approximations.
+In this case, we'll keep things simple and choose $q(U,V)$ to be a Gaussian factorized over every single entry of each matrix (the same form as the prior).
+Thus our variational parameters will consist of a mean and variance for each entry in U and V.
+% $\phi^{(\mu U)}_{ik}$, $\phi^{(\sigma^2 U)}_{ik}$, $\phi^{(\mu V)}_{jk}$, and $\phi^{(\sigma^2 V)}_{jk}$.
+
+To make our method scalable, we have to update the variational parameters without passing over the whole dataset each time.
+We can do this by taking Monte Carlo estimates over ratings.
+However, we to need scale our estimates so that their expectations are the same as if we did the sum exactly.
+
+
+\begin{enumerate}
+
+\item 
+Write an unbiased Monte Carlo estimate of the lower bound, depending only on one rating at a time.
+Write the entropy and prior terms exactly, without using Monte Carlo.  Hint: You can combine them into a KL divergence between two Gaussians.
+
+\item Write the gradient of your estimate of the lower bound w.r.t. the variational parameters.
+
+Hint: To capture all the ways in which the gradient depends on the parameters, if your gradient has an expectation of the form $\mathbb{E}_{X \sim \mathcal{N}(\mu, \sigma)}[f(X)]$, re-write it in terms of $\mathbb{E}_{Z \sim \mathcal{N}(0, 1)}[f(Z \sigma + \mu)]$.
+
+\item For $K = 2$, optimize the variational parameters for 10 epochs over the first 100,000 datapoints.  Use whichever optimization method you prefer.
+
+Plot the training and test-set log-likelihood as a function of the number of epochs, as well as the marginal likelihood lower bound.
+That is to say: at the end of each epoch, evaluate the log of the average predictive probability of all ratings in the training and test sets using 100 samples from q(U,V).
+The lower bound is the sum of entropy, prior and likelihood terms, while the training-set and test-set likelihoods only use the likelihood term.
+
+\item Fit your variational model for $K = 1$ to $K = 10$, and plot the training-set log-likelihood, test-set log-likelihood, and lower bound for each value of $K$.
+How do the shapes of these curves differ?
+
+\item What are the pros and cons of Gibbs sampling versus stochastic variational inference?
+
+\end{enumerate}
+\end{problem}
+  
+\appendix
+\section{\textsf{2.py}}
+\input{2.tex}
+\end{document}