diff options
Diffstat (limited to 'hw4/main.tex')
| -rw-r--r-- | hw4/main.tex | 262 |
1 files changed, 262 insertions, 0 deletions
diff --git a/hw4/main.tex b/hw4/main.tex new file mode 100644 index 0000000..414f37d --- /dev/null +++ b/hw4/main.tex @@ -0,0 +1,262 @@ +% No 'submit' option for the problems by themselves. +\documentclass[submit]{harvardml} +% Use the 'submit' option when you submit your solutions. +%\documentclass[submit]{harvardml} + +\usepackage{url} + +% Put in your full name and email address. +\name{Thibaut Horel} +\email{thorel@seas.harvard.edu} + +% List any people you worked with. +\collaborators{% +} + +% You don't need to change these. +\course{CS281-F13} +\assignment{Assignment \#4} +\duedate{23:59pm November 13, 2015} + +% Useful macros +\newcommand{\bw}{\boldsymbol{w}} +\newcommand{\distNorm}{\mathcal{N}} +\newcommand{\given}{\,|\,} +\newcommand{\ident}{\mathbb{I}} +\newcommand{\btheta}{\boldsymbol{\theta}} +\newcommand{\bz}{\boldsymbol{z}} +\newcommand{\balpha}{\boldsymbol{\alpha}} +\newcommand{\bbeta}{\boldsymbol{\beta}} +\usepackage{amsmath} +\usepackage{hyperref} +\usepackage{graphicx} + +% Some useful macros. +\newcommand{\R}{\mathbb{R}} +\newcommand{\E}{\mathbb{E}} +\newcommand{\var}{\text{var}} +\newcommand{\cov}{\text{cov}} +\newcommand{\N}{\mathcal{N}} +\newcommand{\ep}{\varepsilon} + +\theoremstyle{plain} +\newtheorem{lemma}{Lemma} +\input{def.tex} +\begin{document} + +\begin{problem}[10 pts] + For a target density~$p(x)$ and proposal + density~$q(x'\gets x)$, the Metropolis--Hastings transition + operator is given by + \begin{align*} + T(x'\gets x) &= q(x'\gets x)\,\min\left\{1, + \frac{\tilde p(x')\,q(x\gets x')}{\tilde p(x)\,q(x'\gets x)} + \right\} + \end{align*} + where $\tilde p(x)$ is the unnormalized target density. + + Show that this transition operator satisfies detailed balance. +\end{problem} + +\paragraph{Solution.} We need to show that for any two states $x$ and $x'$: +\begin{displaymath} + p(x)T(x'\gets x) = p(x')T(x\gets x') +\end{displaymath} +Let us expand both sides, we need to show that: +\begin{displaymath} + p(x) q(x'\gets x)\,\min\left\{1, + \frac{\tilde p(x')\,q(x\gets x')}{\tilde p(x)\,q(x'\gets x)} + \right\} + = p(x')q(x\gets x')\,\min\left\{1, + \frac{\tilde p(x)\,q(x'\gets x)}{\tilde p(x')\,q(x\gets x')} + \right\} +\end{displaymath} +We distinguish two cases: +\begin{itemize} + \item $\tilde{p}(x')q(x\gets x') \leq \tilde{p}(x)q(x'\gets x)$: in that + case the $\min$ in the left-hand side collapses to its second argument + and the left-hand side collapses to: + \begin{displaymath} + p(x')q(x\gets x') + \end{displaymath} + But in that case the $\min$ in the right-hand side collapses to $1$ + (its first argument) and the right-hand side collapses to: + \begin{displaymath} + p(x')q(x\gets x') + \end{displaymath} + and we see that the two sides are equal. + \item $\tilde{p}(x')q(x\gets x') > \tilde{p}(x)q(x'\gets x)$: this case is + treated in exactly the same way as the above case by observing that it + is equivalent after swapping the role of $x$ and $x'$. +\end{itemize} +The two cases together conclude the proof of detailed balance. + +\newpage + +\section*{Modeling users and jokes with a Bayesian latent bilinear model} + +The next two questions will develop Bayesian inference methods for the simplest version of the latent bilinear model you used to model jokes ratings in HW3. + + +The data set we'll use is the same as in HW3, a modified and preprocessed variant of the Jester data set. +However, to make things easier (and to make being Bayesian more worthwhile) {\bf we'll only use the first 100,000 joke ratings} for your training set. The second 100,000 ratings will form your test set. + +\subsection*{The model} + +The model is the same as in HW3, but with Gaussian priors on the latent parameter matrices $U$ and $V$. + +Let~${r_{i,j}\in\{1,2,3,4,5\}}$ be the rating of user $i$ on joke $j$. A latent linear model introduces a vector ${u_i\in\R^K}$ for each user and a vector~${v_j\in\R^K}$ for each joke. Then, each rating is modeled as a noisy version of the appropriate inner product. Specifically, +\[ +r_{i,j} \sim \mathcal{N}(u_i^T v_j, \sigma_\epsilon^2). +\] +Fix $\sigma_\epsilon^2$ to be 1.0, and $K = 2$. + +We put independent Gaussian priors on each element of $U$ and $V$: +\[U_{i,k} \sim \mathcal{N}(0, \sigma_U^2=5)\] +\[V_{i,k} \sim \mathcal{N}(0, \sigma_V^2=5)\] + +\begin{problem}[Gibbs sampling, 30pts] + +. + +\begin{enumerate} + +\item Write down the Gibbs update equations for U and V. That is to say, write their conditional distibutions, conditioned on all the other variables as well as the training data: +% +$$p(U | V, R )$$ +$$p(V | U, R )$$ + +Because the model is bi-linear, these updates should have fairly simple forms. + +\item Write a Gibbs sampler for $U$ and $V$ that updates $U$ and $V$ in turn. + +\item Run the Gibbs sampler for 100 steps (i.e. update both $U$ and $V$ 100 times). +Plot the training and test-set log-likelihood as a function of the number of steps through your training set. +That is, use all previous samples of $U, V$ to evaluate the predictive probability of all ratings. + +\item One reason to be Bayesian is that you don't have to worry about overfitting. +Run the your Gibbs sampler for $K = 1$ to $K = 10$, and plot the training and test-set log-likelihood for each value of $K$. How do the shapes of these curve differ from the curves you saw when doing maximum likelihood estimation in HW3? + + +\end{enumerate} +\end{problem} + +\paragraph{Solution.} 1. Using the results about systems of Gaussians from +Murphy's chapter 4, we get: +\begin{displaymath} + u_i | R, V \sim \mathcal{N}\left(\frac{\Sigma_1}{\sigma_\epsilon^2} V_i^Tr_i, + \Sigma_1\right) +\end{displaymath} +where $r_i$ is the vector of ratings given by user $i$, $V_i$ is the projection +of matrix $V$ which only keeps the jokes (rows) rated by user $i$, and $\Sigma$ +is given by: +\begin{displaymath} + \Sigma_1^{-1} = \frac{Id}{\sigma_U^2} + \frac{V_i^TV_i}{\sigma^2_\epsilon} +\end{displaymath} + +Similarly, we get that: +\begin{displaymath} + v_j | R, v \sim + \mathcal{N}\left(\frac{\Sigma_2}{\sigma_\epsilon^2}U_j^Tr_j, + \Sigma_2\right) +\end{displaymath} +where $U_j$ is the projection of matrix $U$ by only keeping the rows +(users) which have rated joke $j$, $r_j$ is the vector of ratings given to joke +$j$, and $\Sigma_2$ is given by: +\begin{displaymath} + \Sigma_2^{-1} = \frac{Id}{\sigma_V^2} + \frac{U_j^TU_j}{\sigma^2_\epsilon} +\end{displaymath} + +2. Please see the code in the appendix \textsf{2.py}. I used two copies of the + ratings matrix: one in CSC format and one in CSR format to be able to + quickly extract $r_i$ the ratings of a given user and $r_j$ the ratings of + a given joke, as well as the indices of these ratings to index the matrices + $U_i$ and $V_j$ appearing in the above formulas. + +3. The plot of the likelihood (up to a multiplicative factor) can be found + below: +\begin{center} + \includegraphics[width=0.8\textwidth]{2.pdf} +\end{center} + +4. For clarity I decided to plot the MSE (mean squared error) in log scale + instead of the log-likelihood. Otherwise the lines were hard to distinguish. + Note that the MSE is (up to a multiplicative factor) simply the opposite of + the log-likelihood because we have a Gaussian model. + + The plot of the training MSE for different values of $K$ can be found below: +\begin{center} + \includegraphics[width=0.7\textwidth]{train.pdf} +\end{center} +And for the test MSE: +\begin{center} + \includegraphics[width=0.7\textwidth]{test.pdf} +\end{center} +As we can see, the MSE is smaller or the training dataset, which is to be +expected. However, unlike in the previous assignment, the test MSE gets better +when $K$ increases. We do not observe the overfitting phenomenon where the +train MSE decreases with $K$ but the test MSE starts increasing again or large +values of $K$. + + +\begin{problem}[Stochastic Variational Inference, 30pts] + +Now we'll repeat the same experiments, but using stochastic variational inference instead. + +Recall that variational inference optimizes a lower bound on the log marginal likelihood (integrating out parameters $\theta$), like so: +\begin{align} +\log p(x) & = \log \int p(x, \theta) d\theta = \log \int p(x|\theta) p(\theta) d\theta \\ +& = \log \int \frac{q_\phi(\theta)}{q_\phi(\theta)} p(x|\theta) p(\theta) d\theta + = \log \mathbb{E}_{q_\phi} \frac{1}{q(\theta)} p(x|\theta) p(\theta) d\theta \\ +& \geq \mathbb{E}_{q_\phi} \log \left[ \frac{1}{q_\phi(\theta)} p(x|\theta) p(\theta) \right] + = \underbrace{-\mathbb{E}_{q_\phi} \log q_\phi(\theta)}_{\textnormal{entropy}} + \underbrace{\mathbb{E}_{q_\phi} \log p(\theta)}_{\textnormal{prior}} + \underbrace{\mathbb{E}_{q_\phi} \log p(x|\theta)}_{\textnormal{likelihood}} += \mathcal{L}(\phi) +\end{align} +% +In this case, $\theta = U,V$ and $x = R$: +% +\begin{align} +\mathcal{L}(\phi) = -\mathbb{E}_{q_\phi} \log q_\phi(U, V) + \mathbb{E}_{q_\phi} \log p(U, V) + \sum_{n=1}^N \mathbb{E}_{q_\phi} \log p(r_n | U, V) +\end{align} +% +The nice part about writing the lower bound as a set of expectations is that if we can sample from $q_\phi(\theta)$, then we can estimate the bound (and its gradient) by simple Monte Carlo. +That is, we can sample from $q(\theta)$, and estimate the bound or its gradients by evaluating at the samples and averaging them. + +This is a very general formula that works for many different priors, likelihoods and variational approximations. +In this case, we'll keep things simple and choose $q(U,V)$ to be a Gaussian factorized over every single entry of each matrix (the same form as the prior). +Thus our variational parameters will consist of a mean and variance for each entry in U and V. +% $\phi^{(\mu U)}_{ik}$, $\phi^{(\sigma^2 U)}_{ik}$, $\phi^{(\mu V)}_{jk}$, and $\phi^{(\sigma^2 V)}_{jk}$. + +To make our method scalable, we have to update the variational parameters without passing over the whole dataset each time. +We can do this by taking Monte Carlo estimates over ratings. +However, we to need scale our estimates so that their expectations are the same as if we did the sum exactly. + + +\begin{enumerate} + +\item +Write an unbiased Monte Carlo estimate of the lower bound, depending only on one rating at a time. +Write the entropy and prior terms exactly, without using Monte Carlo. Hint: You can combine them into a KL divergence between two Gaussians. + +\item Write the gradient of your estimate of the lower bound w.r.t. the variational parameters. + +Hint: To capture all the ways in which the gradient depends on the parameters, if your gradient has an expectation of the form $\mathbb{E}_{X \sim \mathcal{N}(\mu, \sigma)}[f(X)]$, re-write it in terms of $\mathbb{E}_{Z \sim \mathcal{N}(0, 1)}[f(Z \sigma + \mu)]$. + +\item For $K = 2$, optimize the variational parameters for 10 epochs over the first 100,000 datapoints. Use whichever optimization method you prefer. + +Plot the training and test-set log-likelihood as a function of the number of epochs, as well as the marginal likelihood lower bound. +That is to say: at the end of each epoch, evaluate the log of the average predictive probability of all ratings in the training and test sets using 100 samples from q(U,V). +The lower bound is the sum of entropy, prior and likelihood terms, while the training-set and test-set likelihoods only use the likelihood term. + +\item Fit your variational model for $K = 1$ to $K = 10$, and plot the training-set log-likelihood, test-set log-likelihood, and lower bound for each value of $K$. +How do the shapes of these curves differ? + +\item What are the pros and cons of Gibbs sampling versus stochastic variational inference? + +\end{enumerate} +\end{problem} + +\appendix +\section{\textsf{2.py}} +\input{2.tex} +\end{document} |
