hw4/main.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262

% No 'submit' option for the problems by themselves.
\documentclass[submit]{harvardml}
% Use the 'submit' option when you submit your solutions.
%\documentclass[submit]{harvardml}

\usepackage{url}

% Put in your full name and email address.
\name{Thibaut Horel}
\email{thorel@seas.harvard.edu}

% List any people you worked with.
\collaborators{%
}

% You don't need to change these.
\course{CS281-F13}
\assignment{Assignment \#4}
\duedate{23:59pm November 13, 2015}

% Useful macros
\newcommand{\bw}{\boldsymbol{w}}
\newcommand{\distNorm}{\mathcal{N}}
\newcommand{\given}{\,|\,}
\newcommand{\ident}{\mathbb{I}}
\newcommand{\btheta}{\boldsymbol{\theta}}
\newcommand{\bz}{\boldsymbol{z}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\usepackage{amsmath}
\usepackage{hyperref}
\usepackage{graphicx}

% Some useful macros.
\newcommand{\R}{\mathbb{R}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\var}{\text{var}}
\newcommand{\cov}{\text{cov}}
\newcommand{\N}{\mathcal{N}}
\newcommand{\ep}{\varepsilon}

\theoremstyle{plain}
\newtheorem{lemma}{Lemma}
\input{def.tex}
\begin{document}

\begin{problem}[10 pts]
  For a target density~$p(x)$ and proposal
  density~$q(x'\gets x)$, the Metropolis--Hastings transition
  operator is given by
  \begin{align*}
    T(x'\gets x) &= q(x'\gets x)\,\min\left\{1,
    \frac{\tilde p(x')\,q(x\gets x')}{\tilde p(x)\,q(x'\gets x)}
    \right\}
  \end{align*}
  where $\tilde p(x)$ is the unnormalized target density.
  
  Show that this transition operator satisfies detailed balance.
\end{problem}

\paragraph{Solution.} We need to show that for any two states $x$ and $x'$:
\begin{displaymath}
    p(x)T(x'\gets x) = p(x')T(x\gets x')
\end{displaymath}
Let us expand both sides, we need to show that:
\begin{displaymath}
    p(x) q(x'\gets x)\,\min\left\{1,
    \frac{\tilde p(x')\,q(x\gets x')}{\tilde p(x)\,q(x'\gets x)}
    \right\}
    = p(x')q(x\gets x')\,\min\left\{1,
    \frac{\tilde p(x)\,q(x'\gets x)}{\tilde p(x')\,q(x\gets x')}
    \right\}
\end{displaymath}
We distinguish two cases:
\begin{itemize}
    \item $\tilde{p}(x')q(x\gets x') \leq \tilde{p}(x)q(x'\gets x)$: in that
        case the $\min$ in the left-hand side collapses to its second argument
        and the left-hand side collapses to:
        \begin{displaymath}
            p(x')q(x\gets x')
        \end{displaymath}
        But in that case the $\min$ in the right-hand side collapses to $1$
        (its first argument) and the right-hand side collapses to:
        \begin{displaymath}
            p(x')q(x\gets x')
        \end{displaymath}
        and we see that the two sides are equal.
    \item $\tilde{p}(x')q(x\gets x') > \tilde{p}(x)q(x'\gets x)$: this case is
        treated in exactly the same way as the above case by observing that it
        is equivalent after swapping the role of $x$ and $x'$.
\end{itemize}
The two cases together conclude the proof of detailed balance.

\newpage

\section*{Modeling users and jokes with a Bayesian latent bilinear model}

The next two questions will develop Bayesian inference methods for the simplest version of the latent bilinear model you used to model jokes ratings in HW3.


The data set we'll use is the same as in HW3, a modified and preprocessed variant of the Jester data set.
However, to make things easier (and to make being Bayesian more worthwhile) {\bf we'll only use the first 100,000 joke ratings} for your training set.  The second 100,000 ratings will form your test set.

\subsection*{The model}

The model is the same as in HW3, but with Gaussian priors on the latent parameter matrices $U$ and $V$.

Let~${r_{i,j}\in\{1,2,3,4,5\}}$ be the rating of user $i$ on joke $j$.  A latent linear model introduces a vector ${u_i\in\R^K}$ for each user and a vector~${v_j\in\R^K}$ for each joke.  Then, each rating is modeled as a noisy version of the appropriate inner product. Specifically,
\[
r_{i,j} \sim \mathcal{N}(u_i^T v_j, \sigma_\epsilon^2).
\]
Fix $\sigma_\epsilon^2$ to be 1.0, and $K = 2$.

We put independent Gaussian priors on each element of $U$ and $V$:
\[U_{i,k} \sim \mathcal{N}(0, \sigma_U^2=5)\]
\[V_{i,k} \sim \mathcal{N}(0, \sigma_V^2=5)\]

\begin{problem}[Gibbs sampling, 30pts]

.

\begin{enumerate}

\item Write down the Gibbs update equations for U and V.  That is to say, write their conditional distibutions, conditioned on all the other variables as well as the training data:
%
$$p(U | V, R )$$
$$p(V | U, R )$$

Because the model is bi-linear, these updates should have fairly simple forms.

\item Write a Gibbs sampler for $U$ and $V$ that updates $U$ and $V$ in turn.

\item Run the Gibbs sampler for 100 steps (i.e. update both $U$ and $V$ 100 times).
Plot the training and test-set log-likelihood as a function of the number of steps through your training set.
That is, use all previous samples of $U, V$ to evaluate the predictive probability of all ratings.

\item One reason to be Bayesian is that you don't have to worry about overfitting.
Run the your Gibbs sampler for $K = 1$ to $K = 10$, and plot the training and test-set log-likelihood for each value of $K$.  How do the shapes of these curve differ from the curves you saw when doing maximum likelihood estimation in HW3?


\end{enumerate}
\end{problem}

\paragraph{Solution.} 1. Using the results about systems of Gaussians from
Murphy's chapter 4, we get:
\begin{displaymath}
    u_i | R, V \sim \mathcal{N}\left(\frac{\Sigma_1}{\sigma_\epsilon^2} V_i^Tr_i,
        \Sigma_1\right)
\end{displaymath}
where $r_i$ is the vector of ratings given by user $i$, $V_i$ is the projection
of matrix $V$ which only keeps the jokes (rows) rated by user $i$, and $\Sigma$
is given by:
\begin{displaymath}
    \Sigma_1^{-1} = \frac{Id}{\sigma_U^2} + \frac{V_i^TV_i}{\sigma^2_\epsilon}
\end{displaymath}

Similarly, we get that:
\begin{displaymath}
    v_j | R, v \sim
    \mathcal{N}\left(\frac{\Sigma_2}{\sigma_\epsilon^2}U_j^Tr_j,
    \Sigma_2\right)
\end{displaymath}
where $U_j$ is the projection of matrix $U$ by only keeping the rows
(users) which have rated joke $j$, $r_j$ is the vector of ratings given to joke
$j$, and $\Sigma_2$ is given by:
\begin{displaymath}
    \Sigma_2^{-1} = \frac{Id}{\sigma_V^2} + \frac{U_j^TU_j}{\sigma^2_\epsilon}
\end{displaymath}

2. Please see the code in the appendix \textsf{2.py}. I used two copies of the
   ratings matrix: one in CSC format and one in CSR format to be able to
   quickly extract $r_i$ the ratings of a given user and $r_j$ the ratings of
   a given joke, as well as the indices of these ratings to index the matrices
   $U_i$ and $V_j$ appearing in the above formulas.

3. The plot of the likelihood (up to a multiplicative factor) can be found
   below:
\begin{center}
    \includegraphics[width=0.8\textwidth]{2.pdf}
\end{center}

4. For clarity I decided to plot the MSE (mean squared error) in log scale
   instead of the log-likelihood. Otherwise the lines were hard to distinguish.
   Note that the MSE is (up to a multiplicative factor) simply the opposite of
   the log-likelihood because we have a Gaussian model.

   The plot of the training MSE for different values of $K$ can be found below:
\begin{center}
    \includegraphics[width=0.7\textwidth]{train.pdf}
\end{center}
And for the test MSE:
\begin{center}
    \includegraphics[width=0.7\textwidth]{test.pdf}
\end{center}
As we can see, the MSE is smaller or the training dataset, which is to be
expected. However, unlike in the previous assignment, the test MSE gets better
when $K$ increases. We do not observe the overfitting phenomenon where the
train MSE decreases with $K$ but the test MSE starts increasing again or large
values of $K$.


\begin{problem}[Stochastic Variational Inference, 30pts]

Now we'll repeat the same experiments, but using stochastic variational inference instead.

Recall that variational inference optimizes a lower bound on the log marginal likelihood (integrating out parameters $\theta$), like so:
\begin{align}
\log p(x) & = \log \int p(x, \theta) d\theta = \log \int p(x|\theta) p(\theta) d\theta \\
& = \log \int \frac{q_\phi(\theta)}{q_\phi(\theta)} p(x|\theta) p(\theta) d\theta 
  = \log \mathbb{E}_{q_\phi} \frac{1}{q(\theta)} p(x|\theta) p(\theta) d\theta \\
& \geq \mathbb{E}_{q_\phi} \log \left[ \frac{1}{q_\phi(\theta)} p(x|\theta) p(\theta) \right] 
 = \underbrace{-\mathbb{E}_{q_\phi} \log q_\phi(\theta)}_{\textnormal{entropy}}  + \underbrace{\mathbb{E}_{q_\phi} \log p(\theta)}_{\textnormal{prior}} + \underbrace{\mathbb{E}_{q_\phi} \log p(x|\theta)}_{\textnormal{likelihood}}
= \mathcal{L}(\phi)
\end{align}
%
In this case, $\theta = U,V$ and $x = R$:
%
\begin{align}
\mathcal{L}(\phi) = -\mathbb{E}_{q_\phi} \log q_\phi(U, V) + \mathbb{E}_{q_\phi} \log p(U, V) + \sum_{n=1}^N \mathbb{E}_{q_\phi} \log p(r_n | U, V) 
\end{align}
%
The nice part about writing the lower bound as a set of expectations is that if we can sample from $q_\phi(\theta)$, then we can estimate the bound (and its gradient) by simple Monte Carlo.
That is, we can sample from $q(\theta)$, and estimate the bound or its gradients by evaluating at the samples and averaging them.

This is a very general formula that works for many different priors, likelihoods and variational approximations.
In this case, we'll keep things simple and choose $q(U,V)$ to be a Gaussian factorized over every single entry of each matrix (the same form as the prior).
Thus our variational parameters will consist of a mean and variance for each entry in U and V.
% $\phi^{(\mu U)}_{ik}$, $\phi^{(\sigma^2 U)}_{ik}$, $\phi^{(\mu V)}_{jk}$, and $\phi^{(\sigma^2 V)}_{jk}$.

To make our method scalable, we have to update the variational parameters without passing over the whole dataset each time.
We can do this by taking Monte Carlo estimates over ratings.
However, we to need scale our estimates so that their expectations are the same as if we did the sum exactly.


\begin{enumerate}

\item 
Write an unbiased Monte Carlo estimate of the lower bound, depending only on one rating at a time.
Write the entropy and prior terms exactly, without using Monte Carlo.  Hint: You can combine them into a KL divergence between two Gaussians.

\item Write the gradient of your estimate of the lower bound w.r.t. the variational parameters.

Hint: To capture all the ways in which the gradient depends on the parameters, if your gradient has an expectation of the form $\mathbb{E}_{X \sim \mathcal{N}(\mu, \sigma)}[f(X)]$, re-write it in terms of $\mathbb{E}_{Z \sim \mathcal{N}(0, 1)}[f(Z \sigma + \mu)]$.

\item For $K = 2$, optimize the variational parameters for 10 epochs over the first 100,000 datapoints.  Use whichever optimization method you prefer.

Plot the training and test-set log-likelihood as a function of the number of epochs, as well as the marginal likelihood lower bound.
That is to say: at the end of each epoch, evaluate the log of the average predictive probability of all ratings in the training and test sets using 100 samples from q(U,V).
The lower bound is the sum of entropy, prior and likelihood terms, while the training-set and test-set likelihoods only use the likelihood term.

\item Fit your variational model for $K = 1$ to $K = 10$, and plot the training-set log-likelihood, test-set log-likelihood, and lower bound for each value of $K$.
How do the shapes of these curves differ?

\item What are the pros and cons of Gibbs sampling versus stochastic variational inference?

\end{enumerate}
\end{problem}
  
\appendix
\section{\textsf{2.py}}
\input{2.tex}
\end{document}