hw3/main.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495

% No 'submit' option for the problems by themselves.
\documentclass[submit]{harvardml}
% Use the 'submit' option when you submit your solutions.
%\documentclass[submit]{harvardml}

% Put in your full name and email address.
\name{Thibaut Horel}
\email{thorel@seas.harvard.edu}

% List any people you worked with.
\collaborators{%
}

% You don't need to change these.
\course{CS281-F15}
\assignment{Assignment \#3}
\duedate{11:59pm October 28, 2015}

\usepackage{url, enumitem}
\usepackage{amsfonts}
\usepackage{listings}
\usepackage{graphicx}
\usepackage{bm}

% Some useful macros.
\DeclareMathOperator*{\argmax}{argmax}
\DeclareMathOperator*{\argmin}{argmin}
\newcommand{\prob}{\mathbb{P}}
\newcommand{\given}{\,|\,}
\newcommand{\R}{\mathbb{R}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\var}{\text{var}}
\newcommand{\cov}{\text{cov}}
\newcommand{\N}{\mathcal{N}}
\newcommand{\ep}{\varepsilon}

\theoremstyle{plain}
\newtheorem{lemma}{Lemma}

\begin{document}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{problem}[An orthogonal view of PCA, 20pts]
In lecture we showed that PCA maximizes the variance explained by the top $r$ principal components for every integer $r$. In this problem you will prove an equivalent characterization of PCA as an optimal low-rank approximation of the data. One reason we might seek such a representation is if we have a very high-dimensional data set to which we'd like to apply an expensive learning procedure. In this case, a lower-dimensional approximation of our data \--- if it's sufficiently good \--- might allow us to do our learning much faster by allowing us to operate in the lower-dimensional space.

Let's begin. Suppose we are given a data matrix $X \in \R^{n \times m}$ that is centered, i.e., the mean of every column is 0. Given some set of orthonormal vectors $\{ \pi_1, \ldots, \pi_r\} \subset \R^m$, let $\Pi \in \R^{m \times r}$ be the matrix whose columns are the $\pi_i$. We seek to represent observations as linear combinations of the vectors $\pi_d$. This won't be possible exactly because $r < m$, so instead we ask which set of vectors will give us the best approximation with respect to the $L^2$ norm.

\begin{enumerate}[label=(\alph*)]
\item
Show that if $x$ is an $m$-dimensional row vector then $x \Pi \Pi^T = \sum_{d=1}^r \langle x, \pi_d \rangle \pi_d^T$, so that $\Pi \Pi^T$ projects onto the span of the vectors $\pi_d^T$. 
\item
Argue that the rank of $X \Pi \Pi^T$ is at most $r$.

\item
Suppose that $r=1$ so that $\Pi$ contains just one column vector $\pi_1$, and consider the matrix $X \Pi\Pi^T = X \pi_1 \pi_1^T$. Prove that the reconstruction error
\[
\| X \pi_1\pi_1^T - X \|_2^2
\]
is minimized if and only if
\[
\frac{1}{n} \sum_{i=1}^n (x_i \cdot \pi_1)^2 = \frac{1}{n} \| X \pi_1 \|_2^2
\]
is maximized, where $x_i$ is a row-vector containing the $i$-th row of $X$ and $\| \cdot \|_2$ is the vector/matrix 2-norm.

\item
Argue that the previous part implies that the reconstruction error is minimized if and only if the variance of the data in the $\pi_1$ direction is maximized. That is, if $X^*$ denotes a random observation from $X$ then an optimal $\pi_1$ maximizes $\var(X^* \cdot \pi_1)$. Conclude that when $r=1$ an optimal choice for $\pi_1$ is the top principal component of $X$.

\item
What if $r > 1$? Given a vector $x \in \R^m$, use the fact that $\Pi \Pi^T$ is a projection (part a), as well as the previous part, to show similarly to part c that the reconstruction error
\[
\| X \Pi \Pi^T - X \|_2^2
\]
is minimized if and only if the sum of the variances along the directions $\pi_d$,
\[
\sum_{d=1}^r \var(X^* \cdot \pi_d)
\]
is maximized. Conclude that in general the top $r$ principal components give us an optimal rank-$d$ approximation of the data matrix $X$.
\end{enumerate}
\end{problem}

\paragraph{Solution.} (a) By definition of the matrix product, $\Pi\Pi^T
= \sum_{d=1}^r \pi_d\pi_d^T$. Hence, $x\Pi\Pi^T = \sum_{d=1}^r x\pi_d\pi_d^T$.
Since $x$ is a row-vector and $\pi_d$ is a column vector, $x\pi_d = \langle x,
\pi_d \rangle$, and we obtain the required form for $x\Pi\Pi^T$.

Furthermore, it is easy to see that $\Pi^T\Pi$ (the Gramiam matrix of the
$\pi_d$ vectors) is equal to $I_r$ since the vectors $\pi_d$ are orthonormal.
Hence $(\Pi\Pi^T)^2 = (\Pi\Pi^T)$, which shows that $\Pi\Pi^T$ is a projection.

This form shows that $\Pi\Pi^T$ projects $x$ onto the span of the
vectors $\pi_d^T$. The coordinate along vector $\pi_d^T$ is given by $\langle
x, \pi_d\rangle$.

(b) Since $\Pi\Pi^T$ projects onto the span $S$ of the vectors $\pi_d^T$, we know
that its image is a subspace of $S$. $S$ has dimension exactly $r$ (the vectors
$\pi_d$ form an orthonormal basis of it). Hence the image of $\Pi\Pi^T$ has
dimension at most $r$. That is, the rank of $\Pi\Pi^T$ is at most $r$.

(c) For matrices $X$ and $Y$ of compatible dimensions, by definition of the
matrix product, the $i$th row of matrix $XY$ is equal to $x_i Y$ where $x_i$ is
the $i$th row of $X$.

In particular:
\begin{displaymath}
    \|X\pi_1\pi_1^T - X\|_2^2 = \sum_{i=1}^n \|x_i\pi_1\pi_1^T - x_i\|_2^2
\end{displaymath}
Using the identity $\|a+b\|^2 = \|a\|^2 + \|b\|^2 + 2\langle a, b\rangle$, we
get:
\begin{displaymath}
    \|X\pi_1\pi_1^T - X\|_2^2 = \sum_{i=1}^n\big( \|x_i\pi_1\pi_1^T\|_2^2
    + \|x_i\|_2^2 - 2x_i\pi_1\pi_1^Tx_i^T\big)
\end{displaymath}
We also note that:
\begin{displaymath}
    \|x_i\pi_1\pi_1^T\|_2^2 = x_i\pi_1\pi_1^T\pi_1\pi_1^Tx_i^T
    = x_i\pi_1\pi_1^Tx_i^T
\end{displaymath}
where the last equality uses that $\pi_1^T\pi_1 = 1$. Plugging this in the
above equality we obtain:
\begin{displaymath}
    \|X\pi_1\pi_1^T - X\|_2^2 = \sum_{i=1}^n\big(\|x_i\|_2^2 - x_i\pi_1\pi_1^Tx_i^T\big)
\end{displaymath}
Since the first term in the sum does not depend on $\pi_1$, minimizing the
reconstruction error is equivalent to maximizing:
\begin{displaymath}
    \sum_{i=1}^n  x_i\pi_1\pi_1^Tx_i^T = \sum_{i=1}^n (x_i\cdot \pi_1)^2 = 
    \|X\pi_1\|^2
\end{displaymath}
Finally, multiplying by $\frac{1}{n}$ does not change the maximization problem.

(d) Let $X^*$ be a random observation from $X$, that is a row of $X$ chosen
uniformly at random. First:
\begin{displaymath}
    \E(X^*\cdot\pi_1) = \frac{1}{n}\sum_{i=1}^n x_i\cdot\pi_1
    = \left(\frac{1}{n}\sum_{i=1}^n x_i\right)\cdot\pi_1
    = 0\cdot\pi_1 = 0
\end{displaymath}
where the penultimate equality use that the columns of $X$ sum up to zero.
Hence:
\begin{displaymath}
    \var(X^*\cdot\pi_1) = \E\big((X^*\cdot\pi_1)^2\big)
    = \frac{1}{n}\sum_{i=1}^n (x_i\cdot\pi_1)^2
\end{displaymath}
and we obtain exactly the quantity computed in (c). This shows that choosing
a direction $\pi_1$ of maximum variance for $X$, is equivalent to choosing
$\pi_1$ minimizing the reconstruction error. So for $r=1$, the optimal choice
for $\pi_1$ (to minimize the reconstruction error) is the top principal
component of $X$.

(e) Using part (a), one can write similarly to (c):
\begin{displaymath}
    \|X\Pi\Pi^T - X\|_2^2 = \sum_{i=1}^n \|x_i\Pi\Pi^T - x_i\|_2^2
    = \sum_{i=1}^n \|\sum_{d=1}^r(x_i\cdot\pi_d)\pi_d^T - x_i\|_2^2
\end{displaymath}
Expanding the squares and using that the $\pi_d$ vectors are orthonormal:
\begin{displaymath}
    \|\sum_{d=1}^r(x_i\cdot\pi_d)\pi_d^T - x_i\|_2^2
    = \sum_{d=1}^r(x_i\cdot\pi_d)^2 + \|x_i\|_2^2
    - 2\sum_{d=1}^r (x_i\cdot\pi_d)^2
    = -\sum_{d=1}^r(x_i\cdot\pi_d)^2 + \|x_i\|_2^2
\end{displaymath}
Hence minimizing the reconstruction error is equivalent to maximizing:
\begin{displaymath}
    \frac{1}{n}\sum_{i=1}^n\sum_{d=1}^r(x_i\cdot\pi_d)^2
\end{displaymath}

One the other hand, similarly to (d), the sum of the variances along the
projecting dimensions:
\begin{displaymath}
    \sum_{d=1}^r\var(X^*\cdot\pi_d) = \sum_{d=1}^r\frac{1}{n}\sum_{i=1}^n(x_i\cdot\pi_d)^2
\end{displaymath}
is exactly the same quantity as above after swapping the sums.

Hence minimizing the reconstruction error is equivalent to maximizing the sum
of the variances. We conclude that for $r>1$, finding the best low-rank
approximation is equivalent to using the top $r$ principal components as the
projection basis.


\begin{problem}[Modeling users and jokes with a latent linear model, 25pts]
In this problem, we'll use a latent linear model to jointly model the ratings
users assign to jokes. The data set we'll use is a modified and preprocessed
variant of the Jester data set (version 2) from
\url{http://eigentaste.berkeley.edu/dataset/}. The data we provide you with are user
ratings of 150 jokes. There are over 1.7M ratings with values 1, 2, 3, 4 and 5,
from about sixty thousand users. {\bf The data we give you is a modified version of the original Jester data set, 
please use the files we provide in canvas and not the original ones}. The texts of the jokes are also available.
Warning: most of the jokes are bad, tasteless, or both. At best, they were
funny to late-night TV hosts in 1999-2000. Note also that some of the jokes do
not have any ratings and so can be ignored.

\begin{enumerate}[label=(\alph*)]
\item
Let~${r_{i,j}\in\{1,2,3,4,5\}}$ be the rating of user $i$ on joke $j$.  A latent linear model introduces a vector ${u_i\in\R^K}$ for each user and a vector~${v_j\in\R^K}$ for each joke.  Then, each rating is modeled as a noisy version of the appropriate inner product. Specifically,
\[
r_{i,j} \sim \mathcal{N}(u_i^T v_j, \sigma^2).
\]
so that the log-likelihood is
\[
\log P(R) = \sum_{i,j} \left( -\frac{1}{2\sigma^2} \left( r_{i,j} - u_i^T v_j \right)^2 - \log \sigma - \frac{1}{2} \log(2\pi) \right)
\]
where the sum runs over all $i,j$ for which a rating exists.

What are the gradients of this log-likelihood with respect to $u_i$ and $v_j$?

\item
Randomly choose 100K ratings to hold out as a validation set. Now set $K = 2$,
$\sigma^2=1.0$ and initialize the components of $u_i$ and $v_j$ by sampling from
a Gaussian distribution with zero mean and standard deviation $0.01$.  After
this, adjust the vectors $u_i$ and $v_j$ on your training set using stochastic
gradient descent without minibatches: iterate over the training ratings
updating $u_i$ and $v_j$ for each $r_{ij}$ that is available. Use $\lambda =
0.05$ as the learning rate and iterate 10 times (epochs) over the training
data. Note that the maximum likelihood estimate of $\sigma^2$ is just the mean
squared error of your predictions on the training set. Report your MLE of
$\sigma^2$.

\item Evaluate different choices of $K$ on the validation set. Evaluate $K = 1,
\ldots, 10$ and produce a plot that shows the root-mean-squared error on both
the training set and the validation set for each trial and for each $K$. What
seems like a good value for~$K$?

 \item We might imagine that some jokes are just better or worse than others.
We might also imagine that some users tend to have higher or lower means in
their ratings. In this case, we can introduce biases into the model so that
$r_{ij} \approx u_i^\text{T} v_j + a_i + b_j + g$, where $a_i$, $b_j$ and $g$ are user,
joke and global biases, respectively.  Change the model to incorporate these
biases and fit it again with $K=2$, learning these new biases as well. Write
down the likelihood that you are optimizing. One side-effect is that you should
be able to rank the jokes from best to worst. What are the best and worst jokes
and their respective biases?  What is the value of the global bias?

 \item Sometimes we have users or jokes that only have a few ratings. We don't
want to overfit with these and so we might want to put priors on them. What are
reasonable priors for the latent features and the biases? Draw a directed
graphical model that shows all of these variables and their relationships.
Note that you are not required to code this new model up, just discuss
resonable priors and write the graphical model.
\end{enumerate}
\end{problem}

\paragraph{Solution.} The code for this problem can be found in file \textsf{2.py}.

(a) The gradient with respect to $u_i$ is:
\begin{displaymath}
    \frac{1}{\sigma^2}\sum_{j}(r_{i,j} - u_i^Tv_j)v_j
\end{displaymath}
Similarly, the gradient with respect to $v_j$ is:
\begin{displaymath}
    \frac{1}{\sigma^2}\sum_{i}(r_{i,j} - u_i^Tv_j)u_i
\end{displaymath}

(b) The obtained MLE for $\sigma^2$ is 1.1064889001

(c) The plot of the RMSE for different values of $K$ can be found below:
\begin{center}
\includegraphics[width=0.8\textwidth]{rmse.pdf}
\end{center}
It seems that a good value for $K$ is 2 or 3. For these values, the RMSE on the
validation set is minimal. For larger values of $K$, the RMSE decreases on the training
set but becomes worse on the validation set: we are overfitting.

(d) The new log-likelihood is:
\begin{displaymath}
\log P(R) = \sum_{i,j}
\left( -\frac{1}{2\sigma^2} \left( r_{i,j} - u_i^T v_j - a_i - b_j - g\right)^2
- \log \sigma - \frac{1}{2} \log(2\pi) \right)
\end{displaymath}

After fitting this new-model, we obtain an MSE of 0.932 and 1.111 on the
training and validation sets respectively. The global bias is 1.4429.

By ordering jokes by their bias, we obtain the best joke (72) with a bias of 1.76:
\begin{center}
    \emph{on the first day of college the dean addressed the students pointing
        out some of the rules the female dormitory will be out of bounds for
        all male students and the male dormitory to the female students anybody
        caught breaking this rule will be finded the first time he continued
        anybody caught breaking this rule the second time will be fined being
        caught a third time will cost you a fine of are there any questions at
        this point a male student in the crowd inquired how much for a season
    pass}
\end{center}

and the worst joke (7) with a bias of -0.18:
\begin{center}
    \emph{how many feminists does it take to screw in a light bulb thats not
    funny}
\end{center}

(e) A Gaussian prior of mean 0 seems reasonable for the coordinates of the
latent vectors for both the users and the jokes. This way we will avoid having
too large parameters. Another reasonable prior would be a Laplace prior to
enforce sparse latent vectors.

For the user bias and the joke bias as well as the global bias we can use
uniform prior over $[1, 5]$.

The graphical model looks like this in nested plate notation:
\begin{center}
\includegraphics[width=0.5\textwidth]{model.pdf}
\end{center}


\section*{Ordinal Regression}

We now address the problem of predicting joke ratings given the text of the
joke. The previous models assumed that the ratings where continuous real
numbers, while they are actually integers from 1 to 5. To take this into
account, we will use an ordinal regression model.
Let the rating values
be $r = 1,\ldots,R$. In the ordinal regression model the real line is
partitioned into $R$ contiguous intervals with boundaries
\begin{align}
-\infty = b_1 < b_2 < \ldots < b_{R+1} = +\infty\,,
\end{align}
such that the interval $[b_r,b_{r+1})$ corresponds to the $r$-th rating value.
We will assume that $b_2 = -4$, $b_1 = -2$, $b_3 = 0$, $b_4 = 2$ and $b_5 = 4$.
Instead of directly modeling the ratings, we will be modeling them in terms of a
hidden variable $f$. We have that the rating $r$ is observed
if $f$ falls in the interval for that rating. The conditional probability of
$r$ given $f$ is then
\begin{align}
p(r|f) = 
\left\{
    \begin{array}{ll}
        1  & \mbox{if }\quad  b_r \leq f < b_{r+1} \\
        0 & \mbox{if } \quad \mbox{otherwise}
    \end{array}
\right.
= \Theta(f - b_r) - \Theta(f-b_{r+1})\,,
\end{align}
where $\Theta(x)$ is the Heaviside step function, that is, $\Theta(x)=1$ if
$x>0$ and $\Theta(x)=0$ otherwise. Uncertainty about the exact value of $f$ can
be modeled by adding additive Gaussian noise to $f$. Let $\sigma^2$ be the variance
of this noise. Then $p(f|h) = \mathcal{N}(f|h,\sigma^2)$, where $h$ is a new latent
variable that is the noise free version of $f$.

Given some features $\mathbf{x}$ for a particular joke, we can then combine the
previous likelihood with a linear model to make predictions for the possible
rating values for the joke. In particular, we can assume that the noise-free
rating value $h$ is a linear function of $\mathbf{x}$, that is
$h(\mathbf{x})=\mathbf{w}^\text{T} \mathbf{x}$, where $\mathbf{w}$ is a vector
of regression coefficients. We will assume that the prior for $\mathbf{w}$ is
$p(\mathbf{w})=\mathcal{N}(\bm 0, \mathbf{I})$.

The problem now is what feature representation to use for the jokes. We propose
here that you compute binary features that indicate the presence or absence of
the 150 most common words across all the jokes plus a bias term that is always
equal to one. The file \texttt{jester\_items.clean.dat} in canvas has the joke texts with
punctuation and case removed. To turn the texts into the proposed binary features you can use the
python script \texttt{generate\_binary\_features.py}, which is also available at canvas.

\subsection*{Stochastic optimization with Adam}

Stochastic optimization of cost functions for models with many parameters can
be difficult because each parameter may require a different learning
rate. Adam,
(\url{http://arxiv.org/pdf/1412.6980v8.pdf})
is a method that can be used to avoid this problem. Adam computes individual
adaptive learning rates for different parameters from estimates of first and
second moments of the gradients. Algorithm 1, shows the pseudo-code (extracted from the above
reference). You are encouraged to look at the pdf
for more details.

\subsection*{Sparse matrices}

A collection of ratings given by a set of users on a set of items can be
efficiently manipulated by storing them in a sparse matrix format. The non-zero
entries in the matrix would contain the observed rating values, while the zero
entries encode missing ratings. In the package \texttt{spcipy.sparse} you
have available the following sparse matrix types:

{\footnotesize
\begin{verbatim}
bsr_matrix  Block Sparse Row matrix
coo_matrix  A sparse matrix in COOrdinate format.
csc_matrix  Compressed Sparse Column matrix
csr_matrix  Compressed Sparse Row matrix
dia_matrix  Sparse matrix with DIAgonal storage
dok_matrix  Dictionary Of Keys based sparse matrix.
lil_matrix  Row-based linked list sparse matrix
\end{verbatim}
}

Each sparse matrix type presents different computational advantages for
accessing and manipulating the data depending on whether the operations are
performed by rows or by columns or in an arbitrary way. If you encode the
ratings in a matrix with users as rows and jokes as columns, the
\texttt{csc\_matrix} type can be useful to quickly identify the ratings that
are available for any particular joke. The \texttt{lil\_matrix} type is useful
if we want to incrementally construct a sparse matrix by adding new blocks of
non-zero entries.

\begin{problem}[Ordinal linear regression 25pts] You are required to
\begin{enumerate}
\vspace{-0.1cm}
\item Explain what is the form for $p(r|h)$ in the ordinal regression model.

\vspace{-0.1cm}
\item Explain what is the mean of the predictive distribution in the ordinal regression model.

\vspace{-0.1cm}
\item Code a function that splits the ratings for each joke into a training set
with 95\% of the ratings and a test set with the remaining 5\%.  After this,
each joke should have 95\% of its ratings in the training set and 5\% of them
in the test set. For this, it can be useful to work with sparse matrices of dimension $n\times
m$ where $n$ is the number of users and $m$ is the number of jokes and the
non-zero entries in the matrix are the observed rating values.

\vspace{-0.1cm}
\item Code a function that computes the log posterior distribution of the ordinal
regression model up to a normalization constant. Code a function that computes
the gradients of the previous function with respect to $\mathbf{w}$ and
$\sigma^2$. You are encouraged to use automatic differentiation tools such as autograd
(\url{https://github.com/HIPS/autograd}).

\vspace{-0.1cm}
\item Code a function that finds the MAP solution for $\mathbf{w}$ and
$\sigma^2$ in the ordinal regression model given the available training data using mini-batch stochastic
gradient descent. Use minibatches of size 100 and Adam with its default
parameter values for 100 epochs (iterations over the training data). Report the
average test RMSE.

\vspace{-0.1cm}
\item Modify the previous model to have a Gaussian likelihood, that is, $p(r|h)=\mathcal{N}(r|h,\sigma^2)$.
Report the resulting average test RMSE of the new model. Does performance
increase or decrease? Why?
\vspace{-0.1cm}
\item How does the performance of the models analyzed in this problem compare to
the performance of the model from problem 2? Which model performs best? Why?
\end{enumerate}
\end{problem}

\paragraph{Solution.} The code for this problem can be found in file
\textsf{3.py}.

(1) Let us denote by $\Phi$ the CDF of a Gaussian variable of mean 0 and
variance 1, then:
\begin{displaymath}
    p(r|h) = \Phi\left(\frac{b_{r+1} - h}{\sigma^2}\right) - 
    \Phi\left(\frac{b_{r} - h}{\sigma^2}\right)
\end{displaymath}
this computes the probably that $f\sim \mathcal{N}(h,\sigma^2)$ falls in the
interval $[b_r, b_{r+1}]$.

(2) In the ordinal regression model, $h = h(x) = w^Tx$. So for a given weight
vector $w$ and joke with attributes $x$, the mean of the predictive
distribution is:
\begin{displaymath}
    \mu(w, x, \sigma^2) = \sum_{r=1}^5 r\left(
        \Phi\left(\frac{b_{r+1} - w^Tx}{\sigma^2}\right) - 
        \Phi\left(\frac{b_{r} - w^Tx}{\sigma^2}\right)
    \right)
\end{displaymath}
It is not entirely clear to me what was being asked for this question. If we
wanted to we could also compute the marginal predictive distribution by
integrating the above quantity against the prior $p(w)$. However, after reading
Piazza, I think the above quantity is what was expected.

(3) See  function \textsf{build\_train\_test} in \textsf{3.py}.

(4) See functions \textsf{log\_posterior} and \textsf{grad\_log\_posterior} in
\textsf{3.py}.

(5) See function \textsf{sgd} in \textsf{3.py}. I obtained an average test RMSE
of 1.2871.

(6) See function \textsf{log\_posterior\_bis}. This function is simpler to
write, because the likelihood part of the log-posterior is the likelihood of
Gaussian variables, which takes the standard form of the squared error on the
training dataset (like in linear regression). The regularization term is the
same as in (5). The new average test RMSE is  1.2896. This is slightly worse
than what as obtained in (5). This can be explained by the fact that this model
does not capture the \emph{ordinal} nature (integer ratings) of the
variable to be predicted.

(7). The performances of the models in this problem are worse than the model
from problem 2.

This can be explained by the fact the model of problem 2 is richer: it assumes
that the ratings depends on latent features of the jokes and latent features of
the users.

In contrast, the models in this problem assume that the ratings of a joke only
depend on features of the joke. The individual fluctuations across individuals
are simply models as independent random noise.

An ideal model would combine ideas from the model in this problem to capture
the ordinal nature of ratings and ideas from the model in problem 2 to use
a low rank model on both users and jokes to model the mean of the Gaussian
variables.

\end{document}