diff options
Diffstat (limited to 'hw3/main.tex')
| -rw-r--r-- | hw3/main.tex | 495 |
1 files changed, 495 insertions, 0 deletions
diff --git a/hw3/main.tex b/hw3/main.tex new file mode 100644 index 0000000..e4460c4 --- /dev/null +++ b/hw3/main.tex @@ -0,0 +1,495 @@ +% No 'submit' option for the problems by themselves. +\documentclass[submit]{harvardml} +% Use the 'submit' option when you submit your solutions. +%\documentclass[submit]{harvardml} + +% Put in your full name and email address. +\name{Thibaut Horel} +\email{thorel@seas.harvard.edu} + +% List any people you worked with. +\collaborators{% +} + +% You don't need to change these. +\course{CS281-F15} +\assignment{Assignment \#3} +\duedate{11:59pm October 28, 2015} + +\usepackage{url, enumitem} +\usepackage{amsfonts} +\usepackage{listings} +\usepackage{graphicx} +\usepackage{bm} + +% Some useful macros. +\DeclareMathOperator*{\argmax}{argmax} +\DeclareMathOperator*{\argmin}{argmin} +\newcommand{\prob}{\mathbb{P}} +\newcommand{\given}{\,|\,} +\newcommand{\R}{\mathbb{R}} +\newcommand{\E}{\mathbb{E}} +\newcommand{\var}{\text{var}} +\newcommand{\cov}{\text{cov}} +\newcommand{\N}{\mathcal{N}} +\newcommand{\ep}{\varepsilon} + +\theoremstyle{plain} +\newtheorem{lemma}{Lemma} + +\begin{document} + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\begin{problem}[An orthogonal view of PCA, 20pts] +In lecture we showed that PCA maximizes the variance explained by the top $r$ principal components for every integer $r$. In this problem you will prove an equivalent characterization of PCA as an optimal low-rank approximation of the data. One reason we might seek such a representation is if we have a very high-dimensional data set to which we'd like to apply an expensive learning procedure. In this case, a lower-dimensional approximation of our data \--- if it's sufficiently good \--- might allow us to do our learning much faster by allowing us to operate in the lower-dimensional space. + +Let's begin. Suppose we are given a data matrix $X \in \R^{n \times m}$ that is centered, i.e., the mean of every column is 0. Given some set of orthonormal vectors $\{ \pi_1, \ldots, \pi_r\} \subset \R^m$, let $\Pi \in \R^{m \times r}$ be the matrix whose columns are the $\pi_i$. We seek to represent observations as linear combinations of the vectors $\pi_d$. This won't be possible exactly because $r < m$, so instead we ask which set of vectors will give us the best approximation with respect to the $L^2$ norm. + +\begin{enumerate}[label=(\alph*)] +\item +Show that if $x$ is an $m$-dimensional row vector then $x \Pi \Pi^T = \sum_{d=1}^r \langle x, \pi_d \rangle \pi_d^T$, so that $\Pi \Pi^T$ projects onto the span of the vectors $\pi_d^T$. +\item +Argue that the rank of $X \Pi \Pi^T$ is at most $r$. + +\item +Suppose that $r=1$ so that $\Pi$ contains just one column vector $\pi_1$, and consider the matrix $X \Pi\Pi^T = X \pi_1 \pi_1^T$. Prove that the reconstruction error +\[ +\| X \pi_1\pi_1^T - X \|_2^2 +\] +is minimized if and only if +\[ +\frac{1}{n} \sum_{i=1}^n (x_i \cdot \pi_1)^2 = \frac{1}{n} \| X \pi_1 \|_2^2 +\] +is maximized, where $x_i$ is a row-vector containing the $i$-th row of $X$ and $\| \cdot \|_2$ is the vector/matrix 2-norm. + +\item +Argue that the previous part implies that the reconstruction error is minimized if and only if the variance of the data in the $\pi_1$ direction is maximized. That is, if $X^*$ denotes a random observation from $X$ then an optimal $\pi_1$ maximizes $\var(X^* \cdot \pi_1)$. Conclude that when $r=1$ an optimal choice for $\pi_1$ is the top principal component of $X$. + +\item +What if $r > 1$? Given a vector $x \in \R^m$, use the fact that $\Pi \Pi^T$ is a projection (part a), as well as the previous part, to show similarly to part c that the reconstruction error +\[ +\| X \Pi \Pi^T - X \|_2^2 +\] +is minimized if and only if the sum of the variances along the directions $\pi_d$, +\[ +\sum_{d=1}^r \var(X^* \cdot \pi_d) +\] +is maximized. Conclude that in general the top $r$ principal components give us an optimal rank-$d$ approximation of the data matrix $X$. +\end{enumerate} +\end{problem} + +\paragraph{Solution.} (a) By definition of the matrix product, $\Pi\Pi^T += \sum_{d=1}^r \pi_d\pi_d^T$. Hence, $x\Pi\Pi^T = \sum_{d=1}^r x\pi_d\pi_d^T$. +Since $x$ is a row-vector and $\pi_d$ is a column vector, $x\pi_d = \langle x, +\pi_d \rangle$, and we obtain the required form for $x\Pi\Pi^T$. + +Furthermore, it is easy to see that $\Pi^T\Pi$ (the Gramiam matrix of the +$\pi_d$ vectors) is equal to $I_r$ since the vectors $\pi_d$ are orthonormal. +Hence $(\Pi\Pi^T)^2 = (\Pi\Pi^T)$, which shows that $\Pi\Pi^T$ is a projection. + +This form shows that $\Pi\Pi^T$ projects $x$ onto the span of the +vectors $\pi_d^T$. The coordinate along vector $\pi_d^T$ is given by $\langle +x, \pi_d\rangle$. + +(b) Since $\Pi\Pi^T$ projects onto the span $S$ of the vectors $\pi_d^T$, we know +that its image is a subspace of $S$. $S$ has dimension exactly $r$ (the vectors +$\pi_d$ form an orthonormal basis of it). Hence the image of $\Pi\Pi^T$ has +dimension at most $r$. That is, the rank of $\Pi\Pi^T$ is at most $r$. + +(c) For matrices $X$ and $Y$ of compatible dimensions, by definition of the +matrix product, the $i$th row of matrix $XY$ is equal to $x_i Y$ where $x_i$ is +the $i$th row of $X$. + +In particular: +\begin{displaymath} + \|X\pi_1\pi_1^T - X\|_2^2 = \sum_{i=1}^n \|x_i\pi_1\pi_1^T - x_i\|_2^2 +\end{displaymath} +Using the identity $\|a+b\|^2 = \|a\|^2 + \|b\|^2 + 2\langle a, b\rangle$, we +get: +\begin{displaymath} + \|X\pi_1\pi_1^T - X\|_2^2 = \sum_{i=1}^n\big( \|x_i\pi_1\pi_1^T\|_2^2 + + \|x_i\|_2^2 - 2x_i\pi_1\pi_1^Tx_i^T\big) +\end{displaymath} +We also note that: +\begin{displaymath} + \|x_i\pi_1\pi_1^T\|_2^2 = x_i\pi_1\pi_1^T\pi_1\pi_1^Tx_i^T + = x_i\pi_1\pi_1^Tx_i^T +\end{displaymath} +where the last equality uses that $\pi_1^T\pi_1 = 1$. Plugging this in the +above equality we obtain: +\begin{displaymath} + \|X\pi_1\pi_1^T - X\|_2^2 = \sum_{i=1}^n\big(\|x_i\|_2^2 - x_i\pi_1\pi_1^Tx_i^T\big) +\end{displaymath} +Since the first term in the sum does not depend on $\pi_1$, minimizing the +reconstruction error is equivalent to maximizing: +\begin{displaymath} + \sum_{i=1}^n x_i\pi_1\pi_1^Tx_i^T = \sum_{i=1}^n (x_i\cdot \pi_1)^2 = + \|X\pi_1\|^2 +\end{displaymath} +Finally, multiplying by $\frac{1}{n}$ does not change the maximization problem. + +(d) Let $X^*$ be a random observation from $X$, that is a row of $X$ chosen +uniformly at random. First: +\begin{displaymath} + \E(X^*\cdot\pi_1) = \frac{1}{n}\sum_{i=1}^n x_i\cdot\pi_1 + = \left(\frac{1}{n}\sum_{i=1}^n x_i\right)\cdot\pi_1 + = 0\cdot\pi_1 = 0 +\end{displaymath} +where the penultimate equality use that the columns of $X$ sum up to zero. +Hence: +\begin{displaymath} + \var(X^*\cdot\pi_1) = \E\big((X^*\cdot\pi_1)^2\big) + = \frac{1}{n}\sum_{i=1}^n (x_i\cdot\pi_1)^2 +\end{displaymath} +and we obtain exactly the quantity computed in (c). This shows that choosing +a direction $\pi_1$ of maximum variance for $X$, is equivalent to choosing +$\pi_1$ minimizing the reconstruction error. So for $r=1$, the optimal choice +for $\pi_1$ (to minimize the reconstruction error) is the top principal +component of $X$. + +(e) Using part (a), one can write similarly to (c): +\begin{displaymath} + \|X\Pi\Pi^T - X\|_2^2 = \sum_{i=1}^n \|x_i\Pi\Pi^T - x_i\|_2^2 + = \sum_{i=1}^n \|\sum_{d=1}^r(x_i\cdot\pi_d)\pi_d^T - x_i\|_2^2 +\end{displaymath} +Expanding the squares and using that the $\pi_d$ vectors are orthonormal: +\begin{displaymath} + \|\sum_{d=1}^r(x_i\cdot\pi_d)\pi_d^T - x_i\|_2^2 + = \sum_{d=1}^r(x_i\cdot\pi_d)^2 + \|x_i\|_2^2 + - 2\sum_{d=1}^r (x_i\cdot\pi_d)^2 + = -\sum_{d=1}^r(x_i\cdot\pi_d)^2 + \|x_i\|_2^2 +\end{displaymath} +Hence minimizing the reconstruction error is equivalent to maximizing: +\begin{displaymath} + \frac{1}{n}\sum_{i=1}^n\sum_{d=1}^r(x_i\cdot\pi_d)^2 +\end{displaymath} + +One the other hand, similarly to (d), the sum of the variances along the +projecting dimensions: +\begin{displaymath} + \sum_{d=1}^r\var(X^*\cdot\pi_d) = \sum_{d=1}^r\frac{1}{n}\sum_{i=1}^n(x_i\cdot\pi_d)^2 +\end{displaymath} +is exactly the same quantity as above after swapping the sums. + +Hence minimizing the reconstruction error is equivalent to maximizing the sum +of the variances. We conclude that for $r>1$, finding the best low-rank +approximation is equivalent to using the top $r$ principal components as the +projection basis. + + +\begin{problem}[Modeling users and jokes with a latent linear model, 25pts] +In this problem, we'll use a latent linear model to jointly model the ratings +users assign to jokes. The data set we'll use is a modified and preprocessed +variant of the Jester data set (version 2) from +\url{http://eigentaste.berkeley.edu/dataset/}. The data we provide you with are user +ratings of 150 jokes. There are over 1.7M ratings with values 1, 2, 3, 4 and 5, +from about sixty thousand users. {\bf The data we give you is a modified version of the original Jester data set, +please use the files we provide in canvas and not the original ones}. The texts of the jokes are also available. +Warning: most of the jokes are bad, tasteless, or both. At best, they were +funny to late-night TV hosts in 1999-2000. Note also that some of the jokes do +not have any ratings and so can be ignored. + +\begin{enumerate}[label=(\alph*)] +\item +Let~${r_{i,j}\in\{1,2,3,4,5\}}$ be the rating of user $i$ on joke $j$. A latent linear model introduces a vector ${u_i\in\R^K}$ for each user and a vector~${v_j\in\R^K}$ for each joke. Then, each rating is modeled as a noisy version of the appropriate inner product. Specifically, +\[ +r_{i,j} \sim \mathcal{N}(u_i^T v_j, \sigma^2). +\] +so that the log-likelihood is +\[ +\log P(R) = \sum_{i,j} \left( -\frac{1}{2\sigma^2} \left( r_{i,j} - u_i^T v_j \right)^2 - \log \sigma - \frac{1}{2} \log(2\pi) \right) +\] +where the sum runs over all $i,j$ for which a rating exists. + +What are the gradients of this log-likelihood with respect to $u_i$ and $v_j$? + +\item +Randomly choose 100K ratings to hold out as a validation set. Now set $K = 2$, +$\sigma^2=1.0$ and initialize the components of $u_i$ and $v_j$ by sampling from +a Gaussian distribution with zero mean and standard deviation $0.01$. After +this, adjust the vectors $u_i$ and $v_j$ on your training set using stochastic +gradient descent without minibatches: iterate over the training ratings +updating $u_i$ and $v_j$ for each $r_{ij}$ that is available. Use $\lambda = +0.05$ as the learning rate and iterate 10 times (epochs) over the training +data. Note that the maximum likelihood estimate of $\sigma^2$ is just the mean +squared error of your predictions on the training set. Report your MLE of +$\sigma^2$. + +\item Evaluate different choices of $K$ on the validation set. Evaluate $K = 1, +\ldots, 10$ and produce a plot that shows the root-mean-squared error on both +the training set and the validation set for each trial and for each $K$. What +seems like a good value for~$K$? + + \item We might imagine that some jokes are just better or worse than others. +We might also imagine that some users tend to have higher or lower means in +their ratings. In this case, we can introduce biases into the model so that +$r_{ij} \approx u_i^\text{T} v_j + a_i + b_j + g$, where $a_i$, $b_j$ and $g$ are user, +joke and global biases, respectively. Change the model to incorporate these +biases and fit it again with $K=2$, learning these new biases as well. Write +down the likelihood that you are optimizing. One side-effect is that you should +be able to rank the jokes from best to worst. What are the best and worst jokes +and their respective biases? What is the value of the global bias? + + \item Sometimes we have users or jokes that only have a few ratings. We don't +want to overfit with these and so we might want to put priors on them. What are +reasonable priors for the latent features and the biases? Draw a directed +graphical model that shows all of these variables and their relationships. +Note that you are not required to code this new model up, just discuss +resonable priors and write the graphical model. +\end{enumerate} +\end{problem} + +\paragraph{Solution.} The code for this problem can be found in file \textsf{2.py}. + +(a) The gradient with respect to $u_i$ is: +\begin{displaymath} + \frac{1}{\sigma^2}\sum_{j}(r_{i,j} - u_i^Tv_j)v_j +\end{displaymath} +Similarly, the gradient with respect to $v_j$ is: +\begin{displaymath} + \frac{1}{\sigma^2}\sum_{i}(r_{i,j} - u_i^Tv_j)u_i +\end{displaymath} + +(b) The obtained MLE for $\sigma^2$ is 1.1064889001 + +(c) The plot of the RMSE for different values of $K$ can be found below: +\begin{center} +\includegraphics[width=0.8\textwidth]{rmse.pdf} +\end{center} +It seems that a good value for $K$ is 2 or 3. For these values, the RMSE on the +validation set is minimal. For larger values of $K$, the RMSE decreases on the training +set but becomes worse on the validation set: we are overfitting. + +(d) The new log-likelihood is: +\begin{displaymath} +\log P(R) = \sum_{i,j} +\left( -\frac{1}{2\sigma^2} \left( r_{i,j} - u_i^T v_j - a_i - b_j - g\right)^2 +- \log \sigma - \frac{1}{2} \log(2\pi) \right) +\end{displaymath} + +After fitting this new-model, we obtain an MSE of 0.932 and 1.111 on the +training and validation sets respectively. The global bias is 1.4429. + +By ordering jokes by their bias, we obtain the best joke (72) with a bias of 1.76: +\begin{center} + \emph{on the first day of college the dean addressed the students pointing + out some of the rules the female dormitory will be out of bounds for + all male students and the male dormitory to the female students anybody + caught breaking this rule will be finded the first time he continued + anybody caught breaking this rule the second time will be fined being + caught a third time will cost you a fine of are there any questions at + this point a male student in the crowd inquired how much for a season + pass} +\end{center} + +and the worst joke (7) with a bias of -0.18: +\begin{center} + \emph{how many feminists does it take to screw in a light bulb thats not + funny} +\end{center} + +(e) A Gaussian prior of mean 0 seems reasonable for the coordinates of the +latent vectors for both the users and the jokes. This way we will avoid having +too large parameters. Another reasonable prior would be a Laplace prior to +enforce sparse latent vectors. + +For the user bias and the joke bias as well as the global bias we can use +uniform prior over $[1, 5]$. + +The graphical model looks like this in nested plate notation: +\begin{center} +\includegraphics[width=0.5\textwidth]{model.pdf} +\end{center} + + +\section*{Ordinal Regression} + +We now address the problem of predicting joke ratings given the text of the +joke. The previous models assumed that the ratings where continuous real +numbers, while they are actually integers from 1 to 5. To take this into +account, we will use an ordinal regression model. +Let the rating values +be $r = 1,\ldots,R$. In the ordinal regression model the real line is +partitioned into $R$ contiguous intervals with boundaries +\begin{align} +-\infty = b_1 < b_2 < \ldots < b_{R+1} = +\infty\,, +\end{align} +such that the interval $[b_r,b_{r+1})$ corresponds to the $r$-th rating value. +We will assume that $b_2 = -4$, $b_1 = -2$, $b_3 = 0$, $b_4 = 2$ and $b_5 = 4$. +Instead of directly modeling the ratings, we will be modeling them in terms of a +hidden variable $f$. We have that the rating $r$ is observed +if $f$ falls in the interval for that rating. The conditional probability of +$r$ given $f$ is then +\begin{align} +p(r|f) = +\left\{ + \begin{array}{ll} + 1 & \mbox{if }\quad b_r \leq f < b_{r+1} \\ + 0 & \mbox{if } \quad \mbox{otherwise} + \end{array} +\right. += \Theta(f - b_r) - \Theta(f-b_{r+1})\,, +\end{align} +where $\Theta(x)$ is the Heaviside step function, that is, $\Theta(x)=1$ if +$x>0$ and $\Theta(x)=0$ otherwise. Uncertainty about the exact value of $f$ can +be modeled by adding additive Gaussian noise to $f$. Let $\sigma^2$ be the variance +of this noise. Then $p(f|h) = \mathcal{N}(f|h,\sigma^2)$, where $h$ is a new latent +variable that is the noise free version of $f$. + +Given some features $\mathbf{x}$ for a particular joke, we can then combine the +previous likelihood with a linear model to make predictions for the possible +rating values for the joke. In particular, we can assume that the noise-free +rating value $h$ is a linear function of $\mathbf{x}$, that is +$h(\mathbf{x})=\mathbf{w}^\text{T} \mathbf{x}$, where $\mathbf{w}$ is a vector +of regression coefficients. We will assume that the prior for $\mathbf{w}$ is +$p(\mathbf{w})=\mathcal{N}(\bm 0, \mathbf{I})$. + +The problem now is what feature representation to use for the jokes. We propose +here that you compute binary features that indicate the presence or absence of +the 150 most common words across all the jokes plus a bias term that is always +equal to one. The file \texttt{jester\_items.clean.dat} in canvas has the joke texts with +punctuation and case removed. To turn the texts into the proposed binary features you can use the +python script \texttt{generate\_binary\_features.py}, which is also available at canvas. + +\subsection*{Stochastic optimization with Adam} + +Stochastic optimization of cost functions for models with many parameters can +be difficult because each parameter may require a different learning +rate. Adam, +(\url{http://arxiv.org/pdf/1412.6980v8.pdf}) +is a method that can be used to avoid this problem. Adam computes individual +adaptive learning rates for different parameters from estimates of first and +second moments of the gradients. Algorithm 1, shows the pseudo-code (extracted from the above +reference). You are encouraged to look at the pdf +for more details. + +\subsection*{Sparse matrices} + +A collection of ratings given by a set of users on a set of items can be +efficiently manipulated by storing them in a sparse matrix format. The non-zero +entries in the matrix would contain the observed rating values, while the zero +entries encode missing ratings. In the package \texttt{spcipy.sparse} you +have available the following sparse matrix types: + +{\footnotesize +\begin{verbatim} +bsr_matrix Block Sparse Row matrix +coo_matrix A sparse matrix in COOrdinate format. +csc_matrix Compressed Sparse Column matrix +csr_matrix Compressed Sparse Row matrix +dia_matrix Sparse matrix with DIAgonal storage +dok_matrix Dictionary Of Keys based sparse matrix. +lil_matrix Row-based linked list sparse matrix +\end{verbatim} +} + +Each sparse matrix type presents different computational advantages for +accessing and manipulating the data depending on whether the operations are +performed by rows or by columns or in an arbitrary way. If you encode the +ratings in a matrix with users as rows and jokes as columns, the +\texttt{csc\_matrix} type can be useful to quickly identify the ratings that +are available for any particular joke. The \texttt{lil\_matrix} type is useful +if we want to incrementally construct a sparse matrix by adding new blocks of +non-zero entries. + +\begin{problem}[Ordinal linear regression 25pts] You are required to +\begin{enumerate} +\vspace{-0.1cm} +\item Explain what is the form for $p(r|h)$ in the ordinal regression model. + +\vspace{-0.1cm} +\item Explain what is the mean of the predictive distribution in the ordinal regression model. + +\vspace{-0.1cm} +\item Code a function that splits the ratings for each joke into a training set +with 95\% of the ratings and a test set with the remaining 5\%. After this, +each joke should have 95\% of its ratings in the training set and 5\% of them +in the test set. For this, it can be useful to work with sparse matrices of dimension $n\times +m$ where $n$ is the number of users and $m$ is the number of jokes and the +non-zero entries in the matrix are the observed rating values. + +\vspace{-0.1cm} +\item Code a function that computes the log posterior distribution of the ordinal +regression model up to a normalization constant. Code a function that computes +the gradients of the previous function with respect to $\mathbf{w}$ and +$\sigma^2$. You are encouraged to use automatic differentiation tools such as autograd +(\url{https://github.com/HIPS/autograd}). + +\vspace{-0.1cm} +\item Code a function that finds the MAP solution for $\mathbf{w}$ and +$\sigma^2$ in the ordinal regression model given the available training data using mini-batch stochastic +gradient descent. Use minibatches of size 100 and Adam with its default +parameter values for 100 epochs (iterations over the training data). Report the +average test RMSE. + +\vspace{-0.1cm} +\item Modify the previous model to have a Gaussian likelihood, that is, $p(r|h)=\mathcal{N}(r|h,\sigma^2)$. +Report the resulting average test RMSE of the new model. Does performance +increase or decrease? Why? +\vspace{-0.1cm} +\item How does the performance of the models analyzed in this problem compare to +the performance of the model from problem 2? Which model performs best? Why? +\end{enumerate} +\end{problem} + +\paragraph{Solution.} The code for this problem can be found in file +\textsf{3.py}. + +(1) Let us denote by $\Phi$ the CDF of a Gaussian variable of mean 0 and +variance 1, then: +\begin{displaymath} + p(r|h) = \Phi\left(\frac{b_{r+1} - h}{\sigma^2}\right) - + \Phi\left(\frac{b_{r} - h}{\sigma^2}\right) +\end{displaymath} +this computes the probably that $f\sim \mathcal{N}(h,\sigma^2)$ falls in the +interval $[b_r, b_{r+1}]$. + +(2) In the ordinal regression model, $h = h(x) = w^Tx$. So for a given weight +vector $w$ and joke with attributes $x$, the mean of the predictive +distribution is: +\begin{displaymath} + \mu(w, x, \sigma^2) = \sum_{r=1}^5 r\left( + \Phi\left(\frac{b_{r+1} - w^Tx}{\sigma^2}\right) - + \Phi\left(\frac{b_{r} - w^Tx}{\sigma^2}\right) + \right) +\end{displaymath} +It is not entirely clear to me what was being asked for this question. If we +wanted to we could also compute the marginal predictive distribution by +integrating the above quantity against the prior $p(w)$. However, after reading +Piazza, I think the above quantity is what was expected. + +(3) See function \textsf{build\_train\_test} in \textsf{3.py}. + +(4) See functions \textsf{log\_posterior} and \textsf{grad\_log\_posterior} in +\textsf{3.py}. + +(5) See function \textsf{sgd} in \textsf{3.py}. I obtained an average test RMSE +of 1.2871. + +(6) See function \textsf{log\_posterior\_bis}. This function is simpler to +write, because the likelihood part of the log-posterior is the likelihood of +Gaussian variables, which takes the standard form of the squared error on the +training dataset (like in linear regression). The regularization term is the +same as in (5). The new average test RMSE is 1.2896. This is slightly worse +than what as obtained in (5). This can be explained by the fact that this model +does not capture the \emph{ordinal} nature (integer ratings) of the +variable to be predicted. + +(7). The performances of the models in this problem are worse than the model +from problem 2. + +This can be explained by the fact the model of problem 2 is richer: it assumes +that the ratings depends on latent features of the jokes and latent features of +the users. + +In contrast, the models in this problem assume that the ratings of a joke only +depend on features of the joke. The individual fluctuations across individuals +are simply models as independent random noise. + +An ideal model would combine ideas from the model in this problem to capture +the ordinal nature of ratings and ideas from the model in problem 2 to use +a low rank model on both users and jokes to model the mean of the Gaussian +variables. + +\end{document} |
