[HW1] Problem 1

author: Thibaut Horel <thibaut.horel@gmail.com> 2015-09-18 14:02:20 -0400
committer: Thibaut Horel <thibaut.horel@gmail.com> 2015-09-18 14:02:20 -0400
commit: 3aea510694365b79ac6e396ee902d6013de3d79f (patch)
tree: d3504d050d84140a5f84a19140dc9f513cd1f84b /hw1
parent: 80ad83621f7dc3802404bb35927d78e9ba3641af (diff)
download: cs281-3aea510694365b79ac6e396ee902d6013de3d79f.tar.gz
3 files changed, 430 insertions, 0 deletions
diff --git a/hw1/assignment-1.pdf b/hw1/assignment-1.pdf
new file mode 100644
index 0000000..a5a5adc
--- /dev/null
+++ b/hw1/assignment-1.pdf
diff --git a/hw1/harvardml.cls b/hw1/harvardml.cls
new file mode 100644
index 0000000..e285173
--- /dev/null
+++ b/hw1/harvardml.cls
@@ -0,0 +1,97 @@
+% Ryan Adams
+% School of Engineering and Applied Sciences
+% Harvard University
+% v0.01, 31 August 2013
+% Based on HMC Math Dept. template by Eric J. Malm.
+\NeedsTeXFormat{LaTeX2e}[1995/01/01]
+\ProvidesClass{harvardml}
+[2013/08/31 v0.01 Harvard ML Assignment Class]
+
+\RequirePackage{ifpdf}
+
+\newif\ifhmlset@submit
+\DeclareOption{submit}{%
+  \hmlset@submittrue%
+}
+\DeclareOption{nosubmit}{%
+  \hmlset@submitfalse%
+}
+
+\DeclareOption*{\PassOptionsToClass{\CurrentOption}{article}}
+\ExecuteOptions{nosubmit}
+\ProcessOptions\relax
+
+\LoadClass[10pt,letterpaper]{article}
+
+\newif\ifhmlset@header
+
+\hmlset@headertrue
+
+\RequirePackage{mathpazo}
+\RequirePackage{palatino}
+\RequirePackage{amsmath}
+\RequirePackage{amssymb}
+\RequirePackage{amsthm}
+\RequirePackage{fullpage}
+\RequirePackage{mdframed}
+
+\newtheoremstyle{hmlplain}
+                {3pt}% Space above
+                {3pt}% Space below
+                {}% Body font
+                {}% Indent amount
+                {\bfseries}% Theorem head font
+                {\\*[3pt]}% Punctuation after theorem head
+                {.5em}% Space after theorem head
+                {}% Theorem head spec (can be left empty, meaning `normal')
+
+\def\titlebar{\hrule height2pt\vskip .25in\vskip-\parskip}
+
+\newcommand{\headerblock}{%
+  \noindent\begin{minipage}{0.33\textwidth}
+  \begin{flushleft}
+  \ifhmlset@submit
+    \mbox{\hmlset@name}\\
+    \mbox{\tt \hmlset@email}\\
+    \mbox{\hmlset@course}
+  \fi
+  \end{flushleft}
+  \end{minipage}
+  \noindent\begin{minipage}{0.33\textwidth}
+  \begin{center}
+    \mbox{\Large\hmlset@assignment}\protect\\
+    Due: \hmlset@duedate
+  \end{center}
+  \end{minipage}
+  \noindent\begin{minipage}{0.33\textwidth}
+  \begin{flushright}
+  \ifhmlset@submit
+    Collaborators: \hmlset@collaborators
+    \fi
+  \end{flushright}
+  \end{minipage}
+  \vspace{0.1cm}
+  \titlebar
+}
+
+\ifhmlset@header\AtBeginDocument{\headerblock}\fi
+
+\def\hmlset@name{}
+\def\hmlset@email{}
+\def\hmlset@course{}
+\def\hmlset@assignment{}
+\def\hmlset@duedate{}
+\def\hmlset@collaborators{}
+\def\hmlset@extraline{}
+
+% commands to set header block info
+\newcommand{\name}[1]{\def\hmlset@name{#1}}
+\newcommand{\email}[1]{\def\hmlset@email{#1}}
+\newcommand{\course}[1]{\def\hmlset@course{#1}}
+\newcommand{\assignment}[1]{\def\hmlset@assignment{#1}}
+\newcommand{\duedate}[1]{\def\hmlset@duedate{#1}}
+\newcommand{\collaborators}[1]{\def\hmlset@collaborators{#1}}
+\newcommand{\extraline}[1]{\def\hmlset@extraline{#1}}
+
+\theoremstyle{hmlplain}
+\newmdtheoremenv[skipabove=\topsep,skipbelow=\topsep,nobreak=true]{problem}{Problem}
diff --git a/hw1/main.tex b/hw1/main.tex
new file mode 100644
index 0000000..cd5c1f9
--- /dev/null
+++ b/hw1/main.tex
@@ -0,0 +1,333 @@
+% No 'submit' option for the problems by themselves.
+\documentclass[submit]{harvardml}
+% Use the 'submit' option when you submit your solutions.
+%\documentclass[submit]{harvardml}
+
+% Put in your full name and email address.
+\name{Thibaut Horel}
+\email{thorel@seas.harvard.edu}
+
+% List any people you worked with.
+\collaborators{%
+}
+
+% You don't need to change these.
+\course{CS281-F15}
+\assignment{Assignment \#1}
+\duedate{11:59pm September 23, 2015}
+
+\usepackage{url, enumitem}
+\usepackage{amsfonts}
+\usepackage{listings}
+\usepackage{bm}
+
+% Some useful macros.
+\newcommand{\prob}{\mathbb{P}}
+\newcommand{\given}{\,|\,}
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\E}{\mathbb{E}}
+\newcommand{\var}{\text{var}}
+\newcommand{\cov}{\text{cov}}
+\newcommand{\N}{\mathcal{N}}
+\newcommand{\ep}{\varepsilon}
+
+\newcommand{\Dir}{\text{Dirichlet}}
+
+\theoremstyle{plain}
+\newtheorem{lemma}{Lemma}
+
+\begin{document}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\begin{problem}[The Gaussian algebra, 10pts]
+Let $X$ and $Y$ be independent univariate Gaussian random variables. In the previous problem set, you likely used the fact that $X + Y$ is also Gaussian. Here you'll prove this fact.
+
+\begin{enumerate}[label=(\alph*)]
+\item Suppose $X$ and $Y$ have mean 0 and variances $\sigma_X^2$ and $\sigma_Y^2$ respectively. Write the pdf of $X + Y$ as an integral.
+\item Evaluate the integral from the previous part to find a closed-form expression for the pdf of $X+Y$, then argue that this expression implies that $X+Y$ is also Gaussian with mean $0$ and variance $\sigma_X^2 + \sigma_Y^2$. Hint: what is the integral, over the entire real line, of
+\[ 
+f(x) = \frac{1}{\sqrt{2\pi}s} \exp\left( -\frac{1}{2s^2}(x - m)^2 \right) ,
+\] i.e., the pdf of a univariate Gaussian random variable?
+\item Extend the above result to the case in which $X$ and $Y$ may have arbitrary means.
+\item Univariate Gaussians are supported on the entire real line. Sometimes this is undesirable because we're modeling a quantity with positive support. A common way to transform a Gaussian to solve this problem is to exponentiate it. Suppose $X$ is a univariate Gaussian with mean $\mu$ and variance $\sigma^2$. What is the pdf of $e^X$?
+\end{enumerate}
+\end{problem}
+
+\paragraph{Solution} (a) Writing $f_X$, $f_Y$ and $g$ the cdfs of $X$, $Y$ and
+$X+Y$ respectively, we have:
+\begin{displaymath}
+    g(z) = \int_{\mathbb{R}} f_X(u)f_Y(z-u)du,\quad z\in\mathbb{R}
+\end{displaymath}
+
+(b) Using the formula for the cdf of a gaussain, we obtain:
+\begin{displaymath}
+    g(z) = \frac{1}{2\pi\sigma_X\sigma_Y}\int_{\mathbb{R}}
+    \exp\left(-\frac{u^2}{2\sigma_X^2} - \frac{(z-u)^2}{2\sigma_y^2}\right)du
+\end{displaymath}
+the argument of the $\exp$ function can be reordered in ``canonical form'' with
+respect to $u$:
+\begin{displaymath}
+    -\frac{u^2}{2\sigma_X^2} - \frac{(z-u)^2}{2\sigma_y^2}
+    = -\frac{\sigma_X^2+ \sigma_Y^2}{2\sigma_X^2\sigma_Y^2}\left(u
+    - \frac{\sigma_X^2}{\sigma_X^2+\sigma_Y^2}z\right)^2
+    - \frac{z^2}{2(\sigma_X^2+\sigma_Y^2)}
+\end{displaymath}
+Hence:
+\begin{displaymath}
+    g(z) = \frac{1}{2\pi\sigma_X\sigma_Y}
+    \exp\left(- \frac{z^2}{2(\sigma_X^2+\sigma_Y^2)}\right)
+    \int_{\mathbb{R}}
+    -\frac{\sigma_X^2+ \sigma_Y^2}{2\sigma_X^2\sigma_Y^2}\left(u
+    - \frac{\sigma_X^2}{\sigma_X^2+\sigma_Y^2}z\right)^2du
+\end{displaymath}
+We now recognize the integral of the un-normalized cdf of a Gaussian variable.
+Hence the integral evaluates to:
+\begin{displaymath}
+    \sqrt{2\pi} \frac{\sigma_X\sigma_Y}{\sqrt{\sigma_X^2 + \sigma_Y^2}}
+\end{displaymath}
+after simplification we obtain the following formula for $g(z)$:
+\begin{displaymath}
+    g(z) = \frac{1}{\sqrt{2\pi}\sqrt{\sigma_X^2+\sigma_Y^2}}
+    \exp\left(- \frac{z^2}{2(\sigma_X^2+\sigma_Y^2)}\right)
+\end{displaymath}
+That is, $X+Y$ is a Gaussian variable of mean $0$ and variance
+$\sigma_X^2+\sigma_Y^2$.
+
+(c) If $X$ has mean $\mu_X$ and $Y$ has mean $\mu_Y$, $X-\mu_X$ and $Y-\mu_Y$
+are still independent variables with the same variance as before and zero mean.
+Applying the previous result. $X+Y-\mu_X-\mu_Y$ is a Gaussian variable with
+mean $0$ and variance $\sigma_X^2 + \sigma_Y^2$. Hence, $X+Y$ is normally
+distributed with mean $\mu_X+\mu_Y$ and variance $\sigma_X^2+\sigma_Y^2$.
+
+(d) Since $x\mapsto e^X$ is a bijection mapping $\mathbb{R}$ to $\mathbb{R}^+$,
+the pdf of $Y = e^X$ can be obtained by using an inverse transform. Formally,
+for any interval $[a,b]$ with $a,b>0$:
+\begin{align*}
+    \prob(Y\in[a,b]) = \prob(X\in [\log a, \log b])
+    &=\frac{1}{\sqrt{2\pi}\sigma_X}\int_{\log a}^{\log b}
+    \exp\left(-\frac{1}{2\sigma_X}(x-\mu)^2\right)dx\\
+    &=\frac{1}{\sqrt{2\pi}\sigma_X}\int_{a}^{b}
+    \exp\left(-\frac{1}{2z\sigma_X}(\log z-\mu)^2\right)dz
+\end{align*}
+where we used the change of variable $z = e^x$ in the last equality. Hence the
+pdf of $Y$:
+\begin{displaymath}
+    f_Y = \begin{cases}
+    \frac{1}{\sqrt{2\pi}\sigma_X}
+    \exp\left(-\frac{1}{2z\sigma_X}(\log z-\mu)^2\right)&\text{if $x>0$}\\
+    0&\text{otherwise}
+    \end{cases}
+\end{displaymath}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\begin{problem}[Regression, 13pts]
+Suppose that $X \in \R^{n \times m}$ with $n \geq m$ and $Y \in \R^n$, and that $Y \sim \N(X\beta, \sigma^2 I)$. You learned in class that the maximum likelihood estimate $\hat\beta$ of $\beta$ is given by
+\[
+\hat\beta = (X^TX)^{-1}X^TY
+\]
+\begin{enumerate}[label=(\alph*)]
+\item Why do we need to assume that $n \geq m$?
+\item Define $H = X(X^TX)^{-1}X^T$, so that the ``fitted" values $\hat Y = X\hat{\beta}$ satisfy $\hat Y = HY$. Show that $H$ is an orthogonal projection matrix that projects onto the column space of $X$, so that the fitted y-values are a projection of $Y$ onto the column space of $X$.
+\item What are the expectation and covariance matrix of $\hat\beta$?
+\item Compute the gradient with respect to $\beta$ of the log likelihood implied by the model above, assuming we have observed $Y$ and $X$.
+\item Suppose we place a normal prior on $\beta$. That is, we assume that $\beta \sim \N(0, \tau^2 I)$. Show that the MAP estimate of $\beta$ given $Y$ in this context is
+\[
+\hat \beta_{MAP} = (X^TX + \lambda I)^{-1}X^T Y
+\]
+where $\lambda = \sigma^2 / \tau^2$. (You may employ standard conjugacy results about Gaussians without proof in your solution.)
+
+Estimating $\beta$ in this way is called {\em ridge regression} because the matrix $\lambda I$ looks like a ``ridge". Ridge regression is a common form of {\em regularization} that is used to avoid the overfitting (resp. underdetermination) that happens when the sample size is close to (resp. higher than) the output dimension in linear regression.
+\item Do we need $n \geq m$ to do ridge regression? Why or why not?
+\item Show that ridge regression is equivalent to adding $m$ additional rows to $X$ where the $j$-th additional row has its $j$-th entry equal to $\sqrt{\lambda}$ and all other entries equal to zero, adding $m$ corresponding additional entries to $Y$ that are all 0, and and then computing the maximum likelihood estimate of $\beta$ using the modified $X$ and $Y$.
+\end{enumerate}
+\end{problem}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\begin{problem}[The Dirichlet and multinomial distributions, 12pts]
+The Dirichlet distribution over $K$ categories is a generalization of the beta distribution. It has a shape parameter $a \in \R^K$ with non-negative entries and is supported over the set of $K$-dimensional positive vectors whose components sum to 1. Its density is given by
+\[
+f(x | a) = \frac{\Gamma\left( \sum_{k=1}^K a_k \right)}{\Pi_{k=1}^K \Gamma(a_k)} \Pi_{k=1}^K x_k^{a_k - 1}
+\]
+(Notice that when $K=2$, this reduces to the density of a beta distribution since in that case $a_2 = 1 - a_1$.) For the rest of this problem, assume a fixed $K \geq 2$.
+\begin{enumerate}[label=(\alph*)]
+\item Suppose $X$ is Dirichlet-distributed with shape parameter $a$. Without proof, state the value of $E(X)$. Your answer should be a vector defined in terms of either $a$ or $K$ or potentially both.
+\item Suppose that $\theta \sim \text{Dirichlet}(a)$ and that $X \sim \text{Categorical}(\theta)$. That is, suppose we first sample a $K$-dimensional vector $\theta$ with entries in $(0,1)$ from a Dirichlet distribution and then roll a $K$-sided die such that the probability of rolling the number $k$ is $\theta_k$. Prove that $\theta | X$ also follows a Dirichlet distribution. What is its shape parameter?
+\item Now suppose that $\theta \sim \text{Dirichlet}(a)$ and that $X_1, X_2, \ldots \stackrel{iid}{\sim} \text{Categorical}(\theta)$. Show that
+\[
+P(X_n = k | X_1, \ldots, X_{n-1}) = \frac{a_{k,n}}{\sum_{k=1}^K a_{k,n}}
+\]
+where $a_{k,n} = a_k + \sum_{i=1}^{n-1} \mathbb{1}\{X_i = k\}$. (Bonus points if your solution does not involve any integrals.)
+\item Argue that the random variable $\lim_{n \rightarrow \infty} \frac{1}{n}\sum_{i=1}^n \mathbb{1}\{X_i = k\}$ is marginally beta-distributed. What is its shape parameter? If you're not sure how to rigorously talk about convergence of random variables, give an informal argument. Hint: what would you say if $\theta$ were fixed?
+\item Suppose we have $K$ distinct colors and an urn with $a_k$ balls of color $k$. At each time step, we choose a ball uniformly at random from the urn and then add into the urn an additional new ball of the same color as the chosen ball. (So if at the first time step we choose a ball of color 1, we'll end up with $a_1+1$ balls of color 1 and $a_k$ balls of color $k$ for all $k > 1$ at the start of the second time step.) Let $\rho_{k,n}$ be the fraction of all the balls that are of color $k$ at time $n$. What is the distribution of $\lim_{n \rightarrow \infty} \rho_k^n$? Prove your answer.
+\end{enumerate}
+\end{problem}
+
+\section*{Physicochemical Properties of Protein Tertiary Structure}
+
+In the following problems we will code two different approaches for
+solving linear regression problems and compare how they scale as a function of
+the dimensionality of the data.  We will also investigate the effects of
+linear and non-linear features in the predictions made by linear models.
+
+We will be working with the regression data set Protein
+Tertiary Structure:
+\url{https://archive.ics.uci.edu/ml/machine-learning-databases/00265/CASP.csv}.
+This data set contains information about predicted conformations for 45730
+proteins. In the data, the target variable $y$ is the root-mean-square
+deviation (RMSD) of the predicted conformations with respect to the true properly
+folded form of the protein. The RMSD is the measure of the average distance
+between the atoms (usually the backbone atoms) of superimposed proteins.
+The features $\mathbf{x}$ are
+physico-chemical properties of the proteins in their true folded form. After
+downloading the file CASP.csv we can load the data into python using
+\begin{verbatim}
+>>> import numpy as np
+>>> data = np.loadtxt('CASP.csv', delimiter = ',', skiprows = 1)
+\end{verbatim}
+We can then obtain the vector of target variables and the feature matrix using
+\begin{verbatim}
+>>> y = data[ : , 0 ]
+>>> X = data[ : , 1 : ]
+\end{verbatim}
+We can then split the original data into a training set with 90\% of the data
+entries in the file CASP.csv and a test set with the remaining 10\% of the
+entries. Normally, the spliting of the data is done at random, but here {\bf we ask
+you to put into the training set the first 90\% of the elements from the
+file CASP.csv} so that we can verify that the values that you will be reporting are correct.
+(This should not cause problems, because the rows of the file are in a random order.)
+The following python function splits the data as requested:
+\begin{verbatim}
+def split_train_test(X, y, fraction_train = 9.0 / 10.0):
+    end_train = round(X.shape[ 0 ] * fraction_train)
+    X_train = X[ 0 : end_train, ]
+    y_train = y[ 0 : end_train ]
+    X_test = X[ end_train :, ]
+    y_test = y[ end_train : ]
+    return X_train, y_train, X_test, y_test
+\end{verbatim}
+We can then normalize the features so that they have zero mean and unit deviation
+in the training set. This is a standard step before the application of many machine
+learning methods. We can use the following python function to perform this operation:
+\begin{verbatim}
+def normalize_features(X_train, X_test):
+    mean_X_train = np.mean(X_train, 0)
+    std_X_train = np.std(X_train, 0)
+    std_X_train[ std_X_train == 0 ] = 1
+    X_train_normalized = (X_train - mean_X_train) / std_X_train
+    X_test_normalized = (X_test - mean_X_train) / std_X_train
+    return X_train_normalized, X_test_normalized
+\end{verbatim}
+After these steps are done, we can concatenate a bias feature (one feature which
+always takes value 1) to the observations in the normalized training and test sets:
+\begin{verbatim}
+>>> X_train, y_train, X_test, y_test = split_train_test(X, y)
+>>> X_train, X_test = normalize_features(X_train, X_test)
+>>> X_train = np.concatenate((X_train, np.ones((X_train.shape[ 0 ], 1))), 1)
+>>> X_test = np.concatenate((X_test, np.ones((X_test.shape[ 0 ], 1))), 1)
+\end{verbatim}
+We are now ready to apply our machine learning methods to the normalized training set and
+evaluate their performance on the normalized test set.
+In the following problems, you will be asked to report some numbers and produce
+some figures. Include these numbers and figures in your assignment report.
+{\bf The numbers should be reported with up to 8 decimals}.
+
+%Notation consistent with prob 2?
+\begin{problem}[7pts]\label{prob:analytic_linear_model}
+Assume that the targets $y$ are obtained as a function of the normalized
+features $\mathbf{x}$ according to a Bayesian linear model with additive Gaussian noise with variance
+$\sigma^2 = 1.0$ and a Gaussian prior on the regression coefficients $\mathbf{w}$
+with precision matrix $\tau^{-2}\mathbf{I}$ where $\tau^{-2} = 10$. Code a routine that finds the Maximum a
+Posteriori (MAP) value $\hat{\mathbf{w}}$ for $\mathbf{w}$ given the normalized
+training data using the QR decomposition (see Section 7.5.2 in Murphy's book).
+\begin{itemize}
+\item Report the value of $\hat{\mathbf{w}}$ obtained. 
+\item Report the root mean squared error (RMSE) of $\hat{\mathbf{w}}$ in the normalized test set.
+\end{itemize}
+\end{problem}
+
+\begin{problem}[14pts]\label{prob:numerical_linear_model}
+Write code that evaluates the log-posterior probability (up to an
+additive constant) of $\mathbf{w}$ according to the previous Bayesian linear model and the normalized training data.
+Write also code that evaluates the gradient of the log-posterior probability with respect to
+$\mathbf{w}$. Merge the two pieces of code into a
+function called \emph{obj\_and\_gradient} that, for a value of $\mathbf{w}$,
+returns the log-posterior probability value and the corresponding gradient.
+Verify that the gradient computation is correct using the approximation
+\begin{align}
+\frac{df(x)}{dx} \approx \frac{f(x + \epsilon) - f(x - \epsilon)}{2\epsilon}\,.
+\end{align}
+Write then a function that uses "obj\_and\_gradient" to find the MAP solution $\hat{\mathbf{w}}$ for
+$\mathbf{w}$ by running 100 iterations of the L-BFGS numerical optimization
+method with $\mathbf{0}$ as the initial point for the optimization.
+
+The L-BFGS method is an iterative method for solving nonlinear optimization problems.
+You will learn more about numerical optimization and about this method in one of the sections of this course. However,
+for this problem you can use the method as a black box that returns the MAP solution
+by sequentially evaluating the objective function and its gradient for different input values.
+In python, a variant of the L-BFGS method called L-BFGS-B is implemented in scipy.optimize.minimize.
+Since scipy.optimize.minimize only minimizes, you can work with the negative
+log-posterior probability to transform the maximization problem into a minimization one.
+
+\begin{itemize}
+\item Report the value of $\hat{\mathbf{w}}$ obtained. 
+\item Report the RMSE of the predictions made with $\hat{\mathbf{w}}$ in the normalized test set.
+\end{itemize}
+\end{problem}
+
+Linear regression can be extended to model non-linear relationships by
+replacing the original features $\mathbf{x}$ with some non-linear functions of
+the original features $\bm \phi(\mathbf{x})$. We can automatically generate one
+such non-linear function by sampling a random weight vector $\mathbf{a}
+\sim \N(0,\mathbf{I})$ and a corresponding random bias $b \sim
+\text{U}[0, 2\pi]$ and then making $\phi(\mathbf{x}) = \cos(\mathbf{a}^\text{T}
+\mathbf{x} + b)$.  By repeating this process $d$ times we can generate $d$
+non-linear functions that, when applied to the original features, produce a
+non-linear mapping of the data into a new $d$ dimensional space.
+We can encode these $d$ functions into a matrix $\mathbf{A}$ with $d$ rows, each one
+with the weights for each function, and a $d$-dimensional vector $\mathbf{b}$
+with the biases for each function. The new mapped fetures are then obtained as
+$\bm \phi (\mathbf{x}) = \cos(\mathbf{A} \mathbf{x} + \mathbf{b})$, where
+$\cos$ applied to a vector returns another vector whose elements are the result
+of applying $\cos$ to the individual elements of the original vector.
+
+\begin{problem}[14pts]\label{prob:non_linear_model}
+Generate 4 sets of non-linear functions, each one with $d=100, 200, 400, 600$ functions, respectively, and use
+them to map the features in the original normalized training and test sets into
+4 new feature spaces, each one of dimensionality given by the value of $d$. After this, for each
+value of $d$, find the MAP solution $\hat{\mathbf{w}}$ for $\mathbf{w}$ using the
+corresponding new training set and the method from problem
+4. Use the same values for $\sigma^2$ and $\tau^{-2}$ as before.
+You are also asked to
+record the time taken by the method QR to obtain a value for $\hat{\mathbf{w}}$. 
+In python  you can compute the time taken by a routine using the time package:
+\begin{verbatim}
+>>> import time
+>>> time_start = time.time()
+>>> routine_to_call()
+>>> running_time = time.time() - time_start
+\end{verbatim}
+Next, compute the RMSE of the resulting predictor in the normalized test
+set. Repeat this process with the method from problem
+\ref{prob:numerical_linear_model} (L-BFGS).  
+
+\begin{itemize}
+\item Report the test RMSE obtained by each method for each value of $d$.
+\end{itemize}
+
+You are asked to generate a plot
+with the results obtained by each method (QR and L-BFGS) 
+for each value of $d$. In this plot
+the $x$ axis should represent the time taken by each method to
+run and the $y$ axis should be the RMSE of the resulting predictor in the
+normalized test set. The plot should
+contain 4 points in red, representing the results obtained by the method QR for
+each value of $d$, and 4 points in blue, representing the results obtained
+by the method L-BFGS for each value of $d$. Answer the following questions:
+\begin{itemize}
+\item Do the non-linear transformations help to reduce the prediction error? Why?
+\item What method (QR or L-BFGS) is faster? Why?
+\end{itemize}
+\end{problem}
+
+\end{document}
author	Thibaut Horel <thibaut.horel@gmail.com>	2015-09-18 14:02:20 -0400
committer	Thibaut Horel <thibaut.horel@gmail.com>	2015-09-18 14:02:20 -0400
commit	3aea510694365b79ac6e396ee902d6013de3d79f (patch)
tree	d3504d050d84140a5f84a19140dc9f513cd1f84b /hw1
parent	80ad83621f7dc3802404bb35927d78e9ba3641af (diff)
download	cs281-3aea510694365b79ac6e396ee902d6013de3d79f.tar.gz