summaryrefslogtreecommitdiffstats
path: root/hw2/main.tex
diff options
context:
space:
mode:
Diffstat (limited to 'hw2/main.tex')
-rw-r--r--hw2/main.tex326
1 files changed, 326 insertions, 0 deletions
diff --git a/hw2/main.tex b/hw2/main.tex
new file mode 100644
index 0000000..f4a1067
--- /dev/null
+++ b/hw2/main.tex
@@ -0,0 +1,326 @@
+% No 'submit' option for the problems by themselves.
+\documentclass[submit]{harvardml}
+% Use the 'submit' option when you submit your solutions.
+%\documentclass[submit]{harvardml}
+
+% Put in your full name and email address.
+\name{Thibaut Horel}
+\email{thorel@seas.harvard.edu}
+\usepackage[utf8x]{inputenc}
+
+% List any people you worked with.
+\collaborators{%
+}
+
+% You don't need to change these.
+\course{CS281-F13}
+\assignment{Assignment \#2}
+\duedate{23:59pm October 7, 2015}
+
+\usepackage{graphicx}
+% Useful macros.
+\DeclareMathOperator*{\argmax}{argmax}
+\DeclareMathOperator*{\argmin}{argmin}
+\newcommand{\given}{\,|\,}
+\newcommand{\prob}{\mathbb{P}}
+\newcommand{\trans}{\mathsf{T}}
+\newcommand{\bx}{\mathbf{x}}
+\newcommand{\by}{\mathbf{y}}
+\newcommand{\bw}{\mathbf{w}}
+\newcommand{\distNorm}{\mathcal{N}}
+\newcommand{\bzero}{\mathbf{0}}
+\newcommand{\ident}{\mathbb{I}}
+
+\begin{document}
+
+
+\begin{problem}[10pts]
+One intuitive way to summarize a probability density is via the mode,
+as this is the ``most likely'' value in some sense. A common example
+of this is using the maximum \textit{a posteriori} (MAP) estimate of a
+model's parameters. In high dimensions, however, the mode becomes
+less and less representative of typical samples. Consider variates
+from a~$D$-dimensional zero mean spherical Gaussian with unit
+variance:
+\begin{align*}
+ \bx &\sim \distNorm(\bzero_D, \ident_D),
+\end{align*}
+where~$\bzero_D$ indicates a column vector of~$D$ zeros and~$\ident_D$
+is a~${D\times D}$ identity matrix.
+\begin{enumerate}
+ \item Compute the distribution that this implies over the distance
+ of these points from the origin. That is, compute the
+ distribution over~$\sqrt{\bx^\trans\bx}$, if~$\bx$ is a
+ realization from~$\distNorm(\bzero_D, \ident_D)$. (Hint: Consider
+ transformations of a Gamma distribution.)
+ \item Make a plot that shows this probability density function for
+ several different values of~$D$, up to ~${D=100}$.
+
+ \item Make a plot of the cumulative distribution function (CDF) over
+ this distance distribution for~${D=100}$. A closed-form solution
+ may be difficult to compute, so you can do this numerically.)
+
+ \item From examining the CDF we can think about where most of the
+ mass lives as a function of radius. For example, most of the mass
+ for~${D=100}$ is within a thin spherical shell. From eyeballing
+ the plot, what are the inner and outer radii for the shell that
+ contains 90\% of the mass in this case?
+\end{enumerate}
+\end{problem}
+
+\paragraph{Solution.} The code for this problem can be found in file
+\textsf{p1.py}.
+
+1. By definition of $\bx$, its coordinates are
+independent univariate Gaussian variables of mean 0 and variance one. Hence,
+$\bx^T\bx = \sum_{i=1}^D\bx_i^2$ is exactly a chi-squared distribution with $D$
+degrees of freedom. So for all $a, b$ we have:
+\begin{align*}
+ p(a\leq \sqrt{\bx^T\bx}\leq b)
+ &= p(a^2\leq \bx^T\bx \leq b)
+ =\frac{1}{2^{D/2}\Gamma(D/2)}\int_{a^2}^{b^2}x^{D/2-1}e^{-x/2}dx\\
+ &= \frac{1}{2^{D/2-1}\Gamma(D/2)} u^{D-1}e^{-u^2/2}
+\end{align*}
+where we used the formula of the chi-squared distribution for the second
+equality and the change of variable $u=\sqrt{x}$ for the last equality. Since
+this is true for all $a,b$, the distribution of the distance is given by:
+\begin{displaymath}
+ p(u) = \frac{1}{2^{D/2-1}\Gamma(D/2)} u^{D-1}e^{-u^2/2}
+\end{displaymath}
+
+2. The pdf for different values of $D$ can be found below:
+\begin{center}
+\includegraphics[width=0.8\textwidth]{dist1.pdf}
+\end{center}
+
+3. The cdf plot is below for $D=100$
+\begin{center}
+\includegraphics[width=0.8\textwidth]{cumul.pdf}
+\end{center}
+
+4. Eyeballing the plot, 90\% is in the shell whose inner and outer radii are
+ $9$ and $11$ respectively.
+
+\begin{problem}[Bayesian Linear Regression, 30pts]
+Here are some one-dimensional data to regress:
+\begin{verbatim}
+x = [-1.87 -1.76 -1.67 -1.22 -0.07 0.11 0.67 1.60 2.22 2.51]'
+y = [0.06 1.67 0.54 -1.45 -0.18 -0.67 0.92 2.95 5.13 5.18]'
+\end{verbatim}
+Construct a Bayesian linear regression model using Gaussian observation noise with fixed variance $\sigma^2$, and a basis of your choosing (e.g., polynomial, sinusoids, radial basis functions).
+For the prior on the regression weights, use a normal distribution:
+$$p(\bw) = \distNorm(\bw | \bw_0, \mathbf V_0)$$
+
+
+\begin{enumerate}
+ \item Write down:
+ \begin{enumerate}
+ \item your basis
+ \item the posterior distribution on $\bw$ given the data: $p(\bw | \bx, \by, \bw_0, \mathbf V_0, \sigma^2)$.
+ \item the posterior predictive distribution $p(y^\star | x^\star, \bx, \by, \bw_0, \mathbf V_0, \sigma^2)$ of a new function value $y^\star = f(x^\star)$ obtained by integrating out $\bw$.
+ \item The marginal likelihood, $p(\by | \bx, \bw_0,\boldsymbol V_0, \sigma^2)$ obtained by integrating out the regression weights.
+ \end{enumerate}
+ \item Choose values of $\bw_0, \mathbf V_0$ and $\sigma^2$ that seem sensible for the regression weight prior and the noise.
+ Plot several random functions drawn from your prior.
+
+ \item Plot the data, as well as several typical posterior samples of the function given the data.
+
+ \item Plot the 95\% central credible interval region of the
+ predictive density as a function of~$x$. That is, produce a plot
+ that shows the ``tube'' containing most of the functions that are
+ consistent with the data under your model.
+
+ \item Try varying the basis functions used. For example, you could change the
+ order of polynomial, or how many radial basis functions to put
+ down. Produce a bar plot of the marginal likelihood for several of these hypotheses.
+ Which hypotheses are supported by the data?
+ \end{enumerate}
+\end{problem}
+
+\paragraph{Solution.} 1. (a) I chose the Gaussian radial basis functions as my
+basis:
+\begin{displaymath}
+ \phi_j(x) = \exp\left(\frac{-(x-\mu_j)^2}{2\epsilon^2}\right)
+\end{displaymath}
+For the plots I chose 10 basis functions whose $\mu_j$ were evenly spread over
+the interval $[-2, 3]$ and $\epsilon = 1$.
+
+(b) Let us denote by $\Phi(\bx)$ the matrix of the observations after applying
+the basis functions to them. That is, the $i$th row of $\Phi(\bx)$ is $\phi(x)
+= [\phi_1(x_i),\ldots,\phi_N(x_i)]$ where $N$ is the number of basis functions.
+Then the posterior is given by:
+\begin{displaymath}
+ p(\bw \given \by) \propto \distNorm(\Phi(\bx)\bw, \sigma^2Id)\distNorm(\bw_0,
+ \mathbf V_0)
+\end{displaymath}
+Using standard results for Gaussian variables (e.g. Theorem 4.4.1 from Murphy)
+we get that $p(\bw \given \bx, \by) \sim \distNorm(\mu, \Sigma)$ with:
+\begin{displaymath}
+ \Sigma^{-1} = V_0^{-1} + \frac{1}{\sigma^2}\Phi(\bx)^T\Phi(\bx)
+ \quad\text{and}\quad \mu = \Sigma V_0^{-1}\bw_0
+ +\frac{1}{\sigma^2}\Sigma\Phi(\bx)^Ty
+\end{displaymath}
+
+(c) Again using Theorem 4.4.1 we compute the posterior predictive:
+\begin{displaymath}
+ p(y^*\given x^*) \sim \distNorm\left(\phi(x^*)^T\mu, \sigma^2 + \phi(x^*)^T
+ \Sigma \phi(x^*)\right)
+\end{displaymath}
+
+(d) And the marginal likelihood:
+\begin{displaymath}
+ p(\by\given \bx)
+ = \distNorm(\Phi(\bx)^T\bw_0, \sigma^2Id + \Phi(\bx)V_0\Phi(\bx)^T)
+\end{displaymath}
+I am using a slight abuse of notation here: in (c) I meant that the posterior
+predictive is a normal distribution whose parameters where indicated as the
+arguments of the $\distNorm$ function. Here I mean that the marginal likelihood
+is computed by evaluating the cdf of normal distribution whose parameters are
+indicated at the point $\by$.
+
+2. I chose $w_0$ to be the zero vector, $V_0$ to be $Id$ and $\sigma^2$ to be
+ 1. Several functions drawn from the prior are draw below:
+\begin{center}
+\includegraphics[width=0.7\textwidth]{prior.pdf}
+\end{center}
+
+3. The data and several functions drawn from the posterior are drawn below:
+\begin{center}
+\includegraphics[width=0.8\textwidth]{posterior.pdf}
+\end{center}
+
+4. The data, mean of the posterior predictive and 95\% interval for the
+posterior predictive (blue-grey tube) can be found below. For any point $x$,
+the posterior predictive distribution at this point is a univariate gaussian
+whose mean and variance was computed in part 1. The 95\% interval is compute by
+considering the interval of length $2\times 1.96$ times the standard deviation
+around the mean.
+\begin{center}
+\includegraphics[width=0.8\textwidth]{predictive.pdf}
+\end{center}
+
+5. The plot below shows the marginal likelihood as a function of the number of
+ basis functions in my RBF basis.
+\begin{center}
+\includegraphics[width=0.8\textwidth]{ml.pdf}
+\end{center}
+It seems that about 25 functions in the basis is a reasonable prior hypothesis.
+
+
+
+\begin{problem}[30pts, Hastie et al., Murphy]
+In this problem, we'll apply logistic regression to a data set of spam
+email. These data consist of 4601 email messages, from which 57
+features have been extracted. These are as follows:
+\begin{itemize}
+ \item $48$ features in $[0,100]$, giving the percentage of words in
+ a given message which match a given word on a list containing,
+ e.g., ``business'', ``free'', etc.
+ \item $6$ features in $[0,100]$, giving the percentage of characters
+ in the email that match characters on a list containing, e.g.,
+ ``\$'', ``\#'', etc.
+ \item Feature 55: The average length of an uninterrupted sequence of
+ capital letters.
+ \item Feature 56: The length of the longest uninterrupted sequence
+ of capital letters.
+ \item Feature 57: The sum of the lengths of uninterrupted sequences
+ of capital letters.
+\end{itemize}
+There are files \texttt{spam.train.dat} and \texttt{spam.test.dat}
+(available on the course website) in which each row is an email.
+There are 3000 training and 1601 test examples. The final column in
+each file indicates whether the email was spam.
+
+\begin{enumerate}
+ \item Code up ~$\ell_2$-regularized logistic regression.
+ That is, code up a model of the form:
+ \begin{align}
+ p(\mathbf{w}) & = \mathcal{N}(0, \sigma^2 I) \\
+ p(y_i = 1 | \mathbf{x}, \mathbf{w}) & = \frac{1}{1 + \exp(-\mathbf{w}^T \mathbf{x})}
+ \end{align}
+
+ and use it to compute the MAP estimate of $\mathbf{w}$, given the training examples.
+ Compute the gradients of the posterior probability of the data with respect to $\mathbf{w}$, and check that they're correct numerically.
+ Show the code you used for computing the derivatives.
+
+ \item Use
+ 10-fold cross-validation on the training set to determine an appropriate regularization
+ penalty $\sigma^2$. Plot different values of $\log_e \sigma^2$ ranging from -8 to 8 against the cross-validation error and test error. Which is the best?
+ \item There are different ways one might preprocess the data. One
+ typical thing to do is to linearly ``standardize'' each input feature so
+ that it has mean zero and variance one. Do this standardization
+ and evaluate the model again. Report the discovered $\bw$, and the training and test error after performing cross-validation as in part 2.
+ \item It may also be the case that some dimensions are better represented a binary variables, or by taking the logarithm, than the given values.
+ Hypothesize whether any of the variables in this data sets might be better coded as binary (or using their log), and explain your reasoning.
+ Then do the transform and apply the same cross-validation as before. Do your results improve? Conjecture why or why not.
+\end{enumerate}
+
+\end{problem}
+
+\paragraph{Solution.} The code for this problem can be found in \textsf{p3.py}
+
+1. The MAP estimate of $w$ is given by:
+\begin{displaymath}
+ w^* = \argmax_{w} \sum_{i=1}^n \log p(x_i, y_i \given w) + \log p(w)
+\end{displaymath}
+For $\ell$-2 regularized logistic regression:
+\begin{align*}
+ \log p(x_i, y_i\given w) &= - y_i\log\left(1+\exp(-w^Tx_i)\right)
+ + (1- y_i)\log\left(\frac{\exp(-w^Tx_i)}{1+\exp(-w^Tx_i)}\right)\\
+ &= -(1-y_i) w^Tx_i -\log\left(1+\exp(-w^T x_i)\right)
+\end{align*}
+and:
+\begin{displaymath}
+ \log p(w) = -\frac{1}{2\sigma^2}\|w\|^2 + C
+\end{displaymath}
+where $C$ is the normalization constant which does not depend on $w$. So the
+MAP optimization problem can be written like this:
+\begin{displaymath}
+ w^* = \argmin_w \sum_{i=1}^n \log\left(1+\exp(-w^Tx_i)\right)
+ + \sum_{i=1}^n(1-y_i)w^Tx_i + \frac{1}{2\sigma^2}\|w\|^2
+\end{displaymath}
+Denoting by $L(w)$ the objective function of the above optimization problem.
+The gradient is given by:
+\begin{displaymath}
+ \nabla L(w) = \sum_{i=1}^n \frac{-x_i}{1+\exp(w^Tx_i)}
+ + \sum_{i=1}^n (1-y_i)x_i + \frac{1}{\sigma^2}w
+\end{displaymath}
+The code to compute $L$ and its gradient and to test the gradient can be found
+in \textsf{p3.py}.
+
+2. The plot of the cross-validation error and test error as a function of $\log
+ \sigma^2$ can be found below:
+\begin{center}
+\includegraphics[width=0.8\textwidth]{scores_txt.pdf}
+\end{center}
+I used fraction of misclassified samples are a measure of error. Strangely,
+the plot doesn't exhibit the standard ``sweetspot behavior''. The test error
+should be increasing for large values of $\sigma^2$ because at this point we
+are simply over-fitting the data, but I haven't been able to understand why
+I am not observing it. In practice, I would use $\log\sigma^2 = -2$.
+
+3. The cross-validation plot for the normalized data can be found below:
+\begin{center}
+\includegraphics[width=0.8\textwidth]{scores2_txt.pdf}
+\end{center}
+The error is much smaller in this case. For $\log\sigma^2 = 2$, I obtained
+a cross-validation error of 0.073 and a test error of 0.086. The value of $w$
+can be found in file \textsf{w.txt.pdf}.
+
+4. I believe feature 55, 56 and 57 would be better represented by taking their
+ logarithms. My reasoning is the following: if for example an email is
+ written entirely in capital letters the variable will simply be equal to the
+ length of the email. But two emails of different lengths written entirely in
+ capital should be considered equally bad. In other words, as long as there
+ are many capital letters, whether it is 100 or 1000, it seems to me that the
+ emails are almost equally likely to be spam. Taking the logarithm will favor
+ the order of magnitude rather than the actual count.
+
+ The plot of the CV and test error after this normalization can be found
+ here:
+\begin{center}
+\includegraphics[width=0.8\textwidth]{scores3_txt.pdf}
+\end{center}
+Despite the normalization the error stays the same.
+
+\end{document}