diff options
| author | Thibaut Horel <thibaut.horel@gmail.com> | 2015-10-08 22:37:26 -0400 |
|---|---|---|
| committer | Thibaut Horel <thibaut.horel@gmail.com> | 2015-10-08 22:37:26 -0400 |
| commit | 0f9fc1bdb1755f5677e3201bcd35706e09235844 (patch) | |
| tree | ec4cac94a4f06c873f163abdc0b6c8f77fedfdd4 /hw1 | |
| parent | fb90afefbf3e6d10164a4c1a496e71a688e600f0 (diff) | |
| download | cs281-0f9fc1bdb1755f5677e3201bcd35706e09235844.tar.gz | |
[hw1] old changes
Diffstat (limited to 'hw1')
| -rw-r--r-- | hw1/main.tex | 15 |
1 files changed, 8 insertions, 7 deletions
diff --git a/hw1/main.tex b/hw1/main.tex index ee9b915..02d80ae 100644 --- a/hw1/main.tex +++ b/hw1/main.tex @@ -212,14 +212,14 @@ and we obtain the desired formula for the MAP estimator. (f) $n$ needs not be greater than $m$ since $(\lambda Id + X^TX)$ might be invertible even if $X^TX$ is not. -(g) Replacing $X$ by $X'$: +(g) Let us define $X'$: \begin{displaymath} X' = \begin{pmatrix} X\\ \sqrt{\lambda}Id \end{pmatrix} \end{displaymath} -and $Y^T$ by $Y'^ = [Y\; 0]$ and replacing $X$ by $X'$ in the least square +Replacing $Y^T$ by $Y'^ = [Y\; 0]$ and replacing $X$ by $X'$ in the least square estimator $\hat{\beta}$, we see that $X'^TY' = X^TY$. Furthermore: \begin{displaymath} X'^TX' = \begin{pmatrix} @@ -230,8 +230,9 @@ estimator $\hat{\beta}$, we see that $X'^TY' = X^TY$. Furthermore: \sqrt{\lambda}Id \end{pmatrix} = X^TX +\lambda Id \end{displaymath} -Hence we get $\hat{\beta} = (X^TX+\lambda Id)^{-1}X^TY$, which is exactly the -expression of the MAP estimator in ridge regression. +Hence the least square estimator yields $\hat{\beta} = (X^TX+\lambda +Id)^{-1}X^TY$, which is exactly the expression of the MAP estimator in ridge +regression. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{problem}[The Dirichlet and multinomial distributions, 12pts] @@ -507,13 +508,13 @@ by the method L-BFGS for each value of $d$. Answer the following questions: We see that the non-linear transformations help reduce the prediction error: the data is probably not very well modeled by a linear relation ship. Allowing more complex relationships (as $d$ becomes larger the relationship becomes more -complex) fits it better. +complex) is a better fit and helps the prediction. QR is faster than L-BFGS when the number of features is not too large ($d\leq 400$). For $d=600$, L-BFGS is faster. The asymptotic complexity of QR is $O(nd^2 + d^3)$ whereas the complexity of L-BFGS is $O(nd)$ (for a fixed number -of iterations). So asymptotically, L-BFGS is be faster as we observe for -$d=600$ (for smaller values of $d$ the constants hidden in the $O$ do not allow +of iterations). So asymptotically, L-BFGS is faster as we observe for $d=600$ +(for smaller values of $d$ the constants hidden in the $O$ do not allow a formal comparison). \end{document} |
