2 files changed, 174 insertions, 1 deletions
diff --git a/DECSUP00465.tex b/DECSUP00465.tex
index ab2e99f..73ba764 100644
--- a/DECSUP00465.tex
+++ b/DECSUP00465.tex
@@ -89,7 +89,6 @@ of infinitly many data points, the steps are combined in a way which makes it
 hard to be convinced that the entire procedure converges to a good solution in
 the limit.
 
-
 \paragraph{Additional remark on the overall approach.} At a high-level, the
 problem is an instance of the following problem: optimizing an
 unknown function based on noisy observtions of its values. This is
diff --git a/jmlr-17-687.tex b/jmlr-17-687.tex
new file mode 100644
index 0000000..53c5fc0
--- /dev/null
+++ b/jmlr-17-687.tex
@@ -0,0 +1,174 @@
+\documentclass[12pt]{article}
+\usepackage[T1]{fontenc}
+\usepackage[utf8]{inputenc}
+\usepackage[hmargin=1.2in, vmargin=1.2in]{geometry}
+
+\title{Review of \emph{Proximal Distance Algorithms: Theory and Practice}}
+\date{}
+\begin{document}
+
+\maketitle
+
+\vspace{-3em}
+
+\paragraph{Summary of the results.} This paper introduces a new class of
+algorithms called \emph{proximal distance algorithms} to find the minimum of an
+objective function $f$ in a constrained set $\mathcal{C}$. The main idea is to
+combine the penalty method with the majorization-minimization (MM) principle,
+specifically:
+\begin{itemize}
+	\item the algorithm solves a sequence of unconstrained minimization
+		problems with objective $f(x) + \rho q(x)$ where $q$ is a penalty
+		function which measures the amount of violation of the constraints and
+		where the parameter $\rho$ is chosen according to a sequence increasing
+		towards $\infty$.
+	\item the proposed penalty function is the squared distance: ${\rm dist}(x,
+		\mathcal{C})^2$ or more precisely, if $\mathcal{C} = \bigcap_{i=1}^m
+		C_i$ is the intersection of closed sets, $q(x)
+		= \frac{1}{m}\sum_{i=1}^m \|x-P_{C_i}(x)\|^2$ where $P_{C_i}$ denotes
+		the (possibly multi-valued) projection of $x$ onto the set $C_i$.
+	\item the optimization problem for fixed $\rho$ is solved by using the MM
+		principle by defining $x_{k+1}$ at step $k+1$ to minimize the surrogate
+		function:
+		\begin{displaymath}
+			g(x|x_k) = f(x) + \frac{\rho}{2m}\sum_{i=1}^m \|x-P_{C_i}(x_k)\|^2	
+		\end{displaymath}
+\end{itemize}
+
+		The authors observe that minimizing the surrogate upper-bound on the
+		objective function amounts to computing the proximal operator of
+		$\rho^{-1}f$ at the point $y_k = \frac{1}{m}\sum_{i=1}^m P_{C_i}(x_k)$
+		thus connecting this algorithm to the field of proximal algorithms
+		which leads to a simple analysis of the convergence of the proposed
+		algorithm for fixed $\rho > 0$. An accelerated variant of the algorithm
+		is also proposed. The authors then show how to instantiate their
+		algorithms on seven different examples and provide experimental evidence
+		that the algorithms compare favorably to traditional general purpose
+		convex optimization algorithms.
+
+\paragraph{Comments on the proposed algorithm.} The proposed algorithms follow
+from simple and general principles leading to intuitive-looking procedures,
+which is a positive point. The interpretation of the optimization for fixed
+$\rho > 0$ via proximal operators is interesting since it allows the proposed
+algorithms to leverage what is already known about computing proximal operators
+and provides a proof of convergence of the procedure for $\rho > 0$ essentially
+for free.
+
+A complaint about the proposed algorithms: the paper departs from the
+traditional way of writing the constraints as a set of inequalities $f_i(x)\leq
+0$ or equalities $f_j(x) = 0$ (typically for convex functions $f_i$ or $f_j$)
+and instead writes them as $x\in \cap_{i=1}^m C_i$ for (typically convex) sets
+$C_i$. This allows the authors to use $\|x-P_{C_i}(x)\|^2$ as the penalty
+function as opposed to $\max (0, f_i(x))^q$ for some exponent $q$ as is usual
+in traditional applications of the penalty method. While very convenient at an
+abstract mathematical level, the notation $P_{C_i}(x)$ and this choice of
+penalty function hides the implicit assumption that one has access to
+a representation of the convex sets $C_i$ defining the constraints in a way
+which allows for easy computation of $P_{C_i}(x)$. While the examples show that
+this is not necessarily a problem, this implicit assumption should be
+highlighted and commented on: if the optimization problem we are facing is
+written with constraints $f_i(x) \leq 0$ and only ``black-box'' access to $f_i$
+is provided, then there is no efficient way to represent the associated convex
+set $C_i$ and to project onto it. While mathematically equivalent, the choice
+of representation of the constraints matter.
+
+\paragraph{Absence of theoretical results.} The only theoretical result in this
+paper is the proof of convergence of the procedure to solve the penalized
+problem for fixed $\rho>0$ via the distance majorization principle. This follows
+from the interpretation of the procedure as a proximal algorithm and a direct
+application of the general convergence theorem of Krasnosel'skii and Mann. This
+still leaves many gaps in the theoretical understanding of the proposed
+algorithms:
+\begin{itemize}
+	\item importantly, the theoretical discussion does not give any indication
+		of the \textbf{rate of convergence} of the procedure and how it
+		compares to other traditional algorithms in convex optimization.
+	\item there is no proof of convergence of the overall procedure where the
+		parameter $\rho$ is chosen according to a diverging sequence. There is
+		only a sketch of the proof at the end of section 3 assuming that both
+		the objective function is coercive (without any comment on whether the
+		objective function at hand satisfies this assumption)
+	\item the overall procedure assumes that the inner optimization problem for
+		fixed $\rho$ is solved exactly, which is not the case in practice. An
+		analysis of this nested algorithm should account for approximate
+		solving of the problem for fixed $\rho > 0$ and how approximation
+		errors accumulate when $\rho$ changes. In fact, the details of the
+		definition of the overall procedure (how to choose the sequence $\rho$,
+		the fact that we use the solution $x_{\rho_n}$ of the problem for
+		$\rho_n$ as the initial solution for the problem with parameter
+		$\rho_{n+1}$) are not provided and left as guess work to the
+		reader (cf. my comment on the writing below).
+	\item the accelerated variant of the algorithm comes without 
+		convergence guarantee.
+\end{itemize}
+
+\paragraph{Comments on the experiments.} The chosen examples are interesting
+and cover a wide range of applications and demonstrate that the proposed
+algorithms can be instantiated successfully in all these applications. The fact
+that the details of the instantiations (how to compute the projections, etc.)
+are worked out in details is valuable.
+
+A surprising and regrettable fact is that the algorithm is compared against
+other algorithms only via computation time (CPU time) which is a very indirect
+measure of efficiency of algorithms since it is highly dependent on
+implementation choices such as the programming language. A more usual and
+better (or at least complementary) way to make the comparison would be to show
+plots of the approximation errors on the y-axis as the number of iterations
+increases on the x-axis. Without this, it is really hard to compare algorithms
+to each other in terms of rate of convergence, especially given the absence of
+theoretical guarantees on the rate of convergence (cf. my comment above on the
+absence of theoretical results).
+
+\paragraph{Comments on the writing.} Given that the paper aims at expanding on
+an already published paper on the same topic, the exposition should be
+particularly polished to provide a good overview of this class of algorithms.
+Unfortunately, the current exposition suffers from important lapses:
+\begin{itemize}
+	\item as already mentioned above, the overall procedure is never described
+		in full details in one place but is instead presented in multiple
+		pieces throughout the first three sections. The fact that the overall
+		procedure is in fact a nested procedure where a different optimization
+		problem is solved at each step could easily be overlooked. In fact, the
+		details of this nested procedure are never provided (choice of the
+		diverging sequence for $\rho$, how the different problems for different
+		values of $\rho$ are combined) are never given in the theoretical
+		section and only briefly commented on in the experimental section. The
+		paper would greatly benefit from a complete specification of the
+		algorithm, for example in a dedicated framed box, or ``algorithm''
+		environment.
+	\item sections 1, 2 and 3 are not well-structured, with an uninterrupted
+		flow of theoretical derivations, examples, proofs and proof sketches
+		without separators. The exposition would benefit from clearly
+		indicating the examples (using a separate environment for example). The
+		theoretical guarantee(s) in section three should be clearly stated in
+		a separate theorem or proposition environment, followed by a proof
+		where the beginning and the end can be clearly identified. The
+		accelerated variant could be presented in a separate section, possibly
+		also in a dedicated framed box or dedicated algorithm environment.
+\end{itemize}
+
+A few additional minor comments on the writing:
+\begin{itemize}
+	\item the notation $P_C$ should be clearly defined at the point it is
+		introduced (notably, the fact that it is in general multi-valued,
+		a point which only becomes clear later in the paper).
+	\item in the introduction, the sentence ``fills in many gaps of theory and
+		practice'' should be rephrased: it is not clear which gaps are filled
+		by the current exposition.
+	\item the writing suffers at times from the use of subjective words, such as
+		``painfully slow'' at the bottom of page 5 which would be better
+		replaced by a precise quantitative statement. Similarly, the first
+		sentence of the introduction ``part science and part art'', while
+		arguably acceptable at the beginning of an introduction, would benefit
+		from some additional justification or a discussion of what is actually
+		meant.
+\end{itemize}
+
+\paragraph{Conclusion.} While the proposed algorithms are interesting and the
+connection to proximal algorithms and the fact that they follow from general
+principles is appealing, the paper greatly suffers from many lapses in the
+theoretical analysis, the exposition and the experimental comparison to other
+methods
+
+
+\end{document}