summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--DECSUP00465.tex1
-rw-r--r--jmlr-17-687.tex174
2 files changed, 174 insertions, 1 deletions
diff --git a/DECSUP00465.tex b/DECSUP00465.tex
index ab2e99f..73ba764 100644
--- a/DECSUP00465.tex
+++ b/DECSUP00465.tex
@@ -89,7 +89,6 @@ of infinitly many data points, the steps are combined in a way which makes it
hard to be convinced that the entire procedure converges to a good solution in
the limit.
-
\paragraph{Additional remark on the overall approach.} At a high-level, the
problem is an instance of the following problem: optimizing an
unknown function based on noisy observtions of its values. This is
diff --git a/jmlr-17-687.tex b/jmlr-17-687.tex
new file mode 100644
index 0000000..53c5fc0
--- /dev/null
+++ b/jmlr-17-687.tex
@@ -0,0 +1,174 @@
+\documentclass[12pt]{article}
+\usepackage[T1]{fontenc}
+\usepackage[utf8]{inputenc}
+\usepackage[hmargin=1.2in, vmargin=1.2in]{geometry}
+
+\title{Review of \emph{Proximal Distance Algorithms: Theory and Practice}}
+\date{}
+\begin{document}
+
+\maketitle
+
+\vspace{-3em}
+
+\paragraph{Summary of the results.} This paper introduces a new class of
+algorithms called \emph{proximal distance algorithms} to find the minimum of an
+objective function $f$ in a constrained set $\mathcal{C}$. The main idea is to
+combine the penalty method with the majorization-minimization (MM) principle,
+specifically:
+\begin{itemize}
+ \item the algorithm solves a sequence of unconstrained minimization
+ problems with objective $f(x) + \rho q(x)$ where $q$ is a penalty
+ function which measures the amount of violation of the constraints and
+ where the parameter $\rho$ is chosen according to a sequence increasing
+ towards $\infty$.
+ \item the proposed penalty function is the squared distance: ${\rm dist}(x,
+ \mathcal{C})^2$ or more precisely, if $\mathcal{C} = \bigcap_{i=1}^m
+ C_i$ is the intersection of closed sets, $q(x)
+ = \frac{1}{m}\sum_{i=1}^m \|x-P_{C_i}(x)\|^2$ where $P_{C_i}$ denotes
+ the (possibly multi-valued) projection of $x$ onto the set $C_i$.
+ \item the optimization problem for fixed $\rho$ is solved by using the MM
+ principle by defining $x_{k+1}$ at step $k+1$ to minimize the surrogate
+ function:
+ \begin{displaymath}
+ g(x|x_k) = f(x) + \frac{\rho}{2m}\sum_{i=1}^m \|x-P_{C_i}(x_k)\|^2
+ \end{displaymath}
+\end{itemize}
+
+ The authors observe that minimizing the surrogate upper-bound on the
+ objective function amounts to computing the proximal operator of
+ $\rho^{-1}f$ at the point $y_k = \frac{1}{m}\sum_{i=1}^m P_{C_i}(x_k)$
+ thus connecting this algorithm to the field of proximal algorithms
+ which leads to a simple analysis of the convergence of the proposed
+ algorithm for fixed $\rho > 0$. An accelerated variant of the algorithm
+ is also proposed. The authors then show how to instantiate their
+ algorithms on seven different examples and provide experimental evidence
+ that the algorithms compare favorably to traditional general purpose
+ convex optimization algorithms.
+
+\paragraph{Comments on the proposed algorithm.} The proposed algorithms follow
+from simple and general principles leading to intuitive-looking procedures,
+which is a positive point. The interpretation of the optimization for fixed
+$\rho > 0$ via proximal operators is interesting since it allows the proposed
+algorithms to leverage what is already known about computing proximal operators
+and provides a proof of convergence of the procedure for $\rho > 0$ essentially
+for free.
+
+A complaint about the proposed algorithms: the paper departs from the
+traditional way of writing the constraints as a set of inequalities $f_i(x)\leq
+0$ or equalities $f_j(x) = 0$ (typically for convex functions $f_i$ or $f_j$)
+and instead writes them as $x\in \cap_{i=1}^m C_i$ for (typically convex) sets
+$C_i$. This allows the authors to use $\|x-P_{C_i}(x)\|^2$ as the penalty
+function as opposed to $\max (0, f_i(x))^q$ for some exponent $q$ as is usual
+in traditional applications of the penalty method. While very convenient at an
+abstract mathematical level, the notation $P_{C_i}(x)$ and this choice of
+penalty function hides the implicit assumption that one has access to
+a representation of the convex sets $C_i$ defining the constraints in a way
+which allows for easy computation of $P_{C_i}(x)$. While the examples show that
+this is not necessarily a problem, this implicit assumption should be
+highlighted and commented on: if the optimization problem we are facing is
+written with constraints $f_i(x) \leq 0$ and only ``black-box'' access to $f_i$
+is provided, then there is no efficient way to represent the associated convex
+set $C_i$ and to project onto it. While mathematically equivalent, the choice
+of representation of the constraints matter.
+
+\paragraph{Absence of theoretical results.} The only theoretical result in this
+paper is the proof of convergence of the procedure to solve the penalized
+problem for fixed $\rho>0$ via the distance majorization principle. This follows
+from the interpretation of the procedure as a proximal algorithm and a direct
+application of the general convergence theorem of Krasnosel'skii and Mann. This
+still leaves many gaps in the theoretical understanding of the proposed
+algorithms:
+\begin{itemize}
+ \item importantly, the theoretical discussion does not give any indication
+ of the \textbf{rate of convergence} of the procedure and how it
+ compares to other traditional algorithms in convex optimization.
+ \item there is no proof of convergence of the overall procedure where the
+ parameter $\rho$ is chosen according to a diverging sequence. There is
+ only a sketch of the proof at the end of section 3 assuming that both
+ the objective function is coercive (without any comment on whether the
+ objective function at hand satisfies this assumption)
+ \item the overall procedure assumes that the inner optimization problem for
+ fixed $\rho$ is solved exactly, which is not the case in practice. An
+ analysis of this nested algorithm should account for approximate
+ solving of the problem for fixed $\rho > 0$ and how approximation
+ errors accumulate when $\rho$ changes. In fact, the details of the
+ definition of the overall procedure (how to choose the sequence $\rho$,
+ the fact that we use the solution $x_{\rho_n}$ of the problem for
+ $\rho_n$ as the initial solution for the problem with parameter
+ $\rho_{n+1}$) are not provided and left as guess work to the
+ reader (cf. my comment on the writing below).
+ \item the accelerated variant of the algorithm comes without
+ convergence guarantee.
+\end{itemize}
+
+\paragraph{Comments on the experiments.} The chosen examples are interesting
+and cover a wide range of applications and demonstrate that the proposed
+algorithms can be instantiated successfully in all these applications. The fact
+that the details of the instantiations (how to compute the projections, etc.)
+are worked out in details is valuable.
+
+A surprising and regrettable fact is that the algorithm is compared against
+other algorithms only via computation time (CPU time) which is a very indirect
+measure of efficiency of algorithms since it is highly dependent on
+implementation choices such as the programming language. A more usual and
+better (or at least complementary) way to make the comparison would be to show
+plots of the approximation errors on the y-axis as the number of iterations
+increases on the x-axis. Without this, it is really hard to compare algorithms
+to each other in terms of rate of convergence, especially given the absence of
+theoretical guarantees on the rate of convergence (cf. my comment above on the
+absence of theoretical results).
+
+\paragraph{Comments on the writing.} Given that the paper aims at expanding on
+an already published paper on the same topic, the exposition should be
+particularly polished to provide a good overview of this class of algorithms.
+Unfortunately, the current exposition suffers from important lapses:
+\begin{itemize}
+ \item as already mentioned above, the overall procedure is never described
+ in full details in one place but is instead presented in multiple
+ pieces throughout the first three sections. The fact that the overall
+ procedure is in fact a nested procedure where a different optimization
+ problem is solved at each step could easily be overlooked. In fact, the
+ details of this nested procedure are never provided (choice of the
+ diverging sequence for $\rho$, how the different problems for different
+ values of $\rho$ are combined) are never given in the theoretical
+ section and only briefly commented on in the experimental section. The
+ paper would greatly benefit from a complete specification of the
+ algorithm, for example in a dedicated framed box, or ``algorithm''
+ environment.
+ \item sections 1, 2 and 3 are not well-structured, with an uninterrupted
+ flow of theoretical derivations, examples, proofs and proof sketches
+ without separators. The exposition would benefit from clearly
+ indicating the examples (using a separate environment for example). The
+ theoretical guarantee(s) in section three should be clearly stated in
+ a separate theorem or proposition environment, followed by a proof
+ where the beginning and the end can be clearly identified. The
+ accelerated variant could be presented in a separate section, possibly
+ also in a dedicated framed box or dedicated algorithm environment.
+\end{itemize}
+
+A few additional minor comments on the writing:
+\begin{itemize}
+ \item the notation $P_C$ should be clearly defined at the point it is
+ introduced (notably, the fact that it is in general multi-valued,
+ a point which only becomes clear later in the paper).
+ \item in the introduction, the sentence ``fills in many gaps of theory and
+ practice'' should be rephrased: it is not clear which gaps are filled
+ by the current exposition.
+ \item the writing suffers at times from the use of subjective words, such as
+ ``painfully slow'' at the bottom of page 5 which would be better
+ replaced by a precise quantitative statement. Similarly, the first
+ sentence of the introduction ``part science and part art'', while
+ arguably acceptable at the beginning of an introduction, would benefit
+ from some additional justification or a discussion of what is actually
+ meant.
+\end{itemize}
+
+\paragraph{Conclusion.} While the proposed algorithms are interesting and the
+connection to proximal algorithms and the fact that they follow from general
+principles is appealing, the paper greatly suffers from many lapses in the
+theoretical analysis, the exposition and the experimental comparison to other
+methods
+
+
+\end{document}