diff options
Diffstat (limited to 'jmlr-17-687.tex')
| -rw-r--r-- | jmlr-17-687.tex | 174 |
1 files changed, 174 insertions, 0 deletions
diff --git a/jmlr-17-687.tex b/jmlr-17-687.tex new file mode 100644 index 0000000..53c5fc0 --- /dev/null +++ b/jmlr-17-687.tex @@ -0,0 +1,174 @@ +\documentclass[12pt]{article} +\usepackage[T1]{fontenc} +\usepackage[utf8]{inputenc} +\usepackage[hmargin=1.2in, vmargin=1.2in]{geometry} + +\title{Review of \emph{Proximal Distance Algorithms: Theory and Practice}} +\date{} +\begin{document} + +\maketitle + +\vspace{-3em} + +\paragraph{Summary of the results.} This paper introduces a new class of +algorithms called \emph{proximal distance algorithms} to find the minimum of an +objective function $f$ in a constrained set $\mathcal{C}$. The main idea is to +combine the penalty method with the majorization-minimization (MM) principle, +specifically: +\begin{itemize} + \item the algorithm solves a sequence of unconstrained minimization + problems with objective $f(x) + \rho q(x)$ where $q$ is a penalty + function which measures the amount of violation of the constraints and + where the parameter $\rho$ is chosen according to a sequence increasing + towards $\infty$. + \item the proposed penalty function is the squared distance: ${\rm dist}(x, + \mathcal{C})^2$ or more precisely, if $\mathcal{C} = \bigcap_{i=1}^m + C_i$ is the intersection of closed sets, $q(x) + = \frac{1}{m}\sum_{i=1}^m \|x-P_{C_i}(x)\|^2$ where $P_{C_i}$ denotes + the (possibly multi-valued) projection of $x$ onto the set $C_i$. + \item the optimization problem for fixed $\rho$ is solved by using the MM + principle by defining $x_{k+1}$ at step $k+1$ to minimize the surrogate + function: + \begin{displaymath} + g(x|x_k) = f(x) + \frac{\rho}{2m}\sum_{i=1}^m \|x-P_{C_i}(x_k)\|^2 + \end{displaymath} +\end{itemize} + + The authors observe that minimizing the surrogate upper-bound on the + objective function amounts to computing the proximal operator of + $\rho^{-1}f$ at the point $y_k = \frac{1}{m}\sum_{i=1}^m P_{C_i}(x_k)$ + thus connecting this algorithm to the field of proximal algorithms + which leads to a simple analysis of the convergence of the proposed + algorithm for fixed $\rho > 0$. An accelerated variant of the algorithm + is also proposed. The authors then show how to instantiate their + algorithms on seven different examples and provide experimental evidence + that the algorithms compare favorably to traditional general purpose + convex optimization algorithms. + +\paragraph{Comments on the proposed algorithm.} The proposed algorithms follow +from simple and general principles leading to intuitive-looking procedures, +which is a positive point. The interpretation of the optimization for fixed +$\rho > 0$ via proximal operators is interesting since it allows the proposed +algorithms to leverage what is already known about computing proximal operators +and provides a proof of convergence of the procedure for $\rho > 0$ essentially +for free. + +A complaint about the proposed algorithms: the paper departs from the +traditional way of writing the constraints as a set of inequalities $f_i(x)\leq +0$ or equalities $f_j(x) = 0$ (typically for convex functions $f_i$ or $f_j$) +and instead writes them as $x\in \cap_{i=1}^m C_i$ for (typically convex) sets +$C_i$. This allows the authors to use $\|x-P_{C_i}(x)\|^2$ as the penalty +function as opposed to $\max (0, f_i(x))^q$ for some exponent $q$ as is usual +in traditional applications of the penalty method. While very convenient at an +abstract mathematical level, the notation $P_{C_i}(x)$ and this choice of +penalty function hides the implicit assumption that one has access to +a representation of the convex sets $C_i$ defining the constraints in a way +which allows for easy computation of $P_{C_i}(x)$. While the examples show that +this is not necessarily a problem, this implicit assumption should be +highlighted and commented on: if the optimization problem we are facing is +written with constraints $f_i(x) \leq 0$ and only ``black-box'' access to $f_i$ +is provided, then there is no efficient way to represent the associated convex +set $C_i$ and to project onto it. While mathematically equivalent, the choice +of representation of the constraints matter. + +\paragraph{Absence of theoretical results.} The only theoretical result in this +paper is the proof of convergence of the procedure to solve the penalized +problem for fixed $\rho>0$ via the distance majorization principle. This follows +from the interpretation of the procedure as a proximal algorithm and a direct +application of the general convergence theorem of Krasnosel'skii and Mann. This +still leaves many gaps in the theoretical understanding of the proposed +algorithms: +\begin{itemize} + \item importantly, the theoretical discussion does not give any indication + of the \textbf{rate of convergence} of the procedure and how it + compares to other traditional algorithms in convex optimization. + \item there is no proof of convergence of the overall procedure where the + parameter $\rho$ is chosen according to a diverging sequence. There is + only a sketch of the proof at the end of section 3 assuming that both + the objective function is coercive (without any comment on whether the + objective function at hand satisfies this assumption) + \item the overall procedure assumes that the inner optimization problem for + fixed $\rho$ is solved exactly, which is not the case in practice. An + analysis of this nested algorithm should account for approximate + solving of the problem for fixed $\rho > 0$ and how approximation + errors accumulate when $\rho$ changes. In fact, the details of the + definition of the overall procedure (how to choose the sequence $\rho$, + the fact that we use the solution $x_{\rho_n}$ of the problem for + $\rho_n$ as the initial solution for the problem with parameter + $\rho_{n+1}$) are not provided and left as guess work to the + reader (cf. my comment on the writing below). + \item the accelerated variant of the algorithm comes without + convergence guarantee. +\end{itemize} + +\paragraph{Comments on the experiments.} The chosen examples are interesting +and cover a wide range of applications and demonstrate that the proposed +algorithms can be instantiated successfully in all these applications. The fact +that the details of the instantiations (how to compute the projections, etc.) +are worked out in details is valuable. + +A surprising and regrettable fact is that the algorithm is compared against +other algorithms only via computation time (CPU time) which is a very indirect +measure of efficiency of algorithms since it is highly dependent on +implementation choices such as the programming language. A more usual and +better (or at least complementary) way to make the comparison would be to show +plots of the approximation errors on the y-axis as the number of iterations +increases on the x-axis. Without this, it is really hard to compare algorithms +to each other in terms of rate of convergence, especially given the absence of +theoretical guarantees on the rate of convergence (cf. my comment above on the +absence of theoretical results). + +\paragraph{Comments on the writing.} Given that the paper aims at expanding on +an already published paper on the same topic, the exposition should be +particularly polished to provide a good overview of this class of algorithms. +Unfortunately, the current exposition suffers from important lapses: +\begin{itemize} + \item as already mentioned above, the overall procedure is never described + in full details in one place but is instead presented in multiple + pieces throughout the first three sections. The fact that the overall + procedure is in fact a nested procedure where a different optimization + problem is solved at each step could easily be overlooked. In fact, the + details of this nested procedure are never provided (choice of the + diverging sequence for $\rho$, how the different problems for different + values of $\rho$ are combined) are never given in the theoretical + section and only briefly commented on in the experimental section. The + paper would greatly benefit from a complete specification of the + algorithm, for example in a dedicated framed box, or ``algorithm'' + environment. + \item sections 1, 2 and 3 are not well-structured, with an uninterrupted + flow of theoretical derivations, examples, proofs and proof sketches + without separators. The exposition would benefit from clearly + indicating the examples (using a separate environment for example). The + theoretical guarantee(s) in section three should be clearly stated in + a separate theorem or proposition environment, followed by a proof + where the beginning and the end can be clearly identified. The + accelerated variant could be presented in a separate section, possibly + also in a dedicated framed box or dedicated algorithm environment. +\end{itemize} + +A few additional minor comments on the writing: +\begin{itemize} + \item the notation $P_C$ should be clearly defined at the point it is + introduced (notably, the fact that it is in general multi-valued, + a point which only becomes clear later in the paper). + \item in the introduction, the sentence ``fills in many gaps of theory and + practice'' should be rephrased: it is not clear which gaps are filled + by the current exposition. + \item the writing suffers at times from the use of subjective words, such as + ``painfully slow'' at the bottom of page 5 which would be better + replaced by a precise quantitative statement. Similarly, the first + sentence of the introduction ``part science and part art'', while + arguably acceptable at the beginning of an introduction, would benefit + from some additional justification or a discussion of what is actually + meant. +\end{itemize} + +\paragraph{Conclusion.} While the proposed algorithms are interesting and the +connection to proximal algorithms and the fact that they follow from general +principles is appealing, the paper greatly suffers from many lapses in the +theoretical analysis, the exposition and the experimental comparison to other +methods + + +\end{document} |
