\documentclass[12pt]{article} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage[hmargin=1.2in, vmargin=1.2in]{geometry} \title{Review of \emph{Proximal Distance Algorithms: Theory and Practice}} \date{} \begin{document} \maketitle \vspace{-3em} \paragraph{Summary of the results.} This paper introduces a new class of algorithms called \emph{proximal distance algorithms} to find the minimum of an objective function $f$ in a constrained set $\mathcal{C}$. The main idea is to combine the penalty method with the majorization-minimization (MM) principle, specifically: \begin{itemize} \item the algorithm solves a sequence of unconstrained minimization problems with objective $f(x) + \rho q(x)$ where $q$ is a penalty function which measures the amount of violation of the constraints and where the parameter $\rho$ is chosen according to a sequence increasing towards $\infty$. \item the proposed penalty function is the squared distance: ${\rm dist}(x, \mathcal{C})^2$ or more precisely, if $\mathcal{C} = \bigcap_{i=1}^m C_i$ is the intersection of closed sets, $q(x) = \frac{1}{m}\sum_{i=1}^m \|x-P_{C_i}(x)\|^2$ where $P_{C_i}$ denotes the (possibly multi-valued) projection of $x$ onto the set $C_i$. \item the optimization problem for fixed $\rho$ is solved by using the MM principle by defining $x_{k+1}$ at step $k+1$ to minimize the surrogate function: \begin{displaymath} g(x|x_k) = f(x) + \frac{\rho}{2m}\sum_{i=1}^m \|x-P_{C_i}(x_k)\|^2 \end{displaymath} \end{itemize} The authors observe that minimizing the surrogate upper-bound on the objective function amounts to computing the proximal operator of $\rho^{-1}f$ at the point $y_k = \frac{1}{m}\sum_{i=1}^m P_{C_i}(x_k)$ thus connecting this algorithm to the field of proximal algorithms which leads to a simple analysis of the convergence of the proposed algorithm for fixed $\rho > 0$. An accelerated variant of the algorithm is also proposed. The authors then show how to instantiate their algorithms on seven different examples and provide experimental evidence that the algorithms compare favorably to traditional general purpose convex optimization algorithms. \paragraph{Comments on the proposed algorithm.} The proposed algorithms follow from simple and general principles leading to intuitive-looking procedures, which is a positive point. The interpretation of the optimization for fixed $\rho > 0$ via proximal operators is interesting since it allows the proposed algorithms to leverage what is already known about computing proximal operators and provides a proof of convergence of the procedure for $\rho > 0$ essentially for free. A complaint about the proposed algorithms: the paper departs from the traditional way of writing the constraints as a set of inequalities $f_i(x)\leq 0$ or equalities $f_j(x) = 0$ (typically for convex functions $f_i$ or $f_j$) and instead writes them as $x\in \cap_{i=1}^m C_i$ for (typically convex) sets $C_i$. This allows the authors to use $\|x-P_{C_i}(x)\|^2$ as the penalty function as opposed to $\max (0, f_i(x))^q$ for some exponent $q$ as is usual in traditional applications of the penalty method. While very convenient at an abstract mathematical level, the notation $P_{C_i}(x)$ and this choice of penalty function hides the implicit assumption that one has access to a representation of the convex sets $C_i$ defining the constraints in a way which allows for easy computation of $P_{C_i}(x)$. While the examples show that this is not necessarily a problem, this implicit assumption should be highlighted and commented on: if the optimization problem we are facing is written with constraints $f_i(x) \leq 0$ and only ``black-box'' access to $f_i$ is provided, then there is no efficient way to represent the associated convex set $C_i$ and to project onto it. While mathematically equivalent, the choice of representation of the constraints matter. \paragraph{Absence of theoretical results.} The only theoretical result in this paper is the proof of convergence of the procedure to solve the penalized problem for fixed $\rho>0$ via the distance majorization principle. This follows from the interpretation of the procedure as a proximal algorithm and a direct application of the general convergence theorem of Krasnosel'skii and Mann. This still leaves many gaps in the theoretical understanding of the proposed algorithms: \begin{itemize} \item importantly, the theoretical discussion does not give any indication of the \textbf{rate of convergence} of the procedure and how it compares to other traditional algorithms in convex optimization. \item there is no proof of convergence of the overall procedure where the parameter $\rho$ is chosen according to a diverging sequence. There is only a sketch of the proof at the end of section 3 assuming that both the objective function is coercive (without any comment on whether the objective function at hand satisfies this assumption) \item the overall procedure assumes that the inner optimization problem for fixed $\rho$ is solved exactly, which is not the case in practice. An analysis of this nested algorithm should account for approximate solving of the problem for fixed $\rho > 0$ and how approximation errors accumulate when $\rho$ changes. In fact, the details of the definition of the overall procedure (how to choose the sequence $\rho$, the fact that we use the solution $x_{\rho_n}$ of the problem for $\rho_n$ as the initial solution for the problem with parameter $\rho_{n+1}$) are not provided and left as guess work to the reader (cf. my comment on the writing below). \item the accelerated variant of the algorithm comes without convergence guarantee. \end{itemize} \paragraph{Comments on the experiments.} The chosen examples are interesting and cover a wide range of applications and demonstrate that the proposed algorithms can be instantiated successfully in all these applications. The fact that the details of the instantiations (how to compute the projections, etc.) are worked out in details is valuable. A surprising and regrettable fact is that the algorithm is compared against other algorithms only via computation time (CPU time) which is a very indirect measure of efficiency of algorithms since it is highly dependent on implementation choices such as the programming language. A more usual and better (or at least complementary) way to make the comparison would be to show plots of the approximation errors on the y-axis as the number of iterations increases on the x-axis. Without this, it is really hard to compare algorithms to each other in terms of rate of convergence, especially given the absence of theoretical guarantees on the rate of convergence (cf. my comment above on the absence of theoretical results). \paragraph{Comments on the writing.} Given that the paper aims at expanding on an already published paper on the same topic, the exposition should be particularly polished to provide a good overview of this class of algorithms. Unfortunately, the current exposition suffers from important lapses: \begin{itemize} \item as already mentioned above, the overall procedure is never described in full details in one place but is instead presented in multiple pieces throughout the first three sections. The fact that the overall procedure is in fact a nested procedure where a different optimization problem is solved at each step could easily be overlooked. In fact, the details of this nested procedure are never provided (choice of the diverging sequence for $\rho$, how the different problems for different values of $\rho$ are combined) are never given in the theoretical section and only briefly commented on in the experimental section. The paper would greatly benefit from a complete specification of the algorithm, for example in a dedicated framed box, or ``algorithm'' environment. \item sections 1, 2 and 3 are not well-structured, with an uninterrupted flow of theoretical derivations, examples, proofs and proof sketches without separators. The exposition would benefit from clearly indicating the examples (using a separate environment for example). The theoretical guarantee(s) in section three should be clearly stated in a separate theorem or proposition environment, followed by a proof where the beginning and the end can be clearly identified. The accelerated variant could be presented in a separate section, possibly also in a dedicated framed box or dedicated algorithm environment. \end{itemize} A few additional minor comments on the writing: \begin{itemize} \item the notation $P_C$ should be clearly defined at the point it is introduced (notably, the fact that it is in general multi-valued, a point which only becomes clear later in the paper). \item in the introduction, the sentence ``fills in many gaps of theory and practice'' should be rephrased: it is not clear which gaps are filled by the current exposition. \item the writing suffers at times from the use of subjective words, such as ``painfully slow'' at the bottom of page 5 which would be better replaced by a precise quantitative statement. Similarly, the first sentence of the introduction ``part science and part art'', while arguably acceptable at the beginning of an introduction, would benefit from some additional justification or a discussion of what is actually meant. \end{itemize} \paragraph{Conclusion.} While the proposed algorithms are interesting and the connection to proximal algorithms and the fact that they follow from general principles is appealing, the paper greatly suffers from many lapses in the theoretical analysis, the exposition and the experimental comparison to other methods \end{document}