jmlr-17-687.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174

\documentclass[12pt]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[hmargin=1.2in, vmargin=1.2in]{geometry}

\title{Review of \emph{Proximal Distance Algorithms: Theory and Practice}}
\date{}
\begin{document}

\maketitle

\vspace{-3em}

\paragraph{Summary of the results.} This paper introduces a new class of
algorithms called \emph{proximal distance algorithms} to find the minimum of an
objective function $f$ in a constrained set $\mathcal{C}$. The main idea is to
combine the penalty method with the majorization-minimization (MM) principle,
specifically:
\begin{itemize}
	\item the algorithm solves a sequence of unconstrained minimization
		problems with objective $f(x) + \rho q(x)$ where $q$ is a penalty
		function which measures the amount of violation of the constraints and
		where the parameter $\rho$ is chosen according to a sequence increasing
		towards $\infty$.
	\item the proposed penalty function is the squared distance: ${\rm dist}(x,
		\mathcal{C})^2$ or more precisely, if $\mathcal{C} = \bigcap_{i=1}^m
		C_i$ is the intersection of closed sets, $q(x)
		= \frac{1}{m}\sum_{i=1}^m \|x-P_{C_i}(x)\|^2$ where $P_{C_i}$ denotes
		the (possibly multi-valued) projection of $x$ onto the set $C_i$.
	\item the optimization problem for fixed $\rho$ is solved by using the MM
		principle by defining $x_{k+1}$ at step $k+1$ to minimize the surrogate
		function:
		\begin{displaymath}
			g(x|x_k) = f(x) + \frac{\rho}{2m}\sum_{i=1}^m \|x-P_{C_i}(x_k)\|^2	
		\end{displaymath}
\end{itemize}

		The authors observe that minimizing the surrogate upper-bound on the
		objective function amounts to computing the proximal operator of
		$\rho^{-1}f$ at the point $y_k = \frac{1}{m}\sum_{i=1}^m P_{C_i}(x_k)$
		thus connecting this algorithm to the field of proximal algorithms
		which leads to a simple analysis of the convergence of the proposed
		algorithm for fixed $\rho > 0$. An accelerated variant of the algorithm
		is also proposed. The authors then show how to instantiate their
		algorithms on seven different examples and provide experimental evidence
		that the algorithms compare favorably to traditional general purpose
		convex optimization algorithms.

\paragraph{Comments on the proposed algorithm.} The proposed algorithms follow
from simple and general principles leading to intuitive-looking procedures,
which is a positive point. The interpretation of the optimization for fixed
$\rho > 0$ via proximal operators is interesting since it allows the proposed
algorithms to leverage what is already known about computing proximal operators
and provides a proof of convergence of the procedure for $\rho > 0$ essentially
for free.

A complaint about the proposed algorithms: the paper departs from the
traditional way of writing the constraints as a set of inequalities $f_i(x)\leq
0$ or equalities $f_j(x) = 0$ (typically for convex functions $f_i$ or $f_j$)
and instead writes them as $x\in \cap_{i=1}^m C_i$ for (typically convex) sets
$C_i$. This allows the authors to use $\|x-P_{C_i}(x)\|^2$ as the penalty
function as opposed to $\max (0, f_i(x))^q$ for some exponent $q$ as is usual
in traditional applications of the penalty method. While very convenient at an
abstract mathematical level, the notation $P_{C_i}(x)$ and this choice of
penalty function hides the implicit assumption that one has access to
a representation of the convex sets $C_i$ defining the constraints in a way
which allows for easy computation of $P_{C_i}(x)$. While the examples show that
this is not necessarily a problem, this implicit assumption should be
highlighted and commented on: if the optimization problem we are facing is
written with constraints $f_i(x) \leq 0$ and only ``black-box'' access to $f_i$
is provided, then there is no efficient way to represent the associated convex
set $C_i$ and to project onto it. While mathematically equivalent, the choice
of representation of the constraints matter.

\paragraph{Absence of theoretical results.} The only theoretical result in this
paper is the proof of convergence of the procedure to solve the penalized
problem for fixed $\rho>0$ via the distance majorization principle. This follows
from the interpretation of the procedure as a proximal algorithm and a direct
application of the general convergence theorem of Krasnosel'skii and Mann. This
still leaves many gaps in the theoretical understanding of the proposed
algorithms:
\begin{itemize}
	\item importantly, the theoretical discussion does not give any indication
		of the \textbf{rate of convergence} of the procedure and how it
		compares to other traditional algorithms in convex optimization.
	\item there is no proof of convergence of the overall procedure where the
		parameter $\rho$ is chosen according to a diverging sequence. There is
		only a sketch of the proof at the end of section 3 assuming that both
		the objective function is coercive (without any comment on whether the
		objective function at hand satisfies this assumption)
	\item the overall procedure assumes that the inner optimization problem for
		fixed $\rho$ is solved exactly, which is not the case in practice. An
		analysis of this nested algorithm should account for approximate
		solving of the problem for fixed $\rho > 0$ and how approximation
		errors accumulate when $\rho$ changes. In fact, the details of the
		definition of the overall procedure (how to choose the sequence $\rho$,
		the fact that we use the solution $x_{\rho_n}$ of the problem for
		$\rho_n$ as the initial solution for the problem with parameter
		$\rho_{n+1}$) are not provided and left as guess work to the
		reader (cf. my comment on the writing below).
	\item the accelerated variant of the algorithm comes without 
		convergence guarantee.
\end{itemize}

\paragraph{Comments on the experiments.} The chosen examples are interesting
and cover a wide range of applications and demonstrate that the proposed
algorithms can be instantiated successfully in all these applications. The fact
that the details of the instantiations (how to compute the projections, etc.)
are worked out in details is valuable.

A surprising and regrettable fact is that the algorithm is compared against
other algorithms only via computation time (CPU time) which is a very indirect
measure of efficiency of algorithms since it is highly dependent on
implementation choices such as the programming language. A more usual and
better (or at least complementary) way to make the comparison would be to show
plots of the approximation errors on the y-axis as the number of iterations
increases on the x-axis. Without this, it is really hard to compare algorithms
to each other in terms of rate of convergence, especially given the absence of
theoretical guarantees on the rate of convergence (cf. my comment above on the
absence of theoretical results).

\paragraph{Comments on the writing.} Given that the paper aims at expanding on
an already published paper on the same topic, the exposition should be
particularly polished to provide a good overview of this class of algorithms.
Unfortunately, the current exposition suffers from important lapses:
\begin{itemize}
	\item as already mentioned above, the overall procedure is never described
		in full details in one place but is instead presented in multiple
		pieces throughout the first three sections. The fact that the overall
		procedure is in fact a nested procedure where a different optimization
		problem is solved at each step could easily be overlooked. In fact, the
		details of this nested procedure are never provided (choice of the
		diverging sequence for $\rho$, how the different problems for different
		values of $\rho$ are combined) are never given in the theoretical
		section and only briefly commented on in the experimental section. The
		paper would greatly benefit from a complete specification of the
		algorithm, for example in a dedicated framed box, or ``algorithm''
		environment.
	\item sections 1, 2 and 3 are not well-structured, with an uninterrupted
		flow of theoretical derivations, examples, proofs and proof sketches
		without separators. The exposition would benefit from clearly
		indicating the examples (using a separate environment for example). The
		theoretical guarantee(s) in section three should be clearly stated in
		a separate theorem or proposition environment, followed by a proof
		where the beginning and the end can be clearly identified. The
		accelerated variant could be presented in a separate section, possibly
		also in a dedicated framed box or dedicated algorithm environment.
\end{itemize}

A few additional minor comments on the writing:
\begin{itemize}
	\item the notation $P_C$ should be clearly defined at the point it is
		introduced (notably, the fact that it is in general multi-valued,
		a point which only becomes clear later in the paper).
	\item in the introduction, the sentence ``fills in many gaps of theory and
		practice'' should be rephrased: it is not clear which gaps are filled
		by the current exposition.
	\item the writing suffers at times from the use of subjective words, such as
		``painfully slow'' at the bottom of page 5 which would be better
		replaced by a precise quantitative statement. Similarly, the first
		sentence of the introduction ``part science and part art'', while
		arguably acceptable at the beginning of an introduction, would benefit
		from some additional justification or a discussion of what is actually
		meant.
\end{itemize}

\paragraph{Conclusion.} While the proposed algorithms are interesting and the
connection to proximal algorithms and the fact that they follow from general
principles is appealing, the paper greatly suffers from many lapses in the
theoretical analysis, the exposition and the experimental comparison to other
methods


\end{document}