1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
|
\documentclass[10pt]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[hmargin=1in, vmargin=1in]{geometry}
\usepackage{amsmath,amsfonts}
\title{\large Review of \emph{The stochastic Auxiliary Problem Principle in Banach Spaces}}
\date{}
\begin{document}
\maketitle
\paragraph{Summary.} This paper considers the Stochastic Auxiliary Problem
Principle (stochastic APP) algorithm, a general stochastic approximation scheme
introduced by Cuholi and Cohen in 1990 and generalizing the Stochastic Proximal
Gradient method as well as the Stochastic Mirror Descent method. While this
method was originally introduced for infinite-dimensional Hilbert spaces, the
main goal of this paper is to analyze its properties when the iterates lie in
a Banach space. The paper focuses on the case where the objective function
decomposes as the sum of a subdifferentiable lower semicontinuous (l.s.c.)
funcion and a (possibly non-differentiable) l.s.c.\ function (typically,
a regularizer in learning applications). Importantly, random noise in the
evaluation of the subdifferential of the first of these two functions is
considered. The main contributions are:
\begin{itemize}
\item a proof of the measurability of the iterates of the Stochastic APP
method in the case of a reflexive separable Banach space.
\item a proof of convergence of the stochastic APP algorithm, both in
function value and for the iterates themselves.
\item an analysis of the rate of convergence of the APP algorithm (in
function value). Both average iterates as well as the final iterate are
considered.
\end{itemize}
\paragraph{General evaluation.} This paper considers an important and general
stochastic approximation method, which encompasses many well-known stochastic
approximation algorithms. As such, analyzing the properties of this procedure
in as general a setting as possible is very relevant to the optimization
community at large. Thus, this paper makes a valuable contribution by
presenting a general analysis which not only extends previous results from
Hilbert spaces to Banach spaces, but also considers the case where random noise
is added to subdifferential estimates (previously, either no noise was
considered for the case of Hilbert spaces, or noise was considered but only for
the specific case of projected stochastic gradient descent). Furthermore, the
discussion of the measurability of the iterates is important and, as noted by
the authors, crucially missing in most if not all prior work. Finally, the
paper is very well-written and a pleasure to read.
\paragraph{Major comments.}
\begin{enumerate}
\item My main concern with this paper is with respect to the generality of
the setting. While the abstract and introduction talk about a general
Banach space, the technical parts of the paper only consider
a separable and \emph{reflexive} Banach space, which is conceptually
much closer to a Hilbert space. In fact, it becomes clear when reading
the proofs of convergence that most of the ideas from the Hilbert case
can be directly applied here as well. Because of this, the more
important contributions of the paper are (i) the discussion of the
measurability of iterates (ii) allowing random noise in subdifferential
evaluations and not as much the generalization from Hilbert spaces to
Banach spaces. I still believe theses are important contributions, but
the reflexivity assumption should be mentioned earlier (possibly
already in the abstract) in order not to mislead the reader into
a false sense of generality. If at the technical level, non-trivial
adaptations were required to accommodate the case of Banach spaces,
those should be emphasized as well.
\item Related to the previous point, it might be helpful in Section 2.2 to
give examples where previous work has considered stochastic
approximation methods for reflexive Banach spaces (but which were not
Hilbert spaces). Examples I have in mind quite often consider
optimization problems over $L^1$ spaces, which are not reflexive, and
for which the results of this paper would not apply.
\item I believe the presentation of section 3.1 could be streamlined, since
a significant part of amounts to presenting previously known result
from the theory of normal integrands and measurable selections. These
could be moved to the appendix, while only keeping in the main part of
the paper the results specific to this paper and not directly taken
from previous work. I believe this would be Proposition
3.17 and Proposition 3.18 as well as Corollary 3.19. This is a matter
of taste though, and I will leave it to the editor to decide what is
preferable and bests matches the style of the journal.
\item Related to the previous point, since one goal of this paper is
generality, it might be worth mentioning that all the results from
section 3.1.1 are also valid in Polish spaces (separable and completely
metrizable space), that is, normability is not necessary.
\item Was measurability the iterates ever discussed in prior work or known
in some specific cases? Maybe in the finite dimensional case where it
is simpler to establish? If so this should be mentioned somewhere in
section 3.
\item A discussion of the assumptions introduced at the beginning of
section 4.1 could be valuable. Which one seem strictly necessary, which
one could maybe be relaxed with more work? Are they commonly assumed in
similar work? This seems particularly relevant for assumption A12 which
if I am not mistaken is not always present in similar work (the
remaining assumptions seem more standard to me).
\end{enumerate}
\paragraph{Minor comments.}
\begin{enumerate}
\item Mention early on in the introduction that $j^C$ and $j^\Sigma$ are
functions from $\mathbb{U}\times\mathbb{W}$ to $\mathbb{R}$. This
becomes clear later but might be slightly confusing at the beginning.
\item When first introducing the Gateaux differentiability assumption,
explain which variant of the definition is used (depending on the
authors, linearity and boundedness of the directional derivative is not
required, but it is here).
\item Second paragraph of section 3.1: when introducing the Borel
$\sigma$-field of $\mathbb{U}^\star$, explain which topology on
$\mathbb{U}^\star$ is considered (it becomes clear later that it is the
topology of the norm induced by the primal norm).
\item It might be worth mentioning that Prop. 3.5 is also commonly
known as the Kuratowski–Ryll-Nardzewski selection theorem \cite{KR65}.
I also believe that it should be explained below Proposition 3.5 or
even stated as a corollary, that an important consequence of this
proposition is that if $\Gamma$ is Effros-measurable then it admits
a measurable selection. This might be missed at first reading and is
the way this proposition is truly used in the paper.
\item In the proof of Lemma 4.1, a better explanation is needed for
inequality 4.3. I am convinced it is true but was not immediately able
to find a clean justification since the objective function is a sum of
differentiable function and non-differentiable function.
\end{enumerate}
\paragraph{Conclusion.} This paper elegantly presents a unified and general
analysis of an important stochastic approximation procedure, while filling
theoretical gaps left open by previous work. I believe it can be a valuable
contribution to SIOPT after the above comments are taken into account.
\bibliographystyle{plain}
\bibliography{main}
\end{document}
|