siopt-2021-140246.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133

\documentclass[10pt]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[hmargin=1in, vmargin=1in]{geometry}
\usepackage{amsmath,amsfonts}

\title{\large Review of \emph{The stochastic Auxiliary Problem Principle in Banach Spaces}}
\date{}

\begin{document}

\maketitle

\paragraph{Summary.} This paper considers the Stochastic Auxiliary Problem
Principle (stochastic APP) algorithm, a general stochastic approximation scheme
introduced by Cuholi and Cohen in 1990 and generalizing the Stochastic Proximal
Gradient method as well as the Stochastic Mirror Descent method. While this
method was originally introduced for infinite-dimensional Hilbert spaces, the
main goal of this paper is to analyze its properties when the iterates lie in
a Banach space. The paper focuses on the case where the objective function
decomposes as the sum of a subdifferentiable lower semicontinuous (l.s.c.)
funcion and a (possibly non-differentiable) l.s.c.\ function (typically,
a regularizer in learning applications). Importantly, random noise in the
evaluation of the subdifferential of the first of these two functions is
considered. The main contributions are:
\begin{itemize}
	\item a proof of the measurability of the iterates of the Stochastic APP
		method in the case of a reflexive separable Banach space.
	\item a proof of convergence of the stochastic APP algorithm, both in
		function value and for the iterates themselves.
	\item an analysis of the rate of convergence of the APP algorithm (in
		function value). Both average iterates as well as the final iterate are
		considered.
\end{itemize}

\paragraph{General evaluation.} This paper considers an important and general
stochastic approximation method, which encompasses many well-known stochastic
approximation algorithms. As such, analyzing the properties of this procedure
in as general a setting as possible is very relevant to the optimization
community at large. Thus, this paper makes a valuable contribution by
presenting a general analysis which not only extends previous results from
Hilbert spaces to Banach spaces, but also considers the case where random noise
is added to subdifferential estimates (previously, either no noise was
considered for the case of Hilbert spaces, or noise was considered but only for
the specific case of projected stochastic gradient descent). Furthermore, the
discussion of the measurability of the iterates is important and, as noted by
the authors, crucially missing in most if not all prior work. Finally, the
paper is very well-written and a pleasure to read.

\paragraph{Major comments.}
\begin{enumerate}
	\item My main concern with this paper is with respect to the generality of
		the setting. While the abstract and introduction talk about a general
		Banach space, the technical parts of the paper only consider
		a separable and \emph{reflexive} Banach space, which is conceptually
		much closer to a Hilbert space. In fact, it becomes clear when reading
		the proofs of convergence that most of the ideas from the Hilbert case
		can be directly applied here as well. Because of this, the more
		important contributions of the paper are (i) the discussion of the
		measurability of iterates (ii) allowing random noise in subdifferential
		evaluations and not as much the generalization from Hilbert spaces to
		Banach spaces. I still believe theses are important contributions, but
		the reflexivity assumption should be mentioned earlier (possibly
		already in the abstract) in order not to mislead the reader into
		a false sense of generality. If at the technical level, non-trivial
		adaptations were required to accommodate the case of Banach spaces,
		those should be emphasized as well.
	\item Related to the previous point, it might be helpful in Section 2.2 to
		give examples where previous work has considered stochastic
		approximation methods for reflexive Banach spaces (but which were not
		Hilbert spaces). Examples I have in mind quite often consider
		optimization problems over $L^1$ spaces, which are not reflexive, and
		for which the results of this paper would not apply.
	\item I believe the presentation of section 3.1 could be streamlined, since
		a significant part of amounts to presenting previously known result
		from the theory of normal integrands and measurable selections. These
		could be moved to the appendix, while only keeping in the main part of
		the paper the results specific to this paper and not directly taken
		from previous work. I believe this would be Proposition
		3.17 and Proposition 3.18 as well as Corollary 3.19. This is a matter
		  of taste though, and I will leave it to the editor to decide what is
		  preferable and bests matches the style of the journal.
	\item Related to the previous point, since one goal of this paper is
		generality, it might be worth mentioning that all the results from
		section 3.1.1 are also valid in Polish spaces (separable and completely
		metrizable space), that is, normability is not necessary.
	\item Was measurability the iterates ever discussed in prior work or known
		in some specific cases? Maybe in the finite dimensional case where it
		is simpler to establish? If so this should be mentioned somewhere in
		section 3.
	\item A discussion of the assumptions introduced at the beginning of
		section 4.1 could be valuable. Which one seem strictly necessary, which
		one could maybe be relaxed with more work? Are they commonly assumed in
		similar work? This seems particularly relevant for assumption A12 which
		if I am not mistaken is not always present in similar work (the
		remaining assumptions seem more standard to me).
\end{enumerate}

\paragraph{Minor comments.} 
\begin{enumerate}
	\item Mention early on in the introduction that $j^C$ and $j^\Sigma$ are
		functions from $\mathbb{U}\times\mathbb{W}$ to $\mathbb{R}$. This
		becomes clear later but might be slightly confusing at the beginning.
	\item When first introducing the Gateaux differentiability assumption,
		explain which variant of the definition is used (depending on the
		authors, linearity and boundedness of the directional derivative is not
		required, but it is here).
	\item Second paragraph of section 3.1: when introducing the Borel
		$\sigma$-field of $\mathbb{U}^\star$, explain which topology on
		$\mathbb{U}^\star$ is considered (it becomes clear later that it is the
		topology of the norm induced by the primal norm).
	\item It might be worth mentioning that Prop. 3.5 is also commonly
		known as the Kuratowski–Ryll-Nardzewski selection theorem \cite{KR65}.
		I also believe that it should be explained below Proposition 3.5 or
		even stated as a corollary, that an important consequence of this
		proposition is that if $\Gamma$ is Effros-measurable then it admits
		a measurable selection. This might be missed at first reading and is
		the way this proposition is truly used in the paper.
	\item In the proof of Lemma 4.1, a better explanation is needed for
		inequality 4.3. I am convinced it is true but was not immediately able
		to find a clean justification since the objective function is a sum of
		differentiable function and non-differentiable function.
\end{enumerate}

\paragraph{Conclusion.} This paper elegantly presents a unified and general
analysis of an important stochastic approximation procedure, while filling
theoretical gaps left open by previous work. I believe it can be a valuable
contribution to SIOPT after the above comments are taken into account.

\bibliographystyle{plain}
\bibliography{main}

\end{document}