\documentclass[10pt]{article} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage[hmargin=1in, vmargin=1in]{geometry} \usepackage{amsmath,amsfonts} \title{\large Review of \emph{The stochastic Auxiliary Problem Principle in Banach Spaces}} \date{} \begin{document} \maketitle \paragraph{Summary.} This paper considers the Stochastic Auxiliary Problem Principle (stochastic APP) algorithm, a general stochastic approximation scheme introduced by Cuholi and Cohen in 1990 and generalizing the Stochastic Proximal Gradient method as well as the Stochastic Mirror Descent method. While this method was originally introduced for infinite-dimensional Hilbert spaces, the main goal of this paper is to analyze its properties when the iterates lie in a Banach space. The paper focuses on the case where the objective function decomposes as the sum of a subdifferentiable lower semicontinuous (l.s.c.) funcion and a (possibly non-differentiable) l.s.c.\ function (typically, a regularizer in learning applications). Importantly, random noise in the evaluation of the subdifferential of the first of these two functions is considered. The main contributions are: \begin{itemize} \item a proof of the measurability of the iterates of the Stochastic APP method in the case of a reflexive separable Banach space. \item a proof of convergence of the stochastic APP algorithm, both in function value and for the iterates themselves. \item an analysis of the rate of convergence of the APP algorithm (in function value). Both average iterates as well as the final iterate are considered. \end{itemize} \paragraph{General evaluation.} This paper considers an important and general stochastic approximation method, which encompasses many well-known stochastic approximation algorithms. As such, analyzing the properties of this procedure in as general a setting as possible is very relevant to the optimization community at large. Thus, this paper makes a valuable contribution by presenting a general analysis which not only extends previous results from Hilbert spaces to Banach spaces, but also considers the case where random noise is added to subdifferential estimates (previously, either no noise was considered for the case of Hilbert spaces, or noise was considered but only for the specific case of projected stochastic gradient descent). Furthermore, the discussion of the measurability of the iterates is important and, as noted by the authors, crucially missing in most if not all prior work. Finally, the paper is very well-written and a pleasure to read. \paragraph{Major comments.} \begin{enumerate} \item My main concern with this paper is with respect to the generality of the setting. While the abstract and introduction talk about a general Banach space, the technical parts of the paper only consider a separable and \emph{reflexive} Banach space, which is conceptually much closer to a Hilbert space. In fact, it becomes clear when reading the proofs of convergence that most of the ideas from the Hilbert case can be directly applied here as well. Because of this, the more important contributions of the paper are (i) the discussion of the measurability of iterates (ii) allowing random noise in subdifferential evaluations and not as much the generalization from Hilbert spaces to Banach spaces. I still believe theses are important contributions, but the reflexivity assumption should be mentioned earlier (possibly already in the abstract) in order not to mislead the reader into a false sense of generality. If at the technical level, non-trivial adaptations were required to accommodate the case of Banach spaces, those should be emphasized as well. \item Related to the previous point, it might be helpful in Section 2.2 to give examples where previous work has considered stochastic approximation methods for reflexive Banach spaces (but which were not Hilbert spaces). Examples I have in mind quite often consider optimization problems over $L^1$ spaces, which are not reflexive, and for which the results of this paper would not apply. \item I believe the presentation of section 3.1 could be streamlined, since a significant part of amounts to presenting previously known result from the theory of normal integrands and measurable selections. These could be moved to the appendix, while only keeping in the main part of the paper the results specific to this paper and not directly taken from previous work. I believe this would be Proposition 3.17 and Proposition 3.18 as well as Corollary 3.19. This is a matter of taste though, and I will leave it to the editor to decide what is preferable and bests matches the style of the journal. \item Related to the previous point, since one goal of this paper is generality, it might be worth mentioning that all the results from section 3.1.1 are also valid in Polish spaces (separable and completely metrizable space), that is, normability is not necessary. \item Was measurability the iterates ever discussed in prior work or known in some specific cases? Maybe in the finite dimensional case where it is simpler to establish? If so this should be mentioned somewhere in section 3. \item A discussion of the assumptions introduced at the beginning of section 4.1 could be valuable. Which one seem strictly necessary, which one could maybe be relaxed with more work? Are they commonly assumed in similar work? This seems particularly relevant for assumption A12 which if I am not mistaken is not always present in similar work (the remaining assumptions seem more standard to me). \end{enumerate} \paragraph{Minor comments.} \begin{enumerate} \item Mention early on in the introduction that $j^C$ and $j^\Sigma$ are functions from $\mathbb{U}\times\mathbb{W}$ to $\mathbb{R}$. This becomes clear later but might be slightly confusing at the beginning. \item When first introducing the Gateaux differentiability assumption, explain which variant of the definition is used (depending on the authors, linearity and boundedness of the directional derivative is not required, but it is here). \item Second paragraph of section 3.1: when introducing the Borel $\sigma$-field of $\mathbb{U}^\star$, explain which topology on $\mathbb{U}^\star$ is considered (it becomes clear later that it is the topology of the norm induced by the primal norm). \item It might be worth mentioning that Prop. 3.5 is also commonly known as the Kuratowski–Ryll-Nardzewski selection theorem \cite{KR65}. I also believe that it should be explained below Proposition 3.5 or even stated as a corollary, that an important consequence of this proposition is that if $\Gamma$ is Effros-measurable then it admits a measurable selection. This might be missed at first reading and is the way this proposition is truly used in the paper. \item In the proof of Lemma 4.1, a better explanation is needed for inequality 4.3. I am convinced it is true but was not immediately able to find a clean justification since the objective function is a sum of differentiable function and non-differentiable function. \end{enumerate} \paragraph{Conclusion.} This paper elegantly presents a unified and general analysis of an important stochastic approximation procedure, while filling theoretical gaps left open by previous work. I believe it can be a valuable contribution to SIOPT after the above comments are taken into account. \bibliographystyle{plain} \bibliography{main} \end{document}