This paper proposes a general stochastic first-order method for the
optimization of a smooth non-convex function over a Riemannian manifold.

While SGD is known to be minimax optimal in general (requiring O(eps^-4)
stochastic gradient queries to find an esp-approximate stationary point), it
has been shown that variants of SGD can achieve O(eps^-3) sample complexity
when the objective function is required to satisfy a stronger, mean-squared
smoothness assumption. Previous adaptations of these methods to the Riemannian
case were either specific to certain manifolds or required large batch gradient
updates at certain iterations, leading to prohibitive computation costs.

The method introduced in this paper can be described as the Riemannian
counterpart of the Stochastic Recursive Momentum method of [Cutkosky et al.,
2019]. The main idea is to interpolate smoothly, over the iterations of the
algorithm, between vanilla Riemannian SGD and the momentum-based RSPRG
algorithm of [Han et al., 2020; Zhou et al., 2019].

The main theoretical result is that with a proper decaying schedule for the
step size and interpolation parameter (controlling how much of the momentum
term to mix in to the gradient update), the proposed algorithm converges with
a nearly-optimal rate (up to logarithmic terms) and only requires one random
sample and two gradient computations per iterations. The result holds for
arbitrary Riemannian manifolds and only requires tuning two hyperparameters.

An experimental section evaluates the proposed algorithm for three distinct
tasks: PCA (on the Grassmann manifold), ICA (on the Stiefel manifold) and
Riemannian Centroid, and shows that it consistently outperforms other
Riemannian stochastic first-order methods.

Detailed comments
=================

The problem considered in this paper is relevant since many tasks in data
analysis are naturally constrained to a Riemannian manifold (for example PCA or
ICA considered in the experimental section). Furthermore it is satisfying both
from a theoretical perspective and applicability perspective to have a general
method which can be analyzed for arbitrary manifolds.

The convergence result is strong and the proofs are correct as far as I was
able to check. A big part of the derivation follows ideas similar to the
Euclidean case which I was able to check, but not being an expert in Riemannian
geometry, I could have missed a technical mistake specific to the Riemannian
case.

The paper is well-written (as far as the 7-page limit permits…) and I found the
experimental evaluations very convincing.

I have a minor concern about the significance of the result because of the
extra assumption of mean-squared Lipschitzness of the gradients. It was not
clear to me how restrictive this assumption is and how often it is found in
practice. In particular, I was surprised not to see any mention of it in the
experimental section: do the examples considered there satisfy the assumption?
I think it would improve the paper to say a bit more about this.