Summary ------- This paper departs from the traditional setting of stochastic gradient descent by considering a setting where computing the gradient of the loss function (here, the negative likelihood) on a single data point is computationally expensive. This is for example the case when considering the Cox proportional hazard model in which the observations are not independent and the probability of "failure" of an individual at a given time is a function of the individuals still alive at that time. The idea introduced in this paper is natural but also fairly general and could be applied to settings beyond the Cox model (as is mentioned in different places in the paper): if the loss function at a single data point can be written as an expectation, then it can be approximated using MCMC techniques. This idea naturally yields a "double stochastic algorithm" which uses two different sampling techniques to approximate the true gradient: (1) a random sampling of a data point as in standard SGD (2) MCMC sampling to approximate the gradient at the data point sampled in (1). The paper provides convergence guarantees of the proposed algorithm under standard assumptions (in particular, both the strongly convex and non strongly convex cases are analyzed). Finally experiments show that the proposed algorithm compares favorably to state-of-the-art algorithms. General comments ---------------- The proposed algorithm is natural and elegantly combines ideas from the SGD literature (more specifically the Prox-SVRG algorithm of Xiao and Zhang) with MCMC techniques. The idea seems general enough to be applied to many scenarios beyond the Cox model and it would be very interesting to see this done in future work (as suggested by the authors). The convergence guarantees are strong and obtained under realistic assumptions that are in particular easily satisfied in the Cox model. The proofs of convergence are correct as far as I was able to check (and assuming that the lemmas imported from previous papers are themselves correct). The experiments are sound and convincingly illustrate that the proposed algorithm performs well compared to state-of-the-art algorithms. In particular, measuring the performance of algorithms by plotting the error as a function of the number of inner products seems to be a good metric in the case of the Cox model and I appreciated the fact that the authors were carefully about carefully designing their performance metric. For the above reasons, I strongly recommend this paper for publication and only have minor comments and suggestions to improve the exposition (see below). Comments about the exposition ----------------------------- The paper is well-written and easy to read. Some minor comments and suggestions: * in section 2, "SGD techniques" paragraph, second equation, \tilde{\theta} appears without having been defined yet, and the sentence below this equation does not completely clarifies the definition of \tilde{\theta} either. Even though it becomes clear later in the paper, I recommend adding a sentence here saying that \tilde{\theta} is some estimate of the optimal solution at which a full gradient is known, and referring to Algorithm 1 and the analysis for details. * in section 2, "Our setting" paragraph, it would maybe help to explicitly mention that the distribution \pi_i used to estimate the gradient at data point i depends on the index i. This explains why a doubly stochastic algorithm seems necessary. Otherwise, if the distribution was some \pi independent of the index, one could directly approximate the full gradient by sampling from \pi and the algorithm would become a standard SGD algorithm. * in section 5, Figure 1: the caption should use "left" and "right" instead of "top" and "bottom". * in section 5, "Hybrid SVRG algorithm" paragraph: it would be useful to say the details are provided in Algorithm 4. Also, the justification that the hybrid algorithm is desirable since N_k increases exponentially was not completely convincing to me: given that time is measured in terms of number of inner products computed, the rate of growth of N_k shouldn't matter as long as the overall number of inner products is the same. Otherwise, this means that somehow, the inner products used to approximate the gradient at a single data point in the MCMC estimation are not as "useful" as inner products used to approximate the gradient using standard SVRG. But if this is the case, some (at least at an intuitive level) explanation should be given. * I noticed two glitches in the references: - the Xiao and Zhang 2010 paper does not have a title - the Toulis and Airoldi 2014 paper is now known under a different title