Summary
-------

This paper departs from the traditional setting of stochastic gradient descent
by considering a setting where computing the gradient of the loss function
(here, the negative likelihood) on a single data point is computationally
expensive. This is for example the case when considering the Cox proportional
hazard model in which the observations are not independent and the probability
of "failure" of an individual at a given time is a function of the individuals
still alive at that time.

The idea introduced in this paper is natural but also fairly general and could
be applied to settings beyond the Cox model (as is mentioned in different
places in the paper): if the loss function at a single data point can be
written as an expectation, then it can be approximated using MCMC techniques.
This idea naturally yields a "double stochastic algorithm" which uses two
different sampling techniques to approximate the true gradient: (1) a random
sampling of a data point as in standard SGD (2) MCMC sampling to approximate
the gradient at the data point sampled in (1).

The paper provides convergence guarantees of the proposed algorithm under
standard assumptions (in particular, both the strongly convex and non strongly
convex cases are analyzed). Finally experiments show that the proposed
algorithm compares favorably to state-of-the-art algorithms.

General comments
----------------

The proposed algorithm is natural and elegantly combines ideas from the SGD
literature (more specifically the Prox-SVRG algorithm of Xiao and Zhang) with
MCMC techniques. The idea seems general enough to be applied to many scenarios
beyond the Cox model and it would be very interesting to see this done in
future work (as suggested by the authors).

The convergence guarantees are strong and obtained under realistic assumptions
that are in particular easily satisfied in the Cox model. The proofs of
convergence are correct as far as I was able to check (and assuming that the
lemmas imported from previous papers are themselves correct).

The experiments are sound and convincingly illustrate that the proposed
algorithm performs well compared to state-of-the-art algorithms. In particular,
measuring the performance of algorithms by plotting the error as a function of
the number of inner products seems to be a good metric in the case of the Cox
model and I appreciated the fact that the authors were carefully about
carefully designing their performance metric.

For the above reasons, I strongly recommend this paper for publication and only
have minor comments and suggestions to improve the exposition (see below).

Comments about the exposition
-----------------------------

The paper is well-written and easy to read. Some minor comments and
suggestions:

* in section 2, "SGD techniques" paragraph, second equation, \tilde{\theta}
  appears without having been defined yet, and the sentence below this equation
  does not completely clarifies the definition of \tilde{\theta} either. Even
  though it becomes clear later in the paper, I recommend adding a sentence
  here saying that \tilde{\theta} is some estimate of the optimal solution at
  which a full gradient is known, and referring to Algorithm 1 and the analysis
  for details.

* in section 2, "Our setting" paragraph, it would maybe help to explicitly
  mention that the distribution \pi_i used to estimate the gradient at data
  point i depends on the index i. This explains why a doubly stochastic
  algorithm seems necessary. Otherwise, if the distribution was some \pi
  independent of the index, one could directly approximate the full gradient by
  sampling from \pi and the algorithm would become a standard SGD algorithm.

* in section 5, Figure 1: the caption should use "left" and "right" instead of
  "top" and "bottom".

* in section 5, "Hybrid SVRG algorithm" paragraph: it would be useful to say
  the details are provided in Algorithm 4. Also, the justification that the
  hybrid algorithm is desirable since N_k increases exponentially was not
  completely convincing to me: given that time is measured in terms of number
  of inner products computed, the rate of growth of N_k shouldn't matter as
  long as the overall number of inner products is the same. Otherwise, this
  means that somehow, the inner products used to approximate the gradient at
  a single data point in the MCMC estimation are not as "useful" as inner
  products used to approximate the gradient using standard SVRG. But if this is
  the case, some (at least at an intuitive level) explanation should be given.

* I noticed two glitches in the references:
  - the Xiao and Zhang 2010 paper does not have a title
  - the Toulis and Airoldi 2014 paper is now known under a different title