1 files changed, 87 insertions, 0 deletions
diff --git a/matcom-d-17-00579.txt b/matcom-d-17-00579.txt
new file mode 100644
index 0000000..718ae0f
--- /dev/null
+++ b/matcom-d-17-00579.txt
@@ -0,0 +1,87 @@
+Summary
+-------
+
+This paper departs from the traditional setting of stochastic gradient descent
+by considering a setting where computing the gradient of the loss function
+(here, the negative likelihood) on a single data point is computationally
+expensive. This is for example the case when considering the Cox proportional
+hazard model in which the observations are not independent and the probability
+of "failure" of an individual at a given time is a function of the individuals
+still alive at that time.
+
+The idea introduced in this paper is natural but also fairly general and could
+be applied to settings beyond the Cox model (as is mentioned in different
+places in the paper): if the loss function at a single data point can be
+written as an expectation, then it can be approximated using MCMC techniques.
+This idea naturally yields a "double stochastic algorithm" which uses two
+different sampling techniques to approximate the true gradient: (1) a random
+sampling of a data point as in standard SGD (2) MCMC sampling to approximate
+the gradient at the data point sampled in (1).
+
+The paper provides convergence guarantees of the proposed algorithm under
+standard assumptions (in particular, both the strongly convex and non strongly
+convex cases are analyzed). Finally experiments show that the proposed
+algorithm compares favorably to state-of-the-art algorithms.
+
+General comments
+----------------
+
+The proposed algorithm is natural and elegantly combines ideas from the SGD
+literature (more specifically the Prox-SVRG algorithm of Xiao and Zhang) with
+MCMC techniques. The idea seems general enough to be applied to many scenarios
+beyond the Cox model and it would be very interesting to see this done in
+future work (as suggested by the authors).
+
+The convergence guarantees are strong and obtained under realistic assumptions
+that are in particular easily satisfied in the Cox model. The proofs of
+convergence are correct as far as I was able to check (and assuming that the
+lemmas imported from previous papers are themselves correct).
+
+The experiments are sound and convincingly illustrate that the proposed
+algorithm performs well compared to state-of-the-art algorithms. In particular,
+measuring the performance of algorithms by plotting the error as a function of
+the number of inner products seems to be a good metric in the case of the Cox
+model and I appreciated the fact that the authors were carefully about
+carefully designing their performance metric.
+
+For the above reasons, I strongly recommend this paper for publication and only
+have minor comments and suggestions to improve the exposition (see below).
+
+Comments about the exposition
+-----------------------------
+
+The paper is well-written and easy to read. Some minor comments and
+suggestions:
+
+* in section 2, "SGD techniques" paragraph, second equation, \tilde{\theta}
+  appears without having been defined yet, and the sentence below this equation
+  does not completely clarifies the definition of \tilde{\theta} either. Even
+  though it becomes clear later in the paper, I recommend adding a sentence
+  here saying that \tilde{\theta} is some estimate of the optimal solution at
+  which a full gradient is known, and referring to Algorithm 1 and the analysis
+  for details.
+
+* in section 2, "Our setting" paragraph, it would maybe help to explicitly
+  mention that the distribution \pi_i used to estimate the gradient at data
+  point i depends on the index i. This explains why a doubly stochastic
+  algorithm seems necessary. Otherwise, if the distribution was some \pi
+  independent of the index, one could directly approximate the full gradient by
+  sampling from \pi and the algorithm would become a standard SGD algorithm.
+
+* in section 5, Figure 1: the caption should use "left" and "right" instead of
+  "top" and "bottom".
+
+* in section 5, "Hybrid SVRG algorithm" paragraph: it would be useful to say
+  the details are provided in Algorithm 4. Also, the justification that the
+  hybrid algorithm is desirable since N_k increases exponentially was not
+  completely convincing to me: given that time is measured in terms of number
+  of inner products computed, the rate of growth of N_k shouldn't matter as
+  long as the overall number of inner products is the same. Otherwise, this
+  means that somehow, the inner products used to approximate the gradient at
+  a single data point in the MCMC estimation are not as "useful" as inner
+  products used to approximate the gradient using standard SVRG. But if this is
+  the case, some (at least at an intuitive level) explanation should be given.
+
+* I noticed two glitches in the references:
+  - the Xiao and Zhang 2010 paper does not have a title
+  - the Toulis and Airoldi 2014 paper is now known under a different title