diff options
Diffstat (limited to 'matcom-d-17-00579.txt')
| -rw-r--r-- | matcom-d-17-00579.txt | 87 |
1 files changed, 87 insertions, 0 deletions
diff --git a/matcom-d-17-00579.txt b/matcom-d-17-00579.txt new file mode 100644 index 0000000..718ae0f --- /dev/null +++ b/matcom-d-17-00579.txt @@ -0,0 +1,87 @@ +Summary +------- + +This paper departs from the traditional setting of stochastic gradient descent +by considering a setting where computing the gradient of the loss function +(here, the negative likelihood) on a single data point is computationally +expensive. This is for example the case when considering the Cox proportional +hazard model in which the observations are not independent and the probability +of "failure" of an individual at a given time is a function of the individuals +still alive at that time. + +The idea introduced in this paper is natural but also fairly general and could +be applied to settings beyond the Cox model (as is mentioned in different +places in the paper): if the loss function at a single data point can be +written as an expectation, then it can be approximated using MCMC techniques. +This idea naturally yields a "double stochastic algorithm" which uses two +different sampling techniques to approximate the true gradient: (1) a random +sampling of a data point as in standard SGD (2) MCMC sampling to approximate +the gradient at the data point sampled in (1). + +The paper provides convergence guarantees of the proposed algorithm under +standard assumptions (in particular, both the strongly convex and non strongly +convex cases are analyzed). Finally experiments show that the proposed +algorithm compares favorably to state-of-the-art algorithms. + +General comments +---------------- + +The proposed algorithm is natural and elegantly combines ideas from the SGD +literature (more specifically the Prox-SVRG algorithm of Xiao and Zhang) with +MCMC techniques. The idea seems general enough to be applied to many scenarios +beyond the Cox model and it would be very interesting to see this done in +future work (as suggested by the authors). + +The convergence guarantees are strong and obtained under realistic assumptions +that are in particular easily satisfied in the Cox model. The proofs of +convergence are correct as far as I was able to check (and assuming that the +lemmas imported from previous papers are themselves correct). + +The experiments are sound and convincingly illustrate that the proposed +algorithm performs well compared to state-of-the-art algorithms. In particular, +measuring the performance of algorithms by plotting the error as a function of +the number of inner products seems to be a good metric in the case of the Cox +model and I appreciated the fact that the authors were carefully about +carefully designing their performance metric. + +For the above reasons, I strongly recommend this paper for publication and only +have minor comments and suggestions to improve the exposition (see below). + +Comments about the exposition +----------------------------- + +The paper is well-written and easy to read. Some minor comments and +suggestions: + +* in section 2, "SGD techniques" paragraph, second equation, \tilde{\theta} + appears without having been defined yet, and the sentence below this equation + does not completely clarifies the definition of \tilde{\theta} either. Even + though it becomes clear later in the paper, I recommend adding a sentence + here saying that \tilde{\theta} is some estimate of the optimal solution at + which a full gradient is known, and referring to Algorithm 1 and the analysis + for details. + +* in section 2, "Our setting" paragraph, it would maybe help to explicitly + mention that the distribution \pi_i used to estimate the gradient at data + point i depends on the index i. This explains why a doubly stochastic + algorithm seems necessary. Otherwise, if the distribution was some \pi + independent of the index, one could directly approximate the full gradient by + sampling from \pi and the algorithm would become a standard SGD algorithm. + +* in section 5, Figure 1: the caption should use "left" and "right" instead of + "top" and "bottom". + +* in section 5, "Hybrid SVRG algorithm" paragraph: it would be useful to say + the details are provided in Algorithm 4. Also, the justification that the + hybrid algorithm is desirable since N_k increases exponentially was not + completely convincing to me: given that time is measured in terms of number + of inner products computed, the rate of growth of N_k shouldn't matter as + long as the overall number of inner products is the same. Otherwise, this + means that somehow, the inner products used to approximate the gradient at + a single data point in the MCMC estimation are not as "useful" as inner + products used to approximate the gradient using standard SVRG. But if this is + the case, some (at least at an intuitive level) explanation should be given. + +* I noticed two glitches in the references: + - the Xiao and Zhang 2010 paper does not have a title + - the Toulis and Airoldi 2014 paper is now known under a different title |
