1 files changed, 99 insertions, 0 deletions
diff --git a/ijcai-2021-47.txt b/ijcai-2021-47.txt
new file mode 100644
index 0000000..1c40bae
--- /dev/null
+++ b/ijcai-2021-47.txt
@@ -0,0 +1,99 @@
+This paper introduces a new stochastic first-order method for the optimization
+of a non-convex population risk. The main idea is to combine the standard Adam
+procedure with techniques from the field of differential privacy to obtain
+better generalization guarantees. Specifically, given an iid sample from the
+population, the procedures splits it two halves, the training data and test
+data, and gradients are computed only on the training data, unless they differ
+significantly from the gradients computed on the test data. In cases where
+there is a significant different, the “test gradients” are revealed via
+a differentially private mechanism and used instead of the “training
+gradients”.
+
+The authors study two concrete instantiations of the above procedure, based on
+the Laplace and Gaussian mechanisms respectively and provide theoretical
+convergence guarantees for both of them, showing that the procedures converge
+to a stationary point of the population loss. Finally the procedures are
+evaluated through experiments and claimed to perform better than other
+sate-of-the-art methods.
+
+Detailed Comments
+=================
+
+The problem studied in this paper is a central question in the field of AI
+nowadays: understanding the extent to which a model trained via stochastic
+first-order methods to minimize the empirical risk also minimizes the
+population risk. The proposed methods are interesting and based on a natural
+application of differential privacy ideas which should intuitively lead to more
+robust out-of-sample performance.
+
+I have several concerns about the significance and novelty of the results on
+the one hand, and on the quality of writing and presentation on the other hand.
+
+1. Novelty and significance:
+
+	* this paper seems very closely related to the paper “Towards Better
+	  Generalization of Adaptive Gradient Methods” published in NeurIPS'20 and
+	  mentioned in the related work of the present paper. The procedure studied
+	  in this prior work are also based on a combination of Adam and
+	  differential privacy techniques. The structure of the theoretical part of
+	  the present paper follows closely this prior work and the theoretical
+	  guarantees seem almost identical. Compare in particular, Theorems 4 and
+	  5 of the present paper to Theorems 1 and 2 of the previous paper.
+
+	* I wasn't able to find in the present paper a clear explanation of how the
+	  newly proposed procedure performs better than the ones in the previous
+	  paper. I did see some minor differences in the technical lemmas, but the
+	  theoretical guarantees for the main theorems (Theorems 4 and 5) seem
+	  identical. 
+
+	* relatedly, I was surprised not to find a comparison with the methods in
+	  the previous paper in the experimental evaluation.
+
+
+	* finally, the paper claims that the proposed methods are superior to
+	  state-of-the-art methods but the experimental section does not paint
+	  a clear picture: when looking at the test error in Figure 2, it seems to
+	  me that the proposed methods perform very similarly to Yogi and AdaBound.
+
+2. Writing and presentation. The paper seems to have been hastily put together
+   and would benefit for a thorough proofreading and polishing pass:
+
+   * a lot of technical details are missing or unclear, making it hard to fully
+	 assess the correctness of the results or even understand what they mean.
+	 A non exhaustive list:
+
+	 - assumption 2 on page 2: the noisy approximation \nabla\hat f(w) and
+	   \tilde g_t haven't been defined yet at this point, so it is not clear
+	   what it means. Looking further down in the paper, one gets to understand
+	   that this assumption maybe applies \tilde g_t computed in lines 8 and 10
+	   of Algorithm 1. However, this quantities are random variable, so I don't
+	   know how to interpret in which sense their norm is bounded: is it almost
+	   surely? in expectation?
+
+	- the descriptions of \phi_t and \psi_t below Algorithm 1 make it sound
+	  like they can be quite general with the ones used in Adam given as just
+	  one example. However, the analysis in the supplementary material is only
+	  done for this one example, although this is not explicitly stated in the
+	  theorem statements.
+
+	- relatedly, one has to guess that all the arithmetic operations in line 13
+	  of the algorithm are performed componentwise when applied to vectors.
+	  While this is fairly standard in this context, it is surprising to only
+	  see it mentioned inside a proof in the supplementary material.
+
+	- the methods evaluated in the experimental section are in fact based on
+	  a minibatch variant of the methods introduced in the paper. Nothing is
+	  said about this variant in the main paper and it is only described in the
+	  supplementary material.
+
+	- Theorem 2, 4 and 6 start with "let \tilde g_1, ... \tilde g_T" but then
+	  given a statement about \hat g_1, ... \hat g_T, so it is not clear which
+	  of the two quantities is actually considered. Looking at the proof in the
+	  supplementary material, it seems that the statement is in fact about
+	  \tilde g_1, ... , \tilde g_T, but the reader would have to guess unless
+	  they look at the proof details.
+
+   * many sentences in the informal exposition, in particular in the
+	 introduction and introductory paragraphs of the sections seem to have been
+	 left in an intermediate state between two variants, probably as a result
+	 of hasty editing.