summaryrefslogtreecommitdiffstats
path: root/ijcai-2021-47.txt
diff options
context:
space:
mode:
Diffstat (limited to 'ijcai-2021-47.txt')
-rw-r--r--ijcai-2021-47.txt99
1 files changed, 99 insertions, 0 deletions
diff --git a/ijcai-2021-47.txt b/ijcai-2021-47.txt
new file mode 100644
index 0000000..1c40bae
--- /dev/null
+++ b/ijcai-2021-47.txt
@@ -0,0 +1,99 @@
+This paper introduces a new stochastic first-order method for the optimization
+of a non-convex population risk. The main idea is to combine the standard Adam
+procedure with techniques from the field of differential privacy to obtain
+better generalization guarantees. Specifically, given an iid sample from the
+population, the procedures splits it two halves, the training data and test
+data, and gradients are computed only on the training data, unless they differ
+significantly from the gradients computed on the test data. In cases where
+there is a significant different, the “test gradients” are revealed via
+a differentially private mechanism and used instead of the “training
+gradients”.
+
+The authors study two concrete instantiations of the above procedure, based on
+the Laplace and Gaussian mechanisms respectively and provide theoretical
+convergence guarantees for both of them, showing that the procedures converge
+to a stationary point of the population loss. Finally the procedures are
+evaluated through experiments and claimed to perform better than other
+sate-of-the-art methods.
+
+Detailed Comments
+=================
+
+The problem studied in this paper is a central question in the field of AI
+nowadays: understanding the extent to which a model trained via stochastic
+first-order methods to minimize the empirical risk also minimizes the
+population risk. The proposed methods are interesting and based on a natural
+application of differential privacy ideas which should intuitively lead to more
+robust out-of-sample performance.
+
+I have several concerns about the significance and novelty of the results on
+the one hand, and on the quality of writing and presentation on the other hand.
+
+1. Novelty and significance:
+
+ * this paper seems very closely related to the paper “Towards Better
+ Generalization of Adaptive Gradient Methods” published in NeurIPS'20 and
+ mentioned in the related work of the present paper. The procedure studied
+ in this prior work are also based on a combination of Adam and
+ differential privacy techniques. The structure of the theoretical part of
+ the present paper follows closely this prior work and the theoretical
+ guarantees seem almost identical. Compare in particular, Theorems 4 and
+ 5 of the present paper to Theorems 1 and 2 of the previous paper.
+
+ * I wasn't able to find in the present paper a clear explanation of how the
+ newly proposed procedure performs better than the ones in the previous
+ paper. I did see some minor differences in the technical lemmas, but the
+ theoretical guarantees for the main theorems (Theorems 4 and 5) seem
+ identical.
+
+ * relatedly, I was surprised not to find a comparison with the methods in
+ the previous paper in the experimental evaluation.
+
+
+ * finally, the paper claims that the proposed methods are superior to
+ state-of-the-art methods but the experimental section does not paint
+ a clear picture: when looking at the test error in Figure 2, it seems to
+ me that the proposed methods perform very similarly to Yogi and AdaBound.
+
+2. Writing and presentation. The paper seems to have been hastily put together
+ and would benefit for a thorough proofreading and polishing pass:
+
+ * a lot of technical details are missing or unclear, making it hard to fully
+ assess the correctness of the results or even understand what they mean.
+ A non exhaustive list:
+
+ - assumption 2 on page 2: the noisy approximation \nabla\hat f(w) and
+ \tilde g_t haven't been defined yet at this point, so it is not clear
+ what it means. Looking further down in the paper, one gets to understand
+ that this assumption maybe applies \tilde g_t computed in lines 8 and 10
+ of Algorithm 1. However, this quantities are random variable, so I don't
+ know how to interpret in which sense their norm is bounded: is it almost
+ surely? in expectation?
+
+ - the descriptions of \phi_t and \psi_t below Algorithm 1 make it sound
+ like they can be quite general with the ones used in Adam given as just
+ one example. However, the analysis in the supplementary material is only
+ done for this one example, although this is not explicitly stated in the
+ theorem statements.
+
+ - relatedly, one has to guess that all the arithmetic operations in line 13
+ of the algorithm are performed componentwise when applied to vectors.
+ While this is fairly standard in this context, it is surprising to only
+ see it mentioned inside a proof in the supplementary material.
+
+ - the methods evaluated in the experimental section are in fact based on
+ a minibatch variant of the methods introduced in the paper. Nothing is
+ said about this variant in the main paper and it is only described in the
+ supplementary material.
+
+ - Theorem 2, 4 and 6 start with "let \tilde g_1, ... \tilde g_T" but then
+ given a statement about \hat g_1, ... \hat g_T, so it is not clear which
+ of the two quantities is actually considered. Looking at the proof in the
+ supplementary material, it seems that the statement is in fact about
+ \tilde g_1, ... , \tilde g_T, but the reader would have to guess unless
+ they look at the proof details.
+
+ * many sentences in the informal exposition, in particular in the
+ introduction and introductory paragraphs of the sections seem to have been
+ left in an intermediate state between two variants, probably as a result
+ of hasty editing.