This paper introduces a new stochastic first-order method for the optimization of a non-convex population risk. The main idea is to combine the standard Adam procedure with techniques from the field of differential privacy to obtain better generalization guarantees. Specifically, given an iid sample from the population, the procedures splits it two halves, the training data and test data, and gradients are computed only on the training data, unless they differ significantly from the gradients computed on the test data. In cases where there is a significant different, the “test gradients” are revealed via a differentially private mechanism and used instead of the “training gradients”. The authors study two concrete instantiations of the above procedure, based on the Laplace and Gaussian mechanisms respectively and provide theoretical convergence guarantees for both of them, showing that the procedures converge to a stationary point of the population loss. Finally the procedures are evaluated through experiments and claimed to perform better than other sate-of-the-art methods. Detailed Comments ================= The problem studied in this paper is a central question in the field of AI nowadays: understanding the extent to which a model trained via stochastic first-order methods to minimize the empirical risk also minimizes the population risk. The proposed methods are interesting and based on a natural application of differential privacy ideas which should intuitively lead to more robust out-of-sample performance. I have several concerns about the significance and novelty of the results on the one hand, and on the quality of writing and presentation on the other hand. 1. Novelty and significance: * this paper seems very closely related to the paper “Towards Better Generalization of Adaptive Gradient Methods” published in NeurIPS'20 and mentioned in the related work of the present paper. The procedure studied in this prior work are also based on a combination of Adam and differential privacy techniques. The structure of the theoretical part of the present paper follows closely this prior work and the theoretical guarantees seem almost identical. Compare in particular, Theorems 4 and 5 of the present paper to Theorems 1 and 2 of the previous paper. * I wasn't able to find in the present paper a clear explanation of how the newly proposed procedure performs better than the ones in the previous paper. I did see some minor differences in the technical lemmas, but the theoretical guarantees for the main theorems (Theorems 4 and 5) seem identical. * relatedly, I was surprised not to find a comparison with the methods in the previous paper in the experimental evaluation. * finally, the paper claims that the proposed methods are superior to state-of-the-art methods but the experimental section does not paint a clear picture: when looking at the test error in Figure 2, it seems to me that the proposed methods perform very similarly to Yogi and AdaBound. 2. Writing and presentation. The paper seems to have been hastily put together and would benefit for a thorough proofreading and polishing pass: * a lot of technical details are missing or unclear, making it hard to fully assess the correctness of the results or even understand what they mean. A non exhaustive list: - assumption 2 on page 2: the noisy approximation \nabla\hat f(w) and \tilde g_t haven't been defined yet at this point, so it is not clear what it means. Looking further down in the paper, one gets to understand that this assumption maybe applies \tilde g_t computed in lines 8 and 10 of Algorithm 1. However, this quantities are random variable, so I don't know how to interpret in which sense their norm is bounded: is it almost surely? in expectation? - the descriptions of \phi_t and \psi_t below Algorithm 1 make it sound like they can be quite general with the ones used in Adam given as just one example. However, the analysis in the supplementary material is only done for this one example, although this is not explicitly stated in the theorem statements. - relatedly, one has to guess that all the arithmetic operations in line 13 of the algorithm are performed componentwise when applied to vectors. While this is fairly standard in this context, it is surprising to only see it mentioned inside a proof in the supplementary material. - the methods evaluated in the experimental section are in fact based on a minibatch variant of the methods introduced in the paper. Nothing is said about this variant in the main paper and it is only described in the supplementary material. - Theorem 2, 4 and 6 start with "let \tilde g_1, ... \tilde g_T" but then given a statement about \hat g_1, ... \hat g_T, so it is not clear which of the two quantities is actually considered. Looking at the proof in the supplementary material, it seems that the statement is in fact about \tilde g_1, ... , \tilde g_T, but the reader would have to guess unless they look at the proof details. * many sentences in the informal exposition, in particular in the introduction and introductory paragraphs of the sections seem to have been left in an intermediate state between two variants, probably as a result of hasty editing.