diff options
Diffstat (limited to 'ijcai-2021-47.txt')
| -rw-r--r-- | ijcai-2021-47.txt | 99 |
1 files changed, 99 insertions, 0 deletions
diff --git a/ijcai-2021-47.txt b/ijcai-2021-47.txt new file mode 100644 index 0000000..1c40bae --- /dev/null +++ b/ijcai-2021-47.txt @@ -0,0 +1,99 @@ +This paper introduces a new stochastic first-order method for the optimization +of a non-convex population risk. The main idea is to combine the standard Adam +procedure with techniques from the field of differential privacy to obtain +better generalization guarantees. Specifically, given an iid sample from the +population, the procedures splits it two halves, the training data and test +data, and gradients are computed only on the training data, unless they differ +significantly from the gradients computed on the test data. In cases where +there is a significant different, the “test gradients” are revealed via +a differentially private mechanism and used instead of the “training +gradients”. + +The authors study two concrete instantiations of the above procedure, based on +the Laplace and Gaussian mechanisms respectively and provide theoretical +convergence guarantees for both of them, showing that the procedures converge +to a stationary point of the population loss. Finally the procedures are +evaluated through experiments and claimed to perform better than other +sate-of-the-art methods. + +Detailed Comments +================= + +The problem studied in this paper is a central question in the field of AI +nowadays: understanding the extent to which a model trained via stochastic +first-order methods to minimize the empirical risk also minimizes the +population risk. The proposed methods are interesting and based on a natural +application of differential privacy ideas which should intuitively lead to more +robust out-of-sample performance. + +I have several concerns about the significance and novelty of the results on +the one hand, and on the quality of writing and presentation on the other hand. + +1. Novelty and significance: + + * this paper seems very closely related to the paper “Towards Better + Generalization of Adaptive Gradient Methods” published in NeurIPS'20 and + mentioned in the related work of the present paper. The procedure studied + in this prior work are also based on a combination of Adam and + differential privacy techniques. The structure of the theoretical part of + the present paper follows closely this prior work and the theoretical + guarantees seem almost identical. Compare in particular, Theorems 4 and + 5 of the present paper to Theorems 1 and 2 of the previous paper. + + * I wasn't able to find in the present paper a clear explanation of how the + newly proposed procedure performs better than the ones in the previous + paper. I did see some minor differences in the technical lemmas, but the + theoretical guarantees for the main theorems (Theorems 4 and 5) seem + identical. + + * relatedly, I was surprised not to find a comparison with the methods in + the previous paper in the experimental evaluation. + + + * finally, the paper claims that the proposed methods are superior to + state-of-the-art methods but the experimental section does not paint + a clear picture: when looking at the test error in Figure 2, it seems to + me that the proposed methods perform very similarly to Yogi and AdaBound. + +2. Writing and presentation. The paper seems to have been hastily put together + and would benefit for a thorough proofreading and polishing pass: + + * a lot of technical details are missing or unclear, making it hard to fully + assess the correctness of the results or even understand what they mean. + A non exhaustive list: + + - assumption 2 on page 2: the noisy approximation \nabla\hat f(w) and + \tilde g_t haven't been defined yet at this point, so it is not clear + what it means. Looking further down in the paper, one gets to understand + that this assumption maybe applies \tilde g_t computed in lines 8 and 10 + of Algorithm 1. However, this quantities are random variable, so I don't + know how to interpret in which sense their norm is bounded: is it almost + surely? in expectation? + + - the descriptions of \phi_t and \psi_t below Algorithm 1 make it sound + like they can be quite general with the ones used in Adam given as just + one example. However, the analysis in the supplementary material is only + done for this one example, although this is not explicitly stated in the + theorem statements. + + - relatedly, one has to guess that all the arithmetic operations in line 13 + of the algorithm are performed componentwise when applied to vectors. + While this is fairly standard in this context, it is surprising to only + see it mentioned inside a proof in the supplementary material. + + - the methods evaluated in the experimental section are in fact based on + a minibatch variant of the methods introduced in the paper. Nothing is + said about this variant in the main paper and it is only described in the + supplementary material. + + - Theorem 2, 4 and 6 start with "let \tilde g_1, ... \tilde g_T" but then + given a statement about \hat g_1, ... \hat g_T, so it is not clear which + of the two quantities is actually considered. Looking at the proof in the + supplementary material, it seems that the statement is in fact about + \tilde g_1, ... , \tilde g_T, but the reader would have to guess unless + they look at the proof details. + + * many sentences in the informal exposition, in particular in the + introduction and introductory paragraphs of the sections seem to have been + left in an intermediate state between two variants, probably as a result + of hasty editing. |
