1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
|
This paper introduces a new stochastic first-order method for the optimization
of a non-convex population risk. The main idea is to combine the standard Adam
procedure with techniques from the field of differential privacy to obtain
better generalization guarantees. Specifically, given an iid sample from the
population, the procedures splits it two halves, the training data and test
data, and gradients are computed only on the training data, unless they differ
significantly from the gradients computed on the test data. In cases where
there is a significant different, the “test gradients” are revealed via
a differentially private mechanism and used instead of the “training
gradients”.
The authors study two concrete instantiations of the above procedure, based on
the Laplace and Gaussian mechanisms respectively and provide theoretical
convergence guarantees for both of them, showing that the procedures converge
to a stationary point of the population loss. Finally the procedures are
evaluated through experiments and claimed to perform better than other
sate-of-the-art methods.
Detailed Comments
=================
The problem studied in this paper is a central question in the field of AI
nowadays: understanding the extent to which a model trained via stochastic
first-order methods to minimize the empirical risk also minimizes the
population risk. The proposed methods are interesting and based on a natural
application of differential privacy ideas which should intuitively lead to more
robust out-of-sample performance.
I have several concerns about the significance and novelty of the results on
the one hand, and on the quality of writing and presentation on the other hand.
1. Novelty and significance:
* this paper seems very closely related to the paper “Towards Better
Generalization of Adaptive Gradient Methods” published in NeurIPS'20 and
mentioned in the related work of the present paper. The procedure studied
in this prior work are also based on a combination of Adam and
differential privacy techniques. The structure of the theoretical part of
the present paper follows closely this prior work and the theoretical
guarantees seem almost identical. Compare in particular, Theorems 4 and
5 of the present paper to Theorems 1 and 2 of the previous paper.
* I wasn't able to find in the present paper a clear explanation of how the
newly proposed procedure performs better than the ones in the previous
paper. I did see some minor differences in the technical lemmas, but the
theoretical guarantees for the main theorems (Theorems 4 and 5) seem
identical.
* relatedly, I was surprised not to find a comparison with the methods in
the previous paper in the experimental evaluation.
* finally, the paper claims that the proposed methods are superior to
state-of-the-art methods but the experimental section does not paint
a clear picture: when looking at the test error in Figure 2, it seems to
me that the proposed methods perform very similarly to Yogi and AdaBound.
2. Writing and presentation. The paper seems to have been hastily put together
and would benefit for a thorough proofreading and polishing pass:
* a lot of technical details are missing or unclear, making it hard to fully
assess the correctness of the results or even understand what they mean.
A non exhaustive list:
- assumption 2 on page 2: the noisy approximation \nabla\hat f(w) and
\tilde g_t haven't been defined yet at this point, so it is not clear
what it means. Looking further down in the paper, one gets to understand
that this assumption maybe applies \tilde g_t computed in lines 8 and 10
of Algorithm 1. However, this quantities are random variable, so I don't
know how to interpret in which sense their norm is bounded: is it almost
surely? in expectation?
- the descriptions of \phi_t and \psi_t below Algorithm 1 make it sound
like they can be quite general with the ones used in Adam given as just
one example. However, the analysis in the supplementary material is only
done for this one example, although this is not explicitly stated in the
theorem statements.
- relatedly, one has to guess that all the arithmetic operations in line 13
of the algorithm are performed componentwise when applied to vectors.
While this is fairly standard in this context, it is surprising to only
see it mentioned inside a proof in the supplementary material.
- the methods evaluated in the experimental section are in fact based on
a minibatch variant of the methods introduced in the paper. Nothing is
said about this variant in the main paper and it is only described in the
supplementary material.
- Theorem 2, 4 and 6 start with "let \tilde g_1, ... \tilde g_T" but then
given a statement about \hat g_1, ... \hat g_T, so it is not clear which
of the two quantities is actually considered. Looking at the proof in the
supplementary material, it seems that the statement is in fact about
\tilde g_1, ... , \tilde g_T, but the reader would have to guess unless
they look at the proof details.
* many sentences in the informal exposition, in particular in the
introduction and introductory paragraphs of the sections seem to have been
left in an intermediate state between two variants, probably as a result
of hasty editing.
|