ijcai-2021-47.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99

This paper introduces a new stochastic first-order method for the optimization
of a non-convex population risk. The main idea is to combine the standard Adam
procedure with techniques from the field of differential privacy to obtain
better generalization guarantees. Specifically, given an iid sample from the
population, the procedures splits it two halves, the training data and test
data, and gradients are computed only on the training data, unless they differ
significantly from the gradients computed on the test data. In cases where
there is a significant different, the “test gradients” are revealed via
a differentially private mechanism and used instead of the “training
gradients”.

The authors study two concrete instantiations of the above procedure, based on
the Laplace and Gaussian mechanisms respectively and provide theoretical
convergence guarantees for both of them, showing that the procedures converge
to a stationary point of the population loss. Finally the procedures are
evaluated through experiments and claimed to perform better than other
sate-of-the-art methods.

Detailed Comments
=================

The problem studied in this paper is a central question in the field of AI
nowadays: understanding the extent to which a model trained via stochastic
first-order methods to minimize the empirical risk also minimizes the
population risk. The proposed methods are interesting and based on a natural
application of differential privacy ideas which should intuitively lead to more
robust out-of-sample performance.

I have several concerns about the significance and novelty of the results on
the one hand, and on the quality of writing and presentation on the other hand.

1. Novelty and significance:

	* this paper seems very closely related to the paper “Towards Better
	  Generalization of Adaptive Gradient Methods” published in NeurIPS'20 and
	  mentioned in the related work of the present paper. The procedure studied
	  in this prior work are also based on a combination of Adam and
	  differential privacy techniques. The structure of the theoretical part of
	  the present paper follows closely this prior work and the theoretical
	  guarantees seem almost identical. Compare in particular, Theorems 4 and
	  5 of the present paper to Theorems 1 and 2 of the previous paper.

	* I wasn't able to find in the present paper a clear explanation of how the
	  newly proposed procedure performs better than the ones in the previous
	  paper. I did see some minor differences in the technical lemmas, but the
	  theoretical guarantees for the main theorems (Theorems 4 and 5) seem
	  identical. 

	* relatedly, I was surprised not to find a comparison with the methods in
	  the previous paper in the experimental evaluation.


	* finally, the paper claims that the proposed methods are superior to
	  state-of-the-art methods but the experimental section does not paint
	  a clear picture: when looking at the test error in Figure 2, it seems to
	  me that the proposed methods perform very similarly to Yogi and AdaBound.

2. Writing and presentation. The paper seems to have been hastily put together
   and would benefit for a thorough proofreading and polishing pass:

   * a lot of technical details are missing or unclear, making it hard to fully
	 assess the correctness of the results or even understand what they mean.
	 A non exhaustive list:

	 - assumption 2 on page 2: the noisy approximation \nabla\hat f(w) and
	   \tilde g_t haven't been defined yet at this point, so it is not clear
	   what it means. Looking further down in the paper, one gets to understand
	   that this assumption maybe applies \tilde g_t computed in lines 8 and 10
	   of Algorithm 1. However, this quantities are random variable, so I don't
	   know how to interpret in which sense their norm is bounded: is it almost
	   surely? in expectation?

	- the descriptions of \phi_t and \psi_t below Algorithm 1 make it sound
	  like they can be quite general with the ones used in Adam given as just
	  one example. However, the analysis in the supplementary material is only
	  done for this one example, although this is not explicitly stated in the
	  theorem statements.

	- relatedly, one has to guess that all the arithmetic operations in line 13
	  of the algorithm are performed componentwise when applied to vectors.
	  While this is fairly standard in this context, it is surprising to only
	  see it mentioned inside a proof in the supplementary material.

	- the methods evaluated in the experimental section are in fact based on
	  a minibatch variant of the methods introduced in the paper. Nothing is
	  said about this variant in the main paper and it is only described in the
	  supplementary material.

	- Theorem 2, 4 and 6 start with "let \tilde g_1, ... \tilde g_T" but then
	  given a statement about \hat g_1, ... \hat g_T, so it is not clear which
	  of the two quantities is actually considered. Looking at the proof in the
	  supplementary material, it seems that the statement is in fact about
	  \tilde g_1, ... , \tilde g_T, but the reader would have to guess unless
	  they look at the proof details.

   * many sentences in the informal exposition, in particular in the
	 introduction and introductory paragraphs of the sections seem to have been
	 left in an intermediate state between two variants, probably as a result
	 of hasty editing.