4 files changed, 272 insertions, 0 deletions
diff --git a/ijcai-2021-1821.txt b/ijcai-2021-1821.txt
new file mode 100644
index 0000000..e942f3a
--- /dev/null
+++ b/ijcai-2021-1821.txt
@@ -0,0 +1,57 @@
+This paper considers the problem of designing an efficient (welfare optimal)
+combinatorial auction in the value query model. Specifically, there are
+n bidders and m items, each bidder has preferences over bundles (sets) of
+items, and the goal is to allocate the items to the bidders in a way which
+(approximately) maximizes social welfare and design payments which incentivize
+truthful report of preferences.
+
+Starting from the iterative MLCA mechanism of Brero et al. (2021) (in which
+bidders' value functions are approximated by a neural network trained on their
+answers to previous queries), the mechanism designed in the present paper goes
+one step further by making use of Fourier Analysis for set functions.
+Specifically, motivated by the fact the bidders' valuation are likely to be
+approximately sparse in the Fourier domain (for a well-chosen transform), the
+mechanism finds a Fourier-sparse approximation of the neural networks
+approximating the bidder's preferences. This then requires computing the
+efficient allocation for these Fourier-sparse approximations, for which the
+authors develop an MIP-based algorithm.
+
+On the empirical side, the authors use the STATS test suite of Weiss et al
+(2017) to generate synthetic instances of combinatorial auctions. After
+evaluating the extent to which bidders' valuation are Fourier-sparse, they show
+that their proposed mechanism compares favorably to MLCA.
+
+Detailed comments
+=================
+
+The paper considers an important and difficult problem and brings a fresh and
+interesting perspective by applying ideas from Fourier analysis for set
+functions. The paper was clear and well-written.
+
+My main issue with the proposed method is a conceptual one: while I believe
+that in many situations, it is reasonable to assume that bidders' valuations
+are Fourier-sparse, I find that the Hybrid approach somewhat undermines this
+assumption by having two competing types of approximations for bidders'
+valuations: one based on Neural Networks and one based on Fourier-sparse
+functions. 
+
+While I understand that a purely Fourier-based approximation would be
+prohibitively expensive in terms of number of queries made to the bidders, the
+interaction between the Neural Network approximation and the Fourier-sparse one
+is hard to reason about: for example, it could be that even if the true value
+functions are approximately Fourier-sparse, the Neural network approximation
+which is performed first obfuscates this fact to some extent and makes it
+harder for the Fourier reconstruction to work.
+
+Overall, I would say that this hybrid approximation is not very well motivated
+from a conceptual perspective, and is only justified by the fact that it yields
+some improvement (although arguably fairly minor ones) when evaluated
+experimentally. However it seems much harder to analyze theoretically (although
+this was not a goal of the present paper) and comes at a cost in terms of
+implementation. In particular, there are now more hyperparameters to tune,
+mainly the splits l_1, to l_4 for the number of queries.
+
+In conclusion, while I believe in using more ideas from Fourier analysis in the
+design of combinatorial auctions, I am not convinced that the cost of this
+specific Hybrid instantiation both from a conceptual and implementation
+perspective is justified by the minor improvements observed experimentally. 
diff --git a/ijcai-2021-3017.txt b/ijcai-2021-3017.txt
new file mode 100644
index 0000000..b869aba
--- /dev/null
+++ b/ijcai-2021-3017.txt
@@ -0,0 +1,62 @@
+This paper introduces a general framework to find a truthful, (approximately)
+revenue-maximizing combinatorial auction mechanism inside a given class of
+mechanisms. A typical situation where this is useful is when considering
+a parametrized class of mechanisms for which the optimal value of the parameter
+depends on the (unknown) distribution of bidders and thus needs to be learned.
+
+The framework is based on the general idea of random sampling, where we learn
+the optimal mechanism on a portion of the population but apply it to another
+portion. The main idea of the framework—which can be thought of as a way to
+make it more robust to variability in the bidders' population—is to first draw
+a set of participatory bidders and then learn on the complement set of bidders
+by repeatedly drawing samples of bidders from the complement set and finding
+a mechanism which maximizes revenue *on average* over the samples. The
+mechanism thus identified is then applied to the set of participatory bidders.
+
+In addition to the introduction of this general framework, the main
+contribution of the papers are:
+
+1. a general bound on the revenue of the mechanisms output by this method,
+   which depends on various parameters quantifying the complexity of the
+   auction class and the variability in the population of bidders.
+
+2. a specific instantiation of the framework to a new class of auctions
+   introduced by the authors, bundling-boosted auctions, for which it is
+   possible to quantify more finely the various parameters described in item 1.
+
+3. an algorithm to efficiently find the mechanism maximizing expected revenue
+   (as required by the framework) for sparse bundling-boosted auctions.
+
+Detailed comments
+=================
+
+The problem considered in this paper is important and the paper brings a nice
+unifying perspective to it by the introduction of a general framework and
+analysis of its performance through learning-theoretic complexity measures.
+Furthermore, the paper is very clear and nicely written (despite the sadly
+restrictive 7 page limit…).
+
+* an important feature of the framework, is that it completely gives up on
+  extracting any revenue from non-participatory bidders. While I understand
+  that this might be a necessary feature given the generality of the scheme,
+  some prior work partially avoid this loss, albeit in much more restrictive
+  settings. Maybe more could be said as a motivation about the trade-off
+  between generality of the scheme and how much revenue we can hope to extract.
+
+* relatedly, I was a bit surprised to see that the bounds on the revenue of the
+  scheme were expressed with respect to the efficient allocation on the entire
+  population. Why use this as a benchmark, as opposed to, for example, the
+  optimal revenue that could be obtained on the entire population? This might
+  be standard in this line of work but adding a sentence of two about this
+  choice might be helpful.
+
+* a drawback of the scheme, which again might be due to its generality is that
+  its guarantees depend of parameters which might be hard to compute or even
+  bound in practice. Although the authors try to explicitate them for specific
+  auction classes, they still remain somewhat opaque. This is particularly true
+  of the partition discrepancy quantity.
+
+* the scheme is reminiscent of the bootstrap method in statistics where
+  multiple samples are drawn with replacement from the same initial sample to
+  account for variability in the population. Maybe this could be mentioned
+  somewhere to connect the paper to the broader statistics/learning literature.
diff --git a/ijcai-2021-366.txt b/ijcai-2021-366.txt
new file mode 100644
index 0000000..f0be2e8
--- /dev/null
+++ b/ijcai-2021-366.txt
@@ -0,0 +1,54 @@
+This paper proposes a general stochastic first-order method for the
+optimization of a smooth non-convex function over a Riemannian manifold.
+
+While SGD is known to be minimax optimal in general (requiring O(eps^-4)
+stochastic gradient queries to find an esp-approximate stationary point), it
+has been shown that variants of SGD can achieve O(eps^-3) sample complexity
+when the objective function is required to satisfy a stronger, mean-squared
+smoothness assumption. Previous adaptations of these methods to the Riemannian
+case were either specific to certain manifolds or required large batch gradient
+updates at certain iterations, leading to prohibitive computation costs.
+
+The method introduced in this paper can be described as the Riemannian
+counterpart of the Stochastic Recursive Momentum method of [Cutkosky et al.,
+2019]. The main idea is to interpolate smoothly, over the iterations of the
+algorithm, between vanilla Riemannian SGD and the momentum-based RSPRG
+algorithm of [Han et al., 2020; Zhou et al., 2019].
+
+The main theoretical result is that with a proper decaying schedule for the
+step size and interpolation parameter (controlling how much of the momentum
+term to mix in to the gradient update), the proposed algorithm converges with
+a nearly-optimal rate (up to logarithmic terms) and only requires one random
+sample and two gradient computations per iterations. The result holds for
+arbitrary Riemannian manifolds and only requires tuning two hyperparameters.
+
+An experimental section evaluates the proposed algorithm for three distinct
+tasks: PCA (on the Grassmann manifold), ICA (on the Stiefel manifold) and
+Riemannian Centroid, and shows that it consistently outperforms other
+Riemannian stochastic first-order methods.
+
+Detailed comments
+=================
+
+The problem considered in this paper is relevant since many tasks in data
+analysis are naturally constrained to a Riemannian manifold (for example PCA or
+ICA considered in the experimental section). Furthermore it is satisfying both
+from a theoretical perspective and applicability perspective to have a general
+method which can be analyzed for arbitrary manifolds.
+
+The convergence result is strong and the proofs are correct as far as I was
+able to check. A big part of the derivation follows ideas similar to the
+Euclidean case which I was able to check, but not being an expert in Riemannian
+geometry, I could have missed a technical mistake specific to the Riemannian
+case.
+
+The paper is well-written (as far as the 7-page limit permits…) and I found the
+experimental evaluations very convincing.
+
+I have a minor concern about the significance of the result because of the
+extra assumption of mean-squared Lipschitzness of the gradients. It was not
+clear to me how restrictive this assumption is and how often it is found in
+practice. In particular, I was surprised not to see any mention of it in the
+experimental section: do the examples considered there satisfy the assumption?
+I think it would improve the paper to say a bit more about this.
+
diff --git a/ijcai-2021-47.txt b/ijcai-2021-47.txt
new file mode 100644
index 0000000..1c40bae
--- /dev/null
+++ b/ijcai-2021-47.txt
@@ -0,0 +1,99 @@
+This paper introduces a new stochastic first-order method for the optimization
+of a non-convex population risk. The main idea is to combine the standard Adam
+procedure with techniques from the field of differential privacy to obtain
+better generalization guarantees. Specifically, given an iid sample from the
+population, the procedures splits it two halves, the training data and test
+data, and gradients are computed only on the training data, unless they differ
+significantly from the gradients computed on the test data. In cases where
+there is a significant different, the “test gradients” are revealed via
+a differentially private mechanism and used instead of the “training
+gradients”.
+
+The authors study two concrete instantiations of the above procedure, based on
+the Laplace and Gaussian mechanisms respectively and provide theoretical
+convergence guarantees for both of them, showing that the procedures converge
+to a stationary point of the population loss. Finally the procedures are
+evaluated through experiments and claimed to perform better than other
+sate-of-the-art methods.
+
+Detailed Comments
+=================
+
+The problem studied in this paper is a central question in the field of AI
+nowadays: understanding the extent to which a model trained via stochastic
+first-order methods to minimize the empirical risk also minimizes the
+population risk. The proposed methods are interesting and based on a natural
+application of differential privacy ideas which should intuitively lead to more
+robust out-of-sample performance.
+
+I have several concerns about the significance and novelty of the results on
+the one hand, and on the quality of writing and presentation on the other hand.
+
+1. Novelty and significance:
+
+	* this paper seems very closely related to the paper “Towards Better
+	  Generalization of Adaptive Gradient Methods” published in NeurIPS'20 and
+	  mentioned in the related work of the present paper. The procedure studied
+	  in this prior work are also based on a combination of Adam and
+	  differential privacy techniques. The structure of the theoretical part of
+	  the present paper follows closely this prior work and the theoretical
+	  guarantees seem almost identical. Compare in particular, Theorems 4 and
+	  5 of the present paper to Theorems 1 and 2 of the previous paper.
+
+	* I wasn't able to find in the present paper a clear explanation of how the
+	  newly proposed procedure performs better than the ones in the previous
+	  paper. I did see some minor differences in the technical lemmas, but the
+	  theoretical guarantees for the main theorems (Theorems 4 and 5) seem
+	  identical. 
+
+	* relatedly, I was surprised not to find a comparison with the methods in
+	  the previous paper in the experimental evaluation.
+
+
+	* finally, the paper claims that the proposed methods are superior to
+	  state-of-the-art methods but the experimental section does not paint
+	  a clear picture: when looking at the test error in Figure 2, it seems to
+	  me that the proposed methods perform very similarly to Yogi and AdaBound.
+
+2. Writing and presentation. The paper seems to have been hastily put together
+   and would benefit for a thorough proofreading and polishing pass:
+
+   * a lot of technical details are missing or unclear, making it hard to fully
+	 assess the correctness of the results or even understand what they mean.
+	 A non exhaustive list:
+
+	 - assumption 2 on page 2: the noisy approximation \nabla\hat f(w) and
+	   \tilde g_t haven't been defined yet at this point, so it is not clear
+	   what it means. Looking further down in the paper, one gets to understand
+	   that this assumption maybe applies \tilde g_t computed in lines 8 and 10
+	   of Algorithm 1. However, this quantities are random variable, so I don't
+	   know how to interpret in which sense their norm is bounded: is it almost
+	   surely? in expectation?
+
+	- the descriptions of \phi_t and \psi_t below Algorithm 1 make it sound
+	  like they can be quite general with the ones used in Adam given as just
+	  one example. However, the analysis in the supplementary material is only
+	  done for this one example, although this is not explicitly stated in the
+	  theorem statements.
+
+	- relatedly, one has to guess that all the arithmetic operations in line 13
+	  of the algorithm are performed componentwise when applied to vectors.
+	  While this is fairly standard in this context, it is surprising to only
+	  see it mentioned inside a proof in the supplementary material.
+
+	- the methods evaluated in the experimental section are in fact based on
+	  a minibatch variant of the methods introduced in the paper. Nothing is
+	  said about this variant in the main paper and it is only described in the
+	  supplementary material.
+
+	- Theorem 2, 4 and 6 start with "let \tilde g_1, ... \tilde g_T" but then
+	  given a statement about \hat g_1, ... \hat g_T, so it is not clear which
+	  of the two quantities is actually considered. Looking at the proof in the
+	  supplementary material, it seems that the statement is in fact about
+	  \tilde g_1, ... , \tilde g_T, but the reader would have to guess unless
+	  they look at the proof details.
+
+   * many sentences in the informal exposition, in particular in the
+	 introduction and introductory paragraphs of the sections seem to have been
+	 left in an intermediate state between two variants, probably as a result
+	 of hasty editing.