NIPS 2016 reviews (only one good paper)

author: Thibaut Horel <thibaut.horel@gmail.com> 2016-07-18 00:39:35 -0400
committer: Thibaut Horel <thibaut.horel@gmail.com> 2016-07-18 00:39:35 -0400
commit: 5550e00c788050053a01cae9d5ab40219ff03a67 (patch)
tree: 70101d13a01c2f70e5b8235d6b9eb7ca6bef8530
parent: 072b6f30503223c130d9cd6679d11cf74b3bf003 (diff)
download: reviews-5550e00c788050053a01cae9d5ab40219ff03a67.tar.gz
7 files changed, 375 insertions, 0 deletions
diff --git a/nips-2016-1303.txt b/nips-2016-1303.txt
new file mode 100644
index 0000000..77b67f5
--- /dev/null
+++ b/nips-2016-1303.txt
@@ -0,0 +1,53 @@
+Summary
+=======
+
+The paper studies the problem of maximizing a submodular set function under
+multiple knapsack (packing/linear) constraints. The authors build on a now
+standard technique: optimizing the multilinear extension of the submodular
+function and then rounding the fractional solution using the "contention
+resolution scheme" framework. The key novelty is to observe that in the
+continuous greedy algorithm, the step size can be chosen proportional
+1/(s+l+1)^3 where s is the upper-bound on the cardinality of the solution set
+and l is the number of packing constraints. Furthermore, it is shown that the
+algorithm can be implemented in such a way that the running solution has
+sparsity (s+l+1) which leads to faster updates.
+
+The suggested algorithms are evaluted experimentally and compared to the
+"threshold greedy algorithm" of [8].
+
+Comments
+========
+
+The problem of maximizing a submodular function under packing constraints
+is important and interesting and there is definitely value in obtaining
+a better understanding of the computational complexity of this problem under
+various settings. My main concern is that this paper seems to be an incremental
+improvement over already known results:
+
+* the paper mentions "improved algorithms" (plural) but there is only one
+algorithm which is explicitly stated (algorithm 1). Furthermore, this
+algorithm is the standard continuous greedy algorithm from [15] with a specific
+accounting of the sparsity of the solution. I am assuming that the other
+algorithm is the Multiplicative Weight algorithm, but the authors simply refer
+to [5] for this algorithm, so it can't be considered novel.
+
+* instead of mentioning "improved algorithms", I think it would be more
+accurate to say "improved analysis". To my understanding, the algorithms are
+the same, only the analysis is refined to properly account for the sparsity of
+the solution.
+
+Furthermore, this paper is lacking the following two things:
+
+* a precise statement (in a theorem/proposition) of the end-to-end
+  computational complexity of the suggested algorithm. This is important if the
+  main contribution is improved running times.
+
+* an explanation of why the multiplicative weight algorithm has worse running
+  time in the experiments, even though the point of this algorithm is to
+  improve the running time of the continuous greedy algorithm. This is
+  counter-intuitive since I would expect the LP solving part of the algorithm
+  to dominate the running time (and the multiplicative weight algorithm
+  specifically addresses this). Is there any reason at all to prefer the
+  multiplicative weight algorithm? Also, it would be interesting in the
+  experiment to show a breakdown of the running time: which fraction of the
+  time is spent sloving LPs?
diff --git a/nips-2016-1690.txt b/nips-2016-1690.txt
new file mode 100644
index 0000000..74227ff
--- /dev/null
+++ b/nips-2016-1690.txt
@@ -0,0 +1,58 @@
+Summary
+=======
+
+This paper studies a facility-location problem formulated as a submodular
+optimization problem. In machine learning, this problem has previously been
+shown to apply in the context of dataset summarization/k-medians clustering.
+Despite the submodularity of the objective function, solving the problem with
+the standard greedy algorithm is prohibitive because the complexity is k*n*m
+where k is the cardinality constraint and n*m is the size of the "benefit
+matrix". A previously suggested approach is to first sparsify the benefit
+matrix and then apply an optimization algorithm on the sparsified matrix.
+
+This paper refines the results from [36] by giving theoretical guarantees on
+the approximation ratio obtained by this method as a function of the chosen
+sparsification level. Two methods of sparsification are analyzed:
+
+* top t: only the top t items (most similar) are kept in each row of the
+benefit matrix
+
+* threshold based: only the entries of benefit matrix above the chosen
+threshold are kept.
+
+Finally experiments show how the suggested approach outperforms (in terms of
+the quality of the solution) the (non sparsified) greedy algorithm for the same
+running time. Or equivalently, reaching a given solution quality takes much
+longer if no sparsification is used.
+
+Comments
+========
+
+The problem is interesting and well-motivated. While I haven't read paper [36]
+upon which this paper claims to build upon, I understand that a complete
+theoretical understanding of the sparsification-based approach was still
+missing and this paper is a good attempt at completing the picture. 
+
+I think having precise theoretical guarantees is very valuable, especially
+since they come with almost matching lower bound (up to logarithmic factors).
+
+My only concern is about section 3.1 (threshold-based sparsification): I think
+this section should have a more precise discussion of the following two facts:
+
+* the stated theoretical guarantee is data dependent. Is there any way to
+obtain a data-independent bound under extra assumptions? For example, assume
+that the benefit matrix comes from a distribution, then given threshold x, the
+sparsification is "good" with high probability. The benefit of such a result
+would be to give some guidance on how to choose the threshold, or what is
+a good threshold (which the paper only briefly discusses).
+
+* it is also not clear how the LSH-based approach combines with the
+threshold-based approach. It would be nice to see an "end-to-end"
+theorem/proposition of the following flavor: assume the matrix is sparsified
+using LSH, then with high probability, we get approximation of...
+
+Finally, the experiments are interesting and convincing. However, while the
+conclusion of section 4.1 is clear, the results of section 4.2 are hard to
+interpret: there is no metric to quantify the quality of the solution and it
+seems that the reader is simply invited to "eyeball" the results, as
+a consequence I am not sure that this section is necessary.
diff --git a/nips-2016-220.txt b/nips-2016-220.txt
new file mode 100644
index 0000000..2c5b67a
--- /dev/null
+++ b/nips-2016-220.txt
@@ -0,0 +1,61 @@
+Summary
+=======
+
+This papers studies submodular functions over a continuous domain: these
+functions are sometimes referred to as "smooth" submodular functions and are
+the direct generalization of discrete submodular functions for continuous
+variables. While prior work has focused on obtaining algorithms and
+optimization guarantees for minimizing smooth submodular functions, this paper
+focuses on obtaining approximation guarantees for maximizing smooth submodular
+functions. Specifically, the authors:
+
+* give and prove a characterization of submodularity (and coordinate-wise
+  concavity) in terms of a diminishing return (DR) property. This is
+  a generalization of a well-known characterization in the discrete case.
+
+* give a (1-1/e) approximation algorithm for smooth submodular functions which
+  are coordinate wise concave and monotone
+
+* give a 1/3 approximation algorithm for arbitrary smooth submodular functions
+
+* give examples of problems which lead to optimizing a smooth submodular
+  function
+
+* validate their algorithms through experiments
+
+Comments
+========
+
+Smooth submodular functions have gathered interest in the recent years and as
+the authors point out, event though results were known for minimization, there
+was a lack of results on the maximization side. 
+
+While it is interesting to see that the results of the discrete case translate
+well to the continuous case, this is also the source of a (relative) weakness
+of the paper: a lack of technical novelty.
+
+Section 4 in particular is a direct adaptation of the continuous algorithm of
+Vondrak 2008. I don't see any fundamental difference in the statement of the
+algorithm and the proof follows a very similar technique. Also, it would have
+been interesting to mention that the multilinear extension of a monotone
+submodular function is DR-submodular to clearly show the connection with
+already existing results.
+
+The algorithm in section 5 is more novel: it draws heavily from the algorithm
+of [Buchbinder 2012] but required an adaptation to the continuous case (and
+a specific analysis).
+
+It is interesting that the authors give examples where maximizing smooth
+submodular functions arise. Some of these examples seem a bit contrived, but
+I don't have enough domain-specific knowledge to judge this confidently.
+
+The paper is well written overall, but suffers from some notations
+inconsitencies locally which can be unsettling for the reader. Two examples:
+
+* the definition of submodularity line 80 is given for arbitrary compact sets,
+but later the authors seem to focs on the specific case of the positive orthant
+of R^n, it is not clear if the results hold in the most general case.
+
+* definition 3.1 is hard to read: is it necessary to introduce a ground set and
+then the characteristic vectors of the ground set elements? why not simply use
+R^n and the standard basis vectors?
diff --git a/nips-2016-2230.txt b/nips-2016-2230.txt
new file mode 100644
index 0000000..ea73164
--- /dev/null
+++ b/nips-2016-2230.txt
@@ -0,0 +1,56 @@
+Summary
+=======
+
+In this paper, the authors study the streaming submodular cover problem: given
+a monotone submodular function f and a target utility Q, the goal is to find
+a set S of minimal size such that f(S) is at least Q. The setting is the one of
+streaming algorithms where the goal is to perform a small (constant) number of
+passes over the data stream, and use sublinear memory (i.e the algorithm cannot
+simply store the entire stream of data).
+
+The first result is an impossibility result: any non-trivial approximation
+guarantee (better than m/2 where m is the size of the stream) requires linear
+memory.
+
+This first result motivates the formulation of a bi-criteria formulation of the
+streaming submodular cover problem: find a set of size at most delta times
+the optimal set which gets a fraction (1-epsilon) of the optimal solution. The
+authors provide an algorithm which provably obtains this, with delta
+= 1/epsilon.
+
+Comments
+========
+
+The paper is well-written and the results are correct as far as I checked (I
+didn't read the proofs in the appendix but checked the ones in the main body of
+the paper).
+
+The experiments are interesting and illustrates well both the scalability of
+the approach and the trade-off introduced by the parameter alpha in the
+algorithm.
+
+My main concern is a conceptual interrogation regarding the computational
+model: I think it is hard to formulate a sound computation model for streaming
+optimization of arbitrary submodular functions. Indeed, since these functions
+can have exponential size representations in the general case, the authors
+assume access to a value oracle (as is standard in the offline setting). It is
+not clear to me what such oracles imply in the streaming setting: for example,
+the algorithm could cheat and use value queries to avoid having to store items.
+The exact interplay between the value queries and the memory limitation
+introduced by the streaming model should be properly discussed.
+
+For specific examples (like submodular set cover which was previously studied
+in the streaming setting, or the ones discussed in the experiments), this
+problem does not exist since the function has an explicit representation which
+can be computed as the strem of elements is received, but I think the general
+case requires a more in-depth discussion of the computational model.
+
+Finally, while the lower-bound is novel and technically interesting, the
+algorithm and its analysis seem to be an incremental variation on a well-known
+technique: it has already been observed in other settings (mapreduce submodular
+optimization for example) that the following simple threshold-based algorithm
+gives constant approximation:
+
+for decreasing values of the threshold (divide by a constant at each iteration)
+include in the solutions all elements whose marginal contribution is above the
+threshold)
diff --git a/nips-2016-2253.txt b/nips-2016-2253.txt
new file mode 100644
index 0000000..56e1b0b
--- /dev/null
+++ b/nips-2016-2253.txt
@@ -0,0 +1,55 @@
+Summary
+=======
+
+This paper studies the problem of feature selection in sparse learning: given
+a learning problem where we seek to find a vector of parameters b maximizing
+some objective function (typically the likelihood given the observed data), the
+goal is to find the optimal such vector under the constraint that its support
+must be of size at most k.
+
+This problem has been studied by Das and Kempe in the specific case where the
+learning task to be performed is least-squares regression. In this setting they
+showed that if the data satisfies the RIP property, then the objective function
+is weakly submodular, leading to provable guarantees of a greedy algorithm for
+feature selection.
+
+This work is a generalization of this work for arbitrary objective functions:
+the main result is to show that if the objective function is strongly convex,
+then the resulting feature selection problem is weakly submodular, leading to
+provable guarantees for greedy algorithms.
+
+The (synthetic) experiments validate the approach and show that the algorithms
+perform according to the theoretical guarantees and outperform the baseline
+(Lasso selection)
+
+Comments
+========
+
+This is a really "clean" an nice paper: the connection between strong convexity
+and weak submodularity is very interesting conceptually. Even though the
+consequence for feature selection in sparse learning is an obvious gain,
+I believe the connection is interesting in itself since it is a new bridge
+between convexity and submodularity (and such bridges have led to interesting
+results in submodular optimization in the past 6-7 years).
+
+The paper is well-written and the proofs are easy to check for correctness
+(they follow from rather standard arguments in submodularity). It is
+interesting to have three algorithms with different running time complexity and
+approximation guarantees. The practical implication (as illustrated in the
+experiments) is that the practitioner can trade-off solution quality for
+running time.
+
+A standard assumption under which theoretical guarantees can be obtained in
+submodular optimization is known as "submodular curvature".  If there is
+a connection between weak submodularity and this notion of curvature it could
+be interesting to mention it (I haven't spent much time thinking about it, but
+it seems that there should be a connection, at least one implying the other).
+
+A final comment which is not about the quality of the paper (but rather about
+the general field of sparse learning and specifically paper [1]): it is always
+hard to interpret the conditions under which the theorems hold. The strong
+convexity condition implies something about the data (some form of
+non-degeneracy of the data) in the case where the objective function is defined
+in terms of the observed data. It is hard to obtain probabilistic guarantees
+for which these "non-degeneracy" properties hold; and the practical implications
+of these results are hard to grasp.
diff --git a/nips-2016-471.txt b/nips-2016-471.txt
new file mode 100644
index 0000000..938ccb2
--- /dev/null
+++ b/nips-2016-471.txt
@@ -0,0 +1,46 @@
+Summary
+=======
+
+In this paper, the authors study the standard influence maximization problem:
+assuming the independent cascade model (a random diffusion process in graphs),
+the goal is to select a seed set of k nodes. While the traditional influence
+maximization problem focuses on the maximizing the expected "cascade size"
+(i.e. number of "infected" vertices at the end of the random process), the
+authors note that this is not always desirable in risk-averse settings. The
+contributions are:
+
+* formulating a maximization problem where the objective function is the CVar
+(conditional value at risk) risk measure
+
+* providing a "multiplicative weight update" algorithm to solve the problem.
+the algorithm is proved to find a solution within an additive 1/e approximation
+of the optimal solution
+
+* comparing their algorithm experimentally with the standard greedy algorithm
+which maximizes the expected cascade size.
+
+Comments
+========
+
+The problem is very interesting: a lot of effort has been put into maximizing
+expected cascade size in the influence maximization problem, but little is
+known as far as "worst case" or "risk measures" are concerned, and I think
+there is a lot of value in obtaining these more nuanced results.
+
+The experiments show very well that the resulting cascade size distribution is
+very different when maximizing the Cvar metric as opposed to maximizing
+expected size. It is also clear that the distribution obtained when maximizing
+CVAR is more desirable in settings where risk needs to be mitigated.
+
+The algorithm is interesting: the ideas are intuitive and standard (sampling
++ multiplicative updates) but to the best of my knowledge are not traditionally
+used in the context of influence maximization.
+
+My only concern is about the specific choice of objective function: I feel that
+the CVAR metric is not very well motivated. Are there any ways to relate it to
+the one used in [5] or [20] which seem more intuitive and easier to interpret?
+The interpretability is made even harder by not choosing a single seed set
+(fixed strategy) but choosing a distribution over seed sets (random strategy)
+and maximizing the CVAR of this random strategy. Also, it would be interesting
+to know more about the properties of the CVAR objective function: is it
+convex/concave in the choice of the random strategy?
diff --git a/nips-2016-988.txt b/nips-2016-988.txt
new file mode 100644
index 0000000..40f4513
--- /dev/null
+++ b/nips-2016-988.txt
@@ -0,0 +1,46 @@
+Summary
+=======
+
+This paper studies the problem of network inference problem for the RDS
+(Respondent Driven Sampling) diffusion process. This random diffusion process
+models a standard procedure used in epidemiology studies. Under this model, the
+authors formulate the network reconstruction process (finding the unknown graph
+over which the diffusion took place) as a MAP (maximum a posteriori)
+optimization problem. The contributions are:
+
+* showing that the objective function is log-submodular
+
+* giving a "variational inference" algorithm: the objective function is
+approximated by linear lower and upper bounds and solving for those upper and
+lower bounds provides an approximate solution to the problem
+
+* experiments showing the validity of the approach
+
+Comments
+========
+
+The problem of network reconstruction is important and hadn't been studied from
+this perspective in the context of Respondent Driven Sampling. While the
+experiments show that the suggested approach works well in practice, I find the
+paper to be lacking a conceptual/technical contribution:
+
+* it was already shown that the log likelihood was submodular in [8] for the
+closely related continuous-time independent cascade model. Even though the
+diffusion process here is slightly different, this result can hardly be
+considered novel.  
+
+* the suggested approach (approximating the objective function with linear
+lower and upper bounds) was alread suggested in [5], including the specific
+choice of the lower and upper additive approximations
+
+* the formulas from [5] are barely instantiated to the specific objective
+function at hand here. It could have been interesting to show how the general
+framework of [5] leads to a simple algorithm for this specific problem, but no
+algorithm is provided and very little is said about how the computation should
+be performed in practice. For example the paragraph starting at line 252 is
+very vague.
+
+* no theoretical guarantees are provided for the approximation obtained by the
+suggested approach. Some guarantees were obtained in [5] under a curvature
+assumption. It could have been interesting to show which form these assumptions
+take in this specific context.
author	Thibaut Horel <thibaut.horel@gmail.com>	2016-07-18 00:39:35 -0400
committer	Thibaut Horel <thibaut.horel@gmail.com>	2016-07-18 00:39:35 -0400
commit	5550e00c788050053a01cae9d5ab40219ff03a67 (patch)
tree	70101d13a01c2f70e5b8235d6b9eb7ca6bef8530
parent	072b6f30503223c130d9cd6679d11cf74b3bf003 (diff)
download	reviews-5550e00c788050053a01cae9d5ab40219ff03a67.tar.gz