Summary ======= This paper studies the problem of feature selection in sparse learning: given a learning problem where we seek to find a vector of parameters b maximizing some objective function (typically the likelihood given the observed data), the goal is to find the optimal such vector under the constraint that its support must be of size at most k. This problem has been studied by Das and Kempe in the specific case where the learning task to be performed is least-squares regression. In this setting they showed that if the data satisfies the RIP property, then the objective function is weakly submodular, leading to provable guarantees of a greedy algorithm for feature selection. This work is a generalization of this work for arbitrary objective functions: the main result is to show that if the objective function is strongly convex, then the resulting feature selection problem is weakly submodular, leading to provable guarantees for greedy algorithms. The (synthetic) experiments validate the approach and show that the algorithms perform according to the theoretical guarantees and outperform the baseline (Lasso selection) Comments ======== This is a really "clean" an nice paper: the connection between strong convexity and weak submodularity is very interesting conceptually. Even though the consequence for feature selection in sparse learning is an obvious gain, I believe the connection is interesting in itself since it is a new bridge between convexity and submodularity (and such bridges have led to interesting results in submodular optimization in the past 6-7 years). The paper is well-written and the proofs are easy to check for correctness (they follow from rather standard arguments in submodularity). It is interesting to have three algorithms with different running time complexity and approximation guarantees. The practical implication (as illustrated in the experiments) is that the practitioner can trade-off solution quality for running time. A standard assumption under which theoretical guarantees can be obtained in submodular optimization is known as "submodular curvature". If there is a connection between weak submodularity and this notion of curvature it could be interesting to mention it (I haven't spent much time thinking about it, but it seems that there should be a connection, at least one implying the other). A final comment which is not about the quality of the paper (but rather about the general field of sparse learning and specifically paper [1]): it is always hard to interpret the conditions under which the theorems hold. The strong convexity condition implies something about the data (some form of non-degeneracy of the data) in the case where the objective function is defined in terms of the observed data. It is hard to obtain probabilistic guarantees for which these "non-degeneracy" properties hold; and the practical implications of these results are hard to grasp.