nips-2016-2253.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

Summary
=======

This paper studies the problem of feature selection in sparse learning: given
a learning problem where we seek to find a vector of parameters b maximizing
some objective function (typically the likelihood given the observed data), the
goal is to find the optimal such vector under the constraint that its support
must be of size at most k.

This problem has been studied by Das and Kempe in the specific case where the
learning task to be performed is least-squares regression. In this setting they
showed that if the data satisfies the RIP property, then the objective function
is weakly submodular, leading to provable guarantees of a greedy algorithm for
feature selection.

This work is a generalization of this work for arbitrary objective functions:
the main result is to show that if the objective function is strongly convex,
then the resulting feature selection problem is weakly submodular, leading to
provable guarantees for greedy algorithms.

The (synthetic) experiments validate the approach and show that the algorithms
perform according to the theoretical guarantees and outperform the baseline
(Lasso selection)

Comments
========

This is a really "clean" an nice paper: the connection between strong convexity
and weak submodularity is very interesting conceptually. Even though the
consequence for feature selection in sparse learning is an obvious gain,
I believe the connection is interesting in itself since it is a new bridge
between convexity and submodularity (and such bridges have led to interesting
results in submodular optimization in the past 6-7 years).

The paper is well-written and the proofs are easy to check for correctness
(they follow from rather standard arguments in submodularity). It is
interesting to have three algorithms with different running time complexity and
approximation guarantees. The practical implication (as illustrated in the
experiments) is that the practitioner can trade-off solution quality for
running time.

A standard assumption under which theoretical guarantees can be obtained in
submodular optimization is known as "submodular curvature".  If there is
a connection between weak submodularity and this notion of curvature it could
be interesting to mention it (I haven't spent much time thinking about it, but
it seems that there should be a connection, at least one implying the other).

A final comment which is not about the quality of the paper (but rather about
the general field of sparse learning and specifically paper [1]): it is always
hard to interpret the conditions under which the theorems hold. The strong
convexity condition implies something about the data (some form of
non-degeneracy of the data) in the case where the objective function is defined
in terms of the observed data. It is hard to obtain probabilistic guarantees
for which these "non-degeneracy" properties hold; and the practical implications
of these results are hard to grasp.