Starting point for examples of submodular functions

author: Thibaut Horel <thibaut.horel@gmail.com> 2015-04-01 14:30:54 -0400
committer: Thibaut Horel <thibaut.horel@gmail.com> 2015-04-01 14:30:54 -0400
commit: b07276072fd182c9228e6a6b800f0390672d05f1 (patch)
tree: f938776d6300c80eaf66d759226b0cc5a9cda083
parent: e5c19d5ee7d1f338561160f2d9e50b39eaf0cb2b (diff)
download: learn-optimize-b07276072fd182c9228e6a6b800f0390672d05f1.tar.gz
1 files changed, 98 insertions, 45 deletions
diff --git a/results.tex b/results.tex
index c9acedd..8e6a186 100644
--- a/results.tex
+++ b/results.tex
@@ -7,6 +7,9 @@
 \DeclareMathOperator{\P}{\mathbb{P}}
 \newcommand{\ex}[1]{\E\left[#1\right]}
 \newcommand{\prob}[1]{\P\left[#1\right]}
+\newcommand{\reals}{\mathbb{R}}
+\newcommand{\ints}{\mathbb{N}}
+\renewcommand{\O}{\mathcal{O}}
 
 
 \newtheorem{proposition}{Proposition}
@@ -116,6 +119,98 @@ More generally, I think the “right” way to think about this is to look at $A
 as the adjacency matrix of a random $k$-regular graph. There are tons of
 results about the probability of the adjacency matrix to be non singular.
 
+\section{Examples}
+
+\subsection{Generic Functions}
+
+\paragraph{TODO:} Add citations!
+
+These are generic examples and serve as building blocks for the applications
+below:
+\begin{itemize}
+    \item $f(S) = g\left(\sum_{i\in S} w_i\right)$ for a concave
+        $g:\reals\to\reals$ and weights $(w_i)_{i\in N}\in \reals_+^{|N|}$. Note
+        that $f$ is monotone iff $g$ is monotone. In this case, $g$ does not
+        matter for the purpose of optimization: the sets are in the same order
+        with or without $g$, the only things which matter are the weights which
+        serve as natural parameters for this class of functions. This class of
+        functions contains:
+        \begin{itemize}
+            \item additive functions (when $g$ is the identity function).
+            \item $f(S) = |S\cap X|$ for some set $X\subseteq N$. This is the
+                case where the weights are $0/1$ and $g$ is the identity
+                function.
+            \item symmetric submodular functions: when the weights are all one.
+            \item budget-additive functions, when $g:x\mapsto \min(B, x)$ for
+                some $B$.
+        \end{itemize}
+    \item $f(S) = \max_{i\in S} w_i$ for weights $(w_i)_{i\in N}\in
+        \reals_+^{|N|}$. This class of functions is also naturally parametrized
+        by the weights.
+    \item (weighted) matroid rank functions. Given a matroid $M$ over a ground
+        set $N$, we define its rank function to be:
+        \begin{displaymath}
+            \forall S\subseteq N,\;
+            r(S) = \max_{\substack{I\subseteq S\\I\in M}} |I|
+        \end{displaymath}
+        more generally, given a weight function $w:N\to\reals_+$, we define the
+        weighted matroid rank function:
+        \begin{displaymath}
+            \forall S\subseteq N,\;
+            r(S) = \max_{\substack{I\subseteq S\\I\in M}} \sum_{i\in I} w_i
+        \end{displaymath}
+\end{itemize}
+
+\paragraph{Remark} The function $f(S)= \max_{i\in S}w_i$ is a weighted matroid
+rank function for the $1$-uniform matroid (the matroid where the independent
+sets are the sets of size at most one).
+
+\subsection{Applications}
+
+\begin{itemize}
+    \item \emph{Coverage functions:} they can be written as a positive linear
+        combination of matroid rank functions:
+        \begin{displaymath}
+            f(S) = \sum_{u\in\mathcal{U}} w_u c_u(S)
+        \end{displaymath}
+        where $c_u$ is the rank function of the matroid $M = \big\{ \emptyset,
+        \{u\}\big\}$.
+    \item \emph{Facility location:} (cite Bilmes) there is a universe
+        $\mathcal{U}$ of locations and a proximity score $s_{i,j}$ for each
+        pair of locations.  We pick a subset of locations $S$ and each point in
+        the universe is allocated to its closest location (the one with highest
+        proximity):
+        \begin{displaymath}
+            f(S) = \sum_{u\in\mathcal{U}} \max_{v\in S} s_{u,v}
+        \end{displaymath}
+        This can be seen as a sum of weighted matroid rank functions: one for
+        each location in the universe associated with a $1$-uniform matroid
+        (other applications: job scheduling).
+
+    \item \emph{Image segmentation:} (cite Jegelka) can be (in some cases)
+        written as a graph cut function. For image segmentation the goal is to
+        minimize the cut.
+        \begin{displaymath}
+            f(S) = \sum_{e\in E} w_e c_e(S)
+        \end{displaymath}
+        where $c_e(S)$ is one iff $e\in E(S,\bar{S})$. \textbf{TODO:} I think
+        this can be written as a matroid rank function.
+    \item \emph{Learning/value of information}: (cite Krause) there is
+        a hypothesis $A$ (a random variable) which is “refined” when more
+        observations are made.  Imagine that there is a finite set $X_1,\dots,
+        X_n$ of possible observations (random variables). Then, assuming that
+        the observations are independent conditioned on $A$, the information
+        gain:
+        \begin{displaymath}
+            f(S) = H(A) - H\big(A\,|\,(X_i)_{i\in S}\big)
+        \end{displaymath}
+        is submodular. The $\log\det$ is the specific case of a linear
+        hypothesis observed with additional independent Gaussian noise.
+        This can be written as multivariate concave over modular
+        (\textbf{TODO:} I think multivariate concave over modular is not
+        submodular in general, it is for $\log\det$. Understand this better).
+\end{itemize}
+
 \section{Passive Optimization}
 
 Let $\Omega$ be the universe of elements and $f$ a function defined on subsets
@@ -186,9 +281,7 @@ g(S^*) \geq \alpha f(S^*) \geq \alpha f(OPT) \geq
 where $OPT \in \arg \max_{S \in K} g(S)$. This can be taken one step further by
 considering a function $g$ such that there exists $\alpha, \beta >0$, an
 additive function $f$ and a bijective univariate function $\phi$, such that
-$\forall S, \alpha \phi(f(S)) \leq g(S) \leq \beta \phi(f(S))$. In this case,
-we solve the system $A w = \phi^{-1}(B)$ and obtain once again an
-$\alpha/\beta$-approximation to the optimum of $g$.
+$\forall S, \alpha \phi(f(S)) \leq g(S) \leq \beta \phi(f(S))$. In this case, we solve the system $A w = \phi^{-1}(B)$ and obtain once again an $\alpha/\beta$-approximation to the optimum of $g$.
 
 \begin{remark}
 Note that here $D-$optimizing $f$ is easy because $D-$learning $f$ is easy. We
@@ -221,51 +314,11 @@ If $D$ is a product distribution such that $ \exists c > 0, \forall i,
 \mathbb{P}(i) \geq c$, it is
 easy to show that $\forall i \in
 \Omega,\ \hat W_i \rightarrow_{|O| \rightarrow +\infty} w_i$
-By using standard concentration bounds, we can hope to obtain within $poly(n,
-\frac{1}{\epsilon})$ observations and with high probability:
+By using standard concentration bounds, we can hope to obtain within $poly(n, \frac{1}{\epsilon})$ observations:
 
-\begin{align}
-  &\forall i \in S : w_i \neq 0 \implies (1-\epsilon) w_i \leq \hat W_i \leq (1
-  + \epsilon) w_i \label{eq:first_condition}\\
-  & \forall i \in S: w_i = 0 \implies 0 \leq \hat W_i \leq \epsilon
-  \label{eq:second_condition}
-\end{align}
-
-Note that if Eq.~\ref{eq:first_condition} and Eq.~\ref{eq:second_condition} are
-both verified, then we can approximately $D-$optimize $f$ under cardinality
-constraint. Let $K \in \mathbb{N}$, let $J$ be the indices of the $K$ largest
-estimated weights $\hat W_i$ and $I$ be the indices of the $K$ largest true
-weights $w_i$. We have:
-
-\begin{align*}
-  \sum_{j \in J \cap \{j : w_j \neq 0\}} w_i  & \geq \frac{1}{1+\epsilon}
-  \sum_{j \in J \cap \{j : w_j \neq 0\}} \hat W_i \\
-                                              & \geq \frac{1}{1+\epsilon}
-  \left( \sum_{j \in J} \hat W_i - |J \cap \{j: w_j =0\} |\epsilon \right) \\
-  & \geq \frac{1}{1+\epsilon} \left( \sum_{j in I} \hat W_i - K\epsilon \right)
-  \\ & \geq \frac{1-\epsilon}{1+\epsilon} OPT - \frac{\epsilon}{1+\epsilon} K
-\end{align*}
-
-Let $\epsilon \leq 1/(1+K/(\delta OPT)) \approx \delta OPT/K$, then
-it follows that choosing the $K$ largest estimated weights $\hat W_i$ is a
-$(1-\delta)(1-\epsilon)/(1+\epsilon)$-approximation to the optimum.
-
-\paragraph{Interpretation} Notice that the expression of our estimator is the
-finite-sample approximation of the differential of the multilinear extension of
-$f$ with respect to $w_i$. Because $f$ is additive, the gradient of its
-multilinear extension with respect to its parameters is constant. Recalling the
-continuous-greedy procedure, it is intuitive that we can $D$-optimize an
-additive function by estimating the gradient of its multilinear extension at
-the unit vectors.
-
-\begin{remark}
-Note that we have not managed to do without a learning phase, we are still
-learning $f$ approximately to optimize it.
-\end{remark}
 
 %For every node
-%$i$, we compute the \emph{average weight of every set containing element $i$}.
-%Let $w_i$ be
+%$i$, we compute the \emph{average weight of every set containing element $i$}. Let $w_i$ be
 %the weight of element $i$, $w \equiv f(\Omega) = \sum_{i \in \Omega} w_i$ and
 %$p \equiv \frac{k}{n}$, then $$\forall i \in \Omega, \mathbb{E}_{S \sim
 %D_p}\left[f(S)| i \in S\right] = pw + (1 -p)w_i$$
author	Thibaut Horel <thibaut.horel@gmail.com>	2015-04-01 14:30:54 -0400
committer	Thibaut Horel <thibaut.horel@gmail.com>	2015-04-01 14:30:54 -0400
commit	b07276072fd182c9228e6a6b800f0390672d05f1 (patch)
tree	f938776d6300c80eaf66d759226b0cc5a9cda083
parent	e5c19d5ee7d1f338561160f2d9e50b39eaf0cb2b (diff)
download	learn-optimize-b07276072fd182c9228e6a6b800f0390672d05f1.tar.gz