aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorThibaut Horel <thibaut.horel@gmail.com>2015-04-01 14:30:54 -0400
committerThibaut Horel <thibaut.horel@gmail.com>2015-04-01 14:30:54 -0400
commitb07276072fd182c9228e6a6b800f0390672d05f1 (patch)
treef938776d6300c80eaf66d759226b0cc5a9cda083
parente5c19d5ee7d1f338561160f2d9e50b39eaf0cb2b (diff)
downloadlearn-optimize-b07276072fd182c9228e6a6b800f0390672d05f1.tar.gz
Starting point for examples of submodular functions
-rw-r--r--results.tex143
1 files changed, 98 insertions, 45 deletions
diff --git a/results.tex b/results.tex
index c9acedd..8e6a186 100644
--- a/results.tex
+++ b/results.tex
@@ -7,6 +7,9 @@
\DeclareMathOperator{\P}{\mathbb{P}}
\newcommand{\ex}[1]{\E\left[#1\right]}
\newcommand{\prob}[1]{\P\left[#1\right]}
+\newcommand{\reals}{\mathbb{R}}
+\newcommand{\ints}{\mathbb{N}}
+\renewcommand{\O}{\mathcal{O}}
\newtheorem{proposition}{Proposition}
@@ -116,6 +119,98 @@ More generally, I think the “right” way to think about this is to look at $A
as the adjacency matrix of a random $k$-regular graph. There are tons of
results about the probability of the adjacency matrix to be non singular.
+\section{Examples}
+
+\subsection{Generic Functions}
+
+\paragraph{TODO:} Add citations!
+
+These are generic examples and serve as building blocks for the applications
+below:
+\begin{itemize}
+ \item $f(S) = g\left(\sum_{i\in S} w_i\right)$ for a concave
+ $g:\reals\to\reals$ and weights $(w_i)_{i\in N}\in \reals_+^{|N|}$. Note
+ that $f$ is monotone iff $g$ is monotone. In this case, $g$ does not
+ matter for the purpose of optimization: the sets are in the same order
+ with or without $g$, the only things which matter are the weights which
+ serve as natural parameters for this class of functions. This class of
+ functions contains:
+ \begin{itemize}
+ \item additive functions (when $g$ is the identity function).
+ \item $f(S) = |S\cap X|$ for some set $X\subseteq N$. This is the
+ case where the weights are $0/1$ and $g$ is the identity
+ function.
+ \item symmetric submodular functions: when the weights are all one.
+ \item budget-additive functions, when $g:x\mapsto \min(B, x)$ for
+ some $B$.
+ \end{itemize}
+ \item $f(S) = \max_{i\in S} w_i$ for weights $(w_i)_{i\in N}\in
+ \reals_+^{|N|}$. This class of functions is also naturally parametrized
+ by the weights.
+ \item (weighted) matroid rank functions. Given a matroid $M$ over a ground
+ set $N$, we define its rank function to be:
+ \begin{displaymath}
+ \forall S\subseteq N,\;
+ r(S) = \max_{\substack{I\subseteq S\\I\in M}} |I|
+ \end{displaymath}
+ more generally, given a weight function $w:N\to\reals_+$, we define the
+ weighted matroid rank function:
+ \begin{displaymath}
+ \forall S\subseteq N,\;
+ r(S) = \max_{\substack{I\subseteq S\\I\in M}} \sum_{i\in I} w_i
+ \end{displaymath}
+\end{itemize}
+
+\paragraph{Remark} The function $f(S)= \max_{i\in S}w_i$ is a weighted matroid
+rank function for the $1$-uniform matroid (the matroid where the independent
+sets are the sets of size at most one).
+
+\subsection{Applications}
+
+\begin{itemize}
+ \item \emph{Coverage functions:} they can be written as a positive linear
+ combination of matroid rank functions:
+ \begin{displaymath}
+ f(S) = \sum_{u\in\mathcal{U}} w_u c_u(S)
+ \end{displaymath}
+ where $c_u$ is the rank function of the matroid $M = \big\{ \emptyset,
+ \{u\}\big\}$.
+ \item \emph{Facility location:} (cite Bilmes) there is a universe
+ $\mathcal{U}$ of locations and a proximity score $s_{i,j}$ for each
+ pair of locations. We pick a subset of locations $S$ and each point in
+ the universe is allocated to its closest location (the one with highest
+ proximity):
+ \begin{displaymath}
+ f(S) = \sum_{u\in\mathcal{U}} \max_{v\in S} s_{u,v}
+ \end{displaymath}
+ This can be seen as a sum of weighted matroid rank functions: one for
+ each location in the universe associated with a $1$-uniform matroid
+ (other applications: job scheduling).
+
+ \item \emph{Image segmentation:} (cite Jegelka) can be (in some cases)
+ written as a graph cut function. For image segmentation the goal is to
+ minimize the cut.
+ \begin{displaymath}
+ f(S) = \sum_{e\in E} w_e c_e(S)
+ \end{displaymath}
+ where $c_e(S)$ is one iff $e\in E(S,\bar{S})$. \textbf{TODO:} I think
+ this can be written as a matroid rank function.
+ \item \emph{Learning/value of information}: (cite Krause) there is
+ a hypothesis $A$ (a random variable) which is “refined” when more
+ observations are made. Imagine that there is a finite set $X_1,\dots,
+ X_n$ of possible observations (random variables). Then, assuming that
+ the observations are independent conditioned on $A$, the information
+ gain:
+ \begin{displaymath}
+ f(S) = H(A) - H\big(A\,|\,(X_i)_{i\in S}\big)
+ \end{displaymath}
+ is submodular. The $\log\det$ is the specific case of a linear
+ hypothesis observed with additional independent Gaussian noise.
+ This can be written as multivariate concave over modular
+ (\textbf{TODO:} I think multivariate concave over modular is not
+ submodular in general, it is for $\log\det$. Understand this better).
+\end{itemize}
+
\section{Passive Optimization}
Let $\Omega$ be the universe of elements and $f$ a function defined on subsets
@@ -186,9 +281,7 @@ g(S^*) \geq \alpha f(S^*) \geq \alpha f(OPT) \geq
where $OPT \in \arg \max_{S \in K} g(S)$. This can be taken one step further by
considering a function $g$ such that there exists $\alpha, \beta >0$, an
additive function $f$ and a bijective univariate function $\phi$, such that
-$\forall S, \alpha \phi(f(S)) \leq g(S) \leq \beta \phi(f(S))$. In this case,
-we solve the system $A w = \phi^{-1}(B)$ and obtain once again an
-$\alpha/\beta$-approximation to the optimum of $g$.
+$\forall S, \alpha \phi(f(S)) \leq g(S) \leq \beta \phi(f(S))$. In this case, we solve the system $A w = \phi^{-1}(B)$ and obtain once again an $\alpha/\beta$-approximation to the optimum of $g$.
\begin{remark}
Note that here $D-$optimizing $f$ is easy because $D-$learning $f$ is easy. We
@@ -221,51 +314,11 @@ If $D$ is a product distribution such that $ \exists c > 0, \forall i,
\mathbb{P}(i) \geq c$, it is
easy to show that $\forall i \in
\Omega,\ \hat W_i \rightarrow_{|O| \rightarrow +\infty} w_i$
-By using standard concentration bounds, we can hope to obtain within $poly(n,
-\frac{1}{\epsilon})$ observations and with high probability:
+By using standard concentration bounds, we can hope to obtain within $poly(n, \frac{1}{\epsilon})$ observations:
-\begin{align}
- &\forall i \in S : w_i \neq 0 \implies (1-\epsilon) w_i \leq \hat W_i \leq (1
- + \epsilon) w_i \label{eq:first_condition}\\
- & \forall i \in S: w_i = 0 \implies 0 \leq \hat W_i \leq \epsilon
- \label{eq:second_condition}
-\end{align}
-
-Note that if Eq.~\ref{eq:first_condition} and Eq.~\ref{eq:second_condition} are
-both verified, then we can approximately $D-$optimize $f$ under cardinality
-constraint. Let $K \in \mathbb{N}$, let $J$ be the indices of the $K$ largest
-estimated weights $\hat W_i$ and $I$ be the indices of the $K$ largest true
-weights $w_i$. We have:
-
-\begin{align*}
- \sum_{j \in J \cap \{j : w_j \neq 0\}} w_i & \geq \frac{1}{1+\epsilon}
- \sum_{j \in J \cap \{j : w_j \neq 0\}} \hat W_i \\
- & \geq \frac{1}{1+\epsilon}
- \left( \sum_{j \in J} \hat W_i - |J \cap \{j: w_j =0\} |\epsilon \right) \\
- & \geq \frac{1}{1+\epsilon} \left( \sum_{j in I} \hat W_i - K\epsilon \right)
- \\ & \geq \frac{1-\epsilon}{1+\epsilon} OPT - \frac{\epsilon}{1+\epsilon} K
-\end{align*}
-
-Let $\epsilon \leq 1/(1+K/(\delta OPT)) \approx \delta OPT/K$, then
-it follows that choosing the $K$ largest estimated weights $\hat W_i$ is a
-$(1-\delta)(1-\epsilon)/(1+\epsilon)$-approximation to the optimum.
-
-\paragraph{Interpretation} Notice that the expression of our estimator is the
-finite-sample approximation of the differential of the multilinear extension of
-$f$ with respect to $w_i$. Because $f$ is additive, the gradient of its
-multilinear extension with respect to its parameters is constant. Recalling the
-continuous-greedy procedure, it is intuitive that we can $D$-optimize an
-additive function by estimating the gradient of its multilinear extension at
-the unit vectors.
-
-\begin{remark}
-Note that we have not managed to do without a learning phase, we are still
-learning $f$ approximately to optimize it.
-\end{remark}
%For every node
-%$i$, we compute the \emph{average weight of every set containing element $i$}.
-%Let $w_i$ be
+%$i$, we compute the \emph{average weight of every set containing element $i$}. Let $w_i$ be
%the weight of element $i$, $w \equiv f(\Omega) = \sum_{i \in \Omega} w_i$ and
%$p \equiv \frac{k}{n}$, then $$\forall i \in \Omega, \mathbb{E}_{S \sim
%D_p}\left[f(S)| i \in S\right] = pw + (1 -p)w_i$$