diff options
| -rw-r--r-- | results.tex | 143 |
1 files changed, 98 insertions, 45 deletions
diff --git a/results.tex b/results.tex index c9acedd..8e6a186 100644 --- a/results.tex +++ b/results.tex @@ -7,6 +7,9 @@ \DeclareMathOperator{\P}{\mathbb{P}} \newcommand{\ex}[1]{\E\left[#1\right]} \newcommand{\prob}[1]{\P\left[#1\right]} +\newcommand{\reals}{\mathbb{R}} +\newcommand{\ints}{\mathbb{N}} +\renewcommand{\O}{\mathcal{O}} \newtheorem{proposition}{Proposition} @@ -116,6 +119,98 @@ More generally, I think the “right” way to think about this is to look at $A as the adjacency matrix of a random $k$-regular graph. There are tons of results about the probability of the adjacency matrix to be non singular. +\section{Examples} + +\subsection{Generic Functions} + +\paragraph{TODO:} Add citations! + +These are generic examples and serve as building blocks for the applications +below: +\begin{itemize} + \item $f(S) = g\left(\sum_{i\in S} w_i\right)$ for a concave + $g:\reals\to\reals$ and weights $(w_i)_{i\in N}\in \reals_+^{|N|}$. Note + that $f$ is monotone iff $g$ is monotone. In this case, $g$ does not + matter for the purpose of optimization: the sets are in the same order + with or without $g$, the only things which matter are the weights which + serve as natural parameters for this class of functions. This class of + functions contains: + \begin{itemize} + \item additive functions (when $g$ is the identity function). + \item $f(S) = |S\cap X|$ for some set $X\subseteq N$. This is the + case where the weights are $0/1$ and $g$ is the identity + function. + \item symmetric submodular functions: when the weights are all one. + \item budget-additive functions, when $g:x\mapsto \min(B, x)$ for + some $B$. + \end{itemize} + \item $f(S) = \max_{i\in S} w_i$ for weights $(w_i)_{i\in N}\in + \reals_+^{|N|}$. This class of functions is also naturally parametrized + by the weights. + \item (weighted) matroid rank functions. Given a matroid $M$ over a ground + set $N$, we define its rank function to be: + \begin{displaymath} + \forall S\subseteq N,\; + r(S) = \max_{\substack{I\subseteq S\\I\in M}} |I| + \end{displaymath} + more generally, given a weight function $w:N\to\reals_+$, we define the + weighted matroid rank function: + \begin{displaymath} + \forall S\subseteq N,\; + r(S) = \max_{\substack{I\subseteq S\\I\in M}} \sum_{i\in I} w_i + \end{displaymath} +\end{itemize} + +\paragraph{Remark} The function $f(S)= \max_{i\in S}w_i$ is a weighted matroid +rank function for the $1$-uniform matroid (the matroid where the independent +sets are the sets of size at most one). + +\subsection{Applications} + +\begin{itemize} + \item \emph{Coverage functions:} they can be written as a positive linear + combination of matroid rank functions: + \begin{displaymath} + f(S) = \sum_{u\in\mathcal{U}} w_u c_u(S) + \end{displaymath} + where $c_u$ is the rank function of the matroid $M = \big\{ \emptyset, + \{u\}\big\}$. + \item \emph{Facility location:} (cite Bilmes) there is a universe + $\mathcal{U}$ of locations and a proximity score $s_{i,j}$ for each + pair of locations. We pick a subset of locations $S$ and each point in + the universe is allocated to its closest location (the one with highest + proximity): + \begin{displaymath} + f(S) = \sum_{u\in\mathcal{U}} \max_{v\in S} s_{u,v} + \end{displaymath} + This can be seen as a sum of weighted matroid rank functions: one for + each location in the universe associated with a $1$-uniform matroid + (other applications: job scheduling). + + \item \emph{Image segmentation:} (cite Jegelka) can be (in some cases) + written as a graph cut function. For image segmentation the goal is to + minimize the cut. + \begin{displaymath} + f(S) = \sum_{e\in E} w_e c_e(S) + \end{displaymath} + where $c_e(S)$ is one iff $e\in E(S,\bar{S})$. \textbf{TODO:} I think + this can be written as a matroid rank function. + \item \emph{Learning/value of information}: (cite Krause) there is + a hypothesis $A$ (a random variable) which is “refined” when more + observations are made. Imagine that there is a finite set $X_1,\dots, + X_n$ of possible observations (random variables). Then, assuming that + the observations are independent conditioned on $A$, the information + gain: + \begin{displaymath} + f(S) = H(A) - H\big(A\,|\,(X_i)_{i\in S}\big) + \end{displaymath} + is submodular. The $\log\det$ is the specific case of a linear + hypothesis observed with additional independent Gaussian noise. + This can be written as multivariate concave over modular + (\textbf{TODO:} I think multivariate concave over modular is not + submodular in general, it is for $\log\det$. Understand this better). +\end{itemize} + \section{Passive Optimization} Let $\Omega$ be the universe of elements and $f$ a function defined on subsets @@ -186,9 +281,7 @@ g(S^*) \geq \alpha f(S^*) \geq \alpha f(OPT) \geq where $OPT \in \arg \max_{S \in K} g(S)$. This can be taken one step further by considering a function $g$ such that there exists $\alpha, \beta >0$, an additive function $f$ and a bijective univariate function $\phi$, such that -$\forall S, \alpha \phi(f(S)) \leq g(S) \leq \beta \phi(f(S))$. In this case, -we solve the system $A w = \phi^{-1}(B)$ and obtain once again an -$\alpha/\beta$-approximation to the optimum of $g$. +$\forall S, \alpha \phi(f(S)) \leq g(S) \leq \beta \phi(f(S))$. In this case, we solve the system $A w = \phi^{-1}(B)$ and obtain once again an $\alpha/\beta$-approximation to the optimum of $g$. \begin{remark} Note that here $D-$optimizing $f$ is easy because $D-$learning $f$ is easy. We @@ -221,51 +314,11 @@ If $D$ is a product distribution such that $ \exists c > 0, \forall i, \mathbb{P}(i) \geq c$, it is easy to show that $\forall i \in \Omega,\ \hat W_i \rightarrow_{|O| \rightarrow +\infty} w_i$ -By using standard concentration bounds, we can hope to obtain within $poly(n, -\frac{1}{\epsilon})$ observations and with high probability: +By using standard concentration bounds, we can hope to obtain within $poly(n, \frac{1}{\epsilon})$ observations: -\begin{align} - &\forall i \in S : w_i \neq 0 \implies (1-\epsilon) w_i \leq \hat W_i \leq (1 - + \epsilon) w_i \label{eq:first_condition}\\ - & \forall i \in S: w_i = 0 \implies 0 \leq \hat W_i \leq \epsilon - \label{eq:second_condition} -\end{align} - -Note that if Eq.~\ref{eq:first_condition} and Eq.~\ref{eq:second_condition} are -both verified, then we can approximately $D-$optimize $f$ under cardinality -constraint. Let $K \in \mathbb{N}$, let $J$ be the indices of the $K$ largest -estimated weights $\hat W_i$ and $I$ be the indices of the $K$ largest true -weights $w_i$. We have: - -\begin{align*} - \sum_{j \in J \cap \{j : w_j \neq 0\}} w_i & \geq \frac{1}{1+\epsilon} - \sum_{j \in J \cap \{j : w_j \neq 0\}} \hat W_i \\ - & \geq \frac{1}{1+\epsilon} - \left( \sum_{j \in J} \hat W_i - |J \cap \{j: w_j =0\} |\epsilon \right) \\ - & \geq \frac{1}{1+\epsilon} \left( \sum_{j in I} \hat W_i - K\epsilon \right) - \\ & \geq \frac{1-\epsilon}{1+\epsilon} OPT - \frac{\epsilon}{1+\epsilon} K -\end{align*} - -Let $\epsilon \leq 1/(1+K/(\delta OPT)) \approx \delta OPT/K$, then -it follows that choosing the $K$ largest estimated weights $\hat W_i$ is a -$(1-\delta)(1-\epsilon)/(1+\epsilon)$-approximation to the optimum. - -\paragraph{Interpretation} Notice that the expression of our estimator is the -finite-sample approximation of the differential of the multilinear extension of -$f$ with respect to $w_i$. Because $f$ is additive, the gradient of its -multilinear extension with respect to its parameters is constant. Recalling the -continuous-greedy procedure, it is intuitive that we can $D$-optimize an -additive function by estimating the gradient of its multilinear extension at -the unit vectors. - -\begin{remark} -Note that we have not managed to do without a learning phase, we are still -learning $f$ approximately to optimize it. -\end{remark} %For every node -%$i$, we compute the \emph{average weight of every set containing element $i$}. -%Let $w_i$ be +%$i$, we compute the \emph{average weight of every set containing element $i$}. Let $w_i$ be %the weight of element $i$, $w \equiv f(\Omega) = \sum_{i \in \Omega} w_i$ and %$p \equiv \frac{k}{n}$, then $$\forall i \in \Omega, \mathbb{E}_{S \sim %D_p}\left[f(S)| i \in S\right] = pw + (1 -p)w_i$$ |
