Starting paper draft

author: Thibaut Horel <thibaut.horel@gmail.com> 2015-04-14 15:20:19 -0400
committer: Thibaut Horel <thibaut.horel@gmail.com> 2015-04-14 15:20:19 -0400
commit: d8a68c6917f5b6053117e0145f6d4d80a8bec26b (patch)
tree: 2abe1acc26ced6358dc2d112fea88147deac6283 /paper/sections/positive.tex
parent: f5a0b1da9a7ff3346ac3af4ad9c1dd0fd9e71f46 (diff)
download: learn-optimize-d8a68c6917f5b6053117e0145f6d4d80a8bec26b.tar.gz
1 files changed, 365 insertions, 0 deletions
diff --git a/paper/sections/positive.tex b/paper/sections/positive.tex
new file mode 100644
index 0000000..c4e8e0f
--- /dev/null
+++ b/paper/sections/positive.tex
@@ -0,0 +1,365 @@
+\subsection{Applications}
+
+\paragraph{TODO:} Add citations!
+
+\subsubsection{Building blocks}
+
+These are generic examples and serve as building blocks for the applications
+below:
+\begin{itemize}
+    \item $f(S) = g\left(\sum_{i\in S} w_i\right)$ for a concave
+        $g:\reals\to\reals$ and weights $(w_i)_{i\in N}\in \reals_+^{|N|}$. Note
+        that $f$ is monotone iff $g$ is monotone. In this case, $g$ does not
+        matter for the purpose of optimization: the sets are in the same order
+        with or without $g$, the only things which matter are the weights which
+        serve as natural parameters for this class of functions. This class of
+        functions contains:
+        \begin{itemize}
+            \item additive functions (when $g$ is the identity function).
+            \item $f(S) = |S\cap X|$ for some set $X\subseteq N$. This is the
+                case where the weights are $0/1$ and $g$ is the identity
+                function.
+            \item symmetric submodular functions: when the weights are all one.
+            \item budget-additive functions, when $g:x\mapsto \min(B, x)$ for
+                some $B$.
+        \end{itemize}
+    \item $f(S) = \max_{i\in S} w_i$ for weights $(w_i)_{i\in N}\in
+        \reals_+^{|N|}$. This class of functions is also naturally parametrized
+        by the weights.
+    \item (weighted) matroid rank functions. Given a matroid $M$ over a ground
+        set $N$, we define its rank function to be:
+        \begin{displaymath}
+            \forall S\subseteq N,\;
+            r(S) = \max_{\substack{I\subseteq S\\I\in M}} |I|
+        \end{displaymath}
+        more generally, given a weight function $w:N\to\reals_+$, we define the
+        weighted matroid rank function:
+        \begin{displaymath}
+            \forall S\subseteq N,\;
+            r(S) = \max_{\substack{I\subseteq S\\I\in M}} \sum_{i\in I} w_i
+        \end{displaymath}
+\end{itemize}
+
+\paragraph{Remark} The function $f(S)= \max_{i\in S}w_i$ is a weighted matroid
+rank function for the $1$-uniform matroid (the matroid where the independent
+sets are the sets of size at most one).
+
+\subsubsection{Examples}
+
+\begin{itemize}
+    \item \emph{Coverage functions:} they can be written as a positive linear
+        combination of matroid rank functions:
+        \begin{displaymath}
+            f(S) = \sum_{u\in\mathcal{U}} w_u c_u(S)
+        \end{displaymath}
+        where $c_u$ is the rank function of the matroid $M = \big\{ \emptyset,
+        \{u\}\big\}$.
+    \item \emph{Facility location:} (cite Bilmes) there is a universe
+        $\mathcal{U}$ of locations and a proximity score $s_{i,j}$ for each
+        pair of locations.  We pick a subset of locations $S$ and each point in
+        the universe is allocated to its closest location (the one with highest
+        proximity):
+        \begin{displaymath}
+            f(S) = \sum_{u\in\mathcal{U}} \max_{v\in S} s_{u,v}
+        \end{displaymath}
+        This can be seen as a sum of weighted matroid rank functions: one for
+        each location in the universe associated with a $1$-uniform matroid
+        (other applications: job scheduling).
+
+    \item \emph{Image segmentation:} (cite Jegelka) can be (in some cases)
+        written as a graph cut function. For image segmentation the goal is to
+        minimize the cut.
+        \begin{displaymath}
+            f(S) = \sum_{e\in E} w_e c_e(S)
+        \end{displaymath}
+        where $c_e(S)$ is one iff $e\in E(S,\bar{S})$. \textbf{TODO:} I think
+        this can be written as a matroid rank function.
+    \item \emph{Learning} (cite Krause) there is
+        a hypothesis $A$ (a random variable) which is “refined” when more
+        observations are made.  Imagine that there is a finite set $X_1,\dots,
+        X_n$ of possible observations (random variables). Then, assuming that
+        the observations are independent conditioned on $A$, the information
+        gain:
+        \begin{displaymath}
+            f(S) = H(A) - H\big(A\,|\,(X_i)_{i\in S}\big)
+        \end{displaymath}
+        is submodular. The $\log\det$ is the specific case of a linear
+        hypothesis observed with additional independent Gaussian noise.
+    \item \emph{Entropy:} Closely related to the previous one. If $(X_1,\dots,
+        X_n)$ are random variables, then:
+        $ f(S) = H(X_S) $ is submodular. In particular, if $(X_1,\dots,
+        X_n)$ are jointly multivariate gaussian, then:
+        \begin{displaymath}
+            f(S) = c|S| + \frac{1}{2}\log\det X_SX_S^T
+        \end{displaymath}
+        for $c= 2\pi e...$ and we fall back to the usual $\log\det$ function.
+    \item \emph{data subset selection/summarization:} in statistical machine
+        translation, Bilmes used sum of concave over modular:
+        \begin{displaymath}
+            f(S) = \sum_{f} \lambda_f \phi\left(\sum_{e\in S}w_f(e)\right)
+        \end{displaymath}
+        where each $f$ represents a feature, $w_f(e)$ represents how much of
+        $f$ element $e$ has, and $\phi$ captures decreasing marginal gain when
+        we have a lot of a given feature.
+        Facility location functions are also commonly used for subset selection.
+    \item \emph{concave spectral functions} One would be tempted to say that
+        any multivariate concave function of a modular function is submodular.
+        This would be the natural generalization of concave over modular.
+        However \textbf{this is not true in general}. However, a possible nice
+        generalization is the following. Let $M$ be a symmetric $n\times
+        n$ matrix, and $g$ is a ``matrix concave'' function. Then:
+        \begin{displaymath}
+            f(S) = \mathrm{Tr}\big(g(M_S)\big)
+        \end{displaymath}
+        is submodular. This contains the $\log\det$ (when $g$ is the matrix
+        $\log$) but tons of other functions (like quantum entropy).
+\end{itemize}
+In summary, the two most general classes of submodular functions (which capture
+all the examples known to man) are: sums of matrix concave functions and sums
+of matroid rank functions. Sums of concave over modular are also nice if we
+want to start with a simpler example.
+
+\subsection{Additive Functions}
+
+We consider here the simple case of additive functions. A function $f$ is
+additive if there exists a weight vector $(w_s)_{s \in \Omega}$ such that
+$\forall S \subseteq \Omega, \ f(S) = \sum_{s \in S} w_s$. We will sometimes
+adopt the notation $w \equiv f(\Omega)$.
+
+Suppose we have observed $n^c$ set-value pairs ${(S_j, f(S_j))}_{j \in N}$ with
+$S_j \sim D$ where $n \equiv |\Omega|$. Consider the $n^c \times n$ matrix $A$
+where for all $i$, the row $A_i = \chi_{S_i}$, the indicator vector of set
+$S_i$. Let $B$ be the $n^c \times 1$ vector such that $\forall i, b_j \equiv
+f(S_j)$. It is easy to see that if $w$ is the $n \times 1$ corresponding weight
+vector for $f$ then:
+\begin{displaymath}
+  A w = B
+\end{displaymath}
+
+Note that if $A$ has full column rank, then we can solve for $w$ exactly and
+also optimize $f$ under any cardinality constraint. We can therefore cast the
+question of $D-$learning and $D-$optimizing $f$ as a random matrix theory
+problem: what is the probability that after $n^c$ for $c > 0$ independent
+samples from $D$, the matrix $A$ will have full rank? See
+section~\ref{sec:matrix_theory}
+
+\paragraph{Extension}
+Note that the previous reasoning can easily be extended to any \emph{almost}
+additive function. Consider a function $g$ such that there exists $\alpha > 0$
+and $\beta > 0$ and an additive function $f$ such that $\forall S, \alpha f(S)
+\leq g(S) \leq \beta f(S)$, then by solving for $\max_{S\in K} f(S)$ we have a
+$\alpha/\beta$-approximation to the optimum of $g$ since:
+
+\begin{displaymath}
+g(S^*) \geq \alpha f(S^*) \geq \alpha f(OPT) \geq
+\frac{\alpha}{\beta} g(OPT)
+\end{displaymath}
+
+where $OPT \in \arg \max_{S \in K} g(S)$. This can be taken one step further by
+considering a function $g$ such that there exists $\alpha, \beta >0$, an
+additive function $f$ and a bijective univariate function $\phi$, such that
+$\forall S, \alpha \phi(f(S)) \leq g(S) \leq \beta \phi(f(S))$. In this case, we
+solve the system $A w = \phi^{-1}(B)$ and obtain once again an
+$\alpha/\beta$-approximation to the optimum of $g$.
+
+
+\subsubsection{Average weight method}
+
+We consider here another method to $D-$optimizing $f$, which only requires
+$D-$learning $f$ approximately. Note that for every element $i \in \Omega$, and
+for a \emph{product} distribution $D$:
+\begin{displaymath}
+  \mathbb{E}_{S \sim D}(f(S)|i \in S) - \mathbb{E}_{S \sim D}(f(S) | i \notin
+  S) = \sum_{j \neq i \in \Omega} w_s \left(\mathbb{P}(j\in S|i \in S) -
+  \mathbb{P}(j \in S | i \notin S)\right) + w_i(1 - 0) = w_i
+\end{displaymath}
+
+Let $O$ be the collection of all sets $S$ for which we have observed $f(S)$ and
+let $O_i \equiv \{S \in O : i \in S\}$ and $O^c_i \equiv \{ S\in O: i \notin S
+\}$. If $O_i$ and $O^c_i$ are non-empty, define the following weight estimator:
+\begin{displaymath}
+  \hat W_i \equiv \frac{1}{|O_i|} \sum_{S \in O_i} f(S) - \frac{1}{|O_i^c|}
+  \sum_{S \in O_i^c} f(S)
+\end{displaymath}
+
+If $D$ is a product distribution such that $ \exists c > 0, \forall i,
+\mathbb{P}(i) \geq c$, it is
+easy to show that $\forall i \in
+\Omega,\ \hat W_i \rightarrow_{|O| \rightarrow +\infty} w_i$
+By using standard concentration bounds, we can hope to obtain within $poly(n,
+\frac{1}{\epsilon})$ observations:
+
+\subsubsection{Generic Random Matrix Theory}
+\label{sec:matrix_theory}
+
+We cite the following result from~\cite{vershynin2010introduction} (Remark 5.40):
+
+\begin{proposition}\label{prop:non_isotropic_isometry} 
+Assume that $A$ is an $N \times n$ matrix whose rows $A_i$ are independent
+sub-gaussian random vectors in $\mathbb{R}^n$ with second moment matrix
+$\Sigma$. Then for every $t \geq 0$, the following inequality holds with
+probability at least: $1- 2 \exp(-ct^2)$: \begin{equation} \|\frac{1}{N} A^T A
+- \Sigma \| \leq \max(\delta, \delta^2) \ \text{where}\ \delta =
+C\sqrt{\frac{n}{N}} + \frac{t}{\sqrt{N}} \end{equation} where $C$, $c$ depend
+only on the subgaussian norm of the rows $K \equiv \max_i \| A_i\|_{\psi_2}$.
+\end{proposition}
+
+The following result is a simple corollary of
+Proposition~\ref{prop:non_isotropic_isometry}:
+
+\begin{corollary}\label{cor:number_of_rows_needed} Let $n \in \mathbb{N}$ and
+  $k \in ]0,n[$. Assume that $A$ is an $N \times n$ matrix whose rows $A_i$ are
+  independent Bernoulli variable vectors in $\mathbb{R}^n$ such that
+  $\mathbb{P}(A_{i,j} = 1) = \frac{k}{n} = 1 - \mathbb{P}(A_{i,j} = 0)$ and let
+  $\sigma \equiv \frac{k}{n}(1- \frac{k}{n}) \neq 0$, then if $N > {(C+1)}^2
+  n/\sigma^2$, the matrix A has full row rank with probability at least $1 -
+  e^{-cn}$, where $C, c$ are constant depending only on  the subgaussian norm
+  of the rows $K \equiv \sup_{p \geq 1}
+  {\frac{1}{\sqrt{p}}(\mathbb{E}|A_1|^p)}^\frac{1}{p} = k$\footnote{Is this true?
+  And how do the constants behave?} \end{corollary}
+
+\begin{proof} It is easy to compare the kernels of $A$ and $A^TA$ and notice
+  that $rank(A) = rank(A^T A)$. Since $A^TA$ is an $n \times n$ matrix, it
+  follows that if $A^TA$ is invertible, then $A$ has full row rank. In other
+  words, if $A$ has smallest singular value $\sigma_{\min}(A) > 0$, then $A$
+  has full row rank.  Consider Prop.~\ref{prop:non_isotropic_isometry} with $t
+  \equiv \sqrt{n}$ and with $N > {(C+1)}^{2} n$, then with probability $1 -
+  2\exp(-cn)$: \begin{equation} \|\frac{1}{N} A^T A - \sigma I \| \leq
+    (C+1)\sqrt{\frac{n}{N}} \end{equation}
+
+It follows that for any vector $x \in \mathbb{R}^n$, $|\frac{1}{N}\|A x\|^2_2
+-\sigma | \leq (C+1)\sqrt{\frac{n}{N}} \implies \|A x\|^2_2 \geq N\left(\sigma
+- (C+1)\sqrt{\frac{n}{N}}\right)$. If $N > {(C+1)}^2n/\sigma^2$, then
+$\sigma_{\min}(A) > 0$ with probability at least $1 - e^{-cn}$ \end{proof}
+
+\subsubsection{Algebraic Approach}
+
+Here we work on $\mathbb{F}_2$. An $n\times n$ binary matrix can be seen as
+a matrix over $\mathbb{F}_2$. Let us assume that each row of $A$ is chosen
+uniformly at random among all binary rows of length $n$. A standard counting
+arguments shows that the number of non-singular matrices over $\mathbb{F}_2$
+is:
+\begin{displaymath}
+    N_n = (2^n-1)(2^n - 2)\dots (2^n- 2^{n-1})
+    = (2^n)^n\prod_{i=1}^n\left(1-\frac{1}{2^i}\right)
+\end{displaymath}
+
+Hence the probability that our random matrix $A$ is invertible is:
+\begin{displaymath}
+    P_n  = \prod_{i=1}^n\left(1-\frac{1}{2^i}\right)
+\end{displaymath}
+It is easy to see that $P_n$ is a decreasing sequence. Its limit is
+$\phi(\frac{1}{2})$ where $\phi$ is the Euler function. We have
+$\phi(\frac{1}{2})\approx
+0.289$ and $P_n$ converges
+exponentially fast to this constant (one way to see this is to use the
+Pentagonal number theorem).
+
+Hence, if we observe $\ell\cdot n$ uniformly random binary rows, the
+probability that they will have full column rank is at least:
+\begin{displaymath}
+    P_{\ell\cdot n}\geq 1 - \left(1-\phi\left(\frac{1}{2}\right)\right)^{\ell}
+\end{displaymath}
+
+Note that this is the probability of having full column rank over
+$\mathbb{F}_2$. A standard linear algebra argument shows that this implies full
+column rank over $\mathbb{R}$.
+
+\paragraph{TODO:} Study the case where we only observe sets of size exactly
+$k$, or at most $k$. This amounts to replacing $2^n$ in the computation above
+by:
+\begin{displaymath}
+{n\choose k}\quad\text{or}\quad\sum_{j=0}^k{n\choose j}
+\end{displaymath}
+
+Thinking about it, why do we assume that the sample sets are of size exactly
+$k$. I think it would make more sense from the learning perspective to consider
+uniformly random sets. In this case, the above approach allows us to conclude
+directly.
+
+More generally, I think the “right” way to think about this is to look at $A$
+as the adjacency matrix of a random $k$-regular graph. There are tons of
+results about the probability of the adjacency matrix being non singular.
+
+\subsection{Influence Functions}
+
+Let $G = (V, E)$ be a weighted directed graph. Without loss of generality, we
+can assign a weight $p_{u,v} \in [0,1]$ to every possible edge $(u,v) \in V^2$.
+Let $m$ be the number of non-zero edges of $G$. Let $\sigma_G(S, p)$ be the
+influence of the set $S \subseteq V$ in $G$ under the IC model with parameters
+$p$.
+
+Recall from \cite{Kempe:03} that:
+\begin{equation}
+    \sigma_G(S, p) = \sum_{v\in V} \P_{G_p}\big[r_{G_p}(S\leadsto v)\big]
+\end{equation}
+where $G_p$ is a random graph where each edge $(u,v)\in E$ appears with
+probability $p_{u,v}$ and $r_{G_p}(S\leadsto v)$ is the indicator variable of
+\emph{there exists a path from $S$ to $v$ in $G_p$}.
+
+\begin{claim}
+\label{cla:oracle}
+If for all $(u,v) \in V^2$, $p_{u,v} \geq p'_{u,v} \geq p_{u,v}
+- \frac{1}{\alpha m}$, then:
+\begin{displaymath}
+    \forall S \subseteq V,\, \sigma_{G}(S, p) \geq \sigma_{G}(S,p') \geq (1
+    - 1/\alpha) \sigma_{G}(S,p)
+\end{displaymath}
+\end{claim}
+
+\begin{proof}
+    We define two coupled random graphs $G_p$ and $G_p'$ as follows: for each
+    edge $(u,v)\in E$, draw a uniform random variable $\mathcal{U}_{u,v}\in
+    [0,1]$. Include $(u,v)$ in $G_p$ (resp. $G_{p'}$) iff
+    $\mathcal{U}_{u,v}\leq p_{u,v}$ (resp. $\mathcal{U}_{u,v}\leq p'_{u,v}$).
+    It is clear from the construction that each edge $(u,v)$ will be present in
+    $G_p$ (resp. $G_p'$) with probability $p_{u,v}$ (resp. $p'_{u,v}$). Hence:
+    \begin{displaymath}
+        \sigma_G(S, p) = \sum_{v\in V} \P_{G_p}\big[r_{G_p}(S\leadsto v)\big]
+        \text{  and  }
+        \sigma_G(S, p') = \sum_{v\in V} \P_{G_p'}\big[r_{G_p'}(S\leadsto v)\big]
+    \end{displaymath}
+
+    By construction $G_{p'}$ is always a subgraph of $G_p$, i.e
+    $r_{G_p'}(S\leadsto v)\leq r_{G_p}(S\leadsto v)$. This proves the left-hand
+    side of the Claim's inequality.
+
+    The probability that an edge is present in $G_p$ but not in $G_p'$ is at
+    most $\frac{1}{\alpha m}$, so with probability $\left(1-\frac{1}{\alpha
+    m}\right)^m$, $G_p$ and $G_p'$ have the same set of edges. This implies that:
+    \begin{displaymath} \P_{G_{p'}}\big[r_{G_p'}(S\leadsto v)\big]\geq
+    \left(1-\frac{1}{\alpha m}\right)^m\P_{G_{p}}\big[r_{G_p}(S\leadsto v)\big]
+  \end{displaymath}
+    which proves the right-hand side of the claim after observing that
+    $\left(1-\frac{1}{\alpha m}\right)^m\geq 1-\frac{1}{\alpha}$ with equality
+    iff $m=1$.
+\end{proof}
+
+ We can use Claim~\ref{cla:oracle} to find a constant factor approximation
+ algorithm to maximising influence on $G$ by maximising influence on $G'$. For
+ $k \in \mathbb{N}^*$, let $O_k \in \arg\max_{|S| \leq k} \sigma_G(S)$ and
+ $\sigma_{G}^* = \sigma_G(O_k)$ where we omit the $k$ when it is unambiguous.
+
+
+\begin{proposition}
+\label{prop:approx_optim}
+Suppose we have an unknown graph $G$ and a known graph $G'$ such that $V = V'$
+and for all $(u,v) \in V^2, |p_{u,v} - p_{u,v}'| \leq  \frac{1}{\alpha m}$.
+Then for all $k \in \mathbb{N}^*$, we can find a set $\hat O_k$ such that
+$\sigma_{G}(\hat O_k) \geq (1 - e^{\frac{2}{\alpha} - 1}) \sigma^*_{G}$
+\end{proposition}
+
+\begin{proof} For every edge $(u,v) \in V^2$, let $\hat p = p'_{u,v} -
+  \frac{1}{\alpha m}$. We are now in the conditions of Claim~\ref{cla:oracle}
+  with $\alpha \leftarrow \alpha/2$. We return the set $\hat O_k$ obtained by
+  greedy maximisation on $\hat G$. It is a classic exercise to show then that
+  $\sigma_G(\hat O_k) \geq 1 - e^{\frac{2}{\alpha} - 1}$ (see Pset 1, CS284r).
+\end{proof}
+
+A small note on the approximation factor: it is only $>0$ for $\alpha > 2$.
+Note that $\alpha \geq 7 \implies 1 - e^{\frac{2}{\alpha} - 1} \geq
+\frac{1}{2}$ and that it converges to $(1 - 1/e)$ which is the best possible
+polynomial-time approximation ratio to influence maximisation unless $P = NP$.
+Also note that in the limit of large $m$, $(1 -\frac{1}{\alpha m})^m
+\rightarrow \exp(-1/\alpha)$ and the approximation ratio goes to $1 -
+\exp(-\exp(-2/\alpha))$.
author	Thibaut Horel <thibaut.horel@gmail.com>	2015-04-14 15:20:19 -0400
committer	Thibaut Horel <thibaut.horel@gmail.com>	2015-04-14 15:20:19 -0400
commit	d8a68c6917f5b6053117e0145f6d4d80a8bec26b (patch)
tree	2abe1acc26ced6358dc2d112fea88147deac6283 /paper/sections/positive.tex
parent	f5a0b1da9a7ff3346ac3af4ad9c1dd0fd9e71f46 (diff)
download	learn-optimize-d8a68c6917f5b6053117e0145f6d4d80a8bec26b.tar.gz