some reorganizing

author: jeanpouget-abadie <jean.pougetabadie@gmail.com> 2015-03-20 14:03:23 -0400
committer: jeanpouget-abadie <jean.pougetabadie@gmail.com> 2015-03-20 14:03:23 -0400
commit: 0965d21907aaa81c7b0e11a2abae2746d7fe0d0c (patch)
tree: 70d0c95b9420ee9bd76e2df47fc893acac2b6302 /results.tex
parent: 3dd861100ecbcef42e8e92758cd0e2bdd6b6d623 (diff)
download: learn-optimize-0965d21907aaa81c7b0e11a2abae2746d7fe0d0c.tar.gz
1 files changed, 80 insertions, 92 deletions
diff --git a/results.tex b/results.tex
index abb5f8e..e167a00 100644
--- a/results.tex
+++ b/results.tex
@@ -65,17 +65,6 @@ It follows that for any vector $x \in \mathbb{R}^n$, $|\frac{1}{N}\|A x\|^2_2
 - (C+1)\sqrt{\frac{n}{N}}\right)$. If $N > {(C+1)}^2n/\sigma^2$, then
 $\sigma_{\min}(A) > 0$ with probability at least $1 - e^{-cn}$ \end{proof}
 
-Note that this is clearly not optimal. For example, note that for $k=1$, the
-solution to the coupon collector problem states that only ${\cal O}(n\log n)$
-rows are sufficient before the matrix $A$ has full row rank with high
-probability, and not ${\cal O}(n^2)$ as stated by
-Corollary~\ref{cor:number_of_rows_needed}.
-
-\paragraph{Note:} $\Theta(n\log n)$ is in expectation for the coupon collector
-problem. If one want an exponentially small probability of failure, then
-I think results on the coupon collector problem also imply that a quadratic
-number of rows are required.
-
 \subsection{More Direct Approach}
 
 Here we work on $\mathbb{F}_2$. An $n\times n$ binary matrix can be seen as
@@ -125,30 +114,27 @@ More generally, I think the “right” way to think about this is to look at $A
 as the adjacency matrix of a random $k$-regular graph. There are tons of
 results about the probability of the adjacency matrix to be non singular.
 
-\section{Passive Optimization of Additive Functions}
-
-Let $\Omega$ be a universe of elements of size $n$. For $k \in [n]$,
-let $D_k$ be the uniform probability distribution over all sets of size $k$,
-and let $D_p$ be the product probability distribution over sets such that
-$\forall j \in \Omega, \mathbb{P}_{S \sim D_p}(j \in S) = p$. Note that $D_{k}$
-and $D_{p}$ for $p \equiv \frac{k}{n}$ are very similar, and will be considered
-almost interchangeably throughout the following notes.
-
-\textbf{TODO:} Make the argument about independently choosing elements with
-proba $k/n$ vs choosing a set of size $k$ rigorous. The “more direct” approach
-above should allow us to consider the case of sets of size $k$ directly.
+\section{Passive Optimization}
 
-Consider the following problem:
+In general, we say we can efficiently optimize a function $f$ under constraints
+$K$ when we have a polynomial-time algorithm making adaptive value queries to
+$f$,which returns a set $S$ such that:  $S \in K$ and $f(S) \geq \alpha f(S^*)$
+with high probability, where $\alpha$ is an absolute constant and $S^* \in \arg
+\max_{S \in K}$.
 
-\begin{problem}
-Let $f = (w_i)_{i \in \Omega}$ be an unknown additive function: $f(S) \equiv
-\sum_{u \in S} w_u$. How many (set, value) pairs $(S, f(S))$ where $S \sim D_k$
-must be observed until we can optimize $f$ exactly or to a constant
-approximation factor under cardinality constraint $k$?
-\end{problem}
+Here, we consider the scenario where we cannot make adaptive value queries, and
+in fact, where we cannot make queries at all! Instead, we suppose that we
+observe a polynomial number of set-value pairs $(S, f(S))$ where $S$ is taken
+from a known distribution $D$. Let $L$ be the problem size. We say we can
+efficiently \emph{passively optimize} $f$ under distribution $D$ and
+constraints $K$ when, after observing ${\cal O}(L^c)$ set-value pairs from $D$
+where $c > 0$ is an absolute constant, we can return a set $S$ such that:  $S
+\in K$ and $f(S) \geq \alpha f(S^*)$ with high probability, where $\alpha$ is
+an absolute constant and $S^* \in \arg \max_{S \in K} f(S)$.
 
+\subsection{Additive function}
 
-\subsection{Inverting the system}\label{subsec:invert_the_system}
+\subsubsection{Inverting the system}\label{subsec:invert_the_system}
 
 Note that for observations ${(S_j, f(S_j))}_{j \in N}$ we can write the system $A
 w = b$, where each row of $A_j$ corresponds to the indicator vector of the set
@@ -164,7 +150,7 @@ and optimize $f$ for any cardinality constraint.
   optimize it under any cardinality constraint, with probability $1-e^{-cn}$.
 \end{proposition}
 
-\subsection{Average weight method}
+\subsubsection{Average weight method}
 
 Another method is to compute for every node $i$ the \emph{average weight of
 every set containing element $i$}. Let $w_i$ be the weight of element $i$, $w
@@ -212,9 +198,9 @@ As such, for $C > 0$ and $N \geq (C+1)\frac{8}{p}\log n$, we have $\forall i
   $(1+\epsilon)/(1-\epsilon)$ factor for any cardinality constraint with
   probability $XXX$.  \end{proposition}
 
-\section{Passive Optimization of Submodular functions}
+\subsection{General submodular functions}
 
-\subsection{Inverting the system}
+\subsubsection{Inverting the system}
 
 A result by~\cite{balcan2011learning} states that:
 
@@ -229,66 +215,12 @@ $Ax=f^{-1}(b)$? Probably not, though we might not care about not learning the
 zeros of the function perfectly, not being able to estimate correctly a
 fraction of the non-zero sets is a big problem
 
-\subsection{Average weight method}
-
-\section{Passive Optimization vs. Passive Learning}
-
-\subsection{Failed Attempt: returning max of observations}
-
-This doesn't work. Give examples as to why! Remember that there are strong
-concentration results for submodular functions -> look at expected value of
-observed sets
-
-\subsection{Example where optimization possible, learning impossible}
-
-Recall the matroid construction from~\cite{balcan2011learning}:
-\begin{theorem}
-  For any $k$ with $k = 2^{o(n^{1/3})}$, there exists a family of sets ${\cal A}
-  \subseteq2^{[n]}$ and a family of matroids $\cal{M} = \{M_{\cal{B}} :
-  \cal{B} \subseteq\cal{A} \}$ such that:
-  \begin{itemize}
-    \item $|{\cal A}| = k$ and $|A| = n^{1/3}$ for every $A \in \cal{A}$
-    \item For every $\cal{B} \subseteq\cal{A}$ and every $A \in \cal{A}$, we
-      have:
-      \begin{align*}
-        \text{rank}_{M_{\cal{B}}}(A) & = 8 \log k \ if A\in{\cal B} \\
-                                     & = |A| \ if A\in {\cal A}\backslash{\cal B}
-      \end{align*}
-  \end{itemize}
-\end{theorem}
-
-Consider the following subset of the above family of matroids: ${\cal
-M}^{\epsilon} = \{M_{\cal B} : {\cal B} \subseteq{\cal A} \wedge |{\cal
-A}\backslash{\cal B}| \geq \epsilon|{\cal A}|\}$ for $k = 2^{n^{1/6}}$.
-Consider an \emph{unknown} function $f$, corresponding to the rank function of
-one of the matroids $M_{\cal B}$ from ${\cal M}^{\epsilon}$.  Note that as long
-as we observe \emph{one} set from ${\cal A} \backslash {\cal B}$, we can
-optimize $f$ exactly! Indeed, the largest possible value for $f$ under
-cardinality constraint $n^{1/3}$ is $\max_{A\in 2^{[n]}} |A| = n^{1/3}$.
-
-One example of a distribution under which this occurs with probability at least
-a constant is $D_u$, the uniform distribution over all sets of ${\cal A}$.  For
-$c>0$, after $n^c$ observations $O \equiv (S_i, f(S_i))_i$ for $S_i \sim D_u$,
-we will observe at least one element from $\cal{A}\backslash\cal{B}$ with
-constant probability:
-
-\begin{equation} \nonumber \mathbb{P}(\bigcup_{S_i} S_i \in {\cal
-A}\backslash{\cal B}) \geq 1 - (1 - \epsilon)^{n^c} \approx \epsilon n^c
-\end{equation}
-
-However, it follows from the analysis of~\cite{balcan2011learning} that we
-cannot learn $f$ under any distribution, even with active value queries!
-Indeed, for any polynomially-sized set of observations, there exists a
-super-polynomially number of functions in ${\cal M^1}$ which coincide on this
-set of observations, but which take very different values outside of this set
-of observations: $8n^{1/6}$ for $A \in {\cal B}$ and $n^{1/3}$ for $A \in {\cal
-A}\backslash {\cal B}$.
+\subsubsection{Average weight method}
 
-{\bf TODO:} A cleaner simpler example would be nice.
 
-\section{The case of parametric submodular functions}
+\subsection{Parametric submodular functions}
 
-\subsection{Influence Functions}
+\subsubsection{Influence Functions}
 
 Let $G = (V, E)$ be a weighted directed graph. Without loss of generality, we
 can assign a weight $p_{u,v} \in [0,1]$ to every possible edge $(u,v) \in V^2$.
@@ -371,7 +303,63 @@ Also note that in the limit of large $m$, $(1 -\frac{1}{\alpha m})^m
 \rightarrow \exp(-1/\alpha)$ and the approximation ratio goes to $1 -
 \exp(-\exp(-2/\alpha))$.
 
-\subsection{Active set selection of stationary Gaussian Processes}
+\subsubsection{Active set selection of stationary Gaussian Processes}
+
+\section{Passive Optimization vs. Passive Learning}
+
+\subsection{Failed Attempt: returning max of observations}
+
+This doesn't work. Give examples as to why! Remember that there are strong
+concentration results for submodular functions -> look at expected value of
+observed sets
+
+\subsection{Example where optimization possible, learning impossible}
+
+Recall the matroid construction from~\cite{balcan2011learning}:
+\begin{theorem}
+  For any $k$ with $k = 2^{o(n^{1/3})}$, there exists a family of sets ${\cal A}
+  \subseteq2^{[n]}$ and a family of matroids $\cal{M} = \{M_{\cal{B}} :
+  \cal{B} \subseteq\cal{A} \}$ such that:
+  \begin{itemize}
+    \item $|{\cal A}| = k$ and $|A| = n^{1/3}$ for every $A \in \cal{A}$
+    \item For every $\cal{B} \subseteq\cal{A}$ and every $A \in \cal{A}$, we
+      have:
+      \begin{align*}
+        \text{rank}_{M_{\cal{B}}}(A) & = 8 \log k \ if A\in{\cal B} \\
+                                     & = |A| \ if A\in {\cal A}\backslash{\cal B}
+      \end{align*}
+  \end{itemize}
+\end{theorem}
+
+Consider the following subset of the above family of matroids: ${\cal
+M}^{\epsilon} = \{M_{\cal B} : {\cal B} \subseteq{\cal A} \wedge |{\cal
+A}\backslash{\cal B}| \geq \epsilon|{\cal A}|\}$ for $k = 2^{n^{1/6}}$.
+Consider an \emph{unknown} function $f$, corresponding to the rank function of
+one of the matroids $M_{\cal B}$ from ${\cal M}^{\epsilon}$.  Note that as long
+as we observe \emph{one} set from ${\cal A} \backslash {\cal B}$, we can
+optimize $f$ exactly! Indeed, the largest possible value for $f$ under
+cardinality constraint $n^{1/3}$ is $\max_{A\in 2^{[n]}} |A| = n^{1/3}$.
+
+One example of a distribution under which this occurs with probability at least
+a constant is $D_u$, the uniform distribution over all sets of ${\cal A}$.  For
+$c>0$, after $n^c$ observations $O \equiv (S_i, f(S_i))_i$ for $S_i \sim D_u$,
+we will observe at least one element from $\cal{A}\backslash\cal{B}$ with
+constant probability:
+
+\begin{equation} \nonumber \mathbb{P}(\bigcup_{S_i} S_i \in {\cal
+A}\backslash{\cal B}) \geq 1 - (1 - \epsilon)^{n^c} \approx \epsilon n^c
+\end{equation}
+
+However, it follows from the analysis of~\cite{balcan2011learning} that we
+cannot learn $f$ under any distribution, even with active value queries!
+Indeed, for any polynomially-sized set of observations, there exists a
+super-polynomially number of functions in ${\cal M^1}$ which coincide on this
+set of observations, but which take very different values outside of this set
+of observations: $8n^{1/6}$ for $A \in {\cal B}$ and $n^{1/3}$ for $A \in {\cal
+A}\backslash {\cal B}$.
+
+{\bf TODO:} A cleaner simpler example would be nice.
+
 
 
 \bibliography{optimize} \bibliographystyle{apalike}
author	jeanpouget-abadie <jean.pougetabadie@gmail.com>	2015-03-20 14:03:23 -0400
committer	jeanpouget-abadie <jean.pougetabadie@gmail.com>	2015-03-20 14:03:23 -0400
commit	0965d21907aaa81c7b0e11a2abae2746d7fe0d0c (patch)
tree	70d0c95b9420ee9bd76e2df47fc893acac2b6302 /results.tex
parent	3dd861100ecbcef42e8e92758cd0e2bdd6b6d623 (diff)
download	learn-optimize-0965d21907aaa81c7b0e11a2abae2746d7fe0d0c.tar.gz