diff options
| author | jeanpouget-abadie <jean.pougetabadie@gmail.com> | 2015-03-20 14:03:23 -0400 |
|---|---|---|
| committer | jeanpouget-abadie <jean.pougetabadie@gmail.com> | 2015-03-20 14:03:23 -0400 |
| commit | 0965d21907aaa81c7b0e11a2abae2746d7fe0d0c (patch) | |
| tree | 70d0c95b9420ee9bd76e2df47fc893acac2b6302 /results.tex | |
| parent | 3dd861100ecbcef42e8e92758cd0e2bdd6b6d623 (diff) | |
| download | learn-optimize-0965d21907aaa81c7b0e11a2abae2746d7fe0d0c.tar.gz | |
some reorganizing
Diffstat (limited to 'results.tex')
| -rw-r--r-- | results.tex | 172 |
1 files changed, 80 insertions, 92 deletions
diff --git a/results.tex b/results.tex index abb5f8e..e167a00 100644 --- a/results.tex +++ b/results.tex @@ -65,17 +65,6 @@ It follows that for any vector $x \in \mathbb{R}^n$, $|\frac{1}{N}\|A x\|^2_2 - (C+1)\sqrt{\frac{n}{N}}\right)$. If $N > {(C+1)}^2n/\sigma^2$, then $\sigma_{\min}(A) > 0$ with probability at least $1 - e^{-cn}$ \end{proof} -Note that this is clearly not optimal. For example, note that for $k=1$, the -solution to the coupon collector problem states that only ${\cal O}(n\log n)$ -rows are sufficient before the matrix $A$ has full row rank with high -probability, and not ${\cal O}(n^2)$ as stated by -Corollary~\ref{cor:number_of_rows_needed}. - -\paragraph{Note:} $\Theta(n\log n)$ is in expectation for the coupon collector -problem. If one want an exponentially small probability of failure, then -I think results on the coupon collector problem also imply that a quadratic -number of rows are required. - \subsection{More Direct Approach} Here we work on $\mathbb{F}_2$. An $n\times n$ binary matrix can be seen as @@ -125,30 +114,27 @@ More generally, I think the “right” way to think about this is to look at $A as the adjacency matrix of a random $k$-regular graph. There are tons of results about the probability of the adjacency matrix to be non singular. -\section{Passive Optimization of Additive Functions} - -Let $\Omega$ be a universe of elements of size $n$. For $k \in [n]$, -let $D_k$ be the uniform probability distribution over all sets of size $k$, -and let $D_p$ be the product probability distribution over sets such that -$\forall j \in \Omega, \mathbb{P}_{S \sim D_p}(j \in S) = p$. Note that $D_{k}$ -and $D_{p}$ for $p \equiv \frac{k}{n}$ are very similar, and will be considered -almost interchangeably throughout the following notes. - -\textbf{TODO:} Make the argument about independently choosing elements with -proba $k/n$ vs choosing a set of size $k$ rigorous. The “more direct” approach -above should allow us to consider the case of sets of size $k$ directly. +\section{Passive Optimization} -Consider the following problem: +In general, we say we can efficiently optimize a function $f$ under constraints +$K$ when we have a polynomial-time algorithm making adaptive value queries to +$f$,which returns a set $S$ such that: $S \in K$ and $f(S) \geq \alpha f(S^*)$ +with high probability, where $\alpha$ is an absolute constant and $S^* \in \arg +\max_{S \in K}$. -\begin{problem} -Let $f = (w_i)_{i \in \Omega}$ be an unknown additive function: $f(S) \equiv -\sum_{u \in S} w_u$. How many (set, value) pairs $(S, f(S))$ where $S \sim D_k$ -must be observed until we can optimize $f$ exactly or to a constant -approximation factor under cardinality constraint $k$? -\end{problem} +Here, we consider the scenario where we cannot make adaptive value queries, and +in fact, where we cannot make queries at all! Instead, we suppose that we +observe a polynomial number of set-value pairs $(S, f(S))$ where $S$ is taken +from a known distribution $D$. Let $L$ be the problem size. We say we can +efficiently \emph{passively optimize} $f$ under distribution $D$ and +constraints $K$ when, after observing ${\cal O}(L^c)$ set-value pairs from $D$ +where $c > 0$ is an absolute constant, we can return a set $S$ such that: $S +\in K$ and $f(S) \geq \alpha f(S^*)$ with high probability, where $\alpha$ is +an absolute constant and $S^* \in \arg \max_{S \in K} f(S)$. +\subsection{Additive function} -\subsection{Inverting the system}\label{subsec:invert_the_system} +\subsubsection{Inverting the system}\label{subsec:invert_the_system} Note that for observations ${(S_j, f(S_j))}_{j \in N}$ we can write the system $A w = b$, where each row of $A_j$ corresponds to the indicator vector of the set @@ -164,7 +150,7 @@ and optimize $f$ for any cardinality constraint. optimize it under any cardinality constraint, with probability $1-e^{-cn}$. \end{proposition} -\subsection{Average weight method} +\subsubsection{Average weight method} Another method is to compute for every node $i$ the \emph{average weight of every set containing element $i$}. Let $w_i$ be the weight of element $i$, $w @@ -212,9 +198,9 @@ As such, for $C > 0$ and $N \geq (C+1)\frac{8}{p}\log n$, we have $\forall i $(1+\epsilon)/(1-\epsilon)$ factor for any cardinality constraint with probability $XXX$. \end{proposition} -\section{Passive Optimization of Submodular functions} +\subsection{General submodular functions} -\subsection{Inverting the system} +\subsubsection{Inverting the system} A result by~\cite{balcan2011learning} states that: @@ -229,66 +215,12 @@ $Ax=f^{-1}(b)$? Probably not, though we might not care about not learning the zeros of the function perfectly, not being able to estimate correctly a fraction of the non-zero sets is a big problem -\subsection{Average weight method} - -\section{Passive Optimization vs. Passive Learning} - -\subsection{Failed Attempt: returning max of observations} - -This doesn't work. Give examples as to why! Remember that there are strong -concentration results for submodular functions -> look at expected value of -observed sets - -\subsection{Example where optimization possible, learning impossible} - -Recall the matroid construction from~\cite{balcan2011learning}: -\begin{theorem} - For any $k$ with $k = 2^{o(n^{1/3})}$, there exists a family of sets ${\cal A} - \subseteq2^{[n]}$ and a family of matroids $\cal{M} = \{M_{\cal{B}} : - \cal{B} \subseteq\cal{A} \}$ such that: - \begin{itemize} - \item $|{\cal A}| = k$ and $|A| = n^{1/3}$ for every $A \in \cal{A}$ - \item For every $\cal{B} \subseteq\cal{A}$ and every $A \in \cal{A}$, we - have: - \begin{align*} - \text{rank}_{M_{\cal{B}}}(A) & = 8 \log k \ if A\in{\cal B} \\ - & = |A| \ if A\in {\cal A}\backslash{\cal B} - \end{align*} - \end{itemize} -\end{theorem} - -Consider the following subset of the above family of matroids: ${\cal -M}^{\epsilon} = \{M_{\cal B} : {\cal B} \subseteq{\cal A} \wedge |{\cal -A}\backslash{\cal B}| \geq \epsilon|{\cal A}|\}$ for $k = 2^{n^{1/6}}$. -Consider an \emph{unknown} function $f$, corresponding to the rank function of -one of the matroids $M_{\cal B}$ from ${\cal M}^{\epsilon}$. Note that as long -as we observe \emph{one} set from ${\cal A} \backslash {\cal B}$, we can -optimize $f$ exactly! Indeed, the largest possible value for $f$ under -cardinality constraint $n^{1/3}$ is $\max_{A\in 2^{[n]}} |A| = n^{1/3}$. - -One example of a distribution under which this occurs with probability at least -a constant is $D_u$, the uniform distribution over all sets of ${\cal A}$. For -$c>0$, after $n^c$ observations $O \equiv (S_i, f(S_i))_i$ for $S_i \sim D_u$, -we will observe at least one element from $\cal{A}\backslash\cal{B}$ with -constant probability: - -\begin{equation} \nonumber \mathbb{P}(\bigcup_{S_i} S_i \in {\cal -A}\backslash{\cal B}) \geq 1 - (1 - \epsilon)^{n^c} \approx \epsilon n^c -\end{equation} - -However, it follows from the analysis of~\cite{balcan2011learning} that we -cannot learn $f$ under any distribution, even with active value queries! -Indeed, for any polynomially-sized set of observations, there exists a -super-polynomially number of functions in ${\cal M^1}$ which coincide on this -set of observations, but which take very different values outside of this set -of observations: $8n^{1/6}$ for $A \in {\cal B}$ and $n^{1/3}$ for $A \in {\cal -A}\backslash {\cal B}$. +\subsubsection{Average weight method} -{\bf TODO:} A cleaner simpler example would be nice. -\section{The case of parametric submodular functions} +\subsection{Parametric submodular functions} -\subsection{Influence Functions} +\subsubsection{Influence Functions} Let $G = (V, E)$ be a weighted directed graph. Without loss of generality, we can assign a weight $p_{u,v} \in [0,1]$ to every possible edge $(u,v) \in V^2$. @@ -371,7 +303,63 @@ Also note that in the limit of large $m$, $(1 -\frac{1}{\alpha m})^m \rightarrow \exp(-1/\alpha)$ and the approximation ratio goes to $1 - \exp(-\exp(-2/\alpha))$. -\subsection{Active set selection of stationary Gaussian Processes} +\subsubsection{Active set selection of stationary Gaussian Processes} + +\section{Passive Optimization vs. Passive Learning} + +\subsection{Failed Attempt: returning max of observations} + +This doesn't work. Give examples as to why! Remember that there are strong +concentration results for submodular functions -> look at expected value of +observed sets + +\subsection{Example where optimization possible, learning impossible} + +Recall the matroid construction from~\cite{balcan2011learning}: +\begin{theorem} + For any $k$ with $k = 2^{o(n^{1/3})}$, there exists a family of sets ${\cal A} + \subseteq2^{[n]}$ and a family of matroids $\cal{M} = \{M_{\cal{B}} : + \cal{B} \subseteq\cal{A} \}$ such that: + \begin{itemize} + \item $|{\cal A}| = k$ and $|A| = n^{1/3}$ for every $A \in \cal{A}$ + \item For every $\cal{B} \subseteq\cal{A}$ and every $A \in \cal{A}$, we + have: + \begin{align*} + \text{rank}_{M_{\cal{B}}}(A) & = 8 \log k \ if A\in{\cal B} \\ + & = |A| \ if A\in {\cal A}\backslash{\cal B} + \end{align*} + \end{itemize} +\end{theorem} + +Consider the following subset of the above family of matroids: ${\cal +M}^{\epsilon} = \{M_{\cal B} : {\cal B} \subseteq{\cal A} \wedge |{\cal +A}\backslash{\cal B}| \geq \epsilon|{\cal A}|\}$ for $k = 2^{n^{1/6}}$. +Consider an \emph{unknown} function $f$, corresponding to the rank function of +one of the matroids $M_{\cal B}$ from ${\cal M}^{\epsilon}$. Note that as long +as we observe \emph{one} set from ${\cal A} \backslash {\cal B}$, we can +optimize $f$ exactly! Indeed, the largest possible value for $f$ under +cardinality constraint $n^{1/3}$ is $\max_{A\in 2^{[n]}} |A| = n^{1/3}$. + +One example of a distribution under which this occurs with probability at least +a constant is $D_u$, the uniform distribution over all sets of ${\cal A}$. For +$c>0$, after $n^c$ observations $O \equiv (S_i, f(S_i))_i$ for $S_i \sim D_u$, +we will observe at least one element from $\cal{A}\backslash\cal{B}$ with +constant probability: + +\begin{equation} \nonumber \mathbb{P}(\bigcup_{S_i} S_i \in {\cal +A}\backslash{\cal B}) \geq 1 - (1 - \epsilon)^{n^c} \approx \epsilon n^c +\end{equation} + +However, it follows from the analysis of~\cite{balcan2011learning} that we +cannot learn $f$ under any distribution, even with active value queries! +Indeed, for any polynomially-sized set of observations, there exists a +super-polynomially number of functions in ${\cal M^1}$ which coincide on this +set of observations, but which take very different values outside of this set +of observations: $8n^{1/6}$ for $A \in {\cal B}$ and $n^{1/3}$ for $A \in {\cal +A}\backslash {\cal B}$. + +{\bf TODO:} A cleaner simpler example would be nice. + \bibliography{optimize} \bibliographystyle{apalike} |
