summaryrefslogtreecommitdiffstats
path: root/general.tex
diff options
context:
space:
mode:
authorStratis Ioannidis <stratis@stratis-Latitude-E6320.(none)>2013-02-11 15:49:50 -0800
committerStratis Ioannidis <stratis@stratis-Latitude-E6320.(none)>2013-02-11 15:49:50 -0800
commit05da1a98508fdc6a7e2745d7dc649ccfb921edee (patch)
treed71bf5b7707239d66cac92ff0f7e93819f18b116 /general.tex
parentec9b831af5c4e953e6000485380c591b3d7ad965 (diff)
downloadrecommendation-05da1a98508fdc6a7e2745d7dc649ccfb921edee.tar.gz
general
Diffstat (limited to 'general.tex')
-rw-r--r--general.tex55
1 files changed, 26 insertions, 29 deletions
diff --git a/general.tex b/general.tex
index 5f80c58..87475a4 100644
--- a/general.tex
+++ b/general.tex
@@ -11,8 +11,8 @@ matrix $R$, the value function takes the general form given by
\eqref{dcrit}.
Applying the mechanism described in Algorithm~\ref{mechanism} and adapting the
-analysis of the approximation ratio, we get the following result which extends
-Theorem~\ref{thm:main}:
+analysis of the approximation ratio \emph{mutatis mutandis}, we get the following result which extends
+Theorem~\ref{thm:main}.
\begin{theorem}
There exists a truthful, individually rational and budget feasible mechanism for the objective
@@ -30,8 +30,8 @@ where $\mu$ is the smallest eigenvalue of $R$.
In the non-bayesian setting, \emph{i.e.} when the experimenter has no prior
distribution on the model, the covariance matrix $R$ is the zero matrix. In this case,
-the ridge regression estimation proceedure \eqref{ridge} reduces to simple least squares (\emph{i.e.}, linear regression),
-and the $D$-optimality reduces to the entropy of $\hat{\beta}$, given by:
+the ridge regression estimation procedure \eqref{ridge} reduces to simple least squares (\emph{i.e.}, linear regression),
+and the $D$-optimality criterion reduces to the entropy of $\hat{\beta}$, given by:
\begin{equation}\label{eq:d-optimal}
V(S) = \log\det(X_S^TX_S)
\end{equation}
@@ -58,37 +58,36 @@ value function $V(S) = \det{\T{X_S}X_S}$.
\subsection{Beyond Linear Models}
Selecting experiments that maximize the information gain in the Bayesian setup
leads to a natural generalization to other learning examples beyond linear
-regression. In particular, in the more general PAC learning setup \cite{valiant}, the features $x_i$, $i\in \mathcal{N}$ take values in some general set $\Omega$, called the \emph{query space}. Measurements $y_i\in\reals$ are given by
+regression. In particular, consider the following variant of the standard PAC learning setup \cite{valiant}: assume that the features $x_i$, $i\in \mathcal{N}$ take values in some generic set $\Omega$, called the \emph{query space}.
+Measurements $y_i\in\reals$ are given by
\begin{equation}\label{eq:hypothesis-model}
y_i = h(x_i) + \varepsilon_i
\end{equation}
where $h\in \mathcal{H}$ for some subset $\mathcal{H}$ of all possible mappings
-$h:\Omega\to\reals$, called the \emph{hypothesis space}, and $\varepsilon_i$
+$h:\Omega\to\reals$, called the \emph{hypothesis space}. As before, we assume that the experimenter has a prior distribution on the
+hypothesis $h\in \mathcal{H}$; we also assume that $\varepsilon_i$
are random variables in $\reals$, not necessarily identically distributed, that
-are independent \emph{conditioned on $h$}. This model is quite broad, and
-captures many learning tasks, such as support vector machines (SVM) and linear
-discriminant analysis (LDA); we give a few concrete examples below:
+are independent \emph{conditioned on $h$}. As before, the features $x_i$ are public, and the goal of the experimenter is to (a) retrieve measurements $y_i$ and (b) estimate $h$ as accurately as possible.
+
+
+This model is quite broad, and
+captures many classic machine learning tasks; we give a few concrete examples below:
\begin{enumerate}
-\item\textbf{Generalized Linear Regression.} $\Omega=\reals^d$, $\mathcal{H}$ is the set of linear maps $\{h(x) = \T{\beta}x \text{ s.t. } \beta\in \reals^d\}$, and $\varepsilon_i$ are independent zero-mean normal variables, where $\expt{\varepsilon_i^2}=\sigma_i$.
-\item\textbf{Logistic Regression.} $\Omega=\reals^d$, $\mathcal{H}$ is the set of maps $\{h(x) = \frac{e^{\T{\beta} x}}{1+e^{\T{\beta} x}} \text{ s.t. } \beta\in\reals^d\}$, and $\varepsilon_i$ are independent conditioned on $h$ such that $$\varepsilon_i=\begin{cases} 1- h(x_i),& \text{w.~prob.}\;h(x_i)\\-h(x_i),&\text{w.~prob.}\;1-h(x_i)\end{cases}$$
-\item\textbf{Learning Binary Functions with Bernoulli Noise.} $\Omega = \{0,1\}^d$, and $\mathcal{H}$ is some subset of $2^{\Omega}$, and $$\varepsilon_i =\begin{cases}0, &\text{w.~prob.}\;p\\\bar{h}(x_i)-h(x_i), &\text{w.~prob.}\;1-p\end{cases}$$
+\item\textbf{Generalized Linear Regression.} In this case, $\Omega=\reals^d$, $\mathcal{H}$ is the set of linear maps $\{h(x) = \T{\beta}x \text{ s.t. } \beta\in \reals^d\}$, and $\varepsilon_i$ are independent zero-mean variables (not necessarily identically distributed).
+\item\textbf{Learning Binary Functions with Bernoulli Noise.} When learning a binary function under noise, the experimenter wishes to determine a binary function $h$ by testing its output on differrent inputs; however, the output may be corrupted with probability $p$. Formally, $\Omega = \{0,1\}^d$, $\mathcal{H}$ is some subset of binary functions $h:\Omega\to\{0,1\}$, and $$\varepsilon_i =\begin{cases}0, &\text{w.~prob.}\;1-p\\\bar{h}(x_i)-h(x_i), &\text{w.~prob.}\;p\end{cases}$$
+\item\textbf{Logistic Regression.} Logistic regression aims to learn a hyperplane separating $+1$--labeled values from $-1$--labeled values; again, values can be corrupted, and the probability that a label is flipped drops with the distance from the hyperplane. Formally, $\Omega=\reals^d$, $\mathcal{H}$ is the set of maps $\{h(x) = \mathop{\mathrm{sign}}(\beta^T x) \text{ for some } \beta\in\reals^d\}$, and $\varepsilon_i$ are independent conditioned on $\beta$ such that $$\varepsilon_i=\begin{cases} -2\cdot\id_{\beta^Tx>0}, & \text{w.~prob.} \frac{1}{1+e^{\beta^Tx}}\;\\ +2\cdot\id_{\beta^Tx<0},&\text{w.~prob.}\frac{e^{\beta^Tx}}{1+e^{\beta^Tx}}\;\end{cases}$$
\end{enumerate}
-In this setup, assuming that the experimenter has a prior distribution on the
-hypothesis $h\in \mathcal{H}$, the information gain objective can be written
-again as the mutual information between $h$ and $y_S$.
+We can again define the information gain as an objective to maximize:
\begin{align}\label{general}
V(S) = \entropy(h) -\entropy(h\mid y_S),\quad S\subseteq\mathcal{N}
\end{align}
This is a monotone set function, and it clearly satisfies $V(\emptyset)=0$.
-Though, in general, mutual information is not a submodular function, this
-specific setup leads to a submodular formulation.
-
-\begin{lemma}
+ In general, the information gain is not a submodular function. However, when the errors $\epsilon_i$ are independent conditioned on $h$, the following lemma holds: \begin{lemma}
The value function given by the information gain \eqref{general} is submodular.
\end{lemma}
\begin{proof}
-Using the chain rule for the conditional entropy we get:
+A more general statement for graphical models is shown in~\cite{krause2005near}; in short, using the chain rule for the conditional entropy we get:
\begin{equation}\label{eq:chain-rule}
V(S) = H(y_S) - H(y_S \mid h)
= H(y_S) - \sum_{i\in S} H(y_i \mid h)
@@ -101,20 +100,18 @@ variables is a submodular function. Thus, the value function is written in
This lemma implies that learning an \emph{arbitrary hypothesis, under an
arbitrary prior} when noise is conditionally independent leads to a submodular
-value function. Hence, we can apply the previously known results to get
+value function. Hence, we can apply the previously known results by \citeN{singer-mechanisms} and \citeN{chen} to get
the following corollary:
\begin{corollary}
- For Bayesian experimental design with the objective given by the
- information gain \eqref{general}, there exists a randomized, budget
- feasible, individually rational, and universally truthful mechanism. This
- mechanism has a complexity $O\big(\text{poly}(n,d)\big)$ and an
- approximation ratio of $7.91$.
+ For Bayesian experimental design with an objective given by the
+ information gain \eqref{general}, there exists a randomized, polynomial-time, budget
+ feasible, individually rational, and universally truthful mechanism with
+ a $7.91$ approximation ratio, in expectation.
In cases where maximizing \eqref{general} can be done in polynomial time in
- the full-information setup, there exists a randomized, budget feasible,
+ the full-information setup, there exists a deterministic, polynomial-time, budget feasible,
individually rational, and truthful mechanism for Bayesian experimental
- design. This mechanism has a complexity $O\big(\text{poly}(n,d)\big)$ and
- an approximation ratio of $8.34$.
+ design with an $8.34$ approximation ratio.
\end{corollary}
Note however that, in many scenarios covered by this model (including the last