diff options
| author | Stratis Ioannidis <stratis@stratis-Latitude-E6320.(none)> | 2012-11-05 11:56:16 -0800 |
|---|---|---|
| committer | Stratis Ioannidis <stratis@stratis-Latitude-E6320.(none)> | 2012-11-05 11:56:16 -0800 |
| commit | af686855b5a6589411edd116619ef233349c53ad (patch) | |
| tree | 42af8efdca7a2fcb38fa39924f881cc181dae37e /general.tex | |
| parent | b3d93b68f5d032a4b438400edd26c3282c2ae5d1 (diff) | |
| parent | 9ad46372b87ff8e77b354169e1736bce4c0a538c (diff) | |
| download | recommendation-af686855b5a6589411edd116619ef233349c53ad.tar.gz | |
muthu section 5
Diffstat (limited to 'general.tex')
| -rw-r--r-- | general.tex | 53 |
1 files changed, 36 insertions, 17 deletions
diff --git a/general.tex b/general.tex index c11cb99..957521a 100644 --- a/general.tex +++ b/general.tex @@ -5,11 +5,13 @@ has a natural interpretation in this context, further motivating its selection as our objective. Moreover, we extend Theorem~\ref{thm:main} to a more general Bayesian setting. -In the Bayesian setting, it is assumed that the experimenter has a prior distribution on $\beta$: in particular, $\beta$ has a multivariate normal prior with zero mean and covariance $\sigma^2R\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance). +In the Bayesian setting, it is assumed that the experimenter has a prior +distribution on $\beta$: in particular, $\beta$ has a multivariate normal prior +with zero mean and covariance $\sigma^2R^{-1}\in \reals^{d^2}$ (where $\sigma^2$ is the noise variance). The experimenter estimates $\beta$ through \emph{maximum a posteriori estimation}: \emph{i.e.}, finding the parameter which maximizes the posterior distribution of $\beta$ given the observations $y_S$. Under the linearity assumption \eqref{model} and the Gaussian prior on $\beta$, maximum a posteriori estimation leads to the following maximization \cite{hastie}: \begin{displaymath} \hat{\beta} = \argmin_{\beta\in\reals^d} \sum_i (y_i - \T{\beta}x_i)^2 - + \sum_i \norm{R\beta}_2^2 + + \T{\beta}R\beta \end{displaymath} This optimization, commonly known as \emph{ridge regression}, includes an additional penalty term compared to the least squares estimation \eqref{leastsquares}. @@ -47,7 +49,7 @@ Theorem~\ref{thm:main}: \begin{theorem} There exists a truthful, individually rational and budget feasible mechanism for the objective function $\tilde{V}$ given by \eqref{eq:normalized}. Furthermore, for any $\varepsilon - > 0$, in time $O(\text{poly}(|\mathcal{N}|, d, \log\log \varepsilon^{-1}))$, + > 0$, in time $O(\text{poly}(n, d, \log\log \varepsilon^{-1}))$, the algorithm computes a set $S^*$ such that: \begin{displaymath} OPT \leq @@ -93,11 +95,18 @@ For any $M>1$, there is no $M$-approximate, truthful, budget feasible, individua \subsection{Beyond Linear Models} -Selecting experiments that maximize the information gain in the Bayesian setup leads to a natural generalization to other learning examples beyond linear regression. In particular, in the more general PAC learning setup \cite{...}, the features $x_i$, $i\in \mathcal{N}$ take values in some general set $\Omega$, called the \emph{query space}, and measurements $y_i\in\reals$ are given by +Selecting experiments that maximize the information gain in the Bayesian setup +leads to a natural generalization to other learning examples beyond linear +regression. In particular, in the more general PAC learning setup \cite{valiant}, the features $x_i$, $i\in \mathcal{N}$ take values in some general set $\Omega$, called the \emph{query space}, and measurements $y_i\in\reals$ are given by \begin{equation}\label{eq:hypothesis-model} y_i = h(x_i) + \varepsilon_i \end{equation} -where $h\in \mathcal{H}$ for some subset $\mathcal{H}$ of all possible mappings $h:\Omega\to\reals$, called the \emph{hypothesis space}, and $\varepsilon_i$ are random variables in $\reals$, not necessarily identically distributed, that are independent \emph{conditioned on $h$}. This model is quite broad, and captures many learning tasks, such as: +where $h\in \mathcal{H}$ for some subset $\mathcal{H}$ of all possible mappings +$h:\Omega\to\reals$, called the \emph{hypothesis space}, and $\varepsilon_i$ +are random variables in $\reals$, not necessarily identically distributed, that +are independent \emph{conditioned on $h$}. This model is quite broad, and +captures many learning tasks, such as support vector machines (SVM), linear +discriminant analysis (LDA), and the following tasks: \begin{enumerate} \item\textbf{Generalized Linear Regression.} $\Omega=\reals^d$, $\mathcal{H}$ is the set of linear maps $\{h(x) = \T{\beta}x \text{ s.t. } \beta\in \reals^d\}$, and $\varepsilon_i$ are independent zero-mean normal variables, where $\expt{\varepsilon_i^2}=\sigma_i$. \item\textbf{Logistic Regression.} $\Omega=\reals^d$, $\mathcal{H}$ is the set of maps $\{h(x) = \frac{e^{\T{\beta} x}}{1+e^{\T{\beta} x}} \text{ s.t. } \beta\in\reals^d\}$, and $\varepsilon_i$ are independent conditioned on $h$ such that $$\varepsilon_i=\begin{cases} 1- h(x_i),& \text{w.~prob.}\;h(x_i)\\-h(x_i),&\text{w.~prob.}\;1-h(x_i)\end{cases}$$ @@ -113,27 +122,37 @@ This is a monotone set function, and it clearly satisfies $V(\emptyset)=0$. Thou The value function given by the information gain \eqref{general} is submodular. \end{lemma} \begin{proof} -The lemma is proved in a slightly different context in \cite{krause2005near}; we -repeat the proof here for the sake of completeness. Using the chain rule for -the conditional entropy we get: +Using the chain rule for the conditional entropy we get: \begin{equation}\label{eq:chain-rule} V(S) = H(y_S) - H(y_S \mid \beta) = H(y_S) - \sum_{i\in S} H(y_i \mid \beta) \end{equation} where the second equality comes from the independence of the $y_i$'s conditioned on $\beta$. Recall that the joint entropy of a set of random -variables is a submodular function. Thus, our value function is written in -\eqref{eq:chain-rule} as the sum of a submodular function and a modular function. +variables is a submodular function. Thus, the value function is written in +\eqref{eq:chain-rule} as the sum of a submodular function and a modular function. \end{proof} -This lemma that implies that learning an \emph{arbitrary hypothesis, under an +This lemma implies that learning an \emph{arbitrary hypothesis, under an arbitrary prior} when noise is conditionally independent leads to a submodular -value function. Hence the universally truthful 7.91-approximate mechanism by -Chen \emph{et al.} \cite{chen} immediately applies in the value query model. -Moreover, in cases where maximizing \eqref{general} can be done in polynomial -time in the full-information setup, the truthful $8.34$-appoximate mechanism of -Chen \emph{et al.}~applies again. However that, in many scenarios covered by -this model (inlcuding the last two examples above), even computing the entropy +value function. Hence, we can apply the results from Chen \emph{et al.} to get +the following corollary: +\begin{corollary} + For Bayesian experimental design with the objective given by the + information gain \eqref{general}, there exists a randomized, budget + feasible, individually rational, and universally truthful mechanism. This + mechanism has a complexity $O\big(\text{poly}(n,d)\big)$ and an + approximation ratio of $7.91$. + + In cases where maximizing \eqref{general} can be done in polynomial time in + the full-information setup, there exists a randomized, budget feasible, + individually rational, and truthful mechanism for Bayesian experimental + design. This mechanism has a complexity $O\big(\text{poly}(n,d)\big)$ and + an approximation ratio of $8.34$. +\end{corollary} + +Note however that, in many scenarios covered by +this model (including the last two examples above), even computing the entropy under a given set might be a hard task---\emph{i.e.}, the value query model may not apply. Hence, identifying learning tasks in the above class for which truthful or universally truthful constant approximation mechanisms exist, or |
