diff options
| author | Stratis Ioannidis <stratis@stratis-Latitude-E6320.(none)> | 2012-10-28 15:16:26 -0700 |
|---|---|---|
| committer | Stratis Ioannidis <stratis@stratis-Latitude-E6320.(none)> | 2012-10-28 15:16:26 -0700 |
| commit | 3fdd3c6674acfecffa3ec734f0183f1df6a239c1 (patch) | |
| tree | b595e3a821e6e2ae679dfd725f106c89964cc752 | |
| parent | cc28557b334d2bc53e9c8243409550ea1c9f368f (diff) | |
| download | recommendation-3fdd3c6674acfecffa3ec734f0183f1df6a239c1.tar.gz | |
file splitting
| -rw-r--r-- | conclusion.tex | 0 | ||||
| -rw-r--r-- | definitions.tex | 16 | ||||
| -rw-r--r-- | general.tex | 0 | ||||
| -rw-r--r-- | intro.tex | 17 | ||||
| -rw-r--r-- | main.tex | 550 | ||||
| -rw-r--r-- | paper.tex | 780 | ||||
| -rw-r--r-- | problem.tex | 193 |
7 files changed, 782 insertions, 774 deletions
diff --git a/conclusion.tex b/conclusion.tex new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/conclusion.tex diff --git a/definitions.tex b/definitions.tex new file mode 100644 index 0000000..ca6b3ba --- /dev/null +++ b/definitions.tex @@ -0,0 +1,16 @@ +\newtheorem{lemma}{Lemma} +\newtheorem{fact}{Fact} +\newtheorem{example}{Example} +\newtheorem{prop}{Proposition} +\newtheorem{theorem}{Theorem} +\newcommand*{\defeq}{\stackrel{\text{def}}{=}} +\newcommand{\var}{\mathop{\mathrm{Var}}} +\newcommand{\condexp}[2]{\mathop{\mathbb{E}}\left[#1|#2\right]} +\newcommand{\expt}[1]{\mathop{\mathbb{E}}\left[#1\right]} +\newcommand{\norm}[1]{\lVert#1\rVert} +\newcommand{\tr}[1]{#1^*} +\newcommand{\ip}[2]{\langle #1, #2 \rangle} +\newcommand{\mse}{\mathop{\mathrm{MSE}}} +\DeclareMathOperator{\trace}{tr} +\DeclareMathOperator*{\argmax}{arg\,max} + diff --git a/general.tex b/general.tex new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/general.tex diff --git a/intro.tex b/intro.tex new file mode 100644 index 0000000..649e7de --- /dev/null +++ b/intro.tex @@ -0,0 +1,17 @@ + +\begin{itemize} + \item already existing field of experiment design: survey-like setup, what + are the best points to include in your experiment? Measure of the + usefulness of the data: variance-reduction or entropy-reduction. + \item nowadays, there is also a big focus on purchasing data: paid surveys, + mechanical turk, etc. that add economic aspects to the problem of + experiment design + \item recent advances (Singer, Chen) in the field of budgeted mechanisms + \item we study ridge regression, very widely used in statistical learning, + and treat it as a problem of budgeted experiment design + \item we make the following contributions: ... + \item extension to a more general setup which includes a wider class of + machine learning problems +\end{itemize} + + diff --git a/main.tex b/main.tex new file mode 100644 index 0000000..189156c --- /dev/null +++ b/main.tex @@ -0,0 +1,550 @@ + +All the budget feasible mechanisms studied recently (TODO ref Singer Chen) +rely on using a greedy heuristic extending the idea of the greedy heuristic for +the knapsack problem. In knapsack, objects are selected based on their +\emph{value-per-cost} ratio. The quantity which plays a similar role for +general submodular functions is the \emph{marginal-contribution-per-cost} +ratio: let us assume that you have already selected a set of points $S$, then +the \emph{marginal-contribution-per-cost} ratio per cost of a new point $i$ is +defined by: +\begin{displaymath} + \frac{V(S\cup\{i\}) - V(S)}{c_i} +\end{displaymath} + +The greedy heuristic then simply repeatedly selects the point whose +marginal-contribution-per-cost ratio is the highest until it reaches the budget +limit. Mechanism considerations aside, this is known to have an unbounded +approximation ratio. However, lemma TODO ref in Chen or Singer shows that the +maximum between the greedy heuristic and the point with maximum value (as +a singleton set) provides a $\frac{5e}{e-1}$ approximation ratio. + +Unfortunately, TODO Singer 2011 points out that taking the maximum between the +greedy heuristic and the most valuable point is not incentive compatible. +Singer and Chen tackle this issue similarly: instead of comparing the most +valuable point to the greedy solution, they compare it to a solution which is +close enough to keep a constant approximation ratio: +\begin{itemize} + \item Chen suggests using $OPT(V,\mathcal{N}\setminus\{i\}, B)$. + Unfortunately, in the general case, this cannot be computed exactly in + polynomial time. + \item Singer uses using the optimal value of a relaxed objective function + which can be proven to be close to the optimal of the original + objective function. The function used is tailored to the specific + problem of coverage. +\end{itemize} + +Here, we use a relaxation of the objective function which is tailored to the +problem of ridge regression. We define: +\begin{displaymath} + \forall\lambda\in[0,1]^{|\mathcal{N}|}\,\quad L_{\mathcal{N}}(\lambda) \defeq + \log\det\left(I_d + \mu\sum_{i\in\mathcal{N}} \lambda_i x_i x_i^*\right) +\end{displaymath} + +We can now present the mechanism we use, which has a same flavor to Chen's and +Singer's + +\begin{algorithm}\label{mechanism} + \caption{Mechanism for ridge regression} + \begin{algorithmic}[1] + \State $i^* \gets \argmax_{j\in\mathcal{N}}V(j)$ + \State $x^* \gets \argmax_{x\in[0,1]^n} \{L_{\mathcal{N}\setminus\{i^*\}}(x) + \,|\, c(x)\leq B\}$ + \Statex + \If{$L(x^*) < CV(i^*)$} + \State \textbf{return} $\{i^*\}$ + \Else + \State $i \gets \argmax_{1\leq j\leq n}\frac{V(j)}{c_j}$ + \State $S \gets \emptyset$ + \While{$c_i\leq \frac{B}{2}\frac{V(S\cup\{i\})-V(S)}{V(S\cup\{i\})}$} + \State $S \gets S\cup\{i\}$ + \State $i \gets \argmax_{j\in\mathcal{N}\setminus S} + \frac{V(S\cup\{j\})-V(S)}{c_j}$ + \EndWhile + \State \textbf{return} $S$ + \EndIf + \end{algorithmic} +\end{algorithm} + +Notice, that the stopping condition in the while loop is more sophisticated +than just ensuring that the sum of the costs does not exceed the budget. This +is because the selected users will be payed more than their costs, and this +stopping condition ensures budget feasibility when the users are paid their +threshold payment. + +We can now state the main result of this section: +\begin{theorem} + The mechanism in \ref{mechanism} is truthful, individually rational, budget + feasible. Furthermore, choosing: + \begin{multline*} + C = C^* = \frac{5e-1 + C_\mu(2e+1)}{2C_\mu(e-1)}\\ + + \frac{\sqrt{C_\mu^2(1+2e)^2 + + 2C_\mu(14e^2+5e+1) + (1-5e)^2}}{2C_\mu(e-1)} + \end{multline*} + we get an approximation ratio of: + \begin{multline*} + 1 + C^* = \frac{5e-1 + C_\mu(4e-1)}{2C_\mu(e-1)}\\ + + \frac{\sqrt{C_\mu^2(1+2e)^2 + + 2C_\mu(14e^2+5e+1) + (1-5e)^2}}{2C_\mu(e-1)} + \end{multline*} + where: + \begin{displaymath} + C_\mu = \frac{\log(1+\mu)}{2\mu} + \end{displaymath} +\end{theorem} + +The proof will consist of the claims of the theorem broken down into lemmas. + +Because the user strategy is parametrized by a single parameter, truthfulness +is equivalent to monotonicity (TODO ref). The proof here is the same as in TODO +and is given for the sake of completeness. + +\begin{lemma} +The mechanism is monotone. +\end{lemma} + +\begin{proof} + We assume by contradiction that there exists a user $i$ that has been + selected by the mechanism and that would not be selected had he reported + a cost $c_i'\leq c_i$ (all the other costs staying the same). + + If $i\neq i^*$ and $i$ has been selected, then we are in the case where + $L(x^*) \geq C V(i^*)$ and $i$ was included in the result set by the greedy + part of the mechanism. By reporting a cost $c_i'\leq c_i$, using the + submodularity of $V$, we see that $i$ will satisfy the greedy selection + rule: + \begin{displaymath} + i = \argmax_{j\in\mathcal{N}\setminus S} \frac{V(S\cup\{j\}) + - V(S)}{c_j} + \end{displaymath} + in an earlier iteration of the greedy heuristic. Let us denote by $S_i$ + (resp. $S_i'$) the set to which $i$ is added when reporting cost $c_i$ + (resp. $c_i'$). We have $S_i'\subset S_i$. Moreover: + \begin{align*} + c_i' & \leq c_i \leq + \frac{B}{2}\frac{V(S_i\cup\{i\})-V(S_i)}{V(S_i\cup\{i\})}\\ + & \leq \frac{B}{2}\frac{V(S_i'\cup\{i\})-V(S_i')}{V(S_i'\cup\{i\})} + \end{align*} + Hence $i$ will still be included in the result set. + + If $i = i^*$, $i$ is included iff $L(x^*) \leq C V(i^*)$. Reporting $c_i'$ + instead of $c_i$ does not change the value $V(i^*)$ nor $L(x^*)$ (which is + computed over $\mathcal{N}\setminus\{i^*\}$). Thus $i$ is still included by + reporting a different cost. +\end{proof} + +\begin{lemma} +The mechanism is budget feasible. +\end{lemma} + +The proof is the same as in Chen and is given here for the sake of +completeness. +\begin{proof} + +\end{proof} + +Using the characterization of the threshold payments from Singer TODO, we can +prove individual rationality similarly to Chen TODO. +\begin{lemma} +The mechanism is individually rational +\end{lemma} + +\begin{proof} + +\end{proof} + +The following lemma requires to have a careful look at the relaxation function +we chose in the mechanism. The next section will be dedicated to studying this +relaxation and will contain the proof of the following lemma: +\begin{lemma} + We have: + \begin{displaymath} + OPT(L_\mathcal{N}, B) \leq \frac{1}{C_\mu}\big(2 OPT(V,\mathcal{N},B) + + \max_{i\in\mathcal{N}}V(i)\big) + \end{displaymath} +\end{lemma} + +\begin{lemma} + Let us denote by $S_M$ the set returned by the mechanism. Let us also + write: + \begin{displaymath} + C_{\textrm{max}} = \max\left(1+C,\frac{e}{e-1}\left( 3 + \frac{12e}{C\cdot + C_\mu(e-1) -5e +1}\right)\right) + \end{displaymath} + + Then: + \begin{displaymath} + OPT(V, \mathcal{N}, B) \leq + C_\text{max}\cdot V(S_M) + \end{displaymath} +\end{lemma} + +\begin{proof} + + If the condition on line 3 of the algorithm holds, then: + \begin{displaymath} + V(i^*) \geq \frac{1}{C}L(x^*) \geq + \frac{1}{C}OPT(V,\mathcal{N}\setminus\{i\}, B) + \end{displaymath} + + But: + \begin{displaymath} + OPT(V,\mathcal{N},B) \leq OPT(V,\mathcal{N}\setminus\{i\}, B) + V(i^*) + \end{displaymath} + + Hence: + \begin{displaymath} + V(i^*) \geq \frac{1}{C+1} OPT(V,\mathcal{N}, B) + \end{displaymath} + + If the condition of the algorithm does not hold: + \begin{align*} + V(i^*) & \leq \frac{1}{C}L(x^*) \leq \frac{1}{C\cdot C_\mu} + \big(2 OPT(V,\mathcal{N}, B) + V(i^*)\big)\\ + & \leq \frac{1}{C\cdot C_\mu}\left(\frac{2e}{e-1}\big(3 V(S_M) + + 2 V(i^*)\big) + + V(i^*)\right) + \end{align*} + + Thus: + \begin{align*} + V(i^*) \leq \frac{6e}{C\cdot C_\mu(e-1)- 5e + 1} V(S_M) + \end{align*} + + Finally, using again that: + \begin{displaymath} + OPT(V,\mathcal{N},B) \leq \frac{e}{e-1}\big(3 V(S_M) + 2 V(i^*)\big) + \end{displaymath} + + We get: + \begin{displaymath} + OPT(V, \mathcal{N}, B) \leq \frac{e}{e-1}\left( 3 + \frac{12e}{C\cdot + C_\mu(e-1) -5e +1}\right) V(S_M) + \end{displaymath} +\end{proof} + +The optimal value for $C$ is: +\begin{displaymath} + C^* = \arg\min C_{\textrm{max}} +\end{displaymath} + +This equation has two solutions. Only one of those is such that: +\begin{displaymath} + C\cdot C_\mu(e-1) -5e +1 \geq 0 +\end{displaymath} +which is needed in the proof of the previous lemma. Computing this solution, +we can state the main result of this section. + +\subsection{Relaxations of the value function} + +To prove lemma TODO, we will use a general method called pipage rounding, +introduced in TODO. Two consecutive relaxations are used: the one that we are +interested in, whose optimization can be computed efficiently, and the +multilinear extension which presents a \emph{cross-convexity} like behavior +which allows for rounding of fractional solution without decreasing the value +of the objective function and thus ensures a constant approximation of the +value function. The difficulty resides in showing that the ratio of the two +relaxations is bounded. + +We say that $R_\mathcal{N}:[0,1]^n\rightarrow\mathbf{R}$ is a relaxation of the +value function $V$ over $\mathcal{N}$ if it coincides with $V$ at binary +points. Formally, for any $S\subset\mathcal{N}$, let $\mathbf{1}_S$ denote the +indicator vector of $S$. $R_\mathcal{N}$ is a relaxation of $V$ over +$\mathcal{N}$ iff: +\begin{displaymath} + \forall S\subset\mathcal{N},\; R_\mathcal{N}(\mathbf{1}_S) = V(S) +\end{displaymath} + +We can extend the optimisation problem defined above to a relaxation by +extending the cost function: +\begin{displaymath} + \forall \lambda\in[0,1]^n,\; c(\lambda) + = \sum_{i\in\mathcal{N}}\lambda_ic_i +\end{displaymath} +The optimisation problem becomes: +\begin{displaymath} + OPT(R_\mathcal{N}, B) = + \max_{\lambda\in[0,1]^n}\left\{R_\mathcal{N}(\lambda)\,|\, c(\lambda)\leq B\right\} +\end{displaymath} +The relaxations we will consider here rely on defining a probability +distribution over subsets of $\mathcal{N}$. + +Let $\lambda\in[0,1]^n$, let us define: +\begin{displaymath} + P_\mathcal{N}^\lambda(S) = \prod_{i\in S}\lambda_i + \prod_{i\in\mathcal{N}\setminus S}(1-\lambda_i) +\end{displaymath} +$P_\mathcal{N}^\lambda(S)$ is the probability of picking the set $S$ if we select +a subset of $\mathcal{N}$ at random by deciding independently for each point to +include it in the set with probability $\lambda_i$ (and to exclude it with +probability $1-\lambda_i$). + +We will consider two relaxations of the value function $V$ over $\mathcal{N}$: +\begin{itemize} + \item the \emph{multi-linear extension} of $V$: + \begin{align*} + F_\mathcal{N}(\lambda) + & = \mathbb{E}_{S\sim P_\mathcal{N}^\lambda}\big[\log\det A(S)\big]\\ + & = \sum_{S\subset\mathcal{N}} P_\mathcal{N}^\lambda(S) V(S)\\ + & = \sum_{S\subset\mathcal{N}} P_\mathcal{N}^\lambda(S) \log\det A(S)\\ + \end{align*} + \item the \emph{concave relaxation} of $V$: + \begin{align*} + L_{\mathcal{N}}(\lambda) + & = \log\det \mathbb{E}_{S\sim P_\mathcal{N}^\lambda}\big[A(S)\big]\\ + & = \log\det\left(\sum_{S\subset N} + P_\mathcal{N}^\lambda(S)A(S)\right)\\ + & = \log\det\left(I_d + \mu\sum_{i\in\mathcal{N}} + \lambda_ix_ix_i^*\right)\\ + & \defeq \log\det \tilde{A}(\lambda) + \end{align*} +\end{itemize} + +\begin{lemma} + The \emph{concave relaxation} $L_\mathcal{N}$ is concave\footnote{Hence + this relaxation is well-named!}. +\end{lemma} + +\begin{proof} + This follows from the concavity of the $\log\det$ function over symmetric + positive semi-definite matrices. More precisely, if $A$ and $B$ are two + symmetric positive semi-definite matrices, then: + \begin{multline*} + \forall\alpha\in [0, 1],\; \log\det\big(\alpha A + (1-\alpha) B\big)\\ + \geq \alpha\log\det A + (1-\alpha)\log\det B + \end{multline*} +\end{proof} + +\begin{lemma}[Rounding]\label{lemma:rounding} + For any feasible $\lambda\in[0,1]^n$, there exists a feasible + $\bar{\lambda}\in[0,1]^n$ such that at most one of its component is + fractional, that is, lies in $(0,1)$ and: + \begin{displaymath} + F_{\mathcal{N}}(\lambda)\leq F_{\mathcal{N}}(\bar{\lambda}) + \end{displaymath} +\end{lemma} + +\begin{proof} + We give a rounding procedure which given a feasible $\lambda$ with at least + two fractional components, returns some $\lambda'$ with one less fractional + component, feasible such that: + \begin{displaymath} + F_\mathcal{N}(\lambda) \leq F_\mathcal{N}(\lambda') + \end{displaymath} + Applying this procedure recursively yields the lemma's result. + + Let us consider such a feasible $\lambda$. Let $i$ and $j$ be two + fractional components of $\lambda$ and let us define the following + function: + \begin{displaymath} + F_\lambda(\varepsilon) = F(\lambda_\varepsilon) + \quad\textrm{where} \quad + \lambda_\varepsilon = \lambda + \varepsilon\left(e_i-\frac{c_i}{c_j}e_j\right) + \end{displaymath} + + It is easy to see that if $\lambda$ is feasible, then: + \begin{multline}\label{eq:convex-interval} + \forall\varepsilon\in\Big[\max\Big(-\lambda_i,(\lambda_j-1)\frac{c_j}{c_i}\Big), \min\Big(1-\lambda_i, \lambda_j + \frac{c_j}{c_i}\Big)\Big],\;\\ + \lambda_\varepsilon\;\;\textrm{is feasible} + \end{multline} + + Furthermore, the function $F_\lambda$ is convex, indeed: + \begin{align*} + F_\lambda(\varepsilon) + & = \mathbb{E}_{S'\sim P_{\mathcal{N}\setminus\{i,j\}}^\lambda(S')}\Big[ + (\lambda_i+\varepsilon)\Big(\lambda_j-\varepsilon\frac{c_i}{c_j}\Big)V(S'\cup\{i,j\})\\ + & + (\lambda_i+\varepsilon)\Big(1-\lambda_j+\varepsilon\frac{c_i}{c_j}\Big)V(S'\cup\{i\})\\ + & + (1-\lambda_i-\varepsilon)\Big(\lambda_j-\varepsilon\frac{c_i}{c_j}\Big)V(S'\cup\{j\})\\ + & + (1-\lambda_i-\varepsilon)\Big(1-\lambda_j+\varepsilon\frac{c_i}{c_j}\Big)V(S')\Big]\\ + \end{align*} + Thus, $F_\lambda$ is a degree 2 polynomial whose dominant coefficient is: + \begin{multline*} + \frac{c_i}{c_j}\mathbb{E}_{S'\sim + P_{\mathcal{N}\setminus\{i,j\}}^\lambda(S')}\Big[ + V(S'\cup\{i\})+V(S'\cup\{i\})\\ + -V(S'\cup\{i,j\})-V(S')\Big] + \end{multline*} + which is positive by submodularity of $V$. Hence, the maximum of + $F_\lambda$ over the interval given in \eqref{eq:convex-interval} is + attained at one of its limit, at which either the $i$-th or $j$-th component of + $\lambda_\varepsilon$ becomes integral. +\end{proof} + +\begin{lemma}\label{lemma:relaxation-ratio} + The following inequality holds: + \begin{displaymath} + \forall\lambda\in[0,1]^n,\; + \frac{\log\big(1+\mu\big)}{2\mu} + \,L_\mathcal{N}(\lambda)\leq + F_\mathcal{N}(\lambda)\leq L_{\mathcal{N}}(\lambda) + \end{displaymath} +\end{lemma} + +\begin{proof} + + We will prove that: + \begin{displaymath} + \frac{\log\big(1+\mu\big)}{2\mu} + \end{displaymath} + is a lower bound of the ratio $\partial_i F_\mathcal{N}(\lambda)/\partial_i + L_\mathcal{N}(\lambda)$. + + This will be enough to conclude, by observing that: + \begin{displaymath} + \frac{F_\mathcal{N}(\lambda)}{L_\mathcal{N}(\lambda)} + \sim_{\lambda\rightarrow 0} + \frac{\sum_{i\in \mathcal{N}}\lambda_i\partial_i F_\mathcal{N}(0)} + {\sum_{i\in\mathcal{N}}\lambda_i\partial_i L_\mathcal{N}(0)} + \end{displaymath} + and that an interior critical point of the ratio + $F_\mathcal{N}(\lambda)/L_\mathcal{N}(\lambda)$ is defined by: + \begin{displaymath} + \frac{F_\mathcal{N}(\lambda)}{L_\mathcal{N}(\lambda)} + = \frac{\partial_i F_\mathcal{N}(\lambda)}{\partial_i + L_\mathcal{N}(\lambda)} + \end{displaymath} + + Let us start by computing the derivatives of $F_\mathcal{N}$ and + $L_\mathcal{N}$ with respect to + the $i$-th component. + + For $F$, it suffices to look at the derivative of + $P_\mathcal{N}^\lambda(S)$: + \begin{displaymath} + \partial_i P_\mathcal{N}^\lambda(S) = \left\{ + \begin{aligned} + & P_{\mathcal{N}\setminus\{i\}}^\lambda(S\setminus\{i\})\;\textrm{if}\; i\in S \\ + & - P_{\mathcal{N}\setminus\{i\}}^\lambda(S)\;\textrm{if}\; + i\in \mathcal{N}\setminus S \\ + \end{aligned}\right. + \end{displaymath} + + Hence: + \begin{multline*} + \partial_i F_\mathcal{N} = + \sum_{\substack{S\subset\mathcal{N}\\ i\in S}} + P_{\mathcal{N}\setminus\{i\}}^\lambda(S\setminus\{i\})V(S)\\ + - \sum_{\substack{S\subset\mathcal{N}\\ i\in \mathcal{N}\setminus S}} + P_{\mathcal{N}\setminus\{i\}}^\lambda(S)V(S)\\ + \end{multline*} + + Now, using that every $S$ such that $i\in S$ can be uniquely written as + $S'\cup\{i\}$, we can write: + \begin{multline*} + \partial_i F_\mathcal{N} = + \sum_{\substack{S\subset\mathcal{N}\\ i\in\mathcal{N}\setminus S}} + P_{\mathcal{N}\setminus\{i\}}^\lambda(S)V(S\cup\{i\})\\ + - \sum_{\substack{S\subset\mathcal{N}\\ i\in \mathcal{N}\setminus S}} + P_{\mathcal{N}\setminus\{i\}}^\lambda(S)V(S)\\ + \end{multline*} + + Finally, by using the expression for the marginal contribution of $i$ to + $S$: + \begin{displaymath} + \partial_i F_\mathcal{N}(\lambda) = + \sum_{\substack{S\subset\mathcal{N}\\ i\in\mathcal{N}\setminus S}} + P_{\mathcal{N}\setminus\{i\}}^\lambda(S) + \log\Big(1 + \mu x_i^*A(S)^{-1}x_i\Big) + \end{displaymath} + + The computation of the derivative of $L_\mathcal{N}$ uses standard matrix + calculus and gives: + \begin{displaymath} + \partial_i L_\mathcal{N}(\lambda) + = \mu x_i^* \tilde{A}(\lambda)^{-1}x_i + \end{displaymath} + + Using the following inequalities: + \begin{gather*} + \forall S\subset\mathcal{N}\setminus\{i\},\quad + P_{\mathcal{N}\setminus\{i\}}^\lambda(S)\geq + P_{\mathcal{N}\setminus\{i\}}^\lambda(S\cup\{i\})\\ + \forall S\subset\mathcal{N},\quad P_{\mathcal{N}\setminus\{i\}}^\lambda(S) + \geq P_\mathcal{N}^\lambda(S)\\ + \forall S\subset\mathcal{N},\quad A(S)^{-1} \geq A(S\cup\{i\})^{-1}\\ + \end{gather*} + we get: + \begin{align*} + \partial_i F_\mathcal{N}(\lambda) + & \geq \sum_{\substack{S\subset\mathcal{N}\\ i\in\mathcal{N}\setminus S}} + P_\mathcal{N}^\lambda(S) + \log\Big(1 + \mu x_i^*A(S)^{-1}x_i\Big)\\ + & \geq \frac{1}{2} + \sum_{\substack{S\subset\mathcal{N}\\ i\in\mathcal{N}\setminus S}} + P_\mathcal{N}^\lambda(S) + \log\Big(1 + \mu x_i^*A(S)^{-1}x_i\Big)\\ + &\hspace{-3.5em}+\frac{1}{2} + \sum_{\substack{S\subset\mathcal{N}\\ i\in\mathcal{N}\setminus S}} + P_\mathcal{N}^\lambda(S\cup\{i\}) + \log\Big(1 + \mu x_i^*A(S\cup\{i\})^{-1}x_i\Big)\\ + &\geq \frac{1}{2} + \sum_{S\subset\mathcal{N}} + P_\mathcal{N}^\lambda(S) + \log\Big(1 + \mu x_i^*A(S)^{-1}x_i\Big)\\ + \end{align*} + + Using that $A(S)\geq I_d$ we get that: + \begin{displaymath} + \mu x_i^*A(S)^{-1}x_i \leq \mu + \end{displaymath} + + Moreover: + \begin{displaymath} + \forall x\leq\mu,\; \log(1+x)\geq + \frac{\log\big(1+\mu\big)}{\mu} x + \end{displaymath} + + Hence: + \begin{displaymath} + \partial_i F_\mathcal{N}(\lambda) \geq + \frac{\log\big(1+\mu\big)}{2\mu} + x_i^*\bigg(\sum_{S\subset\mathcal{N}}P_\mathcal{N}^\lambda(S)A(S)^{-1}\bigg)x_i + \end{displaymath} + + Finally, using that the inverse is a matrix convex function over symmetric + positive definite matrices: + \begin{align*} + \partial_i F_\mathcal{N}(\lambda) &\geq + \frac{\log\big(1+\mu\big)}{2\mu} + x_i^*\bigg(\sum_{S\subset\mathcal{N}}P_\mathcal{N}^\lambda(S)A(S)\bigg)^{-1}x_i\\ + & \geq \frac{\log\big(1+\mu\big)}{2\mu} + \partial_i L_\mathcal{N}(\lambda) + \end{align*} +\end{proof} + +We can now prove lemma TODO from previous section. + +\begin{proof} + Let us consider a feasible point $\lambda^*\in[0,1]^n$ such that $L_\mathcal{N}(\lambda^*) + = OPT(L_\mathcal{N}, B)$. By applying lemma~\ref{lemma:relaxation-ratio} + and lemma~\ref{lemma:rounding} we get a feasible point $\bar{\lambda}$ with at most + one fractional component such that: + \begin{equation}\label{eq:e1} + L_\mathcal{N}(\lambda^*) \leq \frac{1}{C_\mu} + F_\mathcal{N}(\bar{\lambda}) + \end{equation} + + Let $\lambda_i$ denote the fractional component of $\bar{\lambda}$ and $S$ + denote the set whose indicator vector is $\bar{\lambda} - \lambda_i e_i$. + Using the fact that $F_\mathcal{N}$ is linear with respect to the $i$-th + component and is a relaxation of the value function, we get: + \begin{displaymath} + F_\mathcal{N}(\bar{\lambda}) = V(S) +\lambda_i V(S\cup\{i\}) + \end{displaymath} + + Using the submodularity of $V$: + \begin{displaymath} + F_\mathcal{N}(\bar{\lambda}) \leq 2 V(S) + V(i) + \end{displaymath} + + Note that since $\bar{\lambda}$ is feasible, $S$ is also feasible and + $V(S)\leq OPT(V,\mathcal{N}, B)$. Hence: + \begin{equation}\label{eq:e2} + F_\mathcal{N}(\bar{\lambda}) \leq 2 OPT(V,\mathcal{N}, B) + + \max_{i\in\mathcal{N}} V(i) + \end{equation} + + Putting \eqref{eq:e1} and \eqref{eq:e2} together gives the results. +\end{proof} + + @@ -3,789 +3,21 @@ \usepackage{amsmath,amsfonts} \usepackage{algorithm} \usepackage{algpseudocode} -\newtheorem{lemma}{Lemma} -\newtheorem{fact}{Fact} -\newtheorem{example}{Example} -\newtheorem{prop}{Proposition} -\newtheorem{theorem}{Theorem} -\newcommand*{\defeq}{\stackrel{\text{def}}{=}} -\newcommand{\var}{\mathop{\mathrm{Var}}} -\newcommand{\condexp}[2]{\mathop{\mathbb{E}}\left[#1|#2\right]} -\newcommand{\expt}[1]{\mathop{\mathbb{E}}\left[#1\right]} -\newcommand{\norm}[1]{\lVert#1\rVert} -\newcommand{\tr}[1]{#1^*} -\newcommand{\ip}[2]{\langle #1, #2 \rangle} -\newcommand{\mse}{\mathop{\mathrm{MSE}}} -\DeclareMathOperator{\trace}{tr} -\DeclareMathOperator*{\argmax}{arg\,max} +\input{definitions} \title{Budgeted Auctions for Experiment Design} \begin{document} \maketitle \section{Intro} - -\begin{itemize} - \item already existing field of experiment design: survey-like setup, what - are the best points to include in your experiment? Measure of the - usefulness of the data: variance-reduction or entropy-reduction. - \item nowadays, there is also a big focus on purchasing data: paid surveys, - mechanical turk, etc. that add economic aspects to the problem of - experiment design - \item recent advances (Singer, Chen) in the field of budgeted mechanisms - \item we study ridge regression, very widely used in statistical learning, - and treat it as a problem of budgeted experiment design - \item we make the following contributions: ... - \item extension to a more general setup which includes a wider class of - machine learning problems -\end{itemize} +\input{intro} \section{Problem formulation} - -\subsection{Notations} - -Throughout the paper, we will make use of the following notations: if $x$ is -a (column) vector in $\mathbf{R}^d$, $x^*$ denotes its transposed (line) -vector. Thus, the standard inner product between two vectors $x$ and $y$ is -simply $x^* y$. $\norm{x}_2 = x^*x$ will denote the $L_2$ norm of $x$. - -We will also often use the following order over symmetric matrices: if $A$ and -$B$ are two $d\times d$ and $B$ are two $d\times d$ real symmetric matrices, we -write that $A\leq B$ iff: -\begin{displaymath} - \forall x\in\mathbf{R}^d,\quad - x^*Ax \leq x^*Bx -\end{displaymath} -That is, iff $B-A$ is symmetric semi-definite positive. - -This order let us define the notion of a \emph{decreasing} or \emph{convex} -matrix function similarly to their real counterparts. In particular, let us -recall that the matrix inversion is decreasing and convex over symmetric -definite positive matrices. - -\subsection{Data model} - -There is a set of $n$ users, $\mathcal{N} = \{1,\ldots, n\}$. Each user -$i\in\mathcal{N}$ has a public vector of features $x_i\in\mathbf{R}^d$ and an -undisclosed piece of information $y_i\in\mathbf{R}$. We assume that the data -has already been normalized so that $\norm{x_i}_2\leq 1$ for all -$i\in\mathcal{N}$. - -The experimenter is going to select a set of users and ask them to reveal their -private piece of information. We are interested in a \emph{survey setup}: the -experimenter has not seen the data yet, but he wants to know which users he -should be selecting. His goal is to learn the model underlying the data. Here, -we assume a linear model: -\begin{displaymath} - \forall i\in\mathcal{N},\quad y_i = \beta^* x_i + \varepsilon_i -\end{displaymath} -where $\beta\in\mathbf{R}^d$ and $\varepsilon_i\in\mathbf{R}$ follows a normal -distribution of mean $0$ and variance $\sigma^2$. Furthermore, we assume the -error $\varepsilon$ to be independent of the user: -$(\varepsilon_i)_{i\in\mathcal{N}}$ are mutually independent. - -After observing the data, the experimenter could simply do linear regression to -learn the model parameter $\beta$. However, in a more general setup, the -experimenter has a prior knowledge about $\beta$, a distribution over -$\mathbf{R}^d$. After observing the data, the experimenter performs -\emph{maximum a posteriori estimation}: computing the point which maximizes the -posterior distribution of $\beta$ given the observations. - -Here, we will assume, as it is often done, that the prior distribution is -a multivariate normal distribution of mean zero and covariance matrix $\kappa -I_d$. Maximum a posteriori estimation leads to the following maximization -problem: -\begin{displaymath} - \beta_{\text{max}} = \argmax_{\beta\in\mathbf{R}^d} \sum_i (y_i - \beta^*x_i)^2 - + \frac{1}{\mu}\sum_i \norm{\beta}_2^2 -\end{displaymath} -which is the well-known \emph{ridge regression}. $\mu -= \frac{\kappa}{\sigma^2}$ is the regularization parameter. Ridge regression -can thus be seen as linear regression with a regularization term which -prevents $\beta$ from having a large $L_2$-norm. - -\subsection{Value of data} - -Because the user private variables $y_i$ have not been observed yet when the -experimenter has to decide which users to include in his experiment, we treat -$\beta$ as a random variable whose distribution is updated after observing the -data. - -Let us recall that if $\beta$ is random variable over $\mathbf{R}^d$ whose -probability distribution has a density function $f$ with respect to the -Lebesgue measure, its entropy is given by: -\begin{displaymath} - \mathbb{H}(\beta) \defeq - \int_{b\in\mathbf{R}^d} \log f(b) f(b)\text{d}b -\end{displaymath} - -A usual way to measure the decrease of uncertainty induced by the observation -of data is to use the entropy. This leads to the following definition of the -value of data called the \emph{value of information}: -\begin{displaymath} - \forall S\subset\mathcal{N},\quad V(S) = \mathbb{H}(\beta) - - \mathbb{H}(\beta\,|\, - Y_S) -\end{displaymath} -where $Y_S = \{y_i,\,i\in S\}$ is the set of observed data. - -\begin{theorem} - Under the ridge regression model explained in section TODO, the value of data - is equal to: - \begin{align*} - \forall S\subset\mathcal{N},\; V(S) - & = \frac{1}{2}\log\det\left(I_d - + \mu\sum_{i\in S} x_ix_i^*\right)\\ - & \defeq \frac{1}{2}\log\det A(S) - \end{align*} -\end{theorem} - -\begin{proof} - -Let us denote by $X_S$ the matrix whose rows are the vectors $(x_i^*)_{i\in -S}$. Observe that $A_S$ can simply be written as: -\begin{displaymath} - A_S = I_d + \mu X_S^* X_S -\end{displaymath} - -Let us recall that the entropy of a multivariate normal variable $B$ over -$\mathbf{R}^d$ of covariance $\Sigma I_d$ is given by: -\begin{equation}\label{eq:multivariate-entropy} - \mathbb{H}(B) = \frac{1}{2}\log\big((2\pi e)^d \det \Sigma I_d\big) -\end{equation} - -Using the chain rule for conditional entropy, we get that: -\begin{displaymath} - V(S) = \mathbb{H}(Y_S) - \mathbb{H}(Y_S\,|\,\beta) -\end{displaymath} - -Conditioned on $\beta$, $(Y_S)$ follows a multivariate normal -distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence: -\begin{equation}\label{eq:h1} - \mathbb{H}(Y_S\,|\,\beta) - = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right) -\end{equation} - -$(Y_S)$ also follows a multivariate normal distribution of mean zero. Let us -compute its covariance matrix, $\Sigma_Y$: -\begin{align*} - \Sigma_Y & = \expt{YY^*} = \expt{(X_S\beta + E)(X_S\beta + E)^*}\\ - & = \kappa X_S X_S^* + \sigma^2I_n -\end{align*} -Thus, we get that: -\begin{equation}\label{eq:h2} - \mathbb{H}(Y_S) - = \frac{1}{2}\log\left((2\pi e)^n \det(\kappa X_S X_S^* + \sigma^2 I_n)\right) -\end{equation} - -Combining \eqref{eq:h1} and \eqref{eq:h2} we get: -\begin{displaymath} - V(S) = \frac{1}{2}\log\det\left(I_n+\frac{\kappa}{\sigma^2}X_S - X_S^*\right) -\end{displaymath} - -Finally, we can use Sylvester's determinant theorem to get the result. -\end{proof} - -It is also interesting to look at the marginal contribution of a user to a set: the -increase of value induced by adding a user to an already existing set of users. -We have the following lemma. - -\begin{lemma}[Marginal contribution] - \begin{displaymath} - \Delta_i V(S)\defeq V(S\cup\{i\}) - V(S) - = \frac{1}{2}\log\left(1 + \mu x_i^*A(S)^{-1}x_i\right) - \end{displaymath} -\end{lemma} - -\begin{proof} - We have: - \begin{align*} - V(S\cup\{i\}) & = \frac{1}{2}\log\det A(S\cup\{i\})\\ - & = \frac{1}{2}\log\det\left(A(S) + \mu x_i x_i^*\right)\\ - & = V(S) + \frac{1}{2}\log\det\left(I_d + \mu A(S)^{-1}x_i - x_i^*\right)\\ - & = V(S) + \frac{1}{2}\log\left(1 + \mu x_i^* A(S)^{-1}x_i\right) - \end{align*} - where the last equality comes from Sylvester's determinant formula. -\end{proof} - -Because $A(S)$ is symmetric definite positive, the marginal contribution is -positive, which proves that the value function is set increasing. Furthermore, -it is easy to see that if $S\subset S'$, then $A(S)\leq A(S')$. Using the fact -that matrix inversion is decreasing, we see that the marginal contribution of -a fixed user is a set decreasing function. This is the \emph{submodularity} of -the value function. - -TODO? Explain what are the points which are the most valuable : points which -are aligned along directions where the spread over the already existing points -is small. - -\subsection{Auction} - -TODO Explain the optimization problem, why it has to be formulated as an auction -problem. Explain the goals: -\begin{itemize} - \item truthful - \item individually rational - \item budget feasible - \item has a good approximation ratio - -TODO Explain what is already known: it is ok when the function is submodular. When -should we introduce the notion of submodularity? -\end{itemize} - +\input{problem} \section{Main result} - -All the budget feasible mechanisms studied recently (TODO ref Singer Chen) -rely on using a greedy heuristic extending the idea of the greedy heuristic for -the knapsack problem. In knapsack, objects are selected based on their -\emph{value-per-cost} ratio. The quantity which plays a similar role for -general submodular functions is the \emph{marginal-contribution-per-cost} -ratio: let us assume that you have already selected a set of points $S$, then -the \emph{marginal-contribution-per-cost} ratio per cost of a new point $i$ is -defined by: -\begin{displaymath} - \frac{V(S\cup\{i\}) - V(S)}{c_i} -\end{displaymath} - -The greedy heuristic then simply repeatedly selects the point whose -marginal-contribution-per-cost ratio is the highest until it reaches the budget -limit. Mechanism considerations aside, this is known to have an unbounded -approximation ratio. However, lemma TODO ref in Chen or Singer shows that the -maximum between the greedy heuristic and the point with maximum value (as -a singleton set) provides a $\frac{5e}{e-1}$ approximation ratio. - -Unfortunately, TODO Singer 2011 points out that taking the maximum between the -greedy heuristic and the most valuable point is not incentive compatible. -Singer and Chen tackle this issue similarly: instead of comparing the most -valuable point to the greedy solution, they compare it to a solution which is -close enough to keep a constant approximation ratio: -\begin{itemize} - \item Chen suggests using $OPT(V,\mathcal{N}\setminus\{i\}, B)$. - Unfortunately, in the general case, this cannot be computed exactly in - polynomial time. - \item Singer uses using the optimal value of a relaxed objective function - which can be proven to be close to the optimal of the original - objective function. The function used is tailored to the specific - problem of coverage. -\end{itemize} - -Here, we use a relaxation of the objective function which is tailored to the -problem of ridge regression. We define: -\begin{displaymath} - \forall\lambda\in[0,1]^{|\mathcal{N}|}\,\quad L_{\mathcal{N}}(\lambda) \defeq - \log\det\left(I_d + \mu\sum_{i\in\mathcal{N}} \lambda_i x_i x_i^*\right) -\end{displaymath} - -We can now present the mechanism we use, which has a same flavor to Chen's and -Singer's - -\begin{algorithm}\label{mechanism} - \caption{Mechanism for ridge regression} - \begin{algorithmic}[1] - \State $i^* \gets \argmax_{j\in\mathcal{N}}V(j)$ - \State $x^* \gets \argmax_{x\in[0,1]^n} \{L_{\mathcal{N}\setminus\{i^*\}}(x) - \,|\, c(x)\leq B\}$ - \Statex - \If{$L(x^*) < CV(i^*)$} - \State \textbf{return} $\{i^*\}$ - \Else - \State $i \gets \argmax_{1\leq j\leq n}\frac{V(j)}{c_j}$ - \State $S \gets \emptyset$ - \While{$c_i\leq \frac{B}{2}\frac{V(S\cup\{i\})-V(S)}{V(S\cup\{i\})}$} - \State $S \gets S\cup\{i\}$ - \State $i \gets \argmax_{j\in\mathcal{N}\setminus S} - \frac{V(S\cup\{j\})-V(S)}{c_j}$ - \EndWhile - \State \textbf{return} $S$ - \EndIf - \end{algorithmic} -\end{algorithm} - -Notice, that the stopping condition in the while loop is more sophisticated -than just ensuring that the sum of the costs does not exceed the budget. This -is because the selected users will be payed more than their costs, and this -stopping condition ensures budget feasibility when the users are paid their -threshold payment. - -We can now state the main result of this section: -\begin{theorem} - The mechanism in \ref{mechanism} is truthful, individually rational, budget - feasible. Furthermore, choosing: - \begin{multline*} - C = C^* = \frac{5e-1 + C_\mu(2e+1)}{2C_\mu(e-1)}\\ - + \frac{\sqrt{C_\mu^2(1+2e)^2 - + 2C_\mu(14e^2+5e+1) + (1-5e)^2}}{2C_\mu(e-1)} - \end{multline*} - we get an approximation ratio of: - \begin{multline*} - 1 + C^* = \frac{5e-1 + C_\mu(4e-1)}{2C_\mu(e-1)}\\ - + \frac{\sqrt{C_\mu^2(1+2e)^2 - + 2C_\mu(14e^2+5e+1) + (1-5e)^2}}{2C_\mu(e-1)} - \end{multline*} - where: - \begin{displaymath} - C_\mu = \frac{\log(1+\mu)}{2\mu} - \end{displaymath} -\end{theorem} - -The proof will consist of the claims of the theorem broken down into lemmas. - -Because the user strategy is parametrized by a single parameter, truthfulness -is equivalent to monotonicity (TODO ref). The proof here is the same as in TODO -and is given for the sake of completeness. - -\begin{lemma} -The mechanism is monotone. -\end{lemma} - -\begin{proof} - We assume by contradiction that there exists a user $i$ that has been - selected by the mechanism and that would not be selected had he reported - a cost $c_i'\leq c_i$ (all the other costs staying the same). - - If $i\neq i^*$ and $i$ has been selected, then we are in the case where - $L(x^*) \geq C V(i^*)$ and $i$ was included in the result set by the greedy - part of the mechanism. By reporting a cost $c_i'\leq c_i$, using the - submodularity of $V$, we see that $i$ will satisfy the greedy selection - rule: - \begin{displaymath} - i = \argmax_{j\in\mathcal{N}\setminus S} \frac{V(S\cup\{j\}) - - V(S)}{c_j} - \end{displaymath} - in an earlier iteration of the greedy heuristic. Let us denote by $S_i$ - (resp. $S_i'$) the set to which $i$ is added when reporting cost $c_i$ - (resp. $c_i'$). We have $S_i'\subset S_i$. Moreover: - \begin{align*} - c_i' & \leq c_i \leq - \frac{B}{2}\frac{V(S_i\cup\{i\})-V(S_i)}{V(S_i\cup\{i\})}\\ - & \leq \frac{B}{2}\frac{V(S_i'\cup\{i\})-V(S_i')}{V(S_i'\cup\{i\})} - \end{align*} - Hence $i$ will still be included in the result set. - - If $i = i^*$, $i$ is included iff $L(x^*) \leq C V(i^*)$. Reporting $c_i'$ - instead of $c_i$ does not change the value $V(i^*)$ nor $L(x^*)$ (which is - computed over $\mathcal{N}\setminus\{i^*\}$). Thus $i$ is still included by - reporting a different cost. -\end{proof} - -\begin{lemma} -The mechanism is budget feasible. -\end{lemma} - -The proof is the same as in Chen and is given here for the sake of -completeness. -\begin{proof} - -\end{proof} - -Using the characterization of the threshold payments from Singer TODO, we can -prove individual rationality similarly to Chen TODO. -\begin{lemma} -The mechanism is individually rational -\end{lemma} - -\begin{proof} - -\end{proof} - -The following lemma requires to have a careful look at the relaxation function -we chose in the mechanism. The next section will be dedicated to studying this -relaxation and will contain the proof of the following lemma: -\begin{lemma} - We have: - \begin{displaymath} - OPT(L_\mathcal{N}, B) \leq \frac{1}{C_\mu}\big(2 OPT(V,\mathcal{N},B) - + \max_{i\in\mathcal{N}}V(i)\big) - \end{displaymath} -\end{lemma} - -\begin{lemma} - Let us denote by $S_M$ the set returned by the mechanism. Let us also - write: - \begin{displaymath} - C_{\textrm{max}} = \max\left(1+C,\frac{e}{e-1}\left( 3 + \frac{12e}{C\cdot - C_\mu(e-1) -5e +1}\right)\right) - \end{displaymath} - - Then: - \begin{displaymath} - OPT(V, \mathcal{N}, B) \leq - C_\text{max}\cdot V(S_M) - \end{displaymath} -\end{lemma} - -\begin{proof} - - If the condition on line 3 of the algorithm holds, then: - \begin{displaymath} - V(i^*) \geq \frac{1}{C}L(x^*) \geq - \frac{1}{C}OPT(V,\mathcal{N}\setminus\{i\}, B) - \end{displaymath} - - But: - \begin{displaymath} - OPT(V,\mathcal{N},B) \leq OPT(V,\mathcal{N}\setminus\{i\}, B) + V(i^*) - \end{displaymath} - - Hence: - \begin{displaymath} - V(i^*) \geq \frac{1}{C+1} OPT(V,\mathcal{N}, B) - \end{displaymath} - - If the condition of the algorithm does not hold: - \begin{align*} - V(i^*) & \leq \frac{1}{C}L(x^*) \leq \frac{1}{C\cdot C_\mu} - \big(2 OPT(V,\mathcal{N}, B) + V(i^*)\big)\\ - & \leq \frac{1}{C\cdot C_\mu}\left(\frac{2e}{e-1}\big(3 V(S_M) - + 2 V(i^*)\big) - + V(i^*)\right) - \end{align*} - - Thus: - \begin{align*} - V(i^*) \leq \frac{6e}{C\cdot C_\mu(e-1)- 5e + 1} V(S_M) - \end{align*} - - Finally, using again that: - \begin{displaymath} - OPT(V,\mathcal{N},B) \leq \frac{e}{e-1}\big(3 V(S_M) + 2 V(i^*)\big) - \end{displaymath} - - We get: - \begin{displaymath} - OPT(V, \mathcal{N}, B) \leq \frac{e}{e-1}\left( 3 + \frac{12e}{C\cdot - C_\mu(e-1) -5e +1}\right) V(S_M) - \end{displaymath} -\end{proof} - -The optimal value for $C$ is: -\begin{displaymath} - C^* = \arg\min C_{\textrm{max}} -\end{displaymath} - -This equation has two solutions. Only one of those is such that: -\begin{displaymath} - C\cdot C_\mu(e-1) -5e +1 \geq 0 -\end{displaymath} -which is needed in the proof of the previous lemma. Computing this solution, -we can state the main result of this section. - -\subsection{Relaxations of the value function} - -To prove lemma TODO, we will use a general method called pipage rounding, -introduced in TODO. Two consecutive relaxations are used: the one that we are -interested in, whose optimization can be computed efficiently, and the -multilinear extension which presents a \emph{cross-convexity} like behavior -which allows for rounding of fractional solution without decreasing the value -of the objective function and thus ensures a constant approximation of the -value function. The difficulty resides in showing that the ratio of the two -relaxations is bounded. - -We say that $R_\mathcal{N}:[0,1]^n\rightarrow\mathbf{R}$ is a relaxation of the -value function $V$ over $\mathcal{N}$ if it coincides with $V$ at binary -points. Formally, for any $S\subset\mathcal{N}$, let $\mathbf{1}_S$ denote the -indicator vector of $S$. $R_\mathcal{N}$ is a relaxation of $V$ over -$\mathcal{N}$ iff: -\begin{displaymath} - \forall S\subset\mathcal{N},\; R_\mathcal{N}(\mathbf{1}_S) = V(S) -\end{displaymath} - -We can extend the optimisation problem defined above to a relaxation by -extending the cost function: -\begin{displaymath} - \forall \lambda\in[0,1]^n,\; c(\lambda) - = \sum_{i\in\mathcal{N}}\lambda_ic_i -\end{displaymath} -The optimisation problem becomes: -\begin{displaymath} - OPT(R_\mathcal{N}, B) = - \max_{\lambda\in[0,1]^n}\left\{R_\mathcal{N}(\lambda)\,|\, c(\lambda)\leq B\right\} -\end{displaymath} -The relaxations we will consider here rely on defining a probability -distribution over subsets of $\mathcal{N}$. - -Let $\lambda\in[0,1]^n$, let us define: -\begin{displaymath} - P_\mathcal{N}^\lambda(S) = \prod_{i\in S}\lambda_i - \prod_{i\in\mathcal{N}\setminus S}(1-\lambda_i) -\end{displaymath} -$P_\mathcal{N}^\lambda(S)$ is the probability of picking the set $S$ if we select -a subset of $\mathcal{N}$ at random by deciding independently for each point to -include it in the set with probability $\lambda_i$ (and to exclude it with -probability $1-\lambda_i$). - -We will consider two relaxations of the value function $V$ over $\mathcal{N}$: -\begin{itemize} - \item the \emph{multi-linear extension} of $V$: - \begin{align*} - F_\mathcal{N}(\lambda) - & = \mathbb{E}_{S\sim P_\mathcal{N}^\lambda}\big[\log\det A(S)\big]\\ - & = \sum_{S\subset\mathcal{N}} P_\mathcal{N}^\lambda(S) V(S)\\ - & = \sum_{S\subset\mathcal{N}} P_\mathcal{N}^\lambda(S) \log\det A(S)\\ - \end{align*} - \item the \emph{concave relaxation} of $V$: - \begin{align*} - L_{\mathcal{N}}(\lambda) - & = \log\det \mathbb{E}_{S\sim P_\mathcal{N}^\lambda}\big[A(S)\big]\\ - & = \log\det\left(\sum_{S\subset N} - P_\mathcal{N}^\lambda(S)A(S)\right)\\ - & = \log\det\left(I_d + \mu\sum_{i\in\mathcal{N}} - \lambda_ix_ix_i^*\right)\\ - & \defeq \log\det \tilde{A}(\lambda) - \end{align*} -\end{itemize} - -\begin{lemma} - The \emph{concave relaxation} $L_\mathcal{N}$ is concave\footnote{Hence - this relaxation is well-named!}. -\end{lemma} - -\begin{proof} - This follows from the concavity of the $\log\det$ function over symmetric - positive semi-definite matrices. More precisely, if $A$ and $B$ are two - symmetric positive semi-definite matrices, then: - \begin{multline*} - \forall\alpha\in [0, 1],\; \log\det\big(\alpha A + (1-\alpha) B\big)\\ - \geq \alpha\log\det A + (1-\alpha)\log\det B - \end{multline*} -\end{proof} - -\begin{lemma}[Rounding]\label{lemma:rounding} - For any feasible $\lambda\in[0,1]^n$, there exists a feasible - $\bar{\lambda}\in[0,1]^n$ such that at most one of its component is - fractional, that is, lies in $(0,1)$ and: - \begin{displaymath} - F_{\mathcal{N}}(\lambda)\leq F_{\mathcal{N}}(\bar{\lambda}) - \end{displaymath} -\end{lemma} - -\begin{proof} - We give a rounding procedure which given a feasible $\lambda$ with at least - two fractional components, returns some $\lambda'$ with one less fractional - component, feasible such that: - \begin{displaymath} - F_\mathcal{N}(\lambda) \leq F_\mathcal{N}(\lambda') - \end{displaymath} - Applying this procedure recursively yields the lemma's result. - - Let us consider such a feasible $\lambda$. Let $i$ and $j$ be two - fractional components of $\lambda$ and let us define the following - function: - \begin{displaymath} - F_\lambda(\varepsilon) = F(\lambda_\varepsilon) - \quad\textrm{where} \quad - \lambda_\varepsilon = \lambda + \varepsilon\left(e_i-\frac{c_i}{c_j}e_j\right) - \end{displaymath} - - It is easy to see that if $\lambda$ is feasible, then: - \begin{multline}\label{eq:convex-interval} - \forall\varepsilon\in\Big[\max\Big(-\lambda_i,(\lambda_j-1)\frac{c_j}{c_i}\Big), \min\Big(1-\lambda_i, \lambda_j - \frac{c_j}{c_i}\Big)\Big],\;\\ - \lambda_\varepsilon\;\;\textrm{is feasible} - \end{multline} - - Furthermore, the function $F_\lambda$ is convex, indeed: - \begin{align*} - F_\lambda(\varepsilon) - & = \mathbb{E}_{S'\sim P_{\mathcal{N}\setminus\{i,j\}}^\lambda(S')}\Big[ - (\lambda_i+\varepsilon)\Big(\lambda_j-\varepsilon\frac{c_i}{c_j}\Big)V(S'\cup\{i,j\})\\ - & + (\lambda_i+\varepsilon)\Big(1-\lambda_j+\varepsilon\frac{c_i}{c_j}\Big)V(S'\cup\{i\})\\ - & + (1-\lambda_i-\varepsilon)\Big(\lambda_j-\varepsilon\frac{c_i}{c_j}\Big)V(S'\cup\{j\})\\ - & + (1-\lambda_i-\varepsilon)\Big(1-\lambda_j+\varepsilon\frac{c_i}{c_j}\Big)V(S')\Big]\\ - \end{align*} - Thus, $F_\lambda$ is a degree 2 polynomial whose dominant coefficient is: - \begin{multline*} - \frac{c_i}{c_j}\mathbb{E}_{S'\sim - P_{\mathcal{N}\setminus\{i,j\}}^\lambda(S')}\Big[ - V(S'\cup\{i\})+V(S'\cup\{i\})\\ - -V(S'\cup\{i,j\})-V(S')\Big] - \end{multline*} - which is positive by submodularity of $V$. Hence, the maximum of - $F_\lambda$ over the interval given in \eqref{eq:convex-interval} is - attained at one of its limit, at which either the $i$-th or $j$-th component of - $\lambda_\varepsilon$ becomes integral. -\end{proof} - -\begin{lemma}\label{lemma:relaxation-ratio} - The following inequality holds: - \begin{displaymath} - \forall\lambda\in[0,1]^n,\; - \frac{\log\big(1+\mu\big)}{2\mu} - \,L_\mathcal{N}(\lambda)\leq - F_\mathcal{N}(\lambda)\leq L_{\mathcal{N}}(\lambda) - \end{displaymath} -\end{lemma} - -\begin{proof} - - We will prove that: - \begin{displaymath} - \frac{\log\big(1+\mu\big)}{2\mu} - \end{displaymath} - is a lower bound of the ratio $\partial_i F_\mathcal{N}(\lambda)/\partial_i - L_\mathcal{N}(\lambda)$. - - This will be enough to conclude, by observing that: - \begin{displaymath} - \frac{F_\mathcal{N}(\lambda)}{L_\mathcal{N}(\lambda)} - \sim_{\lambda\rightarrow 0} - \frac{\sum_{i\in \mathcal{N}}\lambda_i\partial_i F_\mathcal{N}(0)} - {\sum_{i\in\mathcal{N}}\lambda_i\partial_i L_\mathcal{N}(0)} - \end{displaymath} - and that an interior critical point of the ratio - $F_\mathcal{N}(\lambda)/L_\mathcal{N}(\lambda)$ is defined by: - \begin{displaymath} - \frac{F_\mathcal{N}(\lambda)}{L_\mathcal{N}(\lambda)} - = \frac{\partial_i F_\mathcal{N}(\lambda)}{\partial_i - L_\mathcal{N}(\lambda)} - \end{displaymath} - - Let us start by computing the derivatives of $F_\mathcal{N}$ and - $L_\mathcal{N}$ with respect to - the $i$-th component. - - For $F$, it suffices to look at the derivative of - $P_\mathcal{N}^\lambda(S)$: - \begin{displaymath} - \partial_i P_\mathcal{N}^\lambda(S) = \left\{ - \begin{aligned} - & P_{\mathcal{N}\setminus\{i\}}^\lambda(S\setminus\{i\})\;\textrm{if}\; i\in S \\ - & - P_{\mathcal{N}\setminus\{i\}}^\lambda(S)\;\textrm{if}\; - i\in \mathcal{N}\setminus S \\ - \end{aligned}\right. - \end{displaymath} - - Hence: - \begin{multline*} - \partial_i F_\mathcal{N} = - \sum_{\substack{S\subset\mathcal{N}\\ i\in S}} - P_{\mathcal{N}\setminus\{i\}}^\lambda(S\setminus\{i\})V(S)\\ - - \sum_{\substack{S\subset\mathcal{N}\\ i\in \mathcal{N}\setminus S}} - P_{\mathcal{N}\setminus\{i\}}^\lambda(S)V(S)\\ - \end{multline*} - - Now, using that every $S$ such that $i\in S$ can be uniquely written as - $S'\cup\{i\}$, we can write: - \begin{multline*} - \partial_i F_\mathcal{N} = - \sum_{\substack{S\subset\mathcal{N}\\ i\in\mathcal{N}\setminus S}} - P_{\mathcal{N}\setminus\{i\}}^\lambda(S)V(S\cup\{i\})\\ - - \sum_{\substack{S\subset\mathcal{N}\\ i\in \mathcal{N}\setminus S}} - P_{\mathcal{N}\setminus\{i\}}^\lambda(S)V(S)\\ - \end{multline*} - - Finally, by using the expression for the marginal contribution of $i$ to - $S$: - \begin{displaymath} - \partial_i F_\mathcal{N}(\lambda) = - \sum_{\substack{S\subset\mathcal{N}\\ i\in\mathcal{N}\setminus S}} - P_{\mathcal{N}\setminus\{i\}}^\lambda(S) - \log\Big(1 + \mu x_i^*A(S)^{-1}x_i\Big) - \end{displaymath} - - The computation of the derivative of $L_\mathcal{N}$ uses standard matrix - calculus and gives: - \begin{displaymath} - \partial_i L_\mathcal{N}(\lambda) - = \mu x_i^* \tilde{A}(\lambda)^{-1}x_i - \end{displaymath} - - Using the following inequalities: - \begin{gather*} - \forall S\subset\mathcal{N}\setminus\{i\},\quad - P_{\mathcal{N}\setminus\{i\}}^\lambda(S)\geq - P_{\mathcal{N}\setminus\{i\}}^\lambda(S\cup\{i\})\\ - \forall S\subset\mathcal{N},\quad P_{\mathcal{N}\setminus\{i\}}^\lambda(S) - \geq P_\mathcal{N}^\lambda(S)\\ - \forall S\subset\mathcal{N},\quad A(S)^{-1} \geq A(S\cup\{i\})^{-1}\\ - \end{gather*} - we get: - \begin{align*} - \partial_i F_\mathcal{N}(\lambda) - & \geq \sum_{\substack{S\subset\mathcal{N}\\ i\in\mathcal{N}\setminus S}} - P_\mathcal{N}^\lambda(S) - \log\Big(1 + \mu x_i^*A(S)^{-1}x_i\Big)\\ - & \geq \frac{1}{2} - \sum_{\substack{S\subset\mathcal{N}\\ i\in\mathcal{N}\setminus S}} - P_\mathcal{N}^\lambda(S) - \log\Big(1 + \mu x_i^*A(S)^{-1}x_i\Big)\\ - &\hspace{-3.5em}+\frac{1}{2} - \sum_{\substack{S\subset\mathcal{N}\\ i\in\mathcal{N}\setminus S}} - P_\mathcal{N}^\lambda(S\cup\{i\}) - \log\Big(1 + \mu x_i^*A(S\cup\{i\})^{-1}x_i\Big)\\ - &\geq \frac{1}{2} - \sum_{S\subset\mathcal{N}} - P_\mathcal{N}^\lambda(S) - \log\Big(1 + \mu x_i^*A(S)^{-1}x_i\Big)\\ - \end{align*} - - Using that $A(S)\geq I_d$ we get that: - \begin{displaymath} - \mu x_i^*A(S)^{-1}x_i \leq \mu - \end{displaymath} - - Moreover: - \begin{displaymath} - \forall x\leq\mu,\; \log(1+x)\geq - \frac{\log\big(1+\mu\big)}{\mu} x - \end{displaymath} - - Hence: - \begin{displaymath} - \partial_i F_\mathcal{N}(\lambda) \geq - \frac{\log\big(1+\mu\big)}{2\mu} - x_i^*\bigg(\sum_{S\subset\mathcal{N}}P_\mathcal{N}^\lambda(S)A(S)^{-1}\bigg)x_i - \end{displaymath} - - Finally, using that the inverse is a matrix convex function over symmetric - positive definite matrices: - \begin{align*} - \partial_i F_\mathcal{N}(\lambda) &\geq - \frac{\log\big(1+\mu\big)}{2\mu} - x_i^*\bigg(\sum_{S\subset\mathcal{N}}P_\mathcal{N}^\lambda(S)A(S)\bigg)^{-1}x_i\\ - & \geq \frac{\log\big(1+\mu\big)}{2\mu} - \partial_i L_\mathcal{N}(\lambda) - \end{align*} -\end{proof} - -We can now prove lemma TODO from previous section. - -\begin{proof} - Let us consider a feasible point $\lambda^*\in[0,1]^n$ such that $L_\mathcal{N}(\lambda^*) - = OPT(L_\mathcal{N}, B)$. By applying lemma~\ref{lemma:relaxation-ratio} - and lemma~\ref{lemma:rounding} we get a feasible point $\bar{\lambda}$ with at most - one fractional component such that: - \begin{equation}\label{eq:e1} - L_\mathcal{N}(\lambda^*) \leq \frac{1}{C_\mu} - F_\mathcal{N}(\bar{\lambda}) - \end{equation} - - Let $\lambda_i$ denote the fractional component of $\bar{\lambda}$ and $S$ - denote the set whose indicator vector is $\bar{\lambda} - \lambda_i e_i$. - Using the fact that $F_\mathcal{N}$ is linear with respect to the $i$-th - component and is a relaxation of the value function, we get: - \begin{displaymath} - F_\mathcal{N}(\bar{\lambda}) = V(S) +\lambda_i V(S\cup\{i\}) - \end{displaymath} - - Using the submodularity of $V$: - \begin{displaymath} - F_\mathcal{N}(\bar{\lambda}) \leq 2 V(S) + V(i) - \end{displaymath} - - Note that since $\bar{\lambda}$ is feasible, $S$ is also feasible and - $V(S)\leq OPT(V,\mathcal{N}, B)$. Hence: - \begin{equation}\label{eq:e2} - F_\mathcal{N}(\bar{\lambda}) \leq 2 OPT(V,\mathcal{N}, B) - + \max_{i\in\mathcal{N}} V(i) - \end{equation} - - Putting \eqref{eq:e1} and \eqref{eq:e2} together gives the results. -\end{proof} - +\input{main} \section{General setup} - +\input{general} \section{Conclusion} - +\input{conclusion} \end{document} diff --git a/problem.tex b/problem.tex new file mode 100644 index 0000000..fb9f8e1 --- /dev/null +++ b/problem.tex @@ -0,0 +1,193 @@ +\subsection{Notations} + +Throughout the paper, we will make use of the following notations: if $x$ is +a (column) vector in $\mathbf{R}^d$, $x^*$ denotes its transposed (line) +vector. Thus, the standard inner product between two vectors $x$ and $y$ is +simply $x^* y$. $\norm{x}_2 = x^*x$ will denote the $L_2$ norm of $x$. + +We will also often use the following order over symmetric matrices: if $A$ and +$B$ are two $d\times d$ and $B$ are two $d\times d$ real symmetric matrices, we +write that $A\leq B$ iff: +\begin{displaymath} + \forall x\in\mathbf{R}^d,\quad + x^*Ax \leq x^*Bx +\end{displaymath} +That is, iff $B-A$ is symmetric semi-definite positive. + +This order let us define the notion of a \emph{decreasing} or \emph{convex} +matrix function similarly to their real counterparts. In particular, let us +recall that the matrix inversion is decreasing and convex over symmetric +definite positive matrices. + +\subsection{Data model} + +There is a set of $n$ users, $\mathcal{N} = \{1,\ldots, n\}$. Each user +$i\in\mathcal{N}$ has a public vector of features $x_i\in\mathbf{R}^d$ and an +undisclosed piece of information $y_i\in\mathbf{R}$. We assume that the data +has already been normalized so that $\norm{x_i}_2\leq 1$ for all +$i\in\mathcal{N}$. + +The experimenter is going to select a set of users and ask them to reveal their +private piece of information. We are interested in a \emph{survey setup}: the +experimenter has not seen the data yet, but he wants to know which users he +should be selecting. His goal is to learn the model underlying the data. Here, +we assume a linear model: +\begin{displaymath} + \forall i\in\mathcal{N},\quad y_i = \beta^* x_i + \varepsilon_i +\end{displaymath} +where $\beta\in\mathbf{R}^d$ and $\varepsilon_i\in\mathbf{R}$ follows a normal +distribution of mean $0$ and variance $\sigma^2$. Furthermore, we assume the +error $\varepsilon$ to be independent of the user: +$(\varepsilon_i)_{i\in\mathcal{N}}$ are mutually independent. + +After observing the data, the experimenter could simply do linear regression to +learn the model parameter $\beta$. However, in a more general setup, the +experimenter has a prior knowledge about $\beta$, a distribution over +$\mathbf{R}^d$. After observing the data, the experimenter performs +\emph{maximum a posteriori estimation}: computing the point which maximizes the +posterior distribution of $\beta$ given the observations. + +Here, we will assume, as it is often done, that the prior distribution is +a multivariate normal distribution of mean zero and covariance matrix $\kappa +I_d$. Maximum a posteriori estimation leads to the following maximization +problem: +\begin{displaymath} + \beta_{\text{max}} = \argmax_{\beta\in\mathbf{R}^d} \sum_i (y_i - \beta^*x_i)^2 + + \frac{1}{\mu}\sum_i \norm{\beta}_2^2 +\end{displaymath} +which is the well-known \emph{ridge regression}. $\mu += \frac{\kappa}{\sigma^2}$ is the regularization parameter. Ridge regression +can thus be seen as linear regression with a regularization term which +prevents $\beta$ from having a large $L_2$-norm. + +\subsection{Value of data} + +Because the user private variables $y_i$ have not been observed yet when the +experimenter has to decide which users to include in his experiment, we treat +$\beta$ as a random variable whose distribution is updated after observing the +data. + +Let us recall that if $\beta$ is random variable over $\mathbf{R}^d$ whose +probability distribution has a density function $f$ with respect to the +Lebesgue measure, its entropy is given by: +\begin{displaymath} + \mathbb{H}(\beta) \defeq - \int_{b\in\mathbf{R}^d} \log f(b) f(b)\text{d}b +\end{displaymath} + +A usual way to measure the decrease of uncertainty induced by the observation +of data is to use the entropy. This leads to the following definition of the +value of data called the \emph{value of information}: +\begin{displaymath} + \forall S\subset\mathcal{N},\quad V(S) = \mathbb{H}(\beta) + - \mathbb{H}(\beta\,|\, + Y_S) +\end{displaymath} +where $Y_S = \{y_i,\,i\in S\}$ is the set of observed data. + +\begin{theorem} + Under the ridge regression model explained in section TODO, the value of data + is equal to: + \begin{align*} + \forall S\subset\mathcal{N},\; V(S) + & = \frac{1}{2}\log\det\left(I_d + + \mu\sum_{i\in S} x_ix_i^*\right)\\ + & \defeq \frac{1}{2}\log\det A(S) + \end{align*} +\end{theorem} + +\begin{proof} + +Let us denote by $X_S$ the matrix whose rows are the vectors $(x_i^*)_{i\in +S}$. Observe that $A_S$ can simply be written as: +\begin{displaymath} + A_S = I_d + \mu X_S^* X_S +\end{displaymath} + +Let us recall that the entropy of a multivariate normal variable $B$ over +$\mathbf{R}^d$ of covariance $\Sigma I_d$ is given by: +\begin{equation}\label{eq:multivariate-entropy} + \mathbb{H}(B) = \frac{1}{2}\log\big((2\pi e)^d \det \Sigma I_d\big) +\end{equation} + +Using the chain rule for conditional entropy, we get that: +\begin{displaymath} + V(S) = \mathbb{H}(Y_S) - \mathbb{H}(Y_S\,|\,\beta) +\end{displaymath} + +Conditioned on $\beta$, $(Y_S)$ follows a multivariate normal +distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence: +\begin{equation}\label{eq:h1} + \mathbb{H}(Y_S\,|\,\beta) + = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right) +\end{equation} + +$(Y_S)$ also follows a multivariate normal distribution of mean zero. Let us +compute its covariance matrix, $\Sigma_Y$: +\begin{align*} + \Sigma_Y & = \expt{YY^*} = \expt{(X_S\beta + E)(X_S\beta + E)^*}\\ + & = \kappa X_S X_S^* + \sigma^2I_n +\end{align*} +Thus, we get that: +\begin{equation}\label{eq:h2} + \mathbb{H}(Y_S) + = \frac{1}{2}\log\left((2\pi e)^n \det(\kappa X_S X_S^* + \sigma^2 I_n)\right) +\end{equation} + +Combining \eqref{eq:h1} and \eqref{eq:h2} we get: +\begin{displaymath} + V(S) = \frac{1}{2}\log\det\left(I_n+\frac{\kappa}{\sigma^2}X_S + X_S^*\right) +\end{displaymath} + +Finally, we can use Sylvester's determinant theorem to get the result. +\end{proof} + +It is also interesting to look at the marginal contribution of a user to a set: the +increase of value induced by adding a user to an already existing set of users. +We have the following lemma. + +\begin{lemma}[Marginal contribution] + \begin{displaymath} + \Delta_i V(S)\defeq V(S\cup\{i\}) - V(S) + = \frac{1}{2}\log\left(1 + \mu x_i^*A(S)^{-1}x_i\right) + \end{displaymath} +\end{lemma} + +\begin{proof} + We have: + \begin{align*} + V(S\cup\{i\}) & = \frac{1}{2}\log\det A(S\cup\{i\})\\ + & = \frac{1}{2}\log\det\left(A(S) + \mu x_i x_i^*\right)\\ + & = V(S) + \frac{1}{2}\log\det\left(I_d + \mu A(S)^{-1}x_i + x_i^*\right)\\ + & = V(S) + \frac{1}{2}\log\left(1 + \mu x_i^* A(S)^{-1}x_i\right) + \end{align*} + where the last equality comes from Sylvester's determinant formula. +\end{proof} + +Because $A(S)$ is symmetric definite positive, the marginal contribution is +positive, which proves that the value function is set increasing. Furthermore, +it is easy to see that if $S\subset S'$, then $A(S)\leq A(S')$. Using the fact +that matrix inversion is decreasing, we see that the marginal contribution of +a fixed user is a set decreasing function. This is the \emph{submodularity} of +the value function. + +TODO? Explain what are the points which are the most valuable : points which +are aligned along directions where the spread over the already existing points +is small. + +\subsection{Auction} + +TODO Explain the optimization problem, why it has to be formulated as an auction +problem. Explain the goals: +\begin{itemize} + \item truthful + \item individually rational + \item budget feasible + \item has a good approximation ratio + +TODO Explain what is already known: it is ok when the function is submodular. When +should we introduce the notion of submodularity? +\end{itemize} + + |
