notes.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage{amsmath,amsthm,amsfonts}
\usepackage{comment}
\newtheorem{lemma}{Lemma}
\newtheorem{fact}{Fact}
\newtheorem{example}{Example}
\newtheorem{prop}{Proposition}
\newcommand{\var}{\mathop{\mathrm{Var}}}
\newcommand{\condexp}[2]{\mathop{\mathbb{E}}\left[#1|#2\right]}
\newcommand{\expt}[1]{\mathop{\mathbb{E}}\left[#1\right]}
\newcommand{\norm}[1]{\lVert#1\rVert}
\newcommand{\tr}[1]{#1^*}
\newcommand{\ip}[2]{\langle #1, #2 \rangle}
\newcommand{\mse}{\mathop{\mathrm{MSE}}}
\DeclareMathOperator{\trace}{tr}
\DeclareMathOperator*{\argmax}{arg\,max}
\title{Value of data: an approach to the economics of user data}
\begin{document}
\maketitle

\section{Introduction}

With the recent development of communication technologies, user data is now
present everywhere. It is also clear that there is some notion of value
attached to user data: this is particularly obvious when you consider the
market of targeted advertisement on the Web. However, there is no clear notion
of what a good definition of the value of data should be.

The goal of this work is to propose a framework to study the value of data.
This framework, inspired both by the fields of Information Theory and
Statistical Learning, leads to a natural definition of the value of data, which
can then be used to answer economic questions regarding the use of this data in
a data market.

\section{Data model}

There is a set of users and an experimenter. Each user has a vector of public
features (e.g. age, height, movie ratings, binary features, labels, etc.) and
a private piece of information, an undisclosed variable.

The experimenter has a prior knowledge of how the undisclosed variable relates
to the public vector of features, this prior knowledge is called the
\emph{hypothesis}. The experimenter wants to test and refine his hypothesis
against the users' data: he is going to select a set of users and ask them to
reveal their private variable. Based on the observed data, he will be able to
update his prior knowledge.

Because the users' data helps the experimenter refining his hypothesis, there
is a notion of value attached to a group of users which quantifies how much
uncertainty about the hypothesis is removed by observing their data.

The users also have a cost for revealing their data (for example for privacy
reasons, or because it requires them some time to provide the information to
the experimenter). The experimenter has finite resources: there is a budget
constraint on how much he can spend.

The problems arising in this setup are natural: the experimenter wants to
maximize his utility, the value of the set of users he selects, but he needs to
compensate the users by taking into account their costs and his budget
constraint. The users' costs can either be public, which directly leads to
combinatorial optimizations problems, or private, in which case, a notion of
strategy intervenes in the way the experimenter compensates the users and
requires an auction approach.

Formally, there is a set of users indexed by a set $\mathcal{I}$. The public
feature vector of user $i\in\mathcal{I}$ is an element $x_i$ of a feature set
$E$, his undisclosed variable is denoted by $y_i$ and belongs to some space
$A$. The cost of user $i$ for revealing his data is a positive real number
$c_i\in\mathbf{R}^+$. 

The prior knowledge of the experimenter takes the form of a random variable $H$
over $A^E$ or a subset of $A^E$ called the \emph{hypothesis set}. This random
variable expresses $y_i$ as a function of $x_i$ through this equation:
\begin{equation}\label{eq:hypothesis-model}
    y_i = H(x_i) + \varepsilon_i
\end{equation}
where $\varepsilon_i$ is a random variable over $A$ independent from $H$.

\emph{TODO: explain why this model is not restrictive: $y$ can always be
    written as a deterministic function of $x$ plus something independent of
$x$}

The framework for the experimenter is Bayesian: his prior knowledge the
distribution of $H$ is the prior distribution. Based on the disclosed variables
of the selected users, he updates his knowledge to get the posterior
distribution. 

\emph{Examples.}
\begin{enumerate}
    \item if the hypothesis set is finite (the experimenter has a few
        deterministic model he wants to choose from), observing data allows him
        to rule out parts of the hypothesis set. In this case, the uncertainty
        of the experimenter can simply be measured by the size of the
        hypothesis set.
    \item if $A=\{0,1\}$ and the hypothesis set is the set of all binary
        functions on $E$, the learning task of the experimenter implied by the
        above setup is a binary classification problem.
    \item if $A=\mathbf{R}$, $E=\mathbf{R}^d$ and the hypothesis set is the set
        of linear functions from $E$ to $A$: $\mathcal{L}(A,E)$, the learning
        task is a linear regression. The prior knowledge of the experimenter is
        a prior distribution on the parameters of the linear function, which is
        equivalent to regularized linear regression (e.g. ridge regression).
\end{enumerate}

\section{Value of data}

The experimenter does not know the function which appears in
\eqref{eq:answer-model}. Instead, he has a given set of \emph{hypotheses}
$\mathcal{H}$ and a prior knowledge of the true hypothesis $h$ which is modeled
by a random variable $H$ over $\mathcal{H}$. The data model in
\eqref{eq:answer-model} expresses that conditioned on the hypothesis, the
answers to the experimenter's query are independent.

The prior distribution can also be seen as an expression of the uncertainty of
the experimenter about the true hypothesis $h$. The uncertainty of the
distribution $P$ of $H$ can be classically measured by its entropy
$\mathbb{H}(H)$\footnote{Here we choose to write the entropy of a discrete
distribution, but our results do not rely on this assumption: if the
distribution is continuous, one simply needs to replace the entropy by the
conditional entropy.}:
\begin{displaymath}
    \mathbb{H}(H) = -\sum_{h\in A^E}P(H = h)\log\big(P(H = h)\big)
\end{displaymath}

Given a subset $S\subset I$ of users, we denote by $\mathcal{A}_S$ the set of
answers given by the users in $S$ according to the experimenter's knowledge of
the hypothesis:
\begin{displaymath}
    \mathcal{A}_S = \{Y_i,\, i\in S\} \quad\mathrm{with}
    \quad Y_i = H(x_i) + \varepsilon_i
\end{displaymath}

The goal of the experimenter is to use the answers of the users to learn the
true hypothesis. This setup is the one commonly found in Bayesian Active
Learning with Noise. In this setup the value of a data, or value of information,
used when using the entropy as a measure of uncertainty is simply the decrease
of entropy implied by observing the information. In our case, this leads to
defining the value function $V$:
\begin{equation}\label{eq:value}
    \forall S\subset I,\; V(S) 
    = \mathbb{H}(H) - \mathbb{H}(H\,|\,\mathcal{A}_S)
\end{equation}
where $\mathbb{H}(H\,|\,\mathcal{A}_S)$ is the conditional entropy of $H$ given
$\mathcal{A}_S$. One can also recognize that the definition of the entropy
given above is simply the mutual information $I(H;\mathcal{A}_S)$ between $H$
and $\mathcal{A}_S$. 

Using the \emph{information never hurts} principle, it is
easy to see that the value function defined in \eqref{eq:value} is positive and
set increasing. Thus one could extend the definition of the value to be any
increasing function of the mutual information.

The motivation of this definition of value makes sense when considering the
information theoretic interpretation of the entropy: if we consider that the
experimenter has access to an oracle to whom he can ask yes/no questions. Then,
the entropy of the distribution is exactly the number of questions he needs to
ask to fully know the hypothesis. If he needs to pay for each question asked to
the oracle, then our definition of value directly relates to the cost decrease
implied by observing a set of users. 

\section{Economics of data}

Independently of the chosen definition of value, several optimization problems,
motivated by economic considerations naturally arise in the above setup.

\subsection{Optimal observation selection}

If we assume that the experimenter has a cost $c$ for each user that he
includes in his experiment, his goal will be to optimize the value of his
experiment while minimizing his cost. Because the cost does not depend on the
users, this is equivalent to maximize the value while minimizing the size of
the chosen set of users. Hence, the optimization problems takes the following
form:
\begin{equation}\label{eq:observation-selection}
    S^* = \argmax_{\substack{S\subset \mathcal{I}\\ \vert S\vert \leq k}} V(S)
\end{equation}
\emph{Note: (1-1/e) approximation algorithm known since Nemhauser (1978)}

The fact that there is a function defining the value of a set of users
strongly suggests that some users are more valuable than others. It is then
natural that each user $i$ has a specific cost $c_i$. The above optimization
problem then takes the form:
\begin{equation}\label{eq:cost-observation-selection}
    S^* = \argmax_{\substack{S\subset \mathcal{I}\\ \sum_{i\in S}c_i\leq B}} V(S)
\end{equation}

\emph{Note: this is known as the budgeted maximization problem, or maximization
    problem under knapsack/linear constraint. This problem is much harder than
    the previous one. A (1-1/e) approximation algorithm was known in the case
    when the value function is the one used in the Max-Coverage problem. But
for general, non-decreasing submodular functions, this is a result by
Sviridenko (2004) : A note on maximizing a submodular set function subject to
knapsack constraint}
    
\subsection{Budget feasible auctions}

In this section, the experimenter wants to compensate users for joining the
experiment. Because the payments made by the experimenter are based on the
costs reported by the users, reporting a false cost could be a strategy for a user
to maximize his revenue.

Therefore, this is an auction setup: the experimenter observes the costs
reported by the users $c_i, i\in\mathcal{I}$ and select a subset $S$ of users,
as well as payments $p_i, i\in S$ such that:
\begin{itemize}
    \item \textbf{individually rational}
    \item \textbf{truthful}
    \item \textbf{budget-constrained}: $\sum_{i\in S} p_i \leq B$
    \item \textbf{approximation}: $OPT \leq \alpha V(S)$ where $OPT$ is the solution
        to the budgeted maximization problem in \eqref{eq:cost-observation-selection}
\end{itemize}

\emph{Note: Yaron Singer (2010) Budget feasible auctions, has a 117.7
    polynomial approximation for this problem. It is proven that no algorithm
can do better than $2-\epsilon$ for any $\epsilon$.}

\subsection{Value sharing}

\section{Value of data in the Gaussian world}

In this section we will assume a multivariate Gaussian model:
\begin{displaymath}
    y = \beta^*x + \epsilon
\end{displaymath}

The prior distribution of $\beta$ is a multivariate normal distribution of mean
zero and covariance $\Sigma$. $\epsilon$ is the noise which is modeled with
a normal distribution of mean zero and variance $\sigma^2$.

Note that this model is the probabilistic model used in ridge regression. To
compute the value of a subset $S$ as defined above, we have to compute the
differential conditional entropy of the posterior distribution after observing
the set $S$, which is exactly the distribution which leads to the ridge
regression estimator. Thus the following computation can be seen as the study
of how the ridge regression estimator evolves as you observe more points.

Let us start by computing the entropy of the multivariate normal distribution:
\begin{displaymath}
    H(\beta) = -\frac{1}{C}
    \int_{b\in\mathbf{R}^d} \exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)
    \log\left(-\frac{1}{C}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)\right)db
\end{displaymath}
where:
\begin{displaymath}
    C = \big(\sqrt{2\pi}\big)^d\sqrt{\det\Sigma}
    = \int_{b\in\mathbf{R}^d}\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db 
\end{displaymath}

By expanding the logarithm, we get:
\begin{displaymath}
    H(\beta) = \log(C) + \frac{1}{2C}\int_{b\in\mathbf{R}^d}
    b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db
\end{displaymath}

One can notice that:
\begin{displaymath}
    \frac{1}{C}\int_{b\in\mathbf{R}^d}
    b^*\Sigma^{-1}b\exp\left(-\frac{1}{2}b^*\Sigma^{-1}b\right)db
    = \expt{b^*\Sigma^{-1}b}
\end{displaymath}
where $b$ follows a multivariate normal distribution of variance $\Sigma$
and mean zero.
\begin{displaymath}
    \expt{b^*\Sigma^{-1}b} = \expt{\trace\big(\Sigma^{-1}bb^*\big)}
    = \trace\big(\Sigma^{-1}\Sigma\big)
    = 1
\end{displaymath}

Finally:
\begin{displaymath}
    H(\beta) = \frac{1}{2}\log\big((2\pi)^d\det\Sigma\big) + \frac{1}{2}
    = \frac{1}{2}\log\big((2\pi e)^d\det\Sigma\big)
\end{displaymath}

\subsubsection*{Conditional entropy}

Let $S$ be a subset of $D$ of size $n$. Let us denote by $x_1,\ldots,x_n$ the
points in $S$ and by $Y_1, \ldots, Y_n$ the associated random variables of
interest.

Using the Bayes rule for conditional entropy, we get that:
\begin{displaymath}
    H(\beta\,|\,Y_1,\ldots ,Y_n) = H(Y_1,\ldots,Y_n\,|\,\beta)
    +H(\beta) - H(Y_1,\ldots, Y_n)
\end{displaymath}

Conditioned on $\beta$, $(Y_1,\ldots,Y_n)$ follows a multivariate normal
distribution of mean $X\beta$ and of covariance matrix $\sigma^2 I_n$. Hence:
\begin{displaymath}
    H(Y_1,\ldots,Y_n\,|\,\beta) 
    = \frac{1}{2}\log\left((2\pi e)^n \det(\sigma^2I_n)\right)
\end{displaymath}

$(Y_1,\ldots,Y_n)$ follows a multivariate normal distribution of mean 0. Let us
denote by $\Sigma_Y$ its covariance matrix:
\begin{align}
    \Sigma_Y & = \expt{YY^*} = \expt{(X\beta + E)(X\beta + E)^*}\\
             & = X\Sigma X^* + \sigma^2I_n
\end{align}
which yields the following expression for the joint entropy:
\begin{displaymath}
    H(Y_1,\ldots, Y_n) 
    = \frac{1}{2}\log\left((2\pi e)^n \det(X\Sigma X^* + \sigma^2 I_n)\right)
\end{displaymath}

Combining the above equations, we get:
\begin{align}
    H(\beta\,|\, Y_1,\ldots,Y_n) 
    & = \frac{1}{2}\log\left((2\pi e)^d \det \Sigma
    \det\left(I_d + \Sigma \frac{X^*X}{\sigma^2}\right)^{-1}\right)\\
    & = - \frac{1}{2}\log\left((2\pi e)^d
\det\left(\Sigma^{-1} + \frac{X^*X}{\sigma^2}\right)\right)
\end{align}

\subsubsection*{Value of data}

By combining the formula for entropy and the conditional entropy, we get that
the value of a set $S$ of points is equal to:
\begin{align}
    V(S) & = \frac{1}{2}\log\left((2\pi e)^d \det \Sigma\right)
    + \frac{1}{2}\log\left((2\pi e)^d \det
    \left(\Sigma^{-1}+\frac{X_S^*X_S}{\sigma^2}\right)\right)\\
    & = \frac{1}{2}\log\left(\det\left(I_d 
    + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\right)
\end{align}

\subsubsection*{Marginal contribution}

Here, we want to compute the marginal contribution of a point $x$ to a set $S$ of
users:
\begin{displaymath}
    \Delta_xV(S) = V(S\cup \{x\}) - V(S)
\end{displaymath}

Using that:
\begin{displaymath}
    X_{S\cup\{x\}}^*X_{S\cup\{x\}} = X_S^*X_S + xx^*
\end{displaymath}
we get:
\begin{align}
    \Delta_x V(S) & = \frac{1}{2}\log\det\left(I_d 
    + \Sigma\frac{X_S^*X_S}{\sigma^2} + \Sigma\frac{xx^*}{\sigma^2}\right)
    - \frac{1}{2}\log\det\left(I_d + \Sigma\frac{X_S^*X_S}{\sigma^2}\right)\\
    &  = \frac{1}{2}\log\det\left(I_d + \frac{xx^*}{\sigma^2}\left(\Sigma^{-1} +
\frac{X_S^*X_S}{\sigma^2}\right)^{-1}\right)\\
& = \frac{1}{2}\log\left(1 + \frac{1}{\sigma^2}x^*\left(\Sigma^{-1}
+ \frac{X_S^*X_S}{\sigma^2}\right)^{-1}x\right)
\end{align}

\section*{Appendix: Submodularity}

In this section, we will consider that we are given a \emph{universe} set $U$.
A set function $f$ is a function defined on the power set of $U$, $\mathfrak{P}(U)$.

A set function $f$ defined on $\mathfrak{P}(U)$ will be said \emph{increasing} if
it is increasing with regards to inclusion, that is:

\begin{displaymath}
  \forall\,S\subseteq T,\quad f(S)\leq f(T)
\end{displaymath}

A \emph{decreasing} function on $\mathfrak{P}(U)$ is defined similarly.

A set function $f$ defined on $\mathfrak{P}(U)$ is said to be
\emph{submodular} if it verifies the diminishing returns property, that is,
the marginal increments when adding a point to a set, is a set decreasing
function. More formally, for any point $x$ in $U$, we can define the marginal
increment of $f$ regarding $x$, it is the set function defined as:
\begin{displaymath}
    \Delta_x f(S) = f(S\cup\{x\}) - f(S)
\end{displaymath}
Then, $f$ is \emph{submodular} iff. for all $x$ in $U$, $\Delta_x f$ is a set
decreasing function.

Similarly, a \emph{supermodular} is a function whose marginal increments are set
increasing functions.
\begin{prop}
  Let $R:\mathbf{R}\rightarrow \mathbf{R}$ be a decreasing concave function and
  $f:\mathfrak{P}(U)\rightarrow\mathbf{R}$ be a decreasing submodular function,
  then the composed function $R\circ f$ is increasing and supermodular.
\end{prop}

\begin{proof}
  The increasingness of $R\circ f$ follows immediately from the decreasingness
  of $R$ and $f$.
  
  For the supermodularity, let $S$ and $T$ be two sets such that $S\subseteq
  T$. By decreasingness of $f$, we have:
  \begin{displaymath}
    \forall\,V,\quad f(T)\leq f(S)\quad\mathrm{and}\quad f(T\cup V)\leq f(S\cup V)
  \end{displaymath}
  
  Thus, by concavity of $R$:
  \begin{displaymath}\label{eq:base}
    \forall\,V,\quad\frac{R\big(f(S)\big)-R\big(f(S\cup V)\big)}{f(S)-f(S\cup V)}
    \leq\frac{R\big(f(T)\big)-R\big(f(T\cup V)\big)}{f(T)-f(T\cup V)}
  \end{displaymath}
  
  $f$ is decreasing, so multiplying this last inequality by
  $f(S)-f(S\cup V)$ and $f(T)-f(T\cup V)$ yields:
  \begin{multline}
    \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big)\\
    \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)
  \end{multline}

  $f$ is submodular, so:
  \begin{displaymath}
    f(T\cup V)-f(T)\leq f(S\cup V) - f(S)
  \end{displaymath}
  
  $R\circ f$ is increasing, so:
  \begin{displaymath}
    R\big(f(S)\big)-R\big(f(S\cup V)\big)\leq 0
  \end{displaymath}

  By combining the two previous inequalities, we get:
    \begin{multline*}
    \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\
    \leq \Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(T)-f(T\cup V)\big)
  \end{multline*}

  Injecting this last inequality into \eqref{eq:base} gives:
  \begin{multline*}
    \forall V,\quad\Big(R\big(f(S)\big)-R\big(f(S\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)\\
    \leq \Big(R\big(f(T)\big)-R\big(f(T\cup V)\big)\Big)\big(f(S)-f(S\cup V)\big)
  \end{multline*}

  Dividing left and right by $f(S)-f(S\cup V)$ yields:
  \begin{displaymath}
    \forall V,\quad R\big(f(S)\big)-R\big(f(S\cup V)\big)
    \leq R\big(f(T)\big)-R\big(f(T\cup V)\big)
  \end{displaymath}
  which is exactly the supermodularity of $R\circ f$.
\end{proof}

\end{document}