algorithm.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110

\section{Algorithms}
\label{sec:algorithms}

In Section~\ref{sec:uniqueness}, we showed that a nearest-neighbor classifier
can accurately predict if two sets of skeletal measurements belong to the same
person if the error of the skeleton measurements is small. In this section, we
suggest a probabilistic model for skeleton recognition. In this model, a
skeleton is classified based on the distance from average skeleton profiles of
people in the training set.


\subsection{Mixture of Gaussians}
\label{sec:mixture of Gaussians}

A mixture of Gaussians \cite{bishop06pattern} is a generative probabilistic
model, which is typically applied to modeling problems where class densities
are unimodal and the feature space is low-dimensional. The joint probability
distribution of the model is given by:
\begin{align}
  P(\bx, y) =  \cN(\bx | \bar{\bx}_y, \Sigma) P(y),
  \label{eq:mixture of Gaussians}
\end{align}
where $P(y)$ is the probability of class $y$ and $\cN(\bx | \bar{\bx}_y,
\Sigma)$ is a multivariate normal distribution, which models the density of
$\bx$ given $y$. The mean of the distribution is $\bar{\bx}_y$ and the variance
of $\bx$ is captured by the covariance matrix $\Sigma$. The decision boundary
between any two classes is known to be is linear when all conditionals $\cN(\bx
| \bar{\bx}_y, \Sigma)$ have the same covariance matrix \cite{bishop06pattern}.
In this setting, the mixture of Gaussians model can be viewed as a
probabilistic variant of the nearest-neighbor (NN) classifier in
Section~\ref{sec:uniqueness}.

The mixture of Gaussians model has many advantages. First, the model can be
easily learned using maximum-likelihood (ML) estimation \cite{bishop06pattern}.
In particular, $P(y)$ is the frequency of $y$ in the training set,
$\bar{\bx}_y$ is the expectation of $\bx$ given $y$, and the covariance matrix
is computed as $\Sigma = \sum_y P(y) \Sigma_y$, where $\Sigma_y$ represents the
covariance of $\bx$ given $y$. Second, the inference in the model can be
performed in a closed form. In particular, the model predicts $\hat{y} =
\arg\max_y P(y | \bx)$, where:

\begin{align}
  P(y | \bx) =
  \frac{P(\bx | y) P(y)}{\sum_y P(\bx | y) P(y)} =
  \frac{\cN(\bx | \bar{\bx}_y, \Sigma) P(y)}{\sum_y \cN(\bx | \bar{\bx}_y, \Sigma) P(y)}.
  \label{eq:inference}
\end{align}

In practice, the prediction $\hat{y}$ is accepted when the classifier is
confident. In other words, $P(\hat{y} | \bx) \! > \! \delta$, where $\delta \in
(0, 1)$ is a threshold that controls the precision and recall of the
classifier. In general, the higher the threshold $\delta$, the lower the recall
and the higher the precision.

In this work, we use the mixture of Gaussians model for skeleton recognition.
Skeleton measurements are represented by a vector $\bx$ and each person is
assigned to one class $y$. In particlar, our dataset $\cD = \set{(\bx_1, y_1),
\dots, (\bx_n, y_n)}$ consists of $n$ pairs $(\bx_i, y_i)$, where $y_i$ is the
label of the skeleton $\bx_i$. To verify that our method is suitable for
skeleton recognition, we plot for each skeleton feature $x_k$
(Section~\ref{sec:experiment}) the histogram of differences between all
measurements and the expectation given the class $(\bx_i)_k - \E{}{x_k | y_i}$
(Figure~\ref{fig:error marginals}). All histograms look approximately normal.
This suggests that all class conditionals $P(\bx | y)$ are multivariate normal
and our generative model, although very simple, may be nearly optimal
\cite{bishop06pattern}.

\begin{figure}[t]
  \centering
  \includegraphics{graphics/limbs.pdf}
  \caption{Histograms of differences between 9 skeleton measurements
  $x_k$ (Section~\ref{sec:experiment}) and their expectation given the
  class $y$. In red, the p.d.f. of a normal distribution with mean and
  variance equal to the empirical mean and variance of the measurement}
  \label{fig:error marginals}
\end{figure}


\subsection{Sequential hypothesis testing}
\label{sec:SHT}

The mixture of Gaussians model can be extended to temporal inference through
sequential hypothesis testing. Sequential hypothesis testing
\cite{wald47sequential} is an established statistical framework, where a
subject is sequentially tested for belonging to one of several classes. The
probability that the sequence of data $\bx^{(1)}, \dots, \bx^{(t)}$ belongs to
the class $y$ at time $t$ is given by:

\begin{align}
  P(y | \bx^{(1)}, \dots, \bx^{(t)}) =
  \frac{\prod_{i = 1}^t \cN(\bx^{(i)} | \bar{\bx}_y, \Sigma) P(y)}
  {\sum_y \prod_{i = 1}^t \cN(\bx^{(i)} | \bar{\bx}_y, \Sigma) P(y)}.
  \label{eq:SHT}
\end{align}
In practice, the prediction $\hat{y} = \arg\max_y P(y | \bx^{(1)}, \dots,
\bx^{(t)})$ is accepted when the classifier is confident. In other words,
$P(\hat{y} | \bx^{(1)}, \dots, \bx^{(t)}) > \delta$, where the threshold
$\delta \in (0, 1)$ controls the precision and recall of the predictor. In
general, the higher the threshold $\delta$, the higher the precision and the
lower the recall.

Sequential hypothesis testing is a common technique for smoothing temporal
predictions. In particular, note that the prediction at time $t$ depends on all
data up to time $t$. This reduces the variance of predictions, especially when
input data are noisy, such as in the domain of skeleton recognition.

In skeleton recognition, the sequence $\bx^{(1)}, \dots, \bx^{(t)}$ is skeleton
measurements of a person walking towards the camera, for instance. If the
Kinect detects multiple people, we use the figure tracking of the tools in
\xref{sec:experiment:dataset} to distinguish individual skeleton sequences.