algorithm.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108

\section{Algorithms}

\label{sec:algorithms}

In Section~\ref{sec:uniqueness}, we showed that a nearest-neighbor classifier
can accurately predict if two sets of skeletal measurements belong to the same
person if the error of the skeleton measurements is small. In this section, we
suggest a probabilistic model for skeleton measurements which leads to a simple
algorithm for skeleton recognition. 

\subsection{Mixture of Gaussians}
\label{sec:mixture of Gaussians}

In this section, skeleton measurements are represented by a feature vector
$\bx\in\mathbf{R}^9$ (see section \ref{sec:experiment-design} for a description
of the skeleton features). Each measurement is labeled with a class $y$
representing the identity of the individual it belongs to (there are as many
classes as people taking part in the experiment).

We want a probabilistic model for the joint distribution $P(\bx, y)$. Our
dataset $\cD = \set{(\bx_1, y_1),\ldots,(\bx_n, y_n)}$ consists of $n$ pairs
$(\bx_i, y_i)$ where $y_i$ is the label of the skeleton $\bx_i$. For each
feature $k\in \set{1,\ldots, 9}$, we plot the histogram of the set of
differences $\set{(\bx_i)_k - (\bar{\bx}_{y_i})_k,\; i\in\set{3,\ldots, n}}$,
between a measurement of a feature $(\bx_i)_k$ and the average of all the
measurements of the same feature belonging to the same class
$(\bar{\bx}_{y_i})_k$. These histograms are presented in Figure~\ref{fig:error
marginals}. All histograms look approximately normal, which suggests that  all
class conditionals $P(\bx, y)$ are multivariate normal.

Based on these observations, our model takes the form:
\begin{equation}\label{eq:mixture of Gaussians}
    P(\bx, y) = \cN(\bx | \bar{\bx}_y, \Sigma) P(y)
\end{equation}
which means that the conditional distribution of $\bx$ given $y$ is
multivariate normal of mean $\bar{\bx}_y = \E{}{\bx | y}$ and of covariance
$\Sigma$. Note that the covariance matrix $\Sigma$ does not depend on the class
$y$. This is because we assume that the observed variance is only caused by the
noise introduced by the measurement process (low resolution of the Kinect
camera, approximation of the skeleton mapping algorithm provided by the
Microsoft SDK) and not by the identity of the individual walking in front of
the camera.

This probabilistic model is called a mixture of Gaussians
\cite{bishop06pattern} and has many advantages. First, the parameters of the
model can be easily learned: $P(y)$, the probability of belonging to class $y$
is simply the frequency of $y$ in the training set, $\bar{\bx}_y$ is the sample
mean of the observations belonging to class $y$, and $\Sigma$ is learned as
$\Sigma = \sum_y P(y)\Sigma_y$, where $\Sigma_y$ represents the sample
covariance matrix of $\bx$ given $y$. Second, the inference in the model can be
performed in a closed form by maximum-likelihood estimation: given an
observation $\bx$, the model predicts $\hat{y} = \arg\max_y P(y | \bx)$, where:
\begin{equation}\label{eq:inference}
    P(y|\bx) = \frac{P(\bx, y)}{\sum_y P(\bx|y)P(y)}
    = \frac{\cN(\bx|\bar{\bx}_y, \Sigma)P(y)}{\sum_y\cN(\bx|\bar{\bx}_y,
    \Sigma)P(y)}
\end{equation}
In this setting, the decision boundary between two classes $y_1$ and $y_2$:
$\set{\bx | P(\bx, y_1) = P(\bx, y_2)}$ is an hyperplane \cite{bishop06pattern}
and the mixture of Gaussians model can be viewed as a probabilistic variant of
the nearest-neighbor (NN) classifier in Section~\ref{sec:uniqueness}.

In practice, the prediction $\hat{y}$ is accepted when the classifier is
confident, that is, $P(\hat{y} | \bx) \! > \! \delta$, where $\delta \in
(0, 1)$ is a threshold that controls the precision and recall of the
classifier. In general, the higher the threshold $\delta$, the lower the recall
and the higher the precision.

\begin{figure*}[t]
  \begin{center}
    \includegraphics[width=0.99\textwidth]{graphics/hallway.png}
  \end{center}
  \vspace{-\baselineskip}
  \caption{Experiment setting. Color image, depth image, and fitted
  skeleton as captured by the Kinect in a single frame}
  \label{fig:hallway}
\end{figure*}

\subsection{Sequential hypothesis testing}
\label{sec:SHT}

In our setting (see \xref{sec:experiment-design}), skeletons measurements
are not isolated. On the contrary, everytime a person walks in front of the
camera we get a set of time-indexed measurements belonging to the same
individual that we want to classify. The mixture of Gaussians model can be
extended to temporal inference through through the Sequential hypothesis
testing \cite{wald47sequential} framework. In this framework, a subject is
sequentially tested for belonging to one of several class, by assuming that
conditioned on the class, the measurements are independent realisations of the
same random variable. In our case, the probability that the sequence of data
$\bx^{(1)}, \dots, \bx^{(t)}$ belongs to the class $y$ at time $t$ is given by:
\begin{equation}\label{eq:SHT}
  P(y | \bx^{(1)}, \dots, \bx^{(t)}) =
  \frac{\prod_{i = 1}^t \cN(\bx^{(i)} | \bar{\bx}_y, \Sigma) P(y)}
  {\sum_y \prod_{i = 1}^t \cN(\bx^{(i)} | \bar{\bx}_y, \Sigma) P(y)}.
\end{equation}

As in previous section, the prediction is made using maximum likelihood:
$\hat{y} = \arg\max_y P(y | \bx^{(1)}, \dots, \bx^{(t)})$. In practice, the
prediction is accepted when the classifier is confident, that is $P(\hat{y}
| \bx^{(1)}, \dots, \bx^{(t)}) > \delta$, where the threshold $\delta \in (0,
1)$ controls the precision and recall of the predictor.

Sequential hypothesis testing is a common technique for smoothing temporal
predictions. In particular, note that the prediction at time $t$ depends on all
data up to time $t$. This reduces the variance of predictions, especially when
input data are noisy, such as in the domain of skeleton recognition.