\section{Algorithms} \label{sec:algorithms} In Section~\ref{sec:uniqueness}, we showed that a nearest-neighbor classifier can accurately predict if two sets of skeletal measurements belong to the same person if the error of the skeleton measurements is small. In this section, we suggest a probabilistic model for skeleton measurements which leads to a simple algorithm for skeleton recognition. \subsection{Mixture of Gaussians} \label{sec:mixture of Gaussians} In this section, skeleton measurements are represented by a feature vector $\bx\in\mathbf{R}^9$ (see section \ref{sec:experiment-design} for a description of the skeleton features). Each measurement is labeled with a class $y$ representing the identity of the individual it belongs to (there are as many classes as people taking part in the experiment). We want a probabilistic model for the joint distribution $P(\bx, y)$. Our dataset $\cD = \set{(\bx_1, y_1),\ldots,(\bx_n, y_n)}$ consists of $n$ pairs $(\bx_i, y_i)$ where $y_i$ is the label of the skeleton $\bx_i$. For each feature $k\in \set{1,\ldots, 9}$, we plot the histogram of the set of differences $\set{(\bx_i)_k - (\bar{\bx}_{y_i})_k,\; i\in\set{1,\ldots, n}}$, between a measurement of a feature $(\bx_i)_k$ and the average of all the measurements of the same feature belonging to the same class $(\bar{\bx}_{y_i})_k$. These histograms are presented in Figure~\ref{fig:error marginals}. All histograms look approximately normal, which suggests that all class conditionals $P(\bx, y)$ are multivariate normal. Based on these observations, our model takes the form: \begin{equation}\label{eq:mixture of Gaussians} P(\bx, y) = \cN(\bx | \bar{\bx}_y, \Sigma) P(y) \end{equation} which means that the conditional distribution of $\bx$ given $y$ is multivariate normal of mean $\bar{\bx}_y = \E{}{\bx | y}$ and of covariance $\Sigma$. Note that the covariance matrix $\Sigma$ does not depend on the class $y$. This is because we assume that the observed variance is only caused by the noise introduced by the measurement process (low resolution of the Kinect camera, approximation of the skeleton mapping algorithm provided by the Microsoft SDK) and not by the identity of the individual walking in front of the camera. This probabilistic model is called a mixture of Gaussians \cite{bishop06pattern} and has many advantages. First, the parameters of the model can be easily learned: $P(y)$, the probability of belonging to class $y$ is simply the frequency of $y$ in the training set, $\bar{\bx}_y$ is the sample mean of the observations belonging to class $y$, and $\Sigma$ is learned as $\Sigma = \sum_y P(y)\Sigma_y$, where $\Sigma_y$ represents the sample covariance matrix of $\bx$ given $y$. Second, the inference in the model can be performed in a closed form by maximum-likelihood estimation: given an observation $\bx$, the model predicts $\hat{y} = \arg\max_y P(y | \bx)$, where: \begin{equation}\label{eq:inference} P(y|\bx) = \frac{P(\bx, y)}{\sum_y P(\bx|y)P(y)} = \frac{\cN(\bx|\bar{\bx}_y, \Sigma)P(y)}{\sum_y\cN(\bx|\bar{\bx}_y, \Sigma)P(y)} \end{equation} In this setting, the decision boundary between two classes $y_1$ and $y_2$: $\set{\bx | P(\bx, y_1) = P(\bx, y_2)}$ is a hyperplane \cite{bishop06pattern} and the mixture of Gaussians model can be viewed as a probabilistic variant of the nearest-neighbor (NN) classifier in Section~\ref{sec:uniqueness}. In practice, the prediction $\hat{y}$ is accepted when the classifier is confident, that is, $P(\hat{y} | \bx) \! > \! \delta$, where $\delta \in (0, 1)$ is a threshold that controls the precision and recall of the classifier. In general, the higher the threshold $\delta$, the lower the recall and the higher the precision. \begin{figure*}[t] \begin{center} \includegraphics[width=0.99\textwidth]{graphics/hallway.png} \end{center} \vspace{-1.5\baselineskip} \caption{Experiment setting. Color image, depth image, and fitted skeleton as captured by the Kinect in a single frame} \label{fig:hallway} \end{figure*} \subsection{Sequential hypothesis testing} \label{sec:SHT} In our setting (see \xref{sec:experiment-design}), skeleton measurements are not isolated. On the contrary, everytime a person walks in front of the camera we get a set of time-indexed measurements belonging to the same individual that we want to classify. The mixture of Gaussians model can be extended to temporal inference through through the Sequential hypothesis testing \cite{wald47sequential} framework. In this framework, a subject is sequentially tested for belonging to one of several classes, by assuming that conditioned on the class, the measurements are independent realizations of the same random variable. In our case, the probability that the sequence of data $\bx^{(1)}, \dots, \bx^{(t)}$ belongs to the class $y$ at time $t$ is given by: \begin{equation}\label{eq:SHT} P(y | \bx^{(1)}, \dots, \bx^{(t)}) = \frac{\prod_{i = 1}^t \cN(\bx^{(i)} | \bar{\bx}_y, \Sigma) P(y)} {\sum_y \prod_{i = 1}^t \cN(\bx^{(i)} | \bar{\bx}_y, \Sigma) P(y)}. \end{equation} As in previous section, the prediction is made using maximum likelihood: $\hat{y} = \arg\max_y P(y | \bx^{(1)}, \dots, \bx^{(t)})$. In practice, the prediction is accepted when the classifier is confident, that is $P(\hat{y} | \bx^{(1)}, \dots, \bx^{(t)}) > \delta$, where the threshold $\delta \in (0, 1)$ controls the precision and recall of the predictor. Sequential hypothesis testing is a common technique for smoothing temporal predictions. In particular, note that the prediction at time $t$ depends on all data up to time $t$. This reduces the variance of predictions, especially when input data is noisy, such as in the domain of skeleton recognition.