\section{Algorithms}
\label{sec:algorithms}

In Section~\ref{sec:uniqueness}, we showed that a nearest-neighbor classifier can accurately predict if a skeleton belongs to the same person if the error of the skeleton measurements is small. In this section, we suggest a probabilistic model for skeleton recognition. In this model, a skeleton is classified based on the distance from average skeleton profiles of people in the training set.


\subsection{Mixture of Gaussians}
\label{sec:mixture of Gaussians}

A mixture of Gaussians \cite{bishop06pattern} is a generative probabilistic model, which is typically applied to modeling problems where class densities are unimodal and the feature space is low-dimensional. The joint probability distribution of the model is given by:
\begin{align}
  P(\bx, y) =  \cN(\bx | \bar{\bx}_y, \Sigma) P(y),
  \label{eq:mixture of Gaussians}
\end{align}
where $P(y)$ is the probability of class $y$ and $\cN(\bx | \bar{\bx}_y, \Sigma)$ is a class conditional that is modeled as a multivariate normal distribution. The distribution is centered at a vector $\bar{\bx}_y$, and the density of the class is described by the covariance matrix $\Sigma$. The decision boundary between any two classes $y_1$ and $y_2$, $P(\bx, y_1) \! = \! P(\bx, y_2)$, is linear when all class conditionals have the same covariance matrix $\Sigma$ \cite{bishop06pattern}. Thus, the mixture of Gaussians model is a probabilistic version of the nearest-neighbor (NN) classifier from Section~\ref{sec:uniqueness}.

The mixture of Gaussians model has many advantages. First, the model can be easily learned using maximum-likelihood (ML) estimation \cite{bishop06pattern}. In particular, $P(y)$ is the frequency of class $y$ in training data, $\bar{\bx}_y$ is the expectation of $\bx$ given $y$, and the covariance matrix $\Sigma$ is estimated as a weighted sum $\Sigma = \sum_y P(y) \Sigma_y$, where $\Sigma_y$ is the covariance matrix corresponding to class $y$. Second, the inference in the model can be performed in a closed form. In particular, the predicted label is given by $\hat{y} = \arg\max_y P(y | \bx)$, where:
\begin{align}
  P(y | \bx) =
  \frac{P(\bx | y) P(y)}{\sum_y P(\bx | y) P(y)} =
  \frac{\cN(\bx | \bar{\bx}_y, \Sigma) P(y)}{\sum_y \cN(\bx | \bar{\bx}_y, \Sigma) P(y)}.
  \label{eq:inference}
\end{align}
In practice, the prediction $\hat{y}$ is accepted when the classifier is confident. In other words, $P(\hat{y} | \bx) \! > \! \delta$, where $\delta \in (0, 1)$ is a threshold that controls the precision and recall of the classifier. In general, the higher the threshold $\delta$, the lower the recall and the higher the precision.

In this work, we use the mixture of Gaussians model for skeleton recognition. In this problem, the feature vector $\bx$ are skeleton measurements and each person corresponds to one class $y$.


\subsection{Sequential hypothesis testing}
\label{sec:SHT}

The mixture of Gaussians model can be extended to temporal inference through sequential hypothesis testing. Sequential hypothesis testing \cite{wald47sequential} is an established statistical framework, where a subject is sequentially tested for belonging to one of several classes. The probability that the sequence of data $\bx^{(1)}, \dots, \bx^{(t)}$ belongs to the class $y$ at time $t$ is given by:
\begin{align}
  P(y | \bx^{(1)}, \dots, \bx^{(t)}) =
  \frac{\prod_{i = 1}^t \cN(\bx^{(i)} | \bar{\bx}_y, \Sigma) P(y)}
  {\sum_y \prod_{i = 1}^t \cN(\bx^{(i)} | \bar{\bx}_y, \Sigma) P(y)}.
  \label{eq:SHT}
\end{align}
In practice, the prediction $\hat{y} = \arg\max_y P(y | \bx^{(1)}, \dots, \bx^{(t)})$ is accepted when the classifier is confident. In other words, $P(\hat{y} | \bx^{(1)}, \dots, \bx^{(t)}) > \delta$, where the threshold $\delta \in (0, 1)$ controls the precision and recall of the predictor. In general, the higher the threshold $\delta$, the higher the precision and the lower the recall.

Sequential hypothesis testing is a common technique for smoothing temporal predictions. In particular, note that the prediction at time $t$ depends on all data up to time $t$. This reduces the variance of predictions, especially when input data are noisy, such as in real-world skeleton recognition.

In skeleton recognition, the sequence $\bx^{(1)}, \dots, \bx^{(t)}$ are skeleton measurements of a person walking towards the camera, for instance. If the camera detects more people, we use tracking in the camera to identify individual skeleton sequences.