1 files changed, 69 insertions, 61 deletions
diff --git a/algorithm.tex b/algorithm.tex
index c02a0e5..97d0b42 100644
--- a/algorithm.tex
+++ b/algorithm.tex
@@ -4,57 +4,67 @@
 In Section~\ref{sec:uniqueness}, we showed that a nearest-neighbor classifier
 can accurately predict if two sets of skeletal measurements belong to the same
 person if the error of the skeleton measurements is small. In this section, we
-suggest a probabilistic model for skeleton recognition. In this model, a
-skeleton is classified based on the distance from average skeleton profiles of
-people in the training set.
-
+suggest a probabilistic model for skeleton measurements which leads to a simple
+algorithm for skeleton recognition. 
 
 \subsection{Mixture of Gaussians}
 \label{sec:mixture of Gaussians}
 
-A mixture of Gaussians \cite{bishop06pattern} is a generative probabilistic
-model, which is typically applied to modeling problems where class densities
-are unimodal and the feature space is low-dimensional. The joint probability
-distribution of the model is given by:
-\begin{align}
-  P(\bx, y) =  \cN(\bx | \bar{\bx}_y, \Sigma) P(y),
-  \label{eq:mixture of Gaussians}
-\end{align}
-where $\bx$ denotes an observation, $y$ is a class, $P(y)$ is the probability of the class, and $\cN(\bx | \bar{\bx}_y, \Sigma)$ is the conditional probability of $\bx$ given $y$. The conditional is a multivariate normal distribution. Its mean is $\bar{\bx}_y$ and the variance of $\bx$ is captured by the covariance matrix $\Sigma$. The decision boundary between two classes $y_1$ and $y_2$, $P(\bx, y_1) = P(\bx, y_2)$ for all $\bx$, is linear when all conditionals $\cN(\bx | \bar{\bx}_y, \Sigma)$ have the same covariance matrix $\Sigma$ \cite{bishop06pattern}. In this case, the mixture of Gaussians model can be viewed as a probabilistic variant of the nearest-neighbor (NN) classifier in Section~\ref{sec:uniqueness}.
+In this section, skeleton measurements are represented by a feature vector
+$\bx\in\mathbf{R}^9$ (see section \ref{sec:experiment-design} for a description
+of the skeleton features). Each measurement is labeled with a class $y$
+representing the identity of the individual it belongs to (there are as many
+classes as people taking part in the experiment).
+
+We want a probabilistic model for the joint distribution $P(\bx, y)$. Our
+dataset $\cD = \set{(\bx_1, y_1),\ldots,(\bx_n, y_n)}$ consists of $n$ pairs
+$(\bx_i, y_i)$ where $y_i$ is the label of the skeleton $\bx_i$. For each
+feature $k\in \set{1,\ldots, 9}$, we plot the histogram of the set of
+differences $\set{(\bx_i)_k - (\bar{\bx}_{y_i})_k,\; i\in\set{3,\ldots, n}}$,
+between a measurement of a feature $(\bx_i)_k$ and the average of all the
+measurements of the same feature belonging to the same class
+$(\bar{\bx}_{y_i})_k$. These histograms are presented in Figure~\ref{fig:error
+marginals}. All histograms look approximately normal, which suggests that  all
+class conditionals $P(\bx, y)$ are multivariate normal.
+
+Based on these observations, our model takes the form:
+\begin{equation}\label{eq:mixture of Gaussians}
+    P(\bx, y) = \cN(\bx | \bar{\bx}_y, \Sigma) P(y)
+\end{equation}
+which means that the conditional distribution of $\bx$ given $y$ is
+multivariate normal of mean $\bar{\bx}_y = \E{}{\bx | y}$ and of covariance
+$\Sigma$. Note that the covariance matrix $\Sigma$ does not depend on the class
+$y$. This is because we assume that the observed variance is only caused by the
+noise introduced by the measurement process (low resolution of the Kinect
+camera, approximation of the skeleton mapping algorithm provided by the
+Microsoft SDK) and not by the identity of the individual walking in front of
+the camera.
+
+This probabilistic model is called a mixture of Gaussians
+\cite{bishop06pattern} and has many advantages. First, the parameters of the
+model can be easily learned: $P(y)$, the probability of belonging to class $y$
+is simply the frequency of $y$ in the training set, $\bar{\bx}_y$ is the sample
+mean of the observations belonging to class $y$, and $\Sigma$ is learned as
+$\Sigma = \sum_y P(y)\Sigma_y$, where $\Sigma_y$ represents the sample
+covariance matrix of $\bx$ given $y$. Second, the inference in the model can be
+performed in a closed form by maximum-likelihood estimation: given an
+observation $\bx$, the model predicts $\hat{y} = \arg\max_y P(y | \bx)$, where:
+\begin{equation}\label{eq:inference}
+    P(y|\bx) = \frac{P(\bx, y)}{\sum_y P(\bx|y)P(y)}
+    = \frac{\cN(\bx|\bar{\bx}_y, \Sigma)P(y)}{\sum_y\cN(\bx|\bar{\bx}_y,
+    \Sigma)P(y)}
+\end{equation}
+In this setting, the decision boundary between two classes $y_1$ and $y_2$:
+$\set{\bx | P(\bx, y_1) = P(\bx, y_2)}$ is an hyperplane \cite{bishop06pattern}
+and the mixture of Gaussians model can be viewed as a probabilistic variant of
+the nearest-neighbor (NN) classifier in Section~\ref{sec:uniqueness}.
 
-The mixture of Gaussians model has many advantages. First, the model can be
-easily learned using maximum-likelihood (ML) estimation \cite{bishop06pattern}.
-In particular, $P(y)$ is the frequency of $y$ in the training set,
-$\bar{\bx}_y$ is the expectation of $\bx$ given $y$, and the covariance matrix
-is computed as $\Sigma = \sum_y P(y) \Sigma_y$, where $\Sigma_y$ represents the
-covariance of $\bx$ given $y$. Second, the inference in the model can be
-performed in a closed form. In particular, the model predicts $\hat{y} =
-\arg\max_y P(y | \bx)$, where:
-\begin{align}
-  P(y | \bx) =
-  \frac{P(\bx | y) P(y)}{\sum_y P(\bx | y) P(y)} =
-  \frac{\cN(\bx | \bar{\bx}_y, \Sigma) P(y)}{\sum_y \cN(\bx | \bar{\bx}_y, \Sigma) P(y)}.
-  \label{eq:inference}
-\end{align}
 In practice, the prediction $\hat{y}$ is accepted when the classifier is
-confident. In other words, $P(\hat{y} | \bx) \! > \! \delta$, where $\delta \in
+confident, that is, $P(\hat{y} | \bx) \! > \! \delta$, where $\delta \in
 (0, 1)$ is a threshold that controls the precision and recall of the
 classifier. In general, the higher the threshold $\delta$, the lower the recall
 and the higher the precision.
 
-In this work, we use the mixture of Gaussians model for skeleton recognition.
-Skeleton measurements are represented by a vector $\bx$ and each person is
-assigned to one class $y$. In particlar, our dataset $\cD = \set{(\bx_1, y_1),
-\dots, (\bx_n, y_n)}$ consists of $n$ pairs $(\bx_i, y_i)$, where $y_i$ is the
-label of the skeleton $\bx_i$. To verify that our method is suitable for
-skeleton recognition, we plot for each skeleton feature $x_k$
-(Section~\ref{sec:experiment}) the histogram of differences between all
-measurements and the expectation given the class $(\bx_i)_k - \E{}{x_k | y_i}$
-(Figure~\ref{fig:error marginals}). All histograms look approximately normal.
-This suggests that all class conditionals $P(\bx | y)$ are multivariate normal
-and our generative model, although very simple, may be nearly optimal
-\cite{bishop06pattern}.
-
 \begin{figure*}[t]
   \centering
   \includegraphics[width=0.99\textwidth]{graphics/limbs.pdf}
@@ -65,35 +75,33 @@ and our generative model, although very simple, may be nearly optimal
   \label{fig:error marginals}
 \end{figure*}
 
-
 \subsection{Sequential hypothesis testing}
 \label{sec:SHT}
 
-The mixture of Gaussians model can be extended to temporal inference through
-sequential hypothesis testing. Sequential hypothesis testing
-\cite{wald47sequential} is an established statistical framework, where a
-subject is sequentially tested for belonging to one of several classes. The
-probability that the sequence of data $\bx^{(1)}, \dots, \bx^{(t)}$ belongs to
-the class $y$ at time $t$ is given by:
-\begin{align}
+In our setting (see Section~\ref{seq:experiment-desgin}, skeletons measurements
+are not isolated. On the contrary, everytime a person walks in front of the
+camera we get a set of time-indexed measurements belonging to the same
+individual that we want to classify. The mixture of Gaussians model can be
+extended to temporal inference through through the Sequential hypothesis
+testing \cite{wald47sequential} framework. In this framework, a subject is
+sequentially tested for belonging to one of several class, by assuming that
+conditioned on the class, the measurements are independent realisations of the
+same random variable. In our case, the probability that the sequence of data
+$\bx^{(1)}, \dots, \bx^{(t)}$ belongs to the class $y$ at time $t$ is given by:
+\begin{equation}\label{eq:SHT}
   P(y | \bx^{(1)}, \dots, \bx^{(t)}) =
   \frac{\prod_{i = 1}^t \cN(\bx^{(i)} | \bar{\bx}_y, \Sigma) P(y)}
   {\sum_y \prod_{i = 1}^t \cN(\bx^{(i)} | \bar{\bx}_y, \Sigma) P(y)}.
-  \label{eq:SHT}
-\end{align}
-In practice, the prediction $\hat{y} = \arg\max_y P(y | \bx^{(1)}, \dots,
-\bx^{(t)})$ is accepted when the classifier is confident. In other words,
-$P(\hat{y} | \bx^{(1)}, \dots, \bx^{(t)}) > \delta$, where the threshold
-$\delta \in (0, 1)$ controls the precision and recall of the predictor. In
-general, the higher the threshold $\delta$, the higher the precision and the
-lower the recall.
+\end{equation}
+
+As in previous section, the prediction is made using maximum likelihood:
+$\hat{y} = \arg\max_y P(y | \bx^{(1)}, \dots, \bx^{(t)})$. In practice, the
+prediction is accepted when the classifier is confident, that is $P(\hat{y}
+| \bx^{(1)}, \dots, \bx^{(t)}) > \delta$, where the threshold $\delta \in (0,
+1)$ controls the precision and recall of the predictor.
 
 Sequential hypothesis testing is a common technique for smoothing temporal
 predictions. In particular, note that the prediction at time $t$ depends on all
 data up to time $t$. This reduces the variance of predictions, especially when
 input data are noisy, such as in the domain of skeleton recognition.
 
-In skeleton recognition, the sequence $\bx^{(1)}, \dots, \bx^{(t)}$ is skeleton
-measurements of a person walking towards the camera, for instance. If the
-Kinect detects multiple people, we use the figure tracking of the tools in
-\xref{sec:experiment:dataset} to distinguish individual skeleton sequences.