\section{Real-World Evaluation} \label{sec:experiment} We conduct a real-life uncontrolled experiment using the Kinect to test to the algorithm. First we present the manner and environment in which we perform data collection. Second we describe how the data is processed and classified. Finally, we discuss the results. \subsection{Dataset} The Kinect outputs three primary signals in real-time: a color image stream, a depth image stream, and microphone output. For our purposes, we focus on the depth image stream. As the Kinect was designed to interface directly with the Xbox 360, the tools to interact with it on a PC are limited. Libfreenect~\cite{libfreenect} is a reverse engineered driver which gives access to the raw depth images from the Kinect. This raw data could be used to implement the algorithms \eg of Plagemann~\etal{}~\cite{plagemann:icra10}. Alternatively, OpenNI~\cite{openni}, a framework sponsored by PrimeSense~\cite{primesense}, the company behind the technology of the Kinect, offers figure detection and skeleton fitting algorithms on top of raw access to the data streams. However, the skeleton fitting algorithm of OpenNI requires each individual to strike a specific pose for calibration. More recently, the Kinect for Windows SDK~\cite{kinect-sdk} was released, and its skeleton fitting algorithm operates in real-time without calibration. Given that the Kinect for Windows SDK is the state-of-the-art, we use it to perform our data collection. We collect data using the Kinect SDK over a period of a week in a research laboratory setting. The Kinect is placed at the tee of a well traversed hallway. The view of the Kinect is seen in \fref{fig:hallway}, showing the color image, the depth image, and the fitted skeleton of a person in a single frame. For each frame where a person is detected and a skeleton is fitted we collect the 3D coordinates of 20 body joints, and the color image recorded by the RGB camera. \begin{figure} \begin{center} \includegraphics[width=0.99\textwidth]{graphics/hallway.png} \end{center} \caption{Experiment setting. Color image, depth image, and fitted skeleton as captured by the Kinect in a single frame} \label{fig:hallway} \end{figure} For some frames, one or several joints are out of the frame or are occluded by another part of the body. In those cases, the coordinates of these joints are either absent from the frame or present but tagged as \emph{Inferred} by the Kinect SDK. Inferred means that even though the joint is not visible in the frame, the skeleton-fitting algorithm attempts to guess the right location. Ground truth person identification is obtained by manually labelling each run based on the images captured by the RGB camera of the Kinect. For ease of labelling, only the runs with people walking toward the camera are kept. These are the runs where the average distance from the skeleton joints to the camera is increasing. \subsection{Experiment design} Several reductions are then applied to the data set to extract \emph{features} from the raw data. First, the lengths of 15 body parts are computed from the joint coordinates. These are distances between two contiguous joints in the human body. If one of the two joints of a body part is not present or inferred in a frame, the corresponding body part is reported as absent for the frame. Second, the number of features is reduced to 9 by using the vertical symmetry of the human body: if two body parts are symmetric about the vertical axis, we bundle them into one feature by averaging their lengths. If only one of them is present, we take the value of its counterpart. If none of them are present, the feature is reported as missing for the frame. The resulting nine features are: Head-ShoulderCenter, ShoulderCenter-Shoulder, Shoulder-Elbow, Elbow-Wrist, ShoulderCenter-Spine, Spine-HipCenter, HipCenter-HipSide, HipSide-Knee, Knee-Ankle. Finally, any frame with a missing feature is filtered out. Each detected skeleton also has an ID number which identifies which figure it maps to from the figure detection stage. When there are consecutive number stays the same across several frames, it means that the skeleton-fitting algorithm was able to detect the skeleton in a contiguous way. This allows us to define the concept of a \emph{run}: a sequence of frames with the same skeleton ID. \begin{table} \begin{center} \caption{Data set statistics. The right part of the table shows the average numbers for different intervals of $k$, the rank of a person in the ordering given by the number of frames} \begin{tabular}{|l|r||r|r|r|} \hline Number of people & 25 & $k\leq 5$ & $5\leq k\leq 20$ & $k\geq 20$\\ \hline Number of frames & 15945 & 1211 & 561 & 291 \\ \hline Number of runs & 244 & 18 & 8 & 4\\ \hline \end{tabular} \end{center} \label{tab:dataset} \end{table} \subsection{Results} %%% Local Variables: %%% mode: latex %%% TeX-master: "kinect" %%% End: