1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
|
\section{Real-World Evaluation}
We conduct a real-life uncontrolled experiment using the Kinect to test to the
algorithm. First we present the manner and environment in which we perform
data collection. Second we describe how the data is processed and classified.
Finally, we discuss the results.
\subsection{Dataset}
The Kinect outputs three primary signals in real-time: a color image stream, a
depth image stream, and microphone output. For our purposes, we focus on the
depth image stream. As the Kinect was designed to interface directly with the
Xbox 360, the tools to interact with it on a PC are limited.
Libfreenect~\cite{libfreenect} is a reverse engineered driver which gives
access to the raw depth images from the Kinect. This raw data could be used to
implement the algorithms \eg of Plagemann~\etal{}~\cite{plagemann:icra10}.
Alternatively, OpenNI~\cite{openni}, a framework sponsored by
PrimeSense~\cite{primesense}, the company behind the technology of the Kinect,
offers figure detection and skeleton fitting algorithms on top of raw access to
the data streams. However, the skeleton fitting algorithm of OpenNI requires
each individual to strike a specific pose for calibration. More recently, the
Kinect for Windows SDK~\cite{kinect-sdk} was released, and its skeleton fitting
algorithm operates in real-time without calibration. Given that the Kinect for
Windows SDK is the state-of-the-art, we use it to perform our data collection.
We collect data using the Kinect SDK over a period of a week in a research
laboratory setting. The Kinect is placed at the tee of a well traversed
hallway. The view of the Kinect is seen in \fref{fig:hallway}, showing the
color image, the depth image, and the fitted skeleton of a person in a single
frame. For each frame where a person is detected and a skeleton is fitted we
collect the 3D coordinates of 20 body joints, and the color image recorded by
the RGB camera.
\begin{figure}
\begin{center}
\includegraphics[width=0.99\textwidth]{graphics/hallway.png}
\end{center}
\caption{}
\label{fig:hallway}
\end{figure}
For some frames, one or several joints are out of the frame or are occluded by
another part of the body. In those cases, the coordinates of these joints are
either absent from the frame or present but tagged as \emph{Inferred} by the
Kinect SDK. Inferred means that even though the joint is not visible in the
frame, the skeleton-fitting algorithm attempts to guess the right location.
Ground truth person identification is obtained by manually labelling each run
based on the images captured by the RGB camera of the Kinect. For ease of
labelling, only the runs with people walking toward the camera are kept. These
are the runs where the average distance from the skeleton joints to the camera
is increasing.
\subsection{Experiment design}
Several reductions are then applied to the data set to extract \emph{features}
from the raw data. First, the lengths of 15 body parts are computed from the
joint coordinates. These are distances between two contiguous joints in the
human body. If one of the two joints of a body part is not present or inferred
in a frame, the corresponding body part is reported as absent for the frame.
Second, the number of features is reduced to 9 by using the vertical symmetry
of the human body: if two body parts are symmetric about the vertical axis, we
bundle them into one feature by averaging their lengths. If only one of them is
present, we take the value of its counterpart. If none of them are present, the
feature is reported as missing for the frame. The resulting nine features are:
Head-ShoulderCenter, ShoulderCenter-Shoulder, Shoulder-Elbow, Elbow-Wrist,
ShoulderCenter-Spine, Spine-HipCenter, HipCenter-HipSide, HipSide-Knee,
Knee-Ankle. Finally, any frame with a missing feature is filtered out.
Each detected skeleton also has an ID number which identifies which figure
it maps to from the figure detection stage. When there are consecutive number stays the
same across several frames, it means that the skeleton-fitting
algorithm was able to detect the skeleton in a contiguous way. This
allows us to define the concept of a \emph{run}: a sequence of frames
with the same skeleton ID.
\begin{table}
\begin{center}
\begin{tabular}{|l|r||r|r|r|}
\hline
Number of people & 25 & $k\leq 5$ & $5\leq k\leq 20$ & $k\geq 20$\\
\hline
Number of frames & 15945 & 1211 & 561 & 291 \\
\hline
Number of runs & 244 & 18 & 8 & 4\\
\hline
\end{tabular}
\end{center}
\caption{Data set statistics. The right part of the table shows the
average numbers for different intervals of $k$, the rank of a person
in the ordering given by the number of frames.}
\label{tab:dataset}
\end{table}
\subsection{Results}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "kinect"
%%% End:
|