1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
|
\section{Experiment design}
We conduct a real-life uncontrolled experiment using the Kinect to test to the
algorithm. First we discuss the signal outputs of the Kinect. Second we
describe the environment in which we collect the data. Finally, we interpret
the data.
\subsection{Kinect} The Kinect outputs three primary signals in real-time: a
color image stream, a depth image stream, and microphone output. For our
purposes, we focus on the depth image stream. As the Kinect was designed to
interface directly with the Xbox 360~\cite{xbox}, the tools to interact with it
on a PC are limited. Libfreenect~\cite{libfreenect} is a reverse engineered
driver which gives access to the raw depth images from the Kinect. This raw
data could be used to implement the algorithms \eg of
Plagemann~\etal{}~\cite{plagemann:icra10}. Alternatively,
OpenNI~\cite{openni}, a framework sponsored by PrimeSense~\cite{primesense},
the company behind the technology of the Kinect, offers figure detection and
skeleton fitting algorithms on top of raw access to the data streams. However,
the skeleton fitting algorithm of OpenNI requires each individual to strike a
specific pose for calibration. More recently, the Kinect for Windows
SDK~\cite{kinect-sdk} was released, and its skeleton fitting algorithm operates
in real-time without calibration. Given that the Kinect for Windows SDK is the
state-of-the-art, we use it to perform our data collection.
\subsection{Environment}
\begin{itemize}
\item 1 week
\item 23 people
\end{itemize}
\subsection{Data set}
The original dataset consists of the sequence of all the frames where
a skeleton was detected by the Kinect SDK. For each frames the
following data is available:
\begin{itemize}
\item the 3D coordinates of 20 body joints,
\item a color picture recorded by the video camera.
\end{itemize}
For some of frames, one or several joints are occluded by another part
of the body. In those cases, the coordinates of these joints are
either absent from the frame or present but tagged as \emph{Inferred}
by the Kinect SDK. It means that even though the joint is not
present on the frame, the skeleton-fitting algorithm is able to guess
its location.
Each frame also has a skeleton ID number. If this numbers stays the
same across several frames, it means that the skeleton-fitting
algorithm was able to detect the skeleton in a contiguous way. This
allows us to define the concept of a \emph{run}: a sequence of frames
with the same skeleton ID.
Ground truth person recognition is obtained by manually labelling each
run based on the images captured by the video camera of the
Kinect. For ease of labelling, only the runs with people walking
toward the camera are kept. These are the runs where the average
distance from the skeleton joints to the camera is increasing.
Several reductions are then applied to the data set to extract
\emph{features} from the raw data:
\begin{itemize}
\item From the joints coordinates, the lengths of 15 body parts are
computed. These are distances between two contiguous joints in the
human body. If one of the two joints of a body part is not present
or inferred in a frame, the corresponding body part is reported as
absent for the frame.
\item The number of features is then reduced to 9 by using the
vertical symmetry of the human body: if two body parts are symmetric
about the vertical axis, we bundle them into one feature by
averaging their lengths. If only one of them is present, we take the
value of its counterpart. If none of them are present, the feature
is reported as missing for the frame. The resulting nine features
are: Head-ShoulderCenter, ShoulderCenter-Shoulder, Shoulder-Elbow,
Elbow-Wrist, ShoulderCenter-Spine, Spine-HipCenter,
HipCenter-HipSide, HipSide-Knee, Knee-Ankle.
\item Finally, all the frames where any of the 9 features is missing
are filtered out.
\end{itemize}
\begin{table}
\begin{center}
\begin{tabular}{|l|r||r|r|r|}
\hline
Number of people & 25 & $k\leq 5$ & $5\leq k\leq 20$ & $k\geq 20$\\
\hline
Number of frames & 15945 & 1211 & 561 & 291 \\
\hline
Number of runs & 244 & 18 & 8 & 4\\
\hline
\end{tabular}
\end{center}
\caption{Data set statistics. The right part of the table shows the
average numbers for different intervals of $k$, the rank of a person
in the ordering given by the number of frames.}
\label{tab:dataset}
\end{table}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "kinect"
%%% End:
|