diff options
| -rw-r--r-- | conclusion.tex | 37 | ||||
| -rw-r--r-- | experimental.tex | 87 |
2 files changed, 74 insertions, 50 deletions
diff --git a/conclusion.tex b/conclusion.tex index c54bc9a..40eac53 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -1,24 +1,33 @@ \section{Conclusion} \label{sec:conclusion} -In this paper, we introduce skeleton recognition. We show that skeleton -measurements are unique enough to distinguish individuals using a dataset of -real skeletons. We present a probabilistic model for recognition, and extend -it to take advantage of consecutive frames. Finally we test our model by -collecting data for a week in a real-world setting. Our results show that -skeleton recognition performs close to face recognition, and it can be -used in other scenarios. +In this paper, we present exciting and promising results for face recognition. +With greater than 90\% accuracy for less than 10 people, skeleton recognition +can already be used in households, \eg to load personalized settings on a home +entertainment system. Skeleton recognition performs less than 10\% worse than +face recognition in the current setting. This is a good result considering +face recognition has been studied for years and is more mature. Furthermore, +skeleton recognition works in many situations when face recognition does not. +For example, when a person is not facing the camera or when there is not enough +light. -However, the Kinect SDK does have some limitations. First of all, the Kinect +%we introduce skeleton recognition. We show that skeleton +%measurements are unique enough to distinguish individuals using a dataset of +%real skeletons. We present a probabilistic model for recognition, and extend +%it to take advantage of consecutive frames. Finally we test our model by +%collecting data for a week in a real-world setting. Our results show that +%skeleton recognition performs close to face recognition, and it can be +%used in other scenarios. + +Skeleton recognition has room for improvement. First of all, the Kinect SDK can only fit two skeletons at a time. Therefore, when a group of people walk in front of the Kinect, not all of them can be recognized via skeleton, -where they might be by face recognition. Second, some times figure detection -gives false positives, which caused skeletons to be fit on a window and a -vacuum cleaner during our data collection (both of these are reflective -surfaces, which might explain the failure). +where they might be by face recognition. Second, figure detection can +give false positives, which caused skeletons to be fit on a window and a +vacuum cleaner during our data collection. -Skeleton recognition can only get more accurate as the resolution of range -cameras increases and skeleton fitting algorithms improve. Microsoft is +Finally, as the resolution of range cameras increases and skeleton fitting +algorithms improve, so will the accuracy of skeleton recognition. Microsoft is planning on putting the Kinect technology inside laptops~\footnote{\url{http://www.thedaily.com/page/2012/01/27/012712-tech-kinect-laptop/}} and the Asus Xtion diff --git a/experimental.tex b/experimental.tex index dee0626..57189ab 100644 --- a/experimental.tex +++ b/experimental.tex @@ -40,10 +40,10 @@ operates in real-time without calibration. %that it is the state-of-the-art and does not require calibration. We collect data using the Kinect SDK over a period of a week in a research -laboratory setting. The Kinect is placed at the tee of a well traversed +laboratory setting. The Kinect is placed at the tee of a frequently used hallway. The view of the Kinect is seen in \fref{fig:hallway}, showing the color image, the depth image, and the fitted skeleton of a person in a single -frame. Skeletons are fit from \~1-5 meters away from the Kinect. For each +frame. Skeletons are fit from roughly 1-5 meters away from the Kinect. For each frame where a person is detected and a skeleton is fit we capture the 3-D coordinates of 20 body joints, and the color image. @@ -167,15 +167,25 @@ varying group size $n_p = \{3,5,10,25\}$. \fref{fig:offline} shows the precision-recall plot as the threshold varies. Both algrithms perform three times better than the majority class baseline of -15\% with a recall of 100\% on all people. Several curves are obtained for -different group sizes: people are ordered based on their frequency of -appearance (\fref{fig:frames}), and all the frames belonging to people beyond a -given rank in this ordering are removed. The decrease of performance when -increasing the number of people in the dataset can be explained by the -overlaps between skeleton profiles due to the noise, as discussed in -Section~\ref{sec:uniqueness}, but also by the very few number of runs available -for the least present people, as seen in \fref{fig:frames}, which does not -permit a proper training of the algorithm. +15\% with a recall of 100\% on all people. We make two main observations. +First, as expected, SHT performs better than MoG because of temporal smoothing. +Second, performance is inversely proportional to group size. As we test +against more people, there are more overlaps between skeleton profiles due to +the noise, as discussed in Section~\ref{sec:uniqueness}. Also, the least +present people have a small number of frames, as seen in \fref{fig:frames}, +which may not permit a proper training of the algorithm. For 3 and 5 +people (typical family sizes), we see recognition rates mostly above 90\%, and +we reach 90\% accuracy at 60\% recall for a group size of 10 people. + +%Several curves are obtained for +%different group sizes: people are ordered based on their frequency of +%appearance (\fref{fig:frames}), and all the frames belonging to people beyond a +%given rank in this ordering are removed. The decrease of performance when +%increasing the number of people in the dataset can be explained by the +%overlaps between skeleton profiles due to the noise, as discussed in +%Section~\ref{sec:uniqueness}, but also by the very few number of runs available +%for the least present people, as seen in \fref{fig:frames}, which does not +%permit a proper training of the algorithm. \begin{figure*}[t] \begin{center} @@ -214,16 +224,21 @@ to the dataset. The identification algorithm is then retrained on the augmented dataset, and the newly obtained classifier can be deployed in the building. -In this setting, the sequential hypothesis testing (SHT) algorithm is more -suitable than the algorithm used in Section~\ref{sec:experiment:offline}, because it -accounts for the fact that a person identity does not change across a -run. The analysis is therefore performed by partitioning the dataset -into 10 time sequences of equal size. For a given threshold, the algorithm -is trained and tested incrementally: trained on the first $k$ -sequences (in the chronological order) and tested on the $(k+1)$-th -sequence. \fref{fig:online} shows the prediction-recall -curve when averaging the prediction rate of the 10 incremental -experiments. +%In this setting, the sequential hypothesis testing (SHT) algorithm is more +%suitable than the algorithm used in Section~\ref{sec:experiment:offline}, because it +%accounts for the fact that a person identity does not change across a +%run. +We only evaluate SHT in this setting since it already takes consecutive frames +into account and because it performed better than MoG in the offline setting +(\ref{sec:experiment:offline}). We partition the dataset into 10 time +sequences of equal size. For a given threshold, the algorithm is trained and +tested incrementally: train on the first $k$ sequences (in the chronological +order) and test on the $(k+1)$-th sequence. \fref{fig:online} shows the +prediction-recall curve when averaging the prediction rate over the 10 +incremental experiments. Overall performance is worse than in +\fref{fig:offline:sht} since the system trains on less data than in +\ref{sec:experiment:offline} in all but the last step. We still see +recognition rates mostly above 90\% for group sizes of 3 and 5. \begin{figure}[t] %\subfloat[Mixture of Gaussians]{ @@ -268,14 +283,14 @@ recognition, but not by a large margin. We use the publicly available REST API of \textsf{face.com} to do face recognition on our dataset. Due to the restrictions of the API, for this experiment we train on one half of the data and test on the remaining half. For -comparison, MoG algorithm is run with the same training-testing partitioning of +comparison, the MoG algorithm is run with the same training-testing partitioning of the dataset. In this setting, SHT is not relevant for the comparison, because \textsf{face.com} does not give the possibility to mark a sequence of frames as belonging to the same run. This additional information would be used by the SHT algorithm and would thus bias the results in favor of skeleton recognition. -However, this result does not take into account the disparity in the number of -runs which face recognition and skeleton recognition can classify frames, which -we discuss in the next experiment. +%However, this result does not take into account the disparity in the number of +%runs which face recognition and skeleton recognition can classify frames, which +%we discuss in the next experiment. \begin{figure}[t] \parbox[t]{0.49\linewidth}{ @@ -306,7 +321,7 @@ we discuss in the next experiment. In the next experiment, we include the runs in which people are walking away from the Kinect that we could positively identify. The performance of face -recognition ourperforms skeleton recognition the previous setting. However, +recognition outperforms skeleton recognition in the previous setting. However, there are many cases where only skeleton recognition is possible. The most obvious one is when people are walking away from the camera. Coming back to the raw data collected during the experiment design, we manually label the runs of @@ -327,15 +342,15 @@ well as when they are walking towards the camera. % \label{fig:back} %\end{figure} -\fref{fig:back} compares the curve obtained in \xref{sec:experiment:offline} -with people walking toward the camera, with the curve obtained by running the +\fref{fig:back} compares the results obtained in \xref{sec:experiment:offline} +with people walking toward the camera, with the results of the same experiment on the dataset of runs of people walking away from the camera. -The two curves are sensibly the same. However, one could argue that as the two +The two results are similar. However, one could argue that as the two datasets are completely disjoint, the SHT algorithm is not learning the same profile for a person walking toward the camera and for a person walking away from the camera. The third curve of \fref{fig:back} shows the precision-recall curve when training and testing on the combined dataset of runs toward and away -from the camera. +from the camera with similar performance. \subsection{Reducing the noise} @@ -355,13 +370,13 @@ observation $\bx_i$ is replaced by $\bx_i'$ defined by: \begin{equation} \bx_i' = \bar{\bx}_{y_i} + \frac{\bx_i-\bar{\bx}_{y_i}}{2} \end{equation} -We believe that a reducing factor of 2 for the noise's variance is -realistic given the relative low resolution of the Kinect's infrared -camera. +We believe that a reducing factor of 2 for the noise's variance is realistic +given the relative low resolution of the Kinect's infrared camera. -\fref{fig:var} compares the Precision-recall curve of -\fref{fig:offline:sht} to the curve of the same experiment run on -the newly obtained dataset. +\fref{fig:var} compares the precision-recall curve of \fref{fig:offline:sht} to +the curve of the same experiment run on the newly obtained dataset. We observe +a roughly 20\% increase in performace across most thresholds. Note that these +results would significantly outperform face recognition. %\begin{figure}[t] % \begin{center} |
