Skip to content. Skip to navigation
CIM Menus

Human Pose Estimation Using Multiple RGB-D Sensors

Andrew Phan, PhD Candidate
Centre for Intelligent Machines McGill University

December 9, 2014 at  2:00 PM
Room 603, McConnell Engineering

The context for this presentation is the Collaborative, Human-focused, Assistive Robotics for Manufacturing (CHARM) project, a collaboration between General Motors Canada, University of British Columbia, Univérsité Laval and McGill University, that aims to revolutionize the conventional assembly line by designing one in which a worker and a robotic assistant can share the same workspace and work together safely. To accomplish this task, the robotic assistant must be provided with the worker's position and configuration, called pose.

In this PhD seminar, we present our research progress on estimating the 3D articulated full body pose of up to two subjects using a calibrated sensor network at interactive rates. The sensor network consists of up to four Microsoft Kinect RGB-D cameras placed around a simulated workcell in an approximately orthogonal configuration. To initialize the pose estimation, the subject enters the workcell and adopts an unambiguous T-pose. The focus of this research is on updating the estimated pose over time, called pose refinement, while mitigating the negative effects of three types of undesired interactions, a) subject-subject, b) subject-object and c) self contacts, that we call the contact problem. Our goal is to be able to resolve ambiguities caused by these contacts in order to produce an accurate estimate of the current pose.

In particular, we propose a novel multi-view flow-based voting scheme to constrain the problem and track the displacement of individual body parts across frames. In a first implementation, we compute the optical flow, the apparent motion of objects in the scene, on the intensity images and obtain promising preliminary results when combining the output from the four cameras. In the future, we hope to incorporate additional constraints using 3D based algorithms such as range flow and iterative closest point (ICP) as well as other 2D based algorithms such as scale-invariant feature transform (SIFT). We also propose a new publicly accessible Reparti Motion Tracking Database with synchronized multi-view RGB-D and Vicon motion capture data that contains varying levels of undesired contacts to serve as ground truth. We have done our best to optimize our C++ application which is currently CPU-bound and operates at ~8 Hz with ~650ms latency on an AMD X4 640 3GHz processor released in 2010. We also make use of the OpenCV GPU implementation to get real-time optical flow across the four views thanks to a Nvidia Titan Black graphics card. Our preliminary skeleton joint errors compared to the Vicon currently vary between 0.15 and 0.35 m depending on the dataset. To conclude, we describe how we hope to reduce those errors while maintaining performance.