Efficient Human Pose Estimation via 3D Event Point Cloud

Abstract

Human Pose Estimation (HPE) aims to predict the keypoints of each person from perceived signals. RGB frames based HPE has experienced a rapid development benefiting from deep learning. While it still meets challenges with the drawbacks of frame-based cameras. The prediction of keypoints in scenarios with motion blur or high dynamic range will be inaccurate.

Event cameras, such as the Dynamic Vision Sensor (DVS), is a kind of bio-inspired asynchronous sensor responding to changes in brightness on each pixel independently, with higher dynamic range (over 100dB) and larger temporal resolution (in the order of μs). Event cameras can tackle the disadvantages of frame-based cameras, which could maintain stable output in such extreme scenes. Event-based HPE has not been fully studied, remaining great potential for applications in extreme scenes and efficiency-critical conditions. In this project, we are the first to estimate 2D human pose directly from 3D event point cloud.

We explore the feasibility of estimating human pose from 3D event point clouds directly, which is the first work from this perspective to our best knowledge. We demonstrate the effectiveness of well-known LiDAR point cloud learning backbones for event point cloud based human pose estimation. We propose a new event representation, rasterized event point cloud, which maintains the 3D features from multiple statistical cues and significantly reduces memory consumption and computational overhead with the same precision.

Our method based on PointNet with 2048 points input achieves 82.46mm in MPJPE3D on the DHP19 dataset, while only has a latency of 12.29ms on an NVIDIA Jetson Xavier NX edge computing platform, which is ideally suitable for real-time detection with event cameras.

Pipeline

The proposed pipeline. The raw 3D event point cloud is first rasterized, and then processed by the point cloud backbone. The features output from the backbone are then connected to linear layers to predict two vectors. 2D positions of human keypoints are proposed via decoding the two vectors. Our methods can be easily deployed and integrated to work with different 3D learning architectures, adapted to event-based HPE. We test PointNet, DGCNN, and Point Transformer in our work.

Event Representation

rasterization image.

Schematic diagram of event point cloud rasterization. (a) Raw 3D event point cloud input, (b) Time slice of event point cloud, (c) Rasterized event point cloud at (x, y) position. Note that the rasterization process preserves the discrete nature of the point cloud, rather than the 2D image.

PointNet-4096 Results on DHP19

sub13_session4_mov1

sub15_session4_mov2

sub14_session5_mov2

Comparison Results

New device test (captured by a DAVIS camera)

For the new data with different devices, our approach still delivers impressive results. Moreover, we find that the DHP19 model fails on the new data despite the conducted denoising and filtering. While, our models based on event point cloud with a multidimensional feature generalize well to such unseen domains and is robust to the noise brought by the background as well as the device itself.

DHP19 test dataset

We find that our event point cloud based method is more robust when facing static limbs than the DHP19 model. Static limbs during the movement generate few events which lead to invisible parts when accumulating events to a constant count event frame. However, such few events are retained in the event point cloud, and they could be processed by the point-wise backbone more effectively.

BibTeX

@inproceedings{chen2022EPP,
  	title={Efficient Human Pose Estimation via 3D Event Point Cloud},
  	author={Chen, Jiaan and Shi, Hao and Ye, Yaozu and Yang, Kailun and Sun, Lei and Wang, Kaiwei},
  	booktitle={2022 International Conference on 3D Vision (3DV)},
  	year={2022}
	}