We propose a new dataset and a novel approach to learn hand-object interaction priors for hand and articulated object poses estimation. We first collect a dataset using visual teleoperation, where the human operator can directly play within a physical simulator to manipulate the articulated objects. We record the data and obtain the free and accurate annotations on object poses and contact information from the simulator. Our system only requires an iPhone to record human hand motion, which can be easily scaled up and largely lower the costs on data and annotation collection. With this data, we learn 3D interaction priors including a discriminator (in a GAN) capturing the distribution of how object parts are arranged, and a diffusion model which generates the contact regions on an articulated objects, guiding the hand pose estimation. Such structural and contact priors can easily transfer to the real-world data with barely any domain gap. By using our data and learned priors, our method significantly improves the performance on joint hand and articulated object poses estimation over existing state-of-the-arts.
We present a new hand object interaction dataset named ContactArt and learning pipeline for hand and articulated object poses estimation. We train our method on the proposed ContactArt dataset and test our method on real-world HOI4D dataset. Our method learns the 3D interaction priors from large-scale dataset. The priors can easily transfer to the real and generalize to novel objects.
Our method can also estimate accurate object pose on other real-world datasets, such as RBO and BMVC dataset. Some examples are shown below.
Our method learns the articulation prior with a discriminator during training. At test time, we feed predicted articulation structures to the discriminator with fixed paramaters. The we calculate the adversarial loss and back-propagate the gradients to update the estimator. We iterate this process to improve the estimator and can predict better articulated object pose.
Our method learns the contact prior with a contact diffusion model. At test time, we utilize the contact diffusion model to predict the contact regions at objects. Then we optimize hand pose by minimizing the distance between hand vertices and the contact points at the predicted contact regions. In this way, we can achieve better performance on hand pose estimation.