Motion capture — the process of recording peoples’ movements — traditionally requires equipment, cameras, and software tailored for the purpose. But researchers at the Max Planck Institute and Facebook Reality Labs claim they’ve developed a machine learning algorithm — PhysCap — that works with any off-the-shelf DSLR camera running at 25 frames per second. In a paper expected to be published in the journal ACM Transactions on Graphics in November 2020, the team details PhysCap, which they say is the first of its kind for real-time, physically plausible 3D motion capture accounting for environmental constraints like floor placement. It ostensibly achieves state-of-the-art accuracy on existing benchmarks and qualitatively improves stability at training time.
Motion capture is a core part of modern film, game, and even app development. There’s been countless attempts at making motion capture practical for amateur videographers, from a $2,500 suit to a commercially available framework that leverages Microsoft’s depth-sensing Kinect. But they’re imperfect — even the best human pose-estimating systems struggle to produce smooth animations, yielding 3D models with improper balance, inaccurate body leaning, and other artifacts of instability.
By contrast, PhysCap reportedly captures physically and anatomically correct poses that adhere to physics constraints.
In its first stage, PhysCap estimates 3D body poses in a purely kinematic way with a convolutional neural network (CNN) that infers combined 2D and 3D joint positions from a video. After some refinement, the second stage commences, in which foot contact and motion states are predicted for every frame by a second CNN. (This CNN detects heel and forefoot placement on the ground and classifies the observed poses into “stationary” or “non-stationary” categories.) In the final stage, kinematic pose estimates from the first stage (in both 2D and 3D) are reproduced as closely as possible to account for things like gravity, collisions, foot placement, and gravity.
In experiments, the researchers tested PhysCap on a SONY DSC-RX0 camera and a PC with 32GB of RAM, a GeForce RTX 2070 graphics card, and an eight-core Ryzen7 processor, with which they captured and processed six motion sequences in scenes performed by two performers. The coauthors found that while PhysCap generalized well across scenes with different backgrounds, it sometimes mispredicted foot contacts and therefore foot velocity. Other limitations that arose were the need for a calibrated floor plane and a ground plane in the scene, which the researchers note is harder to find outdoors.
To address these limitations, the researchers plan to investigate modelling hand-scene interactions and contacts between legs and body in sitting and lying poses. “Since the output of PhysCap is environment-aware and the returned root position is global, it is directly suitable for virtual character animation, without any further post-processing,” the researchers wrote. “Here, applications in character animation, virtual and augmented reality, telepresence, or human-computer interaction, are only a few examples of high importance for graphics.”