1Huawei Technologies Co., Ltd. 2Huawei Central Media Technology Institute
*Equal Contributions †Corresponding Author
Abstract
Motion capture now underpins content creation far beyond digital humans, yet most pipelines remain species- or template-specific.
We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt,
the goal is to reconstruct a rotation-based animation (e.g., BVH) that directly drives the specific asset. We present MoCapAnything,
a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations
via constraint-aware Inverse Kinematics (IK) Fitting. MoCapAnything comprises three learnable modules and a lightweight IK stage:
a Reference Prompt Encoder that distills per-joint queries from the asset’s skeleton, mesh, and rendered image set; a Video Feature
Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the modality gap between RGB tokens
and the point-cloud–like joint space; and a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories.
We also curate Truebones Zoo with 1,038 motion clips, each providing a standardized skeleton–mesh–rendered-video triad. Experiments on
in-domain benchmarks and in-the-wild videos show that \name delivers high-quality skeletal animations and exhibits non-trivial cross-species
retargeting across heterogeneous rigs, offering a scalable path toward prompt-based 3D motion capture for arbitrary assets.
Method
Figure: Overview of our modular pipeline.
Detailed architecture of our method. A multi-modal Reference Prompt Encoder fuses mesh, skeleton, and appearance of the target
asset into per-joint queries. A monocular video is converted into a 4D mesh sequence, and both mesh and video features are extracted. The
Unified Motion Decoder fuses these signals via multi-branch attention to predict 3D keypoints, which are converted to asset-specific joint
rotations via an optimization-based IK layer.
1. Monocular Motion Capture
Below we show MoCapAnything results on zoo animals, Objaverse assets, and in-the-wild videos.
1.1 Comparison with GenZoo
GT VideoGT SkeletonOurs SkeletonGenZoo Skeleton
Our predictions show smoother trajectories and stronger multi-species stability than GenZoo.
To the best of our knowledge, GenZoo is currently the method that supports the widest range of animal species, but it is restricted to quadruped skeletons. In contrast, our framework naturally generalizes to arbitrary skeletal structures — including bipeds, birds, reptiles, and even non-biological rigs.
1.2 Mocap: Truebones Zoo
GT VideoGT SkeletonOursGT VideoGT SkeletonOurs
1.3 Mocap: Objaverse
GT VideoGT SkeletonOursGT VideoGT SkeletonOurs
1.4 Mocap: In-the-Wild
GT VideoOurs SkeletonGT VideoOurs Skeleton
GT VideoOurs SkeletonGT VideoOurs Skeleton
GT VideoOurs SkeletonGT VideoOurs Skeleton
Our method demonstrates robust mocap abilities across humans and animals, and generalizes to in-the-wild videos covering every kind of species — flying, running, swimming; bipeds, quadrupeds, multi-leg creatures, and even limbless skeletons — effectively supporting arbitrary skeletal structures.
2. Cross-Species Motion Retargeting
Our model is never trained for retargeting, yet its design naturally enables motion transfer across any skeleton. Here we showcase a variety of cross-species behaviors—animal-to-animal, human-to-animal, animal-to-human, and even wild-video inputs. The results often reveal entertaining combinations: quadrupeds attempting to fly, birds trying to run, animals performing human actions, and humans imitating animal behaviors such as rolling, gliding, or even slithering like a snake.
2.1 Zoo → Zoo
Eagle
Jaguar
Lion
Parrot
Reference
Crocodile
Dog
Eagle
Jaguar
Interesting results: birds attempting quadruped walking, and leopards attempting to fly.
2.2 Human → Zoo
Lion
Ostrich
PolarBearB
Turtle
ReferenceHuman_1Human_2Human_3Human_4
Interesting results: animals can mimic human actions, whether they are birds or quadrupeds..
2.3 Zoo → Human (Part 1)
Human_1
Human_2
ReferenceHamsterOstrichPolarBearBTurtle
Interesting results: humans can mimic animal behaviors—rolling like a reptile, flapping like a bird, or walking like a quadruped.
2.4 Zoo → Human (Part 2)
Human_1
Human_2
ReferenceAnacondaHorseLionParrot
Interesting results: humans can even imitate the movements of a snake. (PS: The horse skeleton includes numerous non-anatomical joints (e.g., reins and saddle attachments), increasing the difficulty of retargeting.)
2.5 Wild → Human
Human_1
Human_2
ReferenceEagleLeopardLionPolarBear
Our method performs well on real in-the-wild videos as well, provided that the foreground subject is pre-segmented.
3. Constraint-Aware IK Fitting
We convert predicted 3D joint trajectories into rotation-based BVH animations while
respecting rigid-body constraints, hierarchy, and joint limits.
Input JointsIK ResultInput JointsIK Result
Our Other Works
SWiT-4D:
Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation.
Summary:
A lightweight temporal adaptation framework that, with only a single sequence for fine-tuning, upgrades existing image-to-3D models into temporally coherent video-to-4D generators via sliding-window modeling, without introducing additional parameters.
Project Page
·
arXiv
Citation
If you find our work useful, please cite:
@article{gong2025mocapanything,
title={MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos},
author={Gong, Kehong and Wen, Zhengyu and He, Weixia and Xu, Mingxi and
Wang, Qi and Zhang, Ning and Li, Zhengyu and
Lian, Dongze and Zhao, Wei and He, Xiaoyu and Zhang, Mingyuan},
journal={arXiv preprint arXiv:2512.10881},
year={2025}
}
Acknowledgement
We referred to the project page of
Nerfies
when creating this project page.