MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

Authors

Kehong Gong*1, Zhengyu Wen*2, Weixia He2, Mingxi Xu2, Qi Wang2, Ning Zhang2, Zhengyu Li2,
Dongze Lian2, Wei Zhao2, Xiaoyu He2, Mingyuan Zhang†2

1Huawei Technologies Co., Ltd.
2Huawei Central Media Technology Institute

*Equal Contributions    Corresponding Author

Abstract

Motion capture now underpins content creation far beyond digital humans, yet most pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation (e.g., BVH) that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware Inverse Kinematics (IK) Fitting. MoCapAnything comprises three learnable modules and a lightweight IK stage: a Reference Prompt Encoder that distills per-joint queries from the asset’s skeleton, mesh, and rendered image set; a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the modality gap between RGB tokens and the point-cloud–like joint space; and a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1,038 motion clips, each providing a standardized skeleton–mesh–rendered-video triad. Experiments on in-domain benchmarks and in-the-wild videos show that \name delivers high-quality skeletal animations and exhibits non-trivial cross-species retargeting across heterogeneous rigs, offering a scalable path toward prompt-based 3D motion capture for arbitrary assets.

Method

Figure: Overview of our modular pipeline.

Detailed architecture of our method. A multi-modal Reference Prompt Encoder fuses mesh, skeleton, and appearance of the target asset into per-joint queries. A monocular video is converted into a 4D mesh sequence, and both mesh and video features are extracted. The Unified Motion Decoder fuses these signals via multi-branch attention to predict 3D keypoints, which are converted to asset-specific joint rotations via an optimization-based IK layer.

1. Monocular Motion Capture

Below we show MoCapAnything results on zoo animals, Objaverse assets, and in-the-wild videos.

1.1 Comparison with GenZoo

GT Video GT Skeleton Ours Skeleton GenZoo Skeleton
Our predictions show smoother trajectories and stronger multi-species stability than GenZoo. To the best of our knowledge, GenZoo is currently the method that supports the widest range of animal species, but it is restricted to quadruped skeletons. In contrast, our framework naturally generalizes to arbitrary skeletal structures — including bipeds, birds, reptiles, and even non-biological rigs.

1.2 Mocap: Truebones Zoo

GT VideoGT SkeletonOurs GT VideoGT SkeletonOurs

1.3 Mocap: Objaverse

GT VideoGT SkeletonOurs GT VideoGT SkeletonOurs

1.4 Mocap: In-the-Wild

GT VideoOurs Skeleton GT VideoOurs Skeleton
GT VideoOurs Skeleton GT VideoOurs Skeleton
GT VideoOurs Skeleton GT VideoOurs Skeleton
Our method demonstrates robust mocap abilities across humans and animals, and generalizes to in-the-wild videos covering every kind of species — flying, running, swimming; bipeds, quadrupeds, multi-leg creatures, and even limbless skeletons — effectively supporting arbitrary skeletal structures.

2. Cross-Species Motion Retargeting

Our model is never trained for retargeting, yet its design naturally enables motion transfer across any skeleton. Here we showcase a variety of cross-species behaviors—animal-to-animal, human-to-animal, animal-to-human, and even wild-video inputs. The results often reveal entertaining combinations: quadrupeds attempting to fly, birds trying to run, animals performing human actions, and humans imitating animal behaviors such as rolling, gliding, or even slithering like a snake.

2.1 Zoo → Zoo

Eagle
Jaguar
Lion
Parrot
Reference
Crocodile
Dog
Eagle
Jaguar
Interesting results: birds attempting quadruped walking, and leopards attempting to fly.

2.2 Human → Zoo

Lion
Ostrich
PolarBearB
Turtle
Reference Human_1 Human_2 Human_3 Human_4
Interesting results: animals can mimic human actions, whether they are birds or quadrupeds..

2.3 Zoo → Human (Part 1)

Human_1
Human_2
Reference Hamster Ostrich PolarBearB Turtle
Interesting results: humans can mimic animal behaviors—rolling like a reptile, flapping like a bird, or walking like a quadruped.

2.4 Zoo → Human (Part 2)

Human_1
Human_2
Reference Anaconda Horse Lion Parrot
Interesting results: humans can even imitate the movements of a snake. (PS: The horse skeleton includes numerous non-anatomical joints (e.g., reins and saddle attachments), increasing the difficulty of retargeting.)

2.5 Wild → Human

Human_1
Human_2
Reference Eagle Leopard Lion PolarBear
Our method performs well on real in-the-wild videos as well, provided that the foreground subject is pre-segmented.

3. Constraint-Aware IK Fitting

We convert predicted 3D joint trajectories into rotation-based BVH animations while respecting rigid-body constraints, hierarchy, and joint limits.
Input Joints IK Result Input Joints IK Result

Citation

If you find our work useful, please cite:

@article{gong2025mocapanything,
  title={MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos},
  author={Gong, Kehong and Wen, Zhengyu and He, Weixia and Xu, Mingxi and 
          Wang, Qi and Zhang, Ning and Li, Zhengyu and 
          Lian, Dongze and Zhao, Wei and He, Xiaoyu and Zhang, Mingyuan},
  journal={arXiv preprint arXiv:2501.xxxxx},
  year={2025}
}