MoCapAnything

Kehong Gong^*1, Zhengyu Wen^*2, Weixia He², Mingxi Xu², Qi Wang², Ning Zhang², Zhengyu Li²,
Dongze Lian², Wei Zhao², Xiaoyu He², Mingyuan Zhang^†2

¹Huawei Technologies Co., Ltd.
²Huawei Central Media Technology Institute

^*Equal Contributions ^†Corresponding Author

Abstract

Motion capture now underpins content creation far beyond digital humans, yet most pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation (e.g., BVH) that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware Inverse Kinematics (IK) Fitting. MoCapAnything comprises three learnable modules and a lightweight IK stage: a Reference Prompt Encoder that distills per-joint queries from the asset’s skeleton, mesh, and rendered image set; a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the modality gap between RGB tokens and the point-cloud–like joint space; and a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1,038 motion clips, each providing a standardized skeleton–mesh–rendered-video triad. Experiments on in-domain benchmarks and in-the-wild videos show that \name delivers high-quality skeletal animations and exhibits non-trivial cross-species retargeting across heterogeneous rigs, offering a scalable path toward prompt-based 3D motion capture for arbitrary assets.

Method

Figure: Overview of our modular pipeline.

Detailed architecture of our method. A multi-modal Reference Prompt Encoder fuses mesh, skeleton, and appearance of the target asset into per-joint queries. A monocular video is converted into a 4D mesh sequence, and both mesh and video features are extracted. The Unified Motion Decoder fuses these signals via multi-branch attention to predict 3D keypoints, which are converted to asset-specific joint rotations via an optimization-based IK layer.

1. Monocular Motion Capture

Below we show MoCapAnything results on zoo animals, Objaverse assets, and in-the-wild videos.

1.1 Comparison with GenZoo

GT Video GT Skeleton Ours Skeleton GenZoo Skeleton

Our predictions show smoother trajectories and stronger multi-species stability than GenZoo. To the best of our knowledge, GenZoo is currently the method that supports the widest range of animal species, but it is restricted to quadruped skeletons. In contrast, our framework naturally generalizes to arbitrary skeletal structures — including bipeds, birds, reptiles, and even non-biological rigs.

1.2 Mocap: Truebones Zoo

GT VideoGT SkeletonOurs GT VideoGT SkeletonOurs

1.3 Mocap: Objaverse

GT VideoGT SkeletonOurs GT VideoGT SkeletonOurs

1.4 Mocap: In-the-Wild

GT VideoOurs Skeleton GT VideoOurs Skeleton

Our method demonstrates robust mocap abilities across humans and animals, and generalizes to in-the-wild videos covering every kind of species — flying, running, swimming; bipeds, quadrupeds, multi-leg creatures, and even limbless skeletons — effectively supporting arbitrary skeletal structures.

2. Cross-Species Motion Retargeting

Our model is never trained for retargeting, yet its design naturally enables motion transfer across any skeleton. Here we showcase a variety of cross-species behaviors—animal-to-animal, human-to-animal, animal-to-human, and even wild-video inputs. The results often reveal entertaining combinations: quadrupeds attempting to fly, birds trying to run, animals performing human actions, and humans imitating animal behaviors such as rolling, gliding, or even slithering like a snake.

2.1 Zoo → Zoo

Eagle

Jaguar

Lion

Parrot

Reference

Crocodile

Dog

Eagle

Jaguar

Interesting results: birds attempting quadruped walking, and leopards attempting to fly.

2.2 Human → Zoo

Lion

Ostrich

PolarBearB

Turtle

Reference Human_1 Human_2 Human_3 Human_4

Interesting results: animals can mimic human actions, whether they are birds or quadrupeds..

2.3 Zoo → Human (Part 1)

Human_1

Human_2

Reference Hamster Ostrich PolarBearB Turtle

Interesting results: humans can mimic animal behaviors—rolling like a reptile, flapping like a bird, or walking like a quadruped.

2.4 Zoo → Human (Part 2)

Human_1

Human_2

Reference Anaconda Horse Lion Parrot

Interesting results: humans can even imitate the movements of a snake. (PS: The horse skeleton includes numerous non-anatomical joints (e.g., reins and saddle attachments), increasing the difficulty of retargeting.)

2.5 Wild → Human

Human_1

Human_2

Reference Eagle Leopard Lion PolarBear

Our method performs well on real in-the-wild videos as well, provided that the foreground subject is pre-segmented.

3. Constraint-Aware IK Fitting

We convert predicted 3D joint trajectories into rotation-based BVH animations while respecting rigid-body constraints, hierarchy, and joint limits.

Input Joints IK Result Input Joints IK Result

Our Other Works

SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation.
Summary: A lightweight temporal adaptation framework that, with only a single sequence for fine-tuning, upgrades existing image-to-3D models into temporally coherent video-to-4D generators via sliding-window modeling, without introducing additional parameters.
Project Page · arXiv

Citation

If you find our work useful, please cite:

@article{gong2025mocapanything,
  title={MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos},
  author={Gong, Kehong and Wen, Zhengyu and He, Weixia and Xu, Mingxi and
          Wang, Qi and Zhang, Ning and Li, Zhengyu and
          Lian, Dongze and Zhao, Wei and He, Xiaoyu and Zhang, Mingyuan},
  journal={arXiv preprint arXiv:2512.10881},
  year={2025}
}

Acknowledgement

We referred to the project page of Nerfies when creating this project page.