1Huawei Technologies Co., Ltd.
2Huawei Central Media Technology Institute
*Equal Contribution
†Corresponding Author
Abstract
Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerable challenging.
The scarcity of large-scale, naturally captured 4D mesh datasets exacerbates the difficulty of learning generalizable video-to-4D models from scratch in a purely data-driven manner.
Fortunately, the substantial progress in image-to-3D, supported by extensive datasets, offers powerful prior foundation models.
To better leverage these models while minimizing reliance on 4D supervision, we introduce a novel method, SWiT-4D—a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation.
SWiT-4D can be seamlessly integrated with any Diffusion Transformer (DiT)-based image-to-3D generator, augmenting it with spatial-temporal modeling capabilities across video frames while preserving its original single-image forward process, which enables 4D mesh reconstruction from videos of arbitrary length.
Furthermore, to recover the global translation of the mesh within the world coordinate system, we also introduce an optimization-based trajectory prediction module specifically tailored for static-camera monocular videos.
Remarkably, SWiT-4D demonstrates strong data efficiency: with only a single short ($<$10s) video for fine-tuning, our model attains high-fidelity geometry and stable temporal consistency, highlighting its practical deployability even under extremely scarce 4D supervision.
Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (c4d, Objaverse, and in-the-wild videos) show that our method consistently outperforms existing baselines in temporal smoothness, underscoring the practical merits of the proposed framework.
Method
Figure 1. Pipeline Overview.
SWiT-4D, a parameter-free temporal extension to image-to-3D diffusion transformers.
Top: Conventional single-frame 3D generation. A shape VAE encodes each 3D mesh into latent space, the diffusion transformer performs denoising, and the shape VAE decodes the latent back to 3D geometry—without any temporal reasoning.
Bottom: Our method introduces temporal modeling losslessly through a sliding-window mechanism applied to both self- and cross-attention.
A 1D rotary positional encoding (1D-RoPE) encodes temporal phase, ensuring identical behavior to the single-frame model when $W{=}0$, while enabling temporal residual learning when $W{>}0$. This design allows coherent motion perception and temporally consistent 4D generation without adding new parameters or supervision.
Results
We evaluate SWiT-4D under diverse conditions — from controlled multi-species datasets to real-world videos.
The model consistently delivers high-fidelity 4D reconstructions with smooth temporal evolution across varying
sequence lengths, unseen object categories, and complex real-world scenes.
Our method supports diverse animal species and short sequence 4D generation in the Zoo dataset.
Zoo-Mid: Mid-Sequence Temporal Consistency
SWiT-4D enables temporally consistent, mid-length mesh sequence generation across various Zoo species.
Zoo-Long: Long-Sequence Structural Stability
SWiT-4D stably generates long temporal sequences for different animals, maintaining temporal coherence.
Consistent4D: Generalization to Unseen Object Categories
Our method generalizes to novel objects and unseen categories on Consist4D benchmark.
In-the-Wild: Robust Real-World Video Reconstruction
SWiT-4D supports real-world video input and demonstrates robustness in-the-wild.
Ablation-1Shot: Reconstruction with 1 shot training
SWiT-4D achieves significant improvements even with very limited training data (a single video, ~150 frames), both in same-species and extremely cross-species settings.
GT1shot in-Domain (trained by one alligator)1shot Out-of-Domain (trained by one horse)TripoSG
Zoo: Reconstruction+Global Translation:
With dedicated post-processing, SWiT-4D also can reconstruct 4D meshes with global translation.
In-the-Wild: Reconstruction+Global Translation:
SWiT-4D can adapt to challenging In-the-Wild video and reconstruct 4D meshes with global translation.
Zoo-Comparison: In-Domain Visualization Results
SWiT-4D acquires better temporal consistency and preserves more geometric details compared to existing methods.
Even only trained on Truebones Zoo dataset, our results still generates better results on unseen Objaverse dataset.
GTOursOurs-1shotTripoSGGVFDLG4MGenZoo
Main Result on Objaverse
Our Other Works
MoCapAnything:
Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
Summary:
A category-agnostic framework for estimating 3D motion from monocular videos
conditioned on an arbitrary rigged 3D asset.
a) Mocap:
Using the same reference skeleton yields standard motion capture.
b) Retargeting:
Conditioning on a reference skeleton from a different species naturally enables
cross-species motion retargeting across heterogeneous skeletons.
@article{gong2025swit4d,
title = {SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation},
author = {Gong, Kehong and Wen, Zhengyu and Xu, Mingxi and He, Weixia and Wang, Qi and
Zhang, Ning and Li, Zhengyu and Li, Chenbin and Lian, Dongze and
Zhao, Wei and He, Xiaoyu and Zhang, Mingyuan},
journal = {arXiv preprint arXiv:2512.10860},
year = {2025}
}
Acknowledgement
We referred to the project page of
Nerfies
when creating this project page.