Kehong Gong*1, Zhengyu Wen*2, Mingxi Xu*2, Weixia He2, Qi Wang2, Ning Zhang2
Zhengyu Li2, Chenbin Li2, Dongze Lian2, Wei Zhao2, Xiaoyu He2, Mingyuan Zhang†2
1Huawei Technologies Co., Ltd. 2Huawei Central Media Technology Institute
*Equal Contribution †Corresponding Author
Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerable challenging. The scarcity of large-scale, naturally captured 4D mesh datasets exacerbates the difficulty of learning generalizable video-to-4D models from scratch in a purely data-driven manner. Fortunately, the substantial progress in image-to-3D, supported by extensive datasets, offers powerful prior foundation models. To better leverage these models while minimizing reliance on 4D supervision, we introduce a novel method, SWiT-4D—a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D can be seamlessly integrated with any Diffusion Transformer (DiT)-based image-to-3D generator, augmenting it with spatial-temporal modeling capabilities across video frames while preserving its original single-image forward process, which enables 4D mesh reconstruction from videos of arbitrary length. Furthermore, to recover the global translation of the mesh within the world coordinate system, we also introduce an optimization-based trajectory prediction module specifically tailored for static-camera monocular videos. Remarkably, SWiT-4D demonstrates strong data efficiency: with only a single short ($<$10s) video for fine-tuning, our model attains high-fidelity geometry and stable temporal consistency, highlighting its practical deployability even under extremely scarce 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (c4d, Objaverse, and in-the-wild videos) show that our method consistently outperforms existing baselines in temporal smoothness, underscoring the practical merits of the proposed framework.
Figure 1. Pipeline Overview.
SWiT-4D, a parameter-free temporal extension to image-to-3D diffusion transformers. Top: Conventional single-frame 3D generation. A shape VAE encodes each 3D mesh into latent space, the diffusion transformer performs denoising, and the shape VAE decodes the latent back to 3D geometry—without any temporal reasoning. Bottom: Our method introduces temporal modeling losslessly through a sliding-window mechanism applied to both self- and cross-attention. A 1D rotary positional encoding (1D-RoPE) encodes temporal phase, ensuring identical behavior to the single-frame model when $W{=}0$, while enabling temporal residual learning when $W{>}0$. This design allows coherent motion perception and temporally consistent 4D generation without adding new parameters or supervision.
We evaluate SWiT-4D under diverse conditions — from controlled multi-species datasets to real-world videos. The model consistently delivers high-fidelity 4D reconstructions with smooth temporal evolution across varying sequence lengths, unseen object categories, and complex real-world scenes.
Our method supports diverse animal species and short sequence 4D generation in the Zoo dataset.
SWiT-4D enables temporally consistent, mid-length mesh sequence generation across various Zoo species.
SWiT-4D stably generates long temporal sequences for different animals, maintaining temporal coherence.
Our method generalizes to novel objects and unseen categories on Consist4D benchmark.
SWiT-4D supports real-world video input and demonstrates robustness in-the-wild.
SWiT-4D achieves significant improvements even with very limited training data (a single video, ~150 frames), both in same-species and extremely cross-species settings.
With dedicated post-processing, SWiT-4D also can reconstruct 4D meshes with global translation.
SWiT-4D can adapt to challenging In-the-Wild video and reconstruct 4D meshes with global translation.
SWiT-4D acquires better temporal consistency and preserves more geometric details compared to existing methods.
Even only trained on Truebones Zoo dataset, our results still generates better results on unseen Objaverse dataset.
@article{gong2025swit4d,
title = {SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation},
author = {Gong, Kehong and Wen, Zhengyu and Xu, Mingxi and He, Weixia and Wang, Qi and
Zhang, Ning and Li, Zhengyu and Li, Chenbin and Lian, Dongze and
Zhao, Wei and He, Xiaoyu and Zhang, Mingyuan},
journal = {arXiv preprint arXiv:2501.xxxxx},
year = {2025}
}