SWiT-4D

Kehong Gong^*1, Zhengyu Wen^*2, Mingxi Xu^*2, Weixia He², Qi Wang², Ning Zhang²

Zhengyu Li², Chenbin Li², Dongze Lian², Wei Zhao², Xiaoyu He², Mingyuan Zhang^†2

¹Huawei Technologies Co., Ltd. ²Huawei Central Media Technology Institute

^*Equal Contribution ^†Corresponding Author

Abstract

Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerable challenging. The scarcity of large-scale, naturally captured 4D mesh datasets exacerbates the difficulty of learning generalizable video-to-4D models from scratch in a purely data-driven manner. Fortunately, the substantial progress in image-to-3D, supported by extensive datasets, offers powerful prior foundation models. To better leverage these models while minimizing reliance on 4D supervision, we introduce a novel method, SWiT-4D—a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D can be seamlessly integrated with any Diffusion Transformer (DiT)-based image-to-3D generator, augmenting it with spatial-temporal modeling capabilities across video frames while preserving its original single-image forward process, which enables 4D mesh reconstruction from videos of arbitrary length. Furthermore, to recover the global translation of the mesh within the world coordinate system, we also introduce an optimization-based trajectory prediction module specifically tailored for static-camera monocular videos. Remarkably, SWiT-4D demonstrates strong data efficiency: with only a single short ($<$10s) video for fine-tuning, our model attains high-fidelity geometry and stable temporal consistency, highlighting its practical deployability even under extremely scarce 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (c4d, Objaverse, and in-the-wild videos) show that our method consistently outperforms existing baselines in temporal smoothness, underscoring the practical merits of the proposed framework.

Method

Figure 1. Pipeline Overview.

SWiT-4D, a parameter-free temporal extension to image-to-3D diffusion transformers. Top: Conventional single-frame 3D generation. A shape VAE encodes each 3D mesh into latent space, the diffusion transformer performs denoising, and the shape VAE decodes the latent back to 3D geometry—without any temporal reasoning. Bottom: Our method introduces temporal modeling losslessly through a sliding-window mechanism applied to both self- and cross-attention. A 1D rotary positional encoding (1D-RoPE) encodes temporal phase, ensuring identical behavior to the single-frame model when $W{=}0$, while enabling temporal residual learning when $W{>}0$. This design allows coherent motion perception and temporally consistent 4D generation without adding new parameters or supervision.

Results

We evaluate SWiT-4D under diverse conditions — from controlled multi-species datasets to real-world videos. The model consistently delivers high-fidelity 4D reconstructions with smooth temporal evolution across varying sequence lengths, unseen object categories, and complex real-world scenes.

Zoo-Short: Short-Sequence Multi-Species Generation

Our method supports diverse animal species and short sequence 4D generation in the Zoo dataset.

Zoo-Mid: Mid-Sequence Temporal Consistency

SWiT-4D enables temporally consistent, mid-length mesh sequence generation across various Zoo species.

Zoo-Long: Long-Sequence Structural Stability

SWiT-4D stably generates long temporal sequences for different animals, maintaining temporal coherence.

Consistent4D: Generalization to Unseen Object Categories

Our method generalizes to novel objects and unseen categories on Consist4D benchmark.

In-the-Wild: Robust Real-World Video Reconstruction

SWiT-4D supports real-world video input and demonstrates robustness in-the-wild.

Ablation-1Shot: Reconstruction with 1 shot training

SWiT-4D achieves significant improvements even with very limited training data (a single video, ~150 frames), both in same-species and extremely cross-species settings.

GT 1shot in-Domain
(trained by one alligator) 1shot Out-of-Domain
(trained by one horse) TripoSG

Zoo: Reconstruction+Global Translation:

With dedicated post-processing, SWiT-4D also can reconstruct 4D meshes with global translation.

In-the-Wild: Reconstruction+Global Translation:

SWiT-4D can adapt to challenging In-the-Wild video and reconstruct 4D meshes with global translation.

Zoo-Comparison: In-Domain Visualization Results

SWiT-4D acquires better temporal consistency and preserves more geometric details compared to existing methods.

GT Ours Ours-1shot TripoSG GVFD LG4M GenZoo

Main Result on Truebones Zoo

Objaverse-Comparison: Out-of-Domain Visualization Results

Even only trained on Truebones Zoo dataset, our results still generates better results on unseen Objaverse dataset.

GT Ours Ours-1shot TripoSG GVFD LG4M GenZoo

Main Result on Objaverse

Our Other Works

MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

Summary: A category-agnostic framework for estimating 3D motion from monocular videos conditioned on an arbitrary rigged 3D asset.
a) Mocap: Using the same reference skeleton yields standard motion capture.
b) Retargeting: Conditioning on a reference skeleton from a different species naturally enables cross-species motion retargeting across heterogeneous skeletons.

Project Page · arXiv

Citation

@article{gong2025swit4d,
  title     = {SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation},
  author    = {Gong, Kehong and Wen, Zhengyu and Xu, Mingxi and He, Weixia and Wang, Qi and 
               Zhang, Ning and Li, Zhengyu and Li, Chenbin and Lian, Dongze and
               Zhao, Wei and He, Xiaoyu and Zhang, Mingyuan},
  journal   = {arXiv preprint arXiv:2512.10860},
  year      = {2025}
}

Acknowledgement

We referred to the project page of Nerfies when creating this project page.