DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding

arXiv

Ning Zhang¹, Zhengyu Li¹, Kwong Weng Loh¹, Mingxi Xu¹, Qi Wang¹, Zhengyu Wen¹
Xiaoyu He¹, Wei Zhao¹, Kehong Gong², Mingyuan Zhang^1,†

¹Huawei Central Media Technology Institute ²Huawei Technologies Co., Ltd.

^†Corresponding author

Abstract

We present DiMo, a diffusion-LLM framework for bidirectional text-motion understanding and generation. Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, DiMo performs multi-step parallel denoising, unifying Text-to-Motion(T2M), Motion-to-Text(M2T), and text-free Motion-to-Motion(M2M) within a single model. This decoding paradigm naturally enables a quality-latency trade-off at inference. On HumanML3D, our method achieves competitive T2M/M2T results against strong baselines. Beside T2M/M2T, we further demonstrate motion completion and prediction under text-free and text-conditioned settings. We also incorporate Residual VQ (RVQ) as the motion tokenizer to improve quantization fidelity, and adopt GRPO within the framework to enhance alignment and controllability. To the best of our knowledge, this is the first work to bring diffusion-LLMs to bidirectional text-motion modeling.

DiMo

DiMo is a unified framework for bidirectional text-motion generation, inspired by the recent success of discrete diffusion language models (dLLMs). Instead of sequentially autoregressing tokens, dLLMs apply random masking and iterative denoising, which naturally supports parallel inference. This allows the model to refine corrupted sequences in multiple steps, dynamically revise low-confidence predictions, and leverage bidirectional attention for stronger contextual reasoning.

Results

Click tabs to switch tasks. Use the arrows to navigate video examples within each task

Comparison with Other Methods

Click tabs to switch tasks. Use the arrows to navigate video examples within each task

Citation

If you find this work useful, please consider citing:

@article{zhang2026dimo,
  title   = {DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding},
  author  = {Zhang, Ning and Li, Zhengyu and Loh, Kwong Weng and Xu, Mingxi and Wang, Qi and 
             Wen, Zhengyu and He, Xiaoyu and Zhao, Wei and Gong, Kehong and Zhang, Mingyuan},
  journal = {arXiv preprint arXiv:2602.04188},
  year    = {2026}
}