We present DiMo, a diffusion-LLM framework for bidirectional text-motion understanding and generation. Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, DiMo performs multi-step parallel denoising, unifying Text-to-Motion(T2M), Motion-to-Text(M2T), and text-free Motion-to-Motion(M2M) within a single model. This decoding paradigm naturally enables a quality-latency trade-off at inference. On HumanML3D, our method achieves competitive T2M/M2T results against strong baselines. Beside T2M/M2T, we further demonstrate motion completion and prediction under text-free and text-conditioned settings. We also incorporate Residual VQ (RVQ) as the motion tokenizer to improve quantization fidelity, and adopt GRPO within the framework to enhance alignment and controllability. To the best of our knowledge, this is the first work to bring diffusion-LLMs to bidirectional text-motion modeling.
DiMo is a unified framework for bidirectional text-motion generation, inspired by the recent success of discrete diffusion language models (dLLMs). Instead of sequentially autoregressing tokens, dLLMs apply random masking and iterative denoising, which naturally supports parallel inference. This allows the model to refine corrupted sequences in multiple steps, dynamically revise low-confidence predictions, and leverage bidirectional attention for stronger contextual reasoning.
Click tabs to switch tasks. Use the arrows to navigate video examples within each task
Click tabs to switch tasks. Use the arrows to navigate video examples within each task
If you find this work useful, please consider citing:
@article{zhang2026dimo,
title = {DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding},
author = {Zhang, Ning and Li, Zhengyu and Loh, Kwong Weng and Xu, Mingxi and Wang, Qi and
Wen, Zhengyu and He, Xiaoyu and Zhao, Wei and Gong, Kehong and Zhang, Mingyuan},
journal = {arXiv preprint arXiv:2602.04188},
year = {2026}
}